XTech 2005: XML, the Web and beyond.
Validation against a single schema is often not enough to tell whether a particular XML instance is valid in all respects. This has been formally recognised by the ISO Document Schema Definition Languages (DSDL) project, which defines a set of complementary schema technologies, including the grammar based validation of Relax NG and the rule based validation of Schematron. A similar set of options is becoming increasingly practical using W3C XML technologies: specifically through the combination of W3C XML Schema and XForms.
XML is by definition extensible. An XML vocabulary, such as XHTML, constrains that extensibility. The structures that make up a vocabulary are often formally defined in a schema. Such schemas are used to validate XML documents purporting to conform to their XML vocabularies. For example, I am writing this paper using the popular text editor (Emacs). Whilst it does not prevent me from adding any XML markup that I like, it is configured (via James Clark's nXML mode) to encourage me to author XML defined by the DocBook vocabulary, in conformance to the Relax NG schema provided by the organisers of XTech2005. Emacs warns me when I break any structure defined by the schema, and it prompts me with valid options as I navigate through the markup.
Once I have finished writing the paper I will submit it electronically via the conference Web site. The paper will be validated against the same schema I used to help me author it, to ensure that it will work with the publishing mechanisms used to generate a more human friendly view of the text than that which I see. If validation fails, submission will also fail and I will be told to fix the XML.
So schemas are undeniably useful. They provide a formal definition an XML vocabulary. Furthermore, we can use schemas with appropriate software to help us to generate valid XML, and then again to check that XML is valid for processing.
There are several widely used XML schema languages available. The grammar based languages: DTDs, W3C XML Schema (WXS) and Relax NG, define XML structures. With Schematron we define assertions against which to check patterns found in XML instances. I will not dwell on the structural languages here, but point the reader instead to Eric van der Vlist's excellent article in XML.com to learn more.
Every user of a schema language is likely to hit upon the constraints of the language at some point. For example, most authors of non-trivial DTDs or WXS schema have come up against the ban on nondeterministic content models (which says, more-or-less, that a validator must be able to work out which sub-structure it is currently validating by looking only at the current element), and then they discover that Relax NG happily validates such structures.
Even more commonly felt is the inability to combine rule based assertions with structural definitions. Let's take a simple fictitious example. Suppose we have a fridge with some cartons of milk in it:
<fridge>
<contents>
<dairyProduce>
<milk type="semi-skimmed">
<volume uom="litres">3</volume>
</milk>
<milk type="full-fat">
<volume uom="litres">3</volume>
</milk>
</dairyProduce>
</contents>
</fridge>
We can assume that the structure of the XML instance maintained by the fridge is defined by an industry standard WXS schema (chilledML). WXS defines the structures and simple data types for the vocabulary perfectly adequately. However, our fridge is configurable. Amongst other things we can set the maximum volumes of its liquid contents. In a moment of extreme health consciousness, we have decided that the maximum allowed volume of full-cream milk at any one time is to be just 1 litre.
The contents of our fridge remains perfectly valid against its schema, but is now invalid against our rule. The schema cannot be altered, so what to do? The fridge could use a conventional programming language to define the constraint, or it could perhaps use a Schematron assertion to express the rule:
<rule context="milk[@type='full-fat']">
<assert test="volume <= 1">The registered user of this fridge says that
there must be no more than 1 litre of full-fat milk in here.</assert>
</rule>
The Schematron rule defines a pattern in XPath. If the Schematron processor finds a match for the pattern (i.e. a <milk/> element with a type attribute with a value of 'full-fat') it will apply the assertion that the child element <volume/> can have a numeric value less than or equal to the value of 1. If the assertion returns false when applied, then the Schematron processor produces a validation report containing the text of the assertion. A clever fridge might even associate such a Schematron assertion with the WXS schema in some way, so that it can validate its content all the more easily.
However, given that we already have a WXS schema, an equally neat way of expressing the constraint and then associating it with the schema is to use an XForms model.
Briefly, an XForms model is a useful bucket into which you can put:
This is not an exhaustive list, but is plenty for our purposes. Our interests lie mostly in MIPs, which are themselves a bit of a bucketful. In our example MIPs play the same role as Schematron assertions. They are defined using the XForms bind expression. A bind consists of an XPath defined pattern from one of the instances referenced within the XForms model, associated with one or more assertion. The following MIPs are available:
To our fridge, MIPs are a convenient means of decorating the structures defined by the WXS schema with some extra assertions expressed in XPath 1.0. Crucially to our fridge, the XForms model defines how the application of MIPs to an XML instance must combine with the structural and data type validation provided by an underlying schema. In XForms the schema is a first class validation object, to which MIP defined assertions are subsidiary. For example, it is an error for MIPs to contradict a schema in any way, and an XML instance is validated against any associated schema before MIPs are applied.
However, it is legal to define MIPs that further constrain schema defined structures and values in an instance. In this context we want to constrain the value of a node defined in a schema to be a positive integer. First we need to define our model:
<xfm:model xmlns:xfm="http://www.w3.org/2002/xforms" schema="chilledML.xsd">
<xfm:instance src="my_fridge.xml" id="fridgeML"/>
</xfm:model>
Note the schema attribute on the <model/> element contains a space separated list of schema names. The <instance/> element refers to our external source XML document, and gives it an id so that we can refer to it within the model. An XForms processor must attempt to validate instances with any schemas listed within the schema attribute, ignoring any schemaLocation or noNamespaceSchemaLocation attribute within any instances.
We need to constrain the allowed values of the <volume/> element. Unsurprisingly we express the assertion with the constraint MIP:
<xfm:model xmlns:xfm="http://www.w3.org/2002/xforms" schema="chilledML.xsd">
<xfm:instance src="my_fridge.xml" id="fridgeML"/>
<xfm:bind nodeset="instance('fridgeML')/contents/dairyProduce/milk[@type = 'full-fat']"
constraint="volume <= 1"/>
</xfm:model>
MIPs are expressed using the <bind/> element. The nodeset attribute value is an XPath 1.0 expression that defines the nodeset to be constrained. The constraint attribute defines the constraint itself, again in terms of XPath 1.0. We end up with an expression not a million miles away from the Schematron assertion we started with.
Note the use of the instance() function in the nodeset attribute. This is a bit like using document() in XSLT 1.0. It allows us to refer to an instance by the value of its id. Strictly speaking we don't need to use instance() here, XForms defines the first instance to be declared in a model to be the default. However making explicit the XML instance referred to, using instance() is safest. After all we might do the following:
Supposing our fridge stores configurable settings in an XML instance like this:
<fridge-settings>
<full-cream-milk>1</full-cream-milk>
</fridge-settings>
We can alter the XForms model to include our fridge settings, and use them directly to constrain the amount of creamy milk we store:
<xfm:model xmlns:xfm="http://www.w3.org/2002/xforms" schema="chilledML.xsd">
<xfm:instance src="my_fridge_settings.xml" id="fridgeSettings"/>
<xfm:instance src="my_fridge.xml" id="chilledML"/>
<xfm:bind nodeset="instance('chilledML')/contents/dairyProduce/milk[@type='full-fat']"
constraint="volume <= instance('fridgeSettings')/full-cream-milk"/>
</xfm:model>
Our fridge now constrains the value of the <milk/> element in our chilledML document to a value less than or equal to the value of the <full-cream-milk/> element in the fridge settings XML document. It would be just as easy to write MIPs that make optional elements or attributes mandatory, or to restrict the number of times an element repeats, for example.
The relevance MIP is another useful tool in the box. We use it, not to state that a given node is invalid, but that it is simply not relevant to the instance if the MIP evaluates to false. The non-relevant node and its ancestors are treated as if they no longer exist, until the MIP evaluates to true again.
Let us suppose for a moment that we sometimes store salad stuff in our fridge. chilledML defines an optional structure to cater for such an eventuality:
<fridge>
<contents>
<saladStuff>
<lettuce>
<quantity>2</quantity>
</lettuce>
<tomato>
<weight uom="kg">0.25</weight>
</tomato>
</saladStuff>
</contents>
</fridge>
chilledML also has an optional structure to allow salad dressings. However, our configurable fridge allows us to say that we are only interested in salad dressing if we have some lettuce.
<xfm:bind nodeset="instance('chilledML')/contents/saladDressing"
relevant="instance('chilledML)/contents/saladStuff/lettuce"/>
We are not saying that there must be salad dressing if we have lettuce, simply that if we have lettuce, then salad dressing matters. If we don't have lettuce, then we don't need to worry about salad dressing. If we felt more strongly about our salad dressing, we might add a required MIP, so that not only would salad dressing be relevant should we have lettuce, but it becomes mandatory. Furthermore, we could use a calculate MIP to determine exactly how much salad dressing we require, given a certain quantity of lettuce.
<xfm:bind nodeset="instance('chilledML')/contents/saladDressing"
relevant="instance('chilledML)/contents/saladStuff/lettuce"/>
<xfm:bind nodeset="instance('chilledML')/contents/saladDressing/mayonnaise"
required="true()"/>
<xfm:bind nodeset="instance('chilledML')/contents/saladDressing/mayonnaise/volume"
calculate="30 * instance('chilledML')/contents/saladStuff/lettuce/quantity"/>
We have gone beyond the scope of validation with this last set of MIPs, as we now have a rule intended to modify an XML instance should the need arise. However, the second bind, which calculates the amount of mayonnaise we need could just as easily have been a constraint. Even so, the relevance MIP itself does more than validate. A node that is not relevant is treated as if it doesn't exist by an XForms processor. It is actually pruned from the tree at the point of saving an XML instance that has been used within an XForm.
Thus far we have a WXS schema that defines XML structures and their values, formally combined with a pattern based assertion. Usefully we have not had to alter our WXS schema to achieve this. Additionally, we know that any MIPs we define will not contradict the underlying WXS schema, so any instance validated by our XForms model will be valid against the WXS schema.
All this is good, but we still haven't replicated quite all of the Schematron assertion we began with. The assertion includes some text from which a useful validation report can be built, whilst our MIPs are so far mute. XForms actually defines four elements that we ought to be able to use to great effect in this context:
Each of these elements is designed to provide some kind of explanation to a human, and very usefully, their contents can be generated dynamically at runtime by using them in conjunction with the XForms <output/> element. Of these <alert/> is probably the most appropriate, as it is associated with validation errors. Ideally we would like to be able to do something like this:
<xfm:bind nodeset="instance('chilledML')/contents/dairyProduce/milk[@type = 'full-fat']"
constraint="volume <= instance('fridgeSettings')/full-cream-milk">
<xfm:alert>
<xfm:output value="concat('The amount of full-cream milk must not exceed',
instance('fridgeSettings')/full-cream-milk)"/>
</xfm:alert>
</xfm:bind>
However the use of alert as a child of bind is not permitted in XForms 1.0. To use alert, and to build a human-readable validation report we have to make use of the UI part of XForms. This is unfortunate not so much because we end up designing a human readable validation report with the XForms UI, but because the resources associated with validation are now split across the XForms model and the UI.
Ideally alerts would be defined inside the model, associated with a particular MIP. They could then manifest themselves as appropriate in some kind of report, XForms based or otherwise. An alert bound to a node in an XML instance is just not as useful as an alert bound to a MIP. XML nodes can potentially become invalid for several reasons, and an alert bound to such a node is presented no matter why the node is invalid. Thus it must be general purpose. On the other hand, an alert bound to a MIP could be authored to reflect the validity test in its parent MIP.
That said, it is possible to build a reasonable XForms UI based validation report. This can either be made read-only, or can take advantage of the fact that XForms is designed to allow user interaction. In the latter case a report not only tells a human what is wrong with an XML instance, but can allow the human to make corrections. If nothing else this makes an XForms based validation service very useful as a debugging tool for developers.
If we take our original constraint, we can build a simple validation report with the following markup:
<xfm:input bind="full-cream-milk">
<xfm:help>Enter a whole number between 0 and <xfm:output
value="instance('fridgeSettings')/full-cream-milk"/>
</xfm:help>
<xfm:hint>The quantity of full-cream milk currently in your fridge.</xfm:hint>
<xfm:alert>The quantity of full cream milk is not an allowed amount. <br/> There can be no
more than <xfm:output
value="concat(instance('fridgeSettings')/full-cream-milk, ' litre(s).')"/>
</xfm:alert>
<xfm:label>Quantity of Full Cream Milk</xfm:label>
</xfm:input>
This XForms markup can be embedded in XHTML, or any other suitable host language. If supported, CSS can be used to control the styling of the report. So, assuming an invalid instance, a simple report might look like this:
If I change the value of full-cream milk to a valid quantity, the form updates to display:
We have already discussed one limitation of XForms for validation: the fact that a user will be presented with the same alert whether a node is invalid because it has a value outside a permitted range, or its value is a string when it should be an integer. Unfortunately there are others.
Perhaps most significant is the fact that the XForms recommendation prohibits an author from defining assertions for a node with more than one bind statement. For example, it is not legal to define one assertion that node foo must have a value of less than 10 if node bar has a positive value, and another assertion that node foo must have a value greater than 10 if node bar has a value that is negative. Instead all assertions must be combined into one, often unwieldy, XPath expression. This is unsatisfactory as assertions become difficult to understand and to write. Perhaps more significant is the fact that once again alert is made less useful. The author cannot compose an alert for each logical assertion, but only a general purpose one that covers all eventualities.
Other limitations are as much about authoring convenience as anything else. For example, XForms allows bind statements to be nested. This should be a very useful feature, however, it is currently defined in such a way as to make it useless.
Despite the limitations, XForms shows great potential as a technology for both expressing pattern-based assertions, in the style of Schematron, and combining them in a useful way with the structural validation defined by WXS. XForms provides developers with a very neat mechanism for enhancing validity rules defined in WXS schemas without having to alter an original schema in any way.
This is particularly useful for people relying on industry standard schemas, such as those produced for the UK Life Assurance Industry by the organisation I work for, Origo Services. In these cases the WXS schema is effectively read-only, but is often not prescriptive enough to be useful for anything but the loosest validation. To be able to constrict schema definitions by making particular optional structures mandatory, and others non-relevant, or by restricting enumerations, is very useful indeed. Even for WXS schema designers there are occasionally constraints that can only be expressed as XPath based assertions. For that reason Origo has taken the decision to issue all new schemas with XForms models, where necessary. Although, initially at any rate, these XForms models should be viewed primarily as machine-readable documentation, Origo has also undertaken to provide XForms UIs along with the XForms models it produces, primarily for use as debugging tools for developers.
The relatively simple combination of a WXS schema with XForms MIPs to create a more comprehensive validation service is probably only a small part of what is needed in the long-run. Given that the XML documents defined by organisations like Origo often have a life in several different contexts (as they move through and between organisations), it is conceivable that one WXS schema and one XForms model will be insufficient to provide complete validation. It is not unusual for three or more organisations to be involved in populating and validating a single XML instance. In such circumstances it would be useful to be able to chain together, or layer validation services, XForms model based or otherwise.
So, finally, is XForms a suitable candidate for an XML validation tool? It is not perfect by any means, but it shows great potential, and its underlying design is elegant and robust. The XForms model is by and large about defining a rich, validation context for one or more XML instance. The XForms UI allows authors to build a human-friendly, interactive view of XML instances that sit within that context. So XForms obviously has a great deal to say about XML validation where human interaction is key. Less obviously perhaps, the design of XForms offers the potential for the model to be used in situations where humans play a smaller role. It is possible to execute an XForm with no human intervention at all. So, perhaps it is too early to say how widely XForms will be used for its validating properties, but there is certainly considerable potential, and much useful work can be done with XForms already.
Mark Seaborne
Standards Architect, Origo Services Ltd http://www.origoservices.com
Mark has been working with XML since 1998. He currently works on b-to-b, XML message standards for use in UK Life Insurance Industry. He represents the industry as a member of the W3C XForms WG.