XTech 2005: XML, the Web and beyond.
XML has enjoyed great success as the format of choice for document-oriented content in single-source publishing solutions. In such applications, content held in XML can be presented in many different ways simply by using different stylesheets that describe presentation semantics. However, making the most of separating content and presentation with data-oriented XML requires a more complex publishing framework. Using a pipeline of simple components that individually filter, group, annotate and so on can make complex transformations easier to write and debug and can enhance the reusability of code. Configuring those components using custom markup languages enables non-programmers to adjust pipelines, for example enabling designers to edit a document template to alter the layout and style of information on a page.
One of the great benefits of XML is that it facilitates single-source publishing: the content of your documents is written in a single XML source document, which can then be presented in many different ways.
Back in pre-XML days, if you wanted to make a document available both on the web and in print, you probably had two copies of that document: one in HTML, on the website; another in Word or some other word-processor, which you could print out. Not only did that mean you were using double the disk space, but managing this duplication was hard work in three ways:
How does XML help with these problems? Well, in a single-source publishing framework, you have a single XML document holding the content of a document. If you want to change that content, there's only one place where you have to do it. Each presentation format for the document is controlled by a stylesheet, so if you need to update the appearance of your entire website, you just change the stylesheet for that website. And this separation of content and presentation means that authors, designers and technical experts can work in parallel, with authors focussing on changes to the XML document, designers on changes to the stylesheet, and technical experts on the processes that perform the transformations.
Separating content from presentation is part of the mantra of XML, and many organisations have adopted XML on the strength of the single-source publishing vision described above. But in real life, things are usually more complicated than that. Have a look at
A building society needs to communicate with its customers to provide monthly statements, yearly financial summaries, notification letters (such as when a mortgage or loan has been accepted or rejected) and other messages of varying kinds.
The raw information used to create the building society's letters originates in several databases. The XML that's extracted from these databases is in a standard format that usually needs to be filtered, sorted, grouped and generally re-jigged to focus on the information that's really relevant for a particular message.
The letters themselves include content written by marketing and legal specialists within the building society, and they're constantly revising the content of these letters. In addition, the building society prides itself on personalising its communication with customers: the native language of the customer is used in all communication; they can choose a variety of channels – hard-copy letters, emails, web pages, even text messages; and the age of the customer is used to determine whether a formal or informal style is used. Subsidiaries of the building society use different letterheads and different website designs, so although the content might usually stay the same, it has to adapt to fit different layouts and styles; plus, the designers often change the way the pages look.
The above example illustrates several factors that come into play in complex document-generation scenarios:
In this paper, I'll discuss how the use of pipelines and XML-based configuration can help with the management of complex document generation.
If you try to carry out complex document-generation such as that described above using a single transformation step, you quickly run into problems. Programs that carry out complex transformations are unwieldy, hard to understand and therefore hard to debug and maintain, simply because they need to do so much.
In situations where the same XML gets transformed to a range of output formats, using separate programs for each transformation leads to more problems. You get repetition of code: whether a bank statement is rendered in HTML, XSL-FO or plain text, the content still has to be filtered and grouped in the same way. You get version-control problems: if the filtering or grouping of the data in the display needs to change, then every program needs to be updated to match.
Pipelining is a technique that helps you break up transformations in a way that maximizes their simplicity (so that they're easy to write and maintain) and reusability (so that the same code is used whenever the same operation needs to take place). In a pipeline, a complex transformation is broken up into several simple component transformations, with the output of earlier components becoming the input to later components. Not only are the individual transformations easier to write, but creating a pipeline helps you to debug the process as a whole by testing the input/output of individual components in the pipeline. What's more, at run-time, the output from one step in the pipeline can be cached and reused, which saves processing time.
XML pipeline components usually accept XML as input and generate some XML as output. They generally fall into the following categories, though sometimes a component will combine two or more tasks:
We're used to viewing attribute and element defaulting as an operation that should be carried out by a validator, since both DTDs and XML Schema support the ability to indicate the default or fixed value of an attribute (or element in the case of XML Schema). However, as in RELAX NG, validation and annotation should be seen as distinct operations.
The components themselves can be written in any language. In many cases, the appropriate language for an XML transformation is XSLT, since XSLT is designed for doing XML transformations, but more general-purpose programming languages can be used instead.
A pipeline may contain any number and any mix of component transformations. In general, filtering, annotating and restructuring occurs before padding in a pipeline, and padding before translating. If necessary, generating comes first and rendering last: keeping the data into an XML format as soon as possible eases the transformation process as a whole.
Creating individual components is all very well, but the point of a pipeline is that they need to fit together. Pipelines can vary a great deal in complexity. While a simple pipeline is just a sequence of components where the output of each directly forms the input of the next, more complex pipelines have to deal with the following:
Each pipeline component needs to be able to take at least one XML document as input and generate at least one XML document as output. The form of those XML documents depends on the technology that manages the overall pipeline process. The options are:
Serialized XML documents are a stream of bytes representing an XML document in a particular character encoding.
One advantage of this format is that it's very low-level and each piece of markup that could possibly be significant will be reported to a component (you don't have to worry about whitespace, comments or processing instructions being ignored, for example). Using serialized XML documents also makes it easy to cache the results of a component, so that it can be reused, and to check the input and output of a component for debugging.
The biggest disadvantage is that both parsing and serializing XML can be comparatively lengthy operations, and it's wasteful for each component to have to parse a document that was only serialized by the previous component in the pipeline. Another disadvantage is that XML documents can only contain what XML documents can contain: they can't
SAX events are the most common method of passing around XML documents between components. Each pipeline component generates a stream of SAX events, which is consumed by the next component in the pipeline.
The biggest advantage of SAX events is that they don't take up memory, which means that SAX-based pipelines can be very efficient, but this is only the case if the components themselves don't need to create a more permanent internal representation of the XML document. (XSLT-based components can't take advantage of the streaming nature of SAX events; STX STX can.)
The disadvantage of using SAX events is that SAX itself isn't very flexible in terms of the information that it can pass around. In particular, it can't be used to pass around the typing information generated after validation against a schema (a PSVI). But this is only a disadvantage of SAX (which is purposefully kept simple) rather than of all event-based APIs.
The final alternative is using an object model of an XML document, commonly a W3C DOM, XOM or a custom object model determined by the pipeline framework.
An advantage of using an object model is that it can include additional information that can't be represented in either a serialized XML document or using SAX events. Most notably, object models have the flexibility to hold a PSVI, which can be useful in components based on XSLT 2.0 or XQuery.
The disadvantage of using an object model is that it takes up memory, though this is less of a disadvantage if the object model is in a form that can be used as-is within a component. For example, if you pass an ordinary DOM as the source document for an XSLT processor, it will usually translate this DOM into an internal representation that optimizes the kind of access required by an XSLT processor; essentially, the same document is held in memory twice. If the same XSLT processor can be used in all components, then the optimized tree can be passed between components, cutting down on memory consumption.
Fitting the components together into a pipeline is usually the job of an XML application server. There are a number of XML application servers around that use pipelines to create documents, such as Cocoon Cocoon, AxKit AxKit, Jelly Jelly, Orbeon Orbeon or Thunderhead Thunderhead. Each of these application servers uses an XML-based configuration file to define the pipeline that's used when a particular document is requested. There are several more experimental pipeline definition languages that do the same kind of thing, such as
For simple pipelines, you can use Ant () or just write your own code.
Adopting a pipelining approach to generating documents helps divide complex document-generation tasks into simpler steps. However, it doesn't address the problem of ensuring that these simple steps can be controlled by the appropriate person. For example, the "padding" component in a pipeline focuses on creating human-readable documents in which data (from an XML document) is arranged, such as the congratulatory paragraphs that accompany a successful loan application. The padding itself needs to be provided by a marketing expert rather than a technical expert, but if the component needs to be written in XSLT (say) then it's going to be difficult for the marketing expert to do it without a lot of help. This is where the second key in complex document generation comes into play: configuration.
Every component in a pipeline will need to be configured in some way, and the more general a component is, the more configuration it requires. An XInclude processor component, which simply included files according to the XInclude specification, wouldn't require any configuration, but an XSLT processor component would need at least a stylesheet and possibly settings for stylesheet parameters as well. To enable the appropriate person to be able to configure the appropriate component in a pipeline, the configuration for that component has to be within their capabilities.
There are two kinds of components that require particular care: padding (where document-oriented content is added to data) and translating (where an XML document is styled for a particular output format). We'll look at each of these briefly here, and discuss the ways in which personalisation might occur within a pipeline. The example we'll examine uses a bank statement, shown in
<statement xmlns="http://www.example.org/bank/statement">
<account branch="12-3456" number="12345678" type="BCA">
<name>Acme Consultants</name>
</account>
<balance date="2002-10-31">7254.98</balance>
<transactions>
<transaction date="2002-11-04" type="DPC" payee="PAYE"
amount="262.84" />
<transaction date="2002-11-04" type="DPC" payee="MD"
amount="1275.63" />
<transaction date="2002-11-04" type="DPC" payee="NI"
amount="279.48" balance="5437.03" />
<transaction date="2002-11-12" type="BAC" payer="BOB"
amount="2578.28" balance="8015.31" />
<transaction date="2002-11-29" type="CHQ" ref="000011"
amount="5600.00" balance="13615.31" />
</transactions>
</statement>
The dream of separating content and presentation originates from the publishing arena, where an XML document might hold the content of a journal article or a book chapter. In these document-oriented applications, the content of the XML document source is largely the same as the content in the final presentation – you might add a few numbers here and there, and resolve a few references, but there's no large-scale addition of content to the page.
In data-oriented applications, however, the content on (say) a web page will usually be sourced from several different locations:
The content/presentation division in this scenario is a lot more blurred than in a publishing context. You could class both the conditional and static content as presentational: after all, the true content – the data – could be presented in many other ways. But often conditional and static content needs to be replicated across presentations – the bank's address might appear in both a web page and a paper version of a bank statement.
A good way to think about this process, then, is as a pipeline with two stages:
Let's look at the transformation of the data-oriented XML to document-oriented XML more closely. Transforming data-oriented XML to document-oriented XML usually requires a different kind of stylesheet from that used when styling document-oriented XML. This stylesheet usually looks like the document being generated, so is usually referred to as a document template. Document templates contain:
If you have any knowledge of XSLT, you'll recognise that XSLT can serve
as the language in which a document template is written, and that simplified
stylesheets (in which the document element is a literal result element rather
than an xsl:stylesheet or xsl:transform
element) are particularly suited to the task. XSLT's
xsl:value-of instruction accesses and inserts information
from the source XML; xsl:if, xsl:choose
and xsl:for-each instructions provide conditional content;
and xsl:copy-of, together with the document()
function, can insert content from elsewhere.
For example, an XSLT document template for the bank statement we're using as an example might look like that shown in .
<doc xmlns="http://www.example.org/document"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:st="http://www.example.org/bank/statement"
xsl:version="1.0">
<title>Bank Statement</title>
<xsl:copy-of select="document('bank-details.xml')" />
<xsl:for-each select="st:statement/st:account">
<table>
<row>
<cell>Branch No</cell>
<cell><xsl:value-of select="@branch" /></cell>
<cell>Account No</cell>
<cell><xsl:value-of select="@number" /></cell>
</row>
<row>
<cell>Statement of</cell>
<cell cols="3">
<xsl:choose>
<xsl:when test="@type = 'BCA'">Business Current Account</xsl:when>
...
</xsl:choose>
</cell>
</row>
<row>
<cell>Account for</cell>
<cell cols="3"><xsl:value-of select="st:name" /></cell>
</row>
</table>
</xsl:for-each>
<table>
<row>
<cell>Particulars</cell>
<cell>Withdrawn</cell>
<cell>Paid In</cell>
<cell>Date</cell>
<cell>Balance</cell>
</row>
<xsl:for-each select="st:statement/st:balance">
<row>
<cell>Balance Forward</cell><cell /><cell />
<cell><xsl:value-of select="@date" /></cell>
<cell><xsl:value-of select="." /></cell>
</row>
</xsl:for-each>
<xsl:for-each select="st:statement/st:transactions">
<xsl:for-each select="st:transaction">
<row>
<cell>
<xsl:value-of select="@type" />
<xsl:text> </xsl:text>
<xsl:value-of select="@payee | @payer | @ref" />
</cell>
<cell>
<xsl:if test="@payee">
<xsl:value-of select="@amount" />
</xsl:if>
</cell>
<cell>
<xsl:if test="not(@payee)">
<xsl:value-of select="@amount" />
</xsl:if>
</cell>
<cell><xsl:value-of select="@date" /></cell>
<cell><xsl:value-of select="@balance" /></cell>
</row>
</xsl:for-each>
<row>
<cell>Totals</cell>
<cell><xsl:value-of select="sum(st:transaction[@payee])" /></cell>
<cell><xsl:value-of select="sum(st:transaction[@payee])" /></cell>
<cell><xsl:value-of select="st:transaction[last()]/@date" /></cell>
<cell><xsl:value-of select="st:transaction[last()]/@balance" /></cell>
</row>
</table>
</xsl:for-each>
<xsl:copy-of select="document('disclaimer.xml')" />
</doc>However, XSLT is rarely used as the language for document templates, the main reason being that document templates need to be easy for non-programmers to edit, since it's non-programmers who have to write the majority of the content that goes into document templates. As you can see from the above listing, XSLT document templates involve complexities such as namespaces and advanced XPath syntax that are beyond the expertise of most authors. Instead, organisations often develop specialist document template languages.
Specialist document templates are built around particular applications, and don't have to cater for the wide variety of markup languages supported by XSLT. In some cases, document templates are actually written using editors that are designed for the target document-oriented markup language. This is particularly the case when that markup language is XHTML, for which there are multitudes of editors available. Alternatively, some XML editors such as XMetaL, FrameMaker, or Word 2003, can be customised to support WYSIWYG-style editing of specialised vocabularies. In other cases, authors are comfortable with editing the raw document-oriented XML, but don't want to have to worry about XSLT and XPath syntax.
Typically, the various instructions that describe how to construct the page are inserted as comments, specialist elements that are ignored by the editing environment, or (more rarely) as processing instructions. The instructions themselves can be specialised: they need only deal with the kind of XML that the document template will be used to access. shows an example of what a specialist document template for bank statements might look like.
<doc xmlns="http://www.example.org/document">
<title>Bank Statement</title>
<include href="bank-details.xml" />
<table>
<row>
<cell>Branch No</cell>
<cell><data insert="branch-number" /></cell>
<cell>Account No</cell>
<cell><data insert="account-number" /></cell>
</row>
<row>
<cell>Statement of</cell>
<cell cols="3"><data insert="account-type" /></cell>
</row>
<row>
<cell>Account for</cell>
<cell cols="3"><data insert="account-name" /></cell>
</row>
</table>
<table>
<row>
<cell>Particulars</cell>
<cell>Withdrawn</cell>
<cell>Paid In</cell>
<cell>Date</cell>
<cell>Balance</cell>
</row>
<row>
<cell>Balance Forward</cell><cell /><cell />
<cell><data insert="statement-date" /></cell>
<cell><data insert="balance-brought-forward" /></cell>
</row>
<repeat each="transaction">
<row>
<cell>
<data insert="transaction-particulars" />
</cell>
<cell>
<if test="transaction-withdrawl">
<data insert="transaction-amount" />
</if>
</cell>
<cell>
<if test="!transaction-withdrawl">
<data insert="transaction-amount" />
</if>
</cell>
<cell><data insert="transaction-date" /></cell>
<cell><data insert="transaction-balance" /></cell>
</row>
</repeat>
<row>
<cell>Totals</cell>
<cell><data insert="total-withdrawn" /></cell>
<cell><data insert="total-paid-in" /></cell>
<cell><data insert="last-transaction-date" /></cell>
<cell><data insert="final-balance" /></cell>
</row>
</table>
<include href="disclaimer.xml" />
</doc>
Specialist document templates hide a lot of the complexity involved in transforming from data-oriented to document-oriented XML. An instruction such as:
<data insert="last-transaction-date" />
might involve the application that processes the document template retrieving the date of the final transaction via a complex path through the source XML, and then formatting the result using a particular date format.
The templates can be processed through specialist software, but often they are processed using XSLT. When document templates are simple, they are usually processed as another input, alongside the data XML. The XSLT code interprets the instructions embedded within the document template, using code such as that shown in .
<xsl:template match="doc:data[@insert = 'last-transaction-date']">
<xsl:call-template name="format-date">
<xsl:with-param name="date"
select="$statement/st:transactions/st:transaction[last()]" />
</xsl:call-template>
</xsl:template>
The benefits of specialist document templates – hiding the complexity of accessing and formatting data within separate code – soon become limits. The software that processes the document template needs to anticipate all the possible types of data that might be inserted into the document, and how they should be formatted. Authors are very keen on having complexity hidden from them until they need to do something complex, at which point they quickly become frustrated with having to negotiate with a programmer in order to insert unanticipated data in unforeseen formats.
An alternative approach is to provide a middle ground that combines the flexibility of XSLT document templates with the ease-of-use of specialist document templates. This is achieved by creating a document template language that has the full power of XPath for accessing data, but adding a pre-processing step to the pipeline so that data that is often accessed is available through easy paths, and data is reformatted in standard ways.
For example, a pre-processing step might take the bank statement XML and augment it with properly formatted dates, expansions of codes, totals and so on, so that it eventually looked like that shown in .
<statement xmlns="http://www.example.org/bank/statement">
<branch-number>12-3456</branch-number>
<account-number>12345678</account-number>
<account-name>Acme Consultants</account-name>
<account-type>Business Current Account</account-type>
<account branch="12-3456" number="12345678" type="BCA">
<name>Acme Consultants</name>
</account>
<statement-date>31 OCT 2002</statement-date>
<balance-brought-forward>7254.98</balance-brought-forward>
<balance date="2002-10-31">7254.98</balance>
<transactions>
<transaction date="2002-11-04" type="DPC" payee="PAYE"
amount="262.84">
<transaction-particulars>DPC PAYE</transaction-particulars>
<transaction-withdrawl>true</transaction-withdrawl>
<transaction-date>4 NOV 2002</transaction-date>
<transaction-amount>262.84</transaction-amount>
<transaction-balance />
</transaction>
<transaction date="2002-11-04" type="DPC" payee="MD"
amount="1275.63" />
<transaction-particulars>DPC MD</transaction-particulars>
<transaction-withdrawl>true</transaction-withdrawl>
<transaction-date>4 NOV 2002</transaction-date>
<transaction-amount>1275.63</transaction-amount>
<transaction-balance />
</transaction>
<transaction date="2002-11-04" type="DPC" payee="NI"
amount="279.48" balance="5437.03" />
<transaction-particulars>DPC NI</transaction-particulars>
<transaction-withdrawl>true</transaction-withdrawl>
<transaction-date>4 NOV 2002</transaction-date>
<transaction-amount>279.48</transaction-amount>
<transaction-balance>5437.03</transaction-balance>
</transaction>
<transaction date="2002-11-12" type="BAC" payer="BOB"
amount="2578.28" balance="8015.31" />
<transaction-particulars>BAC BOB</transaction-particulars>
<transaction-withdrawl>false</transaction-withdrawl>
<transaction-date>12 NOV 2002</transaction-date>
<transaction-amount>2578.28</transaction-amount>
<transaction-balance>8015.31</transaction-balance>
</transaction>
<transaction date="2002-11-29" type="CHQ" ref="000011"
amount="5600.00" balance="13615.31" />
<transaction-particulars>CHQ 000011</transaction-particulars>
<transaction-withdrawl>false</transaction-withdrawl>
<transaction-date>29 NOV 2002</transaction-date>
<transaction-amount>5600.00</transaction-amount>
<transaction-balance>13615.31</transaction-balance>
</transaction>
</transactions>
<total-withdrawn>1817.95</total-withdrawn>
<total-paid-in>8178.28</total-paid-in>
<last-transaction-date>29 NOV 2002</last-transaction-date>
<final-balance>13615.31</final-balance>
</statement>
The document template can then include proper XPaths rather than being
limited to pre-agreed keywords, but has all the important information, in the
correct formats, easily available. The document template in this example could
look very similar to that shown in
, but the content of the
data elements' insert attributes would be interpreted as
XPath expressions rather than keywords. The instruction:
<data insert="last-transaction-date" />
actually points to the last-transaction-date child of
the statement element in the annotated data.
Since it isn't possible to evaluate XPath expressions on-the-fly using
XSLT (although some XSLT processors offer extension functions for this
purpose), the document template needs to be compiled into an XSLT stylesheet,
which is then used to process the annotated data. The compiled version of the
data element above is simply:
<xsl:value-of select="last-transaction-date" />
and the XSLT that performs this compilation looks like:
<xsl:template match="doc:data">
<value-of select="{@insert}" />
</xsl:template>
This mixed approach grants an author all the flexibility of an XSLT document template, with the simplicity of the specialised document template. It also provides a mechanism for evolving the document template and separating technical and authoring roles. As authors write document templates, they may realise that they need easy ways to access additional information or require that information in different formats. The fact that the data hasn't been annotated with this information doesn't stop the author from using it and formatting it using the full power of XPath's functions and operators. If enough document templates require particular pieces of information, they can be added into the annotated document at a later date. Thus, negotiations with technical experts do not have to hold up the author's work on the document.
The previous sections focussed on adding content to data in order to create a document-oriented piece of XML that could be presented in many ways using stylesheets. The data is annotated with additional information and then transformed into document-oriented XML using an XSLT stylesheet generated by compiling a document template. The document-oriented XML is then styled using other stylesheets to create web pages or print editions.
If we turn now to look at styling document-oriented XML, we'll see that we encounter many of the same issues that we encountered when creating that document-oriented XML from the original data. Styling involves three processes:
Designers control all these processes; their primary responsibility is ensuring that the content looks good on the page. Like authors, designers know little and care less about XSLT, and they often want to use WYSIWYG-type authoring tools to try out their designs.
Adding formatting to an XML document is similar to the process of adding content to a document, though the focus here is on content that is specific for a given presentation medium, and on a general layout into which the content should be inserted. One model is to use a document layout with named regions, and map the content supplied in the XML into those regions. For example, a web page layout might use a table in which each cell's ID names a region into which content should be inserted, such as that shown in .
<table>
<tr>
<td colspan="3" id="navigation"></td>
</tr>
<tr>
<td rowspan="2" id="menu"></td>
<td colspan="2" id="title"></td>
</tr>
<tr>
<td id="content"></td>
<td id="links"></td>
</tr>
<tr>
<td colspan="3" id="footer"></td>
</tr>
</table>
The labelled regions can be filled with content from the document-oriented XML, usually by arranging with the authors to include sections with the same name within that XML, or by using a document template along the same lines as that used to generate the document-oriented XML from the original data.
The approaches used when creating web pages and creating print editions are usually different when it comes to adding style to the presentation format.
In a web page, separate CSS stylesheets can hold look-and-feel
information, so the important thing about the generated HTML is that it
contains sufficient hooks to enable the CSS styles to be targeted. Usually,
this means that the HTML contains class attributes that reflect the names of
the original elements in the document-oriented XML. For example, the
title element from the document-oriented XML becomes an
h1 element with a class of title in the HTML:
<h1 class="title">Bank Statement</h1>
and a separate CSS stylesheet contains the statements that style this in black-on-green text:
h1.title { color: black; background-color: green; }
This separation has the advantage that the CSS can be edited separately, by someone with no knowledge of XSLT, to change the look-and-feel of the web pages.
Print media, on the other hand, are generated using XSL-FO, in which the
formatting objects themselves hold stylistic information. The XSLT that
generates the XSL-FO ensures that all occurrences of the same kind of text have
the same style. In this example, the XSL-FO would contain a
fo:block element with stylistic attributes:
<fo:block color="black" background-color="green">Bank Statement</fo:block>
The stylistic information can be added using an attribute set in XSLT, as follows:
<xsl:attribute-set name="title">
<xsl:attribute name="color">black</xsl:attribute>
<xsl:attribute name="background-color">green</xsl:attribute>
</xsl:attribute-set>
<xsl:template match="title">
<fo:block use-attribute-sets="title">
<xsl:apply-templates />
</fo:block>
</xsl:template>
However, using XSLT attribute sets requires knowledge of XSLT, and using attribute sets has the disadvantage that the XSLT stylesheet must be changed each time the look-and-feel of the document changes. In many organisations, then, a separate file, using a specialised markup language, is used to provide the attributes that are placed on the XSL-FO elements. To indicate that titles are black on green, for example, such a style file might include:
<style name="title" fo:color="black" fo:background-color="green" />
This style file can be interpreted by an XSLT stylesheet that generates XSL-FO, or compiled into an XSLT stylesheet containing attribute sets for each style. Usually the style file will actually generate a stylesheet module that is included or imported into the main stylesheet.
The advantage of creating an XSLT stylesheet from the style file, rather than interpreting it at run-time, is that the resulting XSLT stylesheet will run faster than one that interprets a style file dynamically. The XSLT stylesheet only needs to be generated once, at design time, and it can then be compiled, cached and reused multiple times at run time.
When documents are generated automatically, rather than written individually, it begins to be possible to personalise and customise those documents to suit the reader. For example, the documents can be in the reader's favourite language, use a formal or informal style, contain legal information that is relevant to the reader's location, or be designed for the reader's screen resolution.
There are several ways to build personalisation into the document generation process, each of which is suitable for different kinds of customizations:
In most document-generation applications, a combination of these mechanisms will be used to create an output that is designed specifically for the reader.
In the real world, creating a range of outputs from XML is a complex business. This paper has shown how pipelining transformations and configuring them using XML-based templates can make the process simpler, easier to maintain, and allow the appropriate people to have control over appropriate sub-tasks.
While this is a general approach that has been used successfully in many organisations, in most cases the pipelining framework and pipeline components are home-grown. Even when an off-the-shelf solution, such as those listed earlier, is used, the pipeline control language that's used in one solution can't be used in another. For pipelining to come into its own, we need a standard pipeline definition language that can be adopted by all document-generation frameworks.
Pipeline components are designed to be reusable, but again individual pipeline components tend to be home-grown. A standard pipeline component API would enhance sharing of configurable components between organisations. A particular issue with here is whether schema validation information should be included in the set of information passed between components: while including such information reduces the work that has to be done by the receiving component, there's then no guarantee that the correct (version of the) schema was used to perform the validation.
The discussion of configuration above used XSLT as an example. Either XSLT or XQuery can be used to perform XML-to-XML transformations, particularly those in which the source XML is data-oriented. Neither language has a mechanism for dynamically evaluating an XPath expression, which is needed for interpreting document templates that include XPath expressions. Where XSLT wins over XQuery here is that because it's written in XML itself, it's easy to generate XSLT from a document template. But neither language is particularly good as a pipeline component language because neither are streamable: both require an object model of the entire document in order to operate.
Pipelining isn't by any means a new methodology, but it's a highly effective one. Approaching transformation problems by first breaking them down into their smaller components creates simpler, more reusable code. While the above discussion has focused on completely separate components held together within an external application framework, it's also possible to have single XSLT 2.0 stylesheets or XQuery queries that perform a pipelined transformation internally using temporary trees. Even if your transformation doesn't require complex application frameworks, taking a pipelining approach in a single stylesheet can be beneficial.
An early draft of this paper was written for Thunderhead; thanks to them for letting me adapt it for a wider audience.
Jeni Tennison
Jeni Tennison Consulting Ltd
Jeni Tennison is an independent consultant specialising in XSLT and XML Schema development. She trained as a knowledge engineer, gaining a PhD in collaborative ontology development, and since becoming an consultant has worked on using XML in a wide variety of areas, including publishing, water monitoring and financial services. She is the author of "XSLT & XPath On The Edge" (Hungry Minds, 2001) and "Beginning XSLT" (Wrox, 2002), one of the founders of the EXSLT initiative to standardise extensions to XSLT and XPath, and an invited expert on the W3C XSL Working Group.