XTech 2005: XML, the Web and beyond.

The Apache Webserver as a Platform for XML Applications

Discuss this paper on the XTech wiki
View XML source for this paper

Keywords

Abstract

This paper introduces the Apache 2 Filter architecture, and explains its significance in enabling XML applications on the Web. It then presents a case study and several examples of reusable modules based on and enabled by the filter architecture, and seeks to give the reader an insight into some of the kinds of functions Apache is well-suited to implementing.

Introduction

Traditionally, XML applications on the Web use Apache only as a dispatcher, and to serve static pages. For dynamic content, an application server is commonly used, with Apache serving as a proxy for it. Some users may turn straight to Java without giving it a moments thought! A second approach is to extend Apache using an environment based on one of the popular scripting languages such as Perl, PHP or Python. Some fairly sophisticated and powerful XML frameworks - such as AxKit (Perl) - are available.

In the days of Apache 1, it was a natural choice to take one of these approaches. There was of course a native Applications Programming Interface (API) for Apache, but it was quite limiting for many applications, and working around the limitations typically involved far more programmer effort than most users could justify. A few XML applications for Apache 1 exist, but general-purpose reusable components such as XSLT transformation perform extremely poorly.

With the release of Apache 2.0 in 2002, this changed completely. Now Apache itself is a powerful applications platform. The API is altogether richer, and the developer effort required to work effectively with it is much reduced. Not only does it offer an excellent development platform; it can also offer major performance and scalability improvements compared to Java and traditional application servers.

This paper will briefly describe the reasons why Apache 2 is so much more powerful than earlier Apaches as an applications platform, particularly for XML and other markup processing. It will then present an overview of existing technologies and new developments with which the author is familiar, with particular reference to general-purpose components and modules that can be used equally well in their own right or together with your existing applications.

One caveat: for this paper I have concentrated on the reusable modules made possible by the Apache 2 filter architecture. I have unfortunately not had the time or space to discuss apache as a platform for web services.

The Apache 2 Architecture

There are several changes in Apache 2 that affect its suitability as an XML and general-purpose applications platform:

Filtering in Apache 2

The simplest possible formulation of a webserver is a program that listens for HTTP requests and returns a response when it receives one.

However, in real life, webservers split the task into several phases. For example, when a resource is requested, they may need to check that it exists and that the client is authorised to receive it before sending it to the client, and generate an error message if necessary. This is the basic architecture of Apache 1 (and other webservers).

Apache 2 introduces a second axis, processing data through any number of filters. Filters serve many purposes, but the key features are that they can be inserted at will, regardless of what else is going on, can be arbitrarily chained, and (in Apache 2.1 and upwards) configured dynamically according to the actual data, even where that is not known in advance (such as when proxying an application server that Apache knows nothing about). Filters can look at and modify, or even completely replace, the data between the server and the Client. This makes them particularly well-suited to tasks such as processing markup, or indeed image, data. A standard application such as XSLT transformation, implemented as an output filter module, works equally well with any content source, whether it be a static page, dynamically generated by CGI or PHP, or proxied from a Java or other application server.

Filters are well-suited to any kind of data processing task with both data inputs and outputs: this naturally encompasses a wide range of XML applications. Best of all are tasks that can process data in chunks without having to wait for the entire input, which Apache can handle extremely efficiently. In the case of XML applications, this relies on the parser supporting the mode of input available in Apache filters. Parsers well-suited to use in apache include expat and libxml2, but exclude Xerces-C, Tidy and OpenSP (amongst those with which this author is familiar).

XML Applications with Apache 2

In principle, you can do anything in your own customised handler, CGI script, or application server, without reference to the Apache processing or data axes. That trivial observation applies to Apache 1 or 2 equally without distinction. Such monolithic all-in-one applications exist, and fulfil useful functions including webservices such as XMLRPC. But implementing processing in a filter gives you a component that can be reused in a wide range of applications, and freely mix-and-matched with other processing.

A Case Study: Modularisation of Site Valet

The author's Site Valet (valet.webthing.com) includes a number of online tools for markup analysis. One of these is Page Valet, a formal validator, directly comparable with the W3C and WDG validators (w3.org and htmlhelp.com). Like those validators, it started life as a CGI program running under Apache 1 or other server.

One goal of Page Valet is to improve the presentation of results to users compared to the other validators. Today it offers users a choice of radically different presentations by generating results as XML and transforming them with an XSLT stylesheet selected by the user from a menu.

XSLT leads the way

This approach was first developed in 2001, when the site was running on Apache 1.3. Different implementations were tried, including running the XSLT in CGI, and XSLT modules for Apache. The only approach that worked was CGI: running XSLT in Apache 1.3 was several thousand times slower and leaked memory.

Around the end of 2001, development work moved to Apache 2.0. Philipp Dunkel's XSLT module for 2.0 was at an early stage of development, but unlike the 1.3 XSLT modules, it worked well enough to consider for production use even then. This was almost certainly the first XML module to take advantage of the Apache 2 filter architecture. It is now one of several XSLT modules available for Apache 2.

Experimenting with a General-Purpose XML platform

My own work around then focussed on a general-purpose XML module mod_xml, which in addition to XSLT, enabled Valet tools to run as a webservice. Jim Ley developed a Client using XMLHTTP to provide a "validate" menu option in selected browsers using raw XML. This went into production in 2002. mod_xml prototyped a fully modular architecture for XML applications, with both input and output filters to transform data to/from an XML format used by a core application. mod_xml is now obsolete and the two applications developed with it have both been replaced, but the architecture it prototyped lives on!

mod_xml also included an XSLT filter, disinguished by a number of performance optimisations including pre-compiling and cacheing stylesheets, and the ability to accept an in-memory DOM tree in place of XML text as its input. This is now incorporated into mod_transform (www.outoforder.cc), an XSLT filter module that now supports additional technologies including XInclude.

Benefits of Modularisation

Another goal of Site Valet was to offer web accessibility analysis. The first attempts to do so were offered as optional extras in validation, but as the analysis matured, it became a separate tool, AccessValet. Since the modular architecture, including input handling and XSLT transformation, were already in place, the AccessValet software was able to ignore these tasks and concentrate exclusively on markup analysis, thus greatly simplifying the development work. This simplification also made it feasible to rewrite Page Valet as a module.

Fast Filtering with SAX

Of course, XSLT is powerful, but is also a significant performance hit on a busy webserver, particularly as the size and complexity of documents and transformations grows. Stream-based processing with SAX is inherently far faster and more scalable. Since a busy webserver may need to process thousands of concurrent hits in parallel, this is an important consideration.

Fortunately, streamed parsing with SAX is also an excellent fit with the Apache filter architecture. This author has developed a number of SAX-based markup filters, and considers them to be his most interesting work over the past three years.

This approach originates in an HTML-processing filter: mod_accessibility serves to transform content (HTML or XHTML) to improve accessibility and empower end-users. This was followed by the author's most widely-used module, mod_proxy_html, which serves in a proxy server to rewrite HTML links into the proxy's address space, so that local/private links work from outside the private network. These transformations may be applied to every page served, so a fast, efficient and scalable implementation is particularly important.

The Apache XML Namespace Framework

The most significant SAX-based XML technology is the Apache XML Namespace Framework. This is an extension to the Apache API that enables XML processing modules to be developed quickly and easily. The advantages of the framework are that:

As a simple measure of how much module development is simplified, we can compare modules performing similar tasks with and without it. This is not entirely comparing like with like, but it gives a rough comparison:

The basis of the namespace framework is a filter module that parses markup with SAX2, and dispatches on namespace to a processor registered to handle that namespace. Any module may register a namespace handler. There are currently two modules that implement the namespace extension to Apache, and several namespace modules that use it.

A few examples of namespace modules are:

Server Side Includes and Edge Side Includes

Apache's mod_include implements processing directives in HTML comments. mod_include directives take the form <!--#directive var="value" ...--> that maps trivially to a namespace handler <ssi:directive var="value"/> in an XML context. mod_xhtml implements server side includes both as a comment handler and as a separate namespace, leaving it to users which form they prefer. This enables SSI to be combined freely with (other) namespace-based applications, without the overhead of parsing it a second time with mod_include.

ESI is a more extensive postprocessing language that implements a set of processing directives that mix an <esi:...> namespace with <!--esi ...--> comment-based directives. A prototype ESI parser was published in 2003. It is not maintained, but like most of the software mentioned in this paper it is available as open source, and can be maintained on demand.

Scholarly Publication

The Apache Tutor site specializes in tutorials for applications development with Apache. It includes an online editor and a facility for users to add comments (annotations), which are presented as margin notes. Articles are stored in an XML format, using a custom namespace for application-specific information, including article structure, ownership and permissions, revision and locking information, and annotations. The actual article contents are held as XHTML, and articles are served to browsers presented using namespace handlers.

Reverse Proxy

mod_proxy_html cannot be implemented as a namespace handler because it must be able to accept input that is not well-formed XML. However, a similar module mod_proxy_xml has recently been released. Like mod_proxy_html, it serves to rewrite links into a proxy's address space. It implements proxy namespace handlers for XHTML and for WML and reduces the problem of implementing reverse-proxying for another namespace to one of writing a single function. As noted above, it is a good deal shorter and simpler than mod_proxy_html, due in large part to using the simpler namespace API.

SQL and Forms Handling

mod_sql implements a handler for including SQL queries in XML pages. Working with a forms parsing module (mod_form) and the Apache DBD API (apache's database-independent framework for accessing an SQL backend) it takes full advantage of Apache's threaded MPMs with connection pooling to provide a vastly more efficient and scalable means of SQL access than traditional markup-based options such as PHP.

Script Processing

Current work in progress includes implementing Tcl processing in Apache, for a Client upgrading its content management and publishing software from Vignette story server. The Tcl processor takes several forms, and the primary requirement is emulate Vignette's extensions and support the client's legacy documents, but for future use a <tcl:...> namespace is implemented, so the option of moving to a pure XML framework is open.

mod_publisher: the Universal Markup Filter

To conclude our review of XML processing in Apache, we will introduce the author's most comprehensive work in the area. mod_publisher is designed as the universal markup filter (for both XML and HTML) bringing together all the SAX techniques discussed here with additional capabilities including server-side includes, macros and on-the-fly editing, generalised markup rewriting, and DTD-based corrections, as well as implementing the full namespace API and a second extension API to enable other modules to hook different SAX events directly.

Biography

Nick Kew

WebÞing

Nick Kew is a veteran systems and software developer, with a strong belief in the potential of the global IT infrastructure to transform our lives. His work includes developing the technologies (including Apache), and the standards (working with the W3C) to help us realise that potential.