XTech 2005: XML, the Web and beyond.

All XML Databases are Equal

Discuss this paper on the XTech wiki
View XML source for this paper

Keywords

Abstract

But some are more equal than others! This paper discusses the issues involved in evaluating an XML database according to the requirements of your project.

Introduction

As anyone searching for XML storage solutions can tell you, the choice is bewildering. It can be hard to distinguish between the wealth of XML databases currently on the market, and very little comparative work has been done.

It would be wrong to assume that all XML databases suit the needs of your project - unlike relational databases, there is no standard way to define or implement products in this field. The relational model has its "normalized form", but no parallel exists with XML (and indeed, why should it?). The end result is that XML databases may perform wildly differently depending on your data and querying needs.

This paper will discuss the major ways in which an XML database's implementation can affect its performance with differing types of data, query or update usage.

Native vs Relational Backed vs SQL/XML

The earliest XML databases provided a layer on top of a relational database. This involves decomposing the XML document into relational database tables based on the document's schema. Reference integrity is maintained using foreign keys, and document order is stored using a special column in the database tables. This approach allows the wealth of SQL knowledge to be applied to the stored data. It also allows you to utilize the SQL database, and DBA that you may already use in your company.

There are many disadvantages of this approach, however. SQL databases are not very well suited to too extensive or too permissive a schema, which means they are especially bad at handling the XML data model when it isn't restricted by a schema. Any XML query language has to be translated to SQL first, before it is executed, which is typically slower, and storing and materializing documents is a complicated process involving SQL queries that use many joins across multiple tables. These problems mean that this is rarely the fastest or most efficient approach to take when storing XML data.

Native XML databases are based on the XML data model, rather than the relational data model. This means that they are designed from the ground up to deal with elements, attributes and text nodes, and they naturally handle document order and referential integrity issues. Any indexes maintained by the native XML database will be directly applicable to the query language of choice, and there is no translation phase when compiling an query. This means that native XML databases have the potential to be the fastest XML databases.

On the negative side, the XML data model has been around for far less time than the relational data model, so the products available tend to be far less mature. However, some of the native XML databases are built on top of non-relational database back ends (that aren't based on the relational/SQL data model), which means they can inherit the back end's maturity and robustness - so these are worth looking out for.

Recently a number of SQL databases have been gaining support for XML, with the SQL-2003 XML related specifications SQLXML2003. This adds an XML data type to SQL, and allows you to embed XQuery XQuery queries in your SQL queries. Typically, separate indexes are maintained by the database to speed up XQuery queries. These databases do promise the best of both worlds, in terms of relational and XML database compatibility.

Single Schema vs Partial Schema vs Schema-Less

A number of XML database implementations restrict the documents held in a single collection to a specified schema. Relational backed databases often do this, although this paradigm is not restricted to that type of XML database. This has parallels to the relational method of specifying a schema for a table. This paradigm has advantages where the data being stored has a fixed, infrequently changed schema, and the data must be validated before entering the database.

However, one of the big advantages of using XML is that it is extremely good for semi-structured data that has a loose content model, and a schema that has the potential to change. This is the case with the schema for DocBook and HTML, for instance, as it often is with mixed content XML.

In this case, you are better off choosing an XML database that does not require a fixed schema for each collection. These implementations are often simply based on the XML data model - native XML databases often allow this.

A few XML databases give users the best of both worlds, and allow them to define a schema for sub-trees in their XML documents. This allows a strictly validated node to be placed inside otherwise semi-structured or loosely typed data.

Automatic Indexes vs User-Defined Indexes

Indexes are sorted lists of information about a document, that can be used to speed up document querying. To a point, the more indexes an XML database maintains, the faster it can make queries. However, maintaining indexes incurs a speed penalty to document adds, updates and deletions.

Often, if XML databases provide indexes, the user will have no say in what indexes are stored. There will be an automatic set of index information that is maintained as the documents are added and updated, which may include the name and location of elements and attributes, and their values. This can take the pain out of using an XML database, since the user never has to think about indexing - it happens in the background.

But like so many things in software engineering, XML databases are most efficient when they are configured for users' specific needs. XML databases that are configurable in this way often allow users to specify a range of different index types, taking into account more information, like the paths to a node or a full-text index of its value.

By choosing indexes, it is possible to make sure that the most specific indexes are chosen, which will narrow the scope of the search quicker, reducing query times. Also, since less index information is calculated, document updates will be quicker. This is by no means the easy option though, as bad choices of index can equally result in poor query times and poor update times.

Full Text Search

Document-centric data will often benefit from using an XML database that supports full text indexing and search. This is a facility that allows fast word and phrase searching to be done across documents. Although the W3C are in the process of defining full text extensions to the XQuery language, plain XQuery does not support this - so XML databases that can perform full text queries do so currently by using proprietary extensions.

Document Level vs Node Level Storage

The storage model for the documents in the database can often vary. Documents can be stored as some kind of BLOB (binary large object), so the whole document string is preserved. This has the benefit of simplicity, and round tripping of even non-XML infoset information. There are many down sides to this approach; for instance, the entire document has to be parsed to perform any queries on it, consuming more memory, and taking more time.

It also makes partial updates very slow, if they are possible at all, and means that indexes can be less effective, since they can often only reference documents, rather than the nodes inside documents. This makes this approach more feasible for many small documents, rather than a few large documents.

Many XML databases decompose the documents when they store them, into units based on the nodes in the document's XML infoset XMLInfoset. This effectively stores a pre-parsed version of the document. This approach allows querying without parsing the entire document into memory. It is capable of supporting partial updates, and indexes that reference nodes directly. In practice, this approach is often faster in all operations than a whole document storage method, however it typically only preserves the XML infoset information from the document.

Whole Document Update vs Partial Update

Many XML databases to date have concentrated on query performance, but it is worth noting that update performance can vary wildly as well. There is normally a trade off between database query speed and database update speed.

Update speed is also affected by whether the database supports partial updates or not, in other words, if modifications to a part of the document require rewriting the entire document. This can be the case, not only in databases where the documents are stored as BLOBS, but also in cases where the node ID system is required to be recalculated in order to maintain some significant property of it, like document ordering.

Transactional vs Non-Transactional

Engineers who often use relational databases will be used to the guarantees of ACID (atomicity, consistency, isolation, and durability) semantics. However, since many XML databases have been built from scratch, they have not yet implemented the ACID guarantees - or if they have, they are not as robust yet, due to product immaturity. This applies mostly to native XML databases.

Look out for XML databases that are built on top of existing non-relational or relational databases, as they will inherit the mature ACID semantics of their base product.

ACID semantics are essential for storing any kind of mission critical data, and the atomicity guarantees that come form using transactions are invaluable when doing any kind of concurrency. If you are doing either of these things, then you need to examine the maturity and ACID guarantees of an XML database carefully.

Embedded vs Server/Client

When most people think of a database, they think of a stand-alone server daemon, that is accessed by communicating over the TCP/IP stack. Most XML databases fit into this category, and there are some definite advantages to this approach. For instance, databases can easily be run on dedicated computers using this approach, which helps with scalability.

On the other hand, an embedded database doesn't run as a separate process. It is usually linked with a running program, and executes in the same process as the application. This has benefits as it reduces the significant overhead of the TCP/IP stack and calling protocol, opting instead to use direct method calls.

However, databases that guarantee ACID semantics are often sensitive to how they are shut down, meaning that an exception in your application process can often cause an embedded database to become corrupt.

Security

As in all databases, some XML databases come with built in security, and others leave this side of things down to their users. The usual mix of authentication, permissions and encryption apply just as much to XML databases that contain sensitive information.

Use Cases

A web portal to book reviews, stored as individual XML files

A book review is certainly document-centric and semi-structured, so that rules out a relational backed XML database in favour of either a native or SQL/XML one. If an organisation's book reviewers find it hard to keep to the constraints of the initial book review schema, an xml database that supports partial schema, or is schema-less, may be a better choice.

Users may find that they don't often need to retrieve parts of a book review, and that update speed doesn't matter to them, as the reviews are updated infrequently. This may lead them to choose an implementation that is simpler to use, with automatic indexes, document level storage and whole document update.

However, if the organisation wishes to provide an advanced search facility for finding reviews of books, full text search would become important.

Storing and analyzing large documents of pharmaceutical test data

The test data from pharmaceutical tests is in the form of large, repetitious, highly structured documents. Once it is produced, it is never updated, only heavily queried during the analysis process. A relational backed XML database would probably do quite well here - as would using a relational database instead of an XML database. If a native or SQL/XML database was used, it would have to support node level storage and indexing of the document.

Schema support here would probably protect against bad data, unless the organisation wanted to build in a certain amount of tolerance and robustness to the system. User defined indexes might help fine-tune query performance, as complex analysis could involve days of querying. Full text search would almost certainly not be useful.

Archiving financial transactions from an e-commerce website

A high volume of transaction data is constantly being received, and needs to be archived in real time. A transaction is a structured and small amount of data. Reports on sets of transactions are required from time to time.

A native or SQL/XML database would be the right choice here, since a relational backed database is unlikely to give the required throughput. If each transaction is stored in a document, then it is possible to choose a database that implemented document level storage, since queries into a transaction are rare.

However, to add a new transaction to an existing document that contained many transactions, a database that supported node level storage and partial update would be required, to avoid a performance bottleneck during updates.

It would probably not be important to fine-tune queries, so automatic indexes would be sufficient. However, due to the mission-critical data and high concurrency, ACID semantics would need to be guaranteed, so a transactional XML database would be needed. The security facilities of the database would also be important, considering the sensitive information being stored.

Conclusion

Hopefully this paper has furnished you with a basic understanding of the issues involved in evaluating an XML database. There has been a lot of progress in XML databases recently, and it is a product area that is well worth looking into.

There are no hard and fast rules for choosing an XML database, and there is certainly no clear market leader as yet. A product evaluation is best done with an understanding of the issues involved, and of the intended data, queries and use cases to be fulfilled.

Bibliography

[IND5P2] Indexing XML Data Stored in a Relational Database
VLDB Conference, 2004
[AnatomyXMLDB] Anatomy of a Native XML Database
Sleepycat Software, 2004
[ContentProc] Simplifying Content Processing
Mark Logic Corporation, March 2004
[Bourret] XML and Databases
1999-2004
[XMLInfoset] XML Information Set (Second Edition)
4 February 2004
[SQLXML2003] ISO/IEC 9075-14:2003, Information technology - Database languages - SQL - Part 14: XML-Related Specifications (SQL/XML)
2004
[XQuery] XQuery 1.0: An XML Query Language
4 April 2005

Biography

John Snelson

Technical Lead, Parthenon Computing Ltd http://www.parthcomp.com

John has been working with XML for 6 years, through software engineering and technical architect positions at Decisionsoft Ltd, and currently at Parthenon Computing Ltd. He has been a major contributor to Decisionsoft's Pathan XPath 2 engine, as well as having contributed to Xerces-C. More recently, John has worked on the Sleepycat DB XML project, playing a significant role in the implementation of the 2.0 version this XML database.