XTech 2005: XML, the Web and beyond.

XML and Relational Storage–Are they mutually exclusive?

Abstract

Relational databases are the dominant data store by providing storage and processing mechanisms that offers both efficient techniques for storing structured data and high-performance query evaluation. On the other hand, XML is a newer, portable data format to exchange semi-structured data between disparate systems or applications. Businesses today require both XML and relational storage for easy exchange of data and added flexibility. When does it make sense to combine these technologies? And what is the best method for doing this?

XML and databases

Enterprises are keeping increasingly large amounts of business critical data in XML format. Managing large amounts of data efficiently and securely is a problem that is traditionally solved by database management systems. The reasons for storing XML in a database system are the same as for relational data: persistent storage, transactional consistency, recoverability, high availability, security, efficient search and update operations, and scalability. These are all features which make a database a much more appropriate repository for XML data than, for example, a file system. Thus, XML databases have gained increasing popularity and importance in recent years.

What exactly is an XML database? We can generally distinguish between XML-enabled databases and native XML databases. We call a database an XML-enabled database if its core storage and processing model is not the XML data model. In many cases its core is the relational model and a mapping between the XML data model and the relational data model is required. All major relational database systems can be considered XML-enabled databases because they support this mapping to manage XML. DB2 Universal Database(TM) V8 with the DB2 XML Extender is a typical example of an XML-enabled database.

The term native XML database is used in different ways by various groups. One definition that you are likely to encounter says that a native XML database has the following three characteristics:

In particular, this definition allows for a mapping from the XML data model to a different data model for storage and processing. This is what we have defined as an XML-enabled database. Thus, we require a native XML database to also have the following two properties:

This narrowed definition means that XML is more than an externalized data type–it is how the data is handled both logically and physically. The data is represented as XML right down to its physical storage schema on the disk. This model is the best for efficient searching of the XML data. It is also well-suited for evolving XML schemas. Various native XML databases have emerged in recent years, typically offered by small and specialized vendors. In addition to XML-enabled databases and native XML databases, we also want to introduce the idea of a hybrid database. A hybrid database is a relational database that is XML-enabled, but also offers native XML capabilities as defined above. It is a database that supports both the relational data model and the XML data model in all its processing and storage mechanisms. The planned release of DB2 Universal Database is a hybrid database.

The questions addressed in this paper are: Which type of database is right for your XML data? How can you combine XML and relational data management in a single system?

Combining XML and relational data

XML often needs to be combined with structured relational data within a relational database that includes varying levels of support for XML data. The reasons for including XML data with structured relational data are as varied as the data. The motivation for storing XML in a relational database nearly always includes the following:

This assumes that you already have XML data, or that the decision to use XML as a data format has already been made. However, consider the key differences between relational and XML data before deciding to use either format for a specific subset of your data. Neither XML nor relational is the best choice for every application. Various data management needs exist for which the relational data model is insufficient and XML is the better choice. Yet, there are also many usage scenarios where the relational data model is the best choice and where using XML would be an unnecessary burden. We examine these scenarios in the following section.

When to use XML and when to use relational for your data

You have more choices than ever before when considering how your data should be encoded, stored, and retrieved. Consider both the immediate and long-term consequences of the possible choices. This section examines the advantages and disadvantages of the relational data model and the XML data model. This highlights some important design questions. If you are starting from scratch, your first question might be, "Should I store my data in a traditional relational structure or as XML?" Consider the contrasting characteristics of the relational and XML data models shown in the table below.

Relational model XML model Tabular representation of your dataHierarchical representation of your dataStrongly structured. Static schema definitions. The same schema applies to every row of a table. Semi-structured. Flexible schema definitions. An XML schema may or may not exist for some or all of your XML documents. XML Schemas are easily extended.

All relationships are defined by primary keys and foreign keys.

A document contains both the data and relationship information that describes how the data is related.

Order is unimportant. The information is organized in sets which are unordered by definition.

Order is significant. Information is organized in sequences which are ordered by definition. Strongly typed. Each column has exactly one data type.Optionally typed. Types might or might not be defined for some or all elements and attributes in an XML schema.ANSI/ISO Standardization W3C Standardization 3-value logic: true, false, null 2-value logic: true, false Null valuesEmpty elements, missing elements

A key difference between the two models is that relational data is strongly structured and typed, while XML can be much more loosely structured and typed. XML is therefore also often called semi-structured data. In a relational table, every row has exactly the same number of columns, and each column has exactly one well-defined data type. This is very stringent, but allows very efficient processing of the data. However, the relational model can be too stringent for some applications. XML is typically a good choice on these applications. XML is much more flexible. For example, XML elements can be allowed to occur optionally or to occur multiple times under the same parent. Also, you might have an XML schema for some, but not all of your XML documents. And if you have an XML schema, it might define the structure and data types only for parts of a document, leaving it undefined for other parts. Specifically, XML elements and attributes might or might not have a data type defined. Beyond that, the type of an element can be a complex type or even a union of types, which is hard–if not impossible–to represent in the relational model.

If you choose to store your data in a relational structure, you have numerous design considerations to address that cannot be covered in this article. Generally, relational is often the right choice if one or more of the following statements about your data are true:

Similarly, you might be better off storing your data in XML and combining it with relational data if one or more of the following statements are true:

Choosing a database storage option for your XML data

If you choose to combine your XML data with relational data, you have further design considerations. Primarily, how will you store and access your XML data in the relational or hybrid database of your choice? Typically you have the option to store XML documents in one piece as text values in a CLOB or Varchar column, or to decompose and convert your XML into a relational format. The latter approach is also called shredding. Additionally, a hybrid database offers native XML storage as a third option. The following sections discuss the pros and cons of each of these three options, and then compares them.

Storing XML as CLOB or VARCHAR

In relational databases you can store entire XML documents in a variable length character type (Varchar) or as character large objects (CLOB). If XML documents are inserted into CLOB or VARCHAR columns, they are typically inserted as unparsed text objects. Avoiding the XML parsing at insert time guarantees very high insert performance. However, without XML parsing the structure of the XML documents is entirely ignored. This precludes the database from doing intelligent and efficient search and sub-document level extract operations on the stored text objects. The only remedy is to invoke the XML parser at query execution time to “look into” the XML documents so that search conditions can be evaluated. Thus, the high insert performance comes at the cost of low search and extract performance. Only a blind full document retrieval, which again ignores the internal XML structure, is a fast way of reading XML documents from CLOB or VARCHAR columns. Sub-document level updates to individual elements or attributes in XML documents also requires costly XML parsing. Note that the parsing cost for queries and updates increases proportionally to the size of the XML documents.

In a nutshell, the CLOB or VARCHAR approach enjoys simplicity and high performance for insert and full document retrieval at the cost of poor search, extract, and update performance.

The CLOB or VARCHAR storage can be a reasonable choice for applications that have one or more of the following properties:

The problem of poor search performance can be improved with side tables, which are also called property tables. The idea is that certain XML element or attribute values, which are frequently used by the application to find documents, are extracted and inserted into separate relational tables. At query time, predicates on these side tables and a join with the main table can significantly speed up search performance. This is illustrated by the following example.

In this example, we store XML document in a CLOB where each document represent an order. In our case an order has a date, a customer name, contains multiple part numbers, and includes additional order information. Here is a sample of one order document:

<order date="2004-11-18">
	<customer>Thompson</customer>
	<part key="82"> .... </part>
	<part key="83"> .... </part>
	<orderinfo>
				 ....
	</orderinfo>
</order>

We frequently search orders by date, customer, and part keys. Therefore, we decide to extract this information into side tables that can be joined with the main table containing the CLOB column with the full documents, as shown in figure 1 below. The population of the side tables requires XML parsing during insert. This shifts some of the parsing overhead from query to insert time. That is, we are buying higher query and search performance by paying an additional cost for inserts. The side tables effectively act as indexes into the CLOB values. Items that occur multiple times in an XML document, such as the part key in our example, must be in a separate side table with a 1-to-many relationship to the main table.

XML documents stored as CLOB with side tables

The following SQL join query efficiently identifies and retrieves all order documents that include part number 83. No XML parsing is required, and only matching documents are touched in the main table:

SELECT order
FROM mainTable, partSideTable
WHERE mainTable.ID = partSideTable.ID
  AND partSideTable.part_key = 83;

Without the side table all order documents in the main table must be read and parsed to find the ones that contain part number 83. For applications with high performance requirements, this is often too time consuming. Depending on the database system, the SQL statement looks similar to this:

SELECT order
FROM mainTable
WHERE extract(order, '/order/part/@key') = 83;

Shredding XML into a relational schema

This approach decomposes XML documents entirely, discards the XML tags, and stores the element and attribute values in regular relational tables (see left side of Figure 2). In this process, most aspects of the XML document structure are usually lost, such as the relative order of elements in the document. Shredding an XML document requires XML parsing at insert time, and a single XML insert can result in a substantial number of relational inserts into a potentially large number of tables. You can view shredding as an extension of the side table approach above, such that all element and attribute values are maintained in relational tables without storing the actual input document. At each level of nesting in the XML input document, each element that can occur multiple times typically needs to be represented by a separate table in the relational schema. This is acceptable for simple XML documents with a relatively small number of distinct elements. However, more complex XML documents easily require dozens or even hundreds of tables in the relational schema. Defining such a mapping from XML to a relational schema can be a very complicated task, sometimes too complicated to be spractical.

A key advantage of the shredding approach is that after an XML document is decomposed, you can use the full power and performance of plain SQL to query and update the data at the relational level. This can be supported by relational indexes. Retrieving decomposed XML documents in full means that the shredded XML needs to be recomposed. This process is also called XML publishing (see right side of Figure 2). Full reconstruction of a document requires a join of as many tables into which the document has been shredded. Thus, retrieval of complex documents can be inefficient if the number of tables is large. Compared to the CLOB approach, shredding allows you to reconstruct documents that are different from the ones that were previously inserted. The SQL language standard has been extended with XML publishing functions to construct XML documents from any kind of relational data. For this SQL/XML publishing, the origin of the relational data does not matter. That is, you can reconstruct previously shredded documents, or you can compose new documents from your existing relational data without ever inserting XML into your database.

Shredding and Publishing of XML data, to and from a Relational Schema

The following example shows where shredding XML to relational works well. The following XML document contains information about two employees of the sales department. The structure of the XML data is very regular and repetitive, which is easily mapped to a relational schema. Figure 3 shows that we need two tables: one for the department information and one for the employee information.

<DEPARTMENT  deptid=“15” deptname=“Sales”>
        <EMPLOYEE>
             <EMPNO>10</EMPNO>
             <FIRSTNAME>CHRISTINE</FIRSTNAME>
             <LASTNAME>SMITH</LASTNAME>
             <BIRTHDATE>1933-08-24</BIRTHDATE>
             <SALARY>52750.00</SALARY>
        </EMPLOYEE>
        <EMPLOYEE>
              <EMPNO>27</EMPNO>
              <FIRSTNAME>MICHAEL</FIRSTNAME>
              <LASTNAME>THOMPSON</LASTNAME>
              <BIRTHDATE>1948-02-02</BIRTHDATE>
              <SALARY>41250.00</SALARY>
          </EMPLOYEE>
</DEPARTMENT>
XML Document Shredded into 2 Tables

After the data is shredded, you can use standard SQL/XML publishing functions such as XMLELEMENT, XMLATTRIBUTES, XMLAGG, and so forth to reconstruct the XML document. The following SQL/XML statement reads relational data from the tables in Figure 3 and composes the same document that we shredded into these tables. Note that the nesting of the SQL/XML publishing functions corresponds to the structure of the document. The first XMLELEMENT function constructs a top level element named DEPARTMENT that contains two attributes and an aggregation of EMPLOYEE elements. The aggregation is needed because multiple employees are in each department. Each EMPLOYEE element then contains a forest of elements that hold the employee-specific information. The function XMLFOREST is simply an abbreviation so that we do not have to write a separate XMLELEMENT function for each column of the employee table. This SQL/XML statement will produce a separate XML document for each distinct department, one document for each result row. The outermost function, XML2CLOB, converts the constructed XML value into a CLOB character type that the application can access. Alternatively, the new SQL/XML standard function XMLSERIALIZE allows you to cast the constructed XML value to CHAR or VARCHAR values of a specific length.

SELECT XML2CLOB(
    XMLELEMENT(NAME "DEPARTMENT", 
               XMLATTRIBUTES (d.deptid, d.deptname ),
               XMLAGG( XMLELEMENT(NAME "EMPLOYEE", 
                            XMLFOREST (e.empno, 
                                       e.firstname, 
                                       e.lastname, 
                                       e.birthdate, 
                                       e.salary)      
						                            )
                      )
               ) 
      ) AS "deptdoc"
FROM department d, employee e
WHERE d.deptid = e.deptid
GROUP BY  e.dept;

A common pain point of shredding is the required mapping between XML schema and relational schema. While simple in the example above, it is complicated for more complex XML documents. Also, a separate mapping has to be defined for each distinct type of document. That is, for each XML schema involved in an application. If a new document with an unexpected structure is received, it cannot be processed until a new mapping is defined for this document. Also, changes to the XML schemas that are used by the application can easily break the existing mappings.

For example, imagine the following XML schema change: The "max-occurrence" indicator for an element E is changed from 1 to greater than 1. This means that the relational column that holds the existing values for E must be split out into a separate table by itself. Furthermore, primary key and foreign key values must be artificially introduced to establish the 1-to-many relationship between the new and the old table. This is a daunting task if the table already contains a large amount of data. Thus, once data has been inserted, many types of changes to the schema are not feasible. This severely restricts the flexibility that XML is typically used for in the first place.

In addition to the schema mapping, the shredding approach might also require the translation of XPath or XQuery into SQL. This translation is known to be difficult, and the resulting SQL can be inefficient. Beyond that, complex XQueries can even be untranslatable into SQL. Thus, shredding is most feasible if only simple XPath operations are used or if the applications are designed to work directly against the underlying relational schema.

In summary, shredding can be a reasonable choice for applications that have one or more of the following properties:

Storing XML in a native XML data type

Various database systems provide an XML data type so that you can define columns of type XML, as shown in the following CREATE TABLE statement.

CREATE TABLE orders(orderkey INTEGER, order XML);

Documents can be inserted into the XML column and can be updated and searched. Although the database system might let you view and manipulate the data in that column as XML, it does not necessarily store and process your XML documents natively. Under the covers it might still convert XML data to relational, and Xpath to SQL. This is fundamentally the same as the shredding approach discussed above. Or, the XML type might be internally implemented as a CLOB with all the pros and cons that were already explained. Thus, read your database system's documentation carefully if it claims to provide "native XML" support. It might not be as native as you think.

Remember that we defined native XML support as a system which processes and stores XML data using the XML data model. This means that XML documents are stored and manipulated in a parsed format, such as the XML Infoset or the XQuery/XPath Data Model. This is different from the textual angle-bracket representation of XML, it requires XML parsing but no mapping from the XML data model to a different data model. Specifically, the XML data is not stored as text and not shredded to relational or object-relational structures. An internal representation is used for processing and storage on disk, which reflects the hierarchical structure of the XML data. After all, XML is not just a mark-up language, but also a hierarchical data model. Figure 4 shows an XML document and and its hierarchical representation. A true native XML database system uses trees of nodes as the fundamental storage and processing model.

XML Document and its hierarchical representation

How does native XML storage compare to CLOB/Varchar storage? XML parsing at insert time leads to somewhat lower insert performance than for plain CLOB storage without side tables. Similarly, full document retrieval from native XML storage requires serialization, which converts the parsed XML format back into its original textual representation. Again, this can never be as fast as reading a full XML document from a CLOB/Varchar column, which does not involve serialization (because the XML is already stored in text form). The key advantage of the native XML storage is that the structure of the XML documents is explicitly represented in the database. Thus, search, extract, and update operations do not require XML parsing can be executed at vastly higher performance than with CLOB storage. Basically, native XML storage sacrifices some of the insert and retrieval performance to gain much higher query performance. This is a reasonable trade-off for most applications because business data is more often searched and analyzed than inserted. As a result, most business data typically has one insert and many queries.

How does native XML storage compare to shredding? The key advantage of native XML over shredding is obvious: No mapping exists between data models. The XML document structure and order is preserved. XQuery and Xpath can be processed without translation into SQL and without further XML parsing. XQueries and Xpath expressions, which are often navigational in nature, can be efficiently processed by traversal of the stored document trees. Documents with and without XML schema can be stored in a native XML column, even documents with changing and evolving schemas, without the need to adjust any complex mapping. Retrieval of full documents or document fragments do not require joins across many tables. Native XML storage is the only efficient option if the XML data contains hundreds of optional elements and attributes, but each indidvidual instance document contains only a fraction of them. For this sparse data, the cost of normalization is prohibitive and denormalization is impractical because of limits to the maximum width of a row and maximum number of columns in a table. Hence, there is a need to persist and search XML natively along-side other relational data.

In summary, native XML storage is a good choice for applications that have one or more of the following properties:

Storing XML natively does not preclude it from being seamlessly integrated with relational data in the same database or even in the same table. The SQL language has been enhanced with new functions such as XMLQUERY, XMLEXISTS, and XMLTABLE. These functions can be used in SQL statements to access XML data and combine it with relational data.

The function XMLTABLE produces a virtual SQL table containing data that is derived from XML documents stored in the XML database. This can be used to provide an application with a relational (tabular) view of natively stored XML data. Thus, you do not necessarily need to shred XML into relational tables for it to be accessed by relational applications such as BI reporting tools. As an example, let's revisit the department document which we shredded into two tables in the previous section (Figure 3). Here we assume that the same document is inserted in the XML column "dept" of the table "personnel". The following SQL statement applies the XMLTABLE function to this XML column to produce a relational representation of the data just like the table in Figure 3 above.

CREATE TABLE personnel( dept XML );

SELECT DEPTID, EMPNO, FIRSTNAME, LASTNAME, BIRTHDATE, SALARY 
FROM XMLTABLE ( '$dept/DEPARTMENT' 
      PASSING personnel.dept AS "dept" BY REF
      COLUMNS DEPTID    INTEGER PATH '@deptid
              EMPNO     INTEGER PATH 'EMPLOYEE/EMPNO', 
              FIRSTNAME VARCHAR(50) PATH 'EMPLOYEE/FIRSTNAME',  
              LASTNAME  VARCHAR(50) PATH 'EMPLOYEE/FIRSTNAME',  
              BIRTHDATE VARCHAR(50) PATH 'EMPLOYEE/FIRSTNAME',  
              SALARY    DECIMAL(8,2) PATH 'EMPLOYEE/SALARY' );

Summary

The table below summarizes the key advantages and disadvantages of the three XML storage options that we discussed in this paper. None of the three approaches is the single best choice for every application. Applications with specific characteristics and requirements may find that CLOB storage or shredding is the best choice for them. However, for a large share of XML applications only native XML storage offers the flexibility and performance that they require. Furthermore, most business applications do not deal with XML data alone. They have existing relational data and also continue to produce relational data. There is no "either XML or relational," it's typically "XML and relational." XML data and relational data must be processed in an integrated manner, and database systems are required to provide efficient support for both. For many applications, the best choice will be a hybrid relational and XML database system that integrates true native XML support with mature relational database technology. Thus, to answer the question in the title of this paper, XML and relational storage are not and cannot be mutually exclusive.

Feature CLOB Shred NativeSchema FlexibilityBestBad Best XML Search PerformanceBadGood Best Full document retrieval performance Best Bad Good Partial document retrieval performance Bad Good Best Insert PerformanceBestBad Good Update performance (sub-document level) Bad Good Best Sub-document level concurreny control BadGood BestFull document deleteBestBadGood Preserves document structure and order BestBad BestXML Parsing required at insert timeOptional YesYes XML Parsing required at query time (to search & extract elements/attributes)Yes NoNo

-----------------------------

Acknowledgement

We would like to thank Matthias Nicola (IBM Silicon Valley Lab) for his contributions to this paper, such as the comparison of the database storage options for XML data and the corresponding examples. We also wish to thank David Oberstadt (IBM Silicon Valley Lab) for his help with editing and designing this paper.

Biography

George Lapis

Technical Manager, DB2 UDB Compiler Development , IBM Corporation

George Lapis has worked in database software for more than 25 years. He was a member of R* and Starburst research projects at IBM's Almaden Research Center in San Jose, California. He also was a member of the compiler development team for several releases of DB2 Universal Database. His expertise is mostly in compiler technology and implementation. For the last several years he has lead the compiler development team at IBM's Silicon Valley Lab in San Jose, California working on SQL, XML, and XQuery for DB2 Universal Database.