XTech 2005: XML, the Web and beyond.

Matching Python idioms to XML idioms

Discuss this paper on the XTech wiki
View XML source for this paper

Keywords

Abstract

Python offers programmers built-in tools for the universal approaches to XML processing: SAX and DOM. These libraries were developed by people who are as much a part of the XML worlds as the Python world. Such duality is rare in the Python community, which has in general been quite hostile to the likes of SAX and DOM (and often the likes of XML itself). The main quarrel with SAX and DOM is that they lack the flexibility and expressive power that attracts people to Python in the first place.

As usual in such communities, the backlash comes in the form of code. There has been a recent explosion in tools that provide "Pythonic" approaches to XML processing, including ElementTree and Amara XML Toolkit, the latter developed by Mr. Ogbuji.

Some examples of these methods are:

Pythonic tree APIs. These take basic tree data structure (as in DOM), but exclusively use Python conventions, rather than the DOM object model. The results tend to differ greatly from DOM, and in particular require far fewer operations for typical navigation and manipulation. The resulting trees structures are generic, and the same classes of node objects are instantiated regardless of the XML vocabulary that was parsed. The preeminent example is ElementTree.

Python data bindings. These, like the above, are typically data structures that mimic XML's hierarchy, but rather than using generic node objects, they specialize objects to match each XML vocabulary being passed. The resulting access and manipulation idioms thus use familiar property names from the XML instances. Examples include Gnosis Utilities, XMLObject and Amara Bindery.

Schema-driven data bindings. These are data bindings which use an XML schema to establish or to help specialize the resulting node objects from parsing an instance. The main example is generateDS.

Push and pull DOMs. These are a compromise between streaming APIs such as SAX and tree APIs such as DOM. They allow the developer to process the document in streaming mode, and switch to subtree creation when useful for easier processing. Examples include Python's PullDOM and Amara Pushdom.

Declarative SAX frameworks. These use some declarative language (usually based on XPath) to help automate complex state machinery in SAX handlers. The main example is Amara Saxtools.

SAX co-routine frameworks. These are frameworks that use Python's generators to effect a flow-of-control within SAX handlers that maintains local scope across actual SAX events, greatly simplifying state management. The main example is Amara Tenorsax.

This paper and presentation offers, in numerous code examples, an overview of the more interesting Python-specific idioms in XML processing tools, and focuses on Amara XML Toolkit and the several components it provides in order to allow maximum flexibility in XML processing methods.

Introduction

Python, like some other programming language, is a product as much of a certain philosophy of programming as of mechanical language design. Python programmers often see themselves as having escaped bothersome conventions in other languages that stifle their expression and productivity. To the extent that XML, being a cross-language technology, brings in a whiff of these other conventions, it is viewed with suspicion and even hostility by many in the Python community. There are varying degrees of this reaction, but it is pervasive enough that it cannot be ignored by anyone developing tools for processing XML in Python. Right from the start, the Python community has struggled with ways to reconcile XML's idiosyncracies with Python's. It is only now that some of the fruit of this work is coming into reflexive acceptance. This is not to say that it is only now that good tools for XML processing in XML are emerging, on the contrary, such tools emerged almost immediately in 1997 and 1998. But it is only now that Python-specific idioms for XML are moving out of the experimental realm into the mainstream.

Python has the universal SAX and DOM (with some minor variations to properly fit the language), but these are well known, and shall not be directly treated in this paper. Going beyond SAX and DOM there are many differentiated approaches to making XML native to Python, and I classify all of them under two banners. The first is tree APIs, which one might think of as analogous to DOM (although the actual syntax and structures are unrecognizable as anything DOM-like). The second is stream APIs, which are in turn somewhat analogous to SAX, although again usually vastly different in syntax and structure.

This paper does not present an introduction to any of the Python tools discussed. There are brief examples in order to demonstrate typical workings with respect to particular XML characteristics. For more of an introduction see Python Paradigms.

Tree APIs

Tree APIs are characterized by easy access to all the data in an XML document, for both reading and modification. They typically require the entire document to reside in memory, which implies practical limits on the size of documentsthat can be efficiently processed. The basic species of tree APIs define a standard object structure to represent the parts and pieces of XML. You can think of this as defining a class and naming convention to correspond to each type of information item in the XML Infoset. All elements are represented using the same classes and object reference names. All attributes are represented using the same classes and object reference names, different from that of elements, and so on for text and, if supported, processing instructions, comments, etc. As such, these can be considered approximate translations of the Infoset, and abstract model, to a Python representation. Accordingly I call the first class of tree APIs I'll discuss Python XML Infosets.

The second sort of tree API I shall examine in detail is the Python data binding. A data binding is a system for viewing XML documents as databases or programming language or data structures, and vice versa. There are many aspects of data bindings, including rules for converting XML into specialized Python data strucures, and the reverse (marshalling and unmarshalling), using schemata to provide hints and intended data constructs to marshalling and unmarshalling systems, mapping XML data patterns to Python functions, and controlling Python data structures with native XML technologies such as XPath and XSLT patterns. A data binding essentially serves as a very pythonic API, but in this paper, the main distinction made in calling a system a data binding lies in the basics of marshalling and unmarshalling. In data bindings, the very object structure of the resulting Python data structure is set by the XML vocabulary. The object reference names come from the XML vocabulary as well. Data bindings in effect hide the XML Infoset in the API, in contrast to Python XML Infosets. They are based on the model of the XML instance, rather than of the Infoset.

ElementTree

I'll be focusing on ElementTree as an example of Python XML Infosets. ElementTree is developed by Fredrik Lundh. It is a collection of lightweight utilities for XML processing. At its heart is a Python XML Infoset. It doesn't really cover much of the Infoset. The focus is squarely on elements, and there are no other node types. Element objects themselves act as Python lists of the element children.

Navigation. ElementTree does not have a representation for the Document information item. When you performa a parse, you get the document element as an object. This means you don't get any document type declaration (DTDecl) information, XML declaration information or root level processing instructions (PIs) (such as stylesheet instructions) and comments. Fredrik Lundh does show an example of a separate code module ElementTree PIParser for containing this document element, and capturing PIs and comments (but not DTDecl information). This code module adds PIs and comments throughout the document, which are omitted by default. This means that if you could be using this alternate parser you have to check whether the top-level object you get from a parse is an element or a document node. Each element is accessed directly as a Python list in order to get the element children:

for child in element:
    process_element(child)

In effect, a subset (the elements) of the children property of element information items is merged into the representation of the element itself. If you use the PIParser, then this subset also includes these other information items. Attributes are accessed from the element object using the attribs data attribute, which returns a Python dictionary object. This is a very close match to the attributes property.

for key in elem.attribs:
    process_attribute(name=key, value=elem.attribs[key])

You can use Python's iterator protocol to walk over a full tree in document order. In order to do so, you call the getiterator method on a node and iterate over the results.

for elem in tree.getiterator():
    process_element(child)

Text content is represented as a chain of simple data members on element instances.

Namespaces.ElementTree supports namespaces through direct treatment of James Clark's notation directly for element and attribute names. This is a rather different mechanism from most XML processing APIs. So the XHTML document element would have the full name of {http://www.w3.org/1999/xhtml}html. In cases where a prefix is used, this prefix does not appear anywhere in the resulting name. The main advantage of this approach is that it is simple for cases where the XML uses XML Namespaces in its purest form, especially in the matter of whether namespace prefixes are important. Unfortunately, many XML conventions violate some of these principles, especially by using scoped, namespace-qualified names within attribute and text content. Dealing with such cases requires preservation of prefixes. I developed and Fredrik Lundh refined a separate specialized parser module that preserves namespace prefixes (see the bibliography for these ane more resources relating to ElementTree and namespaces).

In addition to XML core matters such as Infoset and namespaces, I also look at data selection and directed processing using XPath (including the XSLT patterns subset). XPath has become part of the basic tool-kit for XML processing across platforms and languages, and an essential component of the XML idiom.

XPath. ElementTree has limited support for XPath, supporting most of what some call "tumblers": simple, chained use of the child axis on elements--no additional axes (including the attributes axis) nor predicates, nor any XPath functions. The XPath support is not extensible. It includes no namespaces support. The easiest way to approximate XPath predicates in ElementTree is to use iterators coupled with custom Python code for filtering purposes. To give a trivial example, in order to get all stylesheet links in XHTML, you would do the following:

for elem in tree.getiterator():
    if elem.attribs[u'rel'] == u'stylesheet':
        process_stylesheet_link(child)

Python XML Infosets have the advantage of using much fewer operations than DOM in any given task. They are simpler to learn and generally less quirky. They do inevitably suffer from quirks in areas related to deficiencies or even useful characteristics of XML, such as XML Namespaces on one hand and mixed content on another.

Mapping XML structures to object structures

Before diving into data bindings, it is important to consider some of the ramifications of reflecting the XML instance structure directly into Python. Take Listing 1, asimple XML example serving as an inventory file for a library.

Listing 1. Simple XML example for data bindings
<library>
  <name>The XML Institute Public Library</name>
  <book isbn="0764547607">
    <title>The XML Bible, 2nd Edition</title>
  </book>
  <book isbn="0321150406">
    <title>Effective XML</title>
  </book>
  <book isbn="1861005946">
    <title>Beginning XSLT</title>
  </book>
</library>

The XML hierarchy doesn't really match data structure hierarchies in most programming languages. As an example, Listing 2 is a sort of pseudo-code data structure that corresponds to the library inventory document.

Listing 2. Example of a programming language structure based on Listing 1
structure library
  begin
    string name;
    list<book> books;
  end;

structure book
  begin
    string isbn;
    string title;
  end;

The library structure comprises a reference, name, to an object of type string, and a reference, books, to a list object designated to hold objects of type book. The structure book is pretty straightforward. The books reference doesn't really correspond to anything in the original XML. It is purely a link from the library structure to the contained book structures. The library structure is analogous to an element information item, and the books reference corresponds to a projection on the children property on that element item, a projection which selects a subsection of the element items in the children list.

As a side note, the need for structure references in programming languages often makes programmers assume that they need similar conventions in XML. People coming from a programming language background will often insist on replacing the XML design in listing 1 with that in listing 3.

Listing 3. Simple XML example with container element added to correspond to structure reference
<library>
  <name>The XML Institute Public Library</name>
  <books>
    <book isbn="0764547607">
      <title>The XML Bible, 2nd Edition</title>
    </book>
    <book isbn="0321150406">
      <title>Effective XML</title>
    </book>
    <book isbn="1861005946">
      <title>Beginning XSLT</title>
    </book>
  </books>
</library>

This is not always good XML design, and it also opens up a "turtles all the way down" problem for mechanical mapping between XML and programming data structure: Technically the new books element needs its own new structure reference from the library structure, and again from the books element to its XML children. It's always dangerous to try to make such mappings too literally. Another common problem with such literal mappings is that in listing 2, many people are used to the fact that the relationship between the book structure and its isbn member is the same as that to its title member; these people often end up expressing both as either element or attribute, without considering that in XML design, the former is better expressed as an attribute and the latter as a child element.

The programming structures in listing 2 are actually very different in nature than the element structures in Listing 1 and even Listing 3. The strucure members are actually named references to other data. The key point is that the names--"name", "books", etc.--do not represent the actual string or list-of-book data items ("objects", of course, in OO languages), but rather the relationship between each instance of the structure and these objects ("associations" in OO modeling). In most programming languages, you deal with object references and the actual objects being referenced in tandem, without considering the distinction between the two (although advanced techniques and a firm command of most languages require programmers to thoroughly understand this distinction). Listing 4 is a very literal translation to XML that takes this distinction into account.

It's not easy to illustrate static data structures in such a dynamic language as Python, but listing 4 is an attempt.

Listing 4. Python example of a programming language structure based on Listing 2
class book:
    def __init__(self):
        self.isbn = ""
        self.title = ""

class library:
    def __init__(self):
        self.name = ""
        self.books = []

a_lib = library()
a_lib.books[book()]

In essence what a Python data binding does is to automate the translation from the XML of listing 1 to the structure in listing 4, as part of the basic parse operation, and maintains that interpretation through mutation and re-serialization. One important difference is that in data bindings attributes are typically differentiated from elements in some way, so that book.isbn isn't modeled in exactly the same way as book.title. Maintaining this distinction helps with re-serialization and maintains the integrity of XML tools such as XPath. This opens up some interesting choices, in particular, how does one name object references and class instances? And how many of Python's data structure tricks does one use? Answering such questions is usually a matter of trading off Python user convenience with fidelity to the XML (which allows for data integrity).

Amara Bindery

I'll be focusing on the Bindery component of Amara XML Toolkit as an example of Python data bindings. Amara XML Toolkit is developed by me, Uche Ogbuji. The most important component is Amara Bindery, the data binding core. It does cover most of the Infoset, but in the case of elements and attributes, it hides this fact by adopting the vocabulary of the XML document into the naming scheme for the object tree. The Python objects that represent the XML information items are known as binding objects.

Navigation. Document and Element objects are fairly similar to the corresponding Infoset information items and represent the children property as an xml_children list, containing a mixture of text and object references in document order. Most specialized property names on binding objects are prefixed with "xml" in order to minimize clashes with names from the XML vocabulary (names starting with "xml" in any case are reserved in XML 1.0). Such clashes become an important consideration in the data binding approach. Elements are created in classes for each XML name, organized according to namespaces (in a manner transparent to most users).

#This is just a literal example.  Most users would use XPath
for node in elem.xml_children:
    process_node(node)

Elements are also available as direct object references on the parent object, named according to the XML element's local name. Attributes are "flattened" into direct object references on each element, although they are also, for reasons of integrity, available through a dictionary called xml_attributes.

#Accessing the attributes on an XHTML img element:
print img.src
print img.height
print img.width

Text objects are included in the xml_children list, as mentioned above. The entire child text content of an elements can also be obtained at a go by using Python's unicode type conversion function.

#Accessing the text of an XHTML p element
print unicode(p)

Amara uses Python iterators to represent multiple child elements with the same name. This is necessary because of the object reference naming scheme, and works naturally in the Python idiom:

#Accessing all p elements in an XHTML body
for elem in body.p:
    process_p(elem)

Namespaces. Amara supports all aspects of namespaces, preserving local names, namespaces and prefixes (for those unfortunate cases where prefixes are sgnificant). Binding object reference names are based strictly on local name.

XPath. Amara supports almost all of XPath; the cases that are not supported tend to be on the extreme margins. It supports the various axes (including the attribute axis), namespaces, predicates and the XPath function library. This allows you to use XML's idiom directly for those situations where the model mismatch means that Python is a bit awkward. Rather than directly access an element's xml_children list to iterate over all elements, a user would probably write:

for elem in elem.xml_xpath(u'*'):
    process_node(node)

Remember that the XPath u'*' selects only the children of the context node (assuming the context node is an element, which is usually the case). In order to get all stylesheet links in XHTML, one would write:

for elem in doc.xml_xpath(u'//*[@rel="stylesheet"]'):
    process_stylesheet_link(elem)

In order to get all links at any depth within a div with ID "blogroll" in XHTML, one would write:

for elem in doc.xml_xpath(u'//div[@id="blogroll"]//a'):
    process_blog_link(elem)

This saves the user from having to maintain state and tree depth while working with a Python iterator.

Other Python data bindings include Gnosis Utilities (which is quite versatile), generateDS (which is schema driven), xmltramp (which is simple almost to the point of sketchiness) and XIST (which uses a Python form of schema, and is quite versatile).

Stream APIs

Stream APIs report the bits of the XML document as they're parsed. This means that they can discard those parts of the document that are not in scope at the time, which makes similarly efficient to SAX, but with a Python-friendly twist. I'll cover two of these, but more briefly than my discussion of the tree APIs.

Pulldom

Pull DOMs are available for several languages. They are designed to give developers the ease of DOM and the efficiency of SAX. Pull DOMs only load in parts of an XML document as they are requested. Python's pull DOM is part of the standard library (the only such software examined in this paper). Listing 5 is a snippet illustrating the workings of pulldom.

Listing 5. Python pulldom example
    events = pulldom.parse(xhtml_file)
    for (event, node) in events:
        if event == pulldom.START_ELEMENT:
            if node.tagName == "a":
                events.expandNode(node)
                process_anchor(node)

The loop for (event, node) in events: is a lightweight one where the DOM node is purely a skeleton. When one reaches the interesting part of the XML (in this case, any XHTML anchor element), expandNode is used to build the complete node at the current subtree. Each loop iteration, any expanded nodes will usually have gone out of scope, which means the Python garbage collector can mop them up. Navigation is a matter of using the current event's details to manage state and thus keep track of one's context in the XML. It is effectively linear, except when dealing with expanded DOM subtrees, which provide all the navigation methods of DOM. Python's pulldom supports namespaces in a pretty straightforward manner. Python's pulldom does not by itself support XPath. One can gain XPath support on expanded nodes by installing PyXML.

Amara Pushbind

Pushbind also operates in streaming mode, and also instantiates subtrees of interest for full access, but there are two key differences from pull DOMs:

  • Rather than having to write procedural code to maintain state through the XML stream in order to find the XML subtrees of interest, the user registers XSLT patterns (a simple subset of XPath) that declare ahead of time the parts of the XML that are of interest. Pushbind takes care of all the state management, and instantiates the requested subtrees in turn and pushes them back to the user's code. This difference is analogous to the difference between "pull" and "push" processing in XSLT. (See the XSLT FAQ entry on push versus pull)
  • The instantiated subtrees are Amara binding objects rather than DOM nodes.

The following code snippet illustrates the workings of Amara Pushbind.

Listing 6. Example of Amara Pushbind, functionally equivalent to listing 5
for elem in binderytools.pushbind(u'a', source=xhtml_file)
    process_anchor(elem)

The listing is much shorter than the pulldom example, because the user no longer has to write any code to track progress through the tree. You can just declare that you're looking for all a elements. That declaration in an XSLT patterns is the first argument to the pushbind function. Each matching element is instantiated as a full subtree available through the resulting iterator, and in listing 6 is handled in the body of the for loop. The subtree is an Amara binding object, so all the facilities discussed in the Amara Bindery section above are available.

The Amara package also includes TenorSAX, another stream API which is not covered in this paper. TenorSAX uses Python generators to try to make SAX processing a little less disjointed and thus easier.

Conclusion

As I've mentioned before, Python has a rich variety and flexibility of XML processing tools. There are tools and techniques for every taste and need, numbering at least 80 (whether there is too much clutter is certainly a matter of debate, but there are about 5-10 main ones to choose from). The trick has always been to figure out which aspects of XML to use as they are, and which to abandon in favor of native Python conventions. The key has always been in actual user profiles. XPath is almost universally accepted as a useful XML technology for use from Python while many have been turning away from raw SAX or DOM. In this paper you have seen some of the many ways in which XML models are interpreted in Python, which should might help you in your choice of tools, or might clarify areas of needed extension and experimentation.

Bibliography

[XML Infoset] W3C XML Information Set (InfoSet) Recommendation
[ElementTree] ElementTree
[Introduction to ElementTree] "Simple XML Processing With elementtree"
XML.com, 12 February 2003
[ElementTree/XML Namespaces] "XML Namespaces Support in Python Tools, Part Three"
XML.com, 30 June 2004
[ElementTree PIParser] Reading processing instructions and comments with ElementTree
[XML Namespaces] W3C Namespaces in XML 1.1 Recommendation
[ElementTree/XML Namespaces 2] More XML
[Amara XML Toolkit] Amara XML Toolkit
[Introduction to the Amara XML Toolkit] "Introducing the Amara XML Toolkit"
XML.com, 19 January 2004
[Amara for output] "Making Old Things New Again" (Proper XML output using 4Suite and Amara)
XML.com, 20 April 2004
[Python's pull DOM] "Using (Python's) pull-based DOMs"
[PyXML] Python/XML Libraries
[XSLT FAQ entry on push versus pull] Push vs Pull
[Python Paradigms] Python Paradigms for XML
11 December 2003

Biography

Uche Ogbuji

Principal consultant, Fourthought, Inc.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF and knowledge-management applications. Mr. Ogbuji is also a lead deveoper of the Versa RDF query langage. He is a Computer Engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia.