XTech 2005: XML, the Web and beyond.
Fast Infoset @ Java.net is an open source project created by Sun Microsystems to provide access to a fast, fully-featured and robust implementation of the ITU-T/ISO Fast Infoset specification. The Fast Infoset specification defines an alternate serialization for XML that uses a binary encoding. This paper describes the features of the Fast Infoset standard as well as the software components that are currently available from Java.net. Additionally, it provides background information relating this work with other "binary XML" initiatives, such as the work just recently completed by the XML Binary Characterization working group at W3C.
The Fast Infoset specification describes an open, standards-based binary format that is based on the XML Information Set XML Information Set. The XML Information Set specifies the result of parsing an XML document, referred to as an XML infoset (or just an infoset), and a glossary of terms to identify infoset components, referred to as information items and properties. An XML infoset is an abstract model of the information stored in an XML document; it establishes a separation between data and information in a way that suits most common uses of XML. In fact, several of the concrete XML data models are defined by referring to XML infoset items and their properties. For example, SOAP Version 1.2 SOAP 1.2 makes use of this abstraction to define the information in a SOAP message without ever referring to XML 1.X, and the SOAP HTTP binding specifically allows for alternative media types that “provide for at least the transfer of the SOAP XML Infoset.”
Fast Infoset is a "Binary XML" format with respect to the XML Binary Characterization Working Group W3C Notes. The term "Binary XML" is an oxymoron since there is no such thing as Binary XML. "Binary XML" refers to binary formats that can be integrated into the "XML stack" and converted to and from XML.
An XML infoset serialized according to the Fast Infoset specification is referred to as a fast infoset document. Fast infoset documents always retain the hierarchical structure described by the corresponding XML infoset, and depending on which features are selected, can be self-contained or not. Fast infoset documents that are self-contained can be converted to and from XML without information loss, at least with respect to the information items and properties defined in the XML Information Set. Stated differently, Fast Infoset is round trippable with respect to XML, modulo the XML Information Set.
This paper proceeds in the following manner:
Mostly gibberish, if viewed in a text editor! A fast infoset document
appears as rows of hexadecimal octets if viewed using the UNIX command
od -A x -t x1. Some information may be discernible in ASCII
text editors; for example UTF-8 encoded Unicode character strings consisting of
Basic Latin characters may be discernible and correspond to the [local name],
[namespace name] and [prefix] properties of attribute information items,
element information items or chunks of character information items. However,
unless you are interested in reading the specification and looking at
hexadecimal dumps, the best approach to viewing a fast infoset document is to
use a tool to convert it to XML.
The first few bytes of a fast infoset document have a reserved meaning and are important for interoperability. A fast infoset document may begin with a 32-bit binary header, consisting of the hexadecimal octets 0xE0, 0x00, 0x00 and 0x01, or a UTF-8 encoded XML declaration that uses the "finf" encoding followed by the 32-bit binary header. In the absence of metadata, such as a MIME type in an Content-Type HTTP header field or an XML declaration, a parser can identify a Fast Infoset stream by looking at the first two octets of the 32-bit binary header: no well-formed XML document encoded in UTF-8, UTF-16 or UCS-4 can start with those two octets. The last two octets of the 32-bit binary header are used to encode the Fast Infoset version (1.0). If an XML declaration is present then a parser can identify a fast infoset document regardless of the encoding method. Thus, the use of an XML declaration is encouraged given that it also allows parsers not capable of processing fast infoset documents to fail gracefully.
The Fast Infoset specification, ITU-T Rec. X.891 | ISO/IEC 24824-1 (Fast Infoset), is being standardized at the ITU-T and ISO. As of this writing, the ISO Final Committee Draft ballot has been completed, and the specification has gone for Consent to Last Call at the ITU-T Study Group 17 meeting in Moscow, 30 March - 8 April 2005. The specification is available to all ITU-T sector members and can also be obtained via the corresponding ISO national body in your location. A document based on Annex D of ITU-T Rec. X.891 | ISO/IEC 24824-1 has been published that presents examples of encoding XML infosets as fast infoset documents.
Two Fast Infoset-related MIME types have been successfully approved for
registration by the IESG.
The
first,
application/fastinfoset, identifies arbitrary fast infoset
documents including the serialization of SOAP 1.1 message infosets. The
second,
application/soap+fastinfoset, identifies SOAP 1.2 message
infosets serialized as fast infoset documents.
The Fast Infoset specification is classified as a generic application of ASN.1. However, no ASN.1 toolkit is required, and it is not necessary to understand other ASN.1 specifications to implement Fast Infoset. The binary encoding is formally specified using an ASN.1 schema and the Encoding Control Notation (ECN), which in a nutshell is a very powerful specification for making customizations to the Packed Encoding Rules to support optimally designed encodings, such as the Fast Infoset encoding or the encoding of legacy protocols. A developer wishing to implement Fast Infoset is not required to understand ECN and can implement it by reading: the normative sections; Annex C which explains in English the details of the encoding in clear and precise terms; and Annex D which contains examples.
The Fast Infoset format has been designed to optimize the axes of compression, serialization and parsing, while retaining the properties of self-description and simplicity. The approach has been to find, when not taking advantage of advanced features, a “sweet spot” where moderate compression can be achieved but not at the undue expense of creation, processing performance and simplicity.
Applying redundancy compression to an XML document will optimize compression but at the expense of both serialization and parsing. The general use of redundancy compression can work well when there are reasonably powerful computers communicating over low bandwidth links but it does not work well in the following cases:
Alternative technologies can produce smaller documents that are both faster to serialize and parse than equivalent fast infoset documents at the expense self-description and simplicity; examples include the mapping of W3C XML Schema to ASN.1 schema and application of the Packed Encoding Rules, or the format defined in MPEG-7 System extensions.
Fast Infoset is a very extensible format in which it is possible, via the use of encoding algorithms, to selectively apply redundancy-based compression or optimized encodings to certain fragments. Using this capability, as well as other advanced features, it is possible to tune the "sweet spot" for a particular application domain.
Fast Infoset supports the encoding of all information items but not necessarily all properties of certain information items. As it is the case for XML, properties such as [in-scope namespaces], [parent] and [owner element] are not part of the encoding but can be computed during parsing. Other properties are simply not encoded and cannot be computed: namely the [specified], [attribute type], [references] properties of an attribute information item and the [element content whitespace] property of an element information item.
In general, the following relationship holds for an XML document that does not contain or point to a document type declaration (DTD):
If an XML document does contain or point to a DTD then full round tripping may not be possible since some properties are computed from processing a DTD (for example the [attribute type] property). This does not prevent the use of Fast Infoset in SOAP-based Web services since SOAP 1.2 forbids the use of DTDs. Note that an infoset parsed from a fast infoset document can still be validated against a W3C XML Schema to produce a post-schema-validation infoset (PSVI) that includes properties such as [attribute type] in attribute information items.
Not all data in an XML document is regarded as information by the XML Information Set (see Appendix D of the XML Information Set for a detailed list). For example, given an infoset, it is impossible to determine whether the characters of a chunk of character information items have been produced as a result of replacing a parameter-entity reference with a literal entity value or not. It follows that, character-by-character equivalence cannot be attained with any format that is based on the infoset model. In fact, most of the programming data models that have been created for XML have similar characteristics: XML processors that make use these data models do not support character-by-character round trippability. Two good examples are:
<foo/> and
<foo></foo>); andfoo='bar' and foo=”bar”).These syntactic differences may be important for an XML editor application or a text-based versioning system. In this sense, what constitutes information is in the eye of the beholder; yet the XML Information Set is sufficient to server a large portion of the XML applications that are currently in use.
Fast Infoset supports two types of character strings:
Identifying strings are character strings associated with the [local name], [namespace name] and [prefix] properties of element information items and attribute information items (as well as certain properties of other less common information items). Identifying strings are likely to occur more than once in an infoset. Identifying strings are encoded in UTF-8 on the first occurrence, assigned an index, and referenced using that index for every subsequent occurrence (see ). The use of UTF-8 for the encoding of all these strings appears limiting at first, but it adds to the simplicity of Fast Infoset and only minimally penalizes non-western encodings given that each string is encoded only once. (If the encoding of such strings in a document becomes problematic, as it may be the case for small documents with little or no repetition of strings, then an external vocabulary can be used instead.)
Non-identifying strings are character strings associated with a chunk of character information items and the [normalized value] property of an attribute information item (as well as certain properties of less common information items). In general, non-identifying strings are less likely to repeat than identifying strings. Hence, it is a serializer's choice to decide whether such strings should be assigned an index or not; a common heuristic is to assign an index to strings whose length is, for example, less than 7 characters. Non-identifying strings may be encoded in UTF-8, UTF-16BE (Big Endian), encoded using a restricted alphabet (see ) or encoded using an encoding algorithm (see ).
Fast Infoset mandates support for the same character encoding schemes that are required by all XML processors. Serializers that need to encode non-identifying strings in other character encoding schemes may use application-defined encoding algorithms. However, the use of these encoders may hinder interoperability for two reasons: (I) a character encoding scheme may not be widely implemented and (ii) an application-defined encoding algorithm may not be universally understood.
A vocabulary is a fundamental concept of the Fast Infoset encoding. It enables the compression of fast infoset documents by replacing repeating information with small positive integers. A vocabulary consists of a set of tables. Each table holds information of a certain kind, such as entries for identifying strings, non-identifying strings or name surrogates (see ). The index associated to a table's entry is used throughout the document to refer to that piece of information. For a detailed example of tables, indexing and name surrogates, the reader is referred to UBL Example.
When serializing an infoset to a fast infoset document, a serializer checks whether certain information (e.g., a [local name] property) is present in the corresponding table of the vocabulary. If the information is not present, then the following occurs:
If the information is present as an entry, then the following occurs:
A literal or its corresponding index is encoded in a manner that enables a parser to re-create an identical vocabulary table during the decoding phase.
When processing (i.e., either parsing or serializing) a fast infoset document, vocabulary tables can be constructed dynamically. The vocabulary that is obtained once the processing of the document is complete is referred to as the final vocabulary. Fast Infoset also supports the concept of an initial vocabulary which can be specified using a pre-defined set of table entries and/or a reference to an external vocabulary. A serializer will encode a reference to the external vocabulary (i.e., a URI) and the set of table entries at the head of the document. The final vocabulary will consist of the initial vocabulary plus the table entries generated dynamically during the processing of the document.
Because the tables of the external vocabulary are never encoded, the serializer and parser must agree on the URI that is used to reference the tables so that the parser can reproduce in full the infoset that was serialized. Thus, the use of an external vocabulary in fast infoset documents results in a format that preserves the hierarchical structure of the infoset but is no longer self-describing or self-contained.
An external vocabulary may be specified by one of the following:
An external vocabulary is advantageous when the cost of literally encoding information is too high with respect to the size of a document. This is often the case with small documents because there is less repeating information. Applying redundancy compression to small XML documents can result in poor compression for precisely the same reasons. In some case, the use of an external vocabulary can result in documents whose sizes are smaller than those obtained applying redundancy compression techniques.
The further tables of an initial vocabulary may be used when the information that needs to be indexed is known up front, which is often the case when the complete infoset is retained in memory (for example, a DOM document). This can be useful when it is appropriate to assign smaller indexes in inverse proportion to the frequency of information.
A vocabulary consists of twelve tables. The following (more commonly used) tables are introduced:
The use of multiple tables reduces the size of indexes that are encoded,
reduces the search space when serializing to ascertain whether information has
been indexed, and allows for optimal implementation. For example, the Java SAX
API returns [local name] properties as java.lang.String and chunks
of character information items as char[]. Hence, the LOCAL NAME table and
CONTENT CHARACTER CHUNK table can be implemented using the data types that are
best suited for each of the supported APIs.
A name surrogate consists of a tuple of three indexes corresponding to the table entries for the [local name], [namespace name] and [prefix] properties of an element information item or an attribute information item. Therefore, a name surrogate is essentially representation of a qualified name.
Name surrogates enable a second level of indexing to improve both compression, parsing and potentially serializing, particularly if an index of a name surrogate can be encoded directly without the need to obtain an index from the [local name], [namespace name] and [prefix] properties. In this case, it is possible for an element information item and the [local name], [namespace name] and [prefix] properties to be encoded in just one byte (if the index of the name surrogate is less than or equal to 32).
A name surrogate, unless encoded as part of the initial vocabulary, is a conceptual representation. Implementations may choose an appropriate and optimal qualified name representation for entries of the ELEMENT NAME and ATTRIBUTE NAME as long as such entries can be added and assigned indexes in the same manner as if name surrogates were used.
Alignment at octet boundaries occurs for the encoding of each information item and also for the octets of each character string. This translates into the addition of padding bits in certain cases, but whenever possible the Fast Infoset encoding attempts to pack extra information (e.g., indexes or length prefixes) into the spare bits of an octet in order to achieve better compactness. For large documents, this packing can make a significant difference in the size of the resulting document. Bit packing adds a very small cost to the serialization of fast infoset documents because information must be logically or'ed, but this is partly offset by the fact that fewer bytes end up being written. The cost of unpacking information when parsing can be reduced by using lookup tables and a set of parsing states (see ).
Character strings are always length prefixed in Fast Infoset. Length prefixing always favors the parser for the allocation of resources or the early production of an error when lengths are considered too large (i.e., supporting indefinite lengths always favors the serializer). Often the length of a character string will be unknown until the character string is written out. For example, a character string that is to be encoded in UTF-8 may contain characters beyond the Basic Latin range resulting in the encoding of at least 2 octets per character.
The [children] property of an element information item or document information item is encoded using an indefinite length encoding which uses a special terminator to indicate the end of a list. This technique enables support for streaming given that it does not require pre-knowledge of the exact number of children.
For a detailed explanation of alignment, padding, packing and length prefixing, the reader is refer to document based on Annex D of ITU-T Rec. X.891 | ISO/IEC 24824-1.
Restricted alphabets can be used to optimize the size of character strings for a chunk of character information items or a [normalized value] of an attribute information item. A restricted alphabet is a character string containing at least two or more characters, all of which must be distinct.
A restricted alphabet is essentially an array of characters where each character is assigned an integer according to its position in the array. Given a restricted alphabet and a character string containing only characters that are members of the restricted alphabet, each character is encoded as follows:
The number of bits encoded for each integer is determined by the size of the restricted alphabet.
Fast Infoset specifies two built-in restricted alphabets: a numeric restricted alphabet consisting of the characters “0123456789-+.E “ and a date and time restricted alphabet consisting of the characters “0123456789-:TZ “. Both alphabets result in an encoding that consists of 4 bits per character (instead of 8 bits for UTF-8 or 16 bits for UTF-16). Since all characters in a restricted alphabets must be valid XML characters it is not necessary to perform character validation when parsing restricted alphabet encodings.
An application-defined restricted alphabet can be specified by adding a character string to the restricted alphabet table of the initial vocabulary (either to the restricted alphabet table of an external vocabulary or to the restricted alphabet table included in the document). Such restricted alphabets are assigned indexes greater than the number of built-in restricted alphabets (which are already added to the table and assigned fixed indexes). When a character string is encoded according to a restricted alphabet, the index of the alphabet in the restricted alphabet table is encoded first so that a parser can determine what alphabet was used.
Encoding algorithms can be used to optimize the size and/or processing speed of character strings for a chunk of character information items or a [normalized value] of an attribute information item.
An encoding algorithm is specified according to a pattern of characters and how a character string matching the pattern is converted to and from an octet string (i.e., a binary value). It is also required that the character string can be converted to and from the octet string without any loss of information so that round-tripping of an infoset is attainable even when encoding algorithms are used.
It is important to stress that encoding algorithms (and also restricted alphabets) do not represent a type system. Any character string that matches a pattern of characters according to an encoding algorithm can use an encoding algorithm regardless of, for example, the data type of the characters specified by a W3C XML Schema. There may be a correspondence between certain encoding algorithms (especially for the built-in encoding algorithms; see later) and W3C XML Schema data types, but there is no dependency or tight coupling.
Conceptually when serializing an infoset to a fast infoset document using encoding algorithms, character strings are converted to octet strings. However, in practice this conceptual process of converting from characters may not be necessary if an application has non-character-based data that can be converted to an octet string without producing an intermediate character string (the inverse also applies to parsing). In those cases, serializing, parsing and data binding are simplified and optimized at the same time. For example, an array of 32-bit integers can be converted directly to an octet string using the built-in int encoding algorithm without the need to first produce a lexeme.
The Fast Infoset specification describes a number of built-in encoding algorithms for arrays of:
All integers and floating point numbers are encoded with the most significant bit first.
The encoding of an array of octets (i.e. a blob) is specified in Fast Infoset using the base64 encoding algorithm. Note that this algorithm is named in accordance with the specified pattern of characters, namely the pattern that conforms to the canonical base64 encoding as specified by IETF RFC 2045. Conceptually, an octet string is base64-encoded to a character string during serialization and base64-decoded during parsing. In practice, applications can encode the octet string directly and, contrary to the name of the encoding algorithm, never perform base64 encoding and decoding at all!
An application-defined encoding algorithm can be specified by:
Such encoding algorithms will be assigned indexes greater than the built-in encoding algorithms (which are already added to the table and assigned fixed indexes). When a character string is encoded according to an encoding algorithm, the index of the URI for application-defined encoding algorithms or the index of the built-in encoding algorithm in the encoding algorithm table is encoded first so that a parser can determine which encoding algorithm was used.
An open source, Java-based implementation of Fast Infoset technology is hosted at Java.Net. The source code is available under the Apache License, version 2.0. Even though this project was initiated by Sun Microsystems, it is open to the general public to join and participate in the development process.
As of this writing, there exist SAX, StAX and DOM serializers and parsers. The SAX library is currently at the most advanced stage in terms of Fast Infoset features supported by the serializer and parser. All three serializers and parsers pass infoset round-trip tests on a wide range of XML documents.
The source code is divided into two areas. The code under the
com.sun.xml.fastinfoset package is implementation specific; the
code under the org.jvnet.fastinfoset package is API
specific.
All serializers inherit from the base class
Encoder . This class provides fundamental support for the
encoding of all information items and properties independent of any particular
serializer API. As a result, concrete serializers tend to be quite lightweight
because most of the encoding is deferred to the Encoder class.
Additional advantages of this code sharing are fewer bugs, a simpler bug fixing
model and interoperability among all the flavors of serializers
supported.
All parsers inherit from the base class
Decoder
. This class provides basic support for decoding, namely, buffering and
decoding of information that is common to all parsers.
The Decoder class is fairly lightweight because parsers tend
to have different models for reporting information. SAX provides a callback or
push model while StAX provides a pull
model. These two models are sufficiently different that it was deemed
appropriate, primarily for performance reasons, for each parser to be
implemented separately to avoid sharing a common but incompatible model.
Unfortunately, this decision carries all the disadvantages associated with code
duplication. To alleviate these disadvantages, all parsers make use of decoding
state tables defined in the
DecodingStateTable class. Given an octet that represents
the beginning of an information item, a decoding table enables a parser to
efficiently determine its state via a simple table lookup using the octet's
value. To exemplify, the DecoderStateTable.EII table enables a
parser to determine easily the type and encoding state of a child information
item of an element information item. Specifically, a state such as
DecoderStateTables.CII_UTF8_LARGE_LENGTH represents the following
knowledge:
There are a total of 24 different states represented in this table. An
important state is the STATE_ILLEGAL state. Not all octets
represent valid states, and thus an important function of the tables is to
return STATE_ILLEGAL so that a parser can produce an error. This
technique helps in producing efficient and robust implementations. There are
also decoding states tables for document information item, attribute
information item, identifying strings as well as non-identifying strings that
correspond to the [normalized value] property.
The use of decoding state tables results in heavy use of switch statements. Therefore, the parsing of fast infoset documents is likely to perform better on a Java virtual machine (JVM) that efficiently supports case selection in O(1) time, as opposed to less efficient JVMs that support a linear search (in O(n) time) or a binary search (in O(log n) time). (Note that for byte code geeks out there, it has been verified that all relevant switch statements in the code compile to the tableswitch instruction, which enables efficient implementation in either O(1) or O(log n) time.)
Parsers and serializers use different techniques to implement vocabulary
tables: serializers use hash tables and parsers used optimized arrays. The
vocabulary implementations can be found under the
com.sun.xml.fastinfoset.vocab package, and the hash
tables and arrays can be found under the
com.sun.xml.fastinfoset.util
package. Note that this is not an uncommon technique; for example, the
Xerces
parser provides optimal support for interning of strings (symbols) and the
management of such strings using references.
The hash tables and arrays are not required to be general as is the case
for classes in the java.util package, and they are optimized for
specific data types (such as java.lang.String, and arrays of
char[]). In addition, they are optimized to support the concept of
initial and external vocabularies. When parsing and serializing, initial and
external vocabularies are essentially read only. Thus, there is optimal support
to defer lookup to a read-only table if the information is not present: a table
may refer to a read-only table (the table of an initial vocabulary), which in
turn may refer to another read-only table (the table of an external
vocabulary).
Although the current implementation may not be the most efficient possible, it provides a good degree of flexibility for experimenting with vocabulary features for serialization and parsing while at the same time providing adequate performance.
The Fast Infoset API is currently a “work in progress” and aims to provide a degree of separation from the implementation so that developers can use a stable interface while enabling the implementations to develop and change.
An API is necessary for developers wishing to use the advanced features of the Fast Infoset technology. If those features are not required, then a parser or serializer can be instantiated and operated via one of the aforementioned standard XML APIs. However, in some cases a parser may report an error if a fast infoset document contains encoded information that requires the use of advanced features (e.g., application-defined encoding algorithms or external vocabularies). However, it is possible for a parser to ignore data encoded using these advanced features by using generic property-setting facilities supported by most XML APIs.
As of this writing, there is an initial first attempt at defining a vocabulary API but this has yet to be implemented, and it is likely to require some time and careful thought to create a good balance of functionality and flexibility while allowing implementations the freedom to provide their own optimizations. The most mature aspect of the API is the encoding algorithm support in SAX.
Encoding algorithms are supported as extensions to the SAX 2 API. For
chunks of character information items, support is implemented in a similar
manner to the
org.xml.sax.ext.LexicalHandler . For [normalized value]
properties, support is implemented by extending the
org.xml.sax.Attributes interface.
The registration of a
PrimitiveTypeContentHandler
allows the parser to report chunks of characters encoded as built-in
encoding algorithm data in the form of arrays of primitive types instead of
characters. For example, if a SAX parser decodes data for the
float encoding algorithm, then such data can be decoded to
an instance of float[] and reported using the
PrimitiveTypeContentHandler.floats() method.
Registration of an
EncodingAlgorithmContentHandler allows the parser to
report chunks of characters encoded as octets for any encoding algorithm
(built-in or application-defined) by using the
EncodingAlgorithmContentHandler.octets() method. In addition, this
interface can be used to report built-in and application-defined algorithms as
Java objects. Implementations of application-defined encoding algorithms (c.f.,
EncodingAlgorithm ) can be registered with an associated
URI so that the encoding algorithm data can be decoded to a specific Java
object. Note that the registration aspect of the API is not SAX specific and
may be used with all parsers and serializers.
SAX parsers report [normalized value] properties encoded as encoding
algorithm data using the
EncodingAlgorithmAttributes
interface. The SAX Attributes interface may be cast to
this interface to check if encoding algorithm data is present.
New handlers can be registered using the
XMLReader.setProperty() method or specific methods on the
FastInfosetReader
interface. This interface also documents the relationship between the
handlers and how encoding algorithm data is reported based on what is
registered.
The SAX API is primarily used for parsing, but it can also be used for
serializing (in pipelining or filtering scenarios it is often the case that SAX
or SAX like events are connected in terms of consuming and producing filters).
The Fast Infoset project provide a SAX serializer implementation,
SAXDocumentSerialier , and a SAX parser implementation,
SAXDocumentParser
. Once a SAX serializer or parser is instantiated the standard SAX 2 API
may be used.
The following Java code serializes a simple infoset (represented as SAX
events) to a fast infoset document (represented as bytes of a
ByteArrayInputStream):
SAXDocumentSerializer s = new SAXDocumentSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
s.setOutputStream(baos);
Attributes attributes = new AttributesImpl();
s.startDocument();
s.startPrefixMapping("ns1", "http://namespace/ns1");
s.startElement("http://namespace/ns1", "root", "ns1:root", attributes);
s.startPrefixMapping("ns2", "http://namespace/ns2");
s.startElement("http://namespace/ns2", "element", "ns2:element", attributes);
String c = "content";
s.characters(c.toCharArray(), 0, c.length());
s.endElement("http://namespace/ns2", "element", "ns2:element");
s.endPrefixMapping("ns2");
s.endElement("http://namespace/ns1", "root", "ns1:root");
s.endPrefixMapping("ns1");
s.endDocument();
The following Java code will output an XML document (using
System.out) from the fast infoset document by using the JAXP
transformation API to parse the fast infoset document and produce and XML
document:
SAXDocumentParser s = new SAXDocumentParser();
ByteArrayInputStream bais = new ByteArrayInputStream(baos.toByteArray());
SAXSource ss = new SAXSource(s, new InputSource(bais));
StreamResult sr = new StreamResult(System.out);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
t.transform(ss, sr);
The following is output to System.out (the XML has been formatted for readability):
<ns1:root xmlns:ns1="http://namespace/ns1">
<ns2:element xmlns:ns2="http://namespace/ns2">content</ns2:element>
</ns1:root>
This simple example shows how easy it is to create and process fast infoset documents.
The W3C XML Binary Characterization WG (XBC WG) provided an analysis to establish the need for a binary XML format by collecting use cases and gathering a list of properties induced by those use cases XBC Use Cases XBC Properties. In addition, the XBC WG proposed a detailed set of measurements for certain properties XBC Measurement Methodologies as well as a final document that characterizes a desirable binary XML format XBC Characterization. In the rest of this section, we relate features in the Fast Infoset format with some of the properties listed in Section 6 of the XBC Characterization document (several of the properties not mentioned in this Section are discussed elsewhere in the paper).
Section 6 of the XBC Characterization document shows a table of properties that MUST be supported and MUST NOT be prevented. The list of properties that must be supported includes a subset of the so-called W3C required properties which are:
Formats that do not support this basic set of properties will not conform to the architectural principles and best practices laid down by the WWW Architecture document published by W3C.
The Fast Infoset format is neutral with regard to transport, platform and human language. It is transport and platform neutral because it based on ASN.1 technology, and does not make any assumptions about the transport beyond those already listed in the definition of Transport Independence property. In addition, the Fast Infoset specification (ITU-T X.891 | ISO/IEC 24824-1) is open and free of encumbrance, so the Royalty Free property is also supported.
The other two properties, Content Type Management and Integratable into
XML Stack, refer to the ease of integration of a new format into the Web and
the existing XML stack, respectively. Both of these properties are supported by
Fast Infoset. First, Fast Infoset applications can use the IESG-approved MIME
type application/fastinfoset; the decision to use a new
MIME type as opposed to an encoding for an existing one (such as
text/xml) was based on the desire to use this facility in
cases where either a subset of the HTTP that does not include support for
encodings was used (as it is common in some Web services toolkits) or a
protocol other than HTTP was used. Second, no other XML specification is
required in order to use Fast Infoset encoding even though, in some cases,
extensions to certain specifications or APIs are desirable to take full
advantage of the features offered by the format. Examples of these are
extensions to APIs such as SAX and DOM to handle binary data (e.g., to avoid
base64 encoding) or changes to certain specifications to add support for a
canonical version of Fast Infoset. (The open source version of Fast Infoset
technology at Java.Net Fast Infoset @
Java.net includes support for all
these advanced features, and proposes extensions to standard APIs without
making their use mandatory).
Perhaps the two most important properties listed in Section 5 of the XBC Characterization document are Processing Efficiency and Compactness. The former is listed as a MUST NOT prevent due to the difficulty of establishing an accurate and comparable measure for it. Any alternate format for XML that does not address these two properties simultaneously is doomed to failure. This is the reason compression formats such as Gzipped-XML are not suitable binary XML formats: they tend to deliver good compactness but only at the expense of poorer performance. This is a consequence of the fact that these formats are not Directly Readable and Writable, so compression must be preceded by serialization, and decompression must be succeeded by parsing —clearly showing why processing speed can only be worse for this format. We believe that, even without using any of its advanced features, Fast Infoset strikes a nice balance between Processing Efficiency and Compactness while at the same time preserves other key properties that made XML successful such as support for Schema Extensions and Deviations and Self-Contained documents, two properties that are essential for building systems that are loosely coupled.
Processing Efficiency and Compactness can be simultaneously improved even further by using restricted alphabets and encoding algorithms. Restricted alphabets can be employed to pack multiple characters into one byte, while encoding algorithms are designed to transfer data types values in binary format. An example of the latter is the base64 encoding algorithm, which can be utilized to send large chunks of binary data without the need to encode it in base64 (c.f., ). The use of encoding algorithms typically improves serialization, parsing and data binding, three key aspects that are listed as part of the Processing Efficiency property's description.
There are many formats that are conceptually similar to Fast Infoset; some are proprietary, and some are open. To the best of the authors' knowledge none of the other formats combines the following:
Most of the formats rely on the concept of vocabularies (or dictionaries) of character strings and the indexing (tokenization) of such strings to achieve moderate compression and improvement in parsing.
Microsoft's Indigo platform will ship with a proprietary binary XML format. An Indigo binding may be customized to use different message encoding formats such as XML, MTOM and “binary XML,” as presented in the Programming Indigo session at the Indigo Day conference. Details are sketchy, but the format is a binary encoding of the XML Information Set and probably not unlike Fast Infoset. It would be interesting to investigate the development of a Fast Infoset message encoding formatter for Indigo.
The WAP Binary XML Content Format (WBXML) specification, contrary to some reports, is not a W3C Recommendation but a W3C Note. The latest specification, dated July 2001, may be found at the OMA. WBXML does not support XML namespaces. Although WBXML is limited in support and features, the nature of the problem it attempted to solve, that is, to reduce the size of documents transmitted over wireless networks, is still a concern for mobile operators. In addition, WBXML has attempted to address the processing constraints of limited processing power and/or limited battery life.
The XBIS format developed by Dennis Sosnoski using the open source XBIS implementation has excellent performance characteristics and support for XML namespaces. However, it is not standardized and is not as feature-rich as the Fast Infoset format.
The open source Nux toolkit provides XQuery support for XOM. In addition a binary format called bnux is provided. The bnux format utilizes the properties of XOM object model (similar to DOM) to generate optimal indexes of repeating character strings that are encoded before the structure and information of the document is encoded. Bnux requires that the complete document be present to perform serialization, and it cannot be used in streaming scenarios. The characteristics of bnux can also be supported using features of the Fast Infoset encoding, and it would be interesting explore the integration of Fast Infoset into nux in the same manner as bnux was implemented.
Extensible
Schema Binary Compression (XSBC) has been developed as a general
approach to binary serialization of XML documents and for investigating the
binary encoding of X3D documents. Conceptually XSBC is very similar to Fast
Infoset when using certain advanced features such as external vocabularies and
encoding algorithms. Fast Infoset is now being used to specify the base
encoding of binary X3D documents (see ). XBSC
development continues to serve the purpose as a platform to experiment with
binary serializations and the use of Fast Infoset serves to define an
interoperable and stable format for specifying binary X3D documents.
There are many other formats that people have proposed that fall into the category of "binary XML". It is simply impossible for us to cover them all: one or more papers would be required for that! Most of these formats can be classified based on the amount of external information they rely on. External information can came in various flavors, but in most cases it is present in the form of a schema. As a general rule, the more schema information, the more compact the representation but the tighter the degree of coupling between producers and consumers. A good source to learn more about these formats are the position papers accepted to the W3C Workshop.
There is some confusion over the distinction between Fast Infoset encoding and MTOM/XOP that requires some explanation. MTOM (SOAP Message Transmission Optimization Mechanism ) is a W3C Recommendation that depends on another W3C Recommendation, XOP (XML-binary Optimized Packaging), for the packaging of binary data within SOAP messages in a way that avoids base64 encoding. XOP solves the specific case of embedding binary data in XML infosets, and MTOM solves the specific case for SOAP message infosets. XOP only supports the embedding of binary data for chunks of character information items.
The Fast Infoset specification describes a binary encoding of the XML Information Set and allows for the direct embedding of binary data for chunks of character information items or for the [normalized value] property of attribute information items. In a nutshell, Fast Infoset can do what MTOM/XOP can do and a lot more.
Conceptually, these two technologies are similar enough with respect to the management of binary data. In particular, Web services toolkits will require similar modifications to either binding frameworks or parsers to ensure that the binary data is never base64 encoded or decoded.
The Web3D consortium Web3D provides a forum for the creation of open standards for real-time 3D communication and is responsible for the Extensible 3D (X3D) Graphics standards. X3D X3D is an Open Standards XML-based 3D file format designed to enable real-time communication of 3D data across network applications.
The XBC WG Use Cases document XBC Use Cases has a section on X3D graphics justifying the case for a binary encoding of X3D XML documents. The properties listed for this use case match those that are supported by Fast Infoset. The Web3D consortium has agreed to adopt Fast Infoset as the base encoding for binary X3D documents, which is being standardized in ISO (Part 3 of ISO/IEC 19776, Extensible 3D (X3D) encodings). The X3D binary encoding avails of the built in encoding algorithm feature of Fast Infoset and specifies application-defined encoding algorithms for the efficient encoding of floating point numbers and co-ordinate data relevant to the description of 3D information.
The people responsible for the standardization work of Fast Infoset and the binary encoding for X3D worked to ensure that the Fast Infoset specification would meet all the requirements for use with X3D in a generic and extensible manner. This included work on the Fast Infoset specification and a prototype to ensure that both compression and processing performance requirements were met. This has proved to be a fruitful working relationship benefitting both sides. Note that Fast Infoset supports the X3D requirements in a generic manner that makes it feasible to apply it to the binary encoding of SVG documents, which is an interesting area for future investigation.
A member of the X3D group is contributing to the development of the Fast Infoset open source project and it is planned to integrate the implementation into the Java based X3D toolkit and X3D browser. Preliminary results show that binary X3D documents are 10% to 30% of the size of XML X3D documents (i.e., 90% to 70% smaller) and can be parsed 5 to 7 times faster than XML X3D documents. When lossy compression is applied for the efficient encoding of 3D information, binary X3D documents can be 5% of the size of XML X3D documents (i.e., 95% smaller) with no visible loss of detail.
Our group at Sun Microsystems has been investigating the use of binary encodings for Web services for about than 2 years Fast Web Services. We have researched (and are still researching) encodings with various characteristics in order to find the ideal match to satisfy the properties of a Web service application, and more generally, a Service Oriented Architecture (SOA).
The W3C XBC working group identified two use cases for which a binary encoding is desirable in order to speed up Web services (c.f., XBC Use Cases). These two use cases are: (I) Web services within the Enterprise and (ii) Web Services for Small Devices. The former is mainly concerned with increasing system throughput to enable to use of Web services for intranet applications that tend to be less loosely coupled and confined to a single security domain. The latter is mainly concerned with the compactness of the encoding to enable the exchange of Web services messages over low-bandwidth, high-latency networks. Given that the Fast Infoset encoding is faster to process and is also more compact, it is applicable to both of the these use cases.
The Java Web Services Developer Pack JWSDP version 1.6,
to be released in June of 2005, now supports the Fast Infoset encoding as part
of JAX-RPC 1.1.3. For ease of deployment, this new version of JAX-RPC also
supports a form of HTTP content negotiation that can be used to "turn on" Fast
Infoset during message exchanges. Content negotiation is completely
driven by the client and uses the standard HTTP headers
Accept and Content-Type. The initial
request is always encoded in XML, but the client has the option of including
the MIME type application/fastinfoset as part of the
Content-Type list. If the request is received by a Fast
Infoset-enabled service, the reply will be encoded in Fast Infoset, and so will
be the remainder of the conversation between the client and the service as long
as the client continues to use the same artifact (e.g., the same stub instance)
to converse with the server. Please refer to the JWSDP version 1.6
documentation for more details.
As the saying goes, when it comes to performance, "your mileage will vary." Fast Infoset is designed to optimize parsing and serialization, so the key to understanding the potential gains associated with using this technology is understanding the percentage of time your application spends in these two tasks. The greater the percentage, the greater the improvement will be.
As part of the source code available from the Fast Infoset @ Java.net project Fast Infoset @ Java.net, there is a tool called Japex that we have used to write a number of different micro-benchmarks for our Fast Infoset implementation. All these performance reports are available from the project's Web page. To summarize, parsing micro-benchmarks show an average improvement of 3 to 10 times depending on the XML parser in question, and JAX-RPC micro-benchmarks show an improvement of 2 to 5 times depending on the structure of the messages exchanged —with the higher improvements achieved when base64 encoding is avoided.
This paper has introduced Fast Infoset and has explained some of its features that result in fast infoset documents being both smaller in size and faster to parse than equivalent XML documents. The Fast Infoset specification (ITU-T Rec. X.891 | ISO/IEC 24824-1) is being standardized jointly in ISO and ITU-T and is progressing as planned with pre-publication as a recommendation expected, approximately, in mid-June 2005 and fully publication as a Recommendation | International Standard some time after the final ISO ballot completes in September 2005.
The Fast Infoset open source implementation @ Java.Net was introduced and many of the Fast Infoset features that are implemented were explained. The open source implementation sends out an important message: Fast Infoset is royalty free and others are free to implement Fast Infoset or contribute to Fast Infoset @ Java.Net on agreeable terms.
Two disparate use cases of Fast Infoset were presented: Web services and 3D graphics. In both these cases the Fast Infoset implementation @ Java.Net has been used and has been shown to offer measurable benefits, in addition to proving that the implementation is maturing in terms of features supported and robustness. The adoption of the Fast Infoset specification by the Web3D consortium for standardization of the binary encoding of X3D XML documents shows that Fast Infoset is meeting the needs of quite an interesting and demanding community with clear marketing requirements.
The authors would like to thank Todd Freter from Sun Microsystems for his editorial work on this paper.
Paul Sandoz
Staff Engineer, Sun Microsystems
Santiago Pericas-Geertsen
Staff Engineer, Sun Microsystems