XTech 2005: XML, the Web and beyond.

ROME: ESCAPE syndication hell, a java developer perspective

Discuss this paper on the XTech wiki
View XML source for this paper

Keywords

Abstract

ROME (Rss and atOM utilitiEs) in java is an open source project aimed at developing a set of Atom/RSS Java utilities that make it easy to work in Java with most syndication formats. ROME and its various subprojects will be presented.

What is ROME

Rss and atOM utilitiEs (ROME) in Java is an open source project aimed at developing a set of Atom/RSS Java utilities that make it easy to work in Java with most syndication formats. The project was started in april 2004 as an internal Sun project by Alejandro Abdelnur, Elaine Chien and Patrick Chanezon. Various flavors of RSS and Atom syndication formats were reaching a tipping point in terms of adoption this year, and Sun's newly appointed president Jonathan Schwartz announced that Sun was adopting RSS, so we expected many Sun products to start parsing and generating RSS. We decided to create a library that would be common to all Sun products. A few months after releasing the first internal versions we had a few internal customers, and we thought it would make sense to release it in open source, in order to leverage the syndication and java experts from outside the company. We released ROME on java.net using the Apache 2 license in June 2004.

Initially we did not want to reinvent the wheel, so why did we start our own library? When looking around for java libraries to take care of the parsing and generation of RSS we were not satisfied with what we found: there was no equivalent in java of Mark Pilgrim's excellent Universal Feed Parser for Python. Informa was the main contender but it was incomplete, designed like a C library, and the development focus was on the persistence layer and JSP taglibs. At that time we missed Kevin Burton's FeedParser, hidden in jakarta-commons, which had an interesting SAX based design.

Project ROME was started out of this frustration. Our requirements were to ESCAPE from Syndication Feeds Hell. In order to allow that the library had to be:

We set out to create this library in the same spirit as the JDOM library for XML manipulation in Java, incorporating XOM's Elliotte Rusty Harold's pearls of wisdom about API design and refactoring (see Air Bags and Other Design Principles which links his 6 interviews with Bill Venners). The ROME implementation uses JDOM though.

What's in a name? The codename for the library was intially Rome, for the fun of all the pig latin we could use, backed by our slogan: All Feeds lead to Rome. However some problem with Sun lawyers obliged us to change it to an acronym: ROME, Rss and atOM utilitiEs. We used the java utilities term in the initial definition though: what we shipped initially, ROME core, included the parsers, generators, io and plugin infrastructure. But since the beginning we expected ROME to be an umbrella project for other syndication utilities, handling all the various aspects of syndication: fetching, persistence, modules to handle various extensions to the syndication formats, handling of syndication related formats (OPML, FOAF), XML healing, etc... This was a smart idea, since a community of developers formed around ROME, creating new subprojects for these various aspects.

The ROME library

The ROME Core project itself includes a set of parsers and generators for the various flavors of syndication feeds, as well as converters to convert from one format to another. The parsers can give you back Java objects that are either specific for the format you want to work with, or a generic normalized SyndFeed class that lets you work on with the data without bothering about the incoming or outgoing feed type. Today ROME handles all flavors of RSS (0.90, 0.91 Netscape, 0.91 Userland, 0.92, 0.93, 0.94, 1.0 and 2.0) and Atom 0.3 feeds. ROME 0.5 was released in January 2005. ROME 0.5 is the first version marked as beta. ROME 0.6 was released in April 2005.

When we started ROME we identified 9 flavors of feed formats (estimation of the total number may vary with authors:-). These formats are specified with various degrees of rigor (from Dave Winer pretty loose specifications, to the Atom IETF more formal documents). But for what most applications are using these feeds, the semantics of what is represented is pretty much the same. So we decided, instead of creating a union of all data models, where you have to fish for what's in the feed for you, to create an intersection of all of them. We call the java interface to this pivot format a SyndFeed.

This approach has a number of advantages:

It has one main drawback though: the conversion is lossy, so if you work at the SyndFeed level, you loose some level of details. If you need all the details you need to work at WireFeed level.

ROME Core Architecture describes how ROME works. The horizontal axis separates the objects used for parsing and generation from the resulting javabeans. The vertical axis separates the abstract Synd (Feed, Entry) level from the concrete format specific Wire level. There is one parser and generator per feed type, and one Module parser and generator per feed type. These parsers and generators are used by WireFeedInput and Output classes in order to generate WireFeed beans. Depending on the format, a format specific subclass of Wirefeed will be generated. If you work at the Synd level, SyndFeedInput and OutPut use their Wire counterparts, to create and consume SyndFeeds. Converters provide the glue between the various formats and the SyndFeed pivot format.

The ROME Core Architecture

ROME is designed to be pluggable: the parsers, generators and converters that ship out of the box are specified and configured in the rome jar rome.properties file. ROME looks for /rome.properties in all classpath entries and aggregates them, which makes it easy to replace or reconfigure all these pieces. Some of our users have done that in order to temporarily fix a bug in ROME Core until we shipped next release. Others use it to add their own modules to the feeds they generate.

Sample Code

Converting any feed to RSS 1.0
                    Reader reader = ...
                    Writer writer = ...
                    
                    SyndFeedInput input = new SyndFeedInput();             
                    SyndFeed feed = input.build(reader);                
                    
                    feed.setFeedType(“rss_1.0”);
                    
                    SyndFeedOutput output = new SyndFeedOutput();                
                    output.output(feed,writer);
                

ROME Core XML Goodies

ROME is XML-strict and Feed-lenient. XML-strict because ROME parsers reliy on JDOM to do the XML parsing, so if the XML is malformed, the feed won't be parsed. But when it comes to feeds, ROME follows Postel's law, Be liberal in what you accept, and conservative in what you send.. It is feed-lenient because ROME relaxes constraints on feed formats when parsing a feed, but it enforces them when generating one. We hope this behavior will help in the production of cleaner specification complinat feeds as ROME is used to geneate more and more feeds.

Because of its XML-strictness, and of the common reality of feeds that are malfromed XML, ROME core contains a few XML processing related goodies: XmlReader deals with charset encoding inconsistencies. This is a very common problem for XML applications and we plan to integrate this code in JAXP. Alejandro described the algorithm we use in the ROME wiki.

XmlHealer solves a few common issues which cause feeds to be malformed XML: it trims XML streams and resolves HTML entities. We thought about implementing a HealingParser that would solve more problems, like closing tags, but we prefer waiting for more evidence that this is a wide problem before tackling this project.

The ROME fetcher

The ROME fetcher subproject has been created by Nick Lothian. It implements all the nitty gritty details involved in fetching feeds over http, including HTTP caching, with a pluggable cache, gzip compression and charset encoding detection (see XMLReader which requires some complex logic involving various RFC and XML specifications).

The ROME fetcher implements conditional get (based on ETags) to avoid fetching feeds that have not changed, and has an experimental support for Bob Wyman's proposal to use rfc3229 delta encoding in order to reduce the amount of bandwidth that is wasted in serving RSS and Atom files. Currently it implements only the feed Instance Method, but we plan to add support for the f-range method as well.

ROME Modules

We expect embedding of additional namespaced data in syndication payload to be an important trend in the next few years. Because of this we've architected ROME to allow for this. ROME has an extensible plugin architecture that allows developers creating module handlers for the various extensions allowed by the syndication formats. Today ROME handles Dublin Core and Synd modules from the RSS 1.0 specification. We plan to implement all modules defined in the various specifications.

Joe Regger is a good example of datablogger who uses ROME to add all sorts of metadata to his feeds, for example triathlon related metadata.

We see a future where syndication formats become a generic envelope format for time based queries, embedding whatever namespaced payload could be useful on the receiving end. This could lead to a few interesting trends in syndication software: plugin frameworks for syndication readers (browsers) to let the user display and interact with this data, server side tranformation agents to transform these data into a suitable representation for clients (feedburner), and servers using syndication formats as an envelope for their time based data (Amazon OpenSearch). Some modules we'd like to implement for ROME in the near future are a UBL module, and an Amazon OpenSearch module (they just added 3 elements to RSS 2.0).

ROME plans and usage

Java Syndication Libraries merger

This year we discovered Kevin Burton's FeedParser project: it is a java syndication library designed around SAX, while ROME is based on DOM. We argued about the respective benefits of SAX (harder to develop with) and DOM (memory usage, slower) until Nick Lothian did a benchmark of ROME vs FeedParser showing that for feeds under 500 kb, ROME was faster. Today most feeds are under 500 kb, so our design works well. But if syndication formats really become an envelope for inter application data exchange, feeds may grow bigger soon. Our discussions led us to think that we could implement event based ROME parsers relying on SAX events fired by FeedParser. This would be a great integration point between the libraries, and would make sure that we agree on the java API to access feeds.

In january, I proposed to do a joint presentation at JavaOne with the Feedparser team and Dave Johnson from the Roller blogging server: Java Syndication Babel: let's paint the picture together!. We have a java-syndication mailing list where we started discussing convergence between our efforts. Dave Johnson has implemented the Atom publishing API in java, in the context of writing his book Blogs, Wikis, and Feeds In Action. Since the Atom publishing API consists in applying HTTP verbs to Atom entries, the java type for the API input/output should be ROME beans. We will present our projects in a common session at JavaOne 2005, and if things go well, may try to gather them together under a larger blogging and syndication tool umbrella project in Apache. We initially thought about standardizing the interfaces for all this using the JCP process, but we felt it made more sense to work together on a set of interoperable implementations in a common open source project. We'll consider the JCP if the need for standardization arises later on.

ROME short term plans

Our most immediate plan is to create a new subproject called ROME modules, to host additional modules such as PRISM, Amazon OpenSearch or UBL.

Mark Woodman and Amin created a subproject called Aqueduct to define a DAO (Data Access Object) interface for ROME persitence. Today they have an implementation based on Prevayler (in memory, with transparent serialization to disk), suitable for small client applications. They work on Hibernate and Castor implementations that would be required for more scalable server side applications.

Another potential development angle is to include a RDF library in ROME, in order to make it useful for Semantic Web applications, following Henry Story and Danny Ayer's work on building an Atom OWL ontology.

The project has a lot of momentum, see Roman Numbers for a few statistics. In the Java & Web Services community where ROME belongs on java.net, we're the 4th most popular project after jaxb, jaxp, jax-rpc, but before jwsdp (Java Web Services Developer Pack)

Roman Numbers

In April Mark Woodman started a logo contest, with many submissions. We have a feed for these:-)

ROME usage

The project Powered By ROME wiki page grows regularly. Some projects and products using ROME include xWiki, SnipSnap, Roller and Sun Portal Server. The most fun recent project using ROME is Public Interactive, an ASP of on-line collaborative tools, community engagement technologies, content syndication services and member and audience relationship management systems for the public broadcasting industry. Rome is used for syndicating news content and Podcasts local published by stations in the Public Interactive network.

All Feeds lead to ROME

We hope that ROME will help java developers start working with syndication format while avoiding Syndication hell: the current syndication explosion will be an opportunity to build many innovative and cool new applications, and now java is well tooled to do so.

All Feeds lead to ROME!

Biography

Patrick Chanezon

Software architect, Sun Microsystemshttp://www.sun.com

Patrick Chanezon is a software architect in the Portal group at Sun. He works remotely from Paris, France since 2001, with teams located in the US and India.

He helped launch blogs.sun.com and co-created the ROME (Rss and atOM utilitiEs) project, an open source library designed to make writing syndication applications in java simpler.

He ported Sun ONE Portal Server to Sun, IBM and BEA application servers. He is now working on adding weblog and syndication capabilities in all Java Enterprise System product.

His resume is at P@ Resume and his blog is named P@ Log.

Biography

Alejandro Abdelnur

Software Engineer, Sun Microsystems http://www.sun.com

Alejandro Abdelnur is a Sun Java System Portal Server architect, with a focus on the development of portal-related standards and overseeing their implementation in the Sun Java System Portal Server.

He is coleader of the entire JSR168 Java Portlet Specification effort, and under his direction, Sun created the specification itself and the test compatibility kit, two of the three components that comprise a JSR spec.

Within Sun, Alejandro previously worked as a software engineer in the eCommerce group. Prior to joining Sun, he worked at Sybase Argentina in the areas of presales and consulting.

is blog is named Tucu's Weblog.