XTech 2005: XML, the Web and beyond.

Bridging XHTML, XML and RDF with GRDDL

Discuss this paper on the XTech wiki
View XML source for this paper

Keywords

Abstract

While SGML and XML languages have had for a long time the possibility to describe syntactic constraints of their vocabularies using DTD and other schema languages, no specific mechanism exists to allow for the mapping between these syntactic constraints and their semantic implications.

GRDDL, a technology in development in W3C, allows to incorporate semantics from XML vocabularies and XHTML conventions into the Semantic Web by re-using existing extensibility hooks of the Web. This paper explains the basic principles of its mechanisms, and explore how it can be applied for various communities.

Introduction

Re-using the same same technologies for sharing documents on the Web to share information and data that can be processed directly by computers is an idea as old as the Web itself.

The Semantic Web, built on the Resource Description Framework (RDF), is the point of reference for sharing computer-processable information on the Web. However, meanwhile, a lot more information is available as non-Semantic Web formats than as RDF-based ones at this time: (X)HTML documents obviously, but also a significant set of other formats encoding images, documents, spreadsheets, newsfeeds, etc. Not all this information is formalized enough to fit in the framework set by the Semantic Web, but a lot of it is, and would benefit from being integrated in this Web of data.

This paper describes GRDDL, standing for "Gleaning Resource Descriptions from Dialects of Languages", a technology under development in W3C designed to fill part of this gap, allowing document authors to associate automatically formalized RDF statements with XHTML and XML-based formats.

Bridging semantics across markup languages

Markup languages allow the use of descriptive names to annotate and structure documents, encode data, describe images, etc. These descriptive names and the structure they create add implicit or explicit semantics to documents. For instance,

Semantics in HTML snippet
<html>
<head><title>Example of an HTML document</title>
</head>

asserts more or less formally that this HTML document has a title of "Example of an HTML document". This assertion can be translated in RDF/N3 N3as <document.html> dc:title "Example of an HTML document".

While SGML and XML languages have had for a long time the possibility to describe syntactic constraints of their vocabularies using DTD and other schema languages, no specific mechanism exists to allow for the mapping between these syntactic constraints and their semantic implications.

But why would one need to do this mapping? Most of the time, a markup vocabulary is developed for an application-specific purpose, and the semantics bound to this vocabulary are encoded in the applications themselves.

The problem with that approach is that all the work done to define precisely the semantics of the said vocabularies gets encapsulated in the applications code, and cannot be re-used in new contexts without developing new code. Said otherwise, it hurts the potential of re-use of existing data.

Moreover, some vocabularies are designed to fit in a multitude of formats, with various syntactic constraints and processing models.

For instance, Creative Commons CC defines set a set of metadata to define licensing rights associated with various forms of content. Users of this metadata vocabulary will want to embed this information in a very wide range of formats (SMIL, SVG, OAI, XHTML). Among these formats the allowed syntactic flexibility varies widely and the way to publish the information must be adapted to each of these cases. Moreover, a processor trying to detect and analyse the various ways this information is embedded in the host formats would need to learn each of the particular embedding methods.

While RDF RDF and OWL OWL, W3C Recommendations since February 2004, provide a direct solution for re-using and combining vocabularies, many existing applications and markup systems cannot realistically be moved to this data model, and even today, many new applications are likely to be built on existing XML and HTML toolkits with their well-deployed workflow tools, rather on the less ubiquitous RDF ones.

In addition to these XML use-cases, the need to incorporate fine-grained metadata in HTML documents beyond what the HTML specification defines, has arisen again and again in the Web history, either to benefit from the deployment of well-known RDF vocabularies, or simply to use HTML as a lever to deploy what has been casually called the "lowercase semantic web", namely the possibility to encode lightweight semantics through HTML markup conventions:

In all the cases above it seems natural to use RDF and OWL as well-grounded foundations to describe the semantics associated with these markup constructions, as the Cambridge Communiqué CambComm suggested in 1999.

GRDDL, standing for "Gleaning Resource Descriptions from Dialects of Languages" GRDDL, proposes a mechanism to make these associations possible for any XML or XHTML document. GRDDL grounds these associations in URI space, making it simple to extend the collection of transformations as new XML vocabularies are deployed.

GRDDL mechanisms

Specifying a Transformation For a Family of Documents

Let us refer to a class of XML or XHTML documents that share a specific semantically-meaningful structure as a "family" of documents. The idea behind GRDDL is that a family of documents will advertise its family via a well-known URI and that dereferencing this URI should lead a GRDDL processor to an algorithm to map from the structure to the semantics. The XHTML1 namespace URI is an example of one such family identifier.

The existence of such a well-known URI for vocabularies deployed on the Web is consistent with the World Wide Web Architecture WEBARCH described by the W3C Technical Architecture Group:

To benefit from and increase the value of the World Wide Web, agents should provide URIs as identifiers for resources.

In the case of XML vocabularies, the most deployed form of URI as identifier for a family of documents is the namespace of the root element - although this is not enforced by any specification, see the relevant TAG issue mixedNamespaceMeaning-13 for more details.

For instance, all P3P Policy Reference files P3P start with a META element in the http://www.w3.org/2002/01/P3Pv1 namespace; interestingly in the case of P3P, the associated meaning of the XML vocabulary has been formally translated in RDF, as described in the RDF Schema for P3P P3P-RDF. Similarly, SVG files and XML Schema files can be recognized by the namespace of their root element, making it a good identifier for the type of structure and semantics associated with these families of document.

XHTML also has been used as a container for sub-vocabularies, e.g. the XHTML Friends Network proposal XFN). The proper way to anchor these additional semantics in the Web is to use the profile attribute on the head element, as warranted by the HTML 4.01 specification HTML4:

The profile attribute of the HEAD specifies the location of a meta data profile. The value of the profile attribute is a URI. User agents may use this URI in two ways:

  • As a globally unique name. User agents may be able to recognize the name (without actually retrieving the profile) and perform some activity based on known conventions for that profile. For instance, search engines could provide an interface for searching through catalogs of HTML documents, where these documents all use the same profile for representing catalog entries.
  • As a link. User agents may dereference the URI and perform some activity based on the actual definitions within the profile (e.g., authorize the usage of the profile within the current HTML document). [The HTML4] specification does not define formats for profiles.

Indeed, an XHTML document using the set of relationships defined in XFN XFN must reference the XFN profile in its head element.

Both for XML and XHTML, when such a family identifier URI exists and is dereferencable, GRDDL proposes that dereferencing the said URI should provide one or more algorithms that transform an instance of a document of this family and turn it into RDF/XML statements. When such a transformation exists, GRDDL specifies that these statements are indeed part of the intended meaning of the document. This is illustrated with the .

Extracting RDF Statements from a P3P document using GRDDL

Practically speaking, this works as follows:

Given the XML nature of the targeted vocabularies, and the growing availability of XSLT processors, GRDDL suggests that algorithms should be expressed in this language, so that a processor configured to fetch new algorithms on the fly could use XSLT XSLT1 as a common transformation language. At the time of this writing, XSLT 2 is still a Working Draft XSLT2, and the question of whether GRDDL should require support for XSLT 2 has not been fully addressed, although its admittedly superior expressive power makes it an interesting candidate.

This does not prevent the use of URIs as simple identifiers relying on a library for well-known transformation algorithms, nor the use of other techniques than XSLT to process the said XML documents.

A point has been left open in the process above: how to detect the URI referencing the algorithms in the namespace (or XHTML profile) document, given that there is no standard format for it? (See W3C TAG issue namespaceDocument-8). And indeed, namespace owners have put a wide variety of documents as representations of their URIs, from a simple HTML document to content negotiated schemas, DTDs, or RDDL documents to dispatch between these various relevant data.

To solve that problem, GRDDL instructs a processor not to look for a particular format, but to look for a given RDF property stated by the namespace or profile document. And since it is not possible to assume that all namespaces and profiles documents are published in RDF, the GRDDL processing is applied here recursively: namely, if the given namespace/profile document is not in RDF/XML but in some XML format, the processor should simply try to extract RDF statements from it using GRDDL processing.

The examples attached to the XSLT-based GRDDL processor XSLT-DEMO shows how this recursive processing can be applied fairly easily to namespace and profile documents given in XML Schema, XHTML and RDDL formats, and can be easily extended to any XML format. The illustrates how this would work applied to a namespace represented by an XML Schema - thus implementing partially one of the goals put up in the Cambridge Communiqué CambComm.

A final question arises with this recursive method: how to stop the recursion without having to modify all the namespace and profile documents of formats that may be used only a few times as containers for GRDDL markup? Not only would this be unlikely to be achievable, it would also limit the number of ways one could use a given format as a container for GRDDL markup.

The following section details the second GRDDL mechanism that makes this possible.

Specifying a Transformation For an Individual Document

There are various situations where it is not possible, practical or desirable to have a URI identifying a given family of documents, or to have it referenced in the places mentioned above (i.e. as namespace of the root element or as profile in the head element), or to change the representation available at such a URI.

Thus, GRDDL provides a second mechanism that allows authors to associate an individual document with a given transformation algorithm. It does so through

This mechanism means that a GRDDL processor should

This mechanism offers a simple and short way to close the recursive processing explained above: in the case of an XML Schema-based namespace document (), one would just need to add a single transformation reference to the XML Schema to allow the GRDDL processor to extract the transformations that apply to all the documents defined in this namespace.

Applying GRDDL recursively through an XML Schema-based namespace document

How can these mechanisms be used in practice with vocabularies deployed today? How do they allow communities to take part to the Semantic Web goals of re-using information as much as possible?

Scenarios of applications

Most of the deployed vocabularies (either XML or XHTML based) are the results of some agreement inside a community on the meaning of the terms defined by the vocabulary.

Some of these communities may wish to integrate these vocabularies into the Semantic Web, either to benefit of the promises of the network effect allowed by sharing a common format to carry semantics, or to solve a particular integration of vocabularies (see ), or simply to re-use some of the tools available for Semantic Web technologies to make inferences, visualize relationships, or ease indexing by Semantic Web bots.

We explore here a few usage scenarios of various communities, and how they could effectively use GRDDL to meet their needs. This section is purely putative and does not necessarily reflect the thinking of these communities.

The W3C XHTML Working Group scenario

The HTML Working Group is working on XHTML 2.0 XHTML 2.0. One of the most promising aspects of this new version of XHTML is the possibility to express in XHTML markup a very wide range of RDF Statements. This would offer a whole new community the opportunity to take a direct part to the Semantic Web.

The current proposal make it possible to extract RDF/XML Statements from XHTML 2 documents using XSLT, making it compatible with the GRDDL mechanisms. As such, any Semantic Web agent implementing GRDDL would be able to parse XHTML 2 documents, as long as the XHTML Working Group publishes in the XHTML 2 namespace document a proper link to the relevant XSLT style sheet.

The W3C SVG Working Group scenario

The SVG specification offers a metadata tag, and its associated XML Schema allows any type of markup inside this element, thus allowing to embed RDF/XML Statements inside any kind of SVG content.

To help make these metadata elements available to Semantic Web agents, the SVG Working Group could decide to update the SVG namespace document, published in XHTML, to make it point to an XSL style sheet that would simply extracts the content of such metadata elements expressed in RDF/XML, making instantaneously all these metadata part of the Semantic Web.

The XHTML Friends Network community scenario

The XFN community has defined a set of well-defined relationships names XFN anchored in a URI space through an XHTML profile. Since these relationships have been made explicit for authors and authoring tools developers, it is reasonable to assume that people using these conventions agree that using them is indeed expressing the intended meaning.

If the XFN community wanted to bring their data into the Semantic Web, for instance to be able to re-use the existing visualization tools developed for Semantic Web languages, they would simply need:

  • to create an RDF/XML description of these relationships - or to make the existing XHTML description equivalent to a set of RDF/XML ones using GRDDL
  • to update the existing XHTML profile to make it reference an XSL transformations that would turn links with XFN-relationships into proper RDF Statements, à la [ foaf:homepage <myhomepage.html>] xfn:met [ foaf:homepage <http://buddy.example.net/homepage/> ]. (in RDF/N3 for ease of reading)
The blogging community scenario

The blogging community has been defining and re-using a number of HTML conventions to markup relationships between blog authors (XFN), geographical indications (GeoURL), communications endpoints (trackback, pingback), links endorsement (nofollow, etc.)

This growing number of conventions allows the encoding of a fascinating number of data that a few ad-hoc applications have started to explore. But the diversity of the topics and the number of used conventions make any development of this application dependent on the creation of a new one.

Moreover, a number of these conventions have not been grounded in URI space, making them somewhat fragile on the process to interpret and make these conventions evolve; in the future, new conventions may end up using clashing names by lack of coordination or due to bad timing.

To lower these risks and benefit from a stronger foundation, the blogging community could create a profile (or a set of profiles) that would associate the existing conventions to a URI; if this profile was set up with a set of GRDDL transformations, any content referencing the said profile could automatically be processed by Semantic Web agents supporting GRDDL. Moreover, any addition of a new convention to the given profile (as a result of a consensus in the community) could be supported in processing tools by simply adding a link to a new XSL style sheet to the said profile.

The Creative Commons community scenario

The Creative Commons have defined a set of RDF properties describing the licenses they propose for authors to use on their content. They want search engines to be able to parse this data in a wide variety of formats, from XHTML to the Open Archive Initiative format.

But each of these formats has a set of syntactic constraints that need to be addressed separately, using a different embedding technique.

Instead of having to maintain a library of formats that Creative Commons search engines need to know how to process, the Creative Commons could create a set of XSL transformations that would work for each of these formats, and either propose them as directly included in individual documents, or work with the community owning these formats to have them include the links in the relevant namespace and profile documents.

These various examples make appear an interesting property of GRDDL: given its levels of indirections, it can be adapted to a wide samples of community processes, and be used as a technical tool to express existing consensus among these communities.

GRDDL status and future development

Specification

As of April 1st 2005, GRDDL has been last published as a Coordination Group Note in April 2004, as a result of a task force set up to explore ways to solve the long-lived question on how to embed RDF in HTML - see STORING for more details.

Since May 2004, this task force has been integrated into the Semantic Web Best Practice and Deployment Working Group; this group serves today as the forum for discussing GRDDL (esp. through the public-rdf-in-xhtml-tf@w3.org mailing list) and seeing what future development is needed for GRDDL.

As of April 2005, the GRDDL specifciation has not been endorsed by W3C Membership.

The authors of the GRDDL specification are interested to get feedback on whether this specification should go through the W3C Recommendation track to help disseminate it through the relevant communities..

On the technical side, one of the interesting issues yet to be resolved concerns the support for XSLT 2 in GRDDL processor; although XSLT 2 is only at Working Draft stage at this time and thus not very widely deployed yet, using it would bring a whole new set of functions and capabilities that are likely to prove very useful in transforming existing XML and XHTML structures into RDF/XML. In particular, XSLT 2.0 may be needed to make XHTML 2.0 fully processable through GRDDL.

Implementations

As of April 1st 2005, five partial or full implementations of GRDDL have been announced:

The diversity of implementations and their number at this stage of development of the specification is a positive sign of the interest in the technology among the Semantic Web community; hopefully this number should grow even larger, and GRDDL could become a basic part of any RDF toolkit.

Test Suite

To accompany both the development of the specification and the development of GRDDL implementations, a test suite has been developed, with a series of test cases for each mechanism described in the specification, and a small Python test harness that automates the running of the test suite.

Conclusion

GRDDL proposes a set of mechanisms strongly anchored in the Web Architecture through its use of URIs, and has the potential to address a number of issues that have arisen through the co-deployment of XHTML, XML-based vocabularies and RDF-based technologies.

The authors of the GRDDL specification are interested in feedback on the technical content of the specification, as well as on the status that potential users of this technology would like to see attached to this specification: is the development of this specification through the W3C Process to make a W3C Recommendation needed to help its deployment in and out of the Semantic Web community?

Acknowledgements

Many thanks to Ralph Swick, Dan Connolly, Dan Brickley for their reviews, suggestions and ideas that have helped write this document.

Bibliography

[CambComm] The Cambridge Communiqué
7 October 1999
[CC] Implementing Creative Commons Metadata
[FOAF] FOAF Vocabulary Specification
3 April 2005
http://xmlns.com/foaf/0.1/
[GRDDL] Gleaning Resource Descriptions from Dialects of Languages (GRDDL)
13 April 2004
[HTML4] HTML 4.01 Specification
24 December 1999
[mixedNamespaceMeaning-13] What is the meaning of a document composed of content in mixed namespaces?
22 April 2002
[N3] Notation 3, An RDF language for the Semantic Web
[namespaceDocument-8] What should a "namespace document" look like?
14 January 2002
[OWL] OWL Web Ontology Language Overview
10 February 2004
[P3P] The Platform for Privacy Preferences 1.0 (P3P1.0) Specification
16 April 2002
[P3P-RDF] An RDF Schema for P3P
25 January 2002
[RDF] Resource Description Framework (RDF): Concepts and Abstract Syntax
10 February 2004
[STORING] Storing Data in Documents: The Design History and Rationale for GRDDL
[] Scalable Vector Graphics (SVG) 1.2
27 October 2004
[WEBARCH] Architecture of the World Wide Web, Volume One
15 December 2004
[XFN] XFN 1.1 relationships meta data profile
[XHTML 2.0] XHTML 2.0
22 July 2004
[XSLT-DEMO] Demonstration of GRDDL applied to XML
[XSLT1] XSL Transformations (XSLT) Version 1.0
16 November 1999
[XSLT2] XSL Transformations (XSLT) Version 2.0
4 April 2005

Biography

Dominique Hazaël-Massieux

W3C Quality Assurance Activity Lead and Systems Engineer , World Wide Web Consortium (W3C) W3C

Dominique holds an engineering degree from the "Grande Ecole" Ecole Centrale Paris. He has been working at W3C since 2000; he started as the W3C Webmaster, then split his efforts between the development of tools for the W3C community and the lead of W3C Quality Assurance Activity.

As an aside to this, he's a Semantic Web enthusiast, and has been more or less directly involved in several Semantic Web Advanced Developments projects in W3C. Liking XSLT a lot, his most recent interests in Semantic Web technologies are in bridging HTML, XML and RDF, especially with the help of GRDDL, an work in progress specification he co-authored.

Dominique maintains a blog on the various computer-related topics of interest to him.