XTech 2005: XML, the Web and beyond.
The BBC World Service has traditionally been a radio service and moved into the Web in 1994, with RSS starting in 2003. The motivation for addressing new media is to more effectively address the audience, giving timely news and information, personalised for them, when and where they want. RSS is a step forward in the BBC's promise of quality content for its entire audience, but also has disadvantages. This presentation will highlight the knowledge obtained from syndicating 35 different languages to 3 different types of users:
1. Desktop Reader: The casual user who wants to read BBC content in an RSS reader instead of a browser.
2. Aggregator: This is usually an engine/computer of some kind (i.e. Blogdigger, Feedster, Daypop).
3. Engineer: This category of user ranges from programmers who build internal intranet applications based on our information, to the enthusiastic open source developer who builds web applications for everyone to see.
The main points of this paper will include:
1. Why RSS testing is complex and different from web page testing.
2. The hard and important choices when thinking about languages in RSS, such as identification, language-specific features (i.e. directionality), styling, and advertising the service.
3. How to best support right to left languages.
4. How the BBC World Service are supporting engineers and advanced aggregators with extra meta-data.
5. What needs to change in the RSS and ATOM standards regarding languages.
6. How the BBC World Service is using RSS as a base for syndication with humans and machines.
BBC World Service is a large part of the publicly funded corporation of the British Broadcasting Corporation. Its remit is balanced news-gathering while broadcasting to and engaging with its audience. The World Service was traditionally a radio-only service; an HTML Web based service began in 1996, and an RSS service began rolling out in 2003. This paper chiefly concerns the RSS roll-out in non-English languages. Language Services write news in several different languages, from Albanian to Vietnamese, meaning the content is original, not translated (there is a large difference between translated and original content). RSS allows the BBC to get more use out of the HTML content that is already produced in that RSS:
For quite some time the World Service only syndicated content to other sites in business deals, but there was always a plan to offer the audience syndicated content for their own use. Late in 2003, the first publicly syndicated RSS feeds were released onto the Internet for the audience to use. They are still available in largely the same form they started in, RSS 0.91, which is simply title, link and description. In late 2004, 35 out of 43 languages were publicly syndicated via RSS 1.0.
The more detailed advantages of publicly syndicated content are well known and out of the scope of this paper. Therefore, this paper focuses on the feasibility of worldwide RSS syndication.
The public RSS market is something that people tend to underestimate; it is a new market with loads of energy and potential for innovation. Though the market is large, it can be syndicated to, if done correctly. The market can be split into three main groups; each one is treated slightly differently, but can be catered for with one type of RSS feed. These main groups include:
The World Service content is unique throughout all these markets because of its language orientation; unfortunately, RSS is not quite ready for language orientation and a world wide audience.
Language tags can be (and should be) used to indicate the language of text in HTML and XML documentsMartin Dürst & Richard Ishida (W3C)
RSS is not a W3C standard, and does not follow this mantra. To improve things, it is not sufficient for content producers (such as the BBC) to unilaterally use such markup. However, there is a need for the software engineers who build Aggregators and RSS readers to be aware of such markup and to write code that handles it appropriately. It is helpful to start by looking at how language information is marked up in HTML, XML, XHTML, and RDF.
For HTML 4, language tags are specified with the lang attribute.
<html lang="en-GB">
The lang attribute is attached to the root html element of a html
document. The value of the attribute follows
RFC 3066 to describe
the language. This is usually an
ISO-639 two or
three letter language code, often followed by an
ISO-3166
two letter country code. IANA
registered names and experimental tags such as x-babeldutch
(described in the next example) may also be used.
<h1 lang=”dut-NL”>het UK stelt ... uit</h1>
<p lang=”x-babeldutch”>De Britse overheid ... maakt</p>
<p lang=”en-UK”>The European ... August.</p>
The lang attribute can also be used in-line to define
different sections of language text. Browsers fully understand the language
attribute and know what to do with it. There was a time early in the web's
existence when content producers did not put language attributes in the HTML,
this may seem quite alien now, but the same mistakes are being made
again.
In XML, the XML:lang attribute needs to be specified. This attribute is designed for identifying the human language used in the scope of the element to which it is attached. It works in the same way as the lang attribute:
<text x="450" y="250" XML:lang=”dut-NL”>het UK stelt ... uit</text>
In XHTML the XML:lang is used along with the lang attribute from HTML 4, but XML:lang takes precedence:
<p XML:lang=”en-UK” lang=”en-UK”>The European Union ... law in August.</p>
Even with the intimate relationship between RDF and Dublin Core, the XML:lang attribute is still used:
<?XML version="1.0"?>
<rdf:RDF XMLns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" XMLns:dc ="http://purl.org/dc/elements/1.1/" XML:lang=”en”>
<rdf:Description rdf:about="http://dublincore.org/">
<dc:title>BBC Worldservice</dc:title>
<dc:description>The BBC World Service, broadcasting in 43 different languages 24 hours a day</dc:description>
<dc:language>en</dc:language>
</rdf:Description>
</rdf:RDF>
RSS is umbrella term for a format that spans several different versions. The RSS family was started by Netscape at version number 0.90 in the late 1990s and was soon obsolete. Afterwards, two separate paths were taken (see Figure 1); RSS 0.91 was written by Dave Winer in 2000, while the RSS Dev group wrote RSS 1.0 a short time later in 2001.
The definition of language in RSS is indeed an interesting topic and a somewhat dark corner of RSS. Most try to avoid it at all costs and there is little said about the matter except in the area of encodings. As we have already seen, XML defines a XML:lang attribute that can be used to define language usage at nearly any level of an XML Document.
RSS 0.91 has one required language element and it is restricted to a small subset of languages. Most notably, languages missing from RSS 0.91 for World Service include Arabic, Urdu, Vietnamese, and Thai. There is no way to specify languages separate from the whole syndication document. In the RSS 0.92 specification, the language element was optional, with the absence of a tag being used for channels using potentially more than one language. Although this is a valid reason, making the language element optional does not solve the problem. A better solution, used in RSS 2.0 and RSS 1.0, is to move the language element down into the item level of the RSS structure.
In the RSS 2.0 specification, the language element is still optional. It defines a small set of languages, but also allows specification of languages in the using RFC 3066, the same way as HTML lang and XML:lang attribute in XHTML and XML. As Tony Hammond describes, RSS endeavoured to be a simple, easy to understand format with relatively modest goals. After it became a popular format, developers wanted to extend it using modules defined in namespaces, as specified by the W3C. RSS 2.0 adds that capability, following a simple rule. A RSS feed may contain elements not described on this page, but only if those elements are defined in a namespace.
The description of Language is a goal with lots of benefits, but my personal opinion is that RSS 2.0 specification does not have enough structure to really make this happen. It is possible to extend RSS 2.0 using namespaces modules, and since XML:lang is an available element to any XML language, it should not be too unlikely to assume people will use it.
The family of RSS 0.91, 0.92 and 2.0 are a confusing bunch and written by a lone developer who gave away his good efforts to the public domain. Some still do not agree that RSS 2.0 is an official specification at all. Languages within the RSS 1.0 specification seem a lot easier because it is based around RDF and has quite a lot to say about extending the specification.
The crux of the difference between RSS 1.0 and earlier (or lateral) versions lies in its extensibility via XML Namespaces and RDF (Resource Description Framework) compliance.
Namespace-based modules allow compartmentalized extensibility. This allows RSS to be extended:
RSS 1.0, being RDF, supports the use of XML:lang to indicate the language of textual content. This is a better solution than a generic solution of a top level language element. The BBC World Service have found these advantages compelling, and more recently developed World Service RSS feeds are all in RSS 1.0.
There is another standard that sits slightly out of the RSS debate, although some people do see it as the next generation after RSS 2.0. The standard is called ATOM and has been in the past referred to as PIE. ATOM encourages the use of RFC 3066 language tags. However, its being in a beta state makes it a difficult standard for the BBC to support at this stage.
Popular blogs, such as those found in Technorati (an Aggregator), tend to use RSS 1.0, with some using RSS 2.0 and several also providing ATOM. News services tend to use RSS 2.0, but the Guardian and the BBC World Service use RSS 1.0. Some RSS feeds from the BBC World Service are still in a large development cycle and therefore use RSS 0.91 from there first conception.
In conclusion, the RSS 2.0 specification is not very clear when it comes to language support. The RSS 1.0 specification being RDF based enforces the support for internationalisation at the core of the standard. Extensibility is key to serving the different target markets in different languages.
Developing RSS feeds for 35 different languages has stressed the technology in a number of ways:
An example of how we mark up language information in RSS 1.0.
<item rdf:about="http://www.bbc.co.uk/go/wsy/pub/rss/1.0/-/persian/news/story/2005/04/050403_la-iraq-parliamentspeaker.shtml">
<title XML:lang=”fa”>رييس مجمع ملی عراق انتخاب شد </title>
<link>http://www.bbc.co.uk/go/wsy/pub/rss/1.0/-/persian/news/story/2005/04/050403_la-iraq-parliamentspeaker.shtml</link>
<description XML:lang=”fa”>نمايندگان پارلمان عراق برای سومين بار از زمان تشکيل اين پارلمان در ماه ژانويه، تشکيل جلسه دادند و پس از چندين هفته اختلاف نظر، حاجم الحسنی ، وزير کنونی صنايع را به عنوان رييس پارلمان انتخاب کردند. </description>
<dc:language>fa</dc:language>
</item>
Consideration was given to using only an XML:lang attribute or only the dc:language element to indicate the language of content, but putting both, while redundant, allows the feeds to work well with more RSS readers.
Around the Internet, there is other syndicated content in non-Latin languages. However, these non-Latin language feeds tend to describe themselves as American English. There is no easy way for a machine to tell its correct language.
Elaph, an Arabic on-line magazine, use a simple RSS 2.0 feed to syndicate their content. Elaph take the view that their users know that the feed is in Arabic and do not use any mechanism for marking their content up as language.
iTiran, a Persian Blogger, similarly assumes his readers know his feed is Persian, to the extent of using Persian in the meta-data, such as:
<dc:subject>فرهنگ</dc:subject>
Even blogs from a single service provider will sometimes include language information, and sometimes not.
Arabic is also read right to left like Urdu, Persian (Farsi), etc. There is no simple solution to indicate right to left text in RSS without using the HTML bi-direction (bidi) attribute.
On Web pages, directionality is identified by the use of right-to-left markers, both with the dir="rtl" attribute value, and by the use of characters that are marked as right to left characters in the Unicode code charts. Style sheets should not be used for this purpose.
Consideration was given to adding the right-to-left information explicitly within the RSS feed, for example, by using the dir attribute from HTML or XHTML.
<title dir="rtl" XML:lang="fa">امپراتوری زنجيره ای مک دونالدز ۵۰ ساله شد </title>
<title xhtml:dir="rtl" XML:lang="fa">امپراتوری زنجيره ای مک دونالدز ۵۰ ساله شد </title>
However, these are non-standard and cause the resulting file to not be valid RDF, and hence not RSS 1.0.
Since in the BBC World Service content the text in each element is from a single language and mainly from a single Unicode directionality type, the text will be correctly displayed right-to-left or left-to-right by following the Unicode directionality algorithm.
There is collective thought that only a tiny subset of structured HTML should be used in the description element. Content providers like the BBC are used to having significant control over the presentation of their content to the end-user. For example, most Arabic on the BBC World Service site is in bold font, to enhance its legibility. Simplified and Traditional Chinese has much smaller in-line spacing that Latin languages. RSS does not provide the ability to add such styling or presentation information, which, from a content providers' point of view, can be a problem.
The BBC, along with many content providers, are reliant on software engineers of RSS readers and parsers to have knowledge of language differences and adjust the presentation accordingly. Although this does observe the separation of concerns, there are issues since content providers may have expert knowledge of their audience and the presentation expected.
When approaching this area, it is fairly easy to think that setting the language type in the language element will be enough, but actually we are relying on software to read the language type and change the direction, in-line spacing, and styling automatically. This currently does not happen for many RSS parsers.
My belief is that if content publishers do not provide RSS feeds with correctly structured language meta-data that software engineers can cut their teeth and applications on, then the stalemate will proceed as it does today. Certainly this is one way of looking at it. The other view point is that software engineers need to put language features into their software, otherwise there is no point in content providers using correctly structured meta-data and modules to describe language content. The BBC World Service RSS feeds will hopefully help motivate RSS reader software engineers to correctly support language features.
I personally favour conversations between content producers and software engineers as I believe we can drive part of the way towards better standards for international RSS adoption that are not possible alone. It would benefit everyone in the long run just as moving from loose HTML to stricter XHTML allowed most browser makers to build better browsers.
The BBC believes Automatic discovery of RSS feeds is critical for RSS adoption. Auto-discovery requires little instruction, and when working smoothly can be as simple as one click by an audience member. There is RSS auto-discovery for the BBC World Service sites, but that is only half the battle.
Unfortunately, not everyone has a RSS reader installed, so a second option of the RSS image button is relied upon to alert the person to an available RSS feed. There are also help pages written in different languages (Czech for example) to explain what RSS is and what benefit it has the audience member.
The final hope is going straight for the browser, bypassing third party applications. Accessing RSS feeds usually requires using a browser in the first instance. At that point, if the browser can deal with RSS, there is little need to use a third party application, especially when many of those applications actually use the built in browser to display the contents of the RSS anyway.
Opera was one of the first browsers to support RSS directly, then Firefox followed with its live bookmarks feature. Others browsers have followed by providing support for RSS. However, there is a better solution than downloading a new browser. Pre-installed browsers with RSS support give the operating system RSS support from the moment it is installed. Novell Suse Linux comes with Firefox pre-installed while other Linux distributions use either Firefox or another RSS reader like Straw.
The first major browser to be bundled with the operating system will be Safari 2.0 (dubbed Safari RSS), which is scheduled to go public in late April 2005 (see Figure 3).
The most widely deployed browser Microsoft's Internet Explorer does not currently support RSS, but it appears that the next version will. This may significantly increase the audience for RSS content and drop the entry level down to zero.
The BBC World Service really sees RSS 1.0 as a multi-purpose technology to suit all types of usage, from mobile readers to mass Aggregators, internal systems to public syndication. There is already thought about delivering relevant weather maps, images, audio, video, etc to BBC World Service partners and our audience.
An article written by Tony Hammond titled The role of RSS in Science Publishing, syndication and Annotation on the web, goes into details about RDF syndication and the advantages for scientific publishing. We are seeing parallels between the advantages to science and some of the advantages we would like to give our internal systems, audience, and business partners. The PRISM (Publishing Requirements for Industry Standard Meta-data) specification can be used inside of RSS 1.0 and has a lot to offer BBC World Service partners in a Really Simple Syndication format.
This is also where we can serve our engineer market. Tenbyten and Newsmap are good applications built on top of RSS feeds. The engineer market grows on good structured content supplied by RSS feeds
The BBC tests its Web sites primarily using browsers such as Internet explorer and Firefox, with the most popular OSes, primarily Windows variants. In this way, the tests the BBC run can simulate the experience of most of our Web audience.
The RSS market is much more fragmented, with about 50 popular desktop RSS readers. When considering both the OS and other environmental factors, there are more than one hundred plausible configurations for the end audience. When combined with the number of languages supported, the variability of the users’ language preferences, and the language support within the OS, there are literally thousands of different end-user tests that would need to be performed in order to use the same approach to testing, trying to ensure that most of the audience have an acceptable experience. This approach to testing is too expensive.
Thus, the World Service aims to produce RSS 1.0 output that is valid to the standards. We try and identify RSS readers that are also valid to the standards, and try and work with those to improve the end user experience.
It is not viable to test beyond validation and conformation of the RSS feed.
Even with all the changes and progress, the problem of what the audience sees is still the biggest worry. We do not want to embed style elements inside the RSS description element and if we did we would be very respectful of the nature of RSS being content and structure, unlike HTML, which is a mixture of Style, Presentation, and Content. We still rely on our audience of right to left RSS feeds like Urdu and Arabic to be using RSS readers that use browsers for displaying content.
If RSS is to be a worldwide technology that anyone can read or write, both problems must be seriously considered, we must come to a simple standard that does not harm any language used. The XML standard of XML:lang seems to work well but does solve the problem of the actual RSS readers being able to render the text correctly in other languages. We believe a two prong approach is needed and BBC World Service RSS feeds are a good starting point for what can be done from the content providers point of view. The fact that one can actually search RSS feeds of all languages is a great achievement and shows software engineers what is possible when you have correctly structured and identified RSS feeds in any language.
All the elements for World wide adoption of RSS are there, it just takes time and both content producers and software engineers to meet half way and promote good use of RSS. If either side ignore this and try to go it alone, RSS will end up where HTML was back in 1997. Content producers will fill the RSS feeds with style to make their content standout and be unique to a chosen RSS reader, while software engineers will try and make sense of meta-data that does not exist and create applications that are built on assumptions. Both are misleading and not good for the world wide adoption of RSS technology.
Many thanks to those who made significant contributions to this paper. In alphabetical order; Deborah Cawkwell, BBC World Service. Jeremy J. Carroll, HP Labs. Joel Chippendale, BBC News online. Miles Metcalfe, Ravensbourne College of Design and Communication. Sarah Forrester, my lovely wife.
Ian Forrester
New Media Software Engineer, BBC World Service
Ian Forrester works for the BBC World Service as a New Media Software Engineer. His background is in design and information architecture, and he has been an advocate of XML and web standards for several years. Ian graduated from Ravensbourne College of Design and Communication in 2001 with a degree in Interaction Design. He worked for clients including Compaq, Ample, Reuters, and AOL Europe before returning to Ravensbourne to work and lecture in information design and development.
His previous projects include developing RSS for the BBC, developing an XML application to facilitate meta-data tagging for course documentation, establishing an archive of student work using XML, designing a freedom of information site based on Word ML, and re-engineering the Ravensbourne website and intranet with a “serious injection of XML, standards, and accessibility.”
Ian previously conducted weekly hands-on sessions with interaction design students in the areas of XML, syndication, web services, and networking. He continues to arrange external lectures for the College, including a day with Richard Stallman and Cory Doctorow from the Electronic Frontier Foundation.
He spends time participating in working groups throughout the BBC in the areas of RSS Syndication, XML, and CSS. In the near future, he plans to continue his education in information design and develop different ways to connect the BBC with its world-wide audience.