XTech 2005: XML, the Web and beyond.
This paper describes how we* built the talkeuro web site [talkeuro] and what the Social Documents approach means. It is in three sections, the first describing the process used to develop and build the site. Then there is a section on what these kind of annotation projects offer, what makes them different and how they can be extended, with notes on future developments. Finally there is a section on how the site is currently being used.
Documents if they are to be of utility, need to be read and discussed, otherwise they are just a statement of record. The European Union published the European Constitution in October for signature by the heads of the twenty-five European countries. The European Constitution [Europa] is an important document which will affect the lives of 500 million people. The EU published it on a web site in PDF format. This is great for printing, but not a means to get people to read, understand and engage with the 350 page document.
Last summer, when the European Constitution was in draft form. A group of us discussed the recent publication and how it was a seemingly closed process. One designed not to involve the people of Europe, despite the fact that many of them would be invited to vote for or against the constitution in referendums over the next 18 months. We resolved to build a web site to address this. Mark Simpkins had recently published a web site to allow for responses to the UK Government ID card Bill, this was a simple quick site [consultationprocess] designed to put the text out for comment, essentially a weblog populated with the paragraphs from the Bill.
For talkeuro we extended this idea, automating more of the process, building a more category based approach and combining the weblog format with a wiki. The resulting platform allows Social Documents to be created instead of PDFs for publication. A Social Document encourages direct engagement with the content of the document. It provides the means of discussion itself, enabling people's commentary and placing it alongside the document on a section by section basis. Those reading the document can therefore read the content and the opinions of those reading before them [SocialDocuments].
To restructure the content from a PDF, we developed a set of scripts to take the output of pdftohtml and convert this into an interim XML format. Once the data was held as XML rather than PDF, we could manipulate and categorise the content on the basis of the document structure. PDF as commonly used rarely contains any of the document hierarchy; it is possible to hold this in a PDF, but current exporting tools do not take advantage of it.
To build the site, we have used a combination of existing software products, partly for speed and partly for the rich set of tools already developed. The main public facing aspect of the site uses MovableType [MovableType] from Six Apart, this gives us an XML friendly web log and publishing platform. We wrote a set of scripts to take our XML and publish it against the XML-RPC interface that MT provides. This automation will enable us to create all 20 language versions using a mostly automated process. Cutting and pasting 450 articles is not an appealing task.
MovableType has an extensive plugin library and an active developer community. This means that the project could use existing software and concentrate its efforts on data scraping scripts. Obviously there are down sides, for example bugs in MovableType are only fixable by Six Apart. Luckily this was not a major issue. Essentially we could concentrate on writing the glue code to make MT behave as we needed it to and manipulate the data into the formats required. Also MT out of the box supports multiple web logs from the same install, which makes system administration simpler. MediaWiki [MediaWiki], our chosen wiki software, has similar benefits, as it is stable mature application and provides rich language support. The final product that talkeuro is comprised from is the PunBB [PunBB] message board system, this was chosen for its clean looks and lightweight functionality.
The choice of software platform is important, but the most critical work is arguably the information architecture design for the site. Three key factors led this work: the desire for clean human-understandable urls; a firm commitment to site longevity, thus avoiding implementation dependent details, such as referencing the file type html via an extension, as recommended in the W3C Architecture reference [W3C]; finally the uri scheme had to support multilingual cross linking. These three requirements do not pull in different directions, as the focus is on clean simple urls.
The chosen format for the site uri is based on the core structure of the constitution. The constitution is arranged into 450 articles and each of these are placed in sections and subsections, a common format for long legal documents. To allow for multilingual cross linking, it was important to ensure that no words which would need translation were present in the URLs, so a format of section_subsection, taking the roman numerals used in the constitution was chosen. This gives urls of the format i_ii or ii_iv_3_4 etc. Finally to allow the articles themselves to be represented we published them as http://talkeuro.com/uk/i_v_iii/article_i44_enhanced_c thus they can be recombined at a later point, as the only explicit classification, is on a sectional basis.
The European Constitution will, if accepted by the people of Europe, a long lived document. The treaties it is replacing date back over fifty years. Therefore the comments that people are making now during the process of ratification or whilst thinking about referendums will have a historic context for future decades. This highlights an issue around the long term management of the site content in terms of ensuring migration from version to version of hosting software and operating systems. However one of the most pertinent issues is allowing the annotation to be read in a meaningful way. Five years from now it should be possible to read comment prior to November 2006, when all ratification should be complete, versus comments made after this point. Later other treaties or agreements may influence thinking around the constitution. Finally there is the issue of amendments to the text, handling these is possible, but the annotation will have been made on the previous text, not the new text. Means to represent the earlier text and the new are being developed.
Multilingual support and multiuser support are one area that is hard to support using current tools. Software publishers or developers do not seem to expect multiple language versions to be developed on the same site. Their expectation seems to be that a single localised version will be used rather than multi language support out of the box. This is a reasonable expectation, but for talkeuro, and future European projects, full multilingual support would be required. Related to this is the lack of template management for multiuser editing; an internationalised template management system with version control would be ideal. Hopefully sites like talkeuro will encourage support for this kind of system, we are working with tool providers to highlight the need for this functionality.
Having described the processes used to build and support talkeuro, this paper will now discuss the people and process for those who are using the site. Ensuring that the site makes the annotation coming from the public visible is an important task. Weblog software is highly configurable, but very focused on the now, talkeuro is relying on the public to generate the contemporary content. Therefore making this new content obvious to the users is critical.
We feel that annotation of public facing documents using the Social Documents framework described is a significant new model for civic engagement and the wider world of social software. The next section of this paper discusses ways to encourage, explore and represent the user behaviour on sites such as talkeuro.
When people comment on a site they are engaging in a conversation with a range of other people. Comments on weblogs are a great way to interact with the author or the topic being discussed. The act of commenting is similar to marginal notes on a book, but there are several key differences.
When you comment in a book, you own the comment and the book. People tend not to write in library books. The annotation is usually private and more meaningful to the person who wrote it. The meaning of the comment is based on the reader's knowledge and the place in the book. When you comment on the web the comment becomes a public statement. The privacy of the annotation in a book is lost, but also the comment needs to be understood by everyone, not just the author of the comment. This makes it harder, as you are not just writing for yourself. Some recent research points at the dramatic fall off when people are asked to make private annotation in public, see Churchill [Churchill] and Marshall [Marshall97,Marshall04].
However the web is full of vibrant forums with many comments, not all of which are spam! They are a form of conversation; for a weblog this is both with the author of the article and with the readers of the article.
So, is the annotation metaphor in some ways unhelpful? Social software, ranging from message boards to weblogs has moved the annotation on written material into a conversation. For a weblog, each post becomes a forum or mini message board, certainly this is true of the more popular weblogs.
Yet, does this move from annotation to conversation lose something of value in the transfer? Message boards and weblogs are very focused on the current moment or the last five comments or the last two weeks. Everything moves forward at a hectic pace. The advent of weblog spam is only encouraging this focus on the now, as current spam-blocking tools encourage the shutting down of conversation after two weeks in many cases. Some people are removing comments altogether.
Whether or not you follow this trend depends on how much you value the information that you as author or publisher have already published. Closing down comments is a shame, as it denies the capability for long term engagement and focuses your readers even more on the now. Search engines extend the reach of your articles, but offering your readers a closed conversation is not very welcoming.
Stewart Brand and his Long Now Foundation [LongNow] and Chris Anderson's concept of the Long Tail [LongTail] both offer strong recommendations for the utility of a long term world view. The Long Now Foundation essentially encourages a view of the world further ahead than the next quarter, whilst the Long Tail concept shows strength present in a good archive of content. The less frequently accessed content can be a rich source of income or audience, given the quantity of it compared to that released or published today. Electronic publishers of all media have a good stake in this area, but we lack the tools to see what is happening. Current analysis tools are in many cases little better than throwing things into heaps.
What are their motivations of people to comment on a document or have a conversation and currently how do we support these needs on our web sites? There are essentially three types of common dialogue.
Firstly there is the weblog comment or annotation, this is focused on a particular, often narrow topic and is in response to the published article or a reply to a previous comment.
Secondly there is the general message board, the user can define their own topic, but within the confines of the overall topics setup by the site owner.
Lastly there are the collaborative writing tools like wikis, in this case there is much more blurring between the author and publisher, but the topic and structure usually come from the publisher, or are influenced by them. So the three can be described as annotation, discussion and writing and need to be supported separately, as they provide for different levels of engagement with a subject.
This offers a new model for interaction - the new content for the site will only come from the readers and their comments etc. Thus the focus changes from a curiosity with the new to an ongoing engagement with the whole document. There will be no new content being published by the site owner, as would be the norm. The European Constitution is a final document and the interaction is with each article of the document.
Within the talkeuro project we are supporting the three tasks above by the following techniques, of which the first is the most important.
The wiki is to catch those people who are deeply involved with subject and want to write more than a simple annotation. The message boards are for the new users to get a more general feel for the area, these will give them a gentle introduction to the site, but allow them to cite sections of the document using the permalinks. There is also a sister project, consultationprocess, looking a faster moving engagement around the Government consultation.
In traditional web sites and weblogs new content mainly comes from the author or publisher. On message boards it mainly comes from the public, but it is heavily focused on an ongoing conversation. When you put a static document online and allow annotation on every page then activity is across the document, as there is no driver to the newest content. The static document as weblog concept with annotation thus changes the landscape for web sites. The lack of new content from the author means the discussion is between the users of the site, purely mediated by their response to the document.
It may be helpful to delve into the urban planning area, for a moment. Kevin Lynch [Lynch]describes the levels of a city and how we learn to navigate within it. He describes landmarks and routes and neighbourhoods, all familiar terms from web information architecture. For these static document annotation sites, the commenting behaviour is like the activity of a city. It can be best thought of as a marketplace, with the different articles representing stalls and the commenting the conversation. Over the space of months and years different conversations will happen in different areas of the site. Given that talkeuro is planned to be live for 10-15 years this is plenty of time for users to become familiar with the topics and lay down layers of conversation. Being able to see where today's conversation is happening is easy, however seeing the pattern over weeks and months is a harder task.
How we recognise and visualise the slower ebb and flow is not easy with current tools. These focus too much on recency and popularity. How do you show that one particular area was busy 2-3 months ago or that some section gets consistent traffic but not enough to make it top five for comments or visits on a regular basis.
This visualisation is useful both for the user of the site and for the site owner, it allows each to see the possible conversations occurring across the site. There are many data points which can be used to exhibit this behaviour and the analysis allows interpretation of the behaviour. It is a matter of finding the right axis to spread them across and using other attributes to stretch the data out, so that people can see the individual articles and comments. It is important that these are placed at the right locations on the site, so as to engage the attentions of the readers at the right time. Then it is a case of using colour, size and counts to display this information in a way that gives meaning without adding clutter. Embellishing the navigational text is a useful mechanism to give additional meaning, brighter colours or larger text emphasise where things are happening. Techniques such as Edward Tufte's sparklines [Tufte] will also work well. Other attributes that are available range from web server logs, to location of the visitors, to their email addresses, to popularity or recency measures of the comments or edits.
Tagging gives a second way into the content allowing people to see what is being read or written about, then seeing all topics matching this area. Examples such as the Folksonomic Zeitgeist on the Observer Blog [Observer] or the tag display on Flickr [Flickr] show how it is possible to use a level of indirection, via the tags, to lead your users to new content. Using Technorati tagging gives the content of the site a new external level of visibility.
The aim of all of this is to direct people to engage with the document and allow them to understand the layers of conversation which have occurred. For talkeuro this will be over a long time period, 5-10 years or more. All of this will be enabled due to the url design chosen for talkeuro. This is critical for the weblog and wiki elements of the site, less so for the message boards, as they are more of a contemporary nature. To determine the best URI structures the principals of REST [REST] were used, as well as a close analysis of the text you are representing.
Psychology and the social sciences have a lot to teach us in this area, in terms of understanding individual and social user behaviour. The hypertext and web literature which have much to offer in terms of analysis of user behaviour, see the ACM Digital Library [ACM], there is a twenty year archive of research in their journals. IBM have published tutorials [IBM] on how to do better data analysis using more powerful statistical techniques.
The talkeuro project [betageek] will continue to explore this novel area of social software. We intend to involve users in iterative design post launch and appeal to the development community to collaborate with us in exploring the future for engagement with public documents.
So what are people doing with the site? Currently the search behaviour is the most interesting. The most frequent search referral is for terms such as iii-148, people are searching for individual articles of the constitution. Due to the uniqueness of these references, the talkeuro.com web site is highly ranked for these terms.
Annotation is slowly building, but the announcement of the UK election has put a but of a dampener on the UK activity. However French interest is growing with the impending French referendum, in late June. It is expected that after the UK election, interest in the European Constitution will pickup too. In fact the French version of the site is currently much busier deeper in the site, people are reading articles more on the French version. To return to the market place metaphor, people are visiting the stalls, rather than noticing there is a market.
The talkeuro project represents a new type of social software and will hopefully provide a useful basis for other projects of a civic nature. The recently launched UK election site TheyWantToBeElected.com/manifestos/ [TheyWantToBeElected] was based on the work in talkeuro. Making sense of what people think and have said about important documents is a key task if we are to be able to help people engage with the decisions that affect their lives.
*We are Gavin Bell, Etienne Pollard, Lucy Serpell, Charles Collicut, Mark Simpkins and James Stewart with lots of help from Ben Hammersley, Rod Mclaren, Stefan Magdalinski, Tom Loosemore, Tom Steinberg, Dave Green, Anno Mitchell and Richard Sandford, plus the development communities of Mediawiki, MovableType and TextDrive
Gavin Bell
Director, talkeuro http://www.talkeuro.com
Gavin Bell is one of a group of volunteers behind talkeuro.com, development notes for which are on betageek.co.uk. Gavin designs infrastructure web products at the BBC, having previously worked in academia, publishing, and advertising. In his spare time, he is interested in reactivating political debate as well as the future of social software. He writes widely, though intermittently, on takeoneonion.org.