XTech 2005: XML, the Web and beyond.
Proponents of the topic map standard have said that the topic map paradigm allows for "seamless knowledge integration" or "global knowledge federation". However, in many, if not most, topic map applications that have been publicly presented, the information contained within the topic map is largely a snapshot in time of a specific domain of knowledge and does not change very much. While the topic map paradigm does allow for the merging of topic maps, this is simply the merging of snapshots and rarely the dynamic addition of new knowledge where views of a topic may change significantly.
The nature of knowledge is that it is always changing. A snapshot is current only in the instant in which it was created. The snapshots become outdated very quickly. A paradigm that claims to integrate or federate knowledge must be able to be just as flexible and dynamic.
This is not to say that the snapshots of information organized within topic maps are a bad thing. I also do not claim that the topic map paradigm cannot be as dynamic as the world in which we all live. I suggest, however, that a paradigm shift to a more dynamic form of information gathering and organization is called for and that the topic map paradigm might still be useful in this more dynamic environment.
Many of today's information suppliers provide interfaces to their collections using web services and APIs. Examples of such suppliers include Amazon and LexisNexis. For example, the LexisNexis Web Services Kit allows users to select a group of sources, submit a query and retrieve documents in XHTML or a semantic markup format such as NITF. Custom applications can be written using the API to reuse the information within the documents. This customization can allow the user to move from simple browsing to knowledge discovery by harvesting the specific bits of information that are important and repurposing it for their own specific application.
The example mentioned above could be described as simply a series of topic map merges that occur whenever a new document set is retrieved. This description would essentially be accurate, but the differentiator is that the application manages the merges, not the user as is frequently shown in product demonstrations. The ability to re-run saved searches allows new knowledge to be added into the knowledge base dynamically. This new knowledge can even be identified and presented to the user in a different manner, if needed.
This paper will present a methodology where data from multiple large information suppliers is searched, retrieved and manipulated to construct a knowledge base dynamically. It will also demonstrate how a topic map based system can be used to organize the knowledge base and merge it with other snapshot based topic maps. An example system will be demonstrated to show how all the pieces fit together into an integrated system.
A web service is a collection of protocols and standards used for exchanging data between applications. Software applications written in various programming languages and running on various platforms can use web services to exchange data over computer networks like the Internet, in a manner similar to inter-process communication on a single computer. This interoperability (e.g. between Java and Python, or Windows and GNU Linux applications) is due to the use of open standards, like XML.
A web service supplies a specific set of operations that other applications can access and use. This allows an application on one system to send requests to and receive responses from applications on other computers. Most web services use HTTP to exchange information, although other protocols could be used. Web services can range in complexity from trivial, such a adding two numbers, to extremely complex.
Web services possess certain characteristics that distinguish them from other computing models:
Although web services have a great deal of potential there are still many problems regarding development and deployment. One major concern is the newness of the data standards and the fear that further development is needed before they can be considered complete. Many developers fear that web services will be too slow for use in high-performance situations and add strain on network resources. Security is also a major concern to those considering offering web services.
Part of what differentiates web services from other computing models is the use of XML and XML-based standards. The most common standards used within web services are SOAP, WSDL and UDDI. SOAP provides a communication mechanism between services and applications. WSDL provides a method for describing services to other programs. UDDI enables the creation of searchable web services registries.
The purpose of SOAP is to enable data transfer between two systems over a network. SOAP messages are the most common means by which the systems exchange data. A SOAP message sent to a web service requests that the service execute a particular task. The service uses the information contained within the message to perform its function and returns the result via another SOAP message.
A SOAP message consists of three parts: an envelope, a header and a body. The envelope wraps the entire message and contains the header and the body. The header provides information about such things as security or routing, if it is present at all. The body contains the application-specific data that is being communicated. Basic SOAP messages do not involve extensive amounts of code and there is little in the way of special software needed to send or receive SOAP messages.
Nearly every web service published on the Internet is accompanied by an associated WSDL document. The WSDL document lists the service's capabilities, state its location and provides instructions regarding its use. It defines the kinds of messages the service can send and receive as well as specifying the data that a calling application must provide in order for the service to perform its task. The WSDL document also provides specific information that informs applications about how to connect to and communicate with the web service.
UDDI enables developers to publish and locate web services on a network. It defines the electronic capabilities and business processes supported by the web service. This information is stored in registries. A UDDI registry is similar to a phone book for the web services. It contains information about companies and their contact information, classification information about the companies and details on their electronic capabilities and data related to supported web services and business practices.
The LexisNexis Web Services Kit (WSK) offers developers a flexible solution to seamlessly integrate the full range of LexisNexis content (over 32,000 sources) into corporate workflow and business applications. This solution is an XML based application programming interface (API) that follows industry accepted standards to provide complete integration and presentation control.
With WSK, it is possible to control those portions of the online research process that interact directly with users. For example, the developer can design the mechanism for users to request information. The request is packaged in a SOAP message to the WSK API. Once the request is processed, a response message containing the XML documents or other information requested is sent. The documents can then be styled for presentation to users or stored in a local database for later processing.
The following sections will step through the process of retrieving information using the WSK.
Before a search can be run, a check is made to determine that the person or application issuing a request is registered with LexisNexis and is authorized to request that type of activity. Since the Web is a stateless environment and every request for service is processed independently, this requirement could prove to be cumbersome. To make this process easier, research applications can submit the user's credentials at the beginning of the user's session. Upon validation, the WSK will issue a security token in the response message. This token must then be included in each subsequent request for service as a proof of authorization.
For security purposes, this security token expires after 24 hours. After that time, an application will need to re-authenticate in order to receive a new security token. Additionally, all authentication messages must be exchanged using the SSL secured communications method (https).
When a user accesses one of the LexisNexis flagship products (such as lexis.com or nexis.com), the service knows their entitlements (authorized access to specific services, content, and features) and adjusts the user interface accordingly. It presents only the sources, features, and services authorized for use by that specific user. He/she cannot build a search that uses an unauthorized source.
However, with the WSK, users are using the research applications to access the LexisNexis services. Therefore, a method is needed to determine which sources they are entitled to use for their searches. Sources can be located through an operation that notifies the research application if a specific source, based on a string of characters within its name, is available for a given client. Details about the source can also be retrieved.
The search operation is used to locate documents that satisfy the parameters specified in the search request. The search specifies the word or phrases that should or should not appear within the documents. In addition, other parameters allow the search to be further restricted by specifying the source(s) to search, the terms to look for within the candidate documents, and a date range or other restrictions that must be observed. The search operation also allows the application to specify how the search result set is delivered. Delivery options include display format (cite list or full document text), markup method, sort order (date, relevance or source specific), and the range of documents desired from within the set. There are two markup method options: display and semantic. When the display markup option is selected, the returned documents are marked up using XHTML. When semantic display is selected, the returned documents are marked up in a more semantically rich markup scheme. News documents are returned in NITF. Other types of documents will be returned in LexisNexis specific semantic markup. Other public interchange markup standards may be adopted in the future based on market conditions.
The response message contains the search results in the format requested. Typically applications initially request that the document be delivered in a list format because it contains the pertinent information about each document found. A brief excerpt from the document makes it easier for the user to decide whether or not the document in question is relevant. Each item also includes a unique document identifier that can be used to retrieve the requested document in whatever display format or markup scheme is desired. This unique identifier is valid for a limited amount of time.
Once a search set has been returned, it is also possible to narrow the results set to more relevant documents by applying additional search restrictions to it.
Results can be retrieved by specifying a range of documents. This returns a subset of the entire search result set. Each time a request is made for a new range of documents, it is possible to specify a different format or type of markup. The response message for this type of retrieval contains the collection of documents in the format and markup requested. It also includes a document identifier for each document delivered that can be stored and used to retrieve specific documents individually.
Applications can also retrieve documents based on their document identifiers. In the case mentioned above where a list of documents is retrieved, the user can specify that the full text of a particular document be retrieved. Some documents also have attachments associated with them. The WSK provides a mechanism for retrieving the attachments documents for display or storage.
Many times users develop a search to retrieve specific information about a topic and then want to re-run the search at a later time. The WSK provide method for saving searches and managing the saved searches. Any number of searches can be saved, each with a user specified name for later use. The names can then be used to recall and rerun the searches from time to time to obtain the most current information available about the topic.
Some documents may hold a particular interest to members of an organization and will be accessed frequently by them. These documents could be a particular news or magazine article, documents and reports related to a specific work project, or documents about the organization itself. Therefore, some organizations have requested permission to store some documents on and serve them from their local server.
LexisNexis is committed to calculating accurate document accesses and paying information vendors the appropriate royalties due them. The issue of how to determine document accesses when served from a source outside the LexisNexis system needed to be addressed.
To resolve this issue, a mechanism for reporting external document accesses was developed. This mechanism is used in conjunction with an agreement between LexisNexis and the requesting organization as to how this process will be implemented. Since most sources are covered by an organization's subscription agreement, these access counts should not impact the invoice received by the organization. However, some sources may be considered premium or be outside the boundaries of the subscription. In these cases, those counts would be used to calculate additional charges just as though those documents were accessed directly from the LexisNexis Research Services.
All full documents already have royalty information embedded within them. This information consists of some proprietary information used internally by LexisNexis as well as an encrypted document token. Applications must store that document token and provide a mechanism for maintaining a count of each time that particular document is accessed from the local server. Then, at intervals specified in the agreement with LexisNexis, the application must issue a message reporting a list of those documents accessed from the local server along with the current access count for each one. The response message will indicate that the document access count reports were received and the count for those documents that were reported successfully should be reset.
The Amazon E-Commerce Service (ECS) is an API that allows you to access Amazon data and functionality through a Web site or Web-enabled application. The ECS follows the standard Web services model: users of the service request data through XML over HTTP using REST or SOAP and data is returned by the service as an XML-formatted stream of text.
The Amazon Web sites enable product sales from Amazon plus many other vendors. The products available through the Amazon E-Commerce Service (ECS) draw from a huge inventory and include a majority of the products available on an Amazon Web site. ECS is available for all of the Amazon sites:
Through ECS, the following types of data can be accessed:
ECS is a read-only system. Product data cannot be sent back to Amazon via ECS. The only data that is transmitted back to Amazon is information about the customer shopping carts managed by the application and their contents. So an application cannot, for example, use ECS to allow customers to create wish lists or reviews and submit them to the Amazon Web site. Additionally, if application developers are selling products through amazon.com, they cannot use ECS to define new products or manage inventory.
ECS provides two types of inquiries: search and lookup. A search is a request that returns information matching specified criteria. Searches can return no data (if nothing matches the criteria specified) or multiple objects that match the search criteria. An example of a search might be a request to retrieve all books about constitutional law. A lookup is a request for a specific object or set of objects, specified by a unique identifier(s). An example of a lookup might be to retrieve information about a book by its Amazon Standard Identification Number (ASIN).
The search operation in ECS uses keywords or other criteria to search for products. This operation combines several of the searches that might be familiar from use of the amazon.com website, including keyword search, power search, author search, artist search, actor search, director search, manufacturer search, and text stream search. Setting up a search operation consists of three steps: choosing the Amazon store to search; specifying search parameters; and, requesting the desired output.
The text stream search retrieves products based on a block of text specified in the request. The text block could be a search term, a paragraph from a blog, an article excerpt, or any other text for which product matches are to be retrieved. When Amazon receives the request, it parses out recognized keywords and returns an equal number of products (ten total) for each recognized keyword. For example, if a request is sent with five recognized keywords, Amazon will return two products matching each recognized keyword. This functionality is available only on the US store.
The power search is used to perform book searches on Amazon using a complex query string. Complex query strings are of the format: key:value where keys include ASIN, author, author-exact, author-begins, keywords, keywords-begin, language, publisher, subject, subject-words-begin, subject-begins, title, title-words-begin, and title-begins. For example the query "author:ambrose" returns a list of books that include "Ambrose" in the author name. A query of "subject:history and (spain or mexico) and not military and language:spanish" would return a list of books in the Spanish language on the subject of either Spanish or Mexican history, excluding all items with military in their subject.
To facilitate customer purchases, ECS allows an application to create and manage a shopping cart of products that a customer wishes to purchase. The shopping cart is a temporary data structure that is stored at Amazon as long as it is in use. Amazon carts with items expire and are deleted if they remain inactive for more than 90 days. Empty carts are deleted after seven days. The shopping cart features allow a more complete shopping experience to be created for customers, as well as give the opportunity to earn commissions through the Amazon Associates program for sales referred to Amazon.
To build a shopping cart, customer selections (which consist of item and quantity) are submitted to ECS. ECS then creates an individual shopping cart with item descriptions and current price information filled in. A variety of ECS functions can be used for adding more products and managing existing products in the cart. When the customer has finished assembling their order, the shopping cart is transferred to Amazon (through the provided URL) for completion of the sale transaction.
When using the shopping cart operations of ECS, a remote cart, which is separate from a customer's shopping carts that may already exist on the Amazon Web sites, is being created. This remote cart has its own identifiers and it is considered an Amazon cart owned by the remote application (not the customer). ECS does not have access to any information which will allow it to associate the shopping cart with a particular customer or their Amazon account until the cart has been submitted for purchase. The application must internally keep track of which cart belongs to which customer. The mechanism for storing this customer session-level information will vary across environments.
The LexisNexis WSK allows application developers to access a wide range of the full LexisNexis collection (within subscription limitations). While the semantic markup provides more detailed markup, the display markup can be more consistent across a wider range of document types. This allows an application to process a greater variety of the data with the same general rules. The example application described in this paper is based on caselaw data retrieved using the display markup scheme, but could easily be extended to include news or financial data.
The sample application is built using a set of open source tools, standards, and applications that have been integrated to demonstrate the concepts discussed in this paper. It is written in Java. The application includes code based on WSK that searches and retrieves documents. The documents are parsed in order to identify specific pieces that are then passed to a topic map engine. The full documents and the topic map are stored in a native XML database. While the application does store the full retrieved documents, this is not required for the application to work as described, but rather to allow easier analysis and debugging.
TMAPI (http://www.tmapi.org): TMAPI is a programming interface for accessing and manipulating data held in a topic map. The TMAPI specification defines a set of core interfaces which must be implemented by a compliant application as well as a set of additional interfaces which may be implemented by a compliant application or which may be built upon the core interfaces. TMAPI was developed in an open process by developers working on topic map processors and topic map applications and placed into the public domain.
XTM4XMLDB (http://sourceforge.net/projects/xtm4xmldb): XTM4XMLDB is an open source topic map engine implementing the TMAPI interface with a native XML database, such as eXist or Apache Xindice as a backend. It is written in Java. XTM4XMLDB provides a topic map specific middle ground between an application and the database. This saves the developer the trouble of having to learn the specifics of either TMAPI, XML:DB, or the eXist API.
eXist (http://exist-db.org): eXist is an open source native XML database featuring efficient, index-based XQuery processing, automatic indexing, extensions for full-text search, XUpdate support and tight integration with existing XML development tools. The database also implements the November 2003 working draft of XQuery 1.0 with the exception of the XML schema related features. The database is completely written in Java and may be deployed in a number of ways, either running as a stand-alone server process, inside a servlet-engine or directly embedded into an application. Within the sample application eXist runs as a stand-alone server.
eXist provides schema-less storage of XML documents in hierarchical collections. Even though the storage is schema-less, documents can be validated as they are added to the database and document type information is maintained within the stored document. Using an extended XPath syntax, users may query a distinct part of the collection hierarchy or even all the documents contained in the database. Despite being lightweight, eXist’s query engine implements efficient, index-based query processing. An enhanced indexing scheme supports quick identification of structural relationships between nodes, such as parent-child, ancestor-descendant or previous/next-sibling. Based on path join algorithms, a wide range of path expression queries is processed only using index information. Access to the actual nodes, which are stored in the central XML document store, is not required for these types of expressions.
eXist is well-suited for applications dealing with small to large collections of XML documents which are occasionally updated. It provides a number of extensions to standard XPath to efficiently process full text queries, including keyword searches, queries on proximity of search terms or regular expressions. Several access methods are available including HTTP, XML-RPC, SOAP and WebDAV. Java applications, including the sample application in this paper use the XML:DB API, a common interface for access to native or XML-enabled databases, which has been proposed by the vendor independent XML:DB initiative.
There are several items that appear consistently throughout US caselaw. These include court information (docket number, court name, etc.), case names, judge information, parties in the case, core terms within the case, headnote information that discusses the most salient points of law within the case, opinions, dissensions, concurrences, cited cases and statutes, etc. Some of these pieces of information provide descriptive metadata that is readily represented within a topic map. Other pieces that are primarily large areas of flowing text do not map well to a topic map. In addition, certain ontologies can be defined, such as the US court structure and the taxonomy used to classify headnote topics. These items are also stored as topic maps that can be merged with the cases as needed. This allows the hierarchies to be maintained and updated without directly affecting the caselaw topic map.
Within the caselaw topic map, the following items are harvested from each case retrieved and loaded into the topic map: court name, case information (including long name, short name, docket number, cite IDs, posture, overview, outcome), judges hearing the case, core terms from within the case, headnote classification and referenced cases and statutes.
Each retrieved case can result in the addition of dozens of new topics that are instances of the types previously mentioned. However, the limited validity of the document identifier makes it an inappropriate choice for use in identifying the document in the topic map. Instead of directly representing the document, the topic map represents the case. The topics harvested from the document are then attached to the case itself. In order to have a more persistent identifier for the documents, the LexisNexis Identifier (LNI) is used. The LNI is stored in a <META> tag in the header of the XHTML document. In order to re-retrieve the document a new search must be submitted using the LNI.
A case is identified in many different ways. Within LexisNexis each file has a unique LNI. This value can be used to create a LexisNexis-specific Published Subject Identifier (PSI) for the case. However, such a value would most likely not be appropriate for publicly exchanged PSIs. Each case has a long (full) name. Most cases also have short names that are used when the case is referred to. The case is assigned a docket number when it is heard in court. Case reporters also assign citations to the case that are different based on each reporter. The names can be modelled within the topic map as names or variant names. The identifiers could be represented as variant names of the case, since a case can have multiple identifiers. They could also be used to build PSIs for the case. By building PSIs using these identifiers, the merging mechanisms built into the topic map model can be used to automatically connect new cases as they are retrieved and loaded into the topic map.
As a case progresses through the court system, it is assigned a different docket number by each court that hears it. It also receives a new citation from each of the case reporters. Special associations can be created that track a case through the courts. This allows the user to see what points of law become instrumental in a argument or what areas are overruled.
Cases and statutes referenced from within the case are also represented as individual topics and associated to the referencing case. The application can determine if and when to retrieve the referenced documents. In some cases, it might be appropriate to automatically retrieve cases to complete the topic map. However, doing so could result in a large number of retrievals, each with a cost. In the sample application, documents are retrieved on an on-demand basis. Since the referenced documents are not retrieved, there in no LNI available from which to construct a PSI. For this reason, the decision was made to use each unique case citation to build a PSI to allow merging. In this scheme it is likely that a case will have multiple PSIs.
The judges hearing the case can be represented as separate topics that are then associated to each case they hear. Several different types of associations and roles can be developed to describe how they ruled or who was the primary author of a portion of a decision. The judges can also be associated to a specific court and given a specific role, such as justice or chief justice.
Each core term discovered within the case is represented as a topic and associated to each case in which it occurs. Each headnote item can be used to connect to the legal taxonomy to the case. Since the headnote information is stored as a full taxonomy, it becomes possible to group cases by areas of law, in either a more granular or less granular fashion. This is done by merging the legal taxonomy topic map with the caselaw topic map.
Consider a search within US Supreme Court cases using the search string "beer and moose". Four documents are returned. One considers a case where Amtrak is sued for refusing to allow an artists' advertisement in one of their stations. Another in Oklahoma considers a statute where the legal drinking age was different between males and females. Another is a suit by the Democratic National Party against the CBS television network for not televising certain editorial advertisements. The last case challenges constitutionality of the state of California's rules regarding the sale of alcoholic beverages at adult entertainment establishments. On the face these cases seem to have little in common. However, when the topics are harvested and added to the topic map, new information begins to jump out immediately. For example, by looking at the core terms from the cases, it becomes evident that 3 of the 4 cases are, in fact, First Amendment cases where freedom of speech and expression are central issues. This might not have become evident in the standard lexis.com views. The ability to look at the metadata from different points of view allows for the discovery of new information.
This discovery of new information becomes evident in just 4 cases in this example. As more and more cases are added to the topic map more commonalities continue to appear that can become vital in the preparation for a legal argument in court.
While any of this information can be searched and retrieved through the LexisNexis system, doing so requires looking for it explicitly. Many of history's most important discoveries are made when not looking for the thing discovered. The topic map views allow such discoveries to be made.
lexis.com allows searches to be run periodically using the ECLIPSE feature. The sample application can also be set up to run saved searches periodically. By adjusting the date range applied to the search, only new results will be retrieved. Any new answer sets can be then merged into the topic map. Depending on the presentation system being used, new topics or associations can be identified to the user as they are added to the topic map.
As topics are added to the topic maps, searches can be run against ECS to determine if there are any items available for sale on Amazon that are related to the topic. If ECS returns an empty set, no link is created. For example, special searches can be run to determine if a judge in a case is the author of any books, as opposed to being the subject of any books. Another example includes checking to see if a headnote topic occurs within the title of a publication or as one of the keywords associated with the publication. In addition, the topic types assigned to specific topics can be used to add more intelligence to the ECS searches. For example, by knowing that a judge is a person, the application might be able to do a power search on the judge's name using an author query.
As the user manipulates the topic map, materials that are related to the topics covered in the case can be presented as items available for purchase. This model could be further extended as other web service APIs are added to the application.
It is important to note that as topics are added to the topic map and ECS is queried for the availability of materials that might be appropriate to the topic, the materials returned from Amazon are not placed into the topic map. Only links that perform the appropriate searches are added to the topic map. This is done because the Amazon inventory, as well as the suppliers and prices, changes constantly. It would be a daunting, if not impossible, task to keep the topic map synchronized with the Amazon databases. By only including specific search through links, the application is able to provide the most current information without having to store it.
This same concept could also be applied to the storage of the documents retrieved from LexisNexis. While the application does currently store these documents, doing so opens the possibility of having outdated documents within the database. If the sample application were taken to a production state, the conventional wisdom would be to only manage the metadata in the topic map and reference the source documents to allow retrieval if the full text was needed at a later date.
API - Application Programming Interface
ASIN - Amazon Standard Identification Number
ECS - E-commerce Service
HTTP - Hypertext Transfer Protocol
LNI - LexisNexis Identifier
NITF - News Industry Text Format
PSI - Published Subject Identifier
REST - Representational State Transfer
SOAP - Simple Object Access Protocol
UDDI - Universal Description, Discovery and Integration
URL - Uniform Resource Locator
WSDL - Web Services Description Language
WSK - Web Services Kit
XHTML - Extensible Hypertext Markup Language
XTM - XML Topic Maps
This paper has presented a methodology where data from LexisNexis and Amazon is searched, retrieved and manipulated to construct a topic map dynamically using web service interfaces provided by the organizations. Through the LexisNexis Web Services Kit, users have the ability to select a group of sources, submit a query and retrieve documents. The sample application can then process the metadata from the documents within the answer set into a topic map. A topic map viewer can then be used to manipulate the topic map views. This allows users to move from simple browsing to knowledge discovery by examining commonalities within large answer sets.
The ability to re-run saved searches allows new knowledge to be added into the knowledge base dynamically. This new knowledge can even be identified and presented to the user in a different manner, if needed.
The addition of the Amazon API demonstrates how an e-commerce application can be developed on top of the topic map and how the topic map can be used to build focused subject matter based browsing of Amazon materials.
Many thanks to Tony Feldkamp who made me aware of the "beer and moose" search and his initial work in setting up topic maps based on caselaw data. Thanks to Steve Petric, Doug Heitkamp, Bob Hodgeman, Kevin Remhof, and the entire Web Services Kit development team for their assistance and for putting together a great product.
Eric Freese
Consulting Software Engineer, LexisNexis http://www.lexisnexis.com
Eric Freese is a consulting software engineer with LexisNexis. He has 17 years of experience in the areas of document, information, and knowledge management with specific expertise in the development and implementation of XML technologies. His experience includes research, analysis, specification, design, development, testing, implementation, integration and management of information systems in a wide range of environments. He has significant research experience in human interface design, graphics interface development and artificial intelligence. Freese was a founding member of TopicMaps.Org, the organization that developed the XML Topic Maps (XTM) specification, and served as the chairman of this group. He is also the chief architect and developer of SemanText, an open source application that uses topic maps to harvest and manage knowledge.