Wednesday, August 06, 2008

As the vision unfolds, software still can't surf.


As the vision unfolds, software still can't surf.
By Andrew Brenneman
http://www.bookbusinessmag.com/story/story.bsp?sid=113283&var=story

Tim Berners-Lee, director of the World Wide Web Consortium (W3C), outlined a strategy for the future of the Web in a series of papers and articles published between 1998 and 2001. He observed that while there was a wealth of information available for people to explore on the Web, computers had difficulty extracting information from it. The Web consists largely of free-form text, and computers have great difficulty understanding human language. While search engines can index the Web, a human being is required to interpret the search results. You may be able to surf the Web, but your computer can't. The value of the World Wide Web is significantly compromised, Berners-Lee argued, without the ability for systems to interpret its content.

He proposed a solution: the Semantic Web, which would provide a bridge between the language of humans and the language of computers. It consists of a set of standards for creating XML-based tags that describe information contained on the Web in a way that computers can understand. The Semantic Web would act as a global database that software applications could meaningfully explore. Your computer could surf.

The implications for content providers are significant, and fall into two categories:

1. Value Chain Integration: A common way of labeling subject matter and meaning within content-the contents of the content-would help integrate parties along the publishing value chain: authors, publishers, distributors, retailers, consumers.

2. Research Value: The value of content for research would be enormously increased. Content that is semantically structured could be queried, as one would query a relational database. Software research agents could continually comb through the Web, looking for significant information, aiding in research. For example, a research agent could be programmed to continually monitor the Web for new findings involving the correlation between thyroid cancer and any polychlorinated biphenyls congener in Northern Europe. This would have a profound impact on legal, scientific and scholarly research.

It has been a decade since Berners-Lee presented this vision, and the Semantic Web is yet to be. The content on the Web is still, for the most part, in human language, undecipherable by software. While there has been much research on semantic technologies, they have not been widely deployed over the last 10 years. HTML took only a couple of years to become a global standard.

What Currently Exists
Was Berners-Lee wrong about the Semantic Web? To begin to answer that, we can first examine what methods have evolved to manage and extract value from the content on the Web.

· Search Engines: Search engines, principally Google, Yahoo and Microsoft Live Search, are the primary means for exploring content on the Web. A search engine's results are semantically "fuzzy" or imprecise, because a search engine indexes words and not their meanings: "apple" will return search results with both fruit and computers. Inexact or not, search engines provide tremendous value and, for many, structure the Web experience.

· Folksonomies: In the current Web 2.0 era, communities of users dynamically submit content to share with others on the Web. The user creates and assigns labels to the content. These labels, or "tags," describe the subject matter and help connect it with other content. This is similar in principal to the application of tags within the Semantic Web model, with an important distinction: The Semantic Web only uses tags from a standard taxonomy of terms, a "controlled vocabulary." Web 2.0 tags are typically user-defined, uncontrolled and are referred to as being within a "folksonomy." A folksonomy is inexact because one user's tags will likely not correspond with another's. Folksonomies, therefore, cannot be used efficiently by software. A person is still required to interpret them. Like search results, folksonomies are "fuzzy," but sometimes "fuzzy" is good enough.

Examples of the Semantic Approach's Value
In looking at the dominance of search engines and Web 2.0 folksonomies, we may well conclude that the model of the Semantic Web has been usurped by other less cumbersome and more organic methods.

But I don't think that is the case. There are some compelling examples emerging of how the semantic approach is adding value to published content.

· Book Industry Standards and Communications (BISAC). Publishing professionals know all about taxonomies. They use them every day. BISAC and Library of Congress subject headings are, in fact, standardized taxonomies, used to connect partners along the publishing value chain. Publishers do not typically embed BISAC tags according to the Semantic Web technical specification, nor do BISAC subject categories contain the detail necessary to perform research. Ted Hill, a publishing consultant who specializes in digital supply chain issues, points out that BISAC was created to let booksellers know on which shelf in a bookstore to place a book. "BISAC subject codes were part of a strategy to reduce double-stocking and cut the cost of inventory," notes Hill, "not promote discovery by search engines." However, BISAC is conceptually consistent with the semantic vision described by Berners-Lee.

· Alexander Street Press. Founded in 2000, Alexander Street Press might be the most forward-thinking electronic content aggregator in the humanities. Alexander Street Press acquires, prepares and electronically distributes collections of books, documents and rich-media content for humanities research. Their preparation includes a very detailed application of semantic tags from controlled vocabularies that dramatically increase the value of the content for research. This process requires domain expertise, curatorial care and technical know-how. According to Alexander Street Press President Stephen Rhind-Tutt, "There is a general underestimation of the value of librarianship and cataloging."

The value of the results is clear, however. Semantic preparation enables researchers to extract facts from collections of content-not just find search terms. Rhind-Tutt observes, "Researchers can ask questions that are much harder to ask [than] if the content was not semantically structured." Alexander Street's longevity is a testament to the value it is creating in the humanities research marketplace.

· Knovel. Knovel provides semantically structured collections of engineering and technical content, including text, charts and tables. This allows the information contained in articles to be queried, as one would query a database. Technical researchers can find answers contained in large bodies of content with great efficiency. For example, a technical researcher could submit a query to find studies that address polymers with a specific tensile strength at a given temperature range. This is tremendously more efficient than simply putting a search engine on top of a collection of thousands of journal articles. In the context of the cost of an engineer's time (and the time he saves on research), the economic value is enormous.

Was Berners-Lee's vision of the Semantic Web on target? BISAC, Alexander Street Press and Knovel are evidence that the semantic approach can increase the value of content through discoverability and research efficiencies. While considerable effort is required to structure content in this way, it yields, in many cases, a significant return.

What has not taken place is the wholesale transformation of the Web. The Web has not become a global, semantically structured database. Instead, there are islands of semantically structured content, inside commercial, walled gardens (subscription services), or within defined communities, as in the case of BISAC. The reason is economic: There often is insufficient justification for the investment required in semantically structuring content.

In addition, adoption of semantic technologies has been slow, since it is built upon other standards-particularly XML and Web Services (a standard way that software can connect with one another on the Internet). It has taken time for XML and Web Services to become widespread. With those standards in place, the semantic approach can and will be increasingly used. However, this will only occur in specific areas of content when there is a particular, usually financial, rationale for doing so.

Your computer still won't be able to surf, but it may be able to swim some laps in the pool. And that may be enough.

Andrew Brenneman is managing director of Finitiv, a digital media consultancy. He has 20 years of experience leading pioneering digital media initiatives in publishing and advertising, including NETg's Skill Builder, Thomson Learning's WebTutor, FreeMark Mail and MsDewey.com. Brenneman also founded the Digital Media Group of The University of Chicago Press Books Division, where he led digital distribution for the Books Division and the development of The Chicago Manual of Style Online.