What are libraries doing (or not doing) about linked data? This was the question that the W3C Library Linked Data Incubator Group investigated between May 2010 and August 2011. In this post, I will take a look at the final report of the W3C Library Linked Data Incubator Group (October 2011) and provide an overview of their recommendations and my own analysis of the issues. Incubator Groups were a program that the W3C ran from 2006-2012 to get work done quickly on innovative ideas where there wasn’t enough to actually begin working on creating the web standards for which the W3C exists. (The Incubator Group program has transitioned into Community and Business Groups).
In this report, the participants in the group made several key recommendations aimed at library leaders, library standards bodies, data and systems designers, and librarian and archivists. The recommendations indicate just how far we are from really being able to implement open linked data in every library but also reveal the current landscape.
The report calls on library leaders to identify potentially very useful sets of data that can be exposed easily using current practices. That is, they should not try to revolutionize workflows, but to evolve towards more linked data. They mention authority files as an example of a data set that is ideal for this purpose, since authority files are lists of real world people with attributes that connect to real things. Having some semantic context for authority files helps–we could imagine a scenario in which you are searching for a common name, but the system recognizes that you are searching for a twentieth century American author and so does not show you a sixteenth century British author. Catalogers don’t necessarily have to do anything differently, either, since these authority files can link to other data to make a whole
picture. VIAF (Virtual International Authority File) is a project between OCLC and several national libraries to create such a linked international authority file using linked data and enter into the semantic web.
Library leadership must face the issue of rights in an open data world. It is a trope that libraries hold much valuable cultural and bibliographic data. Yet in many cases we have purchased or leased this data from a vendor rather than creating it ourselves (certainly we do in the case of indexes and often with catalog records)–and the license terms may not allow for open sharing of the data. We must be aware that exposing linked data openly is probably not going to mesh well with the way we have done things traditionally. Harvard recently released 12 million bibliographic records under a CC0 (public domain) license. Many libraries might not be in the position to release their own bibliographic records if they did not create them originally. Of course the same goes for indexes or bibliographies, other categories of traditional library materials that seems ripe for linking semantically. Library leadership will have to address this before open linked data is truly possible.
Library Standards Bodies
The report calls on library standards bodies to attack the problem from both sides. First, librarians need to be involved with standardizing semantic web technologies in a way that meets their needs and ensures that the library world stays in line with the way the technology is moving generally. Second, creators of library data standards need to ensure that those standards are compatible with semantic web technologies. Library data, when encoded in MARC, combines meaning and the structure in one unit. This works well for people who are reading the data, but is not easy for computers to parse semantically. For instance, consider:
245 10|aPride and prejudice /|cby Jane Austen.
which viewed in the browser or on the catalog card like:
Pride and prejudice / by Jane Austen.
The 245 tells us that this is a main title, and then the 1 tells us there is an added entry, in this case for Jane Austen. The 0 tells us that the title doesn’t begin with an article, or “nonfiling character”. The |a gives the actual title, followed by a / character, and then the |c is the statement of responsibility, followed by a period. Note that there is semantic meaning mixed together with punctuation and words that are helpful for people, such as “by”, which follow the rules of AACR2. There are good reasons for these rules, but the rules were meant to serve the information needs of humans. Given the capabilities of computers to parse and present structured data meaningfully to humans, it seems vital to make library data understandable to computers and know that we can use it to make something more useful to people. You may have noticed that HTML has changed over the past few years in the same way that library data will have to change. If you, for instance, want to give emphasis to a word, you use the <em></em> tags. People know the word is emphasized because it’s in italics, the computer knows it’s emphasized because you told it that it was. Indicating that a word should be italicized using the <i></i> tags looks the same to a human reader who can understand the context for the use of italics, but doesn’t tell the computer that the word is particularly important. HTML 5 has even more use of semantic tags to make more of the standard ways of presenting information on the web meaningful to computers.
The recommendations for data and systems designers are to start building tools that use linked data. Without a “killer app”, it’s hard to get excited about semantic technologies. Just after my last post went up, Google released its “Knowledge Graph”. This search takes words that traditionally would be matched as words, and matches them with “things.” For instance, if I type the search string Lincoln Hall into Google. Google guesses that I probably mean a concert venue in Chicago with that name and shows me that as the first result. It also displays a map, transit directions, reviews, and an upcoming schedule on the sidebar–certainly very convenient if that’s what I was looking for. But below the results for the concert venue, I get a box stating “See results about Lincoln Hall, Climber.” When I click on this, my results change to information about the Australian climber who recently died, and the side bar changes to information about him. Now as a librarian, I know that there would have been many ways to improve my search. But because semantic web technologies allow Google’s algorithms to understand that despite having the same name, an entity of a concert venue and a mountaineer are very different. This neatly disposes of the need for sophisticated searching for facts about things. Whether this is, indeed, revolutionary remains to be seen. But try it as a user. You might be pleasantly surprised by how it makes your search easier. It may be that web-scale discovery will do the same thing for libraries, but this is a tool that remains out of reach of many libraries.
Librarians and Archivists
Librarians and archivists have, as always, a duty to collect and preserve linked data sets. We know how valuable the earliest examples of any piece of data storage are–whether it’s a clay tablet, a book, or an index. We create bibliographies to see how knowledge changed over time or in different contexts. We need to be careful to preserve important data sets currently being produced, and maintain them over time so they remain accessible for future needs. But there’s another danger inherent in not being scrupulous about data integrity. Maintaining accurate and diverse data sets will help keep future information factual and unbiased. When a fact is one step removed from its source, it becomes even more difficult to check it for accuracy. While outright falsehood or misstatement is possible to correct, it will also be important to present alternate perspectives to ensure that scholarship can progress. (For an example of the issues in only presenting the most mainstream understanding of history, see “The ‘Undue Weight’ of Truth on Wikipedia”). If linked data doesn’t help us find out anything novel, will there have been a point in linking it?