Yesterday, I was fortunate to be invited by Andrew Stauffer and Bethany Nowviskie to present at their Rare Book School course, Digitizing the Historical Record. I talked about Linked Open Data (LOD), and afterward, Dana Wheeles talked about the NINES project and how they use RDF and LOD.
I tried to present a gentle, mostly non-technical introduction to LOD, with an example of it in action. Hopefully, this posting will be a 50,000 foot overview also.
The Linked Open Data Universe
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
The first thing to know about LOD is that it’s everywhere. Look at the Linked Open Data cloud diagram above. All of these institutions are publishing data that anyone can use, and their data references others’ data also.
Linked Data vs Open Data vs RDF Data
First we need to unpack the term Linked Open Data:
Linked is an approach to data. You need to provide context for your data; you need to point to other’s data.
Open is a policy. Your data is out there for others to look at and use; you explicitly give others this permission.
Data is a technology and a set of standards. Your data is available using an RDF data model (usually) so computers can easily process it.
(See Christopher Gutteridge’s post for more about this distinction.)
Creating LOD can seem overwhelming. Where do you start? What do you have to do? It’s not an all or nothing proposition. You can take what you have, figure out how close you are to LOD, and work gradually toward making your information a full member of the LOD cloud. The LOD community talks about having four-star data or five-star data. Here are what the different stars denote:
- You’ve released the data using any format under an open license that allows others to view and use your data;
- You’ve released the data in a structured format so that some program can deal with it (e.g., Excel);
- You’ve released the data in a non-proprietary format, like CVS;
- You’ve used HTTP URIs (things you can type into your web browser’s location bar) to identify things in your data and made those URIs available on the web so others can point to your stuff;
- You explicitly link your data to others’ data to provide context.
(This is all over the web. Michael Hausenblas’ explanation with examples is a good starting point.)
A large part of this is about representing knowledge so computers can easily process it. Often LOD is encoded using Resource Description Framework (RDF). This provides a way to model information using a series of statements. Each statement has three parts: a subject, a predicate, and an object. Subjects and predicates must be URIs. Objects can be URIs (linked data) or data literals.
The predicates that you can use are grouped into vocabularies. Each vocabulary is used for a specific domain.
We’re getting abstract, so let’s ground this discussion by looking at a specific vocabulary and set of statements.
Friend of a Friend
For describing people, there’s a vocabulary standard called Friend of a Friend (FOAF). I’ve used that on my web site to provide information about me. (The file on my website is in RDF/XML, which can be frightening. I’ve converted it to Turtle, which we can walk through more easily.)
I’ll show you parts of it line-by-line.
(Ahem. Before we start, a disclaimer: I need to update my FOAF file. It doesn’t reflect best practices. The referencing URL isn’t quite the way it should be, and it uses deprecated FOAF predicates. That said, if you can ignore my dirty laundry, it still illustrates the points I want to make about the basic structure of RDF.)
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
This just says that anywhere
foaf: appears later, replace it with the URL
 a <http://xmlns.com/foaf/0.1/Person>;
This is a statement.
 just means that it’s talking about the document itself, which in this case is a stand-in for me. The predicate here is
a, which is a shortcut that’s used to tell what type of an object something is. In this case, it says that I’m a person, as FOAF defines it.
And because the line ends in a semicolon, the rest of the statements are also about me. Or more specifically, about
foaf:firstName "Eric"; foaf:surname "Rochester"; foaf:name "Eric Rochester"; foaf:nick "Eric";
This set of statements still have the implied subject of me, and they use a series of predicates from FOAF. The object of each is a literal string, giving a value. Roughly this translates into four statements:
- Eric’s first name is “Eric.”
- Eric’s given name is “Rochester.”
- Eric’s full name is “Eric Rochester.”
- Eric’s nickname is “Eric.”
The next statement is a little different:
foaf:workplaceHomepage <http://www.scholarslab.org/> .
This final statement has a URI as the object. It represents this statement:
- Eric’s workplace’s home page is “http://www.scholarslab.org/”.
If this was a little overwhelming, thank you for sticking around this far. Now here’s what you need to know about modeling information using RDF:
- Everything is expressed as subject-predicate-object statements; and
- Predicates are grouped into vocabularies.
The rest is just details.
Linked Open Data and the Semantic Web
During my presentation, someone pointed out that this all sounds a lot like the Semantic Web.
Yes, it does. LOD is the semantic web without the focus on understanding and focusing more on what we can do. Understanding may come later—or not—but in the meantime we can still do some pretty cool things.
The benefit of all this is that it provides another layer for the internet. You can use this information to augment your own services (e.g., Google augments their search results with RDF data about product reviews) or build services on top of this information.
If you’re curious for more or still aren’t convinced, visit the Open Bibliographic Data Guide. They make a business case and articulate some use cases for LOD for libraries and other institutions.
Discussing LOD can get pretty abstract and pretty meta. To keep things grounded, I spent a few hours and threw together a quick demonstration of what you can do with LOD.
The Library of Congress’ Chronicling America project exposes data about the newspapers in its archives using RDF. It’s five-star data, too. For example, to tell the geographic location that the papers covered, it links to both GeoNames and DBpedia. The LoC doesn’t provide the coordinates of these cities, but because they express the places with a link, I can follow those and read the latitude and longitude from there.
Here’s the results of one run of the script. (I randomly pick 100 newspapers from the LoC, so the results of each run is different.)
You can find the source for this example on both Github and BitBucket:
Throughout this post, I’ve tried to link to some resources. Here are a few more (not all of these will be appropriate to a novice):