Loading...
You are here:  Home  >  Data Blogs | Information From Enterprise Leaders  >  Current Article

Retrieving and Using Taxonomy Data from DBpedia

By   /  October 30, 2014  /  No Comments

DBpedia logo on a halloween jack-o-lanternDBpedia, as described in the recent semanticweb.com article DBpedia 2014 Announced, is “a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web.” It currently has over 3 billion triples (that is, facts stored using the W3C standard RDF data model) available for use by applications, making it a cornerstone of the semantic web.

A surprising amount of this data is expressed using the SKOS vocabulary, the W3C standard model for taxonomies used by the Library of Congress, the New York Times, and many other organizations to publish their taxonomies and subject headers. (semanticweb.com has covered SKOS many times in the past.) DBpedia has data about over a million SKOS concepts, arranged hierarchically and ready for you to pull down with simple queries so that you can use them in your RDF applications to add value to your own content and other data.

Where is this taxonomy data in DBpedia?

Many people think of DBpedia as mostly storing the fielded “infobox” information that you see in the gray boxes on the right side of Wikipedia pages—for example, the names of the founders and the net income figures that you see on the right side of the Wikipedia page for IBM. If you scroll to the bottom of that page, you’ll also see the categories that have been assigned to IBM in Wikipedia such as “Companies listed on the New York Stock Exchange” and “Computer hardware companies.” The Wikipedia page for Computer hardware companies lists companies that fall into this category, as well as two other interesting sets of information: subcategories (or, in taxonomist parlance, narrower categories) such as “Computer storage companies” and “Fabless semiconductor companies,” and then, at the bottom of the page, categories that are broader than “Computer hardware companies” such as “Computer companies” and “Electronics companies.”

How does DBpedia store this categorization information? The DBpedia page for IBM shows that DBpedia includes triples saying that IBM has Dublin Core subject values such as category:Companies_listed_on_the_New_York_Stock_Exchange and category:Computer_hardware_companies. The DBpedia page for the category Computer_hardware_companies shows that is a SKOS concept with values for the two key properties of a SKOS concept: a preferred label and broader values. The category:Computer_hardware_companies concept is itself the broader value of several other concepts such as category:Fabless_semiconductor_companies. Because it’s the broader value of other concepts and has its own broader values, it can be both a parent node and a child node in a tree of taxonomic terms, so DBpedia has the data that lets you build a taxonomy hierarchy around any of its categories.

Querying for the data we want

DBpedia includes a SPARQL endpoint, a service that accepts queries sent to it in the SPARQL query language (another recurring topic in semanticweb.com). These queries are sent using the HTTP protocol, which means that you can request query results using a web browser or just about any programming language. DBpedia also includes a web-based “SNORQL” form where you can enter SPARQL queries directly.

Most SPARQL queries, like those of its relational ancestor SQL, are SELECT statements that request columns of data that meet certain conditions. The following SPARQL SELECT query uses two triple patterns (that is, triple statements with variables substituted in certain positions to show the kinds of triples that we want) to ask for all of the skos:broader values of the “Computer hardware companies” concept and the preferred labels of those concepts:

PREFIX cat: <http://dbpedia.org/resource/Category:> 
SELECT ?broaderConcept ?preferredLabel WHERE {
  cat:Computer_hardware_companies skos:broader ?broaderConcept .
  ?broaderConcept skos:prefLabel ?preferredLabel . 
}

The “Computer hardware companies” concept’s actual identifier is the URI http://dbpedia.org/resource/Category:Computer_hardware_companies, which is abbreviated in the query using the cat: prefix declared at the beginning of the query. The skos: prefix would normally need to be declared as well, but it’s one of the predeclared prefixes on the SNORQL form. You can paste this query into the SNORQL form and click the Go button to run it yourself, or you can run it by clicking here. (Keep in mind that DBpedia, as a nonprofit community project, does not have 100% uptime.)

In addition to SELECT queries for retrieving columns of query results, the SPARQL query language offers an alternative form called CONSTRUCT queries that tell the SPARQL engine to construct triples from the information identified by the triple patterns in the WHERE clause. These queries might construct new triples based on the existing data, or they might create copies of the existing ones so that you can retrieve them.

For example, if you paste the following into the SNORQL form and run it, each time it finds triples matching the two patterns in the WHERE clause it will create copies of them and an additional third triple saying that the resource referenced by the ?broaderConcept variable is an instance of the skos:Concept class:

PREFIX cat: <http://dbpedia.org/resource/Category:> 
CONSTRUCT {
  cat:Computer_hardware_companies skos:broader ?broaderConcept .
  ?broaderConcept skos:prefLabel ?preferredLabel . 
  ?broader rdf:type skos:Concept .
}
WHERE {
  cat:Computer_hardware_companies skos:broader ?broaderConcept .
  ?broaderConcept skos:prefLabel ?preferredLabel . 
}

To extend this basic idea—and to have a little fun for the Halloween season—the next query constructs a set of triples about narrower values for the Wikipedia category Horror films (identified in DBpedia with the URI http://dbpedia.org/resource/Category:Horror_films) as well as for the more specific horror film categories two and three levels down from that:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX cat: <http://dbpedia.org/resource/Category:> 
CONSTRUCT {

  ?child rdf:type skos:Concept .
  ?childConcept skos:broader cat:Horror_films .
  ?childConcept skos:prefLabel ?childLabel . 

  ?grandchildConcept rdf:type skos:Concept .
  ?grandchildConcept skos:broader ?childConcept . 
  ?grandchildConcept skos:prefLabel ?grandchildLabel . 

  ?greatgrandchildConcept rdf:type skos:Concept .
  ?greatgrandchildConcept skos:broader ?grandchildConcept . 
  ?greatgrandchildConcept skos:prefLabel ?greatgrandchildLabel . 

  <http://www.dataversity.net/taxonomy/films> rdf:type skos:ConceptScheme .
  <http://www.dataversity.net/taxonomy/films> skos:hasTopConcept cat:Horror_films .
  cat:Horror_films skos:prefLabel "Horror films" . 
}
WHERE {
  {
    ?childConcept skos:broader cat:Horror_films .
    ?childConcept skos:prefLabel ?childLabel . 
  }
  UNION
  {
    ?childConcept skos:broader cat:Horror_films .
    ?grandchildConcept skos:broader ?childConcept . 
    ?grandchildConcept skos:prefLabel ?grandchildLabel . 
  }
  UNION
  {
    ?childConcept skos:broader cat:Horror_films .
    ?grandchildConcept skos:broader ?childConcept . 
    ?greatgrandchildConcept skos:broader ?grandchildConcept . 
    ?greatgrandchildConcept skos:prefLabel ?greatgrandchildLabel . 
  }
}

A few notes about the query:

  • SPARQL syntax offers several other ways to structure this same query, but I wanted to keep it as simple to read as possible.
  • The previous query looked for skos:broader values of cat:Computer_hardware_companies, searching up the taxonomy hierarchy, but this new query starts with values that have cat:Horror_films as their skos:broader values, searching down the taxonomy hierarchy from there.
  • The UNION keywords gather together the triples matching the three groups of triple patterns specified inside the WHERE clause. A rearranged version of this query that used SPARQL’s OPTIONAL keyword could have achieved the same effect as the version using UNION, but it would have run much more slowly.
  • The query also creates a SKOS concept scheme, which groups together concepts. In this case, it has only one—horror films—but more could be added.

Retrieving and using the data

If you execute this query on DBpedia’s SNORQL form, you will see the constructed triples listed there. If you want to actually retrieve the triples in an RDF syntax such as Turtle so that you can load it into a taxonomy management system, you have several options:

  • You can use an HTTP retrieval tool such as cURL or Wget.
  • If retrieval of HTTP resources is not built into your favorite programming or scripting language, there is probably a library available to make that possible.
  • You can use SPARQL’s SERVICE keyword in a locally run query to retrieve triples from a remote SPARQL endpoint such as DBpedia.

After I retrieved the triples constructed by the query shown above, I loaded them into TopQuadrant’s TopBraid EVN taxonomy editor, and we can see that “Indian comedy horror films” is three levels down the taxonomy tree from “Horror films”:

screenshot of Indian Horror Films in TopQuadrant's TopBraid EVN taxonomy editor

 

Of the data associated with this concept, note the two “has broader” values, both shown as hypertext links to those concepts in the screen shot: “Comedy horror films” and “Indian horror films.” The presence of multiple skos:broader values for many of this taxonomy’s concepts means that the retrieved taxonomy is a polyhierarchy, so that in addition to showing up on the Concept Hierarchy tree as a child of “Indian horror films,” as shown here, this same “Indian comedy horror films” concept will also appear as a child of “Comedy horror films” on the hierarchy.

There’s not much other data shown on the right side of the screen, but that’s because the CONSTRUCT query didn’t retrieve much else. Additions to the query could retrieve additional data about these concepts such as the URLs of their Wikipedia pages and owl:sameAs links to related pages at other sources. These are a good example of the links in “Linked Data,” making it possible to combine data about these resources from additional sources outside of this particular SPARQL endpoint.

Remember that, for better or worse, the data is based on Wikipedia data. If you extend the structure of the query above to retrieve lower, more specific levels of horror film categories, you’d probably find the work of film scholars who’ve done serious research as well as the work of nutty people who are a little too into their favorite subgenres. A SKOS-based taxonomy management tool such as TopBraid EVN lets you revise and curate this data and metadata to support your own business needs as well as managing the relationship between the data you retrieve and your own additions.

So, with the query above as a starting point, you can start pulling taxonomies about any of the huge collection of categories that DBpedia offers, as well as data about the members of those categories—business, culture, science, technology, and more. Even comedy horror movies!

Image: courtesy MakeSweet.com

About the author

Photo of Bob DucharmeBob DuCharme is Director of Digital Media Solutions at TopQuadrant and author of O’Reilly’s “Learning SPARQL.” He will be speaking on the use of taxonomy data from DBpedia at the upcoming Taxonomy Boot Camp November 4th in Washington, D.C.

You might also like...

Data Science Trends in 2018

Read More →