Introduction to: Triplestores

Triplestores are Database Management Systems (DBMS) for data modeled using RDF. Unlike Relational Database Management Systems (RDBMS), which store data in relations (or tables) and are queried using SQL, triplestores store RDF triples and are queried using SPARQL.

A key feature of many triplestores is the ability to do inference. It is important to note that a DBMS typically offers the capacity to deal with concurrency, security, logging, recovery, and updates, in addition to loading and storing data. Not all Triplestores offer all these capabilities (yet).

Triplestore Implementations

Triplestores can be broadly classified in three types categories: Native triplestores, RDBMS-backed triplestores and NoSQL triplestores.

Native triplestores are those that are implemented from scratch and exploit the RDF data model to efficiently store and access the RDF data. These include: 4Store, AllegroGraph, BigData, Jena TDB, Sesame, Stardog, OWLIM and uRiKa.

RDBMS-backed triplestores are built by adding an RDF specific layer to an existing RDBMS. These include: Jena SDB, IBM DB2 and Virtuoso.

NoSQL Triplestores are recently being investigated as possible storage managers for RDF. For example, CumulusRDF is built on top of Cassandra. The folks at Seevl have implemented a triplestore on top of Redis.

Several benchmark studies have been done in order to evaluate the performance of Triplestores. Popular benchmarks are Lehigh University Benchmark (LUBM), Berlin SPARQL Benchmark (BSBM), SP2Bench and recently the DBpedia Benchmark. The 2007 best paper at VLDB compares RDBMS-backed triplestores. What’s the best triplestore? Check the results of each of the benchmarks. There is no one right answer.

Recently, the Linked Data Benchmark Council (LDBC) has been formed to create to establish industry cooperation between Triplestore vendors (just imagine the TPC of Triplestores).

Triplestores and Inferencing

A key feature of triplestores is the ability to do inferencing for queries. For example, consider the following OWL ontology:

ex1:FullProfessor  rdf:subClassOf  ex1:Professor.
ex1:AssistantProfessor  rdf:subClassOf  ex1:Professor.
ex1:Professor  owl:equivalentClass  ex2:Teacher

Which states that a Full Professor and an Assistant Professor are both Professors. We also have another ontology that has the class Teacher which is equivalent to Professor. Now consider the following data:

ex1:Bob  rdf:type  ex1:FullProfessor .
ex1:Alice  rdf:type  ex1:AssistantProfessor .
ex2:Mary  rdf:type  ex2:Teacher

Which states that Bob is a Full Professor, Alice is an Assistant Professor and Mary is a Teacher. Given the following SPARQL query:

SELECT ?x
 WHERE {
 ?x rdf:type ex1:Professor
 }

If inferencing is not enabled, the answer to the query is empty because the data does not assert that anybody is a professor. However, it is possible to infer that ex1:Bob, ex1:Alice and ex2:Mary are all professors because we know that Full Professors and Assistant Professors are both Professors and Teachers are also Professors.

Another common predicate that is widely used for inferencing is owl:sameAs. Consider the following data:

ex1:Bob foaf:name "Bob Smith" .
ex1:Bob owl:sameAs ex2:Smith .
ex2:Smith foaf:phone "555-1234" .

Given the following SPARQL query:

SELECT ?name ?phone
 WHERE {
 ex1:Bob foaf:name ?name .
 ex1:Bob foaf:phone ?phone .
 }

Without inferencing, the answer to the query would be empty because there is no triple that has as subject ex1:Bob and predicate foaf:phone. However, if inferencing is enabled, we can infer that Bob has the phone “555-1234” because ex1:Bob is the same as ex2:Smith and ex2:Smith has a phone of “555-1234”.

SPARQL 1.1 introduces Entailment Regimes, which allows a user to specify the type of inference (entailment) that is needed.

Triplestores vs NoSQL Graph Databases

NoSQL Graph Databases such as Neo4j, HyperGraphDB and InfiniteGraph, store and manage data that is represented as a graph. Given that the RDF data model can be interpreted as a graph, it is fair to say that Triplestores are a type of NoSQL Graph Database. However Triplestores differ from NoSQL Graph Databases in several ways. Triplestores have been implemented to store RDF which is a special kind of graph: a directed labeled graph. NoSQL Graph Databases can store different types of graphs: unlabeled graphs, undirected graphs, weighted graphs, hypergraphs, etc.

Triplestores are made to be queried by a standardized query language: SPARQL. SPARQL is based on graph pattern matching. Even though NoSQL Graph Databases don’t have a standardized query language, theses databases can perform not only graph pattern matching queries but are highly performant at reachability and navigational queries (find the shortest path between two nodes, or how are these two nodes connected). Actually, SPARQL 1.1 introduces Property Paths that allow users to specify these types of reachability and navigational queries in Triplestores. However, because existing SPARQL query engines rely on data structures tailored to efficiently perform pattern matching tasks, an efficient implementation of these graph-based tasks may require further extensions to existing Triplestores.

Summary

This is obviously not an extensive list of Triplestores and I know I may have missed many (apologies). The Wikipedia page on Triplestores offers a much larger list. Please feel free to add them in the comments. Additionally, please share in the comments if you know what Triplestore is backing up large public SPARQL endpoints. For example, Stardog is powering the recently launched BestBuy SPARQL endpoint; Virtuoso powers the DBpedia SPARQL endpoint as well as the Linked Open Data Cloud Cache and several datasets in the LOD cloud; BBC uses OWLIM; and UK’s data.gov.uk is powered by Jena TBD.

About the Author

Juan Sequeda is a Ph.D student at the University of Texas at Austin and a NSF Graduate Research Fellow. His research is in the intersection of Semantic Web and Relational Databases. He co-created the Consuming Linked Data Workshop series and regularly gives talks at academic and industry semantic web conferences. Juan is an Invited Expert on the W3C RDB2RDF Working Group and an editor of the “Direct Mapping of Relational Data to RDF” specification. Juan is also the founder of a new startup, Capsenta, which is a spin-off from his research.

Data Topics

Introduction to: Triplestores

Leave a Reply Cancel reply