Triplestores 101: Storing Data for Efficient Inferencing

By on

eg_triples_092916Relational databases are the workhorses of many corporate analytical and reporting applications, but the rise of Big Data has led to the rise of alternative, NoSQL database models. They are better suited for the nature of the data and the kinds of questions that will be asked against that data. Triplestores are a kind of NoSQL database that store data in “triples” rather than the traditional relational structure.  They are similar to graph databases, or rather are a separate division of the standard graph configuration, but also have some distinct differences. These databases can handle trillions of records and support inferencing, making them excellent for Analytics. Triplestores use URIs, which means they support querying and reasoning about the Semantic Web.

 Triplestores and Resource Description Framework

Unlike relational databases which store data in tables, triplestores store data as statements in the Subject-Predicate-Object form, such as “Jessica teaches Computer Science”; each statement is called a triple. This data representation uses the Resource Description Framework (RDF), a standard model for publishing and sharing data on the Web. The subject, predicate, and object all can be URIs, enabling data to be easily linked.

The collection of statements in a triplestore forms a graph database of facts. Each triple can have a name, creating a named graph. The subject and object are the nodes of the graph, and the predicates form the edges between them. There can optionally be a schema model, called an ontology, which provides a formal description of the data.

Triplestores have three possible architectures:

  • In-memory: which stores the triples in main memory
  • Native Store: which provides persistent storage as a triplestore
  • Non-native Store: which provides persistent storage using a third party RDBMS.

The Advantages of Triplestores

Triplestores offer several advantages compared to traditional database management systems:

  • Flexibility: There’s no need to define a schema in advance, and no need for artificial entities such as tables to represent a many to many relationship. The lack of a predefined data schema means that altering the data model is easy.
  • Easy Import/Export: RDF can be imported or exported using the standardized formats N-Triples or N-Quads. As a result, users aren’t locked in to using a specific vendor’s implementation.
  • Efficient Querying: Triplestores can be queried using the language SPARQL. Unlike queries using SQL, which become complicated and inefficient if the database wasn’t designed with columns to join and indexes to make the search efficient, triplestores can easily handle complex queries.
  • Easy Sharing: Because triplestores use URIs, sharing data is simple. This is an advantage for analytics programs that need to bring together data from multiple sources.
  • Relationship Discovery: When combined with ontologies that formally define the objects and their relationship types, triplestores support inferencing that enables the discovery of implicit facts and relationships.

Querying Triplestores Using SPARQL

Triplestores are queried by a language called SPARQL (SPARQL Protocol and RDF Query Language). Like SQL, data can be extracted in tabular format using a Select query. The queries are structured such that a prefix defines the namespace, and a select clause defines the result set to be returned from the specified data set where the triple matches a query pattern. Additional clauses, such as order by and distinct modify the result set. Variables are used in clauses and using the same variable in multiple patterns in the query define joins.

SPARQL also supports a Construct query that generates the results as RDF, an Ask query that answers Yes/No questions, and a Describe query that describes the resources that match the query.

Because triplestores use URIs to reference data, queries can be run from public SPARQL endpoints such as one provided by the UK government. These public endpoints and the use of URIs mean that any accessible data can easily be joined to any other accessible data.

Use Cases for Triplestores

The graphical nature of triplestores makes then good for reasoning about data where the relationship between items in the database is important. This enables reasoning about the relationships in social networks and tying social network analysis to temporal and geospatial indexing to identify trends that support marketing efforts.

Triplestores are also widely used for Semantic Analysis, such as text mining. Sentences can be diagrammed as graphs, and Machine Learning algorithms can disambiguate and reason about the entities. Triplestores are also used to link between structured databases and documents of unstructured data. Their ability to support inferencing makes triplestores a key technology behind search and discovery applications.

Because the Semantic relationships between entities are understood, triplestores can support smart searches that bring back related results, not just items that are exact matches to the query. This enables functionality such as search term expansion and intelligent recommendation engines.

Triplestores aren’t appropriate for transactional applications that require updating sets of data. In many cases, the most effective strategy is to combine triplestores with legacy databases that uses the triplestore’s inferencing capability to add depth to query results. This is achieved through using the triplestore to create smart Metadata that describes the contents of the relational database.

Implementations of Triplestores

There are many available implementations of triplestores. Triplestores can be compared using RDF benchmarks, with several common ones listed here. Some of the most popular triplestores in use include:

  • AllegroGraph is a native implementation of triplestores from Franz. Along with queries in SPARQL, it enables access from Java, Python, Lisp, plus HTTP client requests from languages including Ruby and Python.
  • Apache Jena is an open-source framework for applications using Semantic Web and linked data. Its components include TDB, which is a native triplestore, plus SPARQL and API access.
  • BlazeGraph offers a scalable graph database designed for high performance. It offers multiple versions, including high availability and embedded support options.
  • MarkLogic is a native triplestore designed to provide enterprise-grade functionality. It provides high availability and disaster recovery, horizontal scalability, and role-based security to limit access to triples.
  • Oracle Spatial and Graph offers a native RDF graph designed for linked data and social network analysis. Data can be queried with SPARQL and Java APIs. RDF views can be used to support semantic analysis of relational data.
  • Eclipse rdf4j, formerly known as Sesame, is an open source Java framework for working with RDF data. It provides both in-memory and native data stores, in addition to APIs for integrating with third-party RDF databases.
  • GraphDb is a graph database built on Sesame. It offers a free edition, a standard edition available in the cloud or local installation, and an enterprise edition that support high availability.
  • Virtuoso is available in both open source and commercial editions, with the commercial edition providing replication and virtual database engine. Virtuoso provides both SQL and RDF capabilities in a single server.
  • 3Store is an open source triple store implementation using MySQL as the backing database. RDFs can be queried using SPARQL at the command line, via HTTP, or through a C-language API.

Due to the standardization and portability triplestores provide, selecting a triplestore database doesn’t have to be a long-term commitment. Users can start with a free, open source implementation to experiment and begin working with triplestores, and then upgrade to a vendor-supported, enterprise-grade implementation once their projects and needs have evolved. As a result, there’s no reason for companies to delay experimenting with triplestores to see what they can contribute to their Analytics projects.

Leave a Reply