In order for the Semantic Web to become a reality and success, there needs to be data on the web published as Linked Data. However, data on the web is not a new thing. People have been publishing raw data for a long time as XML, CSV or even spreadsheets. Data can also be accessed through APIs. But where does most of the data on the web come from? Relational Databases!
In 2007, it was determined that Internet accessible databases contained up to 500 times more data compared to the static web and roughly 70% of websites are backed by relational databases. The quantity of data suggests that the success of the Semantic Web depends on developing methods for making relational databases accessible to the Semantic Web.
In October 2007, the W3C hosted the first workshop on RDF access to Relational Databases.
This led to the creating of the RDB2RDF Incubator group, which operated during 2008 and 2009. The objective was to survey existing approaches to map relational database to RDF and decide whether a standard RDB2RDF mapping language was needed and/or possible. The output of this group was a recommendation to create a group in order to standardize a RDB2RDF mapping language.
The RDB2RDF Working group started in 2009 with the objective of standardizing “a language for mapping relational data and relational database schemas into RDF and OWL, tentatively called the RDB2RDF Mapping language, R2RML”.
Why do we need RDB2RDF?
The need to map relational data to RDF is increasing. With the rise of Linked Data, more and more people want to publish their data on the web following the Linked Data principles and most probably the data is in relational databases. RDF can also be used for data integration. Using a common standard data model, with a standard query language (SPARQL) is very attractive.
The two use cases for RDB2RDF is to publish relational data as RDF on the web or combining a relational data with existing RDF.
Use Case 1: This use case exemplifies the desire of people wanting to join the Semantic Web by publishing their data as RDF and offering a SPARQL endpoint to their database. The next step would be either to create links from their dataset to other RDF datasets on the web, however this is not in the scope of RDB2RDF.
Use Case 2: This use case is oriented to data integration. This can be divided in three sub use cases where we would like to combine our relational data with:
â— Structured data (relational databases, spreadsheets, csv, etc).
â— Existing RDF data on the web (Linked Data)
â— Unstructured data (HTML, PDF, etc).
We assume that the other sources we would like to combine have already been converted to RDF.
RDB2RDF can be implemented in two ways. First, relational data can be physically converted to RDF in a ETL (Extract-Transform-Load) and then stored in a RDF triple store. The triple store can also hold other RDF data from different sources and would have all the data integrated. An advantage of this approach is that it is a straightforward and fast approach to achieve data integration. The disadvantage is clear: you are keeping a separate copy of the relational data. If you want to integrate your active and dynamic relational database with other sources, this may not be the most appealing approach. Furthermore, we are dependent on the existing RDF triple stores and their scalability, however that is a different topic.
A different approach is not to materialize the relational data as RDF and leave the relational data where it belongs: in the relational database. Creating a mapping between the relational data and RDF can allow on-the-fly SPARQL queries on top of a relational database. In other words, a SPARQL query is translated to SQL, which is then executed on the relational database. This approach works for Use Case 1, simply publishing relational data on the web as RDF and Linked Data and through a SPARQL endpoint. However, we are not achieving the goals of Use Case 2 because we are just querying our relational database with SPARQL. We would have to enable federated SPARQL queries over different SPARQL endpoints, which again is a different topic. However, an experimental approach that achieves some level of data integration is presented by the work of Hartig et al. and in SQUIN, which allows querying the web of linked data as if it were a database. This approach enables a SPARQL query to execute over different data sources by following the links between RDF data on the web. This is an experimental but exciting approach, however, it also deserves another post by itself.
Existing RDB2RDF Tools
RDB2RDF approaches and tools have been presented over the last 5 years. The RDB2RDF Incubator group presented a survey on these existing tools. We will now present a brief list and short description of some RDB2RDF tools:
Asio Semantic Bridge for Relational Databases
Asios’ SBRD enables integration of relational databases to the Semantic Web by allowing SPARQL queries over the relational database. An initially OWL ontology is generated from the database schema which can then be mapped to a defined domain OWL ontology.
D2RQ consists of a mapping language between relational database schema and RDFS/OWL ontologies. The D2RQ platform creates an RDF view of the relational database, which can be accessed through Jena, Sesame and the SPARQL query language. Additionally, using D2R Server, the relational database can be accessed via the Web through the SPARQL protocol and as Linked Data. The first release of DBpedia in 2007 was done using D2R Server.
Metatomix Semantic Platform
Metatomix’s Semantic Platform allows to map a relational database with an ontology and output the relational data as RDF. The mapping is done through a graphical eclipse plugin. Other structured sources can map to the same ontology allowing data integration under the same ontology.
RDBtoOnto is an automatic tool that generates a populated ontology in RDFS/OWL from a relational database, acting as a ETL tool. This automated tool also provides a user interface that allows specific configurations.
SquirrelRDF is a tool that allows to relational databases to be queried using SPARQL. This tool takes a simplistic approach by not performing any complex model mapping like D2RQ
Triplify is a lightweight plug-in that exposes relational database data as RDF and Linked Data on the Web. There is no SPARQL support. The desired data to be exposed is defined in a series of SQL queries. Triplify is written only in PHP but has been adapted to several popular web applications (WordPress, Joomla, osCommerce, etc)
ODEMapster is a plugin for the NeOn toolkit, which provides a GUI to manage mappings between the relational database and RDFS/OWL ontologies. The mappings are expressed in the R2O language.
Oracle Database 11g supports RDF, RDFS and OWL data management as a native triple store. It is also integrates relational database with other RDF data and is able to combine SQL queries of relational data with RDF graphs and ontologies stored together. It also provides support to Jena.
Ultrawrap is an automatic tool that automatically exposes relational databases as RDF and allows them to be queried using SPARQL. An OWL ontology is generated and then it can be mapped to a domain OWL ontology through a GUI. This tool makes maximal re-use of existing commercial SQL infrastructure by letting the SQL optimizer do the SPARQL query execution. This tool will be released in summer 2010.
Virtuoso RDF Views maps relational data into RDF and allows SPARQL queries to be executed over the relational database and at the same time with a local RDF store, enabling integration of relational and RDF data.
Future of RDB2RDF
Depending on the use case, RDB2RDF is either the end or the means to an end. If you are interested in getting your relational database on the Semantic Web as Linked Data, then you can use existing RDB2RDF tools that will get you to that goal very quickly. However, if your goal is data integration, RDB2RDF is a means to the end, in other words, you can achieve data integration after your data is in RDF. RDB2RDF will get you there.
RDB2RDF has a very exciting future in the short term. The W3C’s RDB2RDF Working Group recently published a Use Case and Requirements document. The next step is to start working on a standarized mapping language. One of the interesting challenges in the long term is to see the adoption of this standard by the major database vendors.
Disclaimer: The views of this blog post reflect only the author’s view and do not reflect the views of the W3C RDB2RDF Working Group.