Time to reflect on Google Refine in so far as it relates to the RDF extension produced by DERI Researcher/ Linked Data Technologist Richard Cyganiak and DERI Research Master Student Fadi Maali, which debuted a couple months back. The RDF Extension adds a graphical user interface for exporting Refine project data in RDF format, and the pairing is worth a closer look by the enterprise community.
Refine – the messy data-cleanup tool once known as Freebase Gridworks, prior to Google’s acquisition of MetaWeb – has an established reputation in open government data and journalism communities, and the RDF Extension certainly has application there. But the Excel-laden enterprise also should consider how it can profit from the matched capabilities, too. Think of how much critical business information is locked inside uncommunicative Excel spreadsheets – reams and reams of them – and individual databases. Refine presents an opportunity to take in that data and clean it up (something Excel itself isn’t particularly focused on) and, with the RDF extension, to free it from its silos, integrate it with other data sets, and just plain open the door to getting a whole lot more use out of it.
This can be an important piece, then, of the bigger notion of the enterprise data space, where there’s a single way to access, query and search data that now lives in its own little pockets. “It seems RDF is a great technology for implementing this sort of abstract idea of the enterprise data space,” Cyganiak says.
To get to the point where dashboards and custom portals can surface information over integrated RDF data stores for average business users will require more than the piece of the equation that Refine and the RDF extension offer, but it’s an important step to reaching the goal. “The extension is part of a bigger story to make big difference,” he says.
There’s more motivation now than ever to get spreadsheet data into RDF triple stores. In the last few years, Cyganiak points out, triple stores have become easier to set up, more scalable, and more stable, and that’s been accompanied by an “immensely more useful” SPARQL 1.1 language for querying data sets. SPARQL 1.1 is still officially a work in progress, he notes, but several vendors have preliminary support for it.
“Getting all your data into triple stores gets easier and with SPARQL you really get interesting and powerful interrogation of this data,” he says. What’s still a stumbling block, after getting cleansed spreadsheet information into RDF format and integrated: SPARQL is complex. It’s a good tool for software developers and data analysts, but the general business user “would benefit a lot from a good user interface that makes it easy to create their own reports and show information on dashboards. …Easily building powerful interfaces, visualization and so on over triple stores is something where we see a big opportunity to make a big difference.”
Good news for getting going in this direction is that so far, Cyganiak hasn’t experienced or heard of any issues where conversions of data sets – including spreadsheets with thousands of rows – to RDF from Google Refine have met any hiccups. “That’s not a big data set in the big picture but it is a big spreadsheet task,” he says. “In our experience it works and it scales. This is not the technology for processing millions of records, no, but spreadsheets you download from somewhere or just have on the desktop, they will work.”
Planned for this month is an update for the RDF extension that will make it possible to first reconcile data from a spreadsheet against a reference data set before conversion. When users put together data from a number of sources, it’s important that the data lines up on shared entities — that the references to them are reconciled, so that you can connect Person A on one spreadsheet to data about Person A on another, for instance. “You want to reconcile these sort of references to the same entity from different data sources,” he says. With the update it will be easier to match them against an existing master data list, such as a master list of customers.
Says Cyganiak, “This will help the task of making sure that multiple data sources actually match up once you’ve converted to RDF.”