[EDITOR’S NOTE: Recently, we reported on the creation of a semantic data wrapper for the GoogleArt project. At the time, the wrapper only offered data for individual paintings and there was no good way to access the full data set. In this deeply technical guest post by the wrapper’s creator, Christophe Guéret, he outlines how to grab the full data set.
If you do something interesting with this data, we would love to hear about it! Leave a comment below.]
Some weeks ago, a first version of a wrapper for the GoogleArt project from Google was put online (see also this blog post).
This wrapper, initially offering semantic data only for individual paintings, has now been extended to museums. The front page of GoogleArt is also available as RDF, providing a machine-readible list of museums. This index page makes it possible, and easy, to download an entire snapshot of the data set so let’s see how to do that.
Downloading the data set from a wrapper
Wrappers around web services offer an RDF representation of content available at the original source. For instance, the SlideShare wrapper provides an RDF representation of a presentation page from the SlideShare web site. The GoogleArt wrapper takes the same approach for paintings and museums listed on the GoogleArt site. Typically, these wrappers work by mimicking the URI scheme of the site they are wrapping. Changing the hostname, and part of the path, of the URL of the original resource for that of the wrapper gives you access to the desired data.
From a linked data perspective, wrappers are doing a valid job at providing de-referencable URIs for the entities they describe. However, the “de-referencing only” scheme makes them more difficult to query. Wrappers don’t offer SPARQL end points as they don’t store the data they serve, that data being computed on-the-fly when the URIs are accessed. To query a wrapper, one has to rely on an indexing service harvesting the different documents and indexing them; something that reminds us of the way to find Web documents and for which the semantic web index Sindice is the state of the art solution.
But such an external indexing service may not provide you with the entire set of triples, nor allow downloading big chunks of their harvested data. In that case, the best way to get the entire dataset locally is to use a spider to download the content published under the different URIs. LDSpider, an application developed by Andreas Harth (AIFB), Juergen Umbrich(DERI), Aidan Hogan and Robert Isele, is the perfect tool for doing that. LDSpider crawls linked data resources, looking for triples it stores in a Nquad file. Nquads are triples to which a named graph has been added. By using it, LDSpider keep tracks of the source of the triples in the final result.
Using a few simple commands, it is possible to harvest all the triples published by the GoogleArt Wrapper. As of the time of writting, there seems to be a bug with the latest release of LDSpider (1.1d) that prevented us from downloading the data. However, everything works fine with the trunk version which can be downloaded and compiled that way:
svn checkout http://ldspider.googlecode.com/svn/trunk/ ldspider-read-only cd ldspider-read-only ant build
Once we have LDSpider ready to go, point it to the index page “-u http://linkeddata.few.vu.nl/googleart/index.rdf”, ask for a load balanced crawl “-c” and request to stay within the same domain name “-y” as the starting resource. This last step is very important! Since the resources published by the wrapper are connected to DBpedia resources, omitting the “-y” would allow the crawler to download the content of the resources that are pointed to in DBpedia, and then download the content of the resources DBpedia points to, and so on… Set the last parameter to the name of the output file “-o data.nq” and you are ready to go:
java -jar dist/ldspider-trunk.jar -u http://linkeddata.few.vu.nl/googleart/index.rdf -y -c -o data.nq
After some time (24 minutes in our case), you get a file with all the data + some header information with extra information about the downloaded resource:
_:header1087646481301043174989 . _:header1087646481301043174989 "200"^^ . _:header1087646481301043174989 "Fri, 25 Mar 2011 08:51:04 GMT" . _:header1087646481301043174989 "TornadoServer/1.0" . _:header1087646481301043174989 "5230" . _:header1087646481301043174989 "application/rdf+xml" . _:header1087646481301043174989 "Keep-Alive" .
To filter these out and get only the data contained in the document, simply use a grep:
grep -v "_:header" data.nq > gartwrapper.nq
The final document “gartwrapper.nq” contains around 37k triples, out of which 1.6k are links to DBpedia URIs. More information about the data set is available through it CKAN package description. That description also contains a link to a pre-made dump.
This download technique is applicable to downloading the content provided by any wrapper or data set for which only de-referencable URIs are provided. However, we should stress that to ensure completness and seed URI listing all (or most of) the published resources: the spider works by following links so be sure to have access to well connected resources. If several seeds are needed to cover the entire data set, iterate the same process starting at every one of them or use the dedicated option from LDSpider (“-d”).