The European Molecular Biology Laboratory (EMBL) and the European Bioinformatics Institute (EBI) that is part of Europe’s leading life sciences laboratory this fall launched a new RDF platform hosting data from six of the public database archives it maintains. That includes peer-reviewed and published data, submitted through large-scale experiments, from databases covering genes and gene expression, proteins (with SIB), pathways, samples, biomodels and molecules with drug-like properties. And next week, during a competition at SWAT4LS in Edinburgh, it’s hoping to draw developers with innovative use case ideas for life-sciences apps that can leverage that data to the benefit of bioinformaticians or bench biologists.
“We need developers to build apps on top of the platform, to build apps to pull in data from these and other sources,” explains Andy Jenkinson, Technical Project Manager at EMBL-EBI. “There is the potential using semantic technology to build those apps more rapidly,” he says, as it streamlines integrating biological data, which is a huge challenge given the data’s complexity and variety. And such apps will be a great help for lab scientists who don’t know anything about working directly with RDF data and SPARQL queries.
The remote data access layer that EBI maintains for software developers who want to build applications that leverage that data benefits the most from bringing semantic technology into the picture, says Jenkinson. “That’s where, in theory, the promise of semantic web technologies comes in. If you manage to get all different disciplines to provide data in a common, standardized, semantically unified way, then you can really create these cross-domain queries, that you can ask questions of the data that would be very difficult to answer using the more isolated approach,” he says.
“You don’t need to download all the expression data, all the chemical data, all the biological data. You can just run queries across different datasets because we did the work of aligning them to each other.”
Kickstarting the integration of discrete biomolecular data is important to realizing scientific goals of getting a complete picture of what whole systems are doing, and to better understand how they can go wrong. Expediting the bringing together of data that now lives in separate communities also can make it easier and faster to make connections among elements such as proteins, genomes, gene expression and drugs that enable drug repurposing. “We can try to do this in a more direct way – to use knowledge that we already have, to connect the information we already have so you can better derive new meaning from it,” says Jenkinson.
The six datasets the RDF platform project has started with – UniProt (9 billion triples alone), ChEMBL, Expression Atlas, Reactome, BioSamples and BioModels – were chosen partly to ensure that a variety of data will be represented. “These are intended to be production-quality services. The data in the RDF platform is released around the same time as we release the primary [raw] data set,” he says. “We do want to expand with more EBI data but we need to see how people are using this first of all.”
So far, things are playing out pretty well. “They are being used, more than we expected, which is a nice problem to have,” he says. At the end of two years the metrics will be reviewed and a decision will be made about moving forward with a wider rollout. Getting the first six datasets up and running required building the core hardware and software ecosystem, of course, so the stage is well-set for the next teams at the institute that would like to add their data to the mix. Overall, EBI has some 35 petabytes of storage used for molecular biology data in its care.
For a look at some example queries possible with the current RDF resources, go here.