The GO Browse Genomic Data Browser application that took top honors at the recent Tetherless World Constellation hackathon, co-sponsored by Elsevier, should shortly be available as a live demo. It’s on the to-do list for Jim McCusker, the PhD student at TWC and part-time software developer at the Yale University School of Medicine who created the application as a visual way to browse linked medical datasets on the genetics of cancer.
The data sources included comparisons of different cancers based on cell lines curated by the National Cancer Institute. “Basically, it measures the level of gene expression for every gene in the human genome,” says McCusker of the data. “The great thing is you can then do automated differential gene expression, so you can do statistical tests to see what genes are significantly expressed from one cancer to the rest.” GO Browse presents this information in a visual way to show more differentially expressed categories of genes based on cell processes.
McCusker downloaded the cancer-gene data; used the GenePattern genomic analysis platform for gene expression analysis, RNA-sequence analysis and so forth; employed Tim Lebo’s csv2rdf4lod tool to convert data about genes and gene assignments from the National Center for Biotechnology Information (NCBI) into the Gene Ontology; and converted the actual expression data using a set of Python scripts he wrote for that purpose. Infrastructure elements behind the work include SADI for Python and Linked Open Biomedical Data (LOBD) project, which incorporates a dozen NCBI datasets with hundreds of millions of RDF triples.
The csv2rdf4lod conversion tool, McCusker says, provides high-quality conversions of conventional spreadsheet data into Linked Open Data, and furthermore does it in a transparent way; the data provenance is available with the actual data. “So it’s very valuable for discovering new data and finding ways of tying it together,” McCusker says
The nested bubble-packing graphics of the app (which McCusker emphasizes is not yet complete) indicate high-level categories within each visualization of certain types of cancers — apoptosis in adenocarcinoma, for instance – from which users can drill down into the areas of a genome where cancers are active in terms of expressing RNA. “A lot of what a cell does is based on how much RNA for a particular protein is expressed at a given time, and that is one way of measuring the activity of a cell,” McCusker explains.
The Future of Semantic Web and Life Sciences Data
Semantic web technologies, McCusker thinks, offer an opportunity to provide more context to the wealth of high-throughput experimental data that already exists in open formats on the web. Funding requirements by the National Institutes of Health, he notes, require researchers to make their articles openly available, as well as the data that informs them (of course without violation of any privacy requirements). To this end researchers use publicly available database guidelines most appropriate to that particular data – MIAME compliance, for instance, for Gene Expression Omnibus submissions – but the actual data is only partly standardized, he says. Work is underway to republish biomedical databases as Linked Data, and McCusker is involved in projects to improve the ontological aspects of projects such as Bio2RDF.
He’s hopeful that the GoBrowse technology will be useful for a number of purposes. The framework can be applied to any medical data set where there are comparisons of two or more states. A computational biologist could take a tool like this, add her own data, and create a quality interactive visualization to embed in an online version of an article she wrote. Or, for that matter, do the same simply to better understand her own data.
Equally important, those individuals don’t themselves have to be semantic web experts. McCusker says a lot of computational biologists use Python for programming, so GO Browse’s front-end accommodates that, giving them a good opportunity to publish their data and algorithms in a new way using technology with which they are already familiar. They can work with Python objects and let the back-end infrastructure take care of managing Python objects to RDF mappers; The use of the SuRF RDF library in the SADI for Python implementation means they can work with Python objects directly.
“They don’t have to worry about the particulars of RDF,” he says.
McCusker would like to advance what he’s begun by creating a web forum for users to publish or upload their own data, and curate the experimental conditions into actual semantic web resources even if they don’t have programming experience. “I want it to be simple enough so that a savvy person can look at this and say, ‘can knock off one of these things, too,’” he says. “The goal is to set an example that when you are doing computational biology and you do it using semantic web technology, and existing ontologies, and existing resources, that by itself becomes an advantage to your resource. You actually get more out of of your data than someone who’s not doing that.”