One in 50 American children have autism, according to the latest figures released by the Centers for Disease Control and Prevention in March. One of the winners of the YarcData Graph Analytics Challenge, announced in April, can make a difference in better understanding the causes of the disease.
Taking second place in the competition, the work of Adam Lugowski, Dr. John Gilbert, and Kevin Dewesse, of the University of California at Santa Barbara, leveraged a dataset created for the Mayo Clinic Smackdown project, that has the same structure and property types – and scale – as the medical organization’s actual Big Data sets around autism, but which uses publicly available data in place of the real thing. The team can’t use the real data because it includes private information about patients, diagnosis, prescriptions, and the like.
But the actual data deployed for the project doesn’t matter, says Lugowski . “The goal is to find relationships we have never thought of before, and this way it doesn’t prejudice the algorithm,” he says. Using YarcData’s uRIKA graph analytics appliance, the algorithm queries the Smackdown dataset – which in its smallest version has almost 40 million RDF triples and in its largest is about 100 times bigger, mirroring the size of all the Mayo Clinic’s actual autism data – to discover commonalities among the data, mimicking how the real data sets could be queried in search of common precursors among clusters of patients with the diagnosis.
There are issues around SPARQL querying and the algorithm in question for the project, however. “SPARQL by its nature has these very local queries with local relationships when you want to find those relationships in your data, but clustering is a very global algorithm,” he says. “You can look at the entire graph at once.”
The algorithm the team applied to the challenge is called Peer Pressure; created a few years ago, it is based on the idea that if you are in a cluster then you probably want your neighbors to also be in the same cluster, so you pressure them to be in your cluster. The team’s contributions to this algorithm were in getting the algorithm to run on a SPARQL system, in writing it as an iterative algorithm. Steve Reinhardt, solution architect at YarcData, writes here that, “The notion of iteration is only obliquely supported in SPARQL.”
Says Lugowski, “We have a set of queries that runs one iteration of the algorithm and then we wrote a script that will loop as many times as necessary,” he says.
The YarcData uRIKA appliance presents the right platform in that it has enough memory to hold the data. “Just the sheer amount of data is quite something, and uRIKA has the advantage that it is meant for handling big data,” says Lugowski. “The second is that the unique processors it has can handle certain types of queries very well, and in particular they are very good at query joins. …We have a few joins in our queries so this makes the algorithms run fast on uRIKA.”
The team experimented with trying to do the work on a standard off-the-shelf machine, by no means a slouch, with 32 cores and 128 GB RAM, and yet couldn’t even load the entirety of the smallest data set, they say. “It’s extremely difficult on standard hardware,” he says, given that a lot of these relationships seem somewhat random.
The Mayo Clinic has the team’s code to determine how they can use it with the real data, and one challenge is figuring out how best to present this data to scientists in a meaningful way. But The Mayo Clinic isn’t the only one that can take advantage of the capabilities the work brings to the table. YarcData has other customers who need the same capability to run global graph algorithms within the SPARQL framework, Lugowski says: “Their customers have asked for something like this, to do clustering, … to do these global graph algorithms.”
Indeed, writes Reinhardt in his blog, many such iterative algorithms “from the data-mining and machine-learning domains are relevant to the problems our customers want to solve on our uRiKA appliance with RDF and SPARQL.”