Early in 2011, I wrote a piece here on SemanticWeb.com which explored the relationship between Semantic Technologies and super-computing's venerable rock star, Cray. Then, earlier this year, Cray spun out a new division to focus upon exploring massive graph databases; something which should resonate with the semantic technology community. The new division — YarcData — differentiates itself quite clearly from its parent, leading with a data-led proposition and typically operating at quite a different pricepoint to its eye-wateringly expensive parent.
I sat down with YarcData President Arvind Parthasarathi during the Semantic Technology & Business Conference in San Francisco, to get an update on YarcData and to hear why the company is investing $100,000 in prizes for a new 'Big Data Graph Analytics Challenge.'
RDF and SPARQL are certainly powerful foundations for the whole Semantic Web vision, but it's rare to see them discussed in positive terms beyond the Semantic Web community itself. YarcData, it seems, wants to change that. To them, SPARQL is the method of choice for querying massive graph databases. And, to them, "massive graph databases" aren't just full of Semantic Web enthusiasts' triples. No, to YarcData, massive graph databases are their differentiator when it comes to tapping enterprise IT buyers' apparently limitless enthusiasm (and budget) for Big Data. And that's a far more lucrative market than the Semantic Web community.
Most of the dominant solutions in the Big Data space today rely upon concepts like Map/Reduce. They're designed for commodity hardware, and the only way to fit massive data volumes onto commodity hardware is to chop it up into pieces, share the pieces out across lots of machines, and then work out how to glue the results back together at the end. That's exactly what tools like Hadoop do, and they're actually pretty good at it. Typically, the problems being addressed are what Parthasarathi terms 'needle in a haystack problems;' you know what you're looking for (the needle) and you just keep searching until you find it.
YarcData (and Cray) address a very different type of problem. Rather than facilitating search, they're interested in supporting discovery across complex graphs of data. Despite the huge size of Facebook (over 100Petabytes in just one of the company's dozens of Hadoop clusters, for example), it's not that significant a technical challenge to search through Facebook to find named people you know, or to pull back the profiles for all of your friends. It's far harder, Parthasarathi suggests, to tackle a graph problem with the same data. Imagine, for example, that you want to assemble a football team from amongst your Facebook friends. You need to select the set of Facebook users that are your friends, and then select a subset of those on the basis of various criteria such as location, age, or even interest in football. Graph problems become even more complex when the graph must be chopped up and spread across different machines. So YarcData's answer is to avoid lots of cheap commodity boxes, and to deliver a single computer powerful enough to hold the graph you're working with in memory. Oracle, IBM and others, of course, have had similar epiphanies. But they're not putting SPARQL at the top of the heap, and YarcData have.
SPARQL, Parthasarathi enthuses, offers an industry-standard way to query graph data. Paired with high performance hardware (YarcData's, of course), he sees a winning combination. SysAdmins installing one of YarcData's appliances simply see a standard SuSE Linux box. Graph data-aware developers see a standard SPARQL endpoint. YarcData's customers see insight, particularly in amongst the "forbidden problems" that Parthasarathi suggests they've never had the means to explore before. And YarcData watches the business roll in.
The problem, of course, with a proposition that's at its most compelling when tackling problems customers haven't addressed before is that awareness of the need is pretty low. YarcData doesn't address a serious pain point. YarcData doesn't make an existing process run faster, or cheaper, or more reliably. YarcData lets customers do things they've probably given up wanting to do, and that's a problem.
Hence the competition. A challenge in which entrants are competing for a share of $100,000 and the chance to run their task on one of YarcData's machines. The $70,000 first prize is certainly good for its winner, but YarcData will be hoping that prospective customers see the potential for similar insight inside their own organisations.
Most of the 'obvious' use cases for complex graphs are trite or contrived. YarcData is counting on the prize money to spark the interest of people with an ability to think creatively. Forget that this is a "forbidden problem" that can't be solved. With a powerful graph database, maybe a whole class of forbidden problems stop being forbidden, and we just need to find new questions to ask.
YarcData's appliance would be expensive to buy outright, but it's actually delivered through a subscription which makes the investment somewhat more manageable. Even with some amazing competition entries to galvanise interest, though, YarcData faces an uphill struggle to persuade cash-strapped IT departments to deploy this alongside all of their existing data management and business intelligence solutions. The company may need to find a way to let an awful lot more people experience the real prize from their competition; the chance to run a job on a YarcData machine, and to see what it can do.