Earlier this month, I travelled to Silicon Valley to attend O’Reilly Media‘s new conference, Strata. The theme was data — especially ‘Big’ Data — and in amongst the Hadoops and the Cassandras and the BigTables and the Map/Reduces, I was searching for the companies making connections to the semantic technologies routinely discussed by readers of SemanticWeb.com. Disappointingly, direct correlations were hard to find, but there were some glimmers of recognition that may benefit from the community’s attention.
First, though, what is this ‘Big Data’ that is getting everyone so excited? As I noted in a recent post on my own blog, it’s often simply characterised as “anything that requires more than a single machine to run.” I’ve also heard people say it’s any data that requires special measures to handle. So far, so vague and woolly. We’d all probably recognise really big data if we saw it. It’s the stuff Excel can’t even begin to load. It’s the stuff that makes you beg for one of those Crays I wrote about last month. It’s the stuff that makes you think the ‘High-Memory Quadruple Extra Large Instance’ (yes, they’re really called that) you’ve rented from Amazon might not be big enough. But where’s the line? When does ‘Big Data’ become ‘sort-of-big Data,’ or ‘pah! this data is so wee it only needs a slide rule to process’ data?
Mike Driscoll gave a presentation early in the event, loosely paraphrased in this post he and Roger Ehrenberg submitted to GigaOM. In it, he drew an analogy with Alberta’s Tar Sands; long known to be rich in oil, but too expensive to viably exploit.
In a similar vein, much of the world’s most valuable information is trapped in digital sand, siloed in servers scattered around the globe. These vast expanses of data — streaming from our smart phones, DVRs, and GPS-enabled cars — require mining and distillation before they can be useful.
Both oil and sand, information and data, share another parallel: In recent years, technology has catalyzed dramatic drops in the costs of extracting each.
Broader economic and geopolitical trends have perhaps been more important than technological advance in making the tar sands viable, and a similar argument could be made for data. We’re spending so much on collecting and storing the stuff that we really should be getting more value from it. Maybe economically squeezed CFOs are finally asking hard questions about the real value of all that data, and wondering whether they need to keep it at all. Although just as energy security drives US interests in Alberta, might ‘information security’ begin to figure highly in the thinking of CIOs adrift in an increasingly unstable environment? There, though, this analogy must quickly be brought to an end. The tar sands raise a host of emotionally fraught environmental issues that — thankfully — have no obvious parallel in the virtual world of data.
Attending session after session, I heard about the ever-accelerating flow of data from mobile phones, tiny ambient sensors in traffic lights and cars and kettles, social networks, financial trading systems, and a host of other sources. I heard about the advances in infrastructure intended to cope with this flood, and the importance of the Cloud in giving a new breed of data scientists the computing power they require. I heard about the role of ‘web scale’ companies such as Amazon and Google and LinkedIn and Facebook and Twitter in designing the first rudimentary tools that could hope to process a never-ending torrent of real-time data, and I heard about the companies like Cloudera building a business around packaging those raw tools for use by normal people.
And whilst much of what I heard sounded ripe for the injection of semantic smarts, I don’t think I heard ‘semantic’ used in presentation or conversation unless I said it first. Maybe semantic technologies can’t cope with processing that much data, that fast? Or maybe people just think it can’t? Maybe there’s a disconnect between the implied precision and flawless recall of a well-formed ontology and the more laissez-faire belief in ‘eventual consistency’ as the current generation of Big Data tools seek broad patterns that are ‘good enough’ in scary volumes of data?
So where were the examples that fused Big Data with Semantic Technology? I certainly wasn’t alone in looking. A team from Talis were on the prowl, dropping hints about Kasabi‘s role in this space. Tyler Bell, Factual‘s new Director of Product, was there. Writing things like this, and saying some of the things he said over lunch, he’s clearly on to something too. Dave Beckett (of Redland fame) was there to tell me just how many miles I had to travel for decent coffee (lots). There was even a Linked Data Meetup scheduled for one evening, although I ended up being distracted elsewhere and don’t know how many actually showed up.
Following the success of this first event, O’Reilly have already announced a follow-up in New York City in September. GigaOM is also planning a Big Data event in New York City in March, extending their current Structure franchise. Let’s round up a scary posse of SemanticWeb.com readers with good stories to tell, and show this Big Data crowd that these two communities have much to share.
Disclaimers: GigaOM pays me to write on Cloud Computing for their Analyst site, GigaOM Pro. I am a former employee of (and current shareholder in) Talis.
Image Credit: Paul Butler