Big Data is about to get real. For years we’ve heard about the promise of putting data to work, gleaning “actionable insights” from mountains of data. But for most enterprises, most of the time, this translated into media hype, not business value.
Recently, however, I had the chance to sit down with two organizations doing groundbreaking Big Data work, and it brought home how real Big Data is, or soon will be. In both cases, the open source nature of Big Data technologies was critical to helping them iterate toward their leadership positions.
Open Source as Innovator
The dramatic rise of Big Data – analyst firm IDC says the Big Data market grew by $34 billion in 2012 alone – directly derives from its open source roots. Whereas open source sometimes played the role of fast-following copycat to proprietary, often expensive innovations, today all of the industry’s major computing trends – Big Data, cloud, mobile – are open source phenomena.
This isn’t surprising when we consider where Big Data was born: in Silicon Valley web companies that had a serious need to scale, and an equally serious need to understand their hundreds of millions of customers. Importantly, none of these companies had any need to sell the technology – that’s not their business – and plenty of reason to open source it, including using open source as a lure for talented developers.
Hence, Hadoop, Storm, MongoDB, Cassandra, Dremel, and other successful Big Data technologies are all open source, and all were born on the web, not in some fusty enterprise software company. And as traditional enterprises have discovered similar Big Data needs to the early web pioneers, they have embraced these same open-source technologies, as Brad Hedlund captures:
As properties such as Yahoo!, Google, Facebook, Amazon became great successes, their architects and software engineers realized that they had moved mountains…The tremendous problems of efficiently running large scale applications on low cost infrastructure had been solved…At the very same time, enterprise IT begins to encounter some of the very same problems solved by the large web provider, such as scalable data warehousing and analytics (so called “Big Data”). Additionally, the software driven distributed systems that solve problems of infrastructure efficiency and management at very large scale could also be applied to infrastructure at a smaller enterprise IT scale (why not?). And finally, the cost savings of an application infrastructure designed to operate on low cost commodity hardware can be realized at any scale, large web or enterprise IT.
Given the above, it’s not surprising that open-source technologies dominate the list of must-have tech skill, as technology job site Dice.com’s latest survey reveals:
Indeed.com, another leading job site for technology professionals, shows much the same thing.
IT professionals with these skills earn significantly more than peers with different skillsets, according to a Dice.com salary survey: $100,000, on average, versus cloud/virtualization ($90,000) and mobile ($80,000).
The reason? Big Data finally is starting to pay big returns on investment.
Two Big Data Pioneers
On a recent trip to Chicago, I was fortunate to spend time with two Big Data bigwigs: Brett Goldstein, Chief Data Officer for the City of Chicago, the man behind the City’s incredibly cool Windy Grid project; and Dr. Philip Shelley, CTO at Sears Holdings. Both are pushing historically conservative organizations to the cutting edge of Big Data, and showing big benefits along the way.
– City of Chicago
The City of Chicago, like any large city, has its share of crime. But a new program aims to change that, among other things. Chicago’s WindyGrid project pulls data from across multiple, disparate agencies, allowing law enforcement and other groups to provide coordinates and get a real-time view into what’s happening in that area:
[C]ity officials might look at a high crime area, while also mapping out the number of liquor permits for a neighborhood, along with the amount of nearby abandoned buildings. Using transcripts from resident complaints or 911 calls [or data from any number of 30 different City agencies or departments], officials could also see trending concerns for the area, like broken lights or stolen garbage cans, and the times the incidents are occurring. If the high crime area also has a high number of liquor permits, for example, officials could then see if other neighborhoods also faced both issues, allowing them to create a more effective response for those areas.
The benefits go beyond law enforcement, however. The data could also be used to improve health care access, streamline public transportation, or any number of things. The key is to aggregate the data (stored in the MongoDB NoSQL database for scalability and geospatial purposes, with Hadoop also being considered to assist the City in the future) and then correlate the data, while also running real-time analytics against it. If a 311 report comes in that street lights are out and this correlates with 911 calls, and can also see that on average it takes 10 days to fix the lights, the City better understands how to resolve its crime issue at a meta level.
It’s an extraordinarily impressive project, one that CDO Goldstein was able to undertake because of open-source tools. Open source gives him agility, as “proprietary solutions [make it] hard to get the data out.” This is critical for a government organization, that it not be beholden to any particular vendor, but cost savings through open source is also important, as he states, “You don’t have to make a multi-million-dollar investment to get a fancy GUI and something meaningful. If you bring something over to Linux, between Python and R you can produce some remarkable outcomes. These are some really low-cost solutions.” In other words, not only does open source improve Big Data innovation, it simultaneously lowers costs.
How often does something like that happen?
Not content to just use open source, however, Goldstein’s plan is to open source the project so that other cities or organizations can use it and improve it, consistent with the City’s focus on open data, open standards, and open source. As he says, “My intent is, when we build things, we put the code on GitHub.”
What is fascinating here is that while the early Big Data technology emerged from the web companies, perhaps the next wave will come from governments, enterprises, and others who see value in improving their Big Data technologies by sharing them. This would represent huge progress.
– Sears Holdings
Sears has been around for over 100 years, and so has built up no shortage of legacy systems along the way. But even the newer systems couldn’t handle Sears’ Big Data needs, as Shelley told an audience at ITA’s “Demystifying Big Data” panel. So Sears decommissioned millions of dollars worth of proprietary data warehousing tools, replacing them with a suite of open-source technologies, with Hadoop at the heart of everything as its “data hub.” Shelley calls Hadoop his new “mainframe,” by which he meant it’s the center of all the computing Sears does.
Previously, Sears could only keep 90 days of data – keeping more was both too expensive and simply not scalable – but now it keeps all its raw, transactional data, later transforming it into a myriad of formats, as necessary to individual organizations within Sears. The system is so powerful, as Shelley told the ITA audience (and previously said much the same thing at Hadoop Summit, which can be viewed here), that the retailing giant can personalize marketing campaigns, coupons, and offers down to the individual customer.
I had a hard time believing this, so followed up with him after the panel. Sears has over 10 million products and over 100 million customers, yet it can parse the data from in-store register check-outs, my online activity, and more to put a paper coupon in the mail to me that gives me a highly customized deal, perhaps on something sitting in local inventory that Sears would like to sell. It’s almost mind-glowingly cool.
And it’s very real.
Like City of Chicago, Shelley was quick to point out that gone are the days of capital expenditures for his Big Data projects. He gets better technology for free or, at worst, on a manageable subscription model. Innovation, it turns out, is comparatively cheap.
Big Data Comes of Age
Once upon a time, we talked about the power of data to personalize advertising, pricing, etc. to our needs, or the ability to use data to fight crime and improve things like medical care. But all we got tended to be poorly tailored advertising.
In the Windy City, things are changing. Big Data has become real, and increasingly real-time. Sears and the City of Chicago, which should be the last to the Big Data party, are among the first. This should give hope to everyone else that is tasked with putting data to work for their organization.
Importantly, neither Sears nor the City of Chicago started out as Big Data experts. In both cases, they downloaded and experimented with open-source software, allowing them to learn with no capital expenditures, upfront or over time. The better they’ve become with the technology, the more they’ve iterated on their projects, fine-tuning them to make them sing.
2013, then, is the year that Big Data becomes real. It’s when Big Data ceases to be something that Google or Facebook do, and instead something that more traditional organizations adopt, whether big or small.
Matt Asay is Vice President of Corporate Strategy at 10gen, the company behind MongoDB NoSQL database. With more than a decade spent in open source, Matt is a recognized open source advocate and board member emeritus of the Open Source Initiative (OSI).