by Angela Guess
Gregory Mone has written an insightful look beyond Hadoop for the Communications of the ACM magazine. Mone begins, "Pandora will not discuss exactly how much data it churns through daily, but head of playlist engineering Eric Bieschke says the company has at least 20 billion thumb ratings. Once every 24 hours, Pandora adds the last day's data to its historical pool—not just thumbs, but information on skipped songs and more—and runs a series of machine learning, collaborative filtering, and collective intelligence tasks to ensure it makes even smarter suggestions for its users. A decade ago this would have been prohibitively expensive. Four years ago, though, Bieschke says Pandora began running these tasks in Apache Hadoop, an open source software system that processes enormous datasets across clusters of cheap computers. 'Hadoop is cost efficient, but more than that, it makes it possible to do super large-scale machine learning,' he says. Pandora's working dataset will only grow, and Hadoop is also designed for expansion. 'It's so much easier to scale. We can literally just buy a bunch of commodity hardware and add it to the cluster'."
Mone continues, "Bieschke is hardly alone in his endorsement. In just a few years, Hadoop has grown into the system of choice for engineers analying big data in fields as diverse as finance, marketing, and bioinformatics. At the same time, the changing nature of data itself, along with a desire for faster feedback, has sparked demand for new approaches, including tools that can deliver ad hoc, real-time processing, and the ability to parse the interconnected data flooding out of social networks and mobile devices. 'Hadoop is going to have to evolve,' says Mike Miller, chief scientist at Cloudant, a cloud database service based in Boston, MA. 'It's very clear that there is a need for other tools.' Indeed, inside and outside the Hadoop ecosystem, that evolution is already well under way."
photo credit: Pandora