Hadoop is Beginning to Stare Newer Big Data Approaches in the Face

By on

by James Kobielus

Waves pass, though the soft shorelines they sculpt might endure longer than you’d expect.

Every new technology platform enters the world as a wave that builds and builds until it’s overtaken by the next wave to come along. That’s the historical perspective expressed by a Forrester “S-curve” or a Gartner “hype cycle”.

However, this is not to imply that successive technology waves necessarily cancel each other out. If the waves are in phase, converging toward a common evolutionary endpoint, they mutually reinforce into an even more powerful wave.

That’s definitely the case with big data. In the present era, Hadoop’s wave is still very much on the rise, but we can already see the larger waves that may overtake it as the big data ocean reshapes our planet along smarter contours.

Recently, I posted my thoughts on the possibility that Apache Spark might drive convergence of Hadoop with NoSQL databases in the coming era of the all-in-memory big data cloud.

Slightly over a year ago, I observed that Hadoop would evolve to support specific deployment roles–such as landing/staging, sandboxing, and archiving–within multi-tier, hybrid big-data architectures that also include MPP RDBMS, in-memory, stream computing, and NoSQL platforms.

And last summer I blogged on the emergence of a big data “omega architecture” into which all of these approaches would eventually dissolve and evolve.

None of this means that Hadoop is likely to become obsolete or to stop evolving into new niches and roles. In fact, Hadoop is likely to be as foundational, in the long-range perspective, as relational database management systems (RDBMSs) have been since IBM invented them in the 1970s. Looking at the 40-plus year history of RDBMS evolution, one sees a data platform approach that with remarkable staying power and adaptability. It’s safe to say that RDBMS in one form or another will still be a key data platform for many uses long after most people reading these words have gone to meet the choir.

Hadoop is still a young pachyderm, compared to RDBMSs. It has not even reached its 10th birthday yet, so it’s probably a long way from arriving at any specific plateau in adoption, innovation, or maturity. However, let’s remember that tech industry trends cycle faster than ever. A decade in the early 21st century packs more successive tech innovation waves than a decade in the mid-20th century. As with “dog years” vs. “people years,” a “Hadoop year” (in maturation milestones) is probably equivalent to several “RDBMS years.”

For a technology and market so young, it feels a bit early for Hadoop to be having an identity crisis, but that seems to be what’s happening. As noted in this recent Gartner blog by Merv Adrian and Nick Heudecker, the advent of “Hadoop 2.0”–marked by divergent vendor implementations of old and new subprojects–has muddied the already fuzzy question of what Hadoop is and is not. They sum up this increasing industry muddle with a rhetorical question: “If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name ‘Hadoop’ really mean anything anymore?”

I agree with them that Hadoop has no clear boundaries. Hate to say I told you so, but, in fact, I’ve been saying that for at least the past 3 years, ever since my analyst days. To add onto what the Gartner analysts are saying, I think the Hadoop industry “identity crisis” stems in part from the industry’s lack of standardization or even a unifying vision for what Hadoop is and can evolve into.  No one else has ever stepped forward to present a coherent vision for Hadoop’s ongoing development. When will development of Hadoop’s various components be substantially complete? What is the reference architecture within which Hadoop’s current and future components will be developed and evolved under Apache’s auspices? Where does Hadoop fit within the growing menagerie of big-data technologies? Where does Hadoop leave off and Spark, graph databases, document databases, and other big-data technologies begin? What requirements and features, best addressed elsewhere, should be left off the Apache Hadoop community’s development agenda?

Back in my analyst days, I had to make a judgment call on what to include and exclude from the core scope of Hadoop. I had no choice–it was absolutely essential if I were to do my then-job of evaluating commercial Hadoop offerings on a reasonably apples-to-apples basis. Now, as the employee of an Hadoop solution provider, any similar call I might make would obviously be colored by the specter of vested interest–but that’s beside the point.

We as an industry shouldn’t be forcing analysts or vendors to develop their own definitions of what does and does not fall within Hadoop’s scope. That definition should come from an authoritative industry standards group–not, as Gartner states, from the “mind of the beholder.”

Leave a Reply

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept