Recently, there’s been much discussion regarding Apache Hadoop’s capability and usability (or for the latter, the lack thereof). Hadoop is an extremely useful but disruptive open source technology that enables users to store and analyze poly-structured data in large volumes. As enterprises look to analyze their data for increased business insight, they also want to keep data management costs low. But here’s the rub: a major obstacle to successfully integrating Hadoop into the enterprise ecosystem is the lack of widespread availability of skills. Many businesses simply don’t have the personnel skilled in setting up, managing and analyzing data in their Hadoop clusters. In an enterprise, this initiative could span complex hybrid environments involving systems of record like transactional systems, an enterprise data warehouse, CRM, master data management combined with content repositories, in-memory analytics for dynamic offer management and pricing, and NoSQL.
In fact, during a recent TweetChat exploring the challenges of incorporating Hadoop, Wikibon analyst Jeff Kelly commented: “beyond tech skills u [sic] need willingness to fail, curiosity, perseverance.” Combine this skills gap with lengthy deployment time (an average of two weeks) and confusion over where Hadoop truly fits into existing architecture, and many business users have begun to wonder: “is Hadoop really enterprise ready?”
To that question, I answer yes – if one considers the below developments and imperatives.
Advances in technology are mitigating issues with lack of skills
Considered to be a leading challenge in many Hadoop deployments, skills, or more accurately, its elusiveness, is a major cause for concern for many businesses. In general, a lot of companies can easily get their data into Hadoop, but have difficulty actually gaining insights from it. Oftentimes referred to as a “Data Lake,” or a place to store nearly unlimited uncategorized amounts of data in any format that is massively scaleable and inexpensive, Hadoop provides the promise of getting deep insights from “all data.” But if businesses lack the skills to gainfully fish from this “lake” using the tools they are accustomed to, it becomes murky from disuse, voiding the entire reason for having implemented the technology in the first place.
In the past couple of years, companies needed to hire or train workers with skills in Hadoop and its corresponding tools, including MapReduce, Pig, HBase and Hive. However, increasing numbers of technology companies are releasing powerful SQL capabilities and integrations for Hadoop, which make it more accessible to a larger population of SQL programmers and SQL aware toolsets.
In addition, text, statistical and predictive analytics algorithms running at scale, closer to the storage, are critical for real-time, accurate and actionable insights. A vibrant ecosystem is starting to form around bringing these capabilities to Hadoop.
Utilizing solutions created specifically for increasing ease of use for the Hadoop platform and corresponding tools is crucial for: 1) making the overall business case for Hadoop deployment, and 2) being able to smartly and efficiently perform complex data analysis.
Architecting to decrease deployment time
New technologies, such as plug-and-play expert integrated systems that can architecturally integrate with Hadoop-based software server, and storage can dramatically decrease deployment time – from weeks to mere minutes.
Appliance approaches are not only helping to minimize the need for advanced Hadoop expertise but enabling businesses to deploy their Hadoop-specific big data architecture on a tight turnaround. This saves money and increases efficiency. It also helps them venture into big data one rack at a time.
Some enterprises are deploying Hadoop clusters on an application specific or department specific usage basis due to limitations in granularity of security models in Hadoop. Although this has short term utility, the right approach would be to follow a multi-instanced, common infrastructure/cloud model.
Hadoop can be used on both structured and unstructured data
A common misconception is that Hadoop can only be used for unstructured data analysis. But Hadoop’s capabilities go above and beyond, including the ability to process structured and semi-structured data, as well. Principal use cases include data refinement, filtering and transformations, warehouse augmentation/off-loading, information discovery, and exploration. Going further, we’ve also seen cases of enterprises using Hadoop for next-generation business intelligence.
In light of this, businesses need not feel as though their intentions for Hadoop must fall only within a specific use case. On its own, Hadoop can be utilized for a number of different capabilities, and when combined with other systems, can become a powerful basis for unique solutions. However, MapReduce is batch oriented in nature, and therefore in use cases that require real-time processing of data, one has to consider augmenting a Hadoop based system with real-time, in-memory data processing capabilities.
Hadoop doesn’t replace your traditional system, it merely augments or adds to it
Hadoop supplements the enterprise data platform, and does not replace it. In a recent interview with Wikibon’s Jeff Kelly, IBM’s Tim Vincent explains how businesses need not think of Hadoop as the be-all, end-all – it is truly an expansion of and extension to traditional infrastructure. As such, businesses should become keenly aware that Hadoop, like many other big data platforms, can be utilized for a number of, but not all, big data needs. For instance, in-memory and stream computing, not Hadoop, are optimized for front-end BI query acceleration and continuous data processing, respectively. Furthermore, hybrid architectures which integrate capabilities from several different platforms can provide customized big data handling capabilities, dependent on need.
Make the case for business buy-in to avoid silos
While not yet a significant issue, businesses should begin to guard against Hadoop-created silos, and be mindful of how to avoid them. The accidental creation of silos can be circumvented though the implementation of a strategy that evolves from a single deployment and is built to specifically connect data sources as needed. Another equally important way to achieve the same goal is through the creation of a business culture that denounces, as Jeff Kelly calls it, “data hoarding.”
Getting business buy-in is essential to any big data analytics strategy.