How well acquainted are enterprises with knowing when to take advantage of Hadoop and when to leverage NoSQL? And how well prepared are their IT teams with the skills needed to go in either direction? How well are they grappling with the realities of Big Data?
These were some of the questions tackled by Randy Secrist, Director of Professional Services at Basho Technologies, during his Enterprise Data World 2015 Conference presentation on the topic of Hadoop Versus NoSQL Use Cases. It’s a good time to revisit the issue as 2016 gets into full swing and the technologies continue to grow their audience: Allied Market Research expects that the global NoSQL market will reach $4.2 billion by the end of 2020, growing at 35.1% CAGR from 2014 to 2020. The global Hadoop market, it says, is expected to garner revenue of $84.6 billion by 2021. That represents a CAGR of 63.4% during 2016 to 2021.
Both technologies help businesses grapple with Big Data, but Big Data has more than one face, Secrist explained. Although both technologies do not use SQL, NoSQL, and Hadoop aim at handling separate Big Data workloads. Currently, many businesses are caught in a phase that Secrist described as generally wanting a place to put their data into and gain something from doing so. “There is some vague need to do something and that usually results in confusion,” he said. It’s an iterative process to figure out true goals about data availability or insights and then translating that knowledge into action.
Which Path to Take
Choose the tool that matches the need the organization determines is its greatest pain point, he advised. Those leaning toward understanding better what is going on in their business, how it is working, and what they might need to change to do better should consider solutions for batch-oriented analysis of stored data. “If you are going down the insight path then you are leading down more to Hadoop [and] the more analytic batch-processing workload,” he said.
As an example, he discussed Big Data competitions he has participated in where teams delved into analyzing Health Spending Account (HSA) anonymous data sets to discover how people use these accounts. One of the teams involved was made up of Machine Learning Data Scientists: They were unfamiliar with what the HSA data codes meant, but were able to use pure math and statistics to analyze the sets to visually show how different conditions and procedures clustered together. One can think of this team as learning things about a business issue that weren’t known before. Another team that was built of health care specialists achieved the same results, though its members understood in advance why certain procedures would be clustered together thanks to their familiarity with the relationships of diseases to treatments and procedures. One can think of this team as gaining proof for what they already knew to be true of a business issue.
The biggest pain point for other organizations revolves around real-time operations, where elastic scalability, high availability, and performance matter. “If you have a performance problem, fix that,” he said. NoSQL products can have advantages in these respects. One customer of Basho, for instance, handles accepting emails with a lot of data of multiple varieties at a very high rate for sites including Pinterest, which has some 70 million or so active users per day. It needs to ingest that data as fast as possible, and it wants to take on more customers like Pinterest, which means more volume. “So they need a way to predictably scale to the point where they can know they can handle x amount of throughput, and know how this is going to work today, as well as their projected growth pattern,” he said.
Live Real, Hope for the Ideal
That said, the NoSQL space is still evolving. He noted that there have been efforts to characterize all the different types of NoSQL databases, but that those characterizations are morphing as the systems themselves change. Basho’s own solution, Riak, for example, has blurred together the idea of key-value and document store NoSQL databases. (See this recent article on Riak.) And, “as far as the enterprise goes, most of these all are very new,” he said, with few companies having more than three to five years of experience with such systems.
Such a lack of experience, not surprisingly, still causes unrealistic expectations around the technology. For example, people believe developers will immediately become more efficient thanks to going schemaless. He also believes it’s healthy to consider all an enterprise’s options closely. For instance, when it comes to presenting data to users operationally and performance is a problem, taking more seconds to fetch the data than it should, the fix may be in remodeling the database already in use. “You don’t necessarily need to switch to Cassandra [one of Riak’s competitors] to solve the problem,” he said.
In any case, mixing and matching the two problems and solutions can be tricky:
“If you are trying to figure out what your data means, then it’s harder to know how to operationalize it at the same time,” he said. “One half of this problem is that you have to define your access patterns to this data – how do you want to query it, how do you want to see it. And on the other half, you don’t really care about that. You want to capture everything and store it and analyze it, and they are two very different things.”
Given that dilemma, he again emphasized the importance of determining what the company’s biggest issue is. “If you already know there is a problem, like a performance problem with your application, focus on that and solve that first,” he said. On the other hand, if you truly don’t know your business as well as you should and need the benefit of data-driven insights, that’s the path to head down.
That doesn’t mean that the future won’t bring forth an ideal NoSQL database that can deliver high availability, low latency, scalability, failure resistance, and also enable organizations to gain as much insight as possible. But for now the strengths lie in operationalizing, in determining how best to meet throughput requirements, how to scale out effectively and deliver premium fault tolerance.
Hone Engineering Skill Sets
Secrist also discussed that this is a good time for engineers to grow their skill sets in both spaces so that they can better understand how to do the job. From the availability perspective, for instance, the focus should be on iterating solutions to come up with the best way to store and present only what data is needed.
As with a file cabinet, you can’t just jam things in indiscriminately or there will be junk everywhere, and it will be impossible to find what you need when you open the cabinet drawer to look for it, he said. “If you want data to be available you must first know how to find it so you can make it available,” he said.
In contrast, when gaining insights is a priority, so too is storing all the Big Data you get. It’s simply impossible to know in advance what data ultimately will have value. “Focus on learning effectively for your organization and team,” said Secrist.