Big Data Processing 101: The What, Why, and How

By on

kf_bdpr101_111616First came Apache Lucene, which was, and still is, a free, full-text, downloadable search library. It can be used to analyze normal text for the purpose of developing an index. The index maps each term, “remembering” its location. When the term is searched for, Lucene immediately knows all the places where that term had existed. This makes the search process much faster, and much more efficient, than having to seek the term out anew, each time it is searched for. It also laid the foundation for an alternative method for Big Data processing. Doug Cutting created Lucene in 1999, making it free, by way of Apache, in 2001. (The Apache Software Foundation is an open source, innovation software community.)

In 2002, after Lucene became popular, Doug was joined by Mike Cafarella to make improvements on Lucene. They pulled the processing and storage components of the webcrawler Nutch from Lucene and applied it to Hadoop, as well as the programming model, MapReduce (developed by Google in 2004, and shared per the Open Patent Non-Assertion Pledge). Using these concepts, Doug began working with Yahoo in 2006, to build a “search engine” comparable to Google’s.  The shared and combined concepts made Hadoop a leader in search engine popularity. The fact that Apache Hadoop is free, and compatible with most common computer systems, certainly helped it gain in popularity, as did the fact “other” software programs are also compatible with Hadoop, allowing for greater freedom in the search process.

Hadoop, by itself, can operate using a single machine. This can be useful for experimentation, but normally Hadoop runs in a cluster configuration. The number of clusters can be a few nodes to a few thousand nodes. Hadoop’s efficiency comes from working with batch processes set up in parallel. Rather than having data moved through a network to a specific processing node, large problems are dealt with by dividing them into smaller, more easily solved problems. The smaller problems are solved, and then the combined results provide a final answer to the large problem. Hadoop also allows for the efficient and cost-effective storage of large datasets (maps). Doug Cutting and Mike Cafarella developed the underlying systems and framework using Java, and then adapted Nutch to work on top of it. One of the benefits of the new system allowed the computers to self-monitor, as opposed to having a person monitoring them 24/7, to assure the system doesn’t drop out.

It took a few years for Yahoo to completely transfer its web index to Hadoop, but this slow process gave the company time for intelligent decision making, including the decision to create a “research grid” for their Data Scientists. This grid started with a few dozen nodes and, as Hadoop’s technology evolved, grew to several hundred as the Data Scientists continued to add more data. Other well-known websites using Hadoop include:

  • Yahoo
  • eBay
  • Amazon
  • Facebook
  • Twitter
  • LinkedIn
  • Hulu


Spark is fast becoming another popular system for Big Data processing. Spark is compatible with Hadoop (helping it to work faster), or it can work as a standalone processing engine. Hadoop’s software works with Spark’s processing engine, replacing the MapReduce section. This, in turn, can lead to a variety of alternative processing scenarios, which may include a mixture of algorithms and tools from the two systems.  Cloudera is one example of a business replacing Hadoop’s MapReduce with Spark. As a standalone processor, Spark does not come with its own distributed storage layer, but can use Hadoop’s distributed file system (HDFS).

Spark is different from Hadoop and Google’s MapReduce model because of its superior memory, which speeds up processing time. As an alternative system, Spark can circumvent MapReduce’s imposed linear dataflow, in turn providing a more flexible data screening system.


Apache Flink is an engine which processes streaming data. Spark, by way of comparison, operates in batch mode, and cannot operate on rows as efficiently as Flink can. While Flink can handle batch processes, it does this by treating them as a special case of streaming data. If you are processing streaming data in real time, Flink is the better choice. According to Stephan Ewen, “We reworked the DataStream API heavily since version 0.9. I personally subscribe to the vision that data streaming can subsume many of today’s batch applications, and Flink has added many features to make that possible.”

Flink can handle high volume data streams while keeping a low processing latency, and its DataStream API has added a large number of advanced features, such as supporting Event Time, supporting out-of-order streams, and a very user-friendly, customizable windowing component. The engine has also gained a number of operational features, such as high availability, and better monitoring of metrics and the web dashboard.

The most obvious user friendly features of Flink’s 1.0 release are the “savepoints” and the CEP (Complex Event Processing) library. The CEP library lets users design the data sequence’s search conditions and the sequence of events. The savepoints record a snapshot of the stream processor at certain points in time. This feature is quite useful because it can be used for rerunning streaming computations, or upgrading programs.


 Apache Storm is designed to easily process unbounded streams of data. It is written in Clojure, an all-purpose language that emphasizes functional programming, but is compatible with all programming languages. Storm is a distributed real-time computation system, whose applications are designed as directed acyclic graphs. It can process over a million tuples a second, per node, and is highly scalable.

Apache Storm can be used for real-time analytics, distributed Machine Learning, and a number of other situations, especially those with high data velocity. Storm can be run with YARN and is compatible Hadoop.

Storm is a stream processing engine without batch support, but comes with Trident, a highly functional abstraction layer. Trident is functionally similar to Spark, because it processes mini-batches.


Apache Samza also processes distributed streams of data. Samza is built on Apache Kafka for messaging and uses YARN for cluster resource management. Samza uses a simple API, and unlike the majority of low-level API messaging systems, it offers a simple, callback-based, process message. When a computer in the cluster drops out, the YARN component transparently moves the tasks to another computer.

Samza incorporates Kafka as a way to guarantee the processed messages are in the same order they were received, and assures none of the messages are lost. Kafka creates ordered, re-playable, partitioned, fault-tolerant streams, while YARN provides a distribution environment for Samza.

Though Samza comes with Kafka and YARN, it also has a pluggable API allowing it to work with other messaging systems.

LinkedIn uses Samza, stating it is critical for their members have a positive experience with the notifications and emails they receive from LinkedIn. Instead of each application sending emails to LinkedIn members, all emails are sent through a central Samza email distribution system, combining and organizing the email requests, and then sending a summarized email, based on windowing criteria and specific policies, to the member.

Big Data Conclusions

The IDC predicts Big Data revenues will reach $187 billion in 2019. The use of Big Data will continue to grow and processing solutions are available. There are a number of open source solutions available for processing Big Data, along with numerous enterprise solutions that have many additional features to the open source platforms. Many of the solutions are specialized to give optimum performance within a specific niche (or hardware with specific configurations). It is worth noting several of the best Big Data processing tools are developed in open source communities. The most popular one is still Hadoop, its development has initiated a new industry of products, services, and jobs.

Hadoop has been a large step in the evolution of processing Big Data, but it does have some limitations which are under continual development. The various frameworks have a fair amount of compatibility, and can be used experimentally in a mix-and-match fashion to produce the desired results. There are no hard rules when combining these systems, but there are guidelines and suggestions available. The article, Storm vs Spark vs Samza, compares the three systems, and describes Samza as underrated. There are multiple solutions for processing Big Data and organizations need to compare each of them to find what suits their individual needs best.

Leave a Reply

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept