Hadoop Overview: A Big Data Toolkit

Big Data isn’t new. Forbes traces the origins back to the “information explosion” concept first identified in 1941. The challenge has been to develop practical methods for dealing with the 3Vs: Volume, Variety, and Velocity. Without tools to support and simplify the manipulation and analysis of large data sets, the ability to use that data to generate knowledge was limited. The current interest and growth in Big Data, Data Science, and Analytics is largely because the tools for working with Big Data have finally arrived. Hadoop is an important piece of any enterprise’s Big Data plan.

First developed in 2005, Hadoop is an open source framework for reliable and scalable distributed computing. Working with Hadoop, large datasets can be efficiently and cost-effectively stored and processed using commodity hardware.

The efficiency comes from executing batch processes in parallel. Data doesn’t need to be moved across the network to a central processing node. Instead, big problems are solved by breaking them into smaller problems that are solved independently, and then combining the results to derive a final answer.

The cost effectiveness comes from the use of commodity hardware. The large data sets are simply broken up and stored on ordinary sized local disks. Failures are handled through software, rather than high cost servers with high availability features.

Hadoop’s Components

The Web generates a lot of data. In order to search it effectively, Google needed to keep it indexed, but maintaining that index was extremely challenging. To make this possible, Google created the MapReduce style of processing. MapReduce programming uses two functions, a map job that converts a data set into key/value pairs, and a reduce job that combines the outputs of the map job into a single result. This approach to problem solving was then adopted by developers who were working on Apache’s “Nutch” web-crawler project, and developed into Hadoop.

Hadoop is made up of four modules; the two key modules provide storage and processing functions.

MapReduce: As described above, this component supports the distributed computation of large projects by breaking the problem into smaller parts and combining the results to derive a final answer.
Hadoop Distributed File System (HDFS): This component supports the distributed storage of large data files. HDFS splits up the data and distributes it across the nodes in the cluster; it creates multiple copies of the data for redundancy and reliability purposes. If a node fails, HDFS will automatically access the data from one of the replicas. The data managed by HDFS can be either structured or unstructured; it can support almost any format. Hadoop does not require the use of HDFS; however, other file systems such as S3 or the MapR File System can also be used with Hadoop.
Yet Another Resource Negotiator (YARN): Introduced in Hadoop 2.0, YARN provides scheduling services and manages the Hadoop cluster’s computing resources. Through YARN’s features, Hadoop can run other frameworks besides MapReduce. This has extended Hadoop’s functionality so that it can support real-time, interactive computations on streaming data in addition to batch job processing.
Hadoop Common: The library and utilities that support the other three modules.

Hadoop can run on a single machine, which is useful for experimenting with it, but normally Hadoop runs in a cluster configuration. Clusters can range from just a few nodes to thousands of nodes.

When the data is managed by HDFS, there is a master NameNode that holds the file index. The data is stored on slave DataNodes.

Because of the introduction of YARN, the scheduling jobs that run on the nodes are different in Hadoop 1 and Hadoop 2. Hadoop 1 managed resources with the JobTracker and TaskTracker processes on the master and slave nodes respectively. In Hadoop 2, they are replaced by YARN’s ResourceManager, NodeManager, and ApplicationMaster daemons.

Benefits of Using Hadoop (and Limitations)

Hadoop provides a number of advantages for solving Big Data applications:

Hadoop is cost-effective: Ordinary, commodity hardware is used to achieve large-capacity storage plus high availability and fault tolerant computing.

Hadoop solves problems efficiently: The efficiency is partly due to using the multiple nodes to work on the problem’s parts in parallel, and partly from performing computation on the storage nodes, eliminating delays due to moving data from storage to compute nodes. Because data is not moved between servers, the volume does not overload the network.

Hadoop is extensible: Servers can be added dynamically, and each machine added provides an increase in both storage and compute capacity.

Hadoop is flexible: Although most commonly used to run MapReduce, it can be used to run other applications, as well. It can handle any type of data, structured or unstructured.

These benefits and flexibility don’t mean that Hadoop is suitable for every problem. Problems with smaller data sets can most likely be more easily solved with traditional methods. The HDFS was intended to support write-once read-many operations, and may not work for applications that need to make data updates.

Hadoop also may not be an appropriate choice for storing highly sensitive data. Although Hadoop has a security model, the default configuration disables it. Administrators need to make appropriate choices to ensure data is encrypted and protected as needed.

One of the most popular alternatives to Hadoop is Spark. Like Hadoop, Spark is an Apache project for running computations on large datasets. Spark can leverage the Hadoop framework, especially HDFS, but it can achieve better performance by keeping data in-memory instead of on disk. Spark is also a good choice for real-time analytics, although Hadoop 2.0’s new features allow Hadoop to support streaming analytics as well as batch processes.

Typical Use Cases

Hadoop is widely used by organizations in many different business domains. Cloudera lists 10 common problems that are suited to Hadoop analysis:

Risk modeling
Customer churn analysis
Recommendation engine
Ad targeting
Point of sale transaction analysis
Analyzing network data to predict failure
Threat analysis
Trade surveillance
Search quality
Data sandbox

Most of those use cases are driven by business needs, but the tenth one on the list—data sandbox—is only one of several purely technical reasons for using Hadoop. Hadoop is cost effective simply for data storage. It can be easily be used as a staging area before loading data into a data warehouse.

Industries that have applied Hadoop to their Big Data problems in the past few years include retail, banking, healthcare, and many others. The Hadoop website lists numerous well known firms with clusters containing from fewer than a dozen up to 4500 nodes, including Amazon, EBay, Facebook, Hulu, LinkedIn, Twitter, and Yahoo.

The Hadoop Ecosystem

Hadoop is typically used in conjunction with several other Apache products to form a complete analytics processing environment. These products include:

Pig: Pig is a scripting language that makes the data manipulations commonly needed for analytics (the ETL—extract, transform, and load—operations) easy. Scripts written in “Pig Latin” get turned into MapReduce jobs.

Hive: Hive provides a query language, similar to SQL, which can interface with a Hadoop-based data warehouse.

Oozie: Oozie is a job scheduler that can be used to manage Hadoop jobs.

Sqoop: Sqoop provides tools for moving data between Hadoop and relational databases.

For users who find creating and supporting their own Hadoop environment too complex, there are several vendors who provide supported versions that make getting started easier. They also provide enhanced services that make Hadoop enterprise-ready. Some vendors include:

Amazon Web Services: AWS can rapidly provision a Hadoop cluster, add resources to it, and provides administrative support as a managed service. Other products often used with Hadoop, like Pig and Hive, can also be deployed.

Google Cloud Platform: Similar to AWS, Google lets you rapidly provision Hadoop clusters and associated resources like Pig and Hive. Other Google offerings, like BigQuery and Cloud DataStore connectors, make it easy to access data stored in Google’s data warehouse and cloud storage services.

Cloudera: One of the co-creators of Hadoop is now Cloudera’s Chief Architect. The firm offers a fully supported environment for enterprise Hadoop, including professional services.

HortonWorks: HortonWorks offers Hortonworks Data Platform, based on Hadoop, to support enterprise data lakes and analysis, plus Hortonworks DataFlow for real-time data collection and analysis.

In addition to those Apache products and third-party Hadoop environments, there are additional third-party products that make working with Hadoop easier in your own environment, such as BMC’s Control-M for Hadoop, which makes managing Hadoop batch jobs easier.

Hadoop Looking Forward

Despite the widespread use of Hadoop, there are some questions about its future. The platform is fairly complex to work with, making getting started challenging for businesses. The rise of Spark has led some to consider it as the future of Big Data, rather than Hadoop. Nevertheless, for the foreseeable future, understanding Hadoop will be necessary for organizations that want to start getting value out of their databases, data warehouses, and data lakes.

Data Topics

Hadoop Overview: A Big Data Toolkit

Leave a Reply Cancel reply