Big Data Ecosystem Updates: Hadoop, Containers, and VMs Explained

By on

Twenty years ago, a startup called VMware brought in business by providing a platform to create nonphysical machine virtualizations, such as Linux, Windows, and others. As server processing capacity increased, basic applications couldn’t maximize the use of all the abundant new resources. Enter Virtual Machines (VMs), designed to run software on top of a physical server, and to imitate a specific hardware system. A hypervisor is software or hardware that can create and run VMs.

VMs using different operating systems can be run on the same server. For example, a UNIX VM can run on a server also capable of running a Linux VM. Each VM comes with its own applications, binaries, and libraries. Rather than buying a new computer capable of running Unix software, a business can keep its old one, and add new software. This is a much simpler, much less expensive solution to the changing needs of an organization.

Server virtualization is a technique that can divide a physical server into several small virtual servers, with the assistance of virtualization software. In this system, each virtual server will run multiple operations, simultaneously. James Kobielus, the Lead Analyst at Wikibon, said in a DATAVERSITY® interview:

“The great advantages of server virtualization are that you can make greater utilization of the hardware resources that you’ve invested in. So, you only need to buy new capacity when you actually need it. That’s virtual machines. Now, an issue with virtual machines is that it can be fairly complex to manage all of these disparate machine images inside all of these disparate virtual machines on all these disparate platforms, it can become an administrative burden, quite complex. It’s not really very straightforward.”

And Then Came Containers

While there are tools available, virtual machine technology can be quite difficult to work with. For example, decoupling specific, resource-consuming applications in a virtual world is not terribly easy or straightforward. Microservices and containerization offer an easier alternative. A specific application code, such as the query processor and the backend data in the database indexing logic can, with the containerization of microservices, split these different workloads.

Containers are similar to VMs, in that software from other systems can be run on very different servers, and containers also allow applications to be run together with libraries and dependencies. However, while VMs imitate a hardware system, containers transport their own software system, and use the core operating system as their base.

VMs take up more space, while containers take up less. “VMs can require a substantial amount of resource overhead, such as network input/output, memory, and disk, because an individual VM runs its own operating system, while containers do not,” remarked Kobielus. Containers share something called the operating system (OS) kernel, which accesses the OS core. Additionally, an operating system supporting containers can be smaller, and have fewer features than an OS for a virtual machine. Containers start much more quickly and use only a fraction of the memory used in booting an entire operating system. Clearly, containers are the next evolutionary step. Kobielus commented:

“Containerization has really caught on in the last five years. This is the way to do microservices, and to distribute a platform agnostic, virtualized server environment, and it works. The containers can run, not just in servers, but on client devices, and so forth. So, what we’re seeing is that containerization is really the heart of what’s often called cloud native computing.”

Containerization technology has been part of Linux for a long time. There is a containerization software that you need to run though.  These days Docker is one of the most popular containers, said Kobielus, but while there are other containerization technologies, they’re all plugged into Linux. Basically, Linux is the OS and Linux containers can be implemented through Docker, and Mesos, and various others. “And then you can run the application logic inside of a Docker container, and then scale those up independently.”

Docker support is available in the majority of Linux platforms, making it easy to run those containers and their applications. They will run on essentially any Linux platform, as well as other non-Linux platforms. Microservices can be moved around flexibly among OSs and underlying hardware platforms through the use of containers.


Kubernetes is a container orchestration system that is open-sourced and designed for automating scaling, deployment, and the management of containerized applications. The container orchestration system was originally developed by Google, but is now preserved and maintained by the Cloud Native Computing Foundation. NetApp now uses Kubernetes distribution, which is embedded in their environment and orchestrates storage resources and the containerization of storage throughout a distributed cloud fabric.

StackPointCloud developed a Kubernetes-based control plane for managing federated trusted storage clusters and to synchronize persistence storage containers among public cloud providers. Then NetApp, a large data storage vendor, took that technology and turned it into the NetApp Kubernetes service, which allows customers to launch a Kubernetes cluster, or storage cluster, in as little as three clicks.

“It can scale up to be used by hundreds of users, which allows customers to deploy containers that can scale from a single user interface,” commented Kobielus. This is containerization of storage. Much of the containerization evolution has focused on applications and middleware functionality.

According to Kobielus:

“One of the traditional vulnerabilities or weaknesses of Kubernetes or Docker, and for that matter, Linux containers, has been that they weren’t geared for storage or persistence. However, Wikibon has provided a fair amount of innovation regarding storage space in terms of leveraging Kubernetes and containers, Docker and so forth, for data persistence within the cloud environment.”

There have been several initiatives to containerize storage (storage is sometimes described as the heart of Big Data Analytics). NetApps claims their new Kubernetes service can run a StackPoint engine inside AWS, the Google Compute platform, and Microsoft Azure. (It also supports DigitalOcean, Packetclouds, etc.) Additionally, the Cloud Native Computing Foundation has Rook, which is a storage containerization and orchestration backplane for unstructured data.

Hadoop Storage

A trend in the world of big data analytics platforms is forming. It has to do with Hadoop being used for storage purposes. It is being used for data storage, data archiving, and data transformation. It is also being used for Data Governance. Hadoop is an open source core platform used by many organizations working with big data for a variety of purposes. Consequently, the Hadoop Distributed File Store has become quite popular. Hadoop is being used in on-premise clouds, public clouds, and hybrid clouds. The Hadoop ecosystem is now in the process of being containerized. Red Hat is one of the prime implementers of Kubernetes in the cloud. Kobielus said:

“Innovators are basically taking all the components of the Hadoop ecosystem into their plan, and then containerizing them so they can be deployed and scaled and managed independently. They are then orchestrated in various combinations using Kubernetes.”

A lot of what is going on in terms of actual real-world enterprise deployment of these open source platform technologies for data analytics involve just that. “They are combining them in various and sundry ways that the original developers of these separate open source projects didn’t fully anticipate,“ commented Kobielus. All those platforms are getting containerized. And the trend is not going to change anytime soon.

Image used under license from

Leave a Reply