Data Container Architecture: Answers and Challenges

Data containers are a recent step in the evolution of the cloud. Their purpose is to run reliably as they are transferred from one computer environment to another. They are isolated, improve security, and are “usually” easy to work with. Containers promote application portability through their novel use of packaging, and the philosophy of “develop once, deploy anywhere.” Containers are still in the early stages of growth, though many issues can be eased with the correct container architecture.

Renat Zubairov, CEO of elastic.io (a vendor of hybrid integration platforms), said:

“The lifecycle management for containers brings challenges. When people decide to start using containers, they usually assume that this is the same as, or at least similar to, using virtual machines. In reality, containers differ significantly from virtual machines.”

While virtual machines have their own strengths, the primary benefit of using containers is their small size, combined with the capacity to run numerous containers simultaneously on a server. Another advantage of containers is their impressive modularity. For example, an app can be divided and split among several containers, using a technology called microservices. Containers and microservices are designed to work well together. These strengths are the reasons DevOps techs prefer containers when developing, testing, and building apps in the cloud.

Containers have made using the cloud easier. A container is, in essence, a self-contained package of algorithms that can be installed in a variety of computer systems. By design, containers promote the use of private and hybrid clouds. A private cloud offers significantly more security than a public cloud, but still allows the container to be shared, when necessary. Hybrid clouds allow for the use of private clouds to do the bulk work, and the use of tools available on public clouds for fine tuning.

Many organizations work with hybrid clouds and containers to improve the lifecycle of developing applications by using the appropriate tools, such as continuous integration and continuous delivery. Additionally, containers support the Agile and DevOps philosophies of efficient and continuous software delivery practices.

Containers emphasize easy portability and management at the expense of persistence. They are, by design, meant to be both isolated and easily disposable. This combination is both their strength, and their weakness.

Data Persistence

When a container writes data to a disk, it uses a virtual file system that “exists” within the container itself. The contents generally disappear after the container has been terminated … or has crashed. A container, as an isolated system, does not support “persistence.” Algorithms within the container do not write data directly onto the host file. Containers are isolated and do not share storage with other containers. Nor do they share storage with local applications, daemons, etc., that exist in the host operating system.

The isolation presented by containers is in direct opposition to the concept of persistence, though there are ways to manage the persistence of data for containers. Each method, however, involves trade-offs, especially in regard to portability and isolation. Configuring a container for persistence makes the process of securing and deploying it more difficult.

Docker’s volume plug-ins and Kubernetes’ persistent volumes framework of orchestrators support the use of external volumes for storing and accessing data. You may come across the terms “stateless” and “stateful” — stateless containers do not store data, while stateful containers require some form of backup storage.

Host Based Persistence

Host-based persistence is an early form of data durability for containers, which has matured and provides support for several situations. This type of architecture uses the underlying host for persistence and storage and bypasses the union filesystem backends in order to access the host’s native filesystem. The data is stored outside of the container, making it available when a container is removed.

The architecture supporting host-based persistence allows for multiple containers to share volumes. However, data corruption can take place when multiple containers write to a single shared volume (in this case, “volume” is more like volume four of a book series rather than the amount of data moving through the system). Developers must ensure the applications have been designed to write to these shared data stores.

While this does mean the volumes can be read and written using “normal” Linux tools, generally this should not be done. It runs the risk of causing data corruption if the containers and applications are not prepped for direct access.

Docker Solutions

There are three architectural Docker solutions for providing host-based persistence, each with subtle differences in how they are implemented. They are:

Implicit Per-Container Storage: Creates storage “sandbox” for the container requesting host-based persistence. A directory can be opened by default (using /var/lib/docker/volumes) on the host when creating the container.

Unfortunately, when the container has been removed, the directory is deleted automatically by the Docker Engine. These directories may also disappear when the Docker Engine crashes. It should be noted, the data saved in the sandbox isn’t accessible to other containers, with the exception of the one requesting it.

Explicit Shared Storage (called “Data Volumes” in Docker): This second technique is used to share data with multiple containers that run on the same host. This situation requires an explicit location of the host filesystem to be used as a mount in one or more of the containers. This technique is useful with multiple containers that need read-write access for the same directory. Since the directory on the host is created outside of Docker Engine’s context, it is available even after removing every container or even stopping Docker Engine. This technique is the most popular one used by DevOps teams. Data volumes can be accessed directly from the Docker host.

The problem with this technique is that the containers are no longer portable — persistence has replaced portability. The data is now residing with the host, and does not transfer with the container.

Shared Multi-Host Storage: Combines a distributed filesystem with explicit storage. Containerized workloads in production are often run in a clustered environment, with multiple hosts providing the computer, network and storage capabilities needed. In this, all nodes have the mount point available, and can use it to build a shared mount point for containers.

Docker, combined with host-based persistence for specific use cases, has its uses, but also severely limits the portability of containers, tying to a specific host. Additionally, this system does not take advantage of the specialized storage backends designed for data-intensive workloads. In an effort to resolve these limitations, Docker has added volume plugins that extend the container’s capabilities to include different kinds of storage backends, without changing the deployment architecture or application design.

StorageOS: This product is based primarily on Kubernetes operations via the orchestration platform. It can operate on-premise, and in hybrid cloud situations. It works between containers and the storage and can be used in the cloud or on-premise. It offers storage for containers in Red Hat Openshift, Kubernetes, and Docker, and comes with features that automate and protect storage.

A free version of StorageOS is available on the Docker Hub, if you want check it out.

Kubernetes Manages Volume Lifetimes

Kubernetes offers a different kind of “volume.” In Docker, volumes are directories on a disk, or inside another container. A Kubernetes’ volume, on the other hand, works a little differently. It is enclosed with a pod. (Examples of pods include a peapod, or a pod of whales, or a group containers shared within a network.) A Kubernetes volume will outlive any containers running inside the pod, and the data is preserved, even after container restarts. Kubernetes pods also support a variety of volumes, and can use all, or some, simultaneously.

When the pod ceases to exist, however, the volume disappears, and any history of what took place also disappears.

Container Orchestration

“Container orchestration” describes the automatic process of organizing or scheduling individual containers for applications using microservices within multiple clusters. If eight containers are running four applications, it is not that difficult to organize and arrange the processes and deployment of the containers, so no automation required. However, if 800 containers and 400 applications are operating, the management becomes more difficult. While operating at scale, the orchestration of containers — automation of deployment, scaling, networking, and availability — becomes a necessity.

Container orchestration manages the container’s lifecycle, and is particularly useful in large, complicated environments. Container orchestration controls and automates:

Provisioning and deploying containers
Moving containers when a shortage of resources exists within the host, or if the host dies
Load balancing between containers
Scaling, or removing containers, to distribute and balance the application load across the host’s infrastructure
Redundancy and the availability of containers
Allocating resources among containers
Application configurations per the containers running it
Monitoring the containers and their hosts

Restructuring Staff for Containers

As mentioned earlier, containers support the Agile and DevOps philosophies of efficient and continuous software delivery practices. Typically, the Operations Team (Ops) is responsible for the infrastructure (computers, the network, storage, etc.) and system resources (the file system, runtime libraries, etc.). The Development Team (Dev) works together in developing products and/or services. Changing over to a container system involves merging the responsibilities of the Ops and Dev teams. Reorganizing the tech staff is necessary when shifting to a container system, and may involve hiring temporary employees/contractors for specific projects.

Image used under license from Shutterstock.com

Data Container Architecture: Answers and Challenges

Data Persistence

Host Based Persistence

Docker Solutions

Kubernetes Manages Volume Lifetimes

Container Orchestration

Restructuring Staff for Containers

Keith D. Foote

The Multimodal Lakehouse: Why Your Data Strategy Needs to Evolve Beyond Structured Data

The Data-Centric Revolution: The Strangler Fig Pattern

AI Is Increasing the Strategic Importance of Data Modeling

Thanks!

Data Container Architecture: Answers and Challenges

Data Persistence

Host Based Persistence

Docker Solutions

Kubernetes Manages Volume Lifetimes

Container Orchestration

Restructuring Staff for Containers

Keith D. Foote

Related Articles

The Multimodal Lakehouse: Why Your Data Strategy Needs to Evolve Beyond Structured Data

The Data-Centric Revolution: The Strangler Fig Pattern

AI Is Increasing the Strategic Importance of Data Modeling

Lead the Data Revolution from Your Inbox.

Thanks!