Click here to learn more about author Mahendraprabu Sundarraj.
In the last decade, real-time analytics became a reality in multiple industries thanks to advancements such as big data, distributed data processing, and easy-to-scale cloud infrastructure.
While eCommerce and tech companies continue to reap the benefits of real-time analytics much more quickly than any other industry due to their ability to keep most of their IT infrastructure in the cloud, industrial manufacturing companies continue to face challenges in achieving real-time analytics maturity. A typical factory must have on-premise systems such as MES (Manufacturing Execution System) and edge device applications for high availability and data security requirements. Hence hybrid cloud (on-premise + cloud) is widely accepted as the reference architecture for smart factories. That is the reason why all cloud vendors, such as AWS, Azure, and Google Cloud, offer IoT-based cloudlet solutions to stream data from on-premise IT applications to the cloud.
In hybrid cloud, data travels more distance than in a cloud-native infrastructure, increasing the overall system latency for real-time analytics applications. The main challenges in the hybrid cloud to achieve real-time analytics can be classified into three categories:
- Network Speed: Network bandwidth determines how fast the data is transferred from the on-premise edge device to cloud. In a global manufacturing company, often the data must travel from one continent to another even if there is a cloud region available close to a given factory. The benefits of streaming infrastructure or in-memory data processing platforms are limited when data must travel long distances.
- Compute power. Scaling up the local data center compute power needs significant capital spend and time. If a smart factory increases production, data volume grows, and the need for computing power at on-premise infrastructure also goes up. If computing power is not scaled up continuously, real-time analytics will continue to be a dream.
- Availability. Most of the edge device data needs to be heavily prepared before making the data available to analyze and find useful patterns. Thus, data preparation also increases system latency, in addition to the challenges mentioned above.
Below are the building blocks to create real-time analytics in a hybrid cloud environment. These building blocks eliminate or reduce the impact of the above-listed challenges.
Prepare data close to its source. Data preparation is better done close to the data source. Typical edge devices in a smart factory either produce data in the form of a small data pocket in fixed time intervals or a continuous stream. Either way, consolidating the data in a fixed time interval and preparing the data at local data centers will help in a big way to reduce the overall system latency. It might sound as if we are introducing another batch processing layer, which could increase the overall system latency. However, preparing the data in small batches and transferring it to the cloud eliminates or reduces data preparation at the centralized cloud data lake. Consolidating the data preparation at the cloud not only increases the system latency but also increases compute cost in the cloud. Please note, storage is cheaper in the cloud but not compute.
Format data. Big data file systems and cloud data lakes work efficiently with optimized columnar data formats. As part of the data preparation close to the source, every data pocket can be converted into a columnar format before getting transferred to the cloud. Optimized formats reduce the file size by half or one-quarter. Hence the data can be transferred in less than half the time it usually takes.
Catalog data. Data catalogs (metadata about the data stored in a distributed cluster) make data retrieval much more straightforward; it’s as if the data comes from a single node. When much of the data preparation happens before coming to the cloud, data is immediately available to downstream analytics applications as soon as the data reaches the cloud data lake. Data catalogs further simplify data retrieval and increase the efficiency of in-memory data processing systems.
Be able to handle a variety of workloads. Data is usually consumed in a wide variety of ways – APIs, distributed query engines, visual analytics applications, and machine learning. If the data is copied to multiple data storage platforms to handle these wide varieties of workloads, it increases the latency. Data storage must be carefully picked to handle a wide variety of analytics workloads.
These building blocks will help industrial manufacturing companies build real-time or close to real-time analytics applications even with a hybrid cloud infrastructure.