by Sunil Soares
Brian M. Williams, IBM also provided insights to this blog.
Big Data Governance relates to the governance of Big Data. Big Data is generally characterized in terms of the three “V’s” of volume, velocity and variety. In this blog, we will focus on the governance over the velocity dimension of Big Data. Streaming (or complex event processing) applications analyze high-volumes of high velocity data in real-time without landing interim results to disk. These applications are deployed in a multitude of use cases including algorithmic trading for capital markets, surveillance for law enforcement and threat prevention for homeland security. Streaming applications need to consider two critical disciplines relating to Big Data Governance: data quality and information lifecycle management. We consider each discipline below:
Streaming applications need to consider two aspects relating to the underlying data:
- Sources – Data that is available as input to the streams application. This could be from a socket connection, database query, Java Message Service (JMS) topic/queue or file. A schema can usually define each source, even though the transport medium may vary.
- Sinks or destinations – Data that is produced by a streams application and sent or written to a socket connection, database table, JMS topic/queue, file or web service. Once again, the destination schema must be modeled.
Before building a streaming application, Big Data teams need to understand the characteristics of the data. For example, to support the real time analysis of Twitter data, the Big Data team would use the Twitter API to download a sample set of Tweets for further analysis. The profiling of streaming data is similar to traditional data projects. Both types of projects need to understand the characteristics of the underlying data, such as the frequency of null values. However, profiling of streaming data also needs to consider two additional properties of the source data:
- Temporal alignment – Streaming applications need to discover the temporal offset when joining, correlating and matching data from different sources. For example, a streaming application that needs to combine data from two sensors needs to know that one sensor generates events every second while another generates events every three seconds.
- Rate of arrival – Streaming applications need to understand the rate of arrival of data:
- Does the data arrive continuously?
- Are there bursts in the data?
- Are there gaps in the arrival of the data?
Let us consider the following case study that describes a simple streaming application that correlates data from motion and temperature sensors in a school building. The Big Data team has profiled the data to understand its characteristics. Data from the motion sensors arrives every 30 seconds while data from the temperature sensors arrives every 60 seconds. The streaming application uses this information to conduct a temporal alignment of the motion and temperature sensor data that arrive at different intervals. It accomplishes this by creating a window during which it holds both types of sensor events in memory, so that it can match the two streams of data. The streaming application also uses reference data that room A is a classroom and room B is the boiler room. The streaming application stores temperature data every 10 minutes in Hadoop. The analytics team uses data at rest in Hadoop to build a normalized model that indicates that the average temperature readings of the boiler room and classroom are 65 degrees at 3 AM and 75 degrees at 9 AM respectively. The streaming application then uses the available data to generate alerts such as when sensor data does not arrive for five minutes, the temperature of the boiler room rises to 75 degrees at 3 AM, or the motion sensor detects movement in the classroom at 5 AM.
Information Lifecycle Management
Information lifecycle management (ILM) is turned on its head in the context of real time, streaming data. When data arrives at high velocity, Big Data teams need to know what data is valuable, and what needs to be persisted. If the streaming analytics application can make this determination “in the moment” then it can apply ILM policies to data in motion. For example, a streaming application may analyze sensor readings every tenth of a millisecond, and store the readings every second. However, when sensor readings begin to indicate anomalous behavior, the streaming analytics application may store every reading up to and after the event.
Let us consider a case study where a network monitoring system analyzes streaming data for abnormal events. A network monitoring system analyzes NetFlow data from different routers. Each NetFlow record contains statistical information from network routers such as the source IP address, destination IP address, and the number of bytes and packets. The network monitoring application profiles the data in real-time and compares it with historical norms. It may observe an increase in traffic to a social media website at 9 AM when employees begin their workdays. However, if it notices an abnormally large volume of outgoing network traffic to a previously unknown destination, it may be a sign of an exfiltration (data leaving the company’s network). The network monitoring system accomplishes real time analytics by keeping a portion of network history in memory. The security operations team needs to determine how much data should live in memory. For example, it may decide to keep two hours’ of NetFlow records in memory, and persist the readings to disk every minute for historical analysis. However, the application may begin to store all the network history on disk once it begins to observe the abnormal behavior.
Streaming data introduces exciting new possibilities for Big Data analytics. As discussed in this blog, the Big Data Governance of streaming data needs to consider data quality and information lifecycle management issues that are specific to this environment.