Data Architects and CDOs today face a variety of demands, including delivering Real-Time Analytics, predicting customer behavior, or operationalizing Machine Learning. All of these initiatives share one critical requirement: real-time data. But to get value from real-time data, they must enable continuous collection of data, and on-the-spot integration in-flight across heterogeneous systems.
For years, companies have sought the ability to get immediate value from their data – whether from databases, files, message queues, or devices – while it’s on the move. But with developer solutions that rely on multiple best-of-breed open source projects; an increasing variety of enterprise data sources including transactional databases; and the need to seamlessly transcend on-premises, Cloud and Edge Computing systems…it’s a struggle.
Piecing together everything for delivery to an open source solution like Hadoop for distributed computing in support of Big Data Analytics, or Kafka, in order to deliver real-time data pipelines and streaming apps, challenges even the best organizations. Consideration must be given to a variety of factors, including what to use for ingestion to create streams, what to use for stream processing and analysis to transform and correlate data, and what to use for delivery, alerting and visualization. Finally, one must consider how to build the coding glue to get everything working together in a scalable fashion that seamlessly integrates into the existing architecture.
Strong, streamlined integration has only grown in importance in today’s environment, where data modernization equates to choosing the technology that is most appropriate to answer the questions organizations want to ask of their data. Data Warehouses, for example, are not the best repository for data that will be harnessed for Machine Learning, whereas graph databases can be more helpful in finding connections across data that can be learned from, and leveraged for making predictions. In addition to continuously ingesting real-time data from multiple sources, integration solutions must also ensure data transport reliability for mission-critical applications whenever time-series analysis or complex event processing – or anything else that builds up state over time – comes into play.
No developer can afford to create a streaming data solution that, in the event of failure, can’t determine at what point the stream stopped, and pick up where things left off. In industries such as finance and healthcare, either money or lives are at stake, according to Steve Wilkes, CTO of Striim.
Striim for Real-Time Data Integration
Striim is an end-to-end platform that not only takes on the challenge of Real-Time Data Integration, but also offers built-in stream processing, Data Analytics, alerting and dashboards for enterprises to drive a modern Data Architecture and gain more value from their high-velocity data. With dashboards, for example, users can view historical data, add data filters, search on data, and perform queries against data flows, in additional to viewing the real-time streaming ‘now.’
Striim does it all with an eye to mission-critical applications, building a failure-aware architecture with checkpointing and rollback.
“We want to enable developers to focus on solving business problems and not plumbing problems,” said Wilkes. “We provide the plumbing, the data movement and processing capabilities, and they plug in their own value adds – the business logic – to move the business forward.”
The platform takes into account enterprise-class considerations including scalability, since all the pieces combined together to support streaming integration and Real-Time Analytics have to scale together. As an example, Wilkes points to enriching streaming data for Real-Time Analytics, which requires not only the streaming data, but contextual information that needs to be accessed from external systems and joined with the stream.
Reading directly from databases or external caches would involve queries that can take milliseconds each, blowing up any promise of real-time movement and processing. Striim solves such latency issues with a built-in in-memory data grid (or distributed cache) that supports automatic scaling over its clustered architecture. “Our queries inherently run in a cluster, with all the required data in-memory, so it’s high performance,” Wilkes noted.
Furthermore, Striim performs continuous processing using SQL-based queries, since SQL is the language of data, and a common language that Software Developers, Data Scientists, and Business Analysts all use. As for operationalizing Machine Learning, Wilkes discussed the ability to utilize these models as new data connections get added in real time from a variety of sources.
“It’s one thing to build a model and do things manually, and another where you can monitor flow, website traffic, or security data in real-time and make predictions from that,” he said.
Reflecting Data Reality
Striim’s technology architecture enables it to update all metrics in real-time as new data streams come in from various sources.
“If you run a query against a stream, it will output zero or more events every time it receives a new event, doing so continually with very low latency,” says Wilkes. “If you run a query against a time window over a stream, the query reruns and you get new results every time the contents of that window change.”
And, if you do complex event processing where you are looking for patterns, or sequences of events over time, then, when anything new comes in that matches the pattern, it will also generate results.
Data correlation is used to join together multiple streams of different live data metrics – such as a window of events or security data – which can then be used to support fast decision-making. One of Striim’s clients, for example, is a leading financial services company that leverages the technology for Security Analytics. The company created a distributed security hub based on Kafka that takes in all security data from things like firewalls, VPNs, and so on. Striim then reads the output, correlates, and then writes the correlated events into Kafka.
The correlation process means that the financial services company’s security analysts can see all the events that happened related to an IP address in the last second, for example, in a single record. Dozens of correlated security alarms from different sources in that one second might be cause to raise an alert:
“One problem they were trying to solve was their security analysts were spending most of their time looking at all the alerts coming off of different security devices,” said Wilkes. “This let them combine all these things together to raise alerts for only the really high-profile things among the twenty terabytes a day of data flowing through their system.”
Use cases like this enhance the argument that the natural way to deal with data is in an event-driven fashion, because data itself is created event by event, he said. “Every piece of data exists because someone entered something in an application, used their phone, or went to a website,” Wilkes pointed out. “Data is not created in batches. That’s not the natural way to deal with it. The natural way is the streaming way.”
The good news is, Wilkes said, that enough memory is available now at a low enough cost that organizations can support a continuous streaming integration process that delivers instant data insights when needed. Getting immediate value from data doesn’t have to be a dream – now it can be a reality.
Photo Credit: Hilch/Shutterstock.com