DATAVERSITY® recently interviewed John Schroeder, the Founder of MapR, to find out his thoughts on what is approaching on the Data Management horizon. Schroeder has more than 20 years in the Enterprise Software space, with a focus on Database Management and Business Intelligence. Such a background gives Schroeder insight into how the world of Data Management has changed over time and what major trends are occurring now.
Artificial Intelligence Re-Emerges
Artificial Intelligence (AI) is now back in mainstream discussions, as the umbrella buzzword for Machine Intelligence, Machine Learning, Neural Networks, and Cognitive Computing, Schroeder said.
There is going to be a rapid adoption of AI using straightforward algorithms deployed on large data sets to address repetitive automated tasks, he said. “Google has documented [that] simple algorithms, executed frequently against large datasets yield better results than other approaches using smaller sets.” Compared to traditional platforms,
“Horizontally scalable platforms that can process the three V’s: velocity, variety and volume – using modern and traditional processing models – can provide 10-20 times the cost efficiency,” He adds, “We’ll see the highest value from applying Artificial Intelligence to high-volume repetitive tasks.”
Schroeder illustrates one simple use of AI that involves grouping specific customer shopping attributes into clusters. “Clustering is one of the very basic AI algorithms because once you can cluster items, then you can predict some behavior,” he said. It’s now possible to tune up an algorithm against a massive amount of data so that clusters get tighter and more useful very quickly, which keeps the data fresh and relevant, he said. When the standard deviation between points in an individual cluster is as tight as possible, it’s possible to make assumptions across the cluster, and provide offers and services to other customers within that cluster with reasonable expectation of success.
When clustering is built into an operational system for an online retailer, like Amazon or Wal-Mart, the potential for influencing behavior is significant. In an online catalog with static pricing, the shopping cart abandonment rate is “through the roof,” he said. But with the use of Artificial Intelligence, stores can recommend other products, while in real time, search competitive pricing, dynamically adjust that price, and offer in-store coupons and price guarantees so customers feel that they are getting what they need for the best price available.
“It’s the speed of the world. Address every single subscriber on an individual basis in real time, before they switch to another company,” he said.
Big Data Governance vs Competitive Advantage
The “governance vs. data value” tug of war will be front and center moving forward. Enterprises have a wealth of information about their customers and partners and are now facing an escalating tug-of-war between Data Governance required for compliance, and being free to use data to provide business value, while avoiding damaging data leaks or breaches.
Schroeder said Master Data Management (MD) is a big issue and it’s been a big issue for some time. It’s “very, very, very difficult for any organization to keep up” with governance, lineage, security, and access, especially while expanding the amount of data used in the organization. He says that smarter organizations are asking, “What part of our data has to be governed and be compliant, and what are other data sources that don’t require that? So it gets them out of the rat hole of trying to MDM everything in the world.”
“If I said, ‘Why don’t you go home tonight and take an Excel spreadsheet of every item in your house, and then log anything anybody touches, uses, or eats.’ You couldn’t get anything else done, right? So you’d have to say, ‘Somebody ate a banana, I’ve got to go update the database.’”
Leading organizations will apply Data Management between regulated and non-regulated use cases, he said. Regulated use cases require Data Governance, Data Quality, and Data Lineage so a regulatory body can report and track data through all transformations to the originating source. This is mandatory and necessary, but limiting for non-regulatory use cases where real-time data and a mix of structured and unstructured data yields more effective results.
Companies Focus on Data Lakes, Not Swamps
Organizations are shifting from the “build it and they will come” Data Lake approach to a business-driven data approach. Use case orientation drives the combination of analytics and operations, Schroeder said.
Some companies dream of a Data Lake where everything is collected in “one centralized, secure, fully-governed place, where any department can access anytime, anywhere,” Schroeder says. This could sound attractive at a high level, but too often results in a Data Swamp, which can’t address real-time and operational use case requirements, and ends up looking more like a rebuilt Data Warehouse.
In reality, the today’s world moves faster.
Schroeder says that enterprises require analytics and operational capabilities to address customers, process claims and interface with devices in real time on an individual level. To compete with the fast-moving world of today:
“E-commerce sites must provide individualized recommendations and price checks in real time. Healthcare organizations must process valid claims and block fraudulent claims by combining analytics with operational systems. Media companies are now personalizing content served though set top boxes. Auto manufacturers and ride sharing companies are interoperating at scale with cars and the drivers.”
And it’s not enough to have a business use case pre-defined. The business has to be “visionary enough that they think about the next few use cases as well, so they don’t want to paint themselves into a corner by only servicing the first use case.”
He predicts that businesses that define a use cases in advance will be the most successful because, “The customers do a better job of articulating the requirements, they know what the value’s going to be,” which is the opposite of a generalized “build it, they’ll come” idea.
Delivering these use cases requires an Agile platform that can provide both analytical and operational processing to increase value from additional use cases that span from back office analytics to front office operations. Organizations will push aggressively beyond an “asking questions” approach and architect to drive initial and long term business value.
Data Agility Separates Winners and Losers
Schroeder says that processing and analytic models will evolve to provide a similar level of agility to that of DevOps, as organizations realize that data agility – the ability to understand data in context and take business action – is the source of competitive advantage.
“The mistake that companies can make is implementing for a single approach. They’ll say, ‘All we really need is to be able to do Spark processing. So we’re going to do this in a technology that can only do Spark.’ Then they get three months down the road and they say, ‘Well, now we’ve got to dashboard that out to a lot of subscribers, so we need to do global messaging [but] the platform we deployed on won’t do that. What do we do now?’”
Instead of bringing in another technology for messaging and trying to find a way to pipe data between Spark and the global messaging, then setting up access control and security roles and all that entails, companies can use technology that allows them to be more Agile and less siloed into one particular platform, he said:
“The emergence of Agile processing models will enable the same instance of data to support multiple uses: batch analytics, interactive analytics, global messaging, database, and file-based models. Analytic models are more Agile when a single instance of data can support a broader set of tools. The end result is an Agile development and application platform that supports the broadest range of processing and analytic models.”
Blockchain Transforms Select Financial Service Applications
“There will be select, transformational use cases in financial services that emerge with broad implications for the way data is stored and transactions [are] processed,” said Schroeder. “Blockchain provides obvious efficiency for consumers,” he said. “Because customers won’t have to wait for that SWIFT transaction or worry about the impact of a central datacenter leak.”
Don Tapscott, co-author with and Alex Tapscott of Blockchain Revolution, in a LinkedIn article entitled, Here’s Why Blockchains will Change your Life agrees with Schroeder:
“Big banks and some governments are implementing blockchains as distributed ledgers to revolutionize the way information is stored and transactions occur. Their goals are laudable—speed, lower cost, security, fewer errors, and the elimination of central points of attack and failure.”
Schroeder goes on to say that as a trust protocol, blockchain provides “a global distributed ledger that changes the way data is stored and transactions are processed.” Because it runs on computers distributed throughout the world, adds Tapscott,
“There is no central database to hack. The blockchain is public: anyone can view it at any time because it resides on the network, not within a single institution charged with auditing transactions and keeping records.”
Transactions are stored in blocks where each block refers to the preceding block, blocks are time-stamped, storing the data in a form that cannot be altered, said Schroeder. “For enterprises, blockchain presents a cost savings and opportunity for competitive advantage.”
Machine Learning Maximizes Microservices Impact
Data Management will see an increase in the integration of Machine Learning and microservices, he said. Previous deployments of microservices focused on lightweight services. Those that have incorporated Machine Learning,
“Have typically been limited to ‘fast data’ integrations that were applied to narrow bands of streaming data.” Schroeder says, “We’ll see a development shift to stateful applications that leverage Big Data, and the incorporation of Machine Learning approaches that use large of amounts of historical data to better understand the context of newly arriving streaming data.”