Trends in the Data Management space tend to emerge rapidly and mature slowly. Those that will most impact the way that Data Management is conducted throughout the coming months are explored below. These capabilities for managing data effectively represent the maturation of a number of different technologies, platforms, and capabilities that will pave the way for how this field is practiced.
As a subset of predictive analytics, Machine Learning provides a pivotal function for the enterprise today: it offers a way to tailor consumer interaction and facilitate loyalty in the wide open space of the digital era. In 2014, Machine Learning was most commonly deployed to assist ecommerce, predict customer churning, and assist in various aspects of scientific research. Its propensity to take existing data and predict future data outcomes based on previous ones can apply to many other verticals and facets of the enterprise, encompassing everything from initial marketing efforts to improving customer support.
For example, data and analytics consultants Clarity Solution Group worked with a healthcare payer to predict out of network utilization as a means of measuring the efficacy of steerage and replacement programs. When integrated with Big Data initiatives, Machine Learning can provision insight into virtually any aspect of organizational interest. Tripp Smith of Clarity commented on Machine Learning’s potential to impact the enterprise:
“There’s really broad scale applications in terms of measurement and complex correlating factors with one’s business. I think there’s lots of opportunities and when you take that scalability and applicability to the depths of data that you can work with in these very cost effective, massively parallel environments, now you can actually take those techniques and apply them in new and unique ways.”
The data landscape as a whole has been waiting for the widespread adoption of Big Data since this term initially became popular around 2012. The enterprise’s advantage for embracing these technologies will be significantly enhanced by contemporary developments in SQL, Hadoop, and Spark. Almost every major Big Data vendor and platform supports SQL in some way, which will produce profound effects on both adoption rates and the way that Big Data is deployed. With SQL as the common framework utilized by data engineers and Data Scientists, organizations can reap the performance benefits of Big Data applications with Machine Learning analytics without devoting substantial (monetary or temporal) resources to writing and testing code. Subsequently, Big Data initiatives are now much more accessible to developers, affordable, and integrative with existing enterprise infrastructure.
Big Data initiatives are also more profitable due to a number of advancements in Hadoop, which specifically include increased support for SQL with Hive and HDFS, query optimization and acid transactions. These developments have rendered Hadoop much more reliable for building consistent applications on top of while decreasing the time it takes to do so. These improvements, when combined with Hadoop’s cost-effectiveness and scalability, significantly enhance its role as a Big Data platform. Consequently, it enables organizations to expand their functionality while also reducing costs, which lets them focus more on business and less on technology. Smith mentioned that:
“If you think about some of the things you’re using as far as an operational data store for fast data sets, or basic ELT type processes, we’ve demonstrated that Hadoop not only satisfies those use cases but it also has proven scale…It actually allows our clients to focus more on solving their business problems than trying to readdress technology problems and spending money on really expensive, proprietary software.”
Certain aspects of the newfound transactional and operational capabilities of Hadoop are attributed to Apache Spark. Spark is a processing engine that is designed to accommodate Big Data sets, analytics (including Machine Learning), and rapid speeds for transactions. It has both a streaming component and functionality that provides a layer for memory disk arbitrage which caches data. Spark facilitates transactions in a way that is vastly superior to MapReduce’s traditional batch-oriented processing, and can account for real time Business Intelligence and analytics. It provides a degree of data integration and complex calculations processing that improves Hadoop’s overall utility with Big Data sets and relevant application building.
Although some of the initial hype surrounding Data Science and its educational programs has subsided, it still plays a valuable role for organizations looking to access Big Data to solve business problems and enhance processes. The aforementioned advancements in SQL and Hadoop’s various components means that there is less of a need for Data Scientists—particularly with several analytics vendors offering self-serving advanced analytics options. Additionally, previous knowledge related to Data Science that pertained to some of the lesser known aspects of Java has been replaced by more accessible frameworks such as R and Python.
What these improvements ultimately mean for Data Scientists is that they are freed from relatively mundane (yet necessary) tasks such as programming code and configurations, resulting in more time to dedicate to end solutions that directly address business needs. Smith stated: “What we’re staring to see more and more of as tools like Hive become more robust is our clients starting to offload some of the ETL to ELT processes and then also looking at Hadoop as an option for replacement of where your operational data stores have stood in the past.”
Increasing support for SQL options with Big Data enables the enterprise to integrate the best attributes of the former with those of the latter. While the scalability of NoSQL offerings is formidable, their capability to provide a schema free environment is another valued attribute that can greatly enhance modern data management strategies. The ability to analyze data regardless of schema is one of the distinct advantages of Data Lakes and data stores. Such repositories are most beneficial for expedient ingestion of data without discarding any, which is of particular benefit for real-time operational capabilities. This approach enables organizations to analyze data of various types without schemas, and to readily implement them into modeled formats (in SQL environments in some cases) when they are ready.
Big Data Governance
The greater amounts of data and their variety that Big Data facilitates require clear governance protocols, especially when massive repositories such as Data Lakes are involved. Data Governance in such situations requires a renewed emphasis on architecture and the design of a data management strategy. Desired inputs include business rules, glossaries, and the responsibilities of those in positions of governance. With reference architecture designed to automate these principles, outputs should include quantifiable data quality and metadata standards that are consistently met. Data models and reusable elements of code also help with the automation process of Data Governance. Smith reflected on the value of architecture for Data Governance.
“When we look at Data Management problems, we really see an opportunity to put a very consistent framework around them in terms of architecture and in terms of the design patterns that you use. When you think about the actual execution in Data Management, what you want to be able to do is actually quantify data quality; you want to have assertions around your business rules and assumptions about the data.”
The future of Data Management is heavily influenced by Big Data and increasingly advanced analytics options that include Machine Learning and other Cognitive Computing capabilities. It is reassuring for many enterprises to note the role that SQL will play in Big Data, particularly with the improvements to Hadoop and its operational capabilities facilitated by Spark and Hive.
The role of Data Modeling will be mutable—it is extremely useful from a governance perspective and less so when exploring data, necessary for application building and vital to the role that Data Science plays for the enterprise. All of the aforementioned developments should help to alleviate some of the daily responsibilities related to code and technology integration that Data Scientists were traditionally responsible for, allowing them to live up to the hype around this field by developing solutions that readily prove their value to the enterprise and its business objectives.