by Charles Roe
The two terms listed in the title of this article are relative newcomers on the Data Management (DM) scene and across the tech world. For a greenhorn, Big Data has taken over the landscape in rather prodigious speed; do a search on Big Data and inevitably after Wikipedia’s first result, there are another 1.3 billion pages. Data Science is not so blessed; the term has been bandied around for some years in the industry, but it is still only getting started. The job of Data Scientist is the newest, sexiest employment opportunity for the jacks-and-jills-of-all-trades in the DM marketplace. Fortunately for the many millions of enterprises worldwide now immersed in the surging tides of Big Data, Data Science is there to hold their hands, help them weather the storm and emerge bigger, stronger and ostensibly richer from their new (and interminable) Big Data experience.
What is Big Data?
To clear up any misconceptions a simple definition will be used that has been borrowed from Brian Hopkins and Boris Evelson: “Big data: techniques and technologies that make handling data at extreme scale economical.” It’s an unpretentious definition fraught with a labyrinth of meanings that this article will not attempt to disentangle (the Further References part at the end of this article has some links to deeper discussions of this topic). Also, according to IBM:
“Every day, we create 2.5 quintillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: from sensors used to gather climate information, posts to social media sites, digital pictures and videos posted online, transaction records of online purchases, and from cell phone GPS signals to name a few. This data is big data.”
And one last quote from Edd Dumbhll at O’Reilly Radar should clear up any further misconceptions:
“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn't fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”
Thus, Big Data comprises the collection, storage, analysis and sharing of voluminous amounts of data that are now collected throughout the world from traditional data warehouse systems, on the Internet and from other places such as GPS tracking systems, mobile phone platforms and satellites to name just a fraction of a few. The term deals with datasets that are gargantuan; some enterprises like Facebook, Twitter, Google, Amazon, Yahoo! and other big players in the data collection marathon have single distributed nodes exceeding 20 to 30 petabytes and are growing daily. In January of 2012, the data storage company Cleversafe announced a newly designed 10 exabyte storage system.
Big Data “traditionally” comprises the acronym VVV or Volume, Velocity and Variety:
- Volume: The amounts of data being created, replicated, shared, collected and distributed worldwide in vaster amounts all the time have stressed traditional database systems too much; they can no longer handle the petabyte volumes being collected through social networking, e-commerce and other large volume data channels. Some organizations are collecting many terabytes of data per day. Other Big Data sources include weather information and meteorology, military data collection, medical and other research data, space and scientific data, along with the influx of mobile device data, social networking applications, predictive traffic systems et al. Big Data is the systems that can deal with those volumes in cost effective ways.
- Velocity: Speed is a key to collecting, analyzing and utilizing data for any enterprise. The explosion of e-commerce usage across the globe, the necessity of immediate point-of-interaction statistics and other velocity requirements has necessitated the creation of systems that can handle them. Big Data comprises the platforms that can reliably process such transactions rapidly: eBay cannot wait for their bids and other millions of price alterations to take minutes to change, they need them now; a smartphone GPS app won’t work very well if it takes 14 minutes to update the users location, they need it now; Amazon cannot make money off their personal recommendations if they cannot consistently collect and assemble that data for each customer.
- Variety: Data in traditional relational storage systems was nicely boxed together in tables, columns and rows. It was, and still is, easily searchable and clearly delineated. Such is no longer the case with social networking, blogs, videos, chat rooms, different browsers, graphics, raw sensor data collected from numerous sources, emails and a veritable glut of other sources. Relational systems cannot effectively deal with such variety; Big Data systems were designed for such purposes.
Recently, others have added a fourth V to the mix: some call it Variability, others Value and a still more add Validity as a possibility. So in reality over the next few years the acronym might become VVVV or even VVVVVV, though for now VVV is enough to keep people talking, enterprises worrying and DM professionals working.
What about Data Science?
There are no Data Science classes offered at universities, nor are there any books on the subject - not yet anyway, though they are probably on the way. Data Science does not fulfill the General Ed science requirement, but there are many Data Science websites available with lots of information about this emerging field, and rather auspiciously enough people are talking about it to get a more definitive idea of where it is heading. Data science is the practice of “translating massive data into predictive insights that lead to results.” This involves a data scientist skill package that uses what Drew Conway calls the Data Science Venn Diagram:
- Hacking Skills: “Being able to manipulate text files at the command-line, understanding vectorized operations, thinking algorithmically; these are the hacking skills that make for a successful data hacker.”
- Math and Statistics Knowledge: “Once you have acquired and cleaned the data, the next step is to actually extract insight from it. In order to do this, you need to apply appropriate math and statistics methods, which require at least a baseline familiarity with these tools.”
- Substantive Experience: “Data plus math and statistics only gets you machine learning, which is great if that is what you are interested in, but not if you are doing data science.”
Data science is not just the scientific study of data, though a deep love of data is necessary to be a successful data scientist; data science is “[t]he ability to take data -- to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it.” Data science “requires skills ranging from traditional computer science to mathematics to art,” and the ability to then take all of that data and tell a story with it. Data science is the poetry of data manipulation and data scientists are the masters of what D.J. Patil calls "data jujitsu" or the people who have the technical expertise, curiosity, cleverness and storytelling ability to “use both data and science to create something new.”
Illustrations - What’s the Point of Data Science?
The real reason why Big Data has taken up residence with Data Science is the reality of the marketplace. Big Data is the new Colossus of Rhodes that stands guard over the world of Data Management, but this Colossus is not going to fall down due to an earthquake; Big Data is not going anywhere. Enterprises need Data Science and the multi-skilled Data Scientists to help navigate the ever-growing world of Big Data; those that invest in qualified Data Scientists will enable themselves to traverse through the Big Data quagmire more skillfully than those that do not. The illustrations below are only meant for discussion, they demonstrate some of the areas where Data Science can help move enterprises in the right direction.
In 2009, retail spending was approximately 6 percent of the US economy. But, since 1990 retail’s portion of consumer spending has dropped from 50 percent to roughly 42 percent in 2009. So, while consumers still spend considerable money in the retail sector, the profitability of retail businesses is under intense pressure to innovate. What can Data Science do for the retail industry?
- Increase marketing effectiveness through better analysis and use of customer demographics, purchase preferences, location-based statistics, average purchase sizes, personalized recommendations (online and in-store), collaborative filtering, smartphone apps, improved in-store promotions and others
- Optimization of in-store shopping performance through heat sensors and image analysis to better understand shopper behavior patterns. Such analysis allows for better product placement and labor utilization, especially in terms of customer service oriented numbers, store layout and product placements.
- Enhanced customer experiences though multi-level rewards programs based on collection of social media sentiment analyses, marketing campaign response, click-stream Web data, the addition of peer sentiment data to help leverage buyer preferences, which in turn allows better pricing, assortment, placement and design development.
- Integrated performance optimization from employee surveys and customer shopping data, cashier transactional data like customers per hour, attendance tracking, employee scheduling, customer complaints and a range of other data that can be viewed together.
- Improved supply chain development through better inventory tracking and management, distribution channels, supplier negotiations based on in-store shopping records, cost saving measures from “GPS-enable big data telematics” which help routing and transport management.
The American Healthcare sector accounts for roughly 17 percent of GDP and employs some 11 percent of the working age population; it is increasing at an average rate of about 5 percent per year and is expected to increase more over the next 20 years due to the Baby Boomer generation. Big Data initiatives in hospitals, and throughout the healthcare industry, have the ability to increase performance, enable technological improvements, enhance patient care, improve record keeping, reduce administrative costs and ineffectiveness and in-turn reduce healthcare costs that are escalating to dangerous levels.
- Collection and analysis of “optimal treatment pathways” through a combined system of comparative effectiveness research wherein the entire healthcare industry works together to optimize the treatment of certain diseases would speed care, lower costs and increase the effectiveness of patient care. It would lower cases of overtreatment and under treatment, allow for an easier system of individualized care within such pathways and build a system that would become more optimal as it got larger. It facilitates more effective clinical support decision systems not just within one hospital, but across the field, fostering a system of transparency that only exists in some areas today.
- The growth of remote patient monitoring systems would lower in-hospital stay costs, allow for a better collection of patient data and allow better transmitting and analysis of feedback for chronically ill patients. In 2010, more than 80 percent of the healthcare costs were due to chronically ill patients; a Big Data system utilizing Data Science could better collect, monitor, analyze and provide care for these patients. Such advances like “chip-on-a-pill” technologies could further optimize the entire system.
- Integrated patient record, billing, insurance record systems could lessen fraud, overpayment, duplicate care, missing diagnoses, misunderstandings and mistakes, increase Medicare/Medicaid efficiency and improve the entire system.
- The costs of pharmaceutical and equipment R&D are astronomical. Data Science can help the medical research industry through predictive analysis and modeling, advanced algorithms and statistical analysis for clinical trials, more improved research income allocation and schedules, analysis of data across the entire research to patient continuum, and the advancement of personalized care.
All industries need Data Science and the improvement and optimization of Big Data initiatives: manufacturing will benefit from more efficient production to consumer systems across the board; public sector administration can optimize productivity, innovation and operational efficiency; and consumers can access and utilize all this data through personalized data experiences, especially with the growth of mobile devices to allow for better travel reports and traffic routing, shopping preferences, more effective health procedures, better information retrieval systems and a range of others than are just now beginning. Big Data needs Data Science and vice versa, together they allow us to create, manage, analyze and utilize data in all areas of our lives.