by Charles Roe
The transformations within the Data Management industry are nothing less than phenomenal. Look back 40 or 50 years and we saw the creation of IBM’s IMS in 1966, CODASYL in 1969 that defined the early specifications for DDL and DML. The 1970s saw extensive work on relational database structures, the development of UNIX, more advances in file-oriented processing with the growth of COBOL and RPG, plus numerous other advances in computer networking with ARPANET, and computer architecture in general.
The decades that followed have seen the rise of SQL and other relational database technologies, data warehousing in the early 1980s, innovations in client-server computing, ER and other modeling specifications, and the remarkable progression in computing power that continues up to today. The growth of the Internet changed the game for everyone, and brought about the necessity for ever more balanced, thoughtful, and focused processes and procedures for the management of data in all organizations large and small. This list is only a microcosm and many books have been written about such a history.
The modern world of Data Management is filled with such essential disciplines as Data Governance, Master Data Management, Metadata, Business Intelligence, Data Architecture, and others packaged and elucidated well by DAMA in their DM-BOK and covered at conferences, in white papers, and on hundreds of industry blogs – anyone involved in the industry has their favorite sources and their primary disciplines of interest.
The recent history of Data Management has seen the growth of new concepts like distributed computing, non-relational databases, semantics, Cloud computing, Agile, and the behemoth that has overtaken many conference rooms, called with a seemingly ambiguous name — Big Data. In all of this history though, in every industry, there is always a recherché, a new trend, an exotic or exquisite new idea that grabs the interest of pundits, gurus, and boardrooms alike. Big Data is certainly one of those in Data Management, but it’s more a Leviathan than a stylish new inspiration. Data Science, and its cutting-edge practitioners aptly named Data Scientists, is now the next big thing. Yet, Data Science didn’t just show up by chance, it has actually been around for decades, history just didn’t need it until recently – and it’s not just a trend, it’s a necessity.
Data Science combines the allure of Big Data, the fascination of Unstructured Data, the precision of advanced mathematics and statistics, the innovation of social media, the creativity of storytelling, the investigation and inquiry of forensics, and the ability to use all of those skills together while still being able to demonstrate the results to non-technical audiences. Data Science is the new vogue, the place to be for the best of the best, the new sexy, and its importance to the industry is only going to intensify.
How did this all transpire?
In the words of Anjul Bhambri, VP of Big Data Products at IBM, the Data Scientist is “part analyst, part artist.” Such a combination has a number of requirements, or precursors, which allowed it to evolve. The first is Big Data; without the growth of Big Data over the past many years, the need for Data Scientists would not be so pronounced. The statistics about the growth of Big Data are everywhere and common knowledge to those in the Data Management industry; we’ve now gone from gigabytes to terabytes to petabytes to exabytes, and are now moving into the world of zettabytes. Big Data requires the ability to collect, store, analyze, and derive meaning from data quantities larger than ever before. The role of the Data Scientist is to grapple with that data, tame that data, and provide significant information from that data that allows their particular enterprise to gain an advantage in the marketplace.
The second reason such a job title now exists is complexity. A typical data analyst of the past would probably be working in only one or a small number of systems. A Data Scientist is required to extrapolate data from multiple systems and sources that range from typical relational databases to social media aggregates, transactional systems to BI platforms, document stores to GPS sensory data and a literal multitude of other sources that are entirely dependent upon the organization in question and their needs at that time.
The term Data Science has been bandied around since the 1960s. The work of Peter Naur first started to popularize it as he wanted to substitute computer science with Data Science; his seminal work Concise Survey of Computer Methods laid the groundwork for Data Science. In the 1990s the International Federation of Classification Societies actually added the term Data Science to their classifications. It wasn’t until the early part of this century that the term really started to take on the meaning (and thus the sexiness) it has today. William S. Cleveland wrote a paper entitled “Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics” in 2001 detailing the need for a new academic discipline that combined statistics with computer science that he called Data Science. The term then started to become more commonplace: the Data Science Journal began in 2002, Journal of Data Science in 2003, and a host of others have come along since.
Data Science is now everywhere. Job postings abound for organizations looking for qualified Data Scientists and a number of universities are starting to offer courses and degrees in Data Science:
- Data Science Institute, Columbia University, and UC Berkeley all offer courses in Data Science now.
- University of Washington and Syracuse University offer certification programs in Data Science.
- College of Charleston has a B.S. in Data Science
- Michigan State University, DePaul University, University of Michigan (Dearborn), New York University and many others are offering Master’s Degrees in various disciplines such as Business Analytics, Predictive Analysis, and Data Curation.
The listing above is only a small number of the schools worldwide that have offerings at all levels in Data Science related disciplines. Data Scientists are needed and they have the sexiest job in the entire industry. Computer science is now cool and Data Science is calling for the “coolest” of the cool kids to come and play.
What does a Data Scientist actually accomplish?
Data applications are ubiquitous; the Internet is founded on the data-driven products that allow us to do everything from building webpages to using search engines, playing online games to using e-commerce platforms, doing online banking, sending emails, chatting, engaging in social media, posting photos or videos, blogging, and reading the news. But, data is also the unrefined gold that corporations depend on; that gold is in their non-relational data stores, SQL databases, CRM systems, spreadsheets, email memos, revenue reports, and an assortment of other structures. The job of a Data Scientist is to sift through all those disparate systems, evaluate the data, analyze it, find innovative ways to use it, and then report their findings to the various stakeholders who need that data to do their jobs better. A few notable Data Science breakthroughs include:
- Gracenote’s CDDB Database: this allows users to automatically lookup, label, and get the metadata from any music they own and want to burn to a CD. It is used extensively by iTunes.
- Google’s PageRank: The PageRank algorithm developed by Google set the company above other search engines. It revolutionized the way Internet searching occurred and succeeded by collecting information not just from the page itself but from innumerable sources outside of the actual webpage being ranked.
- Amazon searches: Amazon uses a complex system of search correlations to keep track of exactly what you’ve bought and searched for, along with other user searches in similar categories to provide users with comprehensive recommendations for future purchases.
The “people you may know” mechanism at LinkedIn was essential to the growth of the company and according to D.J Patel allowed LinkedIn to evaluate and analyze their data in small increments, thus keeping down costs and time for development:
“It would have been easy to turn this into a high-ceremony development project that would take thousands of hours of developer time, plus thousands of hours of computing time to do massive correlations across LinkedIn’s membership. But the process worked quite differently: it started out with a relatively small, simple program that looked at members’ profiles and made recommendations accordingly. Asking things like, did you go to Cornell? Then you might like to join the Cornell Alumni group. It then branched out incrementally. In addition to looking at profiles, LinkedIn’s data scientists started looking at events that members attended. Then at books members had in their libraries. The result was a valuable data product that analyzed a huge database – but it was never conceived as such. It started small, and added value iteratively. It was an agile, flexible process that built toward its goal incrementally, rather than tackling a huge mountain of data all at once.”
The Data Scientists at LinkedIn used their Data Science tools to create one of the most successful social media outlets in the world today.
Most recently, Nate Silver, the NY Times blogger and now famous Data Scientist, prognosticated the victory of Barack Obama over Mitt Romney in the 2012 Presidential Election. Mr. Silver also successfully called 49 out of the 50 states in the 2008 election. His statistical successes are a complex amalgamation of Monte Carlo simulations, historical data collection, thousands of sources from state and federal agencies, advanced correlation techniques, methodological consistency, and a reliance on probabilities rather than pundit predictions. His use of creative data collection with precise mathematical calculations allowed him to say, as of the Friday before the election, Obama had an 80.9% chance of winning – it seems Mr. Silver was correct.
The examples listed above are only some of the more famous, but thousands examples abound where Data Science is working for organizations large and small to help them gain knowledge from the data they collect. Retail stores are using Data Science to better collect and analyze seasonal shopping trends; the health care industry is employing advanced data analytics to build comprehensive patient care systems to better track patient histories, evaluate care programs, prescribe drugs, and plan for future possibilities; WHO and other scientific organizations are analyzing cell phone data and malaria infestation maps to better correlate the prevalence and spread of malaria cases throughout Africa; IBM is working with the city of Lyon, France to create a traffic management system that will help the city to better administer traffic flow throughout the city.
There are now innumerable definitions of Data Science around, but one of the best and most all-inclusive was written by Gil Press in his recent Forbes article on Data Science:
“A data scientist is an engineer who employs the scientific method and applies data-discovery tools to find new insights in data. The scientific method—the formulation of a hypothesis, the testing, the careful design of experiments, the verification by others—is something they take from their knowledge of statistics and their training in scientific disciplines. The application (and tweaking) of tools comes from their engineering, or more specifically, computer science and programming background. The best data scientists are product and process innovators and sometimes, developers of new data-discovery tools.”
Mr. Press concludes his article by saying “[T]hat’s the definition of sexy.” Computer science, and now its progeny Data Science, are no longer confined to the world of geekdom, dimly lit-server rooms, or mom’s basement. Data Science is now the place to be; it not only takes a person with a multi-disciplinary approach and keen intelligence to become a Data Scientist, it takes an innovative and revolutionary spirit to dive into those vast data streams and come up swimming with new ideas, new procedures, and new ways to affect the all-important bottom line.
It’s pioneering. It’s risqué. It’s recherché. It’s cool.