by Charles Roe
Do you ever wonder what Facebook does with all those likes that people give and receive? Or how Netflix figures out exactly what movies to recommend to you? Or how Google can infer exactly what you are trying to search for as soon as you begin typing something into the search box? What about the ads on LinkedIn that relates directly to your profile, the music lists in iTunes and any other scores of connections that just happen? Those examples are just a few of the multitudes of data-related instances constantly being collected and analyzed all the time. The counting of mouse clicks on an advertisement, bounce rates on webpages, traditional BI analytics, CRM data collection, mobile phone GPS updates, keyword data mining and so many more combine together to keep data analysts, data stewards, data architects, data modelers, data center project managers, BI team managers, ETL developers, database engineers and everyone else concerned with the ever-increasing accumulation of data busy and with no lack of employment opportunities.
To go back to the questions at the top, is Facebook counting those pokes only to tally them and is Netflix only recommending movies to you because they are nice? The answer is a not-so-simple no. In 2011, humans created and replicated some 1.8 zettabytes of data, or 1.8 trillion gigabytes; such data is the lifeblood of corporations, or the oft-quoted statement “data is the new oil.” All of the data collected into data warehouses, analyzed in traditional BI applications, and accumulated in raw and uncooked formats into the many newly prevalent NoSQL systems is useful. It allows businesses to better understand their customers, their competitors and the trends happening at all times everywhere in the world. However, if those businesses cannot effectively and efficiently analyze all their collected data, it will just sit in the data warehouse, take up space and cost money to store. The newest and hottest job in Data Management is the person that can interpret such data in innovative ways for their employer, the Data Scientist.
What exactly is a Data Scientist? There are now probably as many definitions as job openings, but in some ways that is a positive for anyone wanting to get into such a career; get the requisite skills, sell yourself as an innovator and doors will open you didn’t even know existed. Some of the biggest names in the Data Management field have already chimed in about this latest employment opportunity:
- “Data scientists are part digital trendspotter and part storyteller stitching various pieces of information together. These are people or teams at organizations that sift through the explosion of data to discover what the data is telling them.” Anjul Bhambhri, Vice President of Big Data Products at IBM.
- “A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data.” Dr. DJ Patil is a Data Science in Residence at Greylock Partners, as well as the former Chief Scientist, Chief Security Officer and Head of Analytics and Data Teams at the LinkedIn Corporation.
- “A data scientist is a rare hybrid, a computer scientist with the programming abilities to build software to scrape, combine, and manage data from a variety of sources and a statistician who knows how to derive insights from the information within. S/he combines the skills to create new prototypes with the creativity and thoroughness to ask and answer the deepest questions about the data and what secrets it holds.” Jake Porway, Data without Borders and New York Times.
- Data scientists are “analytically-minded, statistically and mathematically sophisticated data engineers who can infer insights into business and other complex systems out of large quantities of data.” Steve Hillion, Vice President of Analytics at EMC Greenplum.
According to 2011 McKinsey Global Institute (MGI) study “[B]y 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” If we take the definitions of Data Scientist from the experts quoted above and add that to the “people with deep analytical skills” from the MGI study, we can easily infer that a substantial number of the needed jobs are in Data Science and related professions – not just in 2018, but also right now and moving into the future.
What does it take to be a Data Scientist?
- Mathematics: Data Scientists must be competent mathematicians. They must be able to understand computational linear algebra, aka matrix computations, numerical analysis, matrix analysis and others. Most data mining applications use matrix computations as their fundamental algorithms, so a strong understanding of them is essential. While an understanding of traditional tools like MATLAB is important, as well as new distributed computational packages such as Apache Mahout, it is vital that a Data Scientist understands the math, can adjust it, build upon what is already created and be innovative enough to create their own algorithms.
- Statistical Analysis: A strong knowledge of R, SAS, SciPy, Stata, SPSS and other statistical analysis tools is important; expert skills in at least two of them (the list is quite large) with the ability to discuss them during an in-depth technical interview is necessary to get the best jobs. A background in statistical analysis through education or work experience is necessary.
- Programming/Scripting Languages: A clear understanding of various languages like C/C++, Java, PHP, Ruby, Perl, Python is a necessity. The Data Scientist job is highly technical and so deep technical expertise is central to obtaining a good position.
- Relational Databases: Know your way around SQL-based systems. Understand primary and foreign keys, indexing, querying, normalization, constraints and other primary features in relational databases. A solid skill set in accessing and manipulating data within RDBMS provides a good foundation to a career as a Data Scientist.
- Distributed Computing Systems and Tools: NoSQL platforms are becoming more prevalent all the time in the Data Management field. Many are open source systems so studying outside of work is possible to learn them. Go into an interview with a solid background in a range of the systems/tools such as Hadoop/HBase, Cassandra, Hive, Pig, MapReduce et al. Be able to discuss the differences between graph databases, document stores, Key/Value stores and BigTable implementations. Understand distributed caching, sharding, scalability and other key terminology in the field today.
- Data Mining: Learn the primary tools used in Data Mining today. Take a Data Mining course if necessary, or better yet get real-world experience at your workplace. Data Scientists must inherently understand how to find patterns in the vast data sets they work with. A strong background in Data Mining, which also ties into machine learning, statistics and operational database familiarity, is a central feature to any Data Scientist’s job.
- Data Modeling: While a Data Scientist may not sit in their cubes and model data all day long, they do have to be able to understand the models, present them to C-Level Executives and use the models to improve the Data Management systems within a given enterprise. Thus, comprehension of the many modeling tools/techniques/methodologies such as ERWin, Agile, ORM diagrams, UML class diagrams, CRC cards, conceptual/logical/physical schema, DDL, Bachman diagrams, Zachman Framework and others is valuable. You don’t have to be a data modeler, but the ability to speak their language will go a long way.
- Visualization: This feature harkens back to the experts’ quotes. A central aspect of the Data Scientist’s job is telling a story. Data scientists must be able to take the hard data from within the data warehouse and other storage facilities, scrub it, mine it for the most important and business-focused parts and present it within visual parameters that business users can understand and employ. Therefore, the ability to work with visualization tools such as Flare, HighCharts, AmCharts, D3.js, Processing, Google Visualization API, Raphael.js or any number of other visualization packages is pre-eminent to the Data Scientist’s skill package. Data Scientists have to tell a story with their data; they must provide a data narrative that anyone in the enterprise can follow, understand and utilize.
- Creativity and Innovation: Having the first seven points is essential to getting a good job, but Data Scientists don’t just sit around and look at data. Data Scientists must be able to innovate the collection, analysis and usage of data for their enterprise in novel and fantastic ways so all that “enterprise-critical” data is put to advantageous use. A Data Scientist must be able to look at the Facebook likes, click-through and bounce rate stats, social media comments, traditional BI charts, CRM transaction records, along with traffic spikes due to changing weather conditions, YouTube video releases and a virtual superfluity of other possible patterns, instances, happenings and unexpected phenomena in ways that are typically not understood. Data Scientists must be demonstrate actionable innovation and up-to-the-minute creativity – they must love data.
- Communication and Business Perspicacity: Data Scientists are crossbreeds, the amalgamation of IT expertise and business smarts. Such skill sets are abnormal in the world; they are usually mutually exclusive – not anymore. A Data Scientist must be able to spend hours buried in statistical analysis and data mining, more hours with data modelers, and then even more hours creating deft graphical illustrations that explain everything during meeting after meeting with business users from all levels of the enterprise. The stereotypical IT people are not always the best communicators, they prefer their computer screens to real-life communication; such stereotypes are no longer relevant for Data Scientists. They must be able to sell their grand ideas to everyone on both sides of the enterprise fence.
- Education: A M.S in Math, Statistics, Computer Science, Engineering or some other related technical field is not 100% necessary, but it will certainly help your cause. A B.S in one of those fields is a must and an M.S. shows the ability to work within a system, complete tasks with deadlines and a background in theoretical principles. Add to that M.S many years of experience in the field, with knowledge and direct experience in all the points listed above and your resume will make it to the top pile rather than the slush bin.
The Data Scientist is a rather new manifestation in the Data Management industry; the actual meaning of the term, the relevance of the job to enterprises worldwide, the skills and responsibilities necessary for the position and exactly what such a position will mean in the coming years is one of the hottest topics in the tech blogosphere. The Data Scientist is part alchemist taking base data and turning it into gold; they are data novelists who use the power of algorithms, visualizations and creativity to tell the story of data to their enterprise; they are amalgamations of one part forensics investigator, one part journalist, one part scientist, one part computer geek, one part salesperson, one part poet who take the petabytes and exabytes of Big Data and make sense of it; they are the new sexy and they are needed now and into the future in ever-increasing numbers.