Advertisement

Simplistic and Credulous Data Science: Part One

By on

Click to learn more about author Steve Miller.

A few months ago, I was solicited by one of my many LinkedIn Data Science groups to participate in a survey. I’m generally a bit suspicious of surveys, reviewing the questions carefully before making the decision to respond. This initiative, though, posed a question that quite intrigued me. Simply put, the item of interest asked to note one or more developments with the DS discipline/profession that bothered you. The request was particularly timely, as I’d been seething over recent reads in the analytics media.

My current beefs revolve on two concerns: 1) the seemingly still unresolved definition of Data Science by some practitioners; and 2) the absence of methodological rigor in much of what is proffered as exemplary Data Science. Though 2) is more disconcerting than 1), both resonate of over-simplicity, or what distinguished statistician Brad Efron called “credulousness” in an interview years ago. In the remainder of Part 1, I address the definition of Data Science. Next month, I’ll take on DS methodological rigor.

When the Data Science discipline began to gel ten years ago, there was a hue and cry by many practicing statisticians that DS was just a gussied up, over-hyped branding for the same work they’d be doing for years. And for some practitioners, like myself, there was some truth in that assessment, since statistical modeling was but a small part of our work that was mostly consumed with Data Management and computation. The old bromide that statisticians spent 20% of their time doing statistics and 80% of their time doing data rang true for many. Alas, even statisticians whose functions were simply narrowly defined modeling piled on. No less a luminary than data journalist Nate Silver agreed.

Over time, the distinction between pure statistics and Data Science began to clarify. Columbia statistician Andrew Gelman rejected the Data Science = statistics view, opining instead that “Statistics is the least important part of Data Science…There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of Data Science. . . .The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of Data Science is more of an option….To put it another way: you can do tech without statistics, but you can’t do it without coding and databases.”

Stanford statistician David Donoho agreed with Gelman in his outstanding Data Science treatise. For Donoho, Data Science consists of the following six sub-fields:

1. Data Exploration and Preparation

2. Data Representation and Transformation

3. Computing with Data

4. Data Modeling

5. Data Visualization and Presentation

6. Science about Data Science

Only 1), 4), and 5) were included in my graduate statistical training. In addition, today’s statistics curricula often include 3), while graduate analytics programs cover 2) as well – though the topics generally aren’t presented in as much detail.

The core courses of UC Berkeley’s laudable Master of Information and Data Science curriculum offer broad coverage of Donoho’s Data Science.

1. Python for Data Science

2. Research Design and Application for Data and Analysis

3. Fundamentals of Data Engineering

4. Applied Machine Learning

5. Statistics for Data Science

I particularly like the focus on research methods and causal analysis, which embellish Donoho’s 4) and 6). More and more programs are adding similar methodological emphases.

With a cue from this program, my take is that the content of Data Science can be represented with the mnemonic ABCCDE, which, in time sequence order is: Business, Data, Causal, Computation, Exploration, and Algorithms. Noteworthy for me is that the first four of the foci were not much emphasized in my graduate training.

So what’s my beef? Though DS may be less confused with statistics these days, it is instead often similarly equated with machine learning (ML), deep learning (DL), and even AI. But just as Data Science has a broader purview than statistics, so it does also with the modeling emphases of ML, DL, and AI. Doubters need only check DS job postings.

Part Two next month.

Leave a Reply