Advertisement

Data Science and Data Analysis

By on

Click to learn more about author Steve Miller.

I met up with a grad school friend of 40 years the other day. While he earned a doctorate and became an academic luminary, I departed the program with a masters in statistical science and went on first to the not-for-profit and then to the business worlds. Both of us recently retired from full time employment and now satisfy our work cravings with contract consulting.

My friend recently read the seminal paper 50 years of Data Science by Stanford professor David Donoho and wanted my take on how this interpretation of DS history mapped to my career in data and analytics. Thanks for asking, I responded. It turns out quite well.

The Cliff Notes version of Donoho’s thesis is that development of the current field of Data Science has been in the works for a long time, born of frustration with the narrow purview of academic statistics in the 60’s. ‘More than 50 years ago, John Tukey called for a reformation of academic statistics. In ‘The Future of Data Analysis’, he pointed to the existence of an as-yet unrecognized science, whose subject of interest was learning from data, or ‘Data Analysis’. Ten to twenty years ago, John Chambers, Bill Cleveland and Leo Breiman independently once again urged academic statistics to expand its boundaries beyond the classical domain of theoretical statistics; Chambers called for more emphasis on data preparation and presentation rather than statistical modeling; and Breiman called for emphasis on prediction rather than inference. Cleveland even suggested the catchy name “Data Science” for his envisioned field.’ In short, Data Science advances statistics from its mathematical roots to more balanced math, data, and computational foci. I’d encourage the hour or so investment to consume this important article.

I sensed the beginnings of this divide in my grad school years, recognizing both a concern by some professors with the “over-mathematization” of statistical science, as well as the emergence of significant computational progress that lifted every analytic boat. In 1979, just about all my computer work was on mainframes with FORTRAN and PL/I; by 1982 most was on minicomputers with Unix/C/Ingres and pc’s with MS-DOS. SAS, originally written for IBM mainframes and the statistical software of choice at the time, was ported to minicomputers and pc’s in the early 80’s. At that time as well, resampling techniques like the bootstrap, fueled by computation, were starting to come of age in the statistical world.

I well remember one of my first assignments as an internal hospital consultant to forecast the prevalence of cerebrovascular disease in the hospital network. A piece of cake assembling the data and applying regression/time series techniques – just the kind of work I’d done as a research assistant in grad school. Life was good.

Not so fast, though. Next up was designing and implementing a perinatal registry with accumulation of 500 attributes in over 10,000 birth records/year. The challenges were foremost of data management and computation – assembling, wrangling, cleaning, reporting, and managing the data were my jobs. So I developed  database and programming expertise by necessity, becoming in time a  capable data programmer. Alas, the statistical work was far downstream from implementation of the then-new relational database system to manage the data.

Those evolving Data Management and wrangling skills drove my business consulting work from 1985-2005, with the initiatives at first called decision support and then ultimately Data Warehousing/Business Intelligence (DW/BI). Data was pre-eminent, followed by the computational processes of munging, cleaning, and managing. More often than not, BI tools like BusinessObjects and Cognos were superimposed on a data repository implemented with database software such as Oracle or Microsoft SQL Server. SAS software connected to the DW for statistical analysis. Occasionally, full-blown analytic apps were delivered.

The ascendance of open source changed the analytics landscape fifteen years ago, with databases like PostgreSQL and MySQL, agile languages such as Python and Ruby, and the R statistical computing platform, encouraging an even greater commitment to analytics and facilitating the emergence of companies whose products were data and analytics. Add proprietary Self-service Analytics/Visualization tools like Tableau to the Data Analysis mix as well. During this time and currently, I’ve done much more of both data exploration and statistical analysis than in the early years. In many cases, EDA suffices. When it occurs, the statistical emphasis, however, is much more concerned with pure prediction/forecasting than with the inference-generating models of classical statistics – another contrast noted by Donoho.

When I size up my career against Donoho’s  Six Divisions of Greater Data Science, I feel I’ve worked fairly intimately with the first five: 1. Data Exploration and Preparation 2. Data Representation and Transformation 3. Computing with Data 4. Data Modeling and 5. Data Visualization and Presentation. Only  6. Science about Data Science, a much more academic pursuit now growing exponentially, has been unaddressed.

My professor friend absorbed my chronology, opining that his career was primarily about deep dives into  4. Data Modeling and 6. Science about Data Science. While we both expressed overall satisfaction with our careers, we acknowledged a bit of melancholy for not having had extensive opportunities to touch all six. Perhaps today’s data scientists will be presented challenges in each.

Leave a Reply