Methodically Reproducible Data in 2018

Click to learn more about author Steve Miller.

I generally establish a theme to my involvement early on when I attend Strata Data. Missing only last year, I’ve been witness to Strata’s large breadth of topics. The last few years have included deep dives into the then new Spark platform and iPython notebook (now Jupyter notebook/JupyterLab) as well as the makings of Data Scientists. Actually, the latter is a recurring topic and changes every year with the maturing discipline.

This year, my theme was methodology. A first intention was to extend Drew Conway’s iconic Data Science venn diagram, adding an emphasis on scientific research design. Data Science is science, after all, and, as such, should be viewed from the lens of the theory, hypotheses, measurement, and testing that scientists deploy in their work. For me, the validity of Data Science measurement – both internal validity which assesses whether the analyst is measuring what she purports to measure, and external validity, which has to do with the actual population to which the measurement generalizes – are critical and generally troublesome in DS work.

No less a Data Science powerhouse than UC Berkeley recognizes the importance of research design in its DS curriculum. Indeed, having hired several newly-minted PhDs who learned analytics and computation writing their dissertations, I feel those coming to Data Science from a strong research background that includes the authorship of papers, theses, or dissertations, have a leg up. It’s easier to train scientists in technology than it is to teach technologists scientific methods.

In addition to the methodical science angle of DS, there’s also a methodical computation focus, which was a point of emphasis this year for the evolving DS portfolio. Several speakers mentioned the now- blurring distinction between data scientists and developers on DS teams, noting that heightened computational expectations for Data Scientists. It seems at the same time DS is expanding, thus allowing for more division of labor, there’s a rising software engineering bar as well.

Everybody Lies author and NY Times writer Seth Stephens-Davidowitz took on the internal validity measurement problem full frontal in his entertaining keynote. Stephens-Davidowitz offers evidence that asking people direct questions in live surveys invites invalid responses, as respondents are inclined to answer so as either to not embarrass themselves or to please the surveyors. An analysis of interviews and web behavior for recent POTUS elections suggests that asking point blank preferences is less indicative than race-related searches.

UCLA computer science colleagues Miryung Kim and Muhammad Gulzar presented the analysis/findings of their surveys of 793 Data Scientists at Microsoft, where they examined work and educational background, job topics, tools of expertise, and activities. Using k-means clustering, K&G identified nine categories/roles of Data Scientists on Microsoft teams, including data shaper, platform builder, data evangelist, and do it all polymath. They highlighted challenges such as poor Data Quality and heterogeneous technology toolkits.

Kim and Gulzar also addressed methodical computation, proffering peer reviews, cross validation, and dogfood simulation as means to assure software quality. In addition, they presented debugging software to elevate the software engineering discipline of Data Science.

What we didn’t have time to resolve in this fast-paced talk was my question on the external validity of inferring from a sample of Microsoft Data Scientists only. Surely the DS mix at a mature company like Microsoft is different than startup data companies. How might a sample more “representative” of the DS population have impacted overall findings?

Pinterest’s Frances Haugen is committed to building a heterogeneous data organization that combines the best expertise from both business units and Data Scientists. Acknowledging that data work is hard, Haugen, who has the heterogeneous academic bonafides of an engineering degree plus a Harvard MBA, led the creation of a company-wide Data Science curriculum using a “reverse classroom” philosophy to empower those with backgrounds as diverse as product operations and sales to be effective users of analytic insights. This training, on the heels of a detailed roadmap to make Pinterest more data centric, proved decisive.

Haugen rejects “spreadsheet-zilla” as opaque with poorly understood built-in assumptions. Like me, she prefers power and flexibility to ease of use in her company’s analytics tools. And she’s a stickler for methodical computation, emphasizing the power of reproducible quantitative analyses that derive from deploying Jupyter notebooks.

Clare Gollnick’s talk on Limits of Inference was perhaps most provocative of all. Her point of departure was the appalling lack of reproducibility — the inability to replicate findings of studies – in scientific research. Many scientists now believe the cause of this problem is misapplication of traditional inferential statistics, designed for small problems, when used on a larger scale than intended.

The metaphor of an infinite number of monkeys banging on typewriters is often offered as illustrative: eventually, one of the monkeys will author “Hamlet”. Gollnick correctly identifies over-searching, “torturing data till it confesses”, p-hacking, overfitting, et. al. as culprits, adopting a philosophy of science position that searching breaks evidence – that evidence is soundest when searching is limited and targeted. Alas, only those with evidence publish.

I couldn’t agree more with Gollnick. I’ve pretty much stopped using p-values, significance tests, and confidence intervals except when analyzing targeted/limited experiments/randomization. I also share her view that splitting data into train/validate/test and using cross-validation model fitting techniques mitigates the risks of evidence by searching. Cautious search for evidence is best. Hopefully, we can learn from our current failings and adopt an effective large scale statistical inference methodology.

Next Spring, Strata moves from Silicon Valley to San Francisco. Much as I love the convenience of conferencing in San Jose, I welcome the change!

Data Topics

Methodically Reproducible Data in 2018

Leave a Reply Cancel reply