Click to learn more about author Steve Miller.
This is the Second Part of a two-part series. Read Part One here.
About ten years ago, a software vendor with whom my intelligence consulting company partnered conducted a survey on the usage of various analytics platforms that were then prominent in the space. As a proselyte of both R and Python, I was keenly interested in the results for statistical analysis/machine learning. Turns out that at survey conclusion with over 500 respondents, both R and Python were indeed highly rated, but the vendor’s own statistical/ML offering, pretty much an unknown at the time, was also among the leaders.
A little investigation revealed that access to the survey was almost exclusively through the vendor’s website. And who’d be accessing that site? Mainly customers and prospects interested/knowledgeable of the vendor’s analytics components. Little chance that the sample was representative of the population of all customers and prospects. It was instead quite biased towards those predisposed to its solutions. The “findings” were therefore rejected.
Fast forward to a survey on Data Science compensation posted on LinkedIn earlier this year. A seemingly impressive sample size of over 700 revealed that DS salaries had continued to grow quarterly in 2018 from previous years, with one aberrant declining data point. No detail, though, was provided on the construction of the sample, and no auxiliary information such as experience or residence of respondents was included. Though certainly not a random sample, could it have been still representative of the DS population and hence unbiased? Perhaps, but impossible to tell. In the absence of the supplemental information, I had to assume the skeptic’s posture and reject those findings as well.
It seems more than a little ironic that in today’s data age, we should be constantly questioning the validity of our statistical results, often against the most basic challenge of whether the sample represents the population it purports to. After all, doesn’t “big” data imply representativeness and unbiasedness? The answer is, alas, no.
An interesting paper by Harvard statistician Xiao-Li Meng sheds light on the quandary of sample size versus random selection. “It clearly would be foolish to ignore such big datasets because they are not probabilistic or representative. But in order to use them, we minimally need to know how much they can help or whether they can actually do more harm than help.” As Meng illustrates, it can easily be the case that a 5% random sample is superior to an 80% systematic one. ‘The qualitative answer clearly is “it depends”, on how non-random the larger sample is. We would imagine that a small departure from being random should not overwhelm the large gain in sample size. But how small must it be? And indeed, how to quantify “better” or being “non-random”?’
Blithely accepting the results of methodologically-deficient surveys is an illustration of DS credulousness discussed in the last blog. At the same time, in no way do I self-identify as a statistical bigot, unwilling to even consider the validity of a survey study for which I cannot assess representativeness through mathematical calculations. In fact, I see one of the big differences between statisticians and data scientists like myself to be that the latter are generally more practical and less doctrinaire than the former. Whatever works.
What I find in modern non-probabilistic surveys that inspires confidence in their subject selection is a careful delineation of how the sample represents the population of interest on potentially confounding attributes such as geography, education, and experience. The closer those variables match the population, the better the data scientist can trust that the sample represents the population. Indeed, systematic matching of cofounders is at the heart of causal analysis, which purports to determine treatment effects in the absence of randomization.
Many do it already, but I’d like to see all data surveys include an analysis of representativeness that speaks to the combination of random selection and sample matching on key survey attributes of the population. Then, ever the skeptic, I can make an informed assessment of the validity of the findings and accept/reject findings accordingly.