Past performance is no guarantee of future results. – ubiquitous investment disclaimer
Never make predictions, especially about the future.– Casey Stengel
You can file this under “semantic nit-picking.” Or maybe not. The phrases “predictive analytics” and “predictive models” get thrown around a lot these days, especially in the context of what we are doing with all that Big Data accumulating all over the place. This got me to musing about what exactly is the definition of predictive analytics and how predictive, really, are they.
Looking around at some of the big analytical software vendor sites, it is just about impossible to find a definition. Everyone says they are doing it, or making it possible, but nobody defines what “it” is. The best characterization I’ve run across is in a Wikipedia article, which also parses the nuanced differences between predictive, descriptive, and decision models. In a nutshell, predictive analytics are based on predictive models which assign a probability or score to an individual actor (customer, patient, web surfer, etc.); risk scoring by insurance underwriters is a classic example. Descriptive models are looking more at groups or cohorts and describing historical group patterns, while decision models are concerned with defining the data, rules, and logic that go into a specific business decision. Predictive and descriptive models may be subsumed in a decision model. (These may be authoritative definitions, but I think in vernacular usage “predictive analytics” is often used as a synonym for descriptive analytics.)
All well and good, but the crux of my problem with this is that the track record of nearly every field to make accurate predictions is rather dismal – starting with the “dismal science” itself, economics. Stock market crashes, recessions, depressions, and soon-to-burst market bubbles have all gone unpredicted. The stock market has long been used in analyses as a leading economic indicator, but it turns out to be a really awful predictor, For example, the markets had a big run-up just before the great recession began in 2007 – a rather misleading indicator which contributed to failed predictions. In health care, predictive models that, say, score a person’s likelihood of being admitted to the hospital which are accurate well less than 50 percent of the time are still considered to be pretty good. Oddly enough, the predictive models we all consult most days – the weather forecasts – have a bad reputation from those of us who get caught in unexpected showers or have prepared for the snowstorm that never arrives, but actually they do relatively quite well, mainly because the models keep getting refined over many, many years of data collection and analysis.
Which leads to my main point: predictive models are really hypotheses based on analysis of the past. The term “predictive model” is itself a paradox. It is necessarily retrodictive – it may provide a model for the etiology or pattern of past events had the model existed before they happened, but it is only a hypothesis about the future, and when the hypothesis isn’t borne out the model is refined, again, based on what has already transpired. Rather than “predictive analytics,” they are more “hopeful analytics” – the hope being that future patterns conform to what was observed in the past.
Alas, this seems to be more the exception than the rule. Stochastic or unpredictable events have a nasty habit of rendering the future non-conformant with predictions. If this seems too pedantically philosophical, it was succinctly summed up by another great American philosopher out of the baseball world, Yogi Berra, who said, “The future ain’t what it used to be.” (For another perspective, read the insightful blogs by Roger Pielke, Jr., starting with “False Positive Science: Why We Can’t Predict the Future” on the Freakonomics® Web site.)
I suppose anyone analyzing the title of this post would have predicted my answer is “no,” else I wouldn’t have posed the question. (There! – a perfect predictive model that says rhetorical questions are raised in order to be knocked down.) Indeed, I think history tells us there is a certain hubris that attends any term prefaced with the word “predictive.” So why not call “predictive models” what they really are: hypothetical models? And call “predictive analytics” simply analytics. We are analyzing data about the past to create retrodictive models in order to make hypotheses about future events. We will only know how well we did by later analyzing what actually happened. As the Nobel laureate economist Milton Friedman wrote in one of his scholarly papers, “The only relevant test of the validity of a hypothesis is comparison of prediction with experience.”
This is more than a semantic quibble. We are staking a lot on the promise of Big Data to enable us to make predictions and, therefore, better decisions about what to do to affect outcomes. But if the data are funneled through flawed models, what good is the big quantity? During an Enterprise Data World panel discussion last April on Big Data, Neil Raden made the salient point that “having lots more data doesn’t necessarily lead to better decisions.” We are reliant on the models employed in the analytics which are demonstrably imperfect and probably always will be, but can only be made better if the accumulation of data leads to continuous refinement of the models.
Put another way, if you base decisions on models that approach the predictive accuracy of some of the best Big Data-driven models in the world, you would be as accurate as the weather forecasts. Does that give you the warm-and-fuzzies? We need some humility, including in the rubrics we employ.
Of course, if Big Data is all about quantity over quality, as I’ve written before, we could take that approach to predictions too. I’m thinking of the fabulous quip by Linux kernel developer Alan Cox in a 2002 interview: “I figure lots of predictions is best. People will forget the ones I get wrong and marvel over the rest.”