You are here:  Home  >  Data Education  >  BI / Data Science News, Articles, & Education  >  BI / Data Science Blogs  >  Current Article

Is There Such a Thing As Predictive Analytics?

By   /  July 16, 2012  /  5 Comments

by John Biderman

Past performance is no guarantee of future results. – ubiquitous investment disclaimer

Never make predictions, especially about the future.Casey Stengel

You can file this under “semantic nit-picking.” Or maybe not.  The phrases “predictive analytics” and “predictive models” get thrown around a lot these days, especially in the context of what we are doing with all that Big Data accumulating all over the place.  This got me to musing about what exactly is the definition of predictive analytics and how predictive, really, are they.

Looking around at some of the big analytical software vendor sites, it is just about impossible to find a definition.  Everyone says they are doing it, or making it possible, but nobody defines what “it” is.  The best characterization I’ve run across is in a Wikipedia article, which also parses the nuanced differences between predictive, descriptive, and decision models.  In a nutshell, predictive analytics are based on predictive models which assign a probability or score to an individual actor (customer, patient, web surfer, etc.);  risk scoring by insurance underwriters is a classic example.  Descriptive models are looking more at groups or cohorts and describing historical group patterns, while decision models are concerned with defining the data, rules, and logic that go into a specific business decision.  Predictive and descriptive models may be subsumed in a decision model.  (These may be authoritative definitions, but I think in vernacular usage “predictive analytics” is often used as a synonym for descriptive analytics.)

All well and good, but the crux of my problem with this is that the track record of nearly every field to make accurate predictions is rather dismal – starting with the “dismal science” itself, economics.  Stock market crashes, recessions, depressions, and soon-to-burst market bubbles have all gone unpredicted.  The stock market has long been used in analyses as a leading economic indicator, but it turns out to be a really awful predictor,  For example, the markets had a big run-up just before the great recession began in 2007 – a rather misleading indicator which contributed to failed predictions.  In health care, predictive models that, say, score a person’s likelihood of being admitted to the hospital which are accurate well less than 50 percent of the time are still considered to be pretty good.  Oddly enough, the predictive models we all consult most days – the weather forecasts – have a bad reputation from those of us who get caught in unexpected showers or have prepared for the snowstorm that never arrives, but actually they do relatively quite well, mainly because the models keep getting refined over many, many years of data collection and analysis.

Which leads to my main point: predictive models are really hypotheses based on analysis of the past.  The term “predictive model” is itself a paradox.  It is necessarily retrodictive – it may provide a model for the etiology or pattern of past events had the model existed before they happened, but it is only a hypothesis about the future, and when the hypothesis isn’t borne out the model is refined, again, based on what has already transpired.  Rather than “predictive analytics,” they are more “hopeful analytics” – the hope being that future patterns conform to what was observed in the past.

Alas, this seems to be more the exception than the rule.  Stochastic or unpredictable events have a nasty habit of rendering the future non-conformant with predictions.  If this seems too pedantically philosophical, it was succinctly summed up by another great American philosopher out of the baseball world, Yogi Berra, who said, “The future ain’t what it used to be.”  (For another perspective, read the insightful blogs by Roger Pielke, Jr., starting with “False Positive Science: Why We Can’t Predict the Future” on the Freakonomics® Web site.)

I suppose anyone analyzing the title of this post would have predicted my answer is “no,” else I wouldn’t have posed the question. (There! – a perfect predictive model that says rhetorical questions are raised in order to be knocked down.)  Indeed, I think history tells us there is a certain hubris that attends any term prefaced with the word “predictive.”  So why not call “predictive models” what they really are: hypothetical models?  And call “predictive analytics” simply analytics.  We are analyzing data about the past to create retrodictive models in order to make hypotheses about future events.  We will only know how well we did by later analyzing what actually happened.  As the Nobel laureate economist Milton Friedman wrote in one of his scholarly papers, “The only relevant test of the validity of a hypothesis is comparison of prediction with experience.”

This is more than a semantic quibble.  We are staking a lot on the promise of Big Data to enable us to make predictions and, therefore, better decisions about what to do to affect outcomes.  But if the data are funneled through flawed models, what good is the big quantity?  During an Enterprise Data World panel discussion last April on Big Data, Neil Raden made the salient point that “having lots more data doesn’t necessarily lead to better decisions.”  We are reliant on the models employed in the analytics which are demonstrably imperfect and probably always will be, but can only be made better if the accumulation of data leads to continuous refinement of the models.

Put another way, if you base decisions on models that approach the predictive accuracy of some of the best Big Data-driven models in the world, you would be as accurate as the weather forecasts.  Does that give you the warm-and-fuzzies?  We need some humility, including in the rubrics we employ.

Of course, if Big Data is all about quantity over quality, as I’ve written before, we could take that approach to predictions too.  I’m thinking of the fabulous quip by Linux kernel developer Alan Cox in a 2002 interview: “I figure lots of predictions is best. People will forget the ones I get wrong and marvel over the rest.”

About the author

John Biderman has over 20 years of experience in application development, database modeling, systems integration, and enterprise information architecture. He has consulted to Fortune 500 clients in the US, UK, and Asia. At Harvard Pilgrim Health Care (a New England-based not-for-profit health plan) he works in the areas of data architecture standards and policies, data integration, logical data modeling, enterprise SOA message architecture, metadata capture, data quality interventions, engaging the business in data stewardship processes, and project leadership.

  • John,

    While i fully agree with your main point, one effective way to characterize predictive analytics is “converting future uncertainties into usable probabilities”. Because that is really what we do when scoring customers or making forecasts. Future can never be predicted, but can be guestimated. PA is a rigorous means of doing it.

    I invite your comments on this post: http://www.simafore.com/blog/bid/57259/is-predictive-analytics-a-misnomer

  • John Biderman

    Thanks for your comment. “Usable probabilities” – I like it. And it gets to my point that we need a more accurately descriptive term than “predictive” which promises more than it can deliver.

  • Given the definition of “predictive” here, then nothing can be “predictive” (since no one knows the future, only the past), so we should really just remove “predictive” from the English language!

    I think it is better to treat predictive models like this: they predict future behavior (duh!), but we should realize that “All models are wrong, but some are useful” (Box). The problem with the examples of poor predictive accuracy you give is that while their classification accuracy may be bad, they are still quite useful nevertheless. I’d be ecstatic if I could predict if my stocks were going up or down in the next quarter 50.5% of the time. Successful marketing campaigns often provide a lift of 3 or 4 over a random draw, increasing response rates from say 1% to 3% or 4%. That means the models are wrong 96% of the time, but may be wildly profitable.

    That stated, I fully agree with you that humility is in order for us in the field. We rarely know or can discover causal relationships, so have to make due with the data we have. But we have to communicate to the consumers of our models what the models don’t do (which is uncover why people purchase what they purchase–no, the Target pregnancy models were *not* perfect, like a crystal ball!).

  • I knew John Biderman when he was a boy. I correctly predicted that he would be a high achiever because he had a first class intelligence, a positive outlook & a good sense of humor.

You might also like...

Data Governance 2.0 at the Hub of Data Curation

Read More →