Advanced mathematics can feel extremely detached from earthly reality until you realize how incredibly practical it is. The models and algorithms that power big data applications are largely mathematical constructs. If you tighten the equations that describe some patterned relationships, you might be able to radically accelerate some computer program that has been built to carry out those computations.
Without the right algorithms, big data is just a blunt-force instrument that may never be able to truly crack the kernel of some analytical problem. In that vein, I took great interest in a recent article by Vincent Granville on the so-called “curse of big data.”
Curse? No, it’s not demonic possession. It refers to an issue with the traditional mathematical techniques used to find statistical patterns in big data. Apparently, the increasing incidence of outliers in larger data sets can create spurious correlations, which might obscure the true patterns in the data or “reveal” patterns that just don’t exist. An outlier is any observation point that is far out of range from the others, due to experimental errors, measurement variability, or other reasons that might dilute the quality of the data set under investigation.
Applying brute-force machine-driven analysis to outlier-studded data sets won’t address the underlying issue with the math. The algorithms that crunch this data originate in the minds of mathematicians, after all. From a statistical analyst’s point of view, getting data sets under control requires all the outlier-analysis tools in their professional kitbags. Outlier exclusion has long been a key best practice, regardless of the size of the data set that the statistical analyst is working with.
I’m not a mathematician, so I can’t evaluate the new approach that won the “curse of big data” contest. But this statement highlighted the fact that blindly throwing more compute power at big data problems can be counterproductive (as well as costly):
“Instead of running heavy computations, they used mathematical thinking and leveraged their expertise in mathematical optimization as well as in permutation theory and combinatorics. And they succeeded. This proves that sometimes, mathematical modeling can beat even the most powerful system of clustered computers.”
Another interesting aspect of the winning approach (from an IBM-er, no less) is that it involved revisiting mathematical problems that have been resisting solution for few centuries. Out of curiosity, smart people have been grappling with abstract mathematical problems since the dawn of civilization, never realizing that their “ivory tower” efforts might some day have practical applications. I’ll just excerpt the passage that highlights this fact:
“The new metrics designed by Granville for big data are indeed very old. They precede the R squared by at least 50 years, and were first investigated by statisticians in the 18-th century. But the L-1 version (the oldest framework) was quickly abandoned because it was thought to be mathematically intractable, at least for 18th century mathematicians….The conjecture proved by Puget shows that the L-1 model leads to exact, tractable mathematical formulas. It reverses a long established trend among statisticians.”
Now, for those of you who want to get deeper in the math, here’s how the post described it:
“The problem consisted in finding an exact formula for a new type of correlation and goodness-of-fit metrics, designed specifically for big data, generalizing the Spearman’s rank coefficient, and being especially robust for non-bounded, ordinal data found in large data sets. From a mathematical point of view, the new metric is based on L-1 rather than L-2 theory: In other words, it relies on absolute rather than squared differences. Using squares (or higher powers) is what makes traditional metrics such as R squared notoriously sensitive to outliers, and avoided by savvy statistical modelers.”
Data science relies on these sorts of mathematical insights. And the sciences in general rely on data science to furnish the statistical toolset needed to confirm patterns in empirical data sets.
No, none of these equations are particularly sexy. But that’s beside the point. Getting our heads around big data would be impossible without them, just as controlling spacecraft would have been unthinkable if Newton hadn’t laid down those fundamental gravity equations so long ago.