Advanced algorithms power the engines of statistical analysis, but they’re not invisible ghosts in the machine. Instead, data scientists compose and tweak algorithms visually, writing the underlying code and then exploring the resultant visuals to see how well they distill the correlations, classifications and other patterns in the data.
Data scientists rely on advanced visualizations to help them be more productive in working with advanced algorithms. When developing their models, statistical analysts look for the specific visualizations that are most effective at elucidating the sought-after patterns. For example, statistical analytics use the visualizations generated by classification algorithms, as decision tree (DT), to determine whether they sort the observations into categories that make some empirical sense. If the visualizations are difficult to square with some domain expert’s grasp of the problem domain, the algorithms, the data, and/or the expert’s knowledge needs to be adjusted.
Just as there are many ways to skin the proverbial cat, there are many machine-learning algorithms that can be used – to varying degrees of effectiveness – of any particular classification task. What I found fascinating about this blog was how the author, data scientist Takashi J. Ozaki, used comparative visualizations to assess the effectiveness of several classification algorithms against a common data set.
Claiming to be “not a serious expert in machine learning and its scientific basis,” the author touted the learning-curve boost offered by algorithmic output visualizations: “For such people [as himself], explaining meanings of algorithms or theorems is not helpful for understanding how they work – instead, visualized feature…will help us, I believe.”
What he displays are the visualizations of the “decision boundaries” from the most common classification algorithms used in supervised learning: DT, logistic regression (LR), support vector machine (SVM), neural networks (NN), and random forest (RF). The most interesting aspect of his discussion is how, simply based on what his eyes reveal, he flips between numerical and impressionistic assessments of algorithm fitness. For example:
“Some results look like joking. In particular those of SVM #2 and #3 were crazy, too over-fitted, almost no generalized. On the other hand, SVM #1 was too generalized: yeah, it well followed the true boundaries, but its accuracy was bad (approx. 80%). NN was not bad, but also a little over-fitted. Accuracy of RF was great (100%) but looks a little over-fitted too. I know it’s hard to balance generalization and accuracy, but just in my opinion SVM #1 or #3 can be ‘not bad’.”
Even the layperson can sense, visually, what statistical concepts such as “over-fitted” and “generalized” refer to. The visuals that Ozaki calls “over-fitted” resemble gerrymandered legislative-district maps, lacking smooth curves and having dots of various colors straying beyond the lines that surround their primary clusters. The ones he calls “generalized” are primarily smooth curves and colored dots neatly tucked inside their respective cluster line-cordons – the latter exhibit tidy statistical classifications that are more likely to be found in other data samples drawn from the same population.
That suggests a beauty metric for statistical visualizations: the simpler, more elegant, and more natural they appear, the more likely it is that the underlying algorithm captures some deep pattern that’s close to the truth.