Advertisement

Revisiting the Data Science Suitcase

By on

Click to learn more about author Steve Miller.

Two years ago, I wrote a blog entitled “What Size is Your Suitcase?” in which I recounted a holiday shopping “dilemma” my wife and I experienced  as we purchased mutual gift suitcases. The muddle revolved on which size bags to buy – large ones that could handle all our travel needs, or more agile small to mid-size pieces that might be convenient for 95+% of our planned trips, if inadequate for the extreme. I chose the latter, settling on a 21 inch spinner that fits in the overhead bins of commercial aircraft. My wife initially went large, opting for a bulky 25 incher. Once she had it in hand, though, she deemed it clunky and had me exchange for the easier-to-maneuver 23 inch.

I used the suitcase dilemma as a metaphor for the types of decisions I saw being made in the analytics technology world by customers of Inquidia, the consultancy I worked for at the time. Companies we contracted with were invariably confronted with the decision on the type/size/complexity of solutions to implement, and they many times initially demanded the 100% answer to their forecast needs into the mid to long range future. One customer pondered an Hadoop “Big Data” ecosystem surrounding its purported 10 TB of analytic data. In reality, the “real” data size was more like 1 TB and growing slowly  – easily managed by an open source analytic database. The customer seemed disappointed when we told them they didn’t need a Big Data solution. Another customer fretted about the data size limitation of the R statistical platform, confiding they might go instead with an expensive proprietary competitor. Turns out that their largest statistical data set the year we engaged was a modest 2 GB. I showed them R working comfortably with a 20 GB data table on a 64 GB RAM Wintel notebook and they were sold. My summary take: be a suitcase-skeptic – don’t be too quick to purchase the largest, handle-all-cases bags. Consider in addition the frugality and simplicity of a 95+% solution, simultaneously planning for, but not implementing, the 100% case.

I found a birds of a feather fellow suitcase-skeptic when reviewing the splendid presentation “Best Practices for Using Machine Learning in Business in 2018” by data scientist Sziárd Pafka. Pafka teased his readers with a prez subtitle “Deeper than Deep Learning”, complaining that today’s AI is pretty much yesterday’s ML and that deep learning is overkill for many business prediction applications.  “No doubt, deep learning has had great success in computer vision, some success in sequence modeling (time series/text), and (combined with reinforcement learning) fantastic results in virtual environments such as playing games……However, in most problems with tabular/structured data (mix of numeric and categorical variables) as most often encountered in business problems, deep learning usually cannot match the predictive accuracy of tree-based ensembles such as random forests or boosting/GBMs.” And of course deep learning models are generally a good deal more cumbersome to work with than gradient boosting ensembles. So Pafka is a suitcase-skeptic with deep learning for  traditional business ML uses, preferring a 95% ensemble solution to most challenges.  He also promotes open source and multi-language (R and Python) packages such as H2O and xgboost as ML cornerstones. “The best open source tools are on par or better in features and performance compared to the commercial tools, so unlike 10+ years ago when a majority of people used various expensive tool, nowadays open source rules.”

Count Pafka as a skeptic for distributed analytics as well. It’s not that analytics clusters have no value; it’s just that they’re oftentimes needlessly deployed. I couldn’t make the argument better than Szilard: ‘And the good news is that you most likely don’t need distributed “Big Data” ML tools. Even if you have Terabytes of raw data (e.g. user clicks) after you prepare/refine your data for ML (e.g. user behavior features) your model matrix is much smaller and will fit in RAM.’  He cites Netflix’s neural net library Vectorflow, “an efficient solution in a single machine setting, lowering iteration time of modeling without sacrificing the scalability for small to medium size problems (100 M rows).” To be sure, there are many instances for which distributed/cluster computing for analytics is the best and perhaps only choice. The suitcase-skeptic, though, opts for the 95% case, planning for, but saving, distributed solutions for when they’re necessary.

I’ve had many discussions with companies deploying analytics about batch or real-time data loading and model scoring. Out of the gate, most will claim that real-time updates are sine qua non – that is,  until they understand the complexity and cost of that approach. They then often comprimise to small windows of hours or even minutes for batch updates. Pafka is a suitcase-skeptic here too. “Batch scoring is usually simpler to do. I think batch scoring is perfectly fine if you don’t need real-time scoring/ you don’t do real-time decisions. Batch can be daily, hourly, every 5 minutes if you want. You can use the same ML lib as for training from R or python.”  Again, adopt the simpler 95% strategy and save the biggest solutions for when they’s really needed.

I’d recommend readers consume Pafka’s best practices enthusiastically. I also think the suitcase-skeptic approach is the right one for companies getting started with analytics and learning as they go. Plan for the 100% solution down the road while implementing the 95% case that can deliver results immediately. I suspect Data Science luminary Sziárd Pafka would agree.

Leave a Reply