Click to learn more about author James Kobielus.
Productivity in Data Science isn’t a matter of output in any quantitative sense. It’s more an issue of the quality of what Data Scientists produce.
In a Data Science context, quality refers to the validity and relevance of the insights that statistical models are able to distill from the data. As I stated last year, the more data you have, the more stories that Data Scientists can tell with it, though many of those narratives may be entirely (albeit inadvertently) fictitious.
Given the paramount importance of actionable insights, the productivity of Data Science teams can’t be neatly reduced to throughput or any other quantitative metric. Data Scientists can easily pump up their aggregate output along myriad dimensions, such as more sources, more data, more pipeline processes, more variables, more iterations, and more visualizations. But that doesn’t necessarily get them any closer to delivering high-quality analytics for predictive, prescriptive, and other uses.
What happens to the quality of their output when you automate what Data Scientists do? Well, of course, the sheer quantity of Data Science artifacts being produced will grow by leaps and bounds. But it’s not clear how, overwhelmed by a glut of Machine Learning models and other auto-generated algorithmic artifacts, Data Scientists can automate the distinguishing of high-quality outputs from the inevitable useless junk.
Even when it involves unsupervised methods, Machine Learning is never entirely automated. Data Scientists must still prepare the data sets, specify the algorithms, execute them and interpret the results. The process of extracting insights and applying them within the context of particular data-driven applications is still inherently a creative, exploratory process that demands human judgment. Crowdsourcing the process of evaluating the results of unsupervised Machine Learning models, such as is often used in CAPTCHA tests, doesn’t change this fundamental imperative. Automating the execution of the algorithms themselves may be the least important aspect of the overall process.
Nevertheless, there’s no stopping Data Science automation. As Data Science Central’s William Vorhies noted here, commercial tools are enabling automation of many pipeline tasks relating to data engineering (e.g, cleansing, normalization, skewness removal, transformation) and modeling (champion model selection, feature selection, algorithm selection, fitness metric selection). According to Vorhies’ analysis, some of the key pipeline processes that will continue to require manual methods (to varying degrees) include data-engineering tasks such as cluster analysis and exception handling, as well as Data Modeling tasks such as feature engineering and missing-data imputation.
Despite what he says, automation is in fact coming to those judgment-intensive, machine-learning development processes too. As discussed in this recent MIT Technology Review article, researchers are developing Machine Learning systems that automate feature learning, modeling, and training. At the bleeding edge of Data Science R&D, this research focus is alternately called “automated Machine Learning” or “learning to learn. Leading commercial organizations (e.g., Google), nonprofit research institutes (e.g., OpenAI), and universities (e.g, MIT, University of California Berkeley) have their smartest computer scientists working on it.
Breakthrough results in automated Machine Learning are being reported with greater frequency. Some researchers have found that they can automate creation of Machine Learning algorithms that handle particular tasks—such as object recognition—better than equivalent algorithms designed by human experts. Almost certainly, these research initiatives will leverage advances in “transfer learning” and “generative Machine Learning”, which will enable automatic repurposing of algorithmic artifacts from prior Machine Learning projects.
Should today’s professional Data Scientists worry about all this? Are they at risk of being automated out of their core jobs? What value can human Data Scientists add when the world is full of AI algorithms that endlessly and reliably spin out superior AI algorithms for every conceivable task?
Researchers consider automated Machine Learning a productivity tool to help human Data Scientists to deal with growing workloads. In fact, that’s how one researcher frames it in the cited MIT Technology Review article: “Easing the burden on the Data Scientist is a big payoff. It could make you more productive, make you better models, and make you free to explore higher-level ideas.”
That’s all well and good, but how will Data Scientists maintain quality standards in the face of advancing automation? To some degree, manual quality assurance will always remain essential a core task for which human experts will be responsible. Under most likely future scenarios, Data Scientists need to review the output of their automated tools in order to ensure the validity and actionability of the results. This is analogous to how high-throughput manufacturing facilities dedicate personnel to test samples of their production runs before they’re shipped to the customer.
If nothing else, established Data Scientists will need to perform manual reviews prior to putting those assets into production. I discussed this latter requirement in detail in this recent post. Fortunately, as Vorhies states in the above-cited post, this manual review–which he calls “expert override mode” – is built into the emerging generation of Self-service Data Science Modeling tools. These, he states, “allow a[n experienced] Data Scientist to see what assumptions had been made and to modify them.”
The practical limits of Data Science automation are qualitative. We dare not automate these processes any further than we can vouch for the quality of their output. Without that manual review step, the risks of entirely automating the Data Science pipeline may prove unacceptable to society at large.