Advertisement

Don’t Fence Me In: Tensions on the Data Science Frontier

By on

Click here to learn more about author James Kobielus.

Data scientists can be a stubborn, proud, and independent lot. In their hearts, they’re new-age prospects whose primary occupation is to uncover hidden veins of insight buried deep in data.

Self-starting genius is a great thing, but it’s not what pays the bills. In the new world of analytics-driven business, data scientists are evolving into a species of industrial engineer who must conform to a remorselessly repeatable workflow.

This is the decidedly unsexy concept of nitty-gritty business data science taking place in a sort of automated “model factory.” Though the concept isn’t new, it’s rapidly moved from a niche approach to more of a mainstream practice in the past few years as many organizations have realigned their business models around applications that feed off the fruits of data science.

What this means in practice is that the work life of the 21st century business data scientist revolves around:

  • Producing a steady stream of analytic assets, including machine-learning models, cognitive algorithms, and data-driven apps;
  • Deploying those assets into mission-critical business processes, including those that drive sales, marketing, customer engagement, materials management, and much more;
  • Monitoring in real-time the effectiveness of those deployed assets in achieving the intended business; and
  • Processing a never-ending stream of change requests for tweaking those assets to drive business results while boosting their operating efficiency.

There’s no turning back to older, more manual practices in these areas. Automation can help scale the engineering of data and refinement of models at an ever-faster pace. It can boost the efficiency and throughput of the most arduous, labor-intensive, and repeatable data-engineering tasks: discovery, profiling, sampling, and preparation. As I noted in this recent post, data scientists are automating the ingestion and analysis of new data types—especially the image, audio and video content that is so fundamental to the streaming media and cognitive computing revolutions—through sample machine-learning pipelines for computer vision and speech as well as data loaders for other data types.

Automation can also accelerate the building, refinement, and deployment of ever larger, more complex statistical models. And as I noted in this post, unsupervised learning approaches are facilitating automation of more—but far from all—of the exploratory processes that have heretofore required manual approaches.

Just as important, automation is the key to independently verifying, and as a result confirming, your data scientists’ findings. Automation of their entire lifecycle—from data engineering to model building and testing–can enable data scientists to repeat the precise circumstances under which a prior finding was obtained. In this way, automation can reduce the possibility of that inadvertent human errors in the results verification process might have altered some key variables when re-running a particular version of a particular model against particular data sets. And automation can even help your data scientists be far more productive in conducting real-world experiments, where specific variables can be selectively tweaked while the rest remain constant.

Clearly, the new era of automated data science is fast emerging. However, many established data scientists are greeting this new age with anxiety.

Ever the realists, data scientists recognize the need to automate more of their work in team-oriented environments. But they continue to emphasize that the enduring need for manual methods that involve of expert judgment, creative ideation, and ad-hoc exploratory processes. For example, many interactive data exploration and feature engineering functions will continue to require expert judgments from teams of statistical analysts and domain experts. For this reason, data scientists resist efforts to reduce their work to a chain of repeatable, fine-grained tasks in a sort of virtual assembly line.

To the working data scientist, this trend toward automation can feel a bit like the closing of the frontier. Their protests against team-centric factory discipline might come across as a bit like Gene Autry singing “Don’t Fence Me In.” As noted in this recent article, some data scientists are resisting their employers’ efforts to rein in their long-established “free-roam” work-style. As businesses grow their internal data science competencies, they are instituting more formalized workflows that hinge on embedded teams of operational data scientists and tool-based automation. As this trend intensifies, opportunities for unaffiliated staff data scientists to freely dispense advice, focus on special projects, and use idiosyncratic working methods will steadily diminish.

But rest assured that the outlook for data scientists, as a profession, is not dire. Automation of their core tasks will not throw masses of them out of work. Instead, automation will boost data scientists’ productivity by leaps and bounds. In the process, it will free them from industrial-grade drudgery—especially in the thankless task of data preparation–to focus on solving business problems, as opposed to tweaking errant algorithms.

Leave a Reply