Click to learn more about author Rosaria Silipo.
An interview with three data scientists and guided automation experts
There is currently a lot of talk about automated machine learning. There is also a high level of skepticism.
WANT TO STAY IN THE KNOW?
Get our weekly newsletter in your inbox with the latest Data Management articles, webinars, events, online courses, and more.
I am here with data scientists Paolo Tamagnini, Simon Schmid and Christian Dietz.. to ask a few questions on this topic from their point of view and I found this concept of guided automation quite interesting as well, since it is directly involved in the practice of automated machine learning.
Rosaria Silipo: What is automated machine learning?
Christian Dietz: Automated machine learning is about building a system, process or application able to automatically create, train and test machine learning models with as little human input as possible. The CRISP-DM cycle was introduced almost 20 years ago and is now an established process with standard steps, such as data preparation, feature engineering and feature optimization, model training, model optimization, model testing, and deployment, which are common to most data science projects. The aim of automation is to remove human interaction as much as possible from these steps.
There are different algorithms and strategies to do this which vary by complexity and performance, but the main idea is empowering business analysts to train a great number of models and deliver the best one with just a small amount of configuration.
We usually only talk about automating machine learning, but it’s really about fitting automation into as many steps of the cycle as possible — not just the training/selection of models. For example, applications to automate data wrangling or data visualization are starting to appear as well.
Rosaria: Can automated machine learning really fully automate the Data Science cycle with no expert intervention?
Simon Schmid: This is a tricky question! Some say it can, some say it can’t.
In my opinion, automated machine learning can fully automate the Data Science cycle for standard data science problems. You know the scenario: You have some data, the data is quite general and describes the problem well, no unbalanced classes. You choose a model, you train it on the training set, and you evaluate it on the test set. If performance is acceptable, you deploy it. No major surprises. In this case, the whole cycle can be automated, even introducing a few additional optimization steps.
However, more complex Data Science problems are likely to call for some degree of human — or expert — input.
For example, the domain expert can add some unique knowledge about the data treatment and filtering before continuing with the machine learning process. Also, when the data domain becomes more complex than simple tabular data, e.g., including text, images or time series, the domain expert can contribute with custom techniques for data preparation, data partitioning, and feature engineering.
Basically, the answer to your question is “sometimes.” For this very reason, our team works with a framework that allows both options. You can run a fully automated cycle or you can decide to intervene at select points along the way. This is the functionality offered through a feature called Guided Analytics. Guided Analytics allows you to intersperse your workflow with interaction points and thus steer the data science application in different directions, if this is needed.
Rosaria: You have described this already, but I think our readers will benefit from a few additional details. Can you tell us more about Guided Analytics?
Paolo Tamagnini: Guided Analytics is about flexibly adding interaction points in the data pipeline — that is in between the sequence of steps the data go through during the analysis. When you develop a data processing or data analytics application, you don’t just develop it for yourself but also for other people to use too. So, in order to give anyone the opportunity to tweak how the analysis proceeds, you should add some interaction points at strategic locations throughout the pipeline.
The data pipeline is also called a workflow and the interaction points are web pages, easily created without any scripting via Wrapped Metanodes. For example, in a reporting application, you can request the user to select the time window or the KPI to be displayed; in a data wrangling application you can ask which data sources should be blended together and what kind of features should be built; in a machine learning application you can ask the user to specify the target variable as well as the input variables, the model(s) that should be trained, and whether feature engineering needs to be performed.
Rosaria: So, Guided Automation is your interpretation of an application for automated machine learning. Can you briefly describe how it works?
Christian: Guided Automation is what comes out when you combine Guided Analytics with automated machine learning. You can flexibly ask the business analyst to add their expertise when it is necessary and, consequently, automate the standard pieces of the analysis. Just what the right amount of automation and interaction is depends from problem to problem. Sometimes, you can proceed with the default choices, and sometimes you need further input to refine and steer the process.
Common interaction points in guided analytics for automated machine learning applications are when the data are uploaded and the target is selected, specifying which features are to be used as input and which models trained, while automation involves mostly hyperparameter optimization and feature selection. You can also add optional interaction points, for example to customize feature engineering or to select the execution environment for custom scalability.
The process, encompassing automation and interaction points, is shown in the diagram below. We have followed the general recipe in this diagram when implementing a blueprint workflow for guided automation.
Rosaria: This blueprint for guided automation … is this blueprint a software solution I can buy?
Paolo: Buy? No, it’s free! Just like all example and blueprint workflows we work with at. All you need to do is download an open source copy, start it, open the EXAMPLES Server, locate the blueprint at 50_Applications/36_Guided_Analytics_for_ML_Automation, drag and drop it to your LOCAL workspace, and then you can automate the machine learning process on your own data. Please feel free to customize it though. It is a blueprint in the end, and you can improve it and tailor it to your data science problem to achieve the best performance.
Rosaria: Nice! Do I need special software to properly run the blueprint for guided automation?
Paolo: Absolutely not. On KNIME Server, you can access the interaction points remotely from any web browser, which is quite useful. But using the open source and free Analytics Platform, you can also run the blueprint, access the interaction points, and interact with the workflow via the client built-in web browser.
Rosaria: Where can I learn how to customize the blueprint for guided automation?
Paolo: In general, you can start from a free e-learning course. There you can learn more about: how to access data, how to perform ETL operations, and how to display plots and charts.
Thank you Christian, Paolo and Simon for your time and the clear answers! Now I know more about the general concept of automated machine learning, Guided Analytics, and the blueprint you have created, Guided Automation.