Click to learn more about author Paolo Tamagnini.
Welcome our integrated deployment blog series, where we focus on solving the challenges around productionizing Data Science.
Topics will include:
- Resolving the challenges of deploying models
- Building guided analytics applications that create not only a model but a complete model process, using our component approach to AutoML to collaborate on projects
- Setting up an infrastructure to not only monitor but automatically update production workflows
The key feature to solving many of these issues is integrated deployment, and, in this blog, we explain that concept with practical examples.
Data scientists, regardless of what package they use, are used to training machine learning models to solve business issues. The classic approaches to creating Data Science, such as the CRISP-DM cycle, support this approach. But the reality is that a great model can never simply be put into production. A model needs the data prepared and surfaced to it in production in exactly the same way as when it was created. And there may be other aspects involved with the use of the model and the surfacing of results in the correct form that the model itself does not intrinsically have in it.
To date, that huge gap in the process of moving from creating a great model to using it in production has been left to the user, regardless of whether you are using our package or another package such as Python. Effectively, you have always needed to manually design two workflows — one to train your model and another to deploy it. With our platform, the deployment workflow can now be created automatically, thanks to a new extension.
There is a small introductory blog explaining integrated deployment here. But in this article, we’d like to dive in a bit deeper for our fans. To do that, we will look at two existing workflow examples of model creation and model deployment. We will then redo them so that the first creation workflow automatically creates the production workflow.
This approach was used in the Data Science learnathon workshop that we have been running for many years. As onsite and online events, this workshop provides an overview of how to use our platform for not only creating great Data Science but productionizing it. We build two workflows.
The first workflow is the modeling workflow, used to access the available data, blending them in a single table, cleaning and removing all missing values, as well as any other inconsistencies, and applying domain expertise by creating new features, training models, optimizing, and validating them.
The second workflow is the deployment workflow, which not only loads all the settings trained in the modeling workflow but builds the data preprocessing that the model expects. In many cases, the deployment workflow is not just a standalone workflow but is designed so that it can be called via REST API by an external application to create a super simple application to send new data as input and get back the model output via an http request.
In this example, we train a model to predict the churn of existing customers given the stored data of previous customers. The modeling workflow accesses the data from a database and joins from an Excel file. The data is prepared by recomputing the domain of each column, converting a few columns from categorical to numerical, and partitioning it into two sets, the training and the test set. A missing value imputation model is created based on the distribution of the training set, and model optimization is performed to find the optimal parameters for random forest (e.g., number of trees), which is trained right after. The trained model is used to compute churn prediction on the test, which contains customers the model has never seen during training. Via an interactive view, the threshold of the model is optimized and applied to the test set. The evaluation of the model is checked both with the former threshold and the new optimized one via confusion matrixes. The missing value model and the random forest model are saved for the deployment workflow.
The overall modeling workflow is shown in Figure 1 below.
To deploy this simple churn prediction model (before version 4.2), the data scientists had to manually create a new workflow (Figure 2 below), and node by node rebuild the sequence of steps, including manually duplicating the preparation of the raw data so that the previously created and stored models can be used.
This manual work requires the user to spend time to drag and drop again the same nodes that were already used in the modeling workflow. Additionally, the user had to make sure the written model files could be found by the deployment workflow, and that new data could come in and out of the deployment workflow via the JSON format required by the REST API framework.
In this special case, where the binary classification threshold was optimized, the data scientists even had to even manually type in the new threshold value.
Deployment using this manual setup was totally customary but time-consuming. Whenever something had to be changed in the modeling workflow, the deployment workflow had to be updated manually. Consider, for example, training another model that is not random forest or adding another step in the data preparation part. Retraining the same model and redeploying it was possible, but automatically changing nodes was not.
- Integrated deployment empowers you to flexibly deploy automatically from your modeling workflow.
- How does the churn prediction modeling workflow look when integrated deployment is applied?
In Figure 3 below, you can see the same workflow as in Figure 1, with the exception that a few new nodes are used. These are the Capture nodes from the Integrated Deployment Extension. The data scientist can design the deployment workflow as they build the modeling workflow by capturing the segments to be deployed. In this simple example, only two segments of workflows are captured to be deployed — the data preparation and the scoring framed in purple in Figure 3. Any node input connection which does not come from the Capture Workflow Start is fixed as a parameter in the Deployment Workflow. In this case, the only dynamic input and output of the captured nodes is a data port specified in the Capture Workflow Start and End nodes. The two captured workflow segments are then combined via a Workflow Combiner node, and the Deployment Workflow is automatically written on the Server or in the local repository via a Workflow Writer node.
It is important to emphasize that the Workflow Writer node has created a completely configured and functional workflow.
In Figure 4 below, you can have a look at the automatically generated deployment workflow. All the connections that were not defined in the modeling workflow by the Capture Workflow Start and Capture Workflow End nodes are static and imported by the PortObject Reference Reader nodes. Those are generic reader nodes able to load the connection information of static parameters found during training. In Figure 4, the example deployment workflow is reading in three parameters: the missing value model, the random forest model, and the double value to be used as binary classification threshold.
In a scenario where data is prepared, and models are trained and deployed in a routine fashion, integrated deployment becomes super useful for flexibly retraining and redeploying on the fly with updated settings. This can be fully automated by using the Deploy Workflow to Server node to pass the created workflows for deployment via the Server — to which the models are deployed when using the Analytics Platform. You can see an example of the new Deploy Workflow to Server node using the Server Connection in Figure 5.
In the animation below, the Executor is added, the Workflow Object is connected to its input, and, via the dialogue, the right amount of input and output ports are created. This setup offers a model agnostic framework necessary for machine learning interpretability applications such as LIME, shapley values, SHAP, etc.
Even if you do not have access to a Server, the Integrated Deployment Extension can be extremely useful when executing a piece of a workflow over and over again. Just imagine that you would like to test a workflow multiple times without having to copy the entire sequence of nodes on different branches. With the new Workflow Executor node, you can reuse a captured workflow on the fly using the black-box approach (Figure 5 below). This comes in extremely handy when working with the Machine Learning Interpretability Extension.
This introductory example is, of course, only a first demonstration of how integrated deployment enhances analytic workflows. In the upcoming episodes of this series, we will see how this new extension empowers an expert to flexibility train, score, deploy, maintain, and monitor machine learning models in an automated fashion. Stay tuned!