Guided Labeling Episode 7: Weak Supervision Deployed via Guided Analytics

Click to learn more about author Paolo Tamagnini.

Welcome to the seventh episode of our Guided Labeling Blog Series. In the last six episodes, we have covered active learning and weak supervision theory. Today, we would like to present a practical example of implementing weak supervision via guided analytics based on a Workflow.

The other episodes are here:

A Document Classification Problem

Let’s assume you want to train a document classifier, a supervised machine learning model that will predict precise categories for each of your unlabeled documents. This model is required, for example, when dealing with large collections of unlabeled medical records, legal documents, or spam emails, defining a recurrent problem across several industries.

In our example, we will:

Build an application able to digest any kind of documents
Transform the documents into bags of words
Train a weak supervision model using a labeling function provided by the user

We would not need weak supervision if we had labels for each document in our training set, but as our document corpus is unlabeled, we will use weak supervision and create a web-based application to ask the document expert to provide heuristics (labeling functions).

Figure 1: The weak supervision framework to train a document classifier — a document expert provides labeling functions for documents to the system. The produced weak label sources are fed to the label model, which outputs the probabilistic labels to train the final discriminative model, which will be deployed as the final document classifier.

Labeling Function in Document Classification

What kind of labeling function should we use for this weak supervision problem?

Well, we need a heuristic, a rule, which looks for something in the text of a document and, based on that, applies the label to the document. If the rule does not find any matching text, it can leave the label missing.

As a quick example, let’s imagine we want to perform sentiment analysis on movie reviews and label each review as either “positive (P)” or “negative (N).” Each movie review is subsequently a document, and we need to build a somewhat accurate labeling function to label certain documents as “positive (P).” A practical example is pictured in Figure 2 below.

Figure 2: An example of labeling function — in the first document, which describes a movie review, the labeling function is applied and provides a positive label; a slightly different document that does not apply to the rule means that the document is left unlabeled.

By providing many labeling functions like the one in Figure 2, it is possible to train a weak supervision model that is able to detect sentiment in movie reviews. The input of the label model (Figure 1) would be similar to the table shown in Figure 3 (below). As you can see, no feature data is attached to such a table, only the output of several labeling functions on all available training data.

Figure 3: The output labels of labeling functions for sentiment analysis —in this table, the output of eight labeling functions are displayed for hundreds of movie reviews. Each labeling function is a column, and each movie review is a row. The labeling function leaves a missing label when it does not apply to the movie review. If it does apply, it outputs either a positive or negative sentiment label. In weak supervision, this kind of table is called a Weak Label Sources Matrix and can be used to train a machine learning model.

Once the labeling functions are provided, it only takes a few moments to apply them to thousands of documents and feed them to the label model (Figure 4 below).

Figure 4: The labeling function output in the weak supervision framework — we feed the labeling functions to the label model. The label model produces probabilistic labels, which, alongside the bag of words data, can be used to train the final document classifier.

Guided Analytics with Weak Supervision on the WebPortal

In order to enable the document expert to create a weak supervision model, we can use Guided Analytics. Using a web-based application that offers a sequence of interactive views, the user can:

Upload the documents
Define the possible labels the final document classifier needs to make a prediction
Input the labeling functions
Train the label model
Train the discriminative model
Assess the performance

We created a blueprint for this kind of application in a sequence of three interactive views, as shown in Figure 5 (below). The generated web-based application can be accessed via any web browser in the WebPortal.

Figure 5:The three views generated by the open source Guided Analytics Application blueprint — the application aims to enable document experts to create a weak supervision model by providing labeling functions via interactive views.

The implementation of this application was possible in the form of the workflow (Figure 6 below) currently available on the Hub. The workflow is using the Weak Supervision extension to train the label model with a Weak Label Model Learner node and Gradient Boosted Trees Learner node to train the Discriminative Model. Besides the Gradient Boosted Tree algorithm, others are also available, which can be used in conjunction with the Weak Label Model nodes (Figure 6).

*Figure 6:The* *workflow* *behind the Guided Analytics Application and the* *nodes* available in the Analytics Platform to perform Weak Supervision — the workflow compares the performance of the label model probabilistic output with the performance of the final discriminative model via an interactive view. The available nodes are listed in the lower part of the screenshot. The nodes framed in yellow train a label model, and the nodes framed in green train a discriminative model. The workflow in this example uses Gradient Boosted Trees.

When Does Weak Supervision Work?

In this episode of our Guided Labeling Blog Series, we have shown how to use weak supervision for document classification. We have described a single use case here, but the same approach can be applied to images, tabular data, multiclass classification, and many other scenarios. As long as your domain expert can provide the labeling functions, the open source Analytics Platform can provide a workflow to be deployed on the Server and make it accessible via the WebPortal.

What are the requirements for the labeling functions/sources in order to train a good weak supervision model?

Moderate Number of Label Sources: The label sources need to be sufficient in number — in certain use cases, up to 100.
Label Sources Are Uncorrelated: Currently, the implementation of the label model does not take into account strong correlations. So it is best if your domain expert does not provide labeling functions that depend on one another.
Sources Overlap:The labeling functions/sources need to overlap in order for the algorithm to detect patterns of agreement and conflicts. If the labeling sources provide labels for a set of samples that do not intersect, the weak supervision approach is not going to be able to estimate which source should be trusted.
Sources Are Not Too Sparse: If all labeling functions label only a small percentage of the total number of samples, this will affect the model performance.
Sources Are Better Than Random Guessing: This is an easy requirement to satisfy. It should be possible to create labeling functions simply by laying down the logic used by manual labeling work as rules.
No Adversarial Sources Allowed: Weak supervision is considerably more flexible than other machine learning strategies when dealing with noisy labels, i.e., weak label sources are simply better than random guessing. Despite this, weak supervision is not flexible enough to deal with weak sources that are always wrong. This might happen when one of the labeling functions is faulty and, subsequently, worse than simply random guessing. When collecting weak label sources, it is more important to focus on spotting those “bad apples” rather than spending time decreasing the overall noise in the Weak Label Sources Matrix.

Looking Ahead

In the upcoming final episode of the Guided Labeling Blog Series, we will look at how to combine active learning and weak supervision in a single, interactive Guided Analytics application.

This is an on-going series on guided labeling; see each episode at:

LEARN HOW TO IMPLEMENT MACHINE LEARNING IN YOUR ORGANIZATION

Data Topics