WANT TO STAY IN THE KNOW?
Get our weekly newsletter in your inbox with the latest Data Management articles, webinars, events, online courses, and more.
Click to learn more about author Alejandro Correa Bahnsen.
There are a variety of Machine Learning algorithms, and each has its own strengths and weaknesses. In this second article in a series on Machine Learning algorithms, I introduce Random Forests, a supervised algorithm used for classification and regression. If you missed my Introduction to Machine Learning and Decision Trees, I encourage you to read that article first, as it provides a foundation that I’m building on.
Before we dig into Random Forests, you must first understand the concept of an ensemble-learning model. An ensemble-learning model aggregates multiple Machine Learning models to improve performance. Each of the models, when used on their own, is weak. However, when used together in an ensemble, the models are strong—and therefore generate more accurate results.
For example, decision trees are considered weak when used alone. But when a large number of decision trees are used in a Random Forest, the outputs are aggregated and the results represent a strong ensemble.
Understanding Bias and Variance
An algorithm’s strength or weakness is a reflection, in part, of its bias and variance—two sources of error exhibited by every Machine Learning model. Bias and variance are measured by training a Machine Learning model on different parts of the same data and comparing the outputs generated by the model to the actual outputs of the data.
Bias is the measure of how the predicted values of a model differ from the actual values. Bias occurs when an algorithm makes too many simplifying assumptions. This causes the algorithm to predict values that differ from the actual values.
Variance is the measure of how spread out the predictions are. Variance occurs when an algorithm is sensitive to small changes in the training dataset. The higher the variance, the more strongly the algorithm is influenced by the specifics of the data.
Ideally, both bias and variance are low. This indicates that the model will predict values that are very close to the correct values for the different data across the same dataset. When this occurs, you can trust that the model can accurately learn the underlying patterns in the dataset.
Bias and Variance in Random Forests
To understand how bias and variance play out in random forests, we need to take a step back and consider decision trees. Decision trees model complex relationships, but sometimes they overfit the noise in the data. In other words, they aren’t general enough. While they train models that are usually accurate, decisions trees often show a large degree of variability between different data samples from the same dataset. As a result, decision trees are known for showing high variance and low bias.
The objective behind random forests is to take a set of high-variance, low-bias decision trees and transform them into a model that has both low variance and low bias. By aggregating the various outputs of individual decision trees, random forests reduce the variance that can cause errors in decision trees. Through majority voting, we can find the average output given by most of the individual trees. This smooths out the variance so that the model is less likely to produce results further away from the real values.
A random forest trains each decision tree with a different subset of training data. Each node of each decision tree is split using a randomly selected attribute from the data. This element of randomness ensures that the Machine Learning algorithm creates models that are not correlated with one another. As a result, potential errors are evenly spread throughout the model and are cancelled out by the majority voting decision strategy of the model.
Random Forests in the Real World
With the holidays around the corner and the inevitability of flight delays in your near future, let’s imagine that you’re looking for book recommendations. You find a website where real people make book recommendations based on your preferences.
To begin, you complete a questionnaire about your reading preferences. This provides a baseline for the type of books that you might enjoy. Each individual user works as a decision tree, using these criteria to make their recommendations to you. However, it’s unlikely that every user will accurately generalize your reading preferences. For example, one user may incorrectly conclude that you don’t like historical fiction, and therefore eliminate any from his/her recommendations. These errors occur because users have only a limited amount of information about your preferences, and they’re guided by their own biases. To fix this, the site combines the suggestions from many users (each acting as a decision tree) and uses majority voting on their suggestions (thereby creating a random forest).
There still remains one problem: If each user is given the same data from the same questionnaire, their resulting suggestions will lack variance and may be highly biased and correlated. To fix this and generate a wider range of recommendations, the site provides each user with a random set of your answers. As a result, they have fewer criteria with which to make their recommendations. Majority voting eliminates the extreme outliers, leaving you with an accurate and varied list of recommended books to read while you’re sitting at the airport.
Random forests have a number of advantages and disadvantages that should be considered when deciding whether they are appropriate for a given use case. Advantages include the following:
- There is no need for feature normalization
- Individual decision trees can be trained in parallel
- Random forests are widely used
- They reduce overfitting
The disadvantages of random forests include the following:
- They’re not easily interpretable
- They’re not a state-of-the-art algorithm