Loading...
You are here:  Home  >  Data Education  >  BI / Data Science News, Articles, & Education  >  BI / Data Science Blogs  >  Current Article

Machine Learning Algorithms: Introduction to Random Forests

By   /  December 18, 2017  /  No Comments

Click to learn more about author Alejandro Correa Bahnsen.

There are a variety of Machine Learning algorithms, and each has its own strengths and weaknesses. In this second article in a series on Machine Learning algorithms, I introduce Random Forests, a supervised algorithm used for classification and regression. If you missed my Introduction to Machine Learning and Decision Trees, I encourage you to read that article first, as it provides a foundation that I’m building on.

Before we dig into Random Forests, you must first understand the concept of an ensemble-learning model. An ensemble-learning model aggregates multiple Machine Learning models to improve performance. Each of the models, when used on their own, is weak. However, when used together in an ensemble, the models are strong—and therefore generate more accurate results.

For example, decision trees are considered weak when used alone. But when a large number of decision trees are used in a Random Forest, the outputs are aggregated and the results represent a strong ensemble.

Understanding Bias and Variance

An algorithm’s strength or weakness is a reflection, in part, of its bias and variance—two sources of error exhibited by every Machine Learning model. Bias and variance are measured by training a Machine Learning model on different parts of the same data and comparing the outputs generated by the model to the actual outputs of the data.

Bias is the measure of how the predicted values of a model differ from the actual values. Bias occurs when an algorithm makes too many simplifying assumptions. This causes the algorithm to predict values that differ from the actual values.

Variance is the measure of how spread out the predictions are. Variance occurs when an algorithm is sensitive to small changes in the training dataset. The higher the variance, the more strongly the algorithm is influenced by the specifics of the data.

Ideally, both bias and variance are low. This indicates that the model will predict values that are very close to the correct values for the different data across the same dataset. When this occurs, you can trust that the model can accurately learn the underlying patterns in the dataset.

Bias and Variance in Random Forests

To understand how bias and variance play out in random forests, we need to take a step back and consider decision trees. Decision trees model complex relationships, but sometimes they overfit the noise in the data. In other words, they aren’t general enough. While they train models that are usually accurate, decisions trees often show a large degree of variability between different data samples from the same dataset. As a result, decision trees are known for showing high variance and low bias.

The objective behind random forests is to take a set of high-variance, low-bias decision trees and transform them into a model that has both low variance and low bias. By aggregating the various outputs of individual decision trees, random forests reduce the variance that can cause errors in decision trees. Through majority voting, we can find the average output given by most of the individual trees. This smooths out the variance so that the model is less likely to produce results further away from the real values.

A random forest trains each decision tree with a different subset of training data. Each node of each decision tree is split using a randomly selected attribute from the data. This element of randomness ensures that the Machine Learning algorithm creates models that are not correlated with one another. As a result, potential errors are evenly spread throughout the model and are cancelled out by the majority voting decision strategy of the model.

Random Forests in the Real World

With the holidays around the corner and the inevitability of flight delays in your near future, let’s imagine that you’re looking for book recommendations. You find a website where real people make book recommendations based on your preferences.

To begin, you complete a questionnaire about your reading preferences. This provides a baseline for the type of books that you might enjoy. Each individual user works as a decision tree, using these criteria to make their recommendations to you. However, it’s unlikely that every user will accurately generalize your reading preferences. For example, one user may incorrectly conclude that you don’t like historical fiction, and therefore eliminate any from his/her recommendations. These errors occur because users have only a limited amount of information about your preferences, and they’re guided by their own biases. To fix this, the site combines the suggestions from many users (each acting as a decision tree) and uses majority voting on their suggestions (thereby creating a random forest).

There still remains one problem: If each user is given the same data from the same questionnaire, their resulting suggestions will lack variance and may be highly biased and correlated. To fix this and generate a wider range of recommendations, the site provides each user with a random set of your answers. As a result, they have fewer criteria with which to make their recommendations. Majority voting eliminates the extreme outliers, leaving you with an accurate and varied list of recommended books to read while you’re sitting at the airport.

Summary

Random forests have a number of advantages and disadvantages that should be considered when deciding whether they are appropriate for a given use case. Advantages include the following:

  • There is no need for feature normalization
  • Individual decision trees can be trained in parallel
  • Random forests are widely used
  • They reduce overfitting

The disadvantages of random forests include the following:

  • They’re not easily interpretable
  • They’re not a state-of-the-art algorithm

About the author

Dr. Alejandro Correa Bahnsen is the Chief Data Scientist at Easy Solutions. With a passion for Machine Learning, he considers himself a technology evangelist of Data Science. He has more than a decade of experience applying the use and development of predictive models to real-world issues such as cyber fraud, human resources analytics, credit scoring, churn modeling, and direct marketing. In addition to advising the Easy Solution’s executive team and customers on unique fraud challenges, Alejandro manages the Data Science team, tests Big Data processing engines and researches the application of Deep Learning on electronic fraud prevention. He also creates and develops Machine Learning algorithms related to phishing detection, user identification, and malware prevention. He is constantly improving Easy Solutions’ products with Data Science and Artificial Intelligence capabilities. Alejandro holds a PhD in Machine Learning and Pattern Recognition from Luxembourg University. He has published over 15 academic and industrial papers in noteworthy peer-reviewed publications. He also taught the following subjects on a university level: econometrics, financial risk management, Machine Learning, and Natural Language Processing.

You might also like...

Thinking Inside the Box: How to Audit an AI

Read More →