Random Forests

This article is not assessed by the IB but may be helpful to deepen your understanding. Plus, I think it's cool.

Big Idea

Imagine trying to guess something by asking lots of different people – each with slightly different knowledge.  A random forest is like that. It’s a clever way to build really accurate predictions using many decision trees working together.

Decision Trees: Think of them as branching diagrams that help you make decisions based on information (like whether a student passes an exam).
Random Forests: Instead of just one tree, we build hundreds of these trees – each trained on slightly different versions of the data.  They then “vote” to give the final answer.

Why Do Random Forests Work So Well?

The key problem with a single decision tree is that it can get too specific and overfit the data, meaning it performs really well on the training data but poorly on new, unseen data. Random forests solve this by:

1.  Reducing Overfitting: By building many diverse trees, each one corrects for the mistakes of the others.
2.  Diversity is Key: The random sampling and feature selection ensure that the trees aren’t all making the same decisions.

How Do They Work – The Building Blocks

Bagging (Bootstrap Aggregating): Each tree is built on a random sample of your data, with some data points repeated (like drawing with replacement). This creates different perspectives.
Random Feature Selection: At each step when building a tree, only a random subset of the available features are considered for making decisions.  This prevents trees from becoming too specialized.
 
 Aggregation: Once all the trees have made their predictions:

  1. Classification (like predicting if a student passes): The most common answer wins (majority vote).
  2. Regression (like predicting house prices): We average out all the individual predictions.

The core idea behind bagging – Bootstrap Aggregating – is to create multiple, slightly different versions of your original dataset. This diversity is crucial for reducing overfitting and building a robust random forest. Here’s how it works in more detail:

  1. Bootstrap Sampling: This is the key technique. Instead of creating entirely new datasets from scratch, we “sample with replacement” from our original data. Let's break that down:

    • Sampling: We randomly select rows (observations) from your dataset.

    • With Replacement: This means after we pick a row, we put it back into the pool of possible selections and repeat the process. This is what makes it different from simple random sampling – some rows will be selected multiple times, while others won’t be selected at all.

  2. The Degree of Randomness: The amount of randomness you introduce depends on a parameter called ‘bootstrap sample size’. This isn't explicitly set in the algorithm; it’s inherent in the “with replacement” process. A larger dataset will naturally have more variation introduced through bootstrapping than a smaller one.

  3. Why does this matter? Because some data points are repeated, and others are omitted. This creates datasets that aren't identical to the original. Each of these slightly different datasets is then used to train a separate decision tree.

Small Example:

Let’s say you have a dataset with 5 students:

Student IDStudy HoursAttendancePast GradePass/Fail
1209085Pass
2157570Fail
3259590Pass
4188080Fail
5228892Pass

Here’s how a bootstrap sample of size 3 might be created:

  • Step 1: Randomly select Student IDs 1, 3, and 5. (Notice that Student ID 2 is not selected).

  • Step 2: Put these three students back into the pool.

  • Step 3: Repeat steps 1 & 2 until you have a sample of size 3.

A possible bootstrap sample would be:

Student IDStudy HoursAttendancePast GradePass/Fail
1209085Pass
3259590Pass
5228892Pass

Now, you would train a decision tree on this sample of 3 students. You’d get a slightly different tree than if you had trained it on the entire original dataset (with 5 students).

Key Takeaway: Each bootstrap sample is unique and introduces variation into the training process for each individual decision tree in the random forest. This diversity is what makes the random forest so robust and accurate.

 

 

The Process – Step-by-Step

1.  Training Phase:
     Create many random samples of your data.
     Build a decision tree on each sample. At each split, randomly select a subset of features and choose the best one to divide the data.

2.  Prediction Phase:
     Feed new data into all the trees.
     Each tree makes its prediction.
     The random forest combines these predictions (voting or averaging) to give you the final answer.

Key Properties of Random Forests

PropertyDescription
Ensemble MethodCombines multiple models
High AccuracyOften better than single models
Robust to NoiseLess sensitive to outliers
Non-LinearCan capture complex relationships
ParallelizableTrees can be trained independently


Advantages & Limitations

  1. Advantages:  Great accuracy, handles large datasets well, doesn’t need a ton of data preparation.
  2. Limitations: Less easy to understand than a single tree, takes longer to train (because of many trees), can still overfit if not tuned properly.

 

Important Settings (Hyperparameters)

These are settings you can adjust to fine-tune your random forest:

  1.  n_estimators:  The number of trees in the forest – more generally leads to better accuracy, but takes longer to train.
  2. Max depth: Controls how complex each tree can be – helps prevent overfitting.
  3. Number of features per split:  Controls the randomness and prevents trees from becoming too similar.

 

Example: Predicting IB CS Pass Rates

Imagine you want to predict if a student passes IB Computer Science based on factors like study hours, attendance, and past grades. A random forest would train many decision trees using this data, and then combine their predictions (pass/fail) to make the final prediction.


When Should You Use Random Forests?

  1. You need high accuracy in your predictions.
  2. Your data is complex and has non-linear relationships.
  3. Interpretability isn’t a top priority.