Big Idea
A decision tree is a supervised learning algorithm that classifies data by recursively splitting it based on feature values, forming a tree-like structure where each path represents a sequence of decisions that leads to a final category.
In the context of A4.3.2, decision trees are used to:
- Learn patterns from labelled training data
- Predict discrete categorical outcomes (e.g., “disease” vs “no disease”)
- Provide interpretable, rule-based classifications
This directly aligns with the IB requirement that classification techniques predict discrete outcomes from labelled data
1. Core Concept: What is a Decision Tree?
A decision tree consists of:
- Root node → first decision (most important feature)
- Internal nodes → intermediate decisions
- Branches → outcomes of decisions
- Leaf nodes → final classification (output)
Key Idea:
The model splits the dataset repeatedly to increase “purity” (i.e., groups contain mostly one class).
A decision tree consists of:
Root node
The root node is the topmost node in the tree and represents the first test or decision applied to the entire dataset. It is usually based on the feature that best separates the classes at the start of training. In other words, it is the first question the model asks about the data.
Internal nodes
Internal nodes are the non-terminal decision points inside the tree. Each internal node applies a test to one feature and divides the data into smaller subsets. These nodes are called “internal” because they are between the root and the leaves. They represent the intermediate stages of classification.
Branches
Branches are the connections between nodes that represent the possible outcomes of a decision or test. For example, if a node asks “Fever?” the branches might be “Yes” and “No”. A branch therefore shows which path a data point follows depending on its feature value.
Leaf nodes
Leaf nodes are the terminal nodes at the ends of the tree. They do not split further. Instead, each leaf gives the final predicted class or category, such as “spam”, “not spam”, “disease present”, or “disease absent”. A leaf node is therefore the final classification output of the model.
Purity
Purity refers to how homogeneous the data in a node is with respect to class labels. A node is more pure when most or all data points in that node belong to the same class. For example, a node containing 19 “Pass” cases and 1 “Fail” case is more pure than a node containing 10 “Pass” and 10 “Fail” cases.
Key idea: repeated splitting to increase purity
A decision tree works by recursively splitting the dataset into smaller subsets so that each subset becomes more class-homogeneous than before. The aim is to reduce class mixing at each step. As the tree grows, the algorithm tries to create nodes that contain data points from fewer classes, ideally only one class. This is why we say the tree increases purity as it moves from the root toward the leaves.
A precise way to say this is:
A decision tree classifies data by applying a sequence of feature-based tests that partition the training data into progressively more pure subsets, until terminal nodes can assign a class label with sufficient confidence.
2. How Decision Trees Perform Classification
To explain:
A decision tree algorithm:
- Starts with all training data at the root
- Selects the best feature to split the data
- Divides data into subsets based on that feature
- Repeats recursively until:
- Data is pure (all same class), or
- A stopping condition is met
Why this works:
- Each split reduces uncertainty
- The model constructs a hierarchical decision boundary
- New data follows the same path → classification
3. Simple Example
Problem: Classify whether a student passes
Features:
- Studies? (Yes/No)
- Attendance? (High/Low)
Training Data:
Studies | Attendance | Result
Yes | High | Pass
Yes | Low | Pass
No | High | Pass
No | Low | Fail
Tree:
Studies?
/ \
Yes No
Pass Attendance?
/ \
High Low
Pass Fail
Prediction:
New student:
- Studies = No
- Attendance = Low
→ Path → Fail
4. Medium Example
Email Spam Classification
Features:
- Contains “free”? (Yes/No)
- Sender known? (Yes/No)
- Number of links (>3)?
Tree (simplified):
Contains "free"?
/ \
Yes No
Spam Sender known?
/ \
Yes No
Not Spam Links > 3?
/ \
Yes No
Spam Not Spam
Why this works:
- The algorithm identifies features that best separate spam vs non-spam
- Each decision reduces classification uncertainty
5. Advanced Example
Medical Diagnosis (IB-recommended context)
Goal: Predict disease category based on symptoms
Features:
- Fever (Yes/No)
- Cough (Yes/No)
- Blood test result (High/Normal)
- Age group
Example Decision Path:
Fever → Yes
→ Blood Test → High
→ Diagnosis → Infection A
Technical Depth:
The tree uses metrics such as:
- Entropy
- Information Gain
- Gini Index
to choose the optimal split.
Why:
- These metrics quantify impurity
- The algorithm selects splits that maximize information gain
6. Mathematical Intuition (HL Insight)
Entropy (uncertainty measure):
[
H(S) = - \sum p_i \log_2 p_i
]
- High entropy → mixed classes
- Low entropy → pure node
Information Gain:
[
IG = H(parent) - \sum weighted\ H(children)
]
→ Choose split with maximum IG
7. Strengths and Limitations
Strengths
- Highly interpretable (human-readable rules)
- Handles categorical and numerical data
- Requires minimal preprocessing
Limitations
- Overfitting (especially deep trees)
- Sensitive to small data changes
- Greedy splitting (not globally optimal)
8. Comparison to K-NN (Required Context)
| Feature | Decision Tree | K-NN |
|---|---|---|
| Model type | Explicit model (tree) | Lazy learner |
| Training | Builds structure | Stores data |
| Prediction | Fast | Slow |
| Interpretability | High | Low |
| Boundary | Axis-aligned splits | Distance-based |