Big Idea

A decision tree is a supervised learning algorithm that classifies data by recursively splitting it based on feature values, forming a tree-like structure where each path represents a sequence of decisions that leads to a final category.

In the context of A4.3.2, decision trees are used to:

Learn patterns from labelled training data
Predict discrete categorical outcomes (e.g., “disease” vs “no disease”)
Provide interpretable, rule-based classifications

This directly aligns with the IB requirement that classification techniques predict discrete outcomes from labelled data

1. Core Concept: What is a Decision Tree?

A decision tree consists of:

Root node → first decision (most important feature)
Internal nodes → intermediate decisions
Branches → outcomes of decisions
Leaf nodes → final classification (output)

Key Idea:

The model splits the dataset repeatedly to increase “purity” (i.e., groups contain mostly one class).

A decision tree consists of:

Root node
The root node is the topmost node in the tree and represents the first test or decision applied to the entire dataset. It is usually based on the feature that best separates the classes at the start of training. In other words, it is the first question the model asks about the data.

Internal nodes
Internal nodes are the non-terminal decision points inside the tree. Each internal node applies a test to one feature and divides the data into smaller subsets. These nodes are called “internal” because they are between the root and the leaves. They represent the intermediate stages of classification.

Branches
Branches are the connections between nodes that represent the possible outcomes of a decision or test. For example, if a node asks “Fever?” the branches might be “Yes” and “No”. A branch therefore shows which path a data point follows depending on its feature value.

Leaf nodes
Leaf nodes are the terminal nodes at the ends of the tree. They do not split further. Instead, each leaf gives the final predicted class or category, such as “spam”, “not spam”, “disease present”, or “disease absent”. A leaf node is therefore the final classification output of the model.

Purity
Purity refers to how homogeneous the data in a node is with respect to class labels. A node is more pure when most or all data points in that node belong to the same class. For example, a node containing 19 “Pass” cases and 1 “Fail” case is more pure than a node containing 10 “Pass” and 10 “Fail” cases.

Key idea: repeated splitting to increase purity
A decision tree works by recursively splitting the dataset into smaller subsets so that each subset becomes more class-homogeneous than before. The aim is to reduce class mixing at each step. As the tree grows, the algorithm tries to create nodes that contain data points from fewer classes, ideally only one class. This is why we say the tree increases purity as it moves from the root toward the leaves.

A precise way to say this is:

A decision tree classifies data by applying a sequence of feature-based tests that partition the training data into progressively more pure subsets, until terminal nodes can assign a class label with sufficient confidence.

2. How Decision Trees Perform Classification

To explain:

A decision tree algorithm:

Starts with all training data at the root
Selects the best feature to split the data
Divides data into subsets based on that feature
Repeats recursively until:
- Data is pure (all same class), or
- A stopping condition is met

Why this works:

Each split reduces uncertainty
The model constructs a hierarchical decision boundary
New data follows the same path → classification

3. Simple Example

Problem: Classify whether a student passes

Features:

Studies? (Yes/No)
Attendance? (High/Low)

Training Data:

Studies | Attendance | Result
Yes     | High       | Pass
Yes     | Low        | Pass
No      | High       | Pass
No      | Low        | Fail

Tree:

          Studies?
         /       \
      Yes         No
     Pass     Attendance?
              /         \
           High         Low
           Pass         Fail

Prediction:

New student:

Studies = No
Attendance = Low

→ Path → Fail

4. Medium Example

Email Spam Classification

Features:

Contains “free”? (Yes/No)
Sender known? (Yes/No)
Number of links (>3)?

Tree (simplified):

         Contains "free"?
         /              \
      Yes              No
     Spam         Sender known?
                  /           \
               Yes             No
              Not Spam        Links > 3?
                              /       \
                           Yes        No
                          Spam     Not Spam

Why this works:

The algorithm identifies features that best separate spam vs non-spam
Each decision reduces classification uncertainty

5. Advanced Example

Medical Diagnosis (IB-recommended context)

Goal: Predict disease category based on symptoms

Features:

Fever (Yes/No)
Cough (Yes/No)
Blood test result (High/Normal)
Age group

Example Decision Path:

Fever → Yes
→ Blood Test → High
→ Diagnosis → Infection A

Technical Depth:

The tree uses metrics such as:

Entropy
Information Gain
Gini Index

to choose the optimal split.

Why:

These metrics quantify impurity
The algorithm selects splits that maximize information gain

6. Mathematical Intuition (HL Insight)

Entropy (uncertainty measure):

[
H(S) = - \sum p_i \log_2 p_i
]

High entropy → mixed classes
Low entropy → pure node

Information Gain:

[
IG = H(parent) - \sum weighted\ H(children)
]

→ Choose split with maximum IG

7. Strengths and Limitations

Strengths

Highly interpretable (human-readable rules)
Handles categorical and numerical data
Requires minimal preprocessing

Limitations

Overfitting (especially deep trees)
Sensitive to small data changes
Greedy splitting (not globally optimal)

8. Comparison to K-NN (Required Context)

Feature	Decision Tree	K-NN
Model type	Explicit model (tree)	Lazy learner
Training	Builds structure	Stores data
Prediction	Fast	Slow
Interpretability	High	Low
Boundary	Axis-aligned splits	Distance-based