A4.3.1 Explain how linear regression is used to predict continuous outcomes. (HL only)

A4.3.1 Explain how linear regression is used to predict continuous outcomes.

• The relationship between the independent (predictor) and dependent (response) variables

• The significance of the slope and intercept in the regression equation

• How well the model fits the data—often assessed using measures like r² .

The Big Idea

Linear regression is one of the most fundamental algorithms in machine learning and statistics, used to model the relationship between variables and predict continuous outcomes—such as temperature, income, or test scores. The central idea is to fit a straight line to a set of data points that best describes the relationship between an independent variable (also called the predictor) and a dependent variable (also called the response).

This method assumes that the response variable can be expressed as a linear combination of the input variable(s), allowing us to make predictions about new, unseen data based on patterns learned from the training data.

Core Components of Linear Regression

1. Independent vs. Dependent Variables

Independent variable (x): The input or feature used to make a prediction
- Example: number of study hours
Dependent variable (y): The output or value we want to predict
- Example: exam score

Linear regression assumes a direct, linear relationship:

y = mx + b

Where:

$y$ is the predicted outcome
$x$ is the input (independent variable)
$m$ is the slope of the line
$b$ is the y-intercept

2. Interpretation of Slope and Intercept

Slope (m)

Indicates the rate of change of the dependent variable with respect to the independent variable.
In practical terms, it tells you how much y is expected to increase or decrease for each unit increase in x.

Example: If $m = 5$ , then for every additional hour studied, the predicted score increases by 5 points.

Intercept (b)

Represents the predicted value of $y$ when $x = 0$ .
It shows where the line crosses the y-axis.

Example: If $b = 50$ , then a student who studies 0 hours is expected to score 50.

3. Evaluating Model Fit: The Coefficient of Determination (R²)

The R-squared (r²) value is a statistical measure of how well the regression line approximates the real data.

Range: $0 \leq R^2 \leq 1$
Interpretation:
- $R^2 = 0$ : The model explains none of the variability in the data.
- $R^2 = 1$ : The model perfectly explains all variability in the data.
- Higher values indicate better fit, but R² alone does not guarantee that the model is good—it could be overfitting.

Example: An R² of 0.85 means that 85% of the variance in test scores can be explained by the number of hours studied.

Other metrics often used for regression evaluation include:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)

Student-Relatable Example

Imagine you're trying to predict a student's final grade based on how many hours they studied per week.

You collect data from 30 students and plot the number of study hours vs. final grades.
You use linear regression and get the equation:
$\text{Grade} = 6.2 \times \text{Hours} + 48$
- Slope = 6.2: Each additional hour of study increases the predicted grade by 6.2 points.
- Intercept = 48: A student who doesn’t study at all is predicted to score 48.
The model yields an R² of 0.81: Study hours explain 81% of the variation in student grades.

This tells us that while study hours are a strong predictor, other factors (sleep, teaching quality, etc.) may also influence the final score.

Summary

Linear regression is a powerful tool for predicting continuous outcomes based on one or more input variables. By modeling the linear relationship between predictor and response variables, it allows us to quantify, interpret, and forecast real-world values. Understanding the slope, intercept, and goodness-of-fit (R²) helps determine whether the model provides reliable and meaningful predictions.