Activation Functions

This article is not assessed by the IB but may be helpful to deepen your understanding. Plus, I think it's cool.

Big Idea

Activation functions define how a neuron transforms its weighted input into an output signal. Without them, neural networks would be purely linear systems, incapable of modeling complex, non-linear relationships—which are essential for machine learning tasks like classification, regression, and pattern recognition (A4.1, A4.3 in the IB guide ).

 

Activation Functions — Technical Reference Table

NameFormulaStudent-Friendly DescriptionCommon Use Cases
Binary Step( f(x) = \begin{cases} 1 & x \ge 0 \ 0 & x < 0 \end{cases} )Outputs either 0 or 1. Like a simple ON/OFF switch.Early neural networks, theoretical models (rare in modern ML)
Linear (Identity)( f(x) = x )Output equals input. No transformation at all.Output layer for regression (predicting continuous values)
Sigmoid (Logistic)( f(x) = \frac{1}{1 + e^{-x}} )Squashes values between 0 and 1. Can be interpreted as probability.Binary classification (output layer), older neural networks
Tanh (Hyperbolic Tangent)( f(x) = \tanh(x) )Like sigmoid but outputs between -1 and 1. Centered around zero.Hidden layers (older networks), RNNs
ReLU (Rectified Linear Unit)( f(x) = \max(0, x) )Outputs 0 for negative values, keeps positive values unchanged. Very simple and efficient.Default for hidden layers in most deep neural networks
Leaky ReLU( f(x) = \begin{cases} x & x > 0 \ \alpha x & x \le 0 \end{cases} )Like ReLU but allows a small slope for negative inputs (avoids “dead neurons”).Improved hidden layers where ReLU fails
Parametric ReLU (PReLU)( f(x) = \max(\alpha x, x) )Similar to Leaky ReLU, but the slope is learned during training.Advanced deep learning architectures
ELU (Exponential Linear Unit)( f(x) = \begin{cases} x & x > 0 \ \alpha(e^x - 1) & x \le 0 \end{cases} )Smooth curve for negative values, helps learning stability.Deep networks requiring faster convergence
Swish( f(x) = x \cdot \text{sigmoid}(x) )Smooth and non-monotonic; combines linear and sigmoid behavior.Modern deep learning (e.g., Google models)
Softmax( f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} )Converts a vector into probabilities that sum to 1.Multi-class classification (output layer)
Softplus( f(x) = \ln(1 + e^x) )Smooth version of ReLU. No sharp corner at 0.When smooth gradients are required
GELU (Gaussian Error Linear Unit)( f(x) = x \cdot \Phi(x) )Weighs inputs probabilistically instead of hard cutoff like ReLU.Transformers (e.g., modern NLP models)