Big Idea
Activation functions define how a neuron transforms its weighted input into an output signal. Without them, neural networks would be purely linear systems, incapable of modeling complex, non-linear relationships—which are essential for machine learning tasks like classification, regression, and pattern recognition (A4.1, A4.3 in the IB guide ).
Activation Functions — Technical Reference Table
| Name | Formula | Student-Friendly Description | Common Use Cases |
|---|---|---|---|
| Binary Step | ( f(x) = \begin{cases} 1 & x \ge 0 \ 0 & x < 0 \end{cases} ) | Outputs either 0 or 1. Like a simple ON/OFF switch. | Early neural networks, theoretical models (rare in modern ML) |
| Linear (Identity) | ( f(x) = x ) | Output equals input. No transformation at all. | Output layer for regression (predicting continuous values) |
| Sigmoid (Logistic) | ( f(x) = \frac{1}{1 + e^{-x}} ) | Squashes values between 0 and 1. Can be interpreted as probability. | Binary classification (output layer), older neural networks |
| Tanh (Hyperbolic Tangent) | ( f(x) = \tanh(x) ) | Like sigmoid but outputs between -1 and 1. Centered around zero. | Hidden layers (older networks), RNNs |
| ReLU (Rectified Linear Unit) | ( f(x) = \max(0, x) ) | Outputs 0 for negative values, keeps positive values unchanged. Very simple and efficient. | Default for hidden layers in most deep neural networks |
| Leaky ReLU | ( f(x) = \begin{cases} x & x > 0 \ \alpha x & x \le 0 \end{cases} ) | Like ReLU but allows a small slope for negative inputs (avoids “dead neurons”). | Improved hidden layers where ReLU fails |
| Parametric ReLU (PReLU) | ( f(x) = \max(\alpha x, x) ) | Similar to Leaky ReLU, but the slope is learned during training. | Advanced deep learning architectures |
| ELU (Exponential Linear Unit) | ( f(x) = \begin{cases} x & x > 0 \ \alpha(e^x - 1) & x \le 0 \end{cases} ) | Smooth curve for negative values, helps learning stability. | Deep networks requiring faster convergence |
| Swish | ( f(x) = x \cdot \text{sigmoid}(x) ) | Smooth and non-monotonic; combines linear and sigmoid behavior. | Modern deep learning (e.g., Google models) |
| Softmax | ( f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} ) | Converts a vector into probabilities that sum to 1. | Multi-class classification (output layer) |
| Softplus | ( f(x) = \ln(1 + e^x) ) | Smooth version of ReLU. No sharp corner at 0. | When smooth gradients are required |
| GELU (Gaussian Error Linear Unit) | ( f(x) = x \cdot \Phi(x) ) | Weighs inputs probabilistically instead of hard cutoff like ReLU. | Transformers (e.g., modern NLP models) |