Activation Functions | Computer Science KB

Big Idea

Activation functions define how a neuron transforms its weighted input into an output signal. Without them, neural networks would be purely linear systems, incapable of modeling complex, non-linear relationships—which are essential for machine learning tasks like classification, regression, and pattern recognition (A4.1, A4.3 in the IB guide ).

Activation Functions — Technical Reference Table

Name	Formula	Student-Friendly Description	Common Use Cases
Binary Step	( f(x) = \begin{cases} 1 & x \ge 0 \ 0 & x < 0 \end{cases} )	Outputs either 0 or 1. Like a simple ON/OFF switch.	Early neural networks, theoretical models (rare in modern ML)
Linear (Identity)	( f(x) = x )	Output equals input. No transformation at all.	Output layer for regression (predicting continuous values)
Sigmoid (Logistic)	( f(x) = \frac{1}{1 + e^{-x}} )	Squashes values between 0 and 1. Can be interpreted as probability.	Binary classification (output layer), older neural networks
Tanh (Hyperbolic Tangent)	( f(x) = \tanh(x) )	Like sigmoid but outputs between -1 and 1. Centered around zero.	Hidden layers (older networks), RNNs
ReLU (Rectified Linear Unit)	( f(x) = \max(0, x) )	Outputs 0 for negative values, keeps positive values unchanged. Very simple and efficient.	Default for hidden layers in most deep neural networks
Leaky ReLU	( f(x) = \begin{cases} x & x > 0 \ \alpha x & x \le 0 \end{cases} )	Like ReLU but allows a small slope for negative inputs (avoids “dead neurons”).	Improved hidden layers where ReLU fails
Parametric ReLU (PReLU)	( f(x) = \max(\alpha x, x) )	Similar to Leaky ReLU, but the slope is learned during training.	Advanced deep learning architectures
ELU (Exponential Linear Unit)	( f(x) = \begin{cases} x & x > 0 \ \alpha(e^x - 1) & x \le 0 \end{cases} )	Smooth curve for negative values, helps learning stability.	Deep networks requiring faster convergence
Swish	( f(x) = x \cdot \text{sigmoid}(x) )	Smooth and non-monotonic; combines linear and sigmoid behavior.	Modern deep learning (e.g., Google models)
Softmax	( f(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} )	Converts a vector into probabilities that sum to 1.	Multi-class classification (output layer)
Softplus	( f(x) = \ln(1 + e^x) )	Smooth version of ReLU. No sharp corner at 0.	When smooth gradients are required
GELU (Gaussian Error Linear Unit)	( f(x) = x \cdot \Phi(x) )	Weighs inputs probabilistically instead of hard cutoff like ReLU.	Transformers (e.g., modern NLP models)