A4.3.9 Describe how CNNs are designed to adaptively learn spatial hierarchies of features in images.
• Convolutional neural network (CNN) basic architecture: input layer, convolutional layers, activation functions, pooling layers, fully connected layers, output layer
• The effect of the number of layers, kernel size and stride, activation function selection, and the loss function on how CNNs process input data and classify images
The Big Idea
Convolutional Neural Networks (CNNs) are a specialized class of neural networks designed to process and understand visual data. Unlike standard artificial neural networks (ANNs), CNNs are uniquely suited for image-related tasks because they can automatically learn spatial hierarchies of features—such as edges, textures, shapes, and objects—directly from pixel data.
Through layers of convolutional filters, non-linear activation, and pooling, CNNs capture both low-level features (e.g., corners, edges) and high-level features (e.g., faces, digits, animals) as input flows deeper through the network. CNNs are the foundation of modern computer vision, powering applications from face detection to medical image analysis.
Spatial Hierarchies
Spatial hierarchies refer to the progressive structure of visual features learned by a Convolutional Neural Network (CNN) as data passes through its layers. They capture the idea that complex visual patterns are built from simpler ones, layer by layer.
Hierarchical Structure:
- Lower layers detect local, low-level features like edges, textures, and color gradients.
- Intermediate layers combine these to detect parts of objects (e.g., corners, shapes, motifs).
- Higher layers integrate the parts into full objects or abstract concepts (e.g., a digit, a face, a cat).
This hierarchy enables CNNs to generalize from raw pixels to meaningful semantic content, learning increasingly abstract and spatially broad features at deeper layers.
Example: In a CNN trained to recognize dogs:
- Layer 1 may activate on lines and curves.
- Layer 2 may respond to ears or paws.
- Final layers may respond to full dog faces or specific breeds.
Spatial hierarchies are what allow CNNs to scale visually, from fine-grained details to high-level interpretation—essential for robust performance in real-world image tasks.
1. CNN Architecture: Layer-by-Layer Overview
Input Layer
- Accepts image data, typically as a 3D matrix (height × width × channels)
- Example: A 28×28 grayscale image → input shape = (28, 28, 1)
Convolutional Layers
- Apply kernels (small learnable filters) that scan across the image to detect features
- Each kernel produces a feature map that highlights the presence of a specific feature (e.g., edge, corner)
Key parameters:
- Kernel size (e.g., 3×3): Defines the size of the filter
- Stride: Controls how far the filter moves at each step
- Padding: Ensures spatial dimensions are preserved (e.g., "same" vs. "valid")
As more convolutional layers are added, the network can detect increasingly abstract features.
Activation Functions
- Nonlinear functions applied after each convolution to introduce non-linearity
- Common choices:
- ReLU (Rectified Linear Unit): — fast and effective
- Leaky ReLU: Allows a small gradient for negative inputs
- Sigmoid/Tanh: Rarely used in modern CNNs due to vanishing gradient issues
Pooling Layers (Subsampling)
- Reduce the spatial size of feature maps while retaining essential information
- Provide translation invariance and reduce computation
- Common types:
- Max pooling: Takes the maximum value in each region (e.g., 2×2 pool with stride 2)
- Average pooling: Takes the average of each region
Pooling layers help compress information and prevent overfitting.
Fully Connected Layers
- Flatten the output of the final pooling/convolution layer into a 1D vector
- Standard feedforward ANN layers used to make final predictions
- Often the last 1–2 layers in a CNN
Output Layer
- Uses softmax (for multi-class classification) or sigmoid (for binary classification)
- Produces class probabilities or confidence scores
2. Design Factors That Affect CNN Performance
| Parameter | Effect |
|---|---|
| Number of Layers | Deeper networks can model more complex features, but risk overfitting and vanishing gradients if not regularized. |
| Kernel Size | Smaller kernels (e.g., 3×3) are common; stacking multiple small kernels gives a larger receptive field with fewer parameters. |
| Stride | Larger strides reduce feature map size more quickly but may lose fine-grained detail. |
| Activation Function | ReLU is default due to simplicity and performance; poor choices can cause gradient vanishing or slow learning. |
| Loss Function | Guides learning; common choices include categorical cross-entropy (for classification), binary cross-entropy, or mean squared error (for regression tasks). |
Student-Relatable Example
Scenario: You want to build an app that can recognize different types of school lunch items from a photo—classifying trays as containing pizza, salad, sandwiches, or pasta.
- You collect and label hundreds of lunch tray images.
- A CNN is trained to recognize:
- Low-level features like textures (lettuce leaves, crust patterns)
- Mid-level features like the shape of bread or a bowl
- High-level features like the full structure of a pizza slice or sandwich
The deeper the network, the more compositional and abstract the learned features become—allowing your model to correctly classify even messy or partially occluded lunch trays.
Summary
Convolutional Neural Networks are purpose-built for image processing, using layered structures to learn spatial hierarchies of features in visual data. Through convolution, activation, pooling, and fully connected layers, CNNs detect patterns from pixels to objects. Design decisions like layer depth, kernel size, and activation functions significantly influence how well the model performs on real-world classification tasks. Whether recognizing handwritten digits, detecting disease in X-rays, or identifying lunch food, CNNs form the backbone of modern computer vision.