Convolutional Neural Network Visualized
Explore how CNNs process images through multiple layers of abstraction
Input Image
The CNN process begins with a raw input image. For our example, we'll use a handwritten digit '5'. The image is represented as a matrix of pixel values.

Digital Representation
Computers see images as arrays of numbers. Each pixel is represented as a value between 0 (black) and 255 (white) for grayscale images.
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 110, 190, 253, 70, 0, 0],
[0, 0, 191, 40, 0, 191, 0, 0],
[0, 0, 160, 0, 0, 120, 0, 0],
[0, 0, 127, 195, 210, 20, 0, 0],
[0, 0, 0, 0, 40, 173, 0, 0],
[0, 0, 75, 60, 20, 230, 0, 0],
[0, 0, 90, 230, 180, 35, 0, 0]
]
Convolution Operation
The convolution operation slides a filter (kernel) across the input image to detect features like edges, textures, or patterns.
Kernel Sliding
Kernel/Filter
Edge detection filter
Feature Map
Resulting feature map from convolution operation
ReLU Activation
The Rectified Linear Unit (ReLU) introduces non-linearity to the network by converting all negative values to zero, allowing the network to learn complex patterns.
Before ReLU
Feature map contains both positive and negative values
f(x) = max(0, x)After ReLU
Negative values are replaced with zeros, introducing non-linearity
Why Non-Linearity Matters
Without non-linear activation functions like ReLU, the neural network would only be able to learn linear relationships in the data, significantly limiting its ability to solve complex problems. ReLU enables the network to model more complex functions while being computationally efficient.
Pooling Layer
Pooling reduces the spatial dimensions of the feature maps, preserving the most important information while reducing computation and preventing overfitting.
Max Pooling (2×2 Window)
Max Pooling
For each 2×2 window, keep
only the maximum value
Pooled Feature Map
Benefits of Pooling
- Reduces spatial dimensions by 75%
- Preserves important features
- Makes detection more robust to position
- Reduces overfitting
Deep Layer Abstraction
As we progress through deeper layers of the CNN, the network learns increasingly abstract representations of the input image, from simple edges to complex shapes and patterns.
Layer 1: Edges & Corners
Layer 2: Simple Shapes
Layer 3: Complex Features
Hierarchy of Features
Early Layers (e.g., Layer 1)
Detect low-level features like edges, corners, and basic textures. These are the building blocks for more complex pattern recognition.
Middle Layers (e.g., Layer 2)
Combine edges and textures into more complex patterns and shapes like circles, squares, and simple object parts.
Deep Layers (e.g., Layer 3)
Recognize complex, high-level concepts specific to the training dataset, such as eyes, faces, or entire objects.
Flattening and Fully Connected Layer
The final stage of a CNN involves flattening the feature maps into a single vector and passing it through fully connected layers to make predictions.
Feature Maps to Vector
Flattening
The 2D feature maps are converted into a 1D vector by arranging all the values in a single row. This allows the network to transition from convolutional layers to fully connected layers.
Fully Connected Network
Prediction: "5"
Learning via Backpropagation
The CNN learns by comparing its predictions with the true labels, calculating the error, and then propagating this error backward through the network to update weights.
Backpropagation
Backpropagation calculates how much each neuron's weight contributed to the output error. It then adjusts these weights to minimize the error in future predictions, using the chain rule of calculus to distribute error responsibility throughout the network.
Gradient Descent
The network uses gradient descent to adjust weights in the direction that reduces error. By repeatedly processing many examples and making small weight updates, the model gradually improves its ability to recognize patterns and make accurate predictions.