language: en
tags:
- image-classification
- mnist
- emnist
- digit-recognition
- pytorch
- resnet
license: mit
datasets:
- mnist
- emnist
pipeline_tag: image-classification
Handwritten Digit Classifier
A PyTorch image classification model that recognizes handwritten digits (0–9), built on a pretrained ResNet-18 backbone (ImageNet weights) fine-tuned on a combined MNIST + EMNIST dataset with aggressive data augmentation. Achieves 99.46% accuracy on the combined test set.
Model Details
| Property | Value |
|---|---|
| Architecture | ResNet-18 (pretrained on ImageNet) |
| Framework | PyTorch |
| Task | Image Classification (10 classes, digits 0–9) |
| Input Size | 32 × 32 (grayscale, converted to 3-channel) |
| Output | Softmax probabilities over digits 0–9 |
| Test Accuracy | 99.46% |
| Training Device | CUDA (GPU) |
| Epochs | 7 |
| Batch Size | 256 |
| Optimizer | Adam (differential learning rates) |
| Loss Function | CrossEntropyLoss |
| LR Scheduler | StepLR (step=2, gamma=0.5) |
Architecture
The model uses a ResNet-18 backbone pretrained on ImageNet, with the default classification head replaced by a custom fully-connected head:
ResNet-18 Backbone (pretrained on ImageNet1K)
↓
Linear(512 → 128)
↓
ReLU()
↓
Dropout(0.3)
↓
Linear(128 → 10)
↓
Softmax (at inference)
Differential learning rates were used to preserve pretrained features while allowing the new head to learn faster:
- Pretrained backbone layers:
lr = 0.0001 - New classification head (last 4 param groups):
lr = 0.001
The dropout layer (p=0.3) reduces overfitting given the simplicity of digit images relative to the model's capacity.
Dataset
The model was trained on a combined MNIST + EMNIST (digits split) dataset for greater diversity and robustness.
MNIST
| Property | Value |
|---|---|
| Classes | 10 (digits 0–9) |
| Training set | 60,000 grayscale images |
| Test set | 10,000 grayscale images |
| Image size | 28 × 28 pixels |
| Source | yann.lecun.com/exdb/mnist |
EMNIST (digits split)
| Property | Value |
|---|---|
| Classes | 10 (digits 0–9) |
| Training set | 240,000 grayscale images |
| Test set | 40,000 grayscale images |
| Image size | 28 × 28 pixels |
| Source | NIST Special Database 19 |
Combined total: 300,000 training images and 50,000 test images.
Training
The model was trained for 7 epochs on CUDA with a StepLR scheduler (halving LR every 2 epochs). Loss decreased consistently across all epochs.
| Epoch | Loss |
|---|---|
| 1 | 0.1732 |
| 2 | 0.0635 |
| 3 | 0.0446 |
| 4 | 0.0409 |
| 5 | 0.0340 |
| 6 | 0.0307 |
| 7 | 0.0279 |
Final Test Accuracy: 99.46%
Data Augmentation
Aggressive augmentation was applied during training to improve generalization to real-world handwriting styles:
| Augmentation | Parameters |
|---|---|
| Random Rotation | ±15° |
| Random Affine (translate) | ±15% horizontal and vertical |
| Random Affine (shear) | 10° |
| Random Perspective | distortion scale 0.3, p=0.3 |
| Color Jitter | brightness ±0.3, contrast ±0.3 |
| Normalization | mean (0.5, 0.5, 0.5), std (0.5, 0.5, 0.5) |
No augmentation was applied to the test set (only resize + normalize).
Preprocessing
At inference, input images go through the following pipeline:
- Convert to grayscale
- Invert colors (white background → black background to match MNIST format)
- Resize to 32 × 32
- Convert to 3-channel (grayscale replicated across RGB channels for ResNet compatibility)
- Normalize with mean
(0.5, 0.5, 0.5)and std(0.5, 0.5, 0.5)
Usage
import torch
import torch.nn as nn
from torchvision import transforms, models
from huggingface_hub import hf_hub_download
from PIL import Image
import numpy as np
# Load model
model = models.resnet18(weights=None)
model.fc = nn.Sequential(
nn.Linear(512, 128), nn.ReLU(), nn.Dropout(0.3), nn.Linear(128, 10)
)
weights_path = hf_hub_download(
repo_id="AdityaManojShinde/handwritten_digit_classifier",
filename="mnist_model.pth"
)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()
# Preprocessing pipeline
transform = transforms.Compose([
transforms.Grayscale(num_output_channels=3),
transforms.Resize((32, 32)),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])
# Inference
image = Image.open("your_digit.png").convert("L")
img_array = 255 - np.array(image) # invert: white bg → black bg
image = Image.fromarray(img_array)
img_tensor = transform(image).unsqueeze(0)
with torch.no_grad():
output = model(img_tensor)
probs = torch.nn.functional.softmax(output, dim=1)[0]
predicted = probs.argmax().item()
print(f"Predicted digit: {predicted} ({probs[predicted]*100:.1f}% confidence)")
Limitations
- Works best with centered, clearly written single digits on a plain background.
- Not suitable for multi-digit recognition or digit detection in natural scenes.
- May struggle with highly stylized or non-standard digit handwriting not represented in MNIST/EMNIST.
License
This model is released under the MIT License.