---
language: en
tags:
  - image-classification
  - mnist
  - emnist
  - digit-recognition
  - pytorch
  - resnet
license: mit
datasets:
  - mnist
  - emnist
pipeline_tag: image-classification
---

# Handwritten Digit Classifier

A PyTorch image classification model that recognizes handwritten digits (0–9), built on a **pretrained ResNet-18** backbone (ImageNet weights) fine-tuned on a combined **MNIST + EMNIST** dataset with aggressive data augmentation. Achieves **99.46% accuracy** on the combined test set.

---

## Model Details

| Property              | Value                                           |
|-----------------------|-------------------------------------------------|
| **Architecture**      | ResNet-18 (pretrained on ImageNet)              |
| **Framework**         | PyTorch                                         |
| **Task**              | Image Classification (10 classes, digits 0–9)  |
| **Input Size**        | 32 × 32 (grayscale, converted to 3-channel)     |
| **Output**            | Softmax probabilities over digits 0–9           |
| **Test Accuracy**     | **99.46%**                                      |
| **Training Device**   | CUDA (GPU)                                      |
| **Epochs**            | 7                                               |
| **Batch Size**        | 256                                             |
| **Optimizer**         | Adam (differential learning rates)             |
| **Loss Function**     | CrossEntropyLoss                                |
| **LR Scheduler**      | StepLR (step=2, gamma=0.5)                      |

---

## Architecture

The model uses a **ResNet-18** backbone pretrained on ImageNet, with the default classification head replaced by a custom fully-connected head:

```
ResNet-18 Backbone (pretrained on ImageNet1K)
        ↓
  Linear(512 → 128)
        ↓
      ReLU()
        ↓
    Dropout(0.3)
        ↓
  Linear(128 → 10)
        ↓
  Softmax (at inference)
```

**Differential learning rates** were used to preserve pretrained features while allowing the new head to learn faster:
- Pretrained backbone layers: `lr = 0.0001`
- New classification head (last 4 param groups): `lr = 0.001`

The dropout layer (p=0.3) reduces overfitting given the simplicity of digit images relative to the model's capacity.

---

## Dataset

The model was trained on a **combined MNIST + EMNIST (digits split)** dataset for greater diversity and robustness.

### MNIST
| Property         | Value                      |
|------------------|----------------------------|
| **Classes**      | 10 (digits 0–9)            |
| **Training set** | 60,000 grayscale images    |
| **Test set**     | 10,000 grayscale images    |
| **Image size**   | 28 × 28 pixels             |
| **Source**       | [yann.lecun.com/exdb/mnist](http://yann.lecun.com/exdb/mnist/) |

### EMNIST (digits split)
| Property         | Value                      |
|------------------|----------------------------|
| **Classes**      | 10 (digits 0–9)            |
| **Training set** | 240,000 grayscale images   |
| **Test set**     | 40,000 grayscale images    |
| **Image size**   | 28 × 28 pixels             |
| **Source**       | [NIST Special Database 19](https://www.nist.gov/itl/products-and-services/emnist-dataset) |

**Combined total:** 300,000 training images and 50,000 test images.

---

## Training

The model was trained for **7 epochs** on CUDA with a StepLR scheduler (halving LR every 2 epochs). Loss decreased consistently across all epochs.

| Epoch | Loss   |
|-------|--------|
| 1     | 0.1732 |
| 2     | 0.0635 |
| 3     | 0.0446 |
| 4     | 0.0409 |
| 5     | 0.0340 |
| 6     | 0.0307 |
| 7     | 0.0279 |

**Final Test Accuracy: 99.46%**

---

## Data Augmentation

Aggressive augmentation was applied during training to improve generalization to real-world handwriting styles:

| Augmentation            | Parameters                              |
|-------------------------|-----------------------------------------|
| Random Rotation         | ±15°                                    |
| Random Affine (translate)| ±15% horizontal and vertical           |
| Random Affine (shear)   | 10°                                     |
| Random Perspective      | distortion scale 0.3, p=0.3            |
| Color Jitter            | brightness ±0.3, contrast ±0.3         |
| Normalization           | mean (0.5, 0.5, 0.5), std (0.5, 0.5, 0.5) |

No augmentation was applied to the test set (only resize + normalize).

---

## Preprocessing

At inference, input images go through the following pipeline:

1. Convert to **grayscale**
2. **Invert** colors (white background → black background to match MNIST format)
3. **Resize** to 32 × 32
4. Convert to **3-channel** (grayscale replicated across RGB channels for ResNet compatibility)
5. **Normalize** with mean `(0.5, 0.5, 0.5)` and std `(0.5, 0.5, 0.5)`

---

## Usage

```python
import torch
import torch.nn as nn
from torchvision import transforms, models
from huggingface_hub import hf_hub_download
from PIL import Image
import numpy as np

# Load model
model = models.resnet18(weights=None)
model.fc = nn.Sequential(
    nn.Linear(512, 128), nn.ReLU(), nn.Dropout(0.3), nn.Linear(128, 10)
)

weights_path = hf_hub_download(
    repo_id="AdityaManojShinde/handwritten_digit_classifier",
    filename="mnist_model.pth"
)
model.load_state_dict(torch.load(weights_path, map_location="cpu"))
model.eval()

# Preprocessing pipeline
transform = transforms.Compose([
    transforms.Grayscale(num_output_channels=3),
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)),
])

# Inference
image = Image.open("your_digit.png").convert("L")
img_array = 255 - np.array(image)   # invert: white bg → black bg
image = Image.fromarray(img_array)
img_tensor = transform(image).unsqueeze(0)

with torch.no_grad():
    output = model(img_tensor)
    probs = torch.nn.functional.softmax(output, dim=1)[0]
    predicted = probs.argmax().item()

print(f"Predicted digit: {predicted} ({probs[predicted]*100:.1f}% confidence)")
```

---

## Limitations

- Works best with **centered, clearly written** single digits on a plain background.
- Not suitable for multi-digit recognition or digit detection in natural scenes.
- May struggle with highly stylized or non-standard digit handwriting not represented in MNIST/EMNIST.

---

## License

This model is released under the [MIT License](https://opensource.org/licenses/MIT).