abdurafay19
/

Digit-Classifier

@@ -37,7 +37,7 @@ This model is a CNN trained from scratch on the MNIST benchmark dataset. It acce
 - **Demo:** [Hugging Face Space](https://huggingface.co/spaces/abdurafay19/Digit-Classifier)
 ---
 ## Uses
 ### Direct Use
@@ -87,11 +87,11 @@ This model is **not** suitable for:
 import torch
 from torchvision import transforms
 from PIL import Image
-from model import CNN  # your model definition
 # Load model
-model = CNN()
-model.load_state_dict(torch.load("model.pt"))
 model.eval()
 # Preprocess image
@@ -122,31 +122,36 @@ print(f"Predicted digit: {prediction}")
 - **Dataset:** [MNIST](https://huggingface.co/datasets/mnist) — 70,000 grayscale images (60,000 train / 10,000 test)
 - **Input size:** 28×28 pixels, single channel
 - **Classes:** 10 (digits 0–9)
--
 ### Training Procedure
 #### Preprocessing
-- Images resized to 28×28 and converted to grayscale tensors
-- Pixel values normalized using MNIST dataset mean and standard deviation
-- Random horizontal flips and small rotations applied for data augmentation
 #### Training Hyperparameters
-| Parameter        | Value          |
-|------------------|----------------|
-| Optimizer        | Adam           |
-| Learning Rate    | 1e-3           |
-| Batch Size       | 64             |
-| Epochs           | 28             |
-| Loss Function    | CrossEntropyLoss |
-| Dropout          | 0.5            |
-| Training regime  | fp32           |
 #### Speeds, Sizes, Times
-- **Training time:** ~10 minutes on a single GPU (NVIDIA T4)
-- **Model size:** ~2.4 MB (`.pt` file)
 - **Inference speed:** <50ms per image (CPU)
 ---
@@ -161,7 +166,7 @@ Evaluated on the standard MNIST test split — 10,000 images not seen during tra
 #### Factors
-Evaluation was performed across all 10 digit classes equally. No disaggregation by subpopulation was conducted (MNIST does not include demographic metadata).
 #### Metrics
@@ -172,17 +177,32 @@ Evaluation was performed across all 10 digit classes equally. No disaggregation
 | Metric        | Value   |
 |---------------|---------|
-| Test Accuracy | 99.16%   |
 #### Summary
-The model achieves **99.16% accuracy** on the MNIST test set, consistent with state-of-the-art results for CNN-based approaches on this benchmark.
 ---
 ## Model Examination
-The model's convolutional filters learn edge detectors and stroke patterns in the first layers, which compose into digit-specific features in deeper layers. Standard CNN interpretability techniques (e.g., Grad-CAM) can be applied to visualize which regions most influence predictions.
 ---
@@ -193,8 +213,8 @@ Carbon emissions estimated using the [ML Impact Calculator](https://mlco2.github
 | Factor          | Value                  |
 |-----------------|------------------------|
 | Hardware Type   | NVIDIA T4 GPU          |
-| Hours Used      | ~0.2 hrs (10 min)     |
-| Cloud Provider  | Google Colab / Local   |
 | Compute Region  | Singapore              |
 | Carbon Emitted  | ~0.01 kg CO₂eq (est.)  |
@@ -204,33 +224,59 @@ Carbon emissions estimated using the [ML Impact Calculator](https://mlco2.github
 ### Model Architecture
 #### Convolutional Blocks
-| Layer       | Output Shape  | Details                      |
-|-------------|---------------|------------------------------|
-| Conv2d      | (64, 28, 28)  | 64 filters, 3×3, padding=1  |
-| BatchNorm2d | (64, 28, 28)  | —                            |
-| ReLU        | (64, 28, 28)  | inplace=True                 |
-| MaxPool2d   | (64, 14, 14)  | 2×2                          |
-| Conv2d      | (128, 14, 14) | 128 filters, 3×3, padding=1 |
-| BatchNorm2d | (128, 14, 14) | —                            |
-| ReLU        | (128, 14, 14) | inplace=True                 |
-| MaxPool2d   | (128, 7, 7)   | 2×2                          |
 #### Fully Connected Layers
-| Layer    | Output | Details                             |
-|----------|--------|-------------------------------------|
-| Flatten  | 6272   | —                                   |
-| Linear   | 512    | + BatchNorm1d + ReLU + Dropout(0.4) |
-| Linear   | 128    | + BatchNorm1d + ReLU + Dropout(0.2) |
-| Linear   | 10     | Raw logits                          |
-**Total Parameters: ~3.5M** — Kaiming Normal initialization throughout.
 ### Compute Infrastructure
-- **Hardware:** NVIDIA T4 / any CUDA-capable GPU (or CPU for inference)
 - **Software:** Python 3.10+, PyTorch 2.0, torchvision
 ---
@@ -242,7 +288,7 @@ If you use this model in your work, please cite:
 **BibTeX:**
 ```bibtex
 @misc{digit-classifier-2026,
-  author    = Abdul Rafay,
   title     = {Handwritten Digit Classifier (CNN on MNIST)},
   year      = {2026},
   publisher = {Hugging Face},
@@ -263,6 +309,9 @@ If you use this model in your work, please cite:
 | MNIST        | A benchmark dataset of 70,000 handwritten digit images |
 | Softmax      | Activation function that converts raw outputs to probabilities summing to 1 |
 | Dropout      | Regularization technique that randomly disables neurons during training |
 | Grad-CAM     | Gradient-weighted Class Activation Mapping — a model interpretability technique |
 ---

 - **Demo:** [Hugging Face Space](https://huggingface.co/spaces/abdurafay19/Digit-Classifier)
 ---
+digit_classifier(1)
 ## Uses
 ### Direct Use
 import torch
 from torchvision import transforms
 from PIL import Image
+from model import Model  # your model definition
 # Load model
+model = Model()
+model.load_state_dict(torch.load("mnist_best.pth"))
 model.eval()
 # Preprocess image
 - **Dataset:** [MNIST](https://huggingface.co/datasets/mnist) — 70,000 grayscale images (60,000 train / 10,000 test)
 - **Input size:** 28×28 pixels, single channel
 - **Classes:** 10 (digits 0–9)
 ### Training Procedure
 #### Preprocessing
+- Images converted to tensors and normalized using MNIST dataset mean (0.1307) and std (0.3081)
+- Training augmentation: random rotation (±10°), random affine with translation (±10%), scale (0.9–1.1×), and shear (±5°)
+- Test images: normalization only — no augmentation
 #### Training Hyperparameters
+| Parameter       | Value                        |
+|-----------------|------------------------------|
+| Optimizer       | AdamW                        |
+| Learning Rate   | 3e-3 (max, OneCycleLR)       |
+| Weight Decay    | 1e-4                         |
+| Batch Size      | 64                           |
+| Epochs          | 50                           |
+| Loss Function   | CrossEntropyLoss             |
+| Label Smoothing | 0.1                          |
+| LR Scheduler    | OneCycleLR (10% warmup, cosine anneal) |
+| Dropout (conv)  | 0.25 (Dropout2d)             |
+| Dropout (FC)    | 0.25                         |
+| Random Seed     | 23                           |
+| Training regime | fp32                         |
 #### Speeds, Sizes, Times
+- **Training time:** ~10 minutes on a single GPU (NVIDIA T4, Google Colab)
+- **Model parameters:** 160,842
 - **Inference speed:** <50ms per image (CPU)
 ---
 #### Factors
+Evaluation was performed across all 10 digit classes. No disaggregation by subpopulation was conducted (MNIST does not include demographic metadata).
 #### Metrics
 | Metric        | Value   |
 |---------------|---------|
+| Test Accuracy | 99.43%  |
+#### Per-Class Accuracy
+| Digit | Correct | Errors | Accuracy |
+|-------|---------|--------|----------|
+| 0     | 980     | 0      | 100.0%   |
+| 1     | 1132    | 3      | 99.7%    |
+| 2     | 1025    | 7      | 99.3%    |
+| 3     | 1008    | 2      | 99.8%    |
+| 4     | 976     | 6      | 99.4%    |
+| 5     | 885     | 7      | 99.2%    |
+| 6     | 949     | 9      | 99.1%    |
+| 7     | 1020    | 8      | 99.2%    |
+| 8     | 968     | 6      | 99.4%    |
+| 9     | 1000    | 9      | 99.1%    |
 #### Summary
+The model achieves **99.43% accuracy** on the MNIST test set (57 total errors out of 10,000). Digit 0 achieves perfect classification. The most challenging classes are 6 and 9 (9 errors each), consistent with their visual similarity.
 ---
 ## Model Examination
+The model's convolutional filters learn edge detectors and stroke patterns in early layers, which compose into digit-specific features in deeper layers. Standard CNN interpretability techniques (e.g., Grad-CAM) can be applied to visualize which regions most influence predictions.
 ---
 | Factor          | Value                  |
 |-----------------|------------------------|
 | Hardware Type   | NVIDIA T4 GPU          |
+| Hours Used      | ~0.2 hrs (10 min)      |
+| Cloud Provider  | Google Colab           |
 | Compute Region  | Singapore              |
 | Carbon Emitted  | ~0.01 kg CO₂eq (est.)  |
 ### Model Architecture
+The model uses 4 convolutional blocks followed by a compact fully connected head.
 #### Convolutional Blocks
+| Block   | Layer       | Output Shape   | Details                              |
+|---------|-------------|----------------|--------------------------------------|
+| Block 1 | Conv2d      | (32, 28, 28)   | 32 filters, 3×3, padding=1          |
+|         | BatchNorm2d | (32, 28, 28)   | —                                    |
+|         | ReLU        | (32, 28, 28)   | —                                    |
+|         | MaxPool2d   | (32, 14, 14)   | 2×2                                  |
+|         | Dropout2d   | (32, 14, 14)   | p=0.25                               |
+| Block 2 | Conv2d      | (64, 14, 14)   | 64 filters, 3×3, padding=1          |
+|         | BatchNorm2d | (64, 14, 14)   | —                                    |
+|         | ReLU        | (64, 14, 14)   | —                                    |
+|         | MaxPool2d   | (64, 7, 7)     | 2×2                                  |
+|         | Dropout2d   | (64, 7, 7)     | p=0.25                               |
+| Block 3 | Conv2d      | (128, 7, 7)    | 128 filters, 3×3, padding=1         |
+|         | BatchNorm2d | (128, 7, 7)    | —                                    |
+|         | ReLU        | (128, 7, 7)    | —                                    |
+|         | MaxPool2d   | (128, 3, 3)    | 2×2                                  |
+|         | Dropout2d   | (128, 3, 3)    | p=0.25                               |
+| Block 4 | Conv2d      | (256, 3, 3)    | 256 filters, **1×1** kernel (no pad) |
+|         | BatchNorm2d | (256, 3, 3)    | —                                    |
+|         | ReLU        | (256, 3, 3)    | —                                    |
+|         | MaxPool2d   | (256, 1, 1)    | 2×2                                  |
+|         | Dropout2d   | (256, 1, 1)    | p=0.25                               |
 #### Fully Connected Layers
+| Layer    | Output | Details              |
+|----------|--------|----------------------|
+| Flatten  | 256    | 256 × 1 × 1 = 256    |
+| Linear   | 128    | + ReLU + Dropout(0.25) |
+| Linear   | 10     | Raw logits           |
+**Total Parameters: 160,842**
+#### Shape Flow
+```
+Input:   (B,   1, 28, 28)
+Block 1: (B,  32, 14, 14)
+Block 2: (B,  64,  7,  7)
+Block 3: (B, 128,  3,  3)
+Block 4: (B, 256,  1,  1)
+Flatten: (B, 256)
+FC1:     (B, 128)
+Output:  (B,  10)
+```
 ### Compute Infrastructure
+- **Hardware:** NVIDIA T4 GPU (Google Colab)
 - **Software:** Python 3.10+, PyTorch 2.0, torchvision
 ---
 **BibTeX:**
 ```bibtex
 @misc{digit-classifier-2026,
+  author    = {Abdul Rafay},
   title     = {Handwritten Digit Classifier (CNN on MNIST)},
   year      = {2026},
   publisher = {Hugging Face},
 | MNIST        | A benchmark dataset of 70,000 handwritten digit images |
 | Softmax      | Activation function that converts raw outputs to probabilities summing to 1 |
 | Dropout      | Regularization technique that randomly disables neurons during training |
+| BatchNorm    | Batch Normalization — normalizes layer activations to stabilize and speed up training |
+| OneCycleLR   | Learning rate schedule with warmup and cosine decay for faster convergence |
+| Label Smoothing | Softens hard targets to reduce overconfidence and improve generalization |
 | Grad-CAM     | Gradient-weighted Class Activation Mapping — a model interpretability technique |
 ---