---
license: mit
language:
- en
library_name: sklearn
tags:
- mnist
- image-classification
- digits
- handwritten
- computer-vision
- logistic-regression
- machine-learning
datasets:
- ylecun/mnist
metrics:
- accuracy
- f1
- precision
- recall
pipeline_tag: image-classification
---

# MNIST Handwritten Digit Classifier

A classical machine learning approach to handwritten digit recognition using Logistic Regression on the MNIST dataset.

## Model Description

This model classifies 28x28 grayscale images of handwritten digits (0-9) using a simple yet effective Logistic Regression classifier. The project serves as an introduction to image classification and the MNIST dataset.

### Intended Uses

- **Educational**: Learning image classification fundamentals
- **Benchmarking**: Baseline for comparing more complex models
- **Research**: Exploring classical ML on image data
- **Prototyping**: Quick digit recognition experiments

## Training Data

**Dataset**: [ylecun/mnist](https://huggingface.co/datasets/ylecun/mnist)

| Split | Images |
|-------|--------|
| Train | 60,000 |
| Test | 10,000 |
| **Total** | **70,000** |

### Data Characteristics

| Property | Value |
|----------|-------|
| Image Size | 28 x 28 pixels |
| Channels | 1 (Grayscale) |
| Classes | 10 (digits 0-9) |
| Pixel Range | 0-255 (raw), 0-1 (normalized) |
| Format | PNG/NumPy arrays |

### Class Distribution

The dataset is relatively balanced across all 10 digit classes.

## Model Architecture

### Preprocessing Pipeline

```
Raw Image (28x28, uint8)
    ↓
Normalize to [0, 1] (divide by 255)
    ↓
Flatten to vector (784 dimensions)
    ↓
Logistic Regression Classifier
    ↓
Softmax Probabilities (10 classes)
```

### Classifier Configuration

```python
LogisticRegression(
    max_iter=100,
    solver='lbfgs',
    multi_class='multinomial',
    n_jobs=-1
)
```

| Parameter | Value | Description |
|-----------|-------|-------------|
| max_iter | 100 | Maximum iterations for convergence |
| solver | lbfgs | L-BFGS optimization algorithm |
| multi_class | multinomial | True multiclass (not OvR) |
| n_jobs | -1 | Use all CPU cores |

## Performance

### Test Set Results

| Metric | Score |
|--------|-------|
| Accuracy | ~92% |
| Macro F1 | ~92% |
| Macro Precision | ~92% |
| Macro Recall | ~92% |

### Per-Class Performance

| Digit | Precision | Recall | F1-Score |
|-------|-----------|--------|----------|
| 0 | ~0.95 | ~0.97 | ~0.96 |
| 1 | ~0.95 | ~0.97 | ~0.96 |
| 2 | ~0.91 | ~0.89 | ~0.90 |
| 3 | ~0.89 | ~0.90 | ~0.90 |
| 4 | ~0.92 | ~0.92 | ~0.92 |
| 5 | ~0.88 | ~0.87 | ~0.87 |
| 6 | ~0.94 | ~0.95 | ~0.94 |
| 7 | ~0.93 | ~0.91 | ~0.92 |
| 8 | ~0.88 | ~0.87 | ~0.88 |
| 9 | ~0.89 | ~0.90 | ~0.90 |

*Note: Performance varies slightly between runs*

### Common Confusion Pairs

- 4 ↔ 9 (similar upper loops)
- 3 ↔ 8 (curved shapes)
- 5 ↔ 3 (similar strokes)
- 7 ↔ 1 (vertical strokes)

## Usage

### Installation

```bash
pip install scikit-learn pandas numpy matplotlib seaborn pillow
```

### Load and Preprocess Data

```python
import pandas as pd
import numpy as np
from PIL import Image

# Load from Hugging Face
df_train = pd.read_parquet("hf://datasets/ylecun/mnist/mnist/train-00000-of-00001.parquet")
df_test = pd.read_parquet("hf://datasets/ylecun/mnist/mnist/test-00000-of-00001.parquet")

def extract_image(row):
    """Extract image as numpy array"""
    img_data = row['image']
    if isinstance(img_data, dict) and 'bytes' in img_data:
        from io import BytesIO
        img = Image.open(BytesIO(img_data['bytes']))
        return np.array(img)
    elif isinstance(img_data, Image.Image):
        return np.array(img_data)
    return np.array(img_data)

# Prepare data
X_train = np.array([extract_image(row) for _, row in df_train.iterrows()])
y_train = df_train['label'].values

# Normalize and flatten
X_train_flat = X_train.astype('float32').reshape(-1, 784) / 255.0
```

### Train Model

```python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=100,
    solver='lbfgs',
    multi_class='multinomial',
    n_jobs=-1
)
model.fit(X_train_flat, y_train)
```

### Inference

```python
import joblib

# Load model
model = joblib.load('mnist_model.pkl')

# Predict single image
def predict_digit(image):
    """
    image: 28x28 numpy array or PIL Image
    returns: predicted digit (0-9)
    """
    if isinstance(image, Image.Image):
        image = np.array(image)

    # Preprocess
    image_flat = image.astype('float32').reshape(1, 784) / 255.0

    # Predict
    prediction = model.predict(image_flat)[0]
    probabilities = model.predict_proba(image_flat)[0]

    return prediction, probabilities

# Example
digit, probs = predict_digit(test_image)
print(f"Predicted: {digit} (confidence: {probs[digit]:.2%})")
```

### Visualization

```python
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Confusion Matrix
y_pred = model.predict(X_test_flat)
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=range(10), yticklabels=range(10))
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - MNIST')
plt.show()
```

### Average Digit Visualization

```python
# Compute mean image per digit
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for digit in range(10):
    ax = axes[digit // 5, digit % 5]
    mask = y_train == digit
    mean_img = X_train[mask].mean(axis=0)
    ax.imshow(mean_img, cmap='hot')
    ax.set_title(f'Digit: {digit}')
    ax.axis('off')
plt.tight_layout()
plt.show()
```

## Limitations

- **Simple Model**: Logistic Regression doesn't capture spatial relationships
- **No Data Augmentation**: Sensitive to rotation, scaling, translation
- **Grayscale Only**: Won't work with color images
- **Fixed Size**: Requires exactly 28x28 input
- **Clean Data**: Struggles with noisy or poorly centered digits

## Comparison with Other Approaches

| Model | MNIST Accuracy |
|-------|----------------|
| **Logistic Regression** | **~92%** |
| Random Forest | ~97% |
| SVM (RBF kernel) | ~98% |
| MLP (2 hidden layers) | ~98% |
| CNN (LeNet-5) | ~99% |
| Modern CNNs | ~99.7% |

## Technical Specifications

### Dependencies

```
scikit-learn>=1.0.0
pandas>=1.3.0
numpy>=1.20.0
matplotlib>=3.4.0
seaborn>=0.11.0
pillow>=8.0.0
```

### Hardware Requirements

| Task | Hardware | Time |
|------|----------|------|
| Training | CPU | ~2-5 min |
| Inference | CPU | < 1ms per image |
| Memory | RAM | ~500MB |

## Files

```
MNIST/
├── README_HF.md          # This model card
├── mnist_exploration.ipynb  # Full exploration notebook
├── mnist_model.pkl       # Trained model (generated)
└── figures/              # Visualizations (generated)
```

## Citation

```bibtex
@article{lecun1998mnist,
  title={Gradient-based learning applied to document recognition},
  author={LeCun, Yann and Bottou, L{\'e}on and Bengio, Yoshua and Haffner, Patrick},
  journal={Proceedings of the IEEE},
  volume={86},
  number={11},
  pages={2278--2324},
  year={1998}
}

@misc{mnist_hf,
  title={MNIST Dataset},
  author={LeCun, Yann and Cortes, Corinna and Burges, Christopher J.C.},
  howpublished={Hugging Face Datasets},
  url={https://huggingface.co/datasets/ylecun/mnist}
}
```

## License

MIT License

## Acknowledgments

- Yann LeCun for creating MNIST
- Scikit-learn team for the ML library
- Hugging Face for dataset hosting

---

## Next Steps

For better performance, consider:

1. **More Complex Models**: SVM, Random Forest, Neural Networks
2. **Deep Learning**: CNNs with PyTorch/TensorFlow
3. **Data Augmentation**: Rotation, scaling, elastic deformations
4. **Feature Engineering**: HOG, SIFT features
5. **Ensemble Methods**: Combine multiple classifiers