MNIST / README.md

Upload 2 files

667ecf6 verified 24 days ago

7.84 kB

	---
	license: mit
	language:
	- en
	library_name: sklearn
	tags:
	- mnist
	- image-classification
	- digits
	- handwritten
	- computer-vision
	- logistic-regression
	- machine-learning
	datasets:
	- ylecun/mnist
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	pipeline_tag: image-classification
	---

	# MNIST Handwritten Digit Classifier

	A classical machine learning approach to handwritten digit recognition using Logistic Regression on the MNIST dataset.

	## Model Description

	This model classifies 28x28 grayscale images of handwritten digits (0-9) using a simple yet effective Logistic Regression classifier. The project serves as an introduction to image classification and the MNIST dataset.

	### Intended Uses

	- Educational: Learning image classification fundamentals
	- Benchmarking: Baseline for comparing more complex models
	- Research: Exploring classical ML on image data
	- Prototyping: Quick digit recognition experiments

	## Training Data

	Dataset: [ylecun/mnist](https://huggingface.co/datasets/ylecun/mnist)

	\| Split \| Images \|
	\|-------\|--------\|
	\| Train \| 60,000 \|
	\| Test \| 10,000 \|
	\| Total \| 70,000 \|

	### Data Characteristics

	\| Property \| Value \|
	\|----------\|-------\|
	\| Image Size \| 28 x 28 pixels \|
	\| Channels \| 1 (Grayscale) \|
	\| Classes \| 10 (digits 0-9) \|
	\| Pixel Range \| 0-255 (raw), 0-1 (normalized) \|
	\| Format \| PNG/NumPy arrays \|

	### Class Distribution

	The dataset is relatively balanced across all 10 digit classes.

	## Model Architecture

	### Preprocessing Pipeline

	```
	Raw Image (28x28, uint8)
	↓
	Normalize to [0, 1] (divide by 255)
	↓
	Flatten to vector (784 dimensions)
	↓
	Logistic Regression Classifier
	↓
	Softmax Probabilities (10 classes)
	```

	### Classifier Configuration

	```python
	LogisticRegression(
	max_iter=100,
	solver='lbfgs',
	multi_class='multinomial',
	n_jobs=-1
	)
	```

	\| Parameter \| Value \| Description \|
	\|-----------\|-------\|-------------\|
	\| max_iter \| 100 \| Maximum iterations for convergence \|
	\| solver \| lbfgs \| L-BFGS optimization algorithm \|
	\| multi_class \| multinomial \| True multiclass (not OvR) \|
	\| n_jobs \| -1 \| Use all CPU cores \|

	## Performance

	### Test Set Results

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| ~92% \|
	\| Macro F1 \| ~92% \|
	\| Macro Precision \| ~92% \|
	\| Macro Recall \| ~92% \|

	### Per-Class Performance

	\| Digit \| Precision \| Recall \| F1-Score \|
	\|-------\|-----------\|--------\|----------\|
	\| 0 \| ~0.95 \| ~0.97 \| ~0.96 \|
	\| 1 \| ~0.95 \| ~0.97 \| ~0.96 \|
	\| 2 \| ~0.91 \| ~0.89 \| ~0.90 \|
	\| 3 \| ~0.89 \| ~0.90 \| ~0.90 \|
	\| 4 \| ~0.92 \| ~0.92 \| ~0.92 \|
	\| 5 \| ~0.88 \| ~0.87 \| ~0.87 \|
	\| 6 \| ~0.94 \| ~0.95 \| ~0.94 \|
	\| 7 \| ~0.93 \| ~0.91 \| ~0.92 \|
	\| 8 \| ~0.88 \| ~0.87 \| ~0.88 \|
	\| 9 \| ~0.89 \| ~0.90 \| ~0.90 \|

	Note: Performance varies slightly between runs

	### Common Confusion Pairs

	- 4 ↔ 9 (similar upper loops)
	- 3 ↔ 8 (curved shapes)
	- 5 ↔ 3 (similar strokes)
	- 7 ↔ 1 (vertical strokes)

	## Usage

	### Installation

	```bash
	pip install scikit-learn pandas numpy matplotlib seaborn pillow
	```

	### Load and Preprocess Data

	```python
	import pandas as pd
	import numpy as np
	from PIL import Image

	# Load from Hugging Face
	df_train = pd.read_parquet("hf://datasets/ylecun/mnist/mnist/train-00000-of-00001.parquet")
	df_test = pd.read_parquet("hf://datasets/ylecun/mnist/mnist/test-00000-of-00001.parquet")

	def extract_image(row):
	"""Extract image as numpy array"""
	img_data = row['image']
	if isinstance(img_data, dict) and 'bytes' in img_data:
	from io import BytesIO
	img = Image.open(BytesIO(img_data['bytes']))
	return np.array(img)
	elif isinstance(img_data, Image.Image):
	return np.array(img_data)
	return np.array(img_data)

	# Prepare data
	X_train = np.array([extract_image(row) for _, row in df_train.iterrows()])
	y_train = df_train['label'].values

	# Normalize and flatten
	X_train_flat = X_train.astype('float32').reshape(-1, 784) / 255.0
	```

	### Train Model

	```python
	from sklearn.linear_model import LogisticRegression

	model = LogisticRegression(
	max_iter=100,
	solver='lbfgs',
	multi_class='multinomial',
	n_jobs=-1
	)
	model.fit(X_train_flat, y_train)
	```

	### Inference

	```python
	import joblib

	# Load model
	model = joblib.load('mnist_model.pkl')

	# Predict single image
	def predict_digit(image):
	"""
	image: 28x28 numpy array or PIL Image
	returns: predicted digit (0-9)
	"""
	if isinstance(image, Image.Image):
	image = np.array(image)

	# Preprocess
	image_flat = image.astype('float32').reshape(1, 784) / 255.0

	# Predict
	prediction = model.predict(image_flat)[0]
	probabilities = model.predict_proba(image_flat)[0]

	return prediction, probabilities

	# Example
	digit, probs = predict_digit(test_image)
	print(f"Predicted: {digit} (confidence: {probs[digit]:.2%})")
	```

	### Visualization

	```python
	import matplotlib.pyplot as plt
	from sklearn.metrics import confusion_matrix
	import seaborn as sns

	# Confusion Matrix
	y_pred = model.predict(X_test_flat)
	cm = confusion_matrix(y_test, y_pred)

	plt.figure(figsize=(10, 8))
	sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
	xticklabels=range(10), yticklabels=range(10))
	plt.xlabel('Predicted')
	plt.ylabel('True')
	plt.title('Confusion Matrix - MNIST')
	plt.show()
	```

	### Average Digit Visualization

	```python
	# Compute mean image per digit
	fig, axes = plt.subplots(2, 5, figsize=(12, 5))
	for digit in range(10):
	ax = axes[digit // 5, digit % 5]
	mask = y_train == digit
	mean_img = X_train[mask].mean(axis=0)
	ax.imshow(mean_img, cmap='hot')
	ax.set_title(f'Digit: {digit}')
	ax.axis('off')
	plt.tight_layout()
	plt.show()
	```

	## Limitations

	- Simple Model: Logistic Regression doesn't capture spatial relationships
	- No Data Augmentation: Sensitive to rotation, scaling, translation
	- Grayscale Only: Won't work with color images
	- Fixed Size: Requires exactly 28x28 input
	- Clean Data: Struggles with noisy or poorly centered digits

	## Comparison with Other Approaches

	\| Model \| MNIST Accuracy \|
	\|-------\|----------------\|
	\| Logistic Regression \| ~92% \|
	\| Random Forest \| ~97% \|
	\| SVM (RBF kernel) \| ~98% \|
	\| MLP (2 hidden layers) \| ~98% \|
	\| CNN (LeNet-5) \| ~99% \|
	\| Modern CNNs \| ~99.7% \|

	## Technical Specifications

	### Dependencies

	```
	scikit-learn>=1.0.0
	pandas>=1.3.0
	numpy>=1.20.0
	matplotlib>=3.4.0
	seaborn>=0.11.0
	pillow>=8.0.0
	```

	### Hardware Requirements

	\| Task \| Hardware \| Time \|
	\|------\|----------\|------\|
	\| Training \| CPU \| ~2-5 min \|
	\| Inference \| CPU \| < 1ms per image \|
	\| Memory \| RAM \| ~500MB \|

	## Files

	```
	MNIST/
	├── README_HF.md # This model card
	├── mnist_exploration.ipynb # Full exploration notebook
	├── mnist_model.pkl # Trained model (generated)
	└── figures/ # Visualizations (generated)
	```

	## Citation

	```bibtex
	@article{lecun1998mnist,
	title={Gradient-based learning applied to document recognition},
	author={LeCun, Yann and Bottou, L{\'e}on and Bengio, Yoshua and Haffner, Patrick},
	journal={Proceedings of the IEEE},
	volume={86},
	number={11},
	pages={2278--2324},
	year={1998}
	}

	@misc{mnist_hf,
	title={MNIST Dataset},
	author={LeCun, Yann and Cortes, Corinna and Burges, Christopher J.C.},
	howpublished={Hugging Face Datasets},
	url={https://huggingface.co/datasets/ylecun/mnist}
	}
	```

	## License

	MIT License

	## Acknowledgments

	- Yann LeCun for creating MNIST
	- Scikit-learn team for the ML library
	- Hugging Face for dataset hosting

	---

	## Next Steps

	For better performance, consider:

	1. More Complex Models: SVM, Random Forest, Neural Networks
	2. Deep Learning: CNNs with PyTorch/TensorFlow
	3. Data Augmentation: Rotation, scaling, elastic deformations
	4. Feature Engineering: HOG, SIFT features
	5. Ensemble Methods: Combine multiple classifiers