language: en
license: mit
tags:
- text-generation
- gpt2
- mlx
- apple-silicon
- knowledge-distillation
- finewebedu
- text-completion
datasets:
- roneneldan/TinyStories
- HuggingFaceFW/fineweb-edu
library_name: transformers
pipeline_tag: text-generation
model-index:
- name: nanoGPT-MLX-53M
results:
- task:
type: text-generation
dataset:
name: FineWebEdu
type: HuggingFaceFW/fineweb-edu
metrics:
- name: Training Loss
type: loss
value: 3.46
- name: Validation Loss
type: loss
value: 6.71
nanoGPT-MLX-53M: Ultra-Fast GPT on Apple Silicon
β‘ 25,476 tokens/sec inference | π 157 tokens/sec generation | πΎ 101MB model size | β±οΈ 161ms latency
A compact 53M parameter GPT model trained with knowledge distillation in under 3 hours on Apple M2 Pro. Optimized for speed and efficiency using MLX framework.
Perfect for:
- π± On-device text generation
- β‘ Low-latency applications
- π Educational projects & prototyping
- π» Resource-constrained environments
Key Achievement: Achieves 3.6x faster inference than training speed through MLX optimization on Apple Silicon.
Quick Stats
| Metric | Value |
|---|---|
| β‘ Inference Speed | 25,476 tokens/sec (batch) |
| π Generation Speed | 157.5 tokens/sec (real-time) |
| πΎ Model Size (FP16) | 101 MB |
| πΎ Model Size (FP32) | 202 MB |
| β±οΈ Latency (avg) | 161ms |
| β±οΈ Latency (P95) | 172ms |
| π Parameters | 53M (8 layers, 384d, 8 heads) |
| π Teacher Model | GPT-OSS-20B (377x larger) |
| π Training Data | FineWebEdu (10M tokens) |
| β° Training Time | 2.7 hours on M2 Pro |
Model Description
- Architecture: GPT-2 style transformer
- Parameters: 53,990,464 (53M) - compact and efficient
- Training Framework: MLX (Apple Silicon optimized)
- Context Length: 512 tokens
- Vocabulary: 50,257 tokens (GPT-2 tokenizer)
- Training Method: Knowledge Distillation from GPT-OSS-20B (20B params)
- Training Data: FineWebEdu (10M tokens of high-quality educational web content)
- Hardware: M2 Pro with 16GB RAM (consumer laptop!)
- Training Duration: 35,000 iterations (~161 minutes)
Model Architecture
βββ Embedding Layer: 50,257 vocab Γ 384 dim
βββ 8Γ Transformer Blocks
β βββ Multi-Head Attention (8 heads)
β βββ Layer Normalization
β βββ Feed-Forward Network (384 β 1536 β 384)
β βββ Residual Connections
βββ Final Layer Normalization
βββ Language Model Head (tied with embeddings)
Total Parameters: ~53M
- Embedding parameters: ~20M
- Transformer parameters: ~33M
- Weight tying: Embedding weights shared with output layer
Training Details
Training Data
Dataset: FineWebEdu
- Source:
HuggingFaceFW/fineweb-edu - Size: 10M tokens
- Content: High-quality educational web content
- Topics: Science, technology, culture, history, and more
- Quality: Filtered for educational value and coherence
Initial Base: TinyStories
- Used for initial model warm-up before distillation
- Helps model learn basic language structure
Training Procedure
- Optimizer: AdamW
- Learning Rate: 3e-4 with cosine decay to 1.5e-5
- Warmup: 2,000 iterations
- Batch Size: 12
- Total Iterations: 35,000
- Hardware: Apple M2 Pro (16GB RAM)
- Training Speed: ~7,000 tokens/sec
- Training Time: 161 minutes (~2.7 hours)
Knowledge Distillation
This model was trained using knowledge distillation:
- Teacher Model: GPT-OSS-20B (20B params) via Groq API
- Student Model: This 53M parameter model
- Distillation Method: Soft target learning with hard loss combination
- Alpha: 0.7 (hard loss weight) / 0.3 (soft loss weight)
- Temperature: 2.0 for softening distributions
- Teacher Usage: ~1,099 teacher samples generated during training
- Benefit: Learns from larger model's knowledge while maintaining efficiency
Intended Use
Primary Use Cases
- Text Completion: Continuing and completing text passages
- Creative Writing: Story and narrative generation
- Educational: Learning about transformers and knowledge distillation
- Prototyping: Quick experiments with small-scale LLMs
- Resource-Constrained Environments: Running LLMs on consumer hardware
- MLX Framework Demonstration: Showcasing Apple Silicon training capabilities
What This Model Does Well
- β Text continuation with basic coherence
- β Generating grammatically correct sentences
- β Simple narrative patterns
- β Fast inference on Apple Silicon
- β Low resource requirements
What This Model Does NOT Do
- β Not a chat/assistant model: Not trained for conversation or instructions
- β Limited reasoning: 53M parameters is too small for complex logic
- β No factual accuracy: Not designed for knowledge retrieval
- β Short context: Limited to 512 tokens
- β Repetitive patterns: May generate loops in longer sequences
Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model_name = "JackSuuu/nanogpt-mlx-53m-finewebedu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example 1: Story continuation (what it does best)
prompt = "Once upon a time, in a magical forest"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.8,
top_k=50,
top_p=0.95,
do_sample=True,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
# Example 2: Text completion
prompt = "The scientist discovered that"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=80, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Real Generation Examples
Prompt: "Once upon a time, in a magical forest" Output: (Model generates story-like continuation with basic narrative structure)
Prompt: "The scientist discovered" Output: (Model continues with scientific-sounding text)
Note: This is a base language model, not an instruction-following or chat model. For best results, use natural text prompts rather than questions or commands.
Using with MLX (Native)
import mlx.core as mx
from src.model import create_model
from src.generate import generate_text
# Load MLX model
config = {...} # Your config
model = create_model(config)
model.load_weights("checkpoint.npz")
# Generate
text = generate_text(
model,
prompt="Once upon a time",
max_tokens=100,
temperature=0.8
)
print(text)
Performance
Inference Performance (What Users Care About π)
| Metric | Value | Notes |
|---|---|---|
| Batch Inference | 25,476 tokens/sec | 3.6x faster than training |
| Real-time Generation | 157.5 tokens/sec | Interactive use case ready |
| Average Latency | 161ms | Low-latency applications |
| P95 Latency | 172ms | Consistent performance |
| P99 Latency | 179ms | Stable under load |
| Model Size (FP16) | 101 MB | Runs on mobile devices |
| Model Size (FP32) | 202 MB | Fits in RAM easily |
| Memory Usage | ~1.7GB | During training with batch=12 |
Training Metrics
| Metric | Value | Notes |
|---|---|---|
| Training Loss | 3.46 | Excellent convergence |
| Validation Loss | 6.71 | Some overfitting (see below) |
| Best Val Loss | 4.74 | Achieved ~iteration 15K |
| Training Speed | 7,000 tokens/sec | M2 Pro, batch=12 |
| Training Time | 161 minutes (2.7 hours) | Consumer hardware! |
| Total Iterations | 35,000 | Fully converged |
| Teacher Samples | 1,099 | From GPT-OSS-20B |
| Evaluation Speed | 24,779 tokens/sec | Fast validation |
Model Quality
Perplexity: 827.85 (FineWebEdu validation set)
Context: This perplexity reflects the model's 53M parameter size and the complexity of FineWebEdu dataset (diverse educational web content). For reference, GPT-2 Small (124M parameters) achieves ~29 perplexity on WebText, while GPT-2 Medium (355M) achieves ~26. The higher perplexity is expected for a compact model on complex content, and the model performs well for its size class in text completion tasks.
Model Characteristics
Strengths:
- β Grammatically correct text generation
- β Basic sentence structure understanding
- β Fast inference on Apple Silicon
- β Low memory footprint (~200MB)
- β Efficient knowledge distillation from 20B teacher
Known Limitations:
- β οΈ Overfitting: Val loss (6.71) > Train loss (3.46) indicates some overfitting
- β οΈ Repetitive patterns: May generate repeated phrases in longer text
- β οΈ Limited coherence: Best for 50-100 tokens, degrades beyond that
- β οΈ Not factual: Not trained for accurate information retrieval
- β οΈ No instruction following: Not a chat or assistant model
Limitations and Biases
Model Limitations
- Context Window: Limited to 512 tokens
- Model Size: 53M parameters limits capability vs larger models
- Training Data: Primarily simple stories, may not generalize well
- Knowledge Cutoff: No specific knowledge cutoff (training data dependent)
Potential Biases
- Training data (TinyStories) may contain biases present in children's literature
- Limited diversity in training data
- No explicit bias mitigation techniques applied
Not Suitable For
- Production applications requiring factual accuracy
- Legal, medical, or financial advice
- Content requiring long-term coherence
- Tasks requiring reasoning or computation
Training Infrastructure
- Hardware: Apple M2 Pro with 16GB RAM
- Framework: MLX 0.0.9+
- OS: macOS
- GPU: Apple Silicon GPU (Metal)
- Memory Usage: ~4-6GB during training
Citation
If you use this model, please cite:
@software{nanogpt-mlx-53m,
title = {nanoGPT-MLX-53M: Compact GPT with Knowledge Distillation on Apple Silicon},
author = {Jack Su},
year = {2025},
url = {https://github.com/JackSuuu/nanoGPT-on-MLX},
note = {53M parameter model trained using Apple MLX framework with knowledge distillation from GPT-OSS-20B}
}
Related Work
- nanoGPT: Original PyTorch implementation by Andrej Karpathy
- MLX: Apple's array framework for machine learning on Apple silicon
- TinyStories: Dataset by Eldan & Li (Microsoft Research)
- FineWebEdu: High-quality web dataset by HuggingFace
License
MIT License - See repository for details
Acknowledgments
- MLX Team at Apple for the excellent framework
- TinyStories authors for the dataset
- HuggingFace for FineWebEdu and model hosting
- Andrej Karpathy for nanoGPT inspiration
Model Card Authors
Jack Su
Model Card Contact
For questions or issues, please open an issue on the GitHub repository.
Training Notes
This model demonstrates:
- Efficient training on consumer hardware (M2 Pro, 16GB RAM)
- Knowledge distillation effectiveness for small models
- MLX framework capabilities for Apple Silicon
- Realistic expectations for 53M parameter models
The model performs appropriately for its size - it's not meant to compete with billion-parameter models but rather showcases what's achievable with limited resources and knowledge distillation.
This model is primarily for educational and research purposes. Use responsibly! π