tinystories / README.md
jacksuuuu's picture
Upload model - 35000 iterations, loss: 3.4640
422f1ff verified
|
raw
history blame
11.7 kB
metadata
language: en
license: mit
tags:
  - text-generation
  - gpt2
  - mlx
  - apple-silicon
  - knowledge-distillation
  - finewebedu
  - text-completion
datasets:
  - roneneldan/TinyStories
  - HuggingFaceFW/fineweb-edu
library_name: transformers
pipeline_tag: text-generation
model-index:
  - name: nanoGPT-MLX-53M
    results:
      - task:
          type: text-generation
        dataset:
          name: FineWebEdu
          type: HuggingFaceFW/fineweb-edu
        metrics:
          - name: Training Loss
            type: loss
            value: 3.46
          - name: Validation Loss
            type: loss
            value: 6.71

nanoGPT-MLX-53M: Ultra-Fast GPT on Apple Silicon

⚑ 25,476 tokens/sec inference | πŸš€ 157 tokens/sec generation | πŸ’Ύ 101MB model size | ⏱️ 161ms latency

A compact 53M parameter GPT model trained with knowledge distillation in under 3 hours on Apple M2 Pro. Optimized for speed and efficiency using MLX framework.

Perfect for:

  • πŸ“± On-device text generation
  • ⚑ Low-latency applications
  • πŸŽ“ Educational projects & prototyping
  • πŸ’» Resource-constrained environments

Key Achievement: Achieves 3.6x faster inference than training speed through MLX optimization on Apple Silicon.

Quick Stats

Metric Value
⚑ Inference Speed 25,476 tokens/sec (batch)
πŸš€ Generation Speed 157.5 tokens/sec (real-time)
πŸ’Ύ Model Size (FP16) 101 MB
πŸ’Ύ Model Size (FP32) 202 MB
⏱️ Latency (avg) 161ms
⏱️ Latency (P95) 172ms
πŸ“Š Parameters 53M (8 layers, 384d, 8 heads)
πŸŽ“ Teacher Model GPT-OSS-20B (377x larger)
πŸ“š Training Data FineWebEdu (10M tokens)
⏰ Training Time 2.7 hours on M2 Pro

Model Description

  • Architecture: GPT-2 style transformer
  • Parameters: 53,990,464 (53M) - compact and efficient
  • Training Framework: MLX (Apple Silicon optimized)
  • Context Length: 512 tokens
  • Vocabulary: 50,257 tokens (GPT-2 tokenizer)
  • Training Method: Knowledge Distillation from GPT-OSS-20B (20B params)
  • Training Data: FineWebEdu (10M tokens of high-quality educational web content)
  • Hardware: M2 Pro with 16GB RAM (consumer laptop!)
  • Training Duration: 35,000 iterations (~161 minutes)

Model Architecture

β”œβ”€β”€ Embedding Layer: 50,257 vocab Γ— 384 dim
β”œβ”€β”€ 8Γ— Transformer Blocks
β”‚   β”œβ”€β”€ Multi-Head Attention (8 heads)
β”‚   β”œβ”€β”€ Layer Normalization
β”‚   β”œβ”€β”€ Feed-Forward Network (384 β†’ 1536 β†’ 384)
β”‚   └── Residual Connections
β”œβ”€β”€ Final Layer Normalization
└── Language Model Head (tied with embeddings)

Total Parameters: ~53M

  • Embedding parameters: ~20M
  • Transformer parameters: ~33M
  • Weight tying: Embedding weights shared with output layer

Training Details

Training Data

Dataset: FineWebEdu

  • Source: HuggingFaceFW/fineweb-edu
  • Size: 10M tokens
  • Content: High-quality educational web content
  • Topics: Science, technology, culture, history, and more
  • Quality: Filtered for educational value and coherence

Initial Base: TinyStories

  • Used for initial model warm-up before distillation
  • Helps model learn basic language structure

Training Procedure

  • Optimizer: AdamW
  • Learning Rate: 3e-4 with cosine decay to 1.5e-5
  • Warmup: 2,000 iterations
  • Batch Size: 12
  • Total Iterations: 35,000
  • Hardware: Apple M2 Pro (16GB RAM)
  • Training Speed: ~7,000 tokens/sec
  • Training Time: 161 minutes (~2.7 hours)

Knowledge Distillation

This model was trained using knowledge distillation:

  • Teacher Model: GPT-OSS-20B (20B params) via Groq API
  • Student Model: This 53M parameter model
  • Distillation Method: Soft target learning with hard loss combination
  • Alpha: 0.7 (hard loss weight) / 0.3 (soft loss weight)
  • Temperature: 2.0 for softening distributions
  • Teacher Usage: ~1,099 teacher samples generated during training
  • Benefit: Learns from larger model's knowledge while maintaining efficiency

Intended Use

Primary Use Cases

  1. Text Completion: Continuing and completing text passages
  2. Creative Writing: Story and narrative generation
  3. Educational: Learning about transformers and knowledge distillation
  4. Prototyping: Quick experiments with small-scale LLMs
  5. Resource-Constrained Environments: Running LLMs on consumer hardware
  6. MLX Framework Demonstration: Showcasing Apple Silicon training capabilities

What This Model Does Well

  • βœ… Text continuation with basic coherence
  • βœ… Generating grammatically correct sentences
  • βœ… Simple narrative patterns
  • βœ… Fast inference on Apple Silicon
  • βœ… Low resource requirements

What This Model Does NOT Do

  • ❌ Not a chat/assistant model: Not trained for conversation or instructions
  • ❌ Limited reasoning: 53M parameters is too small for complex logic
  • ❌ No factual accuracy: Not designed for knowledge retrieval
  • ❌ Short context: Limited to 512 tokens
  • ❌ Repetitive patterns: May generate loops in longer sequences

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model_name = "JackSuuu/nanogpt-mlx-53m-finewebedu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example 1: Story continuation (what it does best)
prompt = "Once upon a time, in a magical forest"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(
    inputs.input_ids,
    max_length=100,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

# Example 2: Text completion
prompt = "The scientist discovered that"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=80, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Real Generation Examples

Prompt: "Once upon a time, in a magical forest" Output: (Model generates story-like continuation with basic narrative structure)

Prompt: "The scientist discovered" Output: (Model continues with scientific-sounding text)

Note: This is a base language model, not an instruction-following or chat model. For best results, use natural text prompts rather than questions or commands.

Using with MLX (Native)

import mlx.core as mx
from src.model import create_model
from src.generate import generate_text

# Load MLX model
config = {...}  # Your config
model = create_model(config)
model.load_weights("checkpoint.npz")

# Generate
text = generate_text(
    model, 
    prompt="Once upon a time",
    max_tokens=100,
    temperature=0.8
)
print(text)

Performance

Inference Performance (What Users Care About πŸš€)

Metric Value Notes
Batch Inference 25,476 tokens/sec 3.6x faster than training
Real-time Generation 157.5 tokens/sec Interactive use case ready
Average Latency 161ms Low-latency applications
P95 Latency 172ms Consistent performance
P99 Latency 179ms Stable under load
Model Size (FP16) 101 MB Runs on mobile devices
Model Size (FP32) 202 MB Fits in RAM easily
Memory Usage ~1.7GB During training with batch=12

Training Metrics

Metric Value Notes
Training Loss 3.46 Excellent convergence
Validation Loss 6.71 Some overfitting (see below)
Best Val Loss 4.74 Achieved ~iteration 15K
Training Speed 7,000 tokens/sec M2 Pro, batch=12
Training Time 161 minutes (2.7 hours) Consumer hardware!
Total Iterations 35,000 Fully converged
Teacher Samples 1,099 From GPT-OSS-20B
Evaluation Speed 24,779 tokens/sec Fast validation

Model Quality

  • Perplexity: 827.85 (FineWebEdu validation set)

    Context: This perplexity reflects the model's 53M parameter size and the complexity of FineWebEdu dataset (diverse educational web content). For reference, GPT-2 Small (124M parameters) achieves ~29 perplexity on WebText, while GPT-2 Medium (355M) achieves ~26. The higher perplexity is expected for a compact model on complex content, and the model performs well for its size class in text completion tasks.

Model Characteristics

Strengths:

  • βœ… Grammatically correct text generation
  • βœ… Basic sentence structure understanding
  • βœ… Fast inference on Apple Silicon
  • βœ… Low memory footprint (~200MB)
  • βœ… Efficient knowledge distillation from 20B teacher

Known Limitations:

  • ⚠️ Overfitting: Val loss (6.71) > Train loss (3.46) indicates some overfitting
  • ⚠️ Repetitive patterns: May generate repeated phrases in longer text
  • ⚠️ Limited coherence: Best for 50-100 tokens, degrades beyond that
  • ⚠️ Not factual: Not trained for accurate information retrieval
  • ⚠️ No instruction following: Not a chat or assistant model

Limitations and Biases

Model Limitations

  1. Context Window: Limited to 512 tokens
  2. Model Size: 53M parameters limits capability vs larger models
  3. Training Data: Primarily simple stories, may not generalize well
  4. Knowledge Cutoff: No specific knowledge cutoff (training data dependent)

Potential Biases

  • Training data (TinyStories) may contain biases present in children's literature
  • Limited diversity in training data
  • No explicit bias mitigation techniques applied

Not Suitable For

  • Production applications requiring factual accuracy
  • Legal, medical, or financial advice
  • Content requiring long-term coherence
  • Tasks requiring reasoning or computation

Training Infrastructure

  • Hardware: Apple M2 Pro with 16GB RAM
  • Framework: MLX 0.0.9+
  • OS: macOS
  • GPU: Apple Silicon GPU (Metal)
  • Memory Usage: ~4-6GB during training

Citation

If you use this model, please cite:

@software{nanogpt-mlx-53m,
  title = {nanoGPT-MLX-53M: Compact GPT with Knowledge Distillation on Apple Silicon},
  author = {Jack Su},
  year = {2025},
  url = {https://github.com/JackSuuu/nanoGPT-on-MLX},
  note = {53M parameter model trained using Apple MLX framework with knowledge distillation from GPT-OSS-20B}
}

Related Work

  • nanoGPT: Original PyTorch implementation by Andrej Karpathy
  • MLX: Apple's array framework for machine learning on Apple silicon
  • TinyStories: Dataset by Eldan & Li (Microsoft Research)
  • FineWebEdu: High-quality web dataset by HuggingFace

License

MIT License - See repository for details

Acknowledgments

  • MLX Team at Apple for the excellent framework
  • TinyStories authors for the dataset
  • HuggingFace for FineWebEdu and model hosting
  • Andrej Karpathy for nanoGPT inspiration

Model Card Authors

Jack Su

Model Card Contact

For questions or issues, please open an issue on the GitHub repository.

Training Notes

This model demonstrates:

  • Efficient training on consumer hardware (M2 Pro, 16GB RAM)
  • Knowledge distillation effectiveness for small models
  • MLX framework capabilities for Apple Silicon
  • Realistic expectations for 53M parameter models

The model performs appropriately for its size - it's not meant to compete with billion-parameter models but rather showcases what's achievable with limited resources and knowledge distillation.


This model is primarily for educational and research purposes. Use responsibly! πŸš€