language:
- en
license: mit
tags:
- text-generation
- pytorch
- gpt
- transformers
- pre-ln
- causal-lm
datasets:
- HuggingFaceFW/fineweb-edu
library_name: transformers
pipeline_tag: text-generation
metrics:
- perplexity
widget:
- text: Once upon a time
example_title: Story Beginning
- text: The capital of France is
example_title: Factual Question
- text: In the field of machine learning,
example_title: Technical Topic
NanoGPT 53M - Pre-LN Transformer
A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a Pre-LayerNorm (Pre-LN) transformer architecture, compatible with HuggingFace Transformers library.
Model Format: PyTorch (cross-platform compatible)
Training Framework: Apple MLX (exported to PyTorch for universal compatibility)
Model Details
Architecture
- Model Type: GPT (Decoder-only Transformer)
- Parameters: 53M (52,990,464 total, 43M unique with weight tying)
- Architecture Pattern: Pre-LayerNorm (Pre-LN)
- Layers: 8 transformer blocks
- Hidden Size: 384
- Attention Heads: 8
- Feedforward Dimension: 1536
- Context Length: 512 tokens
- Vocabulary Size: 50,257 (GPT-2 tokenizer)
Training
- Framework: Apple MLX (training), PyTorch (export)
- Dataset: FineWebEdu - 10M tokens of educational web content
- Training Hardware: Apple M2 Pro (16GB unified memory)
- Checkpoint: 35000 iterations
- Training Method: Base pretraining (20K iters) + Knowledge Distillation (15K iters)
- Teacher Model: GPT-OSS-20B (via Groq API)
Architecture Highlights
This model uses Pre-LayerNorm architecture, different from standard GPT-2's Post-LN:
# Pre-LN (this model)
x = x + attn(ln(x))
x = x + ff(ln(x))
# vs Post-LN (standard GPT-2)
x = ln(x + attn(x))
x = ln(x + ff(x))
Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).
Training Details
- Dataset: FineWebEdu (diverse educational web content)
- Training Tokens: 10M
- Base Training: 20,000 iterations (loss 0.758)
- Knowledge Distillation: 15,000 additional iterations with GPT-OSS-20B as teacher
- Total Iterations: 35,000
- Batch Size: 12
- Learning Rate: 3e-4 with cosine decay (base), 3e-5 (distillation)
- Final Training Loss: 3.46
- Distillation Method: 50% hard loss (ground truth) + 50% soft loss (teacher)
Performance Benchmarks
Measured on Apple M2 Pro (16GB unified memory):
| Metric | Value |
|---|---|
| Model Size | 53.0M parameters |
| Memory (fp32) | 202.1 MB |
| Memory (fp16) | 101.1 MB |
| Training Throughput | 27,355 tokens/sec |
| Batch Processing | 13.36 batches/sec (batch=4, seq=512) |
| Inference Speed | 169.9 tokens/sec |
| Generation Latency | ~0.59s per 100 tokens |
| Activation Memory | 843 MB (batch=4, seq=512) |
Note: Benchmarks measured at checkpoint 20000. This release (checkpoint 35000) includes additional knowledge distillation training.
Usage
Basic Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer (requires trust_remote_code for custom architecture)
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
model = AutoModelForCausalLM.from_pretrained(
"jacksuuuu/nanogpt-mlx-53m-finewebedu",
trust_remote_code=True
)
# Generate text
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_length=100,
temperature=0.8,
top_k=50,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Example Output
Prompt: "Once upon a time"
Generated (Checkpoint 35000 with distillation):
Once upon a time: "the)." as in KDE, set by an article of the U and
updated to the existing of a network. For requirements of the application
to an individual to the data above above above above...
Note: This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.
Model Architecture
NanoGPTLMHeadModel(
(transformer): NanoGPTModel(
(token_embedding): Embedding(50257, 384)
(position_embedding): Embedding(512, 384)
(blocks): ModuleList(
(0-7): 8 x NanoGPTBlock(
(ln1): LayerNorm((384,), eps=1e-05)
(attn): NanoGPTAttention(
(qkv_proj): Linear(384, 1152)
(out_proj): Linear(384, 384)
)
(ln2): LayerNorm((384,), eps=1e-05)
(ff): FeedForward(
(fc1): Linear(384, 1536)
(fc2): Linear(1536, 384)
)
)
)
(ln_f): LayerNorm((384,), eps=1e-05)
)
(lm_head): Linear(384, 50257)
)
Note: token_embedding and lm_head weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.
Training Configuration
{
"vocab_size": 50257,
"d_model": 384,
"n_layers": 8,
"n_heads": 8,
"d_ff": 1536,
"context_length": 512,
"dropout": 0.1,
"batch_size": 12,
"learning_rate": 3e-4,
"weight_decay": 0.1,
"max_iters": 20000
}
Limitations
- Context length: Limited to 512 tokens
- Domain: Trained on educational web content (FineWebEdu)
- Size: 53M parameters is relatively small compared to modern LLMs
- Generation: Best for short-form content (stories, paragraphs)
- No instruction tuning: This is a base language model, not instruction-tuned
Intended Use
Primary use cases:
- Educational demonstrations of transformer training
- Resource-constrained inference on Apple Silicon
- Base model for fine-tuning on specific domains
- Research and experimentation with Pre-LN architectures
Not recommended for:
- Production applications requiring factual accuracy
- Long-form content generation (>512 tokens)
- Instruction following or chat applications (not instruction-tuned)
Ethical Considerations
This model was trained on FineWebEdu, which contains diverse web content. Users should:
- Be aware of potential biases in generated content
- Validate outputs for factual accuracy
- Not use for applications requiring high reliability
- Consider fine-tuning on domain-specific data for production use
Citation
If you use this model, please cite:
@software{nanogpt_mlx_2025,
author = {JackSu},
title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
year = {2025},
url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
}
Additional Resources
- GitHub Repository: JackSuuu/nanoGPT-on-MLX
- MLX Framework: ml-explore/mlx
- Training Dataset: HuggingFaceFW/fineweb-edu
License
MIT License - See repository for details.