--- language: - en license: mit tags: - text-generation - mlx - gpt - pre-ln datasets: - HuggingFaceFW/fineweb-edu metrics: - perplexity model-index: - name: nanogpt-mlx-53m-finewebedu results: - task: type: text-generation name: Text Generation dataset: name: FineWebEdu type: HuggingFaceFW/fineweb-edu metrics: - type: perplexity value: 690728 name: Validation Perplexity - type: loss value: 0.758 name: Training Loss --- # NanoGPT MLX 53M (FineWebEdu) A 53-million parameter GPT model trained on FineWebEdu using Apple's MLX framework. This model features a **Pre-LayerNorm (Pre-LN) transformer architecture** optimized for Apple Silicon. ## Model Details - **Parameters:** 53M (52,990,464 total) - **Architecture:** Pre-LN Transformer (8 layers, 384d model, 8 attention heads) - **Context Length:** 512 tokens - **Vocabulary:** 50,257 tokens (GPT-2 tokenizer) - **Training Data:** FineWebEdu (10M tokens, educational web content) - **Training Framework:** MLX (Apple Silicon optimized) - **Hardware:** M2 Pro with 16GB memory - **Checkpoint:** 35000 (includes knowledge distillation from GPT-OSS-20B) ### Architecture Highlights This model uses **Pre-LayerNorm** architecture, different from standard GPT-2's Post-LN: ```python # Pre-LN (this model) x = x + attn(ln(x)) x = x + ff(ln(x)) # vs Post-LN (standard GPT-2) x = ln(x + attn(x)) x = ln(x + ff(x)) ``` Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA). ## Training Details - **Dataset:** FineWebEdu (diverse educational web content) - **Training Tokens:** 10M - **Base Training:** 20,000 iterations (loss 0.758) - **Knowledge Distillation:** 15,000 additional iterations with GPT-OSS-20B as teacher - **Total Iterations:** 35,000 - **Batch Size:** 12 - **Learning Rate:** 3e-4 with cosine decay (base), 3e-5 (distillation) - **Final Training Loss:** 3.46 - **Distillation Method:** 50% hard loss (ground truth) + 50% soft loss (teacher) ### Performance Benchmarks Training and inference on M2 Pro (measured at checkpoint 20000): ``` 📊 Model Size: 53.0M parameters 202.1 MB (fp32), 101.1 MB (fp16) ⚡ Training: 27,355 tokens/sec (forward pass) 13.36 batches/sec (batch=4, seq=512) 🎯 Inference: 169.9 tokens/sec ~0.59s per 100 tokens 💾 Memory: 843 MB activations (batch=4, seq=512) ``` **Note:** This checkpoint (35000) includes additional training with knowledge distillation. ## Usage ### Basic Text Generation ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load model and tokenizer (requires trust_remote_code for custom architecture) tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu") model = AutoModelForCausalLM.from_pretrained( "jacksuuuu/nanogpt-mlx-53m-finewebedu", trust_remote_code=True ) # Generate text prompt = "Once upon a time" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_length=100, temperature=0.8, top_k=50, do_sample=True, pad_token_id=tokenizer.eos_token_id ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(text) ``` ### Example Output **Prompt:** "Once upon a time" **Generated (Checkpoint 35000 with distillation):** ``` Once upon a time: "the)." as in KDE, set by an article of the U and updated to the existing of a network. For requirements of the application to an individual to the data above above above above... ``` **Note:** This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case. ## Model Architecture ```python NanoGPTLMHeadModel( (transformer): NanoGPTModel( (token_embedding): Embedding(50257, 384) (position_embedding): Embedding(512, 384) (blocks): ModuleList( (0-7): 8 x NanoGPTBlock( (ln1): LayerNorm((384,), eps=1e-05) (attn): NanoGPTAttention( (qkv_proj): Linear(384, 1152) (out_proj): Linear(384, 384) ) (ln2): LayerNorm((384,), eps=1e-05) (ff): FeedForward( (fc1): Linear(384, 1536) (fc2): Linear(1536, 384) ) ) ) (ln_f): LayerNorm((384,), eps=1e-05) ) (lm_head): Linear(384, 50257) ) ``` **Note:** `token_embedding` and `lm_head` weights are tied (shared), reducing effective parameters from 53M to 43M unique weights. ## Training Configuration ```python { "vocab_size": 50257, "d_model": 384, "n_layers": 8, "n_heads": 8, "d_ff": 1536, "context_length": 512, "dropout": 0.1, "batch_size": 12, "learning_rate": 3e-4, "weight_decay": 0.1, "max_iters": 20000 } ``` ## Limitations - **Context length:** Limited to 512 tokens - **Domain:** Trained on educational web content (FineWebEdu) - **Size:** 53M parameters is relatively small compared to modern LLMs - **Generation:** Best for short-form content (stories, paragraphs) - **No instruction tuning:** This is a base language model, not instruction-tuned ## Intended Use **Primary use cases:** - Educational demonstrations of transformer training - Resource-constrained inference on Apple Silicon - Base model for fine-tuning on specific domains - Research and experimentation with Pre-LN architectures **Not recommended for:** - Production applications requiring factual accuracy - Long-form content generation (>512 tokens) - Instruction following or chat applications (not instruction-tuned) ## Ethical Considerations This model was trained on FineWebEdu, which contains diverse web content. Users should: - Be aware of potential biases in generated content - Validate outputs for factual accuracy - Not use for applications requiring high reliability - Consider fine-tuning on domain-specific data for production use ## Citation If you use this model, please cite: ```bibtex @software{nanogpt_mlx_2025, author = {JackSu}, title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer}, year = {2025}, url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu} } ``` ## Additional Resources - **GitHub Repository:** [JackSuuu/nanoGPT-on-MLX](https://github.com/JackSuuu/nanoGPT-on-MLX) - **MLX Framework:** [ml-explore/mlx](https://github.com/ml-explore/mlx) - **Training Dataset:** [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) ## License MIT License - See repository for details.