--- language: - en license: mit tags: - text-generation - pytorch - gpt - transformers - pre-ln - causal-lm datasets: - HuggingFaceFW/fineweb-edu library_name: transformers pipeline_tag: text-generation metrics: - perplexity widget: - text: "Once upon a time" example_title: "Story Beginning" - text: "The capital of France is" example_title: "Factual Question" - text: "In the field of machine learning," example_title: "Technical Topic" --- # NanoGPT 53M - Pre-LN Transformer A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture**, compatible with HuggingFace Transformers library. > **Model Format:** PyTorch (cross-platform compatible) > **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility) ## Model Details ### Architecture - **Model Type:** GPT (Decoder-only Transformer) - **Parameters:** 53M (52,990,464 total, 43M unique with weight tying) - **Architecture Pattern:** Pre-LayerNorm (Pre-LN) - **Layers:** 8 transformer blocks - **Hidden Size:** 384 - **Attention Heads:** 8 - **Feedforward Dimension:** 1536 - **Context Length:** 512 tokens - **Vocabulary Size:** 50,257 (GPT-2 tokenizer) ### Training - **Framework:** Apple MLX (training), PyTorch (export) - **Dataset:** FineWebEdu - 10M tokens of educational web content - **Training Hardware:** Apple M2 Pro (16GB unified memory) - **Checkpoint:** 35000 iterations - **Training Method:** Base pretraining (20K iters) + Knowledge Distillation (15K iters) - **Teacher Model:** GPT-OSS-20B (via Groq API) ### Architecture Highlights This model uses **Pre-LayerNorm** architecture, different from standard GPT-2's Post-LN: ```python # Pre-LN (this model) x = x + attn(ln(x)) x = x + ff(ln(x)) # vs Post-LN (standard GPT-2) x = ln(x + attn(x)) x = ln(x + ff(x)) ``` Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA). ## Training Details - **Dataset:** FineWebEdu (diverse educational web content) - **Training Tokens:** 10M - **Base Training:** 20,000 iterations (loss 0.758) - **Knowledge Distillation:** 15,000 additional iterations with GPT-OSS-20B as teacher - **Total Iterations:** 35,000 - **Batch Size:** 12 - **Learning Rate:** 3e-4 with cosine decay (base), 3e-5 (distillation) - **Final Training Loss:** 3.46 - **Distillation Method:** 50% hard loss (ground truth) + 50% soft loss (teacher) ### Performance Benchmarks Measured on Apple M2 Pro (16GB unified memory): | Metric | Value | |--------|-------| | **Model Size** | 53.0M parameters | | **Memory (fp32)** | 202.1 MB | | **Memory (fp16)** | 101.1 MB | | **Training Throughput** | 27,355 tokens/sec | | **Batch Processing** | 13.36 batches/sec (batch=4, seq=512) | | **Inference Speed** | 169.9 tokens/sec | | **Generation Latency** | ~0.59s per 100 tokens | | **Activation Memory** | 843 MB (batch=4, seq=512) | > **Note:** Benchmarks measured at checkpoint 20000. This release (checkpoint 35000) includes additional knowledge distillation training. ## Usage ### Basic Text Generation ```python from transformers import AutoTokenizer, AutoModelForCausalLM # Load model and tokenizer (requires trust_remote_code for custom architecture) tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu") model = AutoModelForCausalLM.from_pretrained( "jacksuuuu/nanogpt-mlx-53m-finewebedu", trust_remote_code=True ) # Generate text prompt = "Once upon a time" inputs = tokenizer(prompt, return_tensors="pt") outputs = model.generate( **inputs, max_length=100, temperature=0.8, top_k=50, do_sample=True, pad_token_id=tokenizer.eos_token_id ) text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(text) ``` ### Example Output **Prompt:** "Once upon a time" **Generated (Checkpoint 35000 with distillation):** ``` Once upon a time: "the)." as in KDE, set by an article of the U and updated to the existing of a network. For requirements of the application to an individual to the data above above above above... ``` **Note:** This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case. ## Model Architecture ```python NanoGPTLMHeadModel( (transformer): NanoGPTModel( (token_embedding): Embedding(50257, 384) (position_embedding): Embedding(512, 384) (blocks): ModuleList( (0-7): 8 x NanoGPTBlock( (ln1): LayerNorm((384,), eps=1e-05) (attn): NanoGPTAttention( (qkv_proj): Linear(384, 1152) (out_proj): Linear(384, 384) ) (ln2): LayerNorm((384,), eps=1e-05) (ff): FeedForward( (fc1): Linear(384, 1536) (fc2): Linear(1536, 384) ) ) ) (ln_f): LayerNorm((384,), eps=1e-05) ) (lm_head): Linear(384, 50257) ) ``` **Note:** `token_embedding` and `lm_head` weights are tied (shared), reducing effective parameters from 53M to 43M unique weights. ## Training Configuration ```python { "vocab_size": 50257, "d_model": 384, "n_layers": 8, "n_heads": 8, "d_ff": 1536, "context_length": 512, "dropout": 0.1, "batch_size": 12, "learning_rate": 3e-4, "weight_decay": 0.1, "max_iters": 20000 } ``` ## Limitations - **Context length:** Limited to 512 tokens - **Domain:** Trained on educational web content (FineWebEdu) - **Size:** 53M parameters is relatively small compared to modern LLMs - **Generation:** Best for short-form content (stories, paragraphs) - **No instruction tuning:** This is a base language model, not instruction-tuned ## Intended Use **Primary use cases:** - Educational demonstrations of transformer training - Resource-constrained inference on Apple Silicon - Base model for fine-tuning on specific domains - Research and experimentation with Pre-LN architectures **Not recommended for:** - Production applications requiring factual accuracy - Long-form content generation (>512 tokens) - Instruction following or chat applications (not instruction-tuned) ## Ethical Considerations This model was trained on FineWebEdu, which contains diverse web content. Users should: - Be aware of potential biases in generated content - Validate outputs for factual accuracy - Not use for applications requiring high reliability - Consider fine-tuning on domain-specific data for production use ## Citation If you use this model, please cite: ```bibtex @software{nanogpt_mlx_2025, author = {JackSu}, title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer}, year = {2025}, url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu} } ``` ## Additional Resources - **GitHub Repository:** [JackSuuu/nanoGPT-on-MLX](https://github.com/JackSuuu/nanoGPT-on-MLX) - **MLX Framework:** [ml-explore/mlx](https://github.com/ml-explore/mlx) - **Training Dataset:** [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) ## License MIT License - See repository for details.