---
language:
- en
license: mit
tags:
- text-generation
- mlx
- gpt
- pre-ln
datasets:
- HuggingFaceFW/fineweb-edu
metrics:
- perplexity
model-index:
- name: nanogpt-mlx-53m-finewebedu
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: FineWebEdu
      type: HuggingFaceFW/fineweb-edu
    metrics:
    - type: perplexity
      value: 690728
      name: Validation Perplexity
    - type: loss
      value: 0.758
      name: Training Loss
---

# NanoGPT MLX 53M (FineWebEdu)

A 53-million parameter GPT model trained on FineWebEdu using Apple's MLX framework. This model features a **Pre-LayerNorm (Pre-LN) transformer architecture** optimized for Apple Silicon.

## Model Details

- **Parameters:** 53M (52,990,464 total)
- **Architecture:** Pre-LN Transformer (8 layers, 384d model, 8 attention heads)
- **Context Length:** 512 tokens
- **Vocabulary:** 50,257 tokens (GPT-2 tokenizer)
- **Training Data:** FineWebEdu (10M tokens, educational web content)
- **Training Framework:** MLX (Apple Silicon optimized)
- **Hardware:** M2 Pro with 16GB memory
- **Checkpoint:** 35000 (includes knowledge distillation from GPT-OSS-20B)

### Architecture Highlights

This model uses **Pre-LayerNorm** architecture, different from standard GPT-2's Post-LN:

```python
# Pre-LN (this model)
x = x + attn(ln(x))
x = x + ff(ln(x))

# vs Post-LN (standard GPT-2)
x = ln(x + attn(x))
x = ln(x + ff(x))
```

Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

## Training Details

- **Dataset:** FineWebEdu (diverse educational web content)
- **Training Tokens:** 10M
- **Base Training:** 20,000 iterations (loss 0.758)
- **Knowledge Distillation:** 15,000 additional iterations with GPT-OSS-20B as teacher
- **Total Iterations:** 35,000
- **Batch Size:** 12
- **Learning Rate:** 3e-4 with cosine decay (base), 3e-5 (distillation)
- **Final Training Loss:** 3.46
- **Distillation Method:** 50% hard loss (ground truth) + 50% soft loss (teacher)

### Performance Benchmarks

Training and inference on M2 Pro (measured at checkpoint 20000):

```
📊 Model Size:      53.0M parameters
                   202.1 MB (fp32), 101.1 MB (fp16)

⚡ Training:        27,355 tokens/sec (forward pass)
                   13.36 batches/sec (batch=4, seq=512)

🎯 Inference:       169.9 tokens/sec
                   ~0.59s per 100 tokens

💾 Memory:          843 MB activations (batch=4, seq=512)
```

**Note:** This checkpoint (35000) includes additional training with knowledge distillation.

## Usage

### Basic Text Generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (requires trust_remote_code for custom architecture)
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
model = AutoModelForCausalLM.from_pretrained(
    "jacksuuuu/nanogpt-mlx-53m-finewebedu",
    trust_remote_code=True
)

# Generate text
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.8,
    top_k=50,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
```

### Example Output

**Prompt:** "Once upon a time"

**Generated (Checkpoint 35000 with distillation):**
```
Once upon a time: "the)." as in KDE, set by an article of the U and 
updated to the existing of a network. For requirements of the application 
to an individual to the data above above above above...
```

**Note:** This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.

## Model Architecture

```python
NanoGPTLMHeadModel(
  (transformer): NanoGPTModel(
    (token_embedding): Embedding(50257, 384)
    (position_embedding): Embedding(512, 384)
    (blocks): ModuleList(
      (0-7): 8 x NanoGPTBlock(
        (ln1): LayerNorm((384,), eps=1e-05)
        (attn): NanoGPTAttention(
          (qkv_proj): Linear(384, 1152)
          (out_proj): Linear(384, 384)
        )
        (ln2): LayerNorm((384,), eps=1e-05)
        (ff): FeedForward(
          (fc1): Linear(384, 1536)
          (fc2): Linear(1536, 384)
        )
      )
    )
    (ln_f): LayerNorm((384,), eps=1e-05)
  )
  (lm_head): Linear(384, 50257)
)
```

**Note:** `token_embedding` and `lm_head` weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

## Training Configuration

```python
{
  "vocab_size": 50257,
  "d_model": 384,
  "n_layers": 8,
  "n_heads": 8,
  "d_ff": 1536,
  "context_length": 512,
  "dropout": 0.1,
  "batch_size": 12,
  "learning_rate": 3e-4,
  "weight_decay": 0.1,
  "max_iters": 20000
}
```

## Limitations

- **Context length:** Limited to 512 tokens
- **Domain:** Trained on educational web content (FineWebEdu)
- **Size:** 53M parameters is relatively small compared to modern LLMs
- **Generation:** Best for short-form content (stories, paragraphs)
- **No instruction tuning:** This is a base language model, not instruction-tuned

## Intended Use

**Primary use cases:**
- Educational demonstrations of transformer training
- Resource-constrained inference on Apple Silicon
- Base model for fine-tuning on specific domains
- Research and experimentation with Pre-LN architectures

**Not recommended for:**
- Production applications requiring factual accuracy
- Long-form content generation (>512 tokens)
- Instruction following or chat applications (not instruction-tuned)

## Ethical Considerations

This model was trained on FineWebEdu, which contains diverse web content. Users should:
- Be aware of potential biases in generated content
- Validate outputs for factual accuracy
- Not use for applications requiring high reliability
- Consider fine-tuning on domain-specific data for production use

## Citation

If you use this model, please cite:

```bibtex
@software{nanogpt_mlx_2025,
  author = {JackSu},
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
  year = {2025},
  url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
}
```

## Additional Resources

- **GitHub Repository:** [JackSuuu/nanoGPT-on-MLX](https://github.com/JackSuuu/nanoGPT-on-MLX)
- **MLX Framework:** [ml-explore/mlx](https://github.com/ml-explore/mlx)
- **Training Dataset:** [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)

## License

MIT License - See repository for details.