File size: 7,399 Bytes

422f1ff
16fdf66
 
422f1ff
 
 
c11d047
16fdf66
c11d047
16fdf66
c11d047
422f1ff
090a610
c11d047
 
16fdf66
 
c11d047
 
 
 
 
 
 
422f1ff
 
c11d047
422f1ff
090a610
c11d047
 
992db31
 
422f1ff
16fdf66
422f1ff
c11d047
 
 
 
 
 
 
 
16fdf66
c11d047
 
 
 
090a610
c11d047
582d1f9
 
422f1ff
16fdf66
422f1ff
16fdf66
422f1ff
16fdf66
 
 
 
422f1ff
16fdf66
 
 
422f1ff
 
16fdf66
422f1ff
 
 
090a610
 
582d1f9
992db31
 
 
 
582d1f9
992db31
 
422f1ff
16fdf66
422f1ff
c11d047
422f1ff
c11d047
 
 
 
 
 
 
 
 
 
422f1ff
582d1f9
af92360
16fdf66
422f1ff
16fdf66
422f1ff
 
16fdf66
422f1ff
43c85ff
992db31
43c85ff
992db31
43c85ff
 
422f1ff
16fdf66
 
422f1ff
 
16fdf66
422f1ff
 
 
 
16fdf66
422f1ff
 
16fdf66
422f1ff
 
 
16fdf66
422f1ff
16fdf66
422f1ff
992db31
16fdf66
992db31
 
 
 
 
 
 
16fdf66
422f1ff
992db31
af92360
16fdf66
422f1ff
16fdf66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
422f1ff
16fdf66
422f1ff
16fdf66
422f1ff
16fdf66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
422f1ff
16fdf66
422f1ff
992db31
 
 
 
 
 
 
422f1ff
16fdf66
422f1ff
16fdf66
 
 
 
 
422f1ff
16fdf66
422f1ff
16fdf66
 
422f1ff
16fdf66
422f1ff
16fdf66
 
 
 
 
422f1ff
 
 
 
 
 
16fdf66
 
 
422f1ff
992db31
422f1ff
 
 
16fdf66
422f1ff
16fdf66
 
090a610
422f1ff
 
 
16fdf66

---
language:
- en
license: mit
tags:
- text-generation
- pytorch
- gpt
- transformers
- pre-ln
- causal-lm
datasets:
- roneneldan/TinyStories
library_name: transformers
pipeline_tag: text-generation
metrics:
- perplexity
widget:
- text: "Once upon a time"
  example_title: "Story Beginning"
- text: "The capital of France is"
  example_title: "Factual Question"
- text: "In the field of machine learning,"
  example_title: "Technical Topic"
---

# NanoGPT 53M - Pre-LN Transformer

A 53-million parameter GPT model trained from scratch on TinyStories dataset. This model implements a **Pre-LayerNorm (Pre-LN) transformer architecture** and serves as a demonstration of efficient training on Apple Silicon using the MLX framework.

> **Model Format:** PyTorch (cross-platform compatible)  
> **Training Framework:** Apple MLX (exported to PyTorch for universal compatibility)  
> **Best for:** Educational demonstrations, research, and fine-tuning on specific domains

## Model Details

### Architecture
- **Model Type:** GPT (Decoder-only Transformer)
- **Parameters:** 53M (52,990,464 total, 43M unique with weight tying)
- **Architecture Pattern:** Pre-LayerNorm (Pre-LN)
- **Layers:** 8 transformer blocks
- **Hidden Size:** 384
- **Attention Heads:** 8
- **Feedforward Dimension:** 1536
- **Context Length:** 512 tokens
- **Vocabulary Size:** 50,257 (GPT-2 tokenizer)

### Training
- **Framework:** Apple MLX (training), PyTorch (export)
- **Dataset:** TinyStories - Simple children's stories for language learning
- **Training Hardware:** Apple M2 Pro (16GB unified memory)
- **Checkpoint:** 20000 iterations
- **Training Method:** Base pretraining from scratch

### Architecture Highlights

This model uses **Pre-LayerNorm** architecture, different from standard GPT-2's Post-LN:

```python
# Pre-LN (this model)
x = x + attn(ln(x))
x = x + ff(ln(x))

# vs Post-LN (standard GPT-2)
x = ln(x + attn(x))
x = ln(x + ff(x))
```

Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

## Training Details

- **Dataset:** TinyStories (simple children's stories)
- **Training Tokens:** ~2M training tokens
- **Total Iterations:** 20,000
- **Batch Size:** 12 sequences/batch
- **Sequence Length:** 512 tokens
- **Learning Rate:** 3e-4 with cosine decay schedule
- **Optimizer:** AdamW (β1=0.9, β2=0.95, weight_decay=0.1)
- **Final Training Loss:** 0.7583
- **Training Time:** ~4 hours on Apple M2 Pro
- **Gradient Accumulation:** None (direct updates)

### Performance Benchmarks

Measured on Apple M2 Pro (16GB unified memory):

| Metric | Value |
|--------|-------|
| **Model Size** | 53.0M parameters |
| **Memory (fp32)** | 202.1 MB |
| **Memory (fp16)** | 101.1 MB |
| **Training Throughput** | 27,355 tokens/sec |
| **Batch Processing** | 13.36 batches/sec (batch=4, seq=512) |
| **Inference Speed** | 169.9 tokens/sec |
| **Generation Latency** | ~0.59s per 100 tokens |
| **Activation Memory** | 843 MB (batch=4, seq=512) |

> **Note:** All benchmarks measured at checkpoint 20000 (this release).

## Usage

### Basic Text Generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (requires trust_remote_code for custom architecture)
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/tinystories")
model = AutoModelForCausalLM.from_pretrained(
    "jacksuuuu/tinystories",
    trust_remote_code=True
)

# Generate text
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.8,
    top_k=50,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
```

### Example Output

**Prompt:** "Once upon a time"

**Generated:**
```
Once upon a time, the boy named Lily and his dog named Max went for a walk. 
They ran and ran, but they kept each and got very tired. Suddenly the way, 
Max saw something shiny on the ground. He pointed the shiny to his owner and 
explained, "What does this?"

Max meowed and said, "I don't sign, Max. The sign is too small and it's 
important to learn."
```

**Note:** This model generates coherent short stories and educational content. While grammatically imperfect due to its small size (53M params), it demonstrates good narrative flow and vocabulary learned from FineWebEdu dataset.

## Model Architecture

```python
NanoGPTLMHeadModel(
  (transformer): NanoGPTModel(
    (token_embedding): Embedding(50257, 384)
    (position_embedding): Embedding(512, 384)
    (blocks): ModuleList(
      (0-7): 8 x NanoGPTBlock(
        (ln1): LayerNorm((384,), eps=1e-05)
        (attn): NanoGPTAttention(
          (qkv_proj): Linear(384, 1152)
          (out_proj): Linear(384, 384)
        )
        (ln2): LayerNorm((384,), eps=1e-05)
        (ff): FeedForward(
          (fc1): Linear(384, 1536)
          (fc2): Linear(1536, 384)
        )
      )
    )
    (ln_f): LayerNorm((384,), eps=1e-05)
  )
  (lm_head): Linear(384, 50257)
)
```

**Note:** `token_embedding` and `lm_head` weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

## Training Configuration

```python
{
  "vocab_size": 50257,
  "d_model": 384,
  "n_layers": 8,
  "n_heads": 8,
  "d_ff": 1536,
  "context_length": 512,
  "dropout": 0.1,
  "batch_size": 12,
  "learning_rate": 3e-4,
  "weight_decay": 0.1,
  "max_iters": 20000
}
```

## Limitations

- **Context length:** Limited to 512 tokens (can't process longer documents)
- **Domain:** Trained primarily on educational web content (FineWebEdu)
- **Model size:** 53M parameters - significantly smaller than modern LLMs (1B+)
- **Generation quality:** Produces coherent narratives but with occasional grammatical errors
- **Factual accuracy:** Limited by small model size and training data
- **No instruction tuning:** Base language model - cannot follow instructions or engage in dialogue
- **Training data:** Only 10M tokens (modern models use trillions)

## Intended Use

**Primary use cases:**
- Educational demonstrations of transformer training
- Resource-constrained inference on Apple Silicon
- Base model for fine-tuning on specific domains
- Research and experimentation with Pre-LN architectures

**Not recommended for:**
- Production applications requiring factual accuracy
- Long-form content generation (>512 tokens)
- Instruction following or chat applications (not instruction-tuned)

## Ethical Considerations

This model was trained on FineWebEdu, which contains diverse web content. Users should:
- Be aware of potential biases in generated content
- Validate outputs for factual accuracy
- Not use for applications requiring high reliability
- Consider fine-tuning on domain-specific data for production use

## Citation

If you use this model, please cite:

```bibtex
@software{nanogpt_mlx_2025,
  author = {JackSu},
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
  year = {2025},
  url = {https://huggingface.co/jacksuuuu/tinystories}
}
```

## Additional Resources

- **GitHub Repository:** [JackSuuu/nanoGPT-on-MLX](https://github.com/JackSuuu/nanoGPT-on-MLX)
- **MLX Framework:** [ml-explore/mlx](https://github.com/ml-explore/mlx)
- **Training Dataset:** [roneneldan/TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories)

## License

MIT License - See repository for details.