tinystories / README.md
jacksuuuu's picture
Update usage example with trust_remote_code parameter
43c85ff verified
|
raw
history blame
6.66 kB
metadata
language:
  - en
license: mit
tags:
  - text-generation
  - mlx
  - gpt
  - pre-ln
datasets:
  - HuggingFaceFW/fineweb-edu
metrics:
  - perplexity
model-index:
  - name: nanogpt-mlx-53m-finewebedu
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: FineWebEdu
          type: HuggingFaceFW/fineweb-edu
        metrics:
          - type: perplexity
            value: 690728
            name: Validation Perplexity
          - type: loss
            value: 0.758
            name: Training Loss

NanoGPT MLX 53M (FineWebEdu)

A 53-million parameter GPT model trained on FineWebEdu using Apple's MLX framework. This model features a Pre-LayerNorm (Pre-LN) transformer architecture optimized for Apple Silicon.

Model Details

  • Parameters: 53M (52,990,464 total)
  • Architecture: Pre-LN Transformer (8 layers, 384d model, 8 attention heads)
  • Context Length: 512 tokens
  • Vocabulary: 50,257 tokens (GPT-2 tokenizer)
  • Training Data: FineWebEdu (10M tokens, educational web content)
  • Training Framework: MLX (Apple Silicon optimized)
  • Hardware: M2 Pro with 16GB memory
  • Checkpoint: 35000 (includes knowledge distillation from GPT-OSS-20B)

Architecture Highlights

This model uses Pre-LayerNorm architecture, different from standard GPT-2's Post-LN:

# Pre-LN (this model)
x = x + attn(ln(x))
x = x + ff(ln(x))

# vs Post-LN (standard GPT-2)
x = ln(x + attn(x))
x = ln(x + ff(x))

Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

Training Details

  • Dataset: FineWebEdu (diverse educational web content)
  • Training Tokens: 10M
  • Base Training: 20,000 iterations (loss 0.758)
  • Knowledge Distillation: 15,000 additional iterations with GPT-OSS-20B as teacher
  • Total Iterations: 35,000
  • Batch Size: 12
  • Learning Rate: 3e-4 with cosine decay (base), 3e-5 (distillation)
  • Final Training Loss: 3.46
  • Distillation Method: 50% hard loss (ground truth) + 50% soft loss (teacher)

Performance Benchmarks

Training and inference on M2 Pro (measured at checkpoint 20000):

📊 Model Size:      53.0M parameters
                   202.1 MB (fp32), 101.1 MB (fp16)

⚡ Training:        27,355 tokens/sec (forward pass)
                   13.36 batches/sec (batch=4, seq=512)

🎯 Inference:       169.9 tokens/sec
                   ~0.59s per 100 tokens

💾 Memory:          843 MB activations (batch=4, seq=512)

Note: This checkpoint (35000) includes additional training with knowledge distillation.

Usage

Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (requires trust_remote_code for custom architecture)
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
model = AutoModelForCausalLM.from_pretrained(
    "jacksuuuu/nanogpt-mlx-53m-finewebedu",
    trust_remote_code=True
)

# Generate text
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.8,
    top_k=50,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Example Output

Prompt: "Once upon a time"

Generated (Checkpoint 35000 with distillation):

Once upon a time: "the)." as in KDE, set by an article of the U and 
updated to the existing of a network. For requirements of the application 
to an individual to the data above above above above...

Note: This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.

Model Architecture

NanoGPTLMHeadModel(
  (transformer): NanoGPTModel(
    (token_embedding): Embedding(50257, 384)
    (position_embedding): Embedding(512, 384)
    (blocks): ModuleList(
      (0-7): 8 x NanoGPTBlock(
        (ln1): LayerNorm((384,), eps=1e-05)
        (attn): NanoGPTAttention(
          (qkv_proj): Linear(384, 1152)
          (out_proj): Linear(384, 384)
        )
        (ln2): LayerNorm((384,), eps=1e-05)
        (ff): FeedForward(
          (fc1): Linear(384, 1536)
          (fc2): Linear(1536, 384)
        )
      )
    )
    (ln_f): LayerNorm((384,), eps=1e-05)
  )
  (lm_head): Linear(384, 50257)
)

Note: token_embedding and lm_head weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

Training Configuration

{
  "vocab_size": 50257,
  "d_model": 384,
  "n_layers": 8,
  "n_heads": 8,
  "d_ff": 1536,
  "context_length": 512,
  "dropout": 0.1,
  "batch_size": 12,
  "learning_rate": 3e-4,
  "weight_decay": 0.1,
  "max_iters": 20000
}

Limitations

  • Context length: Limited to 512 tokens
  • Domain: Trained on educational web content (FineWebEdu)
  • Size: 53M parameters is relatively small compared to modern LLMs
  • Generation: Best for short-form content (stories, paragraphs)
  • No instruction tuning: This is a base language model, not instruction-tuned

Intended Use

Primary use cases:

  • Educational demonstrations of transformer training
  • Resource-constrained inference on Apple Silicon
  • Base model for fine-tuning on specific domains
  • Research and experimentation with Pre-LN architectures

Not recommended for:

  • Production applications requiring factual accuracy
  • Long-form content generation (>512 tokens)
  • Instruction following or chat applications (not instruction-tuned)

Ethical Considerations

This model was trained on FineWebEdu, which contains diverse web content. Users should:

  • Be aware of potential biases in generated content
  • Validate outputs for factual accuracy
  • Not use for applications requiring high reliability
  • Consider fine-tuning on domain-specific data for production use

Citation

If you use this model, please cite:

@software{nanogpt_mlx_2025,
  author = {JackSu},
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
  year = {2025},
  url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
}

Additional Resources

License

MIT License - See repository for details.