Update model card: professional format, remove MLX version reference

c11d047 verified 5 months ago

7.19 kB

language:
  - en
license: mit
tags:
  - text-generation
  - pytorch
  - gpt
  - transformers
  - pre-ln
  - causal-lm
datasets:
  - HuggingFaceFW/fineweb-edu
library_name: transformers
pipeline_tag: text-generation
metrics:
  - perplexity
widget:
  - text: Once upon a time
    example_title: Story Beginning
  - text: The capital of France is
    example_title: Factual Question
  - text: In the field of machine learning,
    example_title: Technical Topic

NanoGPT 53M - Pre-LN Transformer

A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a Pre-LayerNorm (Pre-LN) transformer architecture, compatible with HuggingFace Transformers library.

Model Format: PyTorch (cross-platform compatible)
Training Framework: Apple MLX (exported to PyTorch for universal compatibility)

Model Details

Architecture

Model Type: GPT (Decoder-only Transformer)
Parameters: 53M (52,990,464 total, 43M unique with weight tying)
Architecture Pattern: Pre-LayerNorm (Pre-LN)
Layers: 8 transformer blocks
Hidden Size: 384
Attention Heads: 8
Feedforward Dimension: 1536
Context Length: 512 tokens
Vocabulary Size: 50,257 (GPT-2 tokenizer)

Training

Framework: Apple MLX (training), PyTorch (export)
Dataset: FineWebEdu - 10M tokens of educational web content
Training Hardware: Apple M2 Pro (16GB unified memory)
Checkpoint: 35000 iterations
Training Method: Base pretraining (20K iters) + Knowledge Distillation (15K iters)
Teacher Model: GPT-OSS-20B (via Groq API)

Architecture Highlights

This model uses Pre-LayerNorm architecture, different from standard GPT-2's Post-LN:

# Pre-LN (this model)
x = x + attn(ln(x))
x = x + ff(ln(x))

# vs Post-LN (standard GPT-2)
x = ln(x + attn(x))
x = ln(x + ff(x))

Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

Training Details

Dataset: FineWebEdu (diverse educational web content)
Training Tokens: 10M
Base Training: 20,000 iterations (loss 0.758)
Knowledge Distillation: 15,000 additional iterations with GPT-OSS-20B as teacher
Total Iterations: 35,000
Batch Size: 12
Learning Rate: 3e-4 with cosine decay (base), 3e-5 (distillation)
Final Training Loss: 3.46
Distillation Method: 50% hard loss (ground truth) + 50% soft loss (teacher)

Performance Benchmarks

Measured on Apple M2 Pro (16GB unified memory):

Metric	Value
Model Size	53.0M parameters
Memory (fp32)	202.1 MB
Memory (fp16)	101.1 MB
Training Throughput	27,355 tokens/sec
Batch Processing	13.36 batches/sec (batch=4, seq=512)
Inference Speed	169.9 tokens/sec
Generation Latency	~0.59s per 100 tokens
Activation Memory	843 MB (batch=4, seq=512)

Note: Benchmarks measured at checkpoint 20000. This release (checkpoint 35000) includes additional knowledge distillation training.

Usage

Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer (requires trust_remote_code for custom architecture)
tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
model = AutoModelForCausalLM.from_pretrained(
    "jacksuuuu/nanogpt-mlx-53m-finewebedu",
    trust_remote_code=True
)

# Generate text
prompt = "Once upon a time"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=100,
    temperature=0.8,
    top_k=50,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Example Output

Prompt: "Once upon a time"

Generated (Checkpoint 35000 with distillation):

Once upon a time: "the)." as in KDE, set by an article of the U and 
updated to the existing of a network. For requirements of the application 
to an individual to the data above above above above...

Note: This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.

Model Architecture

NanoGPTLMHeadModel(
  (transformer): NanoGPTModel(
    (token_embedding): Embedding(50257, 384)
    (position_embedding): Embedding(512, 384)
    (blocks): ModuleList(
      (0-7): 8 x NanoGPTBlock(
        (ln1): LayerNorm((384,), eps=1e-05)
        (attn): NanoGPTAttention(
          (qkv_proj): Linear(384, 1152)
          (out_proj): Linear(384, 384)
        )
        (ln2): LayerNorm((384,), eps=1e-05)
        (ff): FeedForward(
          (fc1): Linear(384, 1536)
          (fc2): Linear(1536, 384)
        )
      )
    )
    (ln_f): LayerNorm((384,), eps=1e-05)
  )
  (lm_head): Linear(384, 50257)
)

Note: token_embedding and lm_head weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

Training Configuration

{
  "vocab_size": 50257,
  "d_model": 384,
  "n_layers": 8,
  "n_heads": 8,
  "d_ff": 1536,
  "context_length": 512,
  "dropout": 0.1,
  "batch_size": 12,
  "learning_rate": 3e-4,
  "weight_decay": 0.1,
  "max_iters": 20000
}

Limitations

Context length: Limited to 512 tokens
Domain: Trained on educational web content (FineWebEdu)
Size: 53M parameters is relatively small compared to modern LLMs
Generation: Best for short-form content (stories, paragraphs)
No instruction tuning: This is a base language model, not instruction-tuned

Intended Use

Primary use cases:

Educational demonstrations of transformer training
Resource-constrained inference on Apple Silicon
Base model for fine-tuning on specific domains
Research and experimentation with Pre-LN architectures

Not recommended for:

Production applications requiring factual accuracy
Long-form content generation (>512 tokens)
Instruction following or chat applications (not instruction-tuned)

Ethical Considerations

This model was trained on FineWebEdu, which contains diverse web content. Users should:

Be aware of potential biases in generated content
Validate outputs for factual accuracy
Not use for applications requiring high reliability
Consider fine-tuning on domain-specific data for production use

Citation

If you use this model, please cite:

@software{nanogpt_mlx_2025,
  author = {JackSu},
  title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
  year = {2025},
  url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
}

Additional Resources

GitHub Repository: JackSuuu/nanoGPT-on-MLX
MLX Framework: ml-explore/mlx
Training Dataset: HuggingFaceFW/fineweb-edu

License

MIT License - See repository for details.