Update usage example with trust_remote_code parameter

43c85ff verified 5 months ago

6.66 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-generation
	- mlx
	- gpt
	- pre-ln
	datasets:
	- HuggingFaceFW/fineweb-edu
	metrics:
	- perplexity
	model-index:
	- name: nanogpt-mlx-53m-finewebedu
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: FineWebEdu
	type: HuggingFaceFW/fineweb-edu
	metrics:
	- type: perplexity
	value: 690728
	name: Validation Perplexity
	- type: loss
	value: 0.758
	name: Training Loss
	---

	# NanoGPT MLX 53M (FineWebEdu)

	A 53-million parameter GPT model trained on FineWebEdu using Apple's MLX framework. This model features a Pre-LayerNorm (Pre-LN) transformer architecture optimized for Apple Silicon.

	## Model Details

	- Parameters: 53M (52,990,464 total)
	- Architecture: Pre-LN Transformer (8 layers, 384d model, 8 attention heads)
	- Context Length: 512 tokens
	- Vocabulary: 50,257 tokens (GPT-2 tokenizer)
	- Training Data: FineWebEdu (10M tokens, educational web content)
	- Training Framework: MLX (Apple Silicon optimized)
	- Hardware: M2 Pro with 16GB memory
	- Checkpoint: 35000 (includes knowledge distillation from GPT-OSS-20B)

	### Architecture Highlights

	This model uses Pre-LayerNorm architecture, different from standard GPT-2's Post-LN:

	```python
	# Pre-LN (this model)
	x = x + attn(ln(x))
	x = x + ff(ln(x))

	# vs Post-LN (standard GPT-2)
	x = ln(x + attn(x))
	x = ln(x + ff(x))
	```

	Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

	## Training Details

	- Dataset: FineWebEdu (diverse educational web content)
	- Training Tokens: 10M
	- Base Training: 20,000 iterations (loss 0.758)
	- Knowledge Distillation: 15,000 additional iterations with GPT-OSS-20B as teacher
	- Total Iterations: 35,000
	- Batch Size: 12
	- Learning Rate: 3e-4 with cosine decay (base), 3e-5 (distillation)
	- Final Training Loss: 3.46
	- Distillation Method: 50% hard loss (ground truth) + 50% soft loss (teacher)

	### Performance Benchmarks

	Training and inference on M2 Pro (measured at checkpoint 20000):

	```
	📊 Model Size: 53.0M parameters
	202.1 MB (fp32), 101.1 MB (fp16)

	⚡ Training: 27,355 tokens/sec (forward pass)
	13.36 batches/sec (batch=4, seq=512)

	🎯 Inference: 169.9 tokens/sec
	~0.59s per 100 tokens

	💾 Memory: 843 MB activations (batch=4, seq=512)
	```

	Note: This checkpoint (35000) includes additional training with knowledge distillation.

	## Usage

	### Basic Text Generation

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model and tokenizer (requires trust_remote_code for custom architecture)
	tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
	model = AutoModelForCausalLM.from_pretrained(
	"jacksuuuu/nanogpt-mlx-53m-finewebedu",
	trust_remote_code=True
	)

	# Generate text
	prompt = "Once upon a time"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(
	**inputs,
	max_length=100,
	temperature=0.8,
	top_k=50,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(text)
	```

	### Example Output

	Prompt: "Once upon a time"

	Generated (Checkpoint 35000 with distillation):
	```
	Once upon a time: "the)." as in KDE, set by an article of the U and
	updated to the existing of a network. For requirements of the application
	to an individual to the data above above above above...
	```

	Note: This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.

	## Model Architecture

	```python
	NanoGPTLMHeadModel(
	(transformer): NanoGPTModel(
	(token_embedding): Embedding(50257, 384)
	(position_embedding): Embedding(512, 384)
	(blocks): ModuleList(
	(0-7): 8 x NanoGPTBlock(
	(ln1): LayerNorm((384,), eps=1e-05)
	(attn): NanoGPTAttention(
	(qkv_proj): Linear(384, 1152)
	(out_proj): Linear(384, 384)
	)
	(ln2): LayerNorm((384,), eps=1e-05)
	(ff): FeedForward(
	(fc1): Linear(384, 1536)
	(fc2): Linear(1536, 384)
	)
	)
	)
	(ln_f): LayerNorm((384,), eps=1e-05)
	)
	(lm_head): Linear(384, 50257)
	)
	```

	Note: `token_embedding` and `lm_head` weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

	## Training Configuration

	```python
	{
	"vocab_size": 50257,
	"d_model": 384,
	"n_layers": 8,
	"n_heads": 8,
	"d_ff": 1536,
	"context_length": 512,
	"dropout": 0.1,
	"batch_size": 12,
	"learning_rate": 3e-4,
	"weight_decay": 0.1,
	"max_iters": 20000
	}
	```

	## Limitations

	- Context length: Limited to 512 tokens
	- Domain: Trained on educational web content (FineWebEdu)
	- Size: 53M parameters is relatively small compared to modern LLMs
	- Generation: Best for short-form content (stories, paragraphs)
	- No instruction tuning: This is a base language model, not instruction-tuned

	## Intended Use

	Primary use cases:
	- Educational demonstrations of transformer training
	- Resource-constrained inference on Apple Silicon
	- Base model for fine-tuning on specific domains
	- Research and experimentation with Pre-LN architectures

	Not recommended for:
	- Production applications requiring factual accuracy
	- Long-form content generation (>512 tokens)
	- Instruction following or chat applications (not instruction-tuned)

	## Ethical Considerations

	This model was trained on FineWebEdu, which contains diverse web content. Users should:
	- Be aware of potential biases in generated content
	- Validate outputs for factual accuracy
	- Not use for applications requiring high reliability
	- Consider fine-tuning on domain-specific data for production use

	## Citation

	If you use this model, please cite:

	```bibtex
	@software{nanogpt_mlx_2025,
	author = {JackSu},
	title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
	year = {2025},
	url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
	}
	```

	## Additional Resources

	- GitHub Repository: [JackSuuu/nanoGPT-on-MLX](https://github.com/JackSuuu/nanoGPT-on-MLX)
	- MLX Framework: [ml-explore/mlx](https://github.com/ml-explore/mlx)
	- Training Dataset: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)

	## License

	MIT License - See repository for details.