Update model card: professional format, remove MLX version reference

c11d047 verified 5 months ago

7.19 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-generation
	- pytorch
	- gpt
	- transformers
	- pre-ln
	- causal-lm
	datasets:
	- HuggingFaceFW/fineweb-edu
	library_name: transformers
	pipeline_tag: text-generation
	metrics:
	- perplexity
	widget:
	- text: "Once upon a time"
	example_title: "Story Beginning"
	- text: "The capital of France is"
	example_title: "Factual Question"
	- text: "In the field of machine learning,"
	example_title: "Technical Topic"
	---

	# NanoGPT 53M - Pre-LN Transformer

	A 53-million parameter GPT model trained from scratch on FineWebEdu educational content. This model implements a Pre-LayerNorm (Pre-LN) transformer architecture, compatible with HuggingFace Transformers library.

	> Model Format: PyTorch (cross-platform compatible)
	> Training Framework: Apple MLX (exported to PyTorch for universal compatibility)

	## Model Details

	### Architecture
	- Model Type: GPT (Decoder-only Transformer)
	- Parameters: 53M (52,990,464 total, 43M unique with weight tying)
	- Architecture Pattern: Pre-LayerNorm (Pre-LN)
	- Layers: 8 transformer blocks
	- Hidden Size: 384
	- Attention Heads: 8
	- Feedforward Dimension: 1536
	- Context Length: 512 tokens
	- Vocabulary Size: 50,257 (GPT-2 tokenizer)

	### Training
	- Framework: Apple MLX (training), PyTorch (export)
	- Dataset: FineWebEdu - 10M tokens of educational web content
	- Training Hardware: Apple M2 Pro (16GB unified memory)
	- Checkpoint: 35000 iterations
	- Training Method: Base pretraining (20K iters) + Knowledge Distillation (15K iters)
	- Teacher Model: GPT-OSS-20B (via Groq API)

	### Architecture Highlights

	This model uses Pre-LayerNorm architecture, different from standard GPT-2's Post-LN:

	```python
	# Pre-LN (this model)
	x = x + attn(ln(x))
	x = x + ff(ln(x))

	# vs Post-LN (standard GPT-2)
	x = ln(x + attn(x))
	x = ln(x + ff(x))
	```

	Pre-LN provides better training stability and is used in modern transformers (GPT-3, PaLM, LLaMA).

	## Training Details

	- Dataset: FineWebEdu (diverse educational web content)
	- Training Tokens: 10M
	- Base Training: 20,000 iterations (loss 0.758)
	- Knowledge Distillation: 15,000 additional iterations with GPT-OSS-20B as teacher
	- Total Iterations: 35,000
	- Batch Size: 12
	- Learning Rate: 3e-4 with cosine decay (base), 3e-5 (distillation)
	- Final Training Loss: 3.46
	- Distillation Method: 50% hard loss (ground truth) + 50% soft loss (teacher)

	### Performance Benchmarks

	Measured on Apple M2 Pro (16GB unified memory):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Model Size \| 53.0M parameters \|
	\| Memory (fp32) \| 202.1 MB \|
	\| Memory (fp16) \| 101.1 MB \|
	\| Training Throughput \| 27,355 tokens/sec \|
	\| Batch Processing \| 13.36 batches/sec (batch=4, seq=512) \|
	\| Inference Speed \| 169.9 tokens/sec \|
	\| Generation Latency \| ~0.59s per 100 tokens \|
	\| Activation Memory \| 843 MB (batch=4, seq=512) \|

	> Note: Benchmarks measured at checkpoint 20000. This release (checkpoint 35000) includes additional knowledge distillation training.

	## Usage

	### Basic Text Generation

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model and tokenizer (requires trust_remote_code for custom architecture)
	tokenizer = AutoTokenizer.from_pretrained("jacksuuuu/nanogpt-mlx-53m-finewebedu")
	model = AutoModelForCausalLM.from_pretrained(
	"jacksuuuu/nanogpt-mlx-53m-finewebedu",
	trust_remote_code=True
	)

	# Generate text
	prompt = "Once upon a time"
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(
	**inputs,
	max_length=100,
	temperature=0.8,
	top_k=50,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(text)
	```

	### Example Output

	Prompt: "Once upon a time"

	Generated (Checkpoint 35000 with distillation):
	```
	Once upon a time: "the)." as in KDE, set by an article of the U and
	updated to the existing of a network. For requirements of the application
	to an individual to the data above above above above...
	```

	Note: This checkpoint shows characteristics of knowledge distillation training. The model has learned broader patterns from the teacher model (GPT-OSS-20B), though generation quality varies. For more coherent story generation, consider fine-tuning on your specific use case.

	## Model Architecture

	```python
	NanoGPTLMHeadModel(
	(transformer): NanoGPTModel(
	(token_embedding): Embedding(50257, 384)
	(position_embedding): Embedding(512, 384)
	(blocks): ModuleList(
	(0-7): 8 x NanoGPTBlock(
	(ln1): LayerNorm((384,), eps=1e-05)
	(attn): NanoGPTAttention(
	(qkv_proj): Linear(384, 1152)
	(out_proj): Linear(384, 384)
	)
	(ln2): LayerNorm((384,), eps=1e-05)
	(ff): FeedForward(
	(fc1): Linear(384, 1536)
	(fc2): Linear(1536, 384)
	)
	)
	)
	(ln_f): LayerNorm((384,), eps=1e-05)
	)
	(lm_head): Linear(384, 50257)
	)
	```

	Note: `token_embedding` and `lm_head` weights are tied (shared), reducing effective parameters from 53M to 43M unique weights.

	## Training Configuration

	```python
	{
	"vocab_size": 50257,
	"d_model": 384,
	"n_layers": 8,
	"n_heads": 8,
	"d_ff": 1536,
	"context_length": 512,
	"dropout": 0.1,
	"batch_size": 12,
	"learning_rate": 3e-4,
	"weight_decay": 0.1,
	"max_iters": 20000
	}
	```

	## Limitations

	- Context length: Limited to 512 tokens
	- Domain: Trained on educational web content (FineWebEdu)
	- Size: 53M parameters is relatively small compared to modern LLMs
	- Generation: Best for short-form content (stories, paragraphs)
	- No instruction tuning: This is a base language model, not instruction-tuned

	## Intended Use

	Primary use cases:
	- Educational demonstrations of transformer training
	- Resource-constrained inference on Apple Silicon
	- Base model for fine-tuning on specific domains
	- Research and experimentation with Pre-LN architectures

	Not recommended for:
	- Production applications requiring factual accuracy
	- Long-form content generation (>512 tokens)
	- Instruction following or chat applications (not instruction-tuned)

	## Ethical Considerations

	This model was trained on FineWebEdu, which contains diverse web content. Users should:
	- Be aware of potential biases in generated content
	- Validate outputs for factual accuracy
	- Not use for applications requiring high reliability
	- Consider fine-tuning on domain-specific data for production use

	## Citation

	If you use this model, please cite:

	```bibtex
	@software{nanogpt_mlx_2025,
	author = {JackSu},
	title = {NanoGPT MLX: 53M Parameter Pre-LN Transformer},
	year = {2025},
	url = {https://huggingface.co/jacksuuuu/nanogpt-mlx-53m-finewebedu}
	}
	```

	## Additional Resources

	- GitHub Repository: [JackSuuu/nanoGPT-on-MLX](https://github.com/JackSuuu/nanoGPT-on-MLX)
	- MLX Framework: [ml-explore/mlx](https://github.com/ml-explore/mlx)
	- Training Dataset: [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)

	## License

	MIT License - See repository for details.