openai/gsm8k
Benchmark β’ Updated β’ 17.6k β’ 958k β’ 1.33k
This is an RL-trained version of nanochat-d34-finetuned, fine-tuned using GRPO (Group Relative Policy Optimization) on GSM8K math problems.
The RL training significantly boosted math reasoning capabilities while maintaining general performance:
| Metric | MID | SFT | RL | Change (SFTβRL) |
|---|---|---|---|---|
| GSM8K | 0.1137 | 0.1327 | 0.2305 | +73.6% |
| ARC-Easy | 0.6961 | 0.7210 | 0.7130 | -1.1% |
| ARC-Challenge | 0.5367 | 0.5418 | 0.5375 | -0.8% |
| MMLU | 0.4229 | 0.4304 | 0.4256 | -1.1% |
| HumanEval | 0.1098 | 0.1037 | 0.0671 | -35.3% |
| SpellingBee | - | - | 0.9922 | N/A |
| ChatCORE | 0.4045 | 0.4157 | 0.4208 | +1.2% |
βββ tokenizer/
β βββ tokenizer.pkl # Tokenizer
β βββ token_bytes.pt # Token byte mappings
βββ chatrl_checkpoints/d34/ # RL checkpoint
β βββ model_000466.pt # Final model weights
β βββ meta_000466.json # Training metadata
βββ report/ # Evaluation reports
β βββ report.md
βββ logs/ # Training logs
MIT License (same as nanochat)
@misc{nanochat,
author = {Andrej Karpathy},
title = {nanochat: The best ChatGPT that $100 can buy},
year = {2025},
publisher = {GitHub},
url = {https://github.com/karpathy/nanochat}
}
Base model
karpathy/nanochat-d34