nanochat-d34-rl

This is an RL-trained version of nanochat-d34-finetuned, fine-tuned using GRPO (Group Relative Policy Optimization) on GSM8K math problems.

Model Description

Base Model: karpathy/nanochat-d34 (2.2B parameters)
SFT Model: pankajmathur/nanochat-d34-finetuned
Architecture: GPT-style transformer with depth=34
Training Pipeline: Pre-training → Mid-training → SFT → RL (GRPO)
Hardware: 8x NVIDIA A100-SXM4-80GB GPUs

Key Achievement: GSM8K +73.6% Improvement

The RL training significantly boosted math reasoning capabilities while maintaining general performance:

Metric	MID	SFT	RL	Change (SFT→RL)
GSM8K	0.1137	0.1327	0.2305	+73.6%
ARC-Easy	0.6961	0.7210	0.7130	-1.1%
ARC-Challenge	0.5367	0.5418	0.5375	-0.8%
MMLU	0.4229	0.4304	0.4256	-1.1%
HumanEval	0.1098	0.1037	0.0671	-35.3%
SpellingBee	-	-	0.9922	N/A
ChatCORE	0.4045	0.4157	0.4208	+1.2%

Training Details

RL Configuration (GRPO)

Run: d34_rl
Source: SFT checkpoint
dtype: bfloat16
device_batch_size: 4
examples_per_step: 16
num_samples: 16
max_new_tokens: 256
temperature: 1.0
top_k: 50
Learning Rates:
- unembedding_lr: 0.0040
- embedding_lr: 0.2000
- matrix_lr: 0.0200
weight_decay: 0.0
num_epochs: 1
Total Steps: 467

Training Metrics (Final)

Pass@1: 0.2300
Pass@2: 0.2750
Pass@3: 0.3275
Pass@4: 0.3675
Average Reward: ~0.28
Average Sequence Length: ~178 tokens

Repository Structure

├── tokenizer/
│   ├── tokenizer.pkl          # Tokenizer
│   └── token_bytes.pt         # Token byte mappings
├── chatrl_checkpoints/d34/    # RL checkpoint
│   ├── model_000466.pt        # Final model weights
│   └── meta_000466.json       # Training metadata
├── report/                    # Evaluation reports
│   └── report.md
└── logs/                      # Training logs

WandB Training Run

Full Report

Related Models

Base: karpathy/nanochat-d34 - Pre-trained base model
SFT: pankajmathur/nanochat-d34-finetuned - Mid-training + SFT checkpoint

License

MIT License (same as nanochat)

Acknowledgments

Andrej Karpathy for the nanochat framework and pre-trained base model
The nanochat community

@misc{nanochat,
  author = {Andrej Karpathy},
  title = {nanochat: The best ChatGPT that $100 can buy},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/karpathy/nanochat}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for pankajmathur/nanochat-d34-rl

Base model

karpathy/nanochat-d34

Finetuned

pankajmathur/nanochat-d34-finetuned