Qwen3 From Scratch GRPO Checkpoints
This repository contains GRPO training checkpoints for the rasbt/qwen3-from-scratch model from chapters 6 and 7 of Build a Reasoning Model (From Scratch).
These files are raw PyTorch state_dict checkpoints intended for use with the reasoning_from_scratch package.
Available Checkpoints
- grpo_original_no_kl: the original GRPO algorithm without a KL term from chapter 6
- 7_3_plus_tracking: chapter 7 tracking variant from
7_3_plus_tracking.py - 7_4_plus_clip_ratio: clipped policy-ratio variant from
7_4_plus_clip_ratio.py - 7_5_plus_kl: KL-regularized variant from
7_5_plus_kl.py - 7_6_plus_format_reward: format-reward variant from
7_6_plus_format_reward.py
Usage Example
The checkpoints in this repository are intended for use with the reasoning_from_scratch package.
For the chapter 6 no-KL checkpoint, you can download it via:
from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints
download_qwen3_grpo_checkpoints(
grpo_type="no_kl",
step="00050",
out_dir="qwen3",
)
For a chapter 7 checkpoint, the intended usage is:
from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints
download_qwen3_grpo_checkpoints(
grpo_type="clip_ratio",
step="00050",
out_dir="qwen3",
)
Once downloaded, you can load a checkpoint and stream text as follows:
from pathlib import Path
import torch
from reasoning_from_scratch.ch02 import (
get_device,
generate_text_basic_stream_cache,
)
from reasoning_from_scratch.qwen3 import (
download_qwen3_small,
Qwen3Model,
Qwen3Tokenizer,
QWEN_CONFIG_06_B,
)
device = get_device()
local_dir = Path("qwen3")
download_qwen3_small(kind="base", tokenizer_only=True, out_dir=local_dir)
tokenizer = Qwen3Tokenizer(tokenizer_file_path=local_dir / "tokenizer-base.json")
model = Qwen3Model(QWEN_CONFIG_06_B)
state_dict = torch.load(
local_dir / "qwen3-0.6B-rlvr-grpo-step00050.pth",
map_location=device,
)
model.load_state_dict(state_dict)
model.to(device)
model.eval()
prompt = "Solve: If x + 7 = 19, what is x?"
input_ids = torch.tensor(tokenizer.encode(prompt), device=device).unsqueeze(0)
for token in generate_text_basic_stream_cache(
model=model,
token_ids=input_ids,
max_new_tokens=256,
eos_token_id=tokenizer.eos_token_id,
):
token_id = token.squeeze(0).item()
print(tokenizer.decode([token_id]), end="", flush=True)
print()
Notes
- The chapter 6 checkpoints are based on the original no-KL GRPO setup described in chapter 6.
- The chapter 7 checkpoints correspond to the improved GRPO variants described in chapter 7 and the supplementary scripts in
ch07/03_rlvr_grpo_scripts_advanced. - The
7_6_plus_format_rewardcheckpoints are based on the reasoning model and are meant to be used with the reasoning tokenizer.
Download Helper Reference
These are the supported grpo_type values for download_qwen3_grpo_checkpoints(...):
- Section 6.2 no-KL:
download_qwen3_grpo_checkpoints(
grpo_type="no_kl",
step="00050",
out_dir="qwen3",
)
Available chapter 6 saved steps: 00050, 00100, 00500, 01000, 01500, 03000, 05000, 09000.
- Section 7.3 tracking:
download_qwen3_grpo_checkpoints(
grpo_type="tracking",
step="00050",
out_dir="qwen3",
)
- Section 7.4 clip-ratio:
download_qwen3_grpo_checkpoints(
grpo_type="clip_ratio",
step="00050",
out_dir="qwen3",
)
- Section 7.5 KL:
download_qwen3_grpo_checkpoints(
grpo_type="kl",
step="00050",
out_dir="qwen3",
)
- Section 7.6 format-reward:
download_qwen3_grpo_checkpoints(
grpo_type="format_reward",
step="00050",
out_dir="qwen3",
)
For the chapter 7 variants, replace step="00050" with any other available saved step such as 00100, 00150, 00200, 00250, 00300, 00350, 00400, 00450, or 00500.