Qwen3 From Scratch GRPO Checkpoints

This repository contains GRPO training checkpoints for the rasbt/qwen3-from-scratch model from chapters 6 and 7 of Build a Reasoning Model (From Scratch).

These files are raw PyTorch state_dict checkpoints intended for use with the reasoning_from_scratch package.

Available Checkpoints

grpo_original_no_kl: the original GRPO algorithm without a KL term from chapter 6
7_3_plus_tracking: chapter 7 tracking variant from 7_3_plus_tracking.py
7_4_plus_clip_ratio: clipped policy-ratio variant from 7_4_plus_clip_ratio.py
7_5_plus_kl: KL-regularized variant from 7_5_plus_kl.py
7_6_plus_format_reward: format-reward variant from 7_6_plus_format_reward.py

Usage Example

The checkpoints in this repository are intended for use with the reasoning_from_scratch package.

For the chapter 6 no-KL checkpoint, you can download it via:

from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints

download_qwen3_grpo_checkpoints(
    grpo_type="no_kl",
    step="00050",
    out_dir="qwen3",
)

For a chapter 7 checkpoint, the intended usage is:

from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints

download_qwen3_grpo_checkpoints(
    grpo_type="clip_ratio",
    step="00050",
    out_dir="qwen3",
)

Once downloaded, you can load a checkpoint and stream text as follows:

from pathlib import Path
import torch

from reasoning_from_scratch.ch02 import (
    get_device,
    generate_text_basic_stream_cache,
)
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Model,
    Qwen3Tokenizer,
    QWEN_CONFIG_06_B,
)

device = get_device()
local_dir = Path("qwen3")

download_qwen3_small(kind="base", tokenizer_only=True, out_dir=local_dir)

tokenizer = Qwen3Tokenizer(tokenizer_file_path=local_dir / "tokenizer-base.json")
model = Qwen3Model(QWEN_CONFIG_06_B)
state_dict = torch.load(
    local_dir / "qwen3-0.6B-rlvr-grpo-step00050.pth",
    map_location=device,
)
model.load_state_dict(state_dict)
model.to(device)
model.eval()

prompt = "Solve: If x + 7 = 19, what is x?"
input_ids = torch.tensor(tokenizer.encode(prompt), device=device).unsqueeze(0)

for token in generate_text_basic_stream_cache(
    model=model,
    token_ids=input_ids,
    max_new_tokens=256,
    eos_token_id=tokenizer.eos_token_id,
):
    token_id = token.squeeze(0).item()
    print(tokenizer.decode([token_id]), end="", flush=True)
print()

Notes

The chapter 6 checkpoints are based on the original no-KL GRPO setup described in chapter 6.
The chapter 7 checkpoints correspond to the improved GRPO variants described in chapter 7 and the supplementary scripts in ch07/03_rlvr_grpo_scripts_advanced.
The 7_6_plus_format_reward checkpoints are based on the reasoning model and are meant to be used with the reasoning tokenizer.

Download Helper Reference

These are the supported grpo_type values for download_qwen3_grpo_checkpoints(...):

Section 6.2 no-KL:

download_qwen3_grpo_checkpoints(
    grpo_type="no_kl",
    step="00050",
    out_dir="qwen3",
)

Available chapter 6 saved steps: 00050, 00100, 00500, 01000, 01500, 03000, 05000, 09000.

Section 7.3 tracking:

download_qwen3_grpo_checkpoints(
    grpo_type="tracking",
    step="00050",
    out_dir="qwen3",
)

Section 7.4 clip-ratio:

download_qwen3_grpo_checkpoints(
    grpo_type="clip_ratio",
    step="00050",
    out_dir="qwen3",
)

Section 7.5 KL:

download_qwen3_grpo_checkpoints(
    grpo_type="kl",
    step="00050",
    out_dir="qwen3",
)

Section 7.6 format-reward:

download_qwen3_grpo_checkpoints(
    grpo_type="format_reward",
    step="00050",
    out_dir="qwen3",
)

For the chapter 7 variants, replace step="00050" with any other available saved step such as 00100, 00150, 00200, 00250, 00300, 00350, 00400, 00450, or 00500.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support