Qwen3 From Scratch GRPO Checkpoints

This repository contains GRPO training checkpoints for the rasbt/qwen3-from-scratch model from chapters 6 and 7 of Build a Reasoning Model (From Scratch).

These files are raw PyTorch state_dict checkpoints intended for use with the reasoning_from_scratch package.

 

Available Checkpoints

 

Usage Example

The checkpoints in this repository are intended for use with the reasoning_from_scratch package.

For the chapter 6 no-KL checkpoint, you can download it via:

from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints

download_qwen3_grpo_checkpoints(
    grpo_type="no_kl",
    step="00050",
    out_dir="qwen3",
)

For a chapter 7 checkpoint, the intended usage is:

from reasoning_from_scratch.qwen3 import download_qwen3_grpo_checkpoints

download_qwen3_grpo_checkpoints(
    grpo_type="clip_ratio",
    step="00050",
    out_dir="qwen3",
)

Once downloaded, you can load a checkpoint and stream text as follows:

from pathlib import Path
import torch

from reasoning_from_scratch.ch02 import (
    get_device,
    generate_text_basic_stream_cache,
)
from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Model,
    Qwen3Tokenizer,
    QWEN_CONFIG_06_B,
)

device = get_device()
local_dir = Path("qwen3")

download_qwen3_small(kind="base", tokenizer_only=True, out_dir=local_dir)

tokenizer = Qwen3Tokenizer(tokenizer_file_path=local_dir / "tokenizer-base.json")
model = Qwen3Model(QWEN_CONFIG_06_B)
state_dict = torch.load(
    local_dir / "qwen3-0.6B-rlvr-grpo-step00050.pth",
    map_location=device,
)
model.load_state_dict(state_dict)
model.to(device)
model.eval()

prompt = "Solve: If x + 7 = 19, what is x?"
input_ids = torch.tensor(tokenizer.encode(prompt), device=device).unsqueeze(0)

for token in generate_text_basic_stream_cache(
    model=model,
    token_ids=input_ids,
    max_new_tokens=256,
    eos_token_id=tokenizer.eos_token_id,
):
    token_id = token.squeeze(0).item()
    print(tokenizer.decode([token_id]), end="", flush=True)
print()

 

Notes

  • The chapter 6 checkpoints are based on the original no-KL GRPO setup described in chapter 6.
  • The chapter 7 checkpoints correspond to the improved GRPO variants described in chapter 7 and the supplementary scripts in ch07/03_rlvr_grpo_scripts_advanced.
  • The 7_6_plus_format_reward checkpoints are based on the reasoning model and are meant to be used with the reasoning tokenizer.

 

Download Helper Reference

These are the supported grpo_type values for download_qwen3_grpo_checkpoints(...):

  • Section 6.2 no-KL:
download_qwen3_grpo_checkpoints(
    grpo_type="no_kl",
    step="00050",
    out_dir="qwen3",
)

Available chapter 6 saved steps: 00050, 00100, 00500, 01000, 01500, 03000, 05000, 09000.

  • Section 7.3 tracking:
download_qwen3_grpo_checkpoints(
    grpo_type="tracking",
    step="00050",
    out_dir="qwen3",
)
  • Section 7.4 clip-ratio:
download_qwen3_grpo_checkpoints(
    grpo_type="clip_ratio",
    step="00050",
    out_dir="qwen3",
)
  • Section 7.5 KL:
download_qwen3_grpo_checkpoints(
    grpo_type="kl",
    step="00050",
    out_dir="qwen3",
)
  • Section 7.6 format-reward:
download_qwen3_grpo_checkpoints(
    grpo_type="format_reward",
    step="00050",
    out_dir="qwen3",
)

For the chapter 7 variants, replace step="00050" with any other available saved step such as 00100, 00150, 00200, 00250, 00300, 00350, 00400, 00450, or 00500.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support