RL Length Penalty Checkpoints

Unified collection of RL length penalty experiment checkpoints for math reasoning compression. These experiments train language models to produce shorter correct solutions using a sigmoid-based relative length penalty in the reward function.

Attribution

This work adapts the method from "Training Language Models to Reason Efficiently" (arXiv:2502.04463). Code adapted from Zanette-Labs/efficient-reasoning.

Models

Model	Base	Alpha Values
Qwen3-4B	Qwen/Qwen3-4B	0.0, 0.05, 0.2, 0.4, 0.6 (round 2)
Qwen3-8B	Qwen/Qwen3-8B	0.4
Nemotron-Nano-8B	nvidia/Llama-3.1-Nemotron-Nano-8B-v1	0.0, 0.2, 0.4

Reward Function

For each prompt group (n=8 rollouts):

Correct answers: reward = accuracy × (1 - alpha × sigmoid(z)) where z = (len - mean_len) / (std_len + 1e-7) computed over correct responses in the group
Incorrect answers: reward = 0

Higher alpha → stronger length penalty. Alpha=0.0 is the baseline (no length penalty, pure accuracy reward).

Training Setup

Parameter	Value
Algorithm	RLOO (GRPO-style)
Framework	Verl 0.6.1
LoRA	rank=32, alpha=64, all-linear targets
Optimizer	AdamW, LR=2e-5, constant schedule (10 warmup steps), weight_decay=0
Batch	24 prompts (6/GPU × 4 GPUs), mini-batch=12, micro-batch=1/GPU
Rollout	8 samples per prompt, temperature=0.6, top-p=1.0
Context	max_prompt=512, max_response=16384, max_model=16896
KL	coef=0.001, type=low_var_kl
Clip ratio	0.2
PPO epochs	1
Gradient checkpointing	enabled
Precision	BF16, FSDP2
Checkpoints	every 10 steps
Total steps	100 (2,400 training examples from 3,200 problem dataset)
vLLM	0.10.2, sync mode, prefix caching, GPU mem=0.80

Dataset

3,200 math problems from daman1209arora/compression_dataset (2,400 used in 100 training steps at batch size 24).

Hardware

4× NVIDIA B200/H200 GPUs on RunPod, single node.

Repo Structure

├── qwen3-4B/
│   ├── alpha-0.0/
│   │   ├── config.md          # Run-specific configuration
│   │   ├── checkpoints/
│   │   │   ├── global_step_10/   # LoRA adapter files (safetensors, config, etc.)
│   │   │   ├── global_step_20/
│   │   │   └── ...               # up to global_step_100
│   │   └── rollouts/
│   │       ├── 1.jsonl           # Rollout data per step
│   │       └── ...               # up to 100.jsonl
│   ├── alpha-0.05/
│   ├── alpha-0.2/
│   ├── alpha-0.4/
│   └── alpha-0.6-round2/
├── qwen3-8B/
│   └── alpha-0.4/
└── nemotron-nano-8B/
    ├── alpha-0.0/
    ├── alpha-0.2/
    └── alpha-0.4/

Each alpha directory contains:

config.md — Run-specific configuration details (WandB ID, hyperparameters)
checkpoints/ — LoRA adapter checkpoints saved every 10 steps
rollouts/ — JSONL rollout data (model generations + rewards) per training step

Notes

Qwen3-4B alpha-0.6-round2 is a second-round run initialized from the merged alpha-0.4 step 100 checkpoint, with KL penalty disabled. It is capped at step 80 due to RL collapse observed at steps 90+.
All other runs include checkpoints up to step 100.
Alpha=0.0 runs serve as baselines (standard accuracy reward, no length penalty).
Thinking mode (enable_thinking=True) is used for all Qwen3 runs.

Downloads last month: -

Video Preview

Reinforcement Learning

Model tree for brikdavies/RL-length-penalty-checkpoints

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Adapter

(950)

this model

Dataset used to train brikdavies/RL-length-penalty-checkpoints

Paper for brikdavies/RL-length-penalty-checkpoints

Training Language Models to Reason Efficiently

Paper • 2502.04463 • Published Feb 6, 2025 • 1