RL Length Penalty Checkpoints

Unified collection of RL length penalty experiment checkpoints for math reasoning compression. These experiments train language models to produce shorter correct solutions using a sigmoid-based relative length penalty in the reward function.

Attribution

This work adapts the method from "Training Language Models to Reason Efficiently" (arXiv:2502.04463). Code adapted from Zanette-Labs/efficient-reasoning.

Models

Model Base Alpha Values
Qwen3-4B Qwen/Qwen3-4B 0.0, 0.05, 0.2, 0.4, 0.6 (round 2)
Qwen3-8B Qwen/Qwen3-8B 0.4
Nemotron-Nano-8B nvidia/Llama-3.1-Nemotron-Nano-8B-v1 0.0, 0.2, 0.4

Reward Function

For each prompt group (n=8 rollouts):

  • Correct answers: reward = accuracy Γ— (1 - alpha Γ— sigmoid(z)) where z = (len - mean_len) / (std_len + 1e-7) computed over correct responses in the group
  • Incorrect answers: reward = 0

Higher alpha β†’ stronger length penalty. Alpha=0.0 is the baseline (no length penalty, pure accuracy reward).

Training Setup

Parameter Value
Algorithm RLOO (GRPO-style)
Framework Verl 0.6.1
LoRA rank=32, alpha=64, all-linear targets
Optimizer AdamW, LR=2e-5, constant schedule (10 warmup steps), weight_decay=0
Batch 24 prompts (6/GPU Γ— 4 GPUs), mini-batch=12, micro-batch=1/GPU
Rollout 8 samples per prompt, temperature=0.6, top-p=1.0
Context max_prompt=512, max_response=16384, max_model=16896
KL coef=0.001, type=low_var_kl
Clip ratio 0.2
PPO epochs 1
Gradient checkpointing enabled
Precision BF16, FSDP2
Checkpoints every 10 steps
Total steps 100 (2,400 training examples from 3,200 problem dataset)
vLLM 0.10.2, sync mode, prefix caching, GPU mem=0.80

Dataset

3,200 math problems from daman1209arora/compression_dataset (2,400 used in 100 training steps at batch size 24).

Hardware

4Γ— NVIDIA B200/H200 GPUs on RunPod, single node.

Repo Structure

β”œβ”€β”€ qwen3-4B/
β”‚   β”œβ”€β”€ alpha-0.0/
β”‚   β”‚   β”œβ”€β”€ config.md          # Run-specific configuration
β”‚   β”‚   β”œβ”€β”€ checkpoints/
β”‚   β”‚   β”‚   β”œβ”€β”€ global_step_10/   # LoRA adapter files (safetensors, config, etc.)
β”‚   β”‚   β”‚   β”œβ”€β”€ global_step_20/
β”‚   β”‚   β”‚   └── ...               # up to global_step_100
β”‚   β”‚   └── rollouts/
β”‚   β”‚       β”œβ”€β”€ 1.jsonl           # Rollout data per step
β”‚   β”‚       └── ...               # up to 100.jsonl
β”‚   β”œβ”€β”€ alpha-0.05/
β”‚   β”œβ”€β”€ alpha-0.2/
β”‚   β”œβ”€β”€ alpha-0.4/
β”‚   └── alpha-0.6-round2/
β”œβ”€β”€ qwen3-8B/
β”‚   └── alpha-0.4/
└── nemotron-nano-8B/
    β”œβ”€β”€ alpha-0.0/
    β”œβ”€β”€ alpha-0.2/
    └── alpha-0.4/

Each alpha directory contains:

  • config.md β€” Run-specific configuration details (WandB ID, hyperparameters)
  • checkpoints/ β€” LoRA adapter checkpoints saved every 10 steps
  • rollouts/ β€” JSONL rollout data (model generations + rewards) per training step

Notes

  • Qwen3-4B alpha-0.6-round2 is a second-round run initialized from the merged alpha-0.4 step 100 checkpoint, with KL penalty disabled. It is capped at step 80 due to RL collapse observed at steps 90+.
  • All other runs include checkpoints up to step 100.
  • Alpha=0.0 runs serve as baselines (standard accuracy reward, no length penalty).
  • Thinking mode (enable_thinking=True) is used for all Qwen3 runs.
Downloads last month
-
Video Preview
loading

Model tree for brikdavies/RL-length-penalty-checkpoints

Finetuned
Qwen/Qwen3-4B
Adapter
(950)
this model

Dataset used to train brikdavies/RL-length-penalty-checkpoints

Paper for brikdavies/RL-length-penalty-checkpoints