Training Language Models to Reason Efficiently
Paper β’ 2502.04463 β’ Published β’ 1
Unified collection of RL length penalty experiment checkpoints for math reasoning compression. These experiments train language models to produce shorter correct solutions using a sigmoid-based relative length penalty in the reward function.
This work adapts the method from "Training Language Models to Reason Efficiently" (arXiv:2502.04463). Code adapted from Zanette-Labs/efficient-reasoning.
| Model | Base | Alpha Values |
|---|---|---|
| Qwen3-4B | Qwen/Qwen3-4B | 0.0, 0.05, 0.2, 0.4, 0.6 (round 2) |
| Qwen3-8B | Qwen/Qwen3-8B | 0.4 |
| Nemotron-Nano-8B | nvidia/Llama-3.1-Nemotron-Nano-8B-v1 | 0.0, 0.2, 0.4 |
For each prompt group (n=8 rollouts):
reward = accuracy Γ (1 - alpha Γ sigmoid(z)) where z = (len - mean_len) / (std_len + 1e-7) computed over correct responses in the groupreward = 0Higher alpha β stronger length penalty. Alpha=0.0 is the baseline (no length penalty, pure accuracy reward).
| Parameter | Value |
|---|---|
| Algorithm | RLOO (GRPO-style) |
| Framework | Verl 0.6.1 |
| LoRA | rank=32, alpha=64, all-linear targets |
| Optimizer | AdamW, LR=2e-5, constant schedule (10 warmup steps), weight_decay=0 |
| Batch | 24 prompts (6/GPU Γ 4 GPUs), mini-batch=12, micro-batch=1/GPU |
| Rollout | 8 samples per prompt, temperature=0.6, top-p=1.0 |
| Context | max_prompt=512, max_response=16384, max_model=16896 |
| KL | coef=0.001, type=low_var_kl |
| Clip ratio | 0.2 |
| PPO epochs | 1 |
| Gradient checkpointing | enabled |
| Precision | BF16, FSDP2 |
| Checkpoints | every 10 steps |
| Total steps | 100 (2,400 training examples from 3,200 problem dataset) |
| vLLM | 0.10.2, sync mode, prefix caching, GPU mem=0.80 |
3,200 math problems from daman1209arora/compression_dataset (2,400 used in 100 training steps at batch size 24).
4Γ NVIDIA B200/H200 GPUs on RunPod, single node.
βββ qwen3-4B/
β βββ alpha-0.0/
β β βββ config.md # Run-specific configuration
β β βββ checkpoints/
β β β βββ global_step_10/ # LoRA adapter files (safetensors, config, etc.)
β β β βββ global_step_20/
β β β βββ ... # up to global_step_100
β β βββ rollouts/
β β βββ 1.jsonl # Rollout data per step
β β βββ ... # up to 100.jsonl
β βββ alpha-0.05/
β βββ alpha-0.2/
β βββ alpha-0.4/
β βββ alpha-0.6-round2/
βββ qwen3-8B/
β βββ alpha-0.4/
βββ nemotron-nano-8B/
βββ alpha-0.0/
βββ alpha-0.2/
βββ alpha-0.4/
Each alpha directory contains:
config.md β Run-specific configuration details (WandB ID, hyperparameters)checkpoints/ β LoRA adapter checkpoints saved every 10 stepsrollouts/ β JSONL rollout data (model generations + rewards) per training stepenable_thinking=True) is used for all Qwen3 runs.