meta-r2 / docs /configuration.md
github-actions[bot]
Deploy Space snapshot
ddbc1ba

Configuration

Source file: scripts/train_trl.py


GRPOConfig via _make_grpo_config()

train_trl.py builds all GRPOConfig instances through _make_grpo_config(**kwargs), which silently drops kwargs not present in the installed TRL version's GRPOConfig dataclass. This makes the script compatible across TRL 1.x minor versions where fields differ.

Common parameters

Parameter Single-step default Episodic default Notes
per_device_train_batch_size 4 4
gradient_accumulation_steps 4 4 Effective batch = 16
num_generations 4 4 GRPO group size; must divide batch size
max_prompt_length 2048 2048 CLI --max-prompt-length
max_completion_length 224 auto (min 512, max 1024) Episodic: min(1024, max(512, 256 Γ— horizon))
temperature 0.7 0.9 Higher episodic temp for EOS exploration
warmup_ratio 0.05 0.05
save_strategy "steps" "steps"
save_steps 5 5
save_total_limit 3 3
logging_steps 5 5
report_to "tensorboard" if available, else "none" same

Learning rate schedule

Single-step curriculum (stages 1–5): 8e-6 β†’ 5e-6 β†’ 3e-6 β†’ 2e-6 β†’ 1e-6

Episodic curriculum (stages 1–2 default): 3e-6 (stage 1), 2e-6 (stage 2), 1e-6 (stage 3+)

bf16 / fp16 flags

Set via _dtype_flags(model) which inspects the actual loaded model parameter dtypes:

  • bfloat16 model β†’ (bf16=True, fp16=False) β€” clean, no GradScaler
  • float16 model (Unsloth 4-bit) β†’ (bf16=False, fp16=False) β€” no AMP, Unsloth handles precision internally
  • No GPU β†’ (False, False)

Never hard-code these flags β€” always use _dtype_flags(model).

Unsloth-specific

config.unsloth_num_chunks = -1

Set after GRPOConfig construction. The field may not exist in non-Unsloth builds; setting it has no effect if Unsloth isn't installed.


Single-step vs episodic

Aspect Single-step (train_curriculum) Episodic (train_episodic_curriculum)
Trainer class GRPOTrainer LifeStackGRPOTrainer
Dataset function generate_dataset() generate_episodic_dataset()
Completion format Single JSON object {"actions": [...]}
Primary reward signal reward_task_success_fn (env simulation) reward_episode_return_fn (trajectory, weight 2.0)
JSON masking No Yes (LifeStackGRPOTrainer._prepare_inputs)
reward_compact_fn Not used (removed) Not used (removed in v4)

Reward weights

Stage 1 (single-step warm-up):

reward_weights = [1.0, 1.5, 1.0]
# [reward_format_fn, reward_clean_eos_fn, reward_route_target_fn]

Stages 2–5 (single-step full signal):

reward_weights = [1.0, 1.25, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.25, 0.5]
# [format, clean_eos, route_target, plausibility, task_success,
#  milestone, replan, reasoning, human_feedback, longterm]

Episodic (v4):

reward_weights = [1.0, 0.5, 0.5, 2.0]
# [episode_format, clean_eos, episode_plausibility, episode_return]

Related files

  • scripts/train_trl.py β€” all config construction
  • docs/train_trl.md β€” full training reference
  • docs/training_guide.md β€” end-to-end guide