# Configuration **Source file:** `scripts/train_trl.py` --- ## `GRPOConfig` via `_make_grpo_config()` `train_trl.py` builds all `GRPOConfig` instances through `_make_grpo_config(**kwargs)`, which silently drops kwargs not present in the installed TRL version's `GRPOConfig` dataclass. This makes the script compatible across TRL 1.x minor versions where fields differ. ### Common parameters | Parameter | Single-step default | Episodic default | Notes | |-----------|--------------------|-----------------|----| | `per_device_train_batch_size` | 4 | 4 | | | `gradient_accumulation_steps` | 4 | 4 | Effective batch = 16 | | `num_generations` | 4 | 4 | GRPO group size; must divide batch size | | `max_prompt_length` | 2048 | 2048 | CLI `--max-prompt-length` | | `max_completion_length` | 224 | auto (min 512, max 1024) | Episodic: `min(1024, max(512, 256 × horizon))` | | `temperature` | 0.7 | 0.9 | Higher episodic temp for EOS exploration | | `warmup_ratio` | 0.05 | 0.05 | | | `save_strategy` | `"steps"` | `"steps"` | | | `save_steps` | 5 | 5 | | | `save_total_limit` | 3 | 3 | | | `logging_steps` | 5 | 5 | | | `report_to` | `"tensorboard"` if available, else `"none"` | same | | ### Learning rate schedule Single-step curriculum (stages 1–5): `8e-6 → 5e-6 → 3e-6 → 2e-6 → 1e-6` Episodic curriculum (stages 1–2 default): `3e-6 (stage 1), 2e-6 (stage 2), 1e-6 (stage 3+)` ### `bf16` / `fp16` flags Set via `_dtype_flags(model)` which inspects the actual loaded model parameter dtypes: - bfloat16 model → `(bf16=True, fp16=False)` — clean, no GradScaler - float16 model (Unsloth 4-bit) → `(bf16=False, fp16=False)` — no AMP, Unsloth handles precision internally - No GPU → `(False, False)` Never hard-code these flags — always use `_dtype_flags(model)`. ### Unsloth-specific ```python config.unsloth_num_chunks = -1 ``` Set after `GRPOConfig` construction. The field may not exist in non-Unsloth builds; setting it has no effect if Unsloth isn't installed. --- ## Single-step vs episodic | Aspect | Single-step (`train_curriculum`) | Episodic (`train_episodic_curriculum`) | |--------|----------------------------------|----------------------------------------| | Trainer class | `GRPOTrainer` | `LifeStackGRPOTrainer` | | Dataset function | `generate_dataset()` | `generate_episodic_dataset()` | | Completion format | Single JSON object | `{"actions": [...]}` | | Primary reward signal | `reward_task_success_fn` (env simulation) | `reward_episode_return_fn` (trajectory, weight 2.0) | | JSON masking | No | Yes (`LifeStackGRPOTrainer._prepare_inputs`) | | `reward_compact_fn` | Not used (removed) | Not used (removed in v4) | --- ## Reward weights Stage 1 (single-step warm-up): ```python reward_weights = [1.0, 1.5, 1.0] # [reward_format_fn, reward_clean_eos_fn, reward_route_target_fn] ``` Stages 2–5 (single-step full signal): ```python reward_weights = [1.0, 1.25, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.25, 0.5] # [format, clean_eos, route_target, plausibility, task_success, # milestone, replan, reasoning, human_feedback, longterm] ``` Episodic (v4): ```python reward_weights = [1.0, 0.5, 0.5, 2.0] # [episode_format, clean_eos, episode_plausibility, episode_return] ``` --- ## Related files - `scripts/train_trl.py` — all config construction - `docs/train_trl.md` — full training reference - `docs/training_guide.md` — end-to-end guide