Spaces:

jdsb06
/

meta-r2

Sleeping

App Files Files Community

meta-r2 / docs /configuration.md

github-actions[bot]

Deploy Space snapshot

ddbc1ba about 1 month ago

preview code

raw

history blame contribute delete

3.4 kB

	# Configuration

	Source file: `scripts/train_trl.py`

	---

	## `GRPOConfig` via `_make_grpo_config()`

	`train_trl.py` builds all `GRPOConfig` instances through `_make_grpo_config(**kwargs)`, which silently drops kwargs not present in the installed TRL version's `GRPOConfig` dataclass. This makes the script compatible across TRL 1.x minor versions where fields differ.

	### Common parameters

	\| Parameter \| Single-step default \| Episodic default \| Notes \|
	\|-----------\|--------------------\|-----------------\|----\|
	\| `per_device_train_batch_size` \| 4 \| 4 \| \|
	\| `gradient_accumulation_steps` \| 4 \| 4 \| Effective batch = 16 \|
	\| `num_generations` \| 4 \| 4 \| GRPO group size; must divide batch size \|
	\| `max_prompt_length` \| 2048 \| 2048 \| CLI `--max-prompt-length` \|
	\| `max_completion_length` \| 224 \| auto (min 512, max 1024) \| Episodic: `min(1024, max(512, 256 × horizon))` \|
	\| `temperature` \| 0.7 \| 0.9 \| Higher episodic temp for EOS exploration \|
	\| `warmup_ratio` \| 0.05 \| 0.05 \| \|
	\| `save_strategy` \| `"steps"` \| `"steps"` \| \|
	\| `save_steps` \| 5 \| 5 \| \|
	\| `save_total_limit` \| 3 \| 3 \| \|
	\| `logging_steps` \| 5 \| 5 \| \|
	\| `report_to` \| `"tensorboard"` if available, else `"none"` \| same \| \|

	### Learning rate schedule

	Single-step curriculum (stages 1–5): `8e-6 → 5e-6 → 3e-6 → 2e-6 → 1e-6`

	Episodic curriculum (stages 1–2 default): `3e-6 (stage 1), 2e-6 (stage 2), 1e-6 (stage 3+)`

	### `bf16` / `fp16` flags

	Set via `_dtype_flags(model)` which inspects the actual loaded model parameter dtypes:

	- bfloat16 model → `(bf16=True, fp16=False)` — clean, no GradScaler
	- float16 model (Unsloth 4-bit) → `(bf16=False, fp16=False)` — no AMP, Unsloth handles precision internally
	- No GPU → `(False, False)`

	Never hard-code these flags — always use `_dtype_flags(model)`.

	### Unsloth-specific

	```python
	config.unsloth_num_chunks = -1
	```

	Set after `GRPOConfig` construction. The field may not exist in non-Unsloth builds; setting it has no effect if Unsloth isn't installed.

	---

	## Single-step vs episodic

	\| Aspect \| Single-step (`train_curriculum`) \| Episodic (`train_episodic_curriculum`) \|
	\|--------\|----------------------------------\|----------------------------------------\|
	\| Trainer class \| `GRPOTrainer` \| `LifeStackGRPOTrainer` \|
	\| Dataset function \| `generate_dataset()` \| `generate_episodic_dataset()` \|
	\| Completion format \| Single JSON object \| `{"actions": [...]}` \|
	\| Primary reward signal \| `reward_task_success_fn` (env simulation) \| `reward_episode_return_fn` (trajectory, weight 2.0) \|
	\| JSON masking \| No \| Yes (`LifeStackGRPOTrainer._prepare_inputs`) \|
	\| `reward_compact_fn` \| Not used (removed) \| Not used (removed in v4) \|

	---

	## Reward weights

	Stage 1 (single-step warm-up):
	```python
	reward_weights = [1.0, 1.5, 1.0]
	# [reward_format_fn, reward_clean_eos_fn, reward_route_target_fn]
	```

	Stages 2–5 (single-step full signal):
	```python
	reward_weights = [1.0, 1.25, 1.0, 1.0, 1.0, 1.0, 1.0, 0.5, 0.25, 0.5]
	# [format, clean_eos, route_target, plausibility, task_success,
	# milestone, replan, reasoning, human_feedback, longterm]
	```

	Episodic (v4):
	```python
	reward_weights = [1.0, 0.5, 0.5, 2.0]
	# [episode_format, clean_eos, episode_plausibility, episode_return]
	```

	---

	## Related files

	- `scripts/train_trl.py` — all config construction
	- `docs/train_trl.md` — full training reference
	- `docs/training_guide.md` — end-to-end guide