JosephBai
/

TTI

Model card Files Files and versions

TTI / Release /docs /REWARD_BACKEND_GUIDE.md

JosephBai's picture

Upload folder using huggingface_hub

857c2e9 verified about 1 month ago

|

history blame contribute delete

2.42 kB

	# Reward Backend Guide

	This guide explains how to:

	- select reward backend at runtime,
	- understand backend capability differences,
	- integrate a custom reward backend.

	## 1) Runtime backend selection

	Set backend via environment or Hydra override.

	### Environment variable (recommended in scripts)

	```bash
	export EVOLVE_REWARD_BACKEND=vlac
	# or
	export EVOLVE_REWARD_BACKEND=robodopamine
	```

	Training scripts pass this value to:

	```text
	+actor_rollout_ref.rollout.reward_backend=<backend>
	```

	### Direct override

	```bash
	python scripts/train_libero_10-sft_full-ttt.py \
	+actor_rollout_ref.rollout.reward_backend=vlac
	```

	## 2) Capability model

	Backends are integrated with a capability contract:

	- required: `progress`
	- optional: `pairwise`
	- optional: `done`

	Current matrix:

	\| Backend \| progress \| pairwise \| done \|
	\|---\|---\|---\|---\|
	\| `vlac` \| yes \| yes \| optional \|
	\| `robodopamine` \| yes \| no \| no \|

	`robodopamine` requires external Robo-Dopamine code (`GRMInference`). Set:

	```bash
	export ROBODOPAMINE_PATH=/path/to/Robo-Dopamine
	```

	or install Robo-Dopamine as an importable package in the active environment.

	Fallback policy:

	- if `pairwise` unsupported, pairwise reward branch is disabled.
	- termination remains derived from progress threshold.

	## 3) Custom backend integration

	### Step 1: Implement adapter

	Create a backend class under `verl/utils/reward_backends/` with:

	```python
	capabilities = RewardBackendCapabilities(...)

	def compute_trajectory_values(...):
	...

	def pairwise_critic(...):
	...
	```

	`compute_trajectory_values` must return:

	- `value_list`: progress values (0-100 scale expected by current rollout path),
	- `critic_list`: pairwise/incremental critic list (may be empty if unsupported).

	### Step 2: Register backend in factory

	Edit `verl/utils/reward_backend_factory.py`:

	- add capability entry to `_CAP_MAP`,
	- add construction branch in `build_reward_backend_from_config(...)`.

	### Step 3: Configure and run smoke check

	```bash
	python scripts/train_libero_10-sft_full-ttt.py \
	+actor_rollout_ref.rollout.reward_backend=<your_backend>
	```

	Verify:

	- rollout initializes,
	- progress reward is non-empty,
	- pairwise branch behavior matches declared capabilities.

	## 4) Notes

	- `vlac` remains the reference backend for paper-faithful behavior.
	- custom backend integration should preserve algorithm invariants listed in `ALGORITHM_INVARIANTS.md`.