TTI / Release /docs /REWARD_BACKEND_GUIDE.md
JosephBai's picture
Upload folder using huggingface_hub
857c2e9 verified
# Reward Backend Guide
This guide explains how to:
- select reward backend at runtime,
- understand backend capability differences,
- integrate a custom reward backend.
## 1) Runtime backend selection
Set backend via environment or Hydra override.
### Environment variable (recommended in scripts)
```bash
export EVOLVE_REWARD_BACKEND=vlac
# or
export EVOLVE_REWARD_BACKEND=robodopamine
```
Training scripts pass this value to:
```text
+actor_rollout_ref.rollout.reward_backend=<backend>
```
### Direct override
```bash
python scripts/train_libero_10-sft_full-ttt.py \
+actor_rollout_ref.rollout.reward_backend=vlac
```
## 2) Capability model
Backends are integrated with a capability contract:
- required: `progress`
- optional: `pairwise`
- optional: `done`
Current matrix:
| Backend | progress | pairwise | done |
|---|---|---|---|
| `vlac` | yes | yes | optional |
| `robodopamine` | yes | no | no |
`robodopamine` requires external Robo-Dopamine code (`GRMInference`). Set:
```bash
export ROBODOPAMINE_PATH=/path/to/Robo-Dopamine
```
or install Robo-Dopamine as an importable package in the active environment.
Fallback policy:
- if `pairwise` unsupported, pairwise reward branch is disabled.
- termination remains derived from progress threshold.
## 3) Custom backend integration
### Step 1: Implement adapter
Create a backend class under `verl/utils/reward_backends/` with:
```python
capabilities = RewardBackendCapabilities(...)
def compute_trajectory_values(...):
...
def pairwise_critic(...):
...
```
`compute_trajectory_values` must return:
- `value_list`: progress values (0-100 scale expected by current rollout path),
- `critic_list`: pairwise/incremental critic list (may be empty if unsupported).
### Step 2: Register backend in factory
Edit `verl/utils/reward_backend_factory.py`:
- add capability entry to `_CAP_MAP`,
- add construction branch in `build_reward_backend_from_config(...)`.
### Step 3: Configure and run smoke check
```bash
python scripts/train_libero_10-sft_full-ttt.py \
+actor_rollout_ref.rollout.reward_backend=<your_backend>
```
Verify:
- rollout initializes,
- progress reward is non-empty,
- pairwise branch behavior matches declared capabilities.
## 4) Notes
- `vlac` remains the reference backend for paper-faithful behavior.
- custom backend integration should preserve algorithm invariants listed in `ALGORITHM_INVARIANTS.md`.