| # Reward Backend Guide |
|
|
| This guide explains how to: |
|
|
| - select reward backend at runtime, |
| - understand backend capability differences, |
| - integrate a custom reward backend. |
|
|
| ## 1) Runtime backend selection |
|
|
| Set backend via environment or Hydra override. |
|
|
| ### Environment variable (recommended in scripts) |
|
|
| ```bash |
| export EVOLVE_REWARD_BACKEND=vlac |
| # or |
| export EVOLVE_REWARD_BACKEND=robodopamine |
| ``` |
|
|
| Training scripts pass this value to: |
|
|
| ```text |
| +actor_rollout_ref.rollout.reward_backend=<backend> |
| ``` |
|
|
| ### Direct override |
|
|
| ```bash |
| python scripts/train_libero_10-sft_full-ttt.py \ |
| +actor_rollout_ref.rollout.reward_backend=vlac |
| ``` |
|
|
| ## 2) Capability model |
|
|
| Backends are integrated with a capability contract: |
|
|
| - required: `progress` |
| - optional: `pairwise` |
| - optional: `done` |
|
|
| Current matrix: |
|
|
| | Backend | progress | pairwise | done | |
| |---|---|---|---| |
| | `vlac` | yes | yes | optional | |
| | `robodopamine` | yes | no | no | |
|
|
| `robodopamine` requires external Robo-Dopamine code (`GRMInference`). Set: |
|
|
| ```bash |
| export ROBODOPAMINE_PATH=/path/to/Robo-Dopamine |
| ``` |
|
|
| or install Robo-Dopamine as an importable package in the active environment. |
|
|
| Fallback policy: |
|
|
| - if `pairwise` unsupported, pairwise reward branch is disabled. |
| - termination remains derived from progress threshold. |
|
|
| ## 3) Custom backend integration |
|
|
| ### Step 1: Implement adapter |
|
|
| Create a backend class under `verl/utils/reward_backends/` with: |
|
|
| ```python |
| capabilities = RewardBackendCapabilities(...) |
| |
| def compute_trajectory_values(...): |
| ... |
| |
| def pairwise_critic(...): |
| ... |
| ``` |
|
|
| `compute_trajectory_values` must return: |
|
|
| - `value_list`: progress values (0-100 scale expected by current rollout path), |
| - `critic_list`: pairwise/incremental critic list (may be empty if unsupported). |
|
|
| ### Step 2: Register backend in factory |
|
|
| Edit `verl/utils/reward_backend_factory.py`: |
|
|
| - add capability entry to `_CAP_MAP`, |
| - add construction branch in `build_reward_backend_from_config(...)`. |
|
|
| ### Step 3: Configure and run smoke check |
|
|
| ```bash |
| python scripts/train_libero_10-sft_full-ttt.py \ |
| +actor_rollout_ref.rollout.reward_backend=<your_backend> |
| ``` |
|
|
| Verify: |
|
|
| - rollout initializes, |
| - progress reward is non-empty, |
| - pairwise branch behavior matches declared capabilities. |
|
|
| ## 4) Notes |
|
|
| - `vlac` remains the reference backend for paper-faithful behavior. |
| - custom backend integration should preserve algorithm invariants listed in `ALGORITHM_INVARIANTS.md`. |
|
|