# Reward Backend Guide This guide explains how to: - select reward backend at runtime, - understand backend capability differences, - integrate a custom reward backend. ## 1) Runtime backend selection Set backend via environment or Hydra override. ### Environment variable (recommended in scripts) ```bash export EVOLVE_REWARD_BACKEND=vlac # or export EVOLVE_REWARD_BACKEND=robodopamine ``` Training scripts pass this value to: ```text +actor_rollout_ref.rollout.reward_backend= ``` ### Direct override ```bash python scripts/train_libero_10-sft_full-ttt.py \ +actor_rollout_ref.rollout.reward_backend=vlac ``` ## 2) Capability model Backends are integrated with a capability contract: - required: `progress` - optional: `pairwise` - optional: `done` Current matrix: | Backend | progress | pairwise | done | |---|---|---|---| | `vlac` | yes | yes | optional | | `robodopamine` | yes | no | no | `robodopamine` requires external Robo-Dopamine code (`GRMInference`). Set: ```bash export ROBODOPAMINE_PATH=/path/to/Robo-Dopamine ``` or install Robo-Dopamine as an importable package in the active environment. Fallback policy: - if `pairwise` unsupported, pairwise reward branch is disabled. - termination remains derived from progress threshold. ## 3) Custom backend integration ### Step 1: Implement adapter Create a backend class under `verl/utils/reward_backends/` with: ```python capabilities = RewardBackendCapabilities(...) def compute_trajectory_values(...): ... def pairwise_critic(...): ... ``` `compute_trajectory_values` must return: - `value_list`: progress values (0-100 scale expected by current rollout path), - `critic_list`: pairwise/incremental critic list (may be empty if unsupported). ### Step 2: Register backend in factory Edit `verl/utils/reward_backend_factory.py`: - add capability entry to `_CAP_MAP`, - add construction branch in `build_reward_backend_from_config(...)`. ### Step 3: Configure and run smoke check ```bash python scripts/train_libero_10-sft_full-ttt.py \ +actor_rollout_ref.rollout.reward_backend= ``` Verify: - rollout initializes, - progress reward is non-empty, - pairwise branch behavior matches declared capabilities. ## 4) Notes - `vlac` remains the reference backend for paper-faithful behavior. - custom backend integration should preserve algorithm invariants listed in `ALGORITHM_INVARIANTS.md`.