File size: 2,419 Bytes
857c2e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# Reward Backend Guide

This guide explains how to:

- select reward backend at runtime,
- understand backend capability differences,
- integrate a custom reward backend.

## 1) Runtime backend selection

Set backend via environment or Hydra override.

### Environment variable (recommended in scripts)

```bash
export EVOLVE_REWARD_BACKEND=vlac
# or
export EVOLVE_REWARD_BACKEND=robodopamine
```

Training scripts pass this value to:

```text
+actor_rollout_ref.rollout.reward_backend=<backend>
```

### Direct override

```bash
python scripts/train_libero_10-sft_full-ttt.py \
  +actor_rollout_ref.rollout.reward_backend=vlac
```

## 2) Capability model

Backends are integrated with a capability contract:

- required: `progress`
- optional: `pairwise`
- optional: `done`

Current matrix:

| Backend | progress | pairwise | done |
|---|---|---|---|
| `vlac` | yes | yes | optional |
| `robodopamine` | yes | no | no |

`robodopamine` requires external Robo-Dopamine code (`GRMInference`). Set:

```bash
export ROBODOPAMINE_PATH=/path/to/Robo-Dopamine
```

or install Robo-Dopamine as an importable package in the active environment.

Fallback policy:

- if `pairwise` unsupported, pairwise reward branch is disabled.
- termination remains derived from progress threshold.

## 3) Custom backend integration

### Step 1: Implement adapter

Create a backend class under `verl/utils/reward_backends/` with:

```python
capabilities = RewardBackendCapabilities(...)

def compute_trajectory_values(...):
    ...

def pairwise_critic(...):
    ...
```

`compute_trajectory_values` must return:

- `value_list`: progress values (0-100 scale expected by current rollout path),
- `critic_list`: pairwise/incremental critic list (may be empty if unsupported).

### Step 2: Register backend in factory

Edit `verl/utils/reward_backend_factory.py`:

- add capability entry to `_CAP_MAP`,
- add construction branch in `build_reward_backend_from_config(...)`.

### Step 3: Configure and run smoke check

```bash
python scripts/train_libero_10-sft_full-ttt.py \
  +actor_rollout_ref.rollout.reward_backend=<your_backend>
```

Verify:

- rollout initializes,
- progress reward is non-empty,
- pairwise branch behavior matches declared capabilities.

## 4) Notes

- `vlac` remains the reference backend for paper-faithful behavior.
- custom backend integration should preserve algorithm invariants listed in `ALGORITHM_INVARIANTS.md`.