File size: 7,824 Bytes
857c2e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
# Reward Model as a Service Guide

This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.

---

## Overview

EVOLVE-VLA uses progress-based reward as the core signal for rollout training.
In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.

### Capability Contract

- **Required**
  - `progress`: backend must provide trajectory progress estimates.
- **Optional**
  - `pairwise`: backend may provide pairwise critic signal.
  - `done`: backend may provide direct done prediction (otherwise derived from progress threshold).

Current backend status:

| Backend | progress | pairwise | done |
|---|---|---|---|
| `vlac` | yes | yes | optional |
| `robodopamine` | yes | no | no |

For backend selection and custom backend integration, see `REWARD_BACKEND_GUIDE.md`.

### What VLAC Does

1. **Progress Estimation**: Quantifies how much closer an agent has moved toward task completion
2. **Termination Detection**: Determines when a trajectory should end based on progress
3. **Dense Rewards**: Provides frame-by-frame feedback for RL optimization

### Why a Separate Service?

- **GPU Memory**: VLAC requires 20-30GB GPU memory, separate from training workers
- **Load Balancing**: Multiple service instances handle concurrent requests from distributed training
- **Flexibility**: Easy to scale independently of training infrastructure

---

## Architecture Design

### Service-Client Model

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RL Training Cluster                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ Worker 1 β”‚  β”‚ Worker 2 β”‚  β”‚ Worker 3 β”‚  β”‚ Worker 4 β”‚   ...   β”‚
β”‚  β”‚ (rollout)β”‚  β”‚ (rollout)β”‚  β”‚ (rollout)β”‚  β”‚ (rollout)β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚
β”‚       β”‚             β”‚             β”‚             β”‚               β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                     β”‚ HTTP/JSON                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚    Load Balancing         β”‚
        β”‚  (Round-robin by worker)  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                           β”‚
  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
  β”‚   VLAC    β”‚              β”‚   VLAC    β”‚
  β”‚ Service 1 β”‚     ...      β”‚ Service 8 β”‚
  β”‚  :8111    β”‚              β”‚  :8118    β”‚
  β”‚  GPU 0    β”‚              β”‚  GPU 7    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Key Design Decisions

1. **HTTP API**: Simple, language-agnostic communication
2. **Single Process per GPU**: Each service instance owns one GPU
3. **Stateless Services**: No session management, pure request-response
4. **Automatic Load Balancing**: Workers round-robin across available services

---

## Mini-Batching and Performance

### Internal Batching

VLAC service automatically batches requests to optimize GPU utilization:

```python
# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
    frames=[frame_0, frame_1, ..., frame_99],  # 100 frames
    batch_size=10  # Suggested batch size
)

# Service internally:
# 1. Chunks 100 frames into batches of ≀8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values
```

**Why Batch Size ≀ 8?**
- Optimal GPU memory utilization for 448Γ—448 images
- Balances throughput and memory usage
- Prevents OOM on 20-30GB GPU memory budget

### Request Processing Pipeline

```
HTTP Request β†’ JSON Parse β†’ Base64 Decode β†’ Image Resize (448Γ—448)
                                                  ↓
                                            Batch Inference
                                                  ↓
                                          Result Aggregation
                                                  ↓
                            JSON Response ← Value Computation
```

**Latency Breakdown**:
- Image decoding/resizing: ~50-100ms
- GPU inference (batch of 8): ~200-400ms
- JSON serialization: ~10-20ms
- **Total**: ~300-800ms per request

### Scaling with Multiple Services

**Single Service** (1 GPU):
- Handles ~1-3 requests/second
- Bottleneck for >4 parallel workers

**Multiple Services** (8 GPUs):
- Handles ~8-24 requests/second
- Supports 16-32 parallel workers
- Linear scaling with GPU count

---

## Key Parameters

### Service Configuration

| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `--port` | `8111` | Base port for service | Each service uses consecutive ports (8111, 8112, ...) |
| `--gpu-ids` | `"0"` | GPUs to use | One service per GPU |
| `--ckpt-path` | `checkpoints/VLAC` | Model checkpoint path | Must point to valid VLAC weights |

### Training Integration

| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `VLAC_SERVICE_URL` | `http://localhost:8111` | Base URL of VLAC service | Must match service host |
| `VLAC_SERVICE_NUM` | `8` | Number of service instances | For load balancing |
| `VLAC_DONE_THRESHOLD` | `0.95` | Completion confidence threshold | Higher = stricter termination |
| `VLAC_OFFSET_CALL` | `16` | Frames between progress checks | Higher = fewer VLAC calls |
| `VLAC_START_STEP_CALL` | `64` | When to start checking | Skip early exploration phase |
| `USE_DENSE_REWARD` | `True` | Use accumulative progress as reward | Enable for dense feedback |

### Parameter Tuning Guidelines

**For Long-Horizon Tasks** (e.g., LIBERO-Long):
```python
VLAC_DONE_THRESHOLD = 0.95       # Standard threshold
VLAC_OFFSET_CALL = 16            # Check every 16 steps
VLAC_START_STEP_CALL = 64        # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True  # Enable progressive horizon
```

**For Short, Precise Tasks**:
```python
VLAC_DONE_THRESHOLD = 0.98       # Stricter threshold
VLAC_OFFSET_CALL = 8             # More frequent checks
VLAC_START_STEP_CALL = 32        # Start earlier
```

**For Faster Training (Development)**:
```python
VLAC_OFFSET_CALL = 32            # Fewer VLAC calls
VLAC_SERVICE_NUM = 4             # Fewer services
```

---

## Usage Guide

### Starting VLAC Service

**Single Service** (for debugging):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0
```

**Multiple Services** (for training):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111
```

This launches 8 services on ports 8111-8118, one per GPU.