File size: 7,824 Bytes
857c2e9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | # Reward Model as a Service Guide
This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.
---
## Overview
EVOLVE-VLA uses progress-based reward as the core signal for rollout training.
In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.
### Capability Contract
- **Required**
- `progress`: backend must provide trajectory progress estimates.
- **Optional**
- `pairwise`: backend may provide pairwise critic signal.
- `done`: backend may provide direct done prediction (otherwise derived from progress threshold).
Current backend status:
| Backend | progress | pairwise | done |
|---|---|---|---|
| `vlac` | yes | yes | optional |
| `robodopamine` | yes | no | no |
For backend selection and custom backend integration, see `REWARD_BACKEND_GUIDE.md`.
### What VLAC Does
1. **Progress Estimation**: Quantifies how much closer an agent has moved toward task completion
2. **Termination Detection**: Determines when a trajectory should end based on progress
3. **Dense Rewards**: Provides frame-by-frame feedback for RL optimization
### Why a Separate Service?
- **GPU Memory**: VLAC requires 20-30GB GPU memory, separate from training workers
- **Load Balancing**: Multiple service instances handle concurrent requests from distributed training
- **Flexibility**: Easy to scale independently of training infrastructure
---
## Architecture Design
### Service-Client Model
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RL Training Cluster β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Worker 1 β β Worker 2 β β Worker 3 β β Worker 4 β ... β
β β (rollout)β β (rollout)β β (rollout)β β (rollout)β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β
β β HTTP/JSON β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β Load Balancing β
β (Round-robin by worker) β
βββββββββββββββ¬ββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β β
βββββββΌββββββ βββββββΌββββββ
β VLAC β β VLAC β
β Service 1 β ... β Service 8 β
β :8111 β β :8118 β
β GPU 0 β β GPU 7 β
βββββββββββββ βββββββββββββ
```
### Key Design Decisions
1. **HTTP API**: Simple, language-agnostic communication
2. **Single Process per GPU**: Each service instance owns one GPU
3. **Stateless Services**: No session management, pure request-response
4. **Automatic Load Balancing**: Workers round-robin across available services
---
## Mini-Batching and Performance
### Internal Batching
VLAC service automatically batches requests to optimize GPU utilization:
```python
# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
frames=[frame_0, frame_1, ..., frame_99], # 100 frames
batch_size=10 # Suggested batch size
)
# Service internally:
# 1. Chunks 100 frames into batches of β€8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values
```
**Why Batch Size β€ 8?**
- Optimal GPU memory utilization for 448Γ448 images
- Balances throughput and memory usage
- Prevents OOM on 20-30GB GPU memory budget
### Request Processing Pipeline
```
HTTP Request β JSON Parse β Base64 Decode β Image Resize (448Γ448)
β
Batch Inference
β
Result Aggregation
β
JSON Response β Value Computation
```
**Latency Breakdown**:
- Image decoding/resizing: ~50-100ms
- GPU inference (batch of 8): ~200-400ms
- JSON serialization: ~10-20ms
- **Total**: ~300-800ms per request
### Scaling with Multiple Services
**Single Service** (1 GPU):
- Handles ~1-3 requests/second
- Bottleneck for >4 parallel workers
**Multiple Services** (8 GPUs):
- Handles ~8-24 requests/second
- Supports 16-32 parallel workers
- Linear scaling with GPU count
---
## Key Parameters
### Service Configuration
| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `--port` | `8111` | Base port for service | Each service uses consecutive ports (8111, 8112, ...) |
| `--gpu-ids` | `"0"` | GPUs to use | One service per GPU |
| `--ckpt-path` | `checkpoints/VLAC` | Model checkpoint path | Must point to valid VLAC weights |
### Training Integration
| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `VLAC_SERVICE_URL` | `http://localhost:8111` | Base URL of VLAC service | Must match service host |
| `VLAC_SERVICE_NUM` | `8` | Number of service instances | For load balancing |
| `VLAC_DONE_THRESHOLD` | `0.95` | Completion confidence threshold | Higher = stricter termination |
| `VLAC_OFFSET_CALL` | `16` | Frames between progress checks | Higher = fewer VLAC calls |
| `VLAC_START_STEP_CALL` | `64` | When to start checking | Skip early exploration phase |
| `USE_DENSE_REWARD` | `True` | Use accumulative progress as reward | Enable for dense feedback |
### Parameter Tuning Guidelines
**For Long-Horizon Tasks** (e.g., LIBERO-Long):
```python
VLAC_DONE_THRESHOLD = 0.95 # Standard threshold
VLAC_OFFSET_CALL = 16 # Check every 16 steps
VLAC_START_STEP_CALL = 64 # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True # Enable progressive horizon
```
**For Short, Precise Tasks**:
```python
VLAC_DONE_THRESHOLD = 0.98 # Stricter threshold
VLAC_OFFSET_CALL = 8 # More frequent checks
VLAC_START_STEP_CALL = 32 # Start earlier
```
**For Faster Training (Development)**:
```python
VLAC_OFFSET_CALL = 32 # Fewer VLAC calls
VLAC_SERVICE_NUM = 4 # Fewer services
```
---
## Usage Guide
### Starting VLAC Service
**Single Service** (for debugging):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0
```
**Multiple Services** (for training):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111
```
This launches 8 services on ports 8111-8118, one per GPU.
|