TTI / Release /docs /RM_AS_SERVICE.md
JosephBai's picture
Upload folder using huggingface_hub
857c2e9 verified
# Reward Model as a Service Guide
This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.
---
## Overview
EVOLVE-VLA uses progress-based reward as the core signal for rollout training.
In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.
### Capability Contract
- **Required**
- `progress`: backend must provide trajectory progress estimates.
- **Optional**
- `pairwise`: backend may provide pairwise critic signal.
- `done`: backend may provide direct done prediction (otherwise derived from progress threshold).
Current backend status:
| Backend | progress | pairwise | done |
|---|---|---|---|
| `vlac` | yes | yes | optional |
| `robodopamine` | yes | no | no |
For backend selection and custom backend integration, see `REWARD_BACKEND_GUIDE.md`.
### What VLAC Does
1. **Progress Estimation**: Quantifies how much closer an agent has moved toward task completion
2. **Termination Detection**: Determines when a trajectory should end based on progress
3. **Dense Rewards**: Provides frame-by-frame feedback for RL optimization
### Why a Separate Service?
- **GPU Memory**: VLAC requires 20-30GB GPU memory, separate from training workers
- **Load Balancing**: Multiple service instances handle concurrent requests from distributed training
- **Flexibility**: Easy to scale independently of training infrastructure
---
## Architecture Design
### Service-Client Model
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ RL Training Cluster β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Worker 1 β”‚ β”‚ Worker 2 β”‚ β”‚ Worker 3 β”‚ β”‚ Worker 4 β”‚ ... β”‚
β”‚ β”‚ (rollout)β”‚ β”‚ (rollout)β”‚ β”‚ (rollout)β”‚ β”‚ (rollout)β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ HTTP/JSON β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Load Balancing β”‚
β”‚ (Round-robin by worker) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚
β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
β”‚ VLAC β”‚ β”‚ VLAC β”‚
β”‚ Service 1 β”‚ ... β”‚ Service 8 β”‚
β”‚ :8111 β”‚ β”‚ :8118 β”‚
β”‚ GPU 0 β”‚ β”‚ GPU 7 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Key Design Decisions
1. **HTTP API**: Simple, language-agnostic communication
2. **Single Process per GPU**: Each service instance owns one GPU
3. **Stateless Services**: No session management, pure request-response
4. **Automatic Load Balancing**: Workers round-robin across available services
---
## Mini-Batching and Performance
### Internal Batching
VLAC service automatically batches requests to optimize GPU utilization:
```python
# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
frames=[frame_0, frame_1, ..., frame_99], # 100 frames
batch_size=10 # Suggested batch size
)
# Service internally:
# 1. Chunks 100 frames into batches of ≀8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values
```
**Why Batch Size ≀ 8?**
- Optimal GPU memory utilization for 448Γ—448 images
- Balances throughput and memory usage
- Prevents OOM on 20-30GB GPU memory budget
### Request Processing Pipeline
```
HTTP Request β†’ JSON Parse β†’ Base64 Decode β†’ Image Resize (448Γ—448)
↓
Batch Inference
↓
Result Aggregation
↓
JSON Response ← Value Computation
```
**Latency Breakdown**:
- Image decoding/resizing: ~50-100ms
- GPU inference (batch of 8): ~200-400ms
- JSON serialization: ~10-20ms
- **Total**: ~300-800ms per request
### Scaling with Multiple Services
**Single Service** (1 GPU):
- Handles ~1-3 requests/second
- Bottleneck for >4 parallel workers
**Multiple Services** (8 GPUs):
- Handles ~8-24 requests/second
- Supports 16-32 parallel workers
- Linear scaling with GPU count
---
## Key Parameters
### Service Configuration
| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `--port` | `8111` | Base port for service | Each service uses consecutive ports (8111, 8112, ...) |
| `--gpu-ids` | `"0"` | GPUs to use | One service per GPU |
| `--ckpt-path` | `checkpoints/VLAC` | Model checkpoint path | Must point to valid VLAC weights |
### Training Integration
| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `VLAC_SERVICE_URL` | `http://localhost:8111` | Base URL of VLAC service | Must match service host |
| `VLAC_SERVICE_NUM` | `8` | Number of service instances | For load balancing |
| `VLAC_DONE_THRESHOLD` | `0.95` | Completion confidence threshold | Higher = stricter termination |
| `VLAC_OFFSET_CALL` | `16` | Frames between progress checks | Higher = fewer VLAC calls |
| `VLAC_START_STEP_CALL` | `64` | When to start checking | Skip early exploration phase |
| `USE_DENSE_REWARD` | `True` | Use accumulative progress as reward | Enable for dense feedback |
### Parameter Tuning Guidelines
**For Long-Horizon Tasks** (e.g., LIBERO-Long):
```python
VLAC_DONE_THRESHOLD = 0.95 # Standard threshold
VLAC_OFFSET_CALL = 16 # Check every 16 steps
VLAC_START_STEP_CALL = 64 # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True # Enable progressive horizon
```
**For Short, Precise Tasks**:
```python
VLAC_DONE_THRESHOLD = 0.98 # Stricter threshold
VLAC_OFFSET_CALL = 8 # More frequent checks
VLAC_START_STEP_CALL = 32 # Start earlier
```
**For Faster Training (Development)**:
```python
VLAC_OFFSET_CALL = 32 # Fewer VLAC calls
VLAC_SERVICE_NUM = 4 # Fewer services
```
---
## Usage Guide
### Starting VLAC Service
**Single Service** (for debugging):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0
```
**Multiple Services** (for training):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111
```
This launches 8 services on ports 8111-8118, one per GPU.