File size: 7,824 Bytes

857c2e9

# Reward Model as a Service Guide

This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.

---

## Overview

EVOLVE-VLA uses progress-based reward as the core signal for rollout training.
In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.

### Capability Contract

- **Required**
  - `progress`: backend must provide trajectory progress estimates.
- **Optional**
  - `pairwise`: backend may provide pairwise critic signal.
  - `done`: backend may provide direct done prediction (otherwise derived from progress threshold).

Current backend status:

| Backend | progress | pairwise | done |
|---|---|---|---|
| `vlac` | yes | yes | optional |
| `robodopamine` | yes | no | no |

For backend selection and custom backend integration, see `REWARD_BACKEND_GUIDE.md`.

### What VLAC Does

1. **Progress Estimation**: Quantifies how much closer an agent has moved toward task completion
2. **Termination Detection**: Determines when a trajectory should end based on progress
3. **Dense Rewards**: Provides frame-by-frame feedback for RL optimization

### Why a Separate Service?

- **GPU Memory**: VLAC requires 20-30GB GPU memory, separate from training workers
- **Load Balancing**: Multiple service instances handle concurrent requests from distributed training
- **Flexibility**: Easy to scale independently of training infrastructure

---

## Architecture Design

### Service-Client Model

```
┌─────────────────────────────────────────────────────────────────┐
│                      RL Training Cluster                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  │ Worker 4 │   ...   │
│  │ (rollout)│  │ (rollout)│  │ (rollout)│  │ (rollout)│         │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘         │
│       │             │             │             │               │
│       └─────────────┴─────────────┴─────────────┘               │
│                     │ HTTP/JSON                                 │
└─────────────────────┼───────────────────────────────────────────┘
                      │
        ┌─────────────┴─────────────┐
        │    Load Balancing         │
        │  (Round-robin by worker)  │
        └─────────────┬─────────────┘
                      │
        ┌─────────────┴─────────────┐
        │                           │
  ┌─────▼─────┐              ┌─────▼─────┐
  │   VLAC    │              │   VLAC    │
  │ Service 1 │     ...      │ Service 8 │
  │  :8111    │              │  :8118    │
  │  GPU 0    │              │  GPU 7    │
  └───────────┘              └───────────┘
```

### Key Design Decisions

1. **HTTP API**: Simple, language-agnostic communication
2. **Single Process per GPU**: Each service instance owns one GPU
3. **Stateless Services**: No session management, pure request-response
4. **Automatic Load Balancing**: Workers round-robin across available services

---

## Mini-Batching and Performance

### Internal Batching

VLAC service automatically batches requests to optimize GPU utilization:

```python
# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
    frames=[frame_0, frame_1, ..., frame_99],  # 100 frames
    batch_size=10  # Suggested batch size
)

# Service internally:
# 1. Chunks 100 frames into batches of ≤8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values
```

**Why Batch Size ≤ 8?**
- Optimal GPU memory utilization for 448×448 images
- Balances throughput and memory usage
- Prevents OOM on 20-30GB GPU memory budget

### Request Processing Pipeline

```
HTTP Request → JSON Parse → Base64 Decode → Image Resize (448×448)
                                                  ↓
                                            Batch Inference
                                                  ↓
                                          Result Aggregation
                                                  ↓
                            JSON Response ← Value Computation
```

**Latency Breakdown**:
- Image decoding/resizing: ~50-100ms
- GPU inference (batch of 8): ~200-400ms
- JSON serialization: ~10-20ms
- **Total**: ~300-800ms per request

### Scaling with Multiple Services

**Single Service** (1 GPU):
- Handles ~1-3 requests/second
- Bottleneck for >4 parallel workers

**Multiple Services** (8 GPUs):
- Handles ~8-24 requests/second
- Supports 16-32 parallel workers
- Linear scaling with GPU count

---

## Key Parameters

### Service Configuration

| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `--port` | `8111` | Base port for service | Each service uses consecutive ports (8111, 8112, ...) |
| `--gpu-ids` | `"0"` | GPUs to use | One service per GPU |
| `--ckpt-path` | `checkpoints/VLAC` | Model checkpoint path | Must point to valid VLAC weights |

### Training Integration

| Parameter | Default | Description | Impact |
|-----------|---------|-------------|--------|
| `VLAC_SERVICE_URL` | `http://localhost:8111` | Base URL of VLAC service | Must match service host |
| `VLAC_SERVICE_NUM` | `8` | Number of service instances | For load balancing |
| `VLAC_DONE_THRESHOLD` | `0.95` | Completion confidence threshold | Higher = stricter termination |
| `VLAC_OFFSET_CALL` | `16` | Frames between progress checks | Higher = fewer VLAC calls |
| `VLAC_START_STEP_CALL` | `64` | When to start checking | Skip early exploration phase |
| `USE_DENSE_REWARD` | `True` | Use accumulative progress as reward | Enable for dense feedback |

### Parameter Tuning Guidelines

**For Long-Horizon Tasks** (e.g., LIBERO-Long):
```python
VLAC_DONE_THRESHOLD = 0.95       # Standard threshold
VLAC_OFFSET_CALL = 16            # Check every 16 steps
VLAC_START_STEP_CALL = 64        # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True  # Enable progressive horizon
```

**For Short, Precise Tasks**:
```python
VLAC_DONE_THRESHOLD = 0.98       # Stricter threshold
VLAC_OFFSET_CALL = 8             # More frequent checks
VLAC_START_STEP_CALL = 32        # Start earlier
```

**For Faster Training (Development)**:
```python
VLAC_OFFSET_CALL = 32            # Fewer VLAC calls
VLAC_SERVICE_NUM = 4             # Fewer services
```

---

## Usage Guide

### Starting VLAC Service

**Single Service** (for debugging):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0
```

**Multiple Services** (for training):
```bash
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111
```

This launches 8 services on ports 8111-8118, one per GPU.