Reward Model as a Service Guide
This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.
Overview
EVOLVE-VLA uses progress-based reward as the core signal for rollout training. In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.
Capability Contract
- Required
progress: backend must provide trajectory progress estimates.
- Optional
pairwise: backend may provide pairwise critic signal.done: backend may provide direct done prediction (otherwise derived from progress threshold).
Current backend status:
| Backend | progress | pairwise | done |
|---|---|---|---|
vlac |
yes | yes | optional |
robodopamine |
yes | no | no |
For backend selection and custom backend integration, see REWARD_BACKEND_GUIDE.md.
What VLAC Does
- Progress Estimation: Quantifies how much closer an agent has moved toward task completion
- Termination Detection: Determines when a trajectory should end based on progress
- Dense Rewards: Provides frame-by-frame feedback for RL optimization
Why a Separate Service?
- GPU Memory: VLAC requires 20-30GB GPU memory, separate from training workers
- Load Balancing: Multiple service instances handle concurrent requests from distributed training
- Flexibility: Easy to scale independently of training infrastructure
Architecture Design
Service-Client Model
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RL Training Cluster β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β
β β Worker 1 β β Worker 2 β β Worker 3 β β Worker 4 β ... β
β β (rollout)β β (rollout)β β (rollout)β β (rollout)β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β β
β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β
β β HTTP/JSON β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β Load Balancing β
β (Round-robin by worker) β
βββββββββββββββ¬ββββββββββββββ
β
βββββββββββββββ΄ββββββββββββββ
β β
βββββββΌββββββ βββββββΌββββββ
β VLAC β β VLAC β
β Service 1 β ... β Service 8 β
β :8111 β β :8118 β
β GPU 0 β β GPU 7 β
βββββββββββββ βββββββββββββ
Key Design Decisions
- HTTP API: Simple, language-agnostic communication
- Single Process per GPU: Each service instance owns one GPU
- Stateless Services: No session management, pure request-response
- Automatic Load Balancing: Workers round-robin across available services
Mini-Batching and Performance
Internal Batching
VLAC service automatically batches requests to optimize GPU utilization:
# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
frames=[frame_0, frame_1, ..., frame_99], # 100 frames
batch_size=10 # Suggested batch size
)
# Service internally:
# 1. Chunks 100 frames into batches of β€8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values
Why Batch Size β€ 8?
- Optimal GPU memory utilization for 448Γ448 images
- Balances throughput and memory usage
- Prevents OOM on 20-30GB GPU memory budget
Request Processing Pipeline
HTTP Request β JSON Parse β Base64 Decode β Image Resize (448Γ448)
β
Batch Inference
β
Result Aggregation
β
JSON Response β Value Computation
Latency Breakdown:
- Image decoding/resizing: ~50-100ms
- GPU inference (batch of 8): ~200-400ms
- JSON serialization: ~10-20ms
- Total: ~300-800ms per request
Scaling with Multiple Services
Single Service (1 GPU):
- Handles ~1-3 requests/second
- Bottleneck for >4 parallel workers
Multiple Services (8 GPUs):
- Handles ~8-24 requests/second
- Supports 16-32 parallel workers
- Linear scaling with GPU count
Key Parameters
Service Configuration
| Parameter | Default | Description | Impact |
|---|---|---|---|
--port |
8111 |
Base port for service | Each service uses consecutive ports (8111, 8112, ...) |
--gpu-ids |
"0" |
GPUs to use | One service per GPU |
--ckpt-path |
checkpoints/VLAC |
Model checkpoint path | Must point to valid VLAC weights |
Training Integration
| Parameter | Default | Description | Impact |
|---|---|---|---|
VLAC_SERVICE_URL |
http://localhost:8111 |
Base URL of VLAC service | Must match service host |
VLAC_SERVICE_NUM |
8 |
Number of service instances | For load balancing |
VLAC_DONE_THRESHOLD |
0.95 |
Completion confidence threshold | Higher = stricter termination |
VLAC_OFFSET_CALL |
16 |
Frames between progress checks | Higher = fewer VLAC calls |
VLAC_START_STEP_CALL |
64 |
When to start checking | Skip early exploration phase |
USE_DENSE_REWARD |
True |
Use accumulative progress as reward | Enable for dense feedback |
Parameter Tuning Guidelines
For Long-Horizon Tasks (e.g., LIBERO-Long):
VLAC_DONE_THRESHOLD = 0.95 # Standard threshold
VLAC_OFFSET_CALL = 16 # Check every 16 steps
VLAC_START_STEP_CALL = 64 # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True # Enable progressive horizon
For Short, Precise Tasks:
VLAC_DONE_THRESHOLD = 0.98 # Stricter threshold
VLAC_OFFSET_CALL = 8 # More frequent checks
VLAC_START_STEP_CALL = 32 # Start earlier
For Faster Training (Development):
VLAC_OFFSET_CALL = 32 # Fewer VLAC calls
VLAC_SERVICE_NUM = 4 # Fewer services
Usage Guide
Starting VLAC Service
Single Service (for debugging):
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0
Multiple Services (for training):
conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111
This launches 8 services on ports 8111-8118, one per GPU.