TTI / Release /docs /RM_AS_SERVICE.md
JosephBai's picture
Upload folder using huggingface_hub
857c2e9 verified

Reward Model as a Service Guide

This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.


Overview

EVOLVE-VLA uses progress-based reward as the core signal for rollout training. In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.

Capability Contract

  • Required
    • progress: backend must provide trajectory progress estimates.
  • Optional
    • pairwise: backend may provide pairwise critic signal.
    • done: backend may provide direct done prediction (otherwise derived from progress threshold).

Current backend status:

Backend progress pairwise done
vlac yes yes optional
robodopamine yes no no

For backend selection and custom backend integration, see REWARD_BACKEND_GUIDE.md.

What VLAC Does

  1. Progress Estimation: Quantifies how much closer an agent has moved toward task completion
  2. Termination Detection: Determines when a trajectory should end based on progress
  3. Dense Rewards: Provides frame-by-frame feedback for RL optimization

Why a Separate Service?

  • GPU Memory: VLAC requires 20-30GB GPU memory, separate from training workers
  • Load Balancing: Multiple service instances handle concurrent requests from distributed training
  • Flexibility: Easy to scale independently of training infrastructure

Architecture Design

Service-Client Model

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      RL Training Cluster                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ Worker 1 β”‚  β”‚ Worker 2 β”‚  β”‚ Worker 3 β”‚  β”‚ Worker 4 β”‚   ...   β”‚
β”‚  β”‚ (rollout)β”‚  β”‚ (rollout)β”‚  β”‚ (rollout)β”‚  β”‚ (rollout)β”‚         β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜         β”‚
β”‚       β”‚             β”‚             β”‚             β”‚               β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                     β”‚ HTTP/JSON                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚    Load Balancing         β”‚
        β”‚  (Round-robin by worker)  β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                           β”‚
  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
  β”‚   VLAC    β”‚              β”‚   VLAC    β”‚
  β”‚ Service 1 β”‚     ...      β”‚ Service 8 β”‚
  β”‚  :8111    β”‚              β”‚  :8118    β”‚
  β”‚  GPU 0    β”‚              β”‚  GPU 7    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Design Decisions

  1. HTTP API: Simple, language-agnostic communication
  2. Single Process per GPU: Each service instance owns one GPU
  3. Stateless Services: No session management, pure request-response
  4. Automatic Load Balancing: Workers round-robin across available services

Mini-Batching and Performance

Internal Batching

VLAC service automatically batches requests to optimize GPU utilization:

# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
    frames=[frame_0, frame_1, ..., frame_99],  # 100 frames
    batch_size=10  # Suggested batch size
)

# Service internally:
# 1. Chunks 100 frames into batches of ≀8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values

Why Batch Size ≀ 8?

  • Optimal GPU memory utilization for 448Γ—448 images
  • Balances throughput and memory usage
  • Prevents OOM on 20-30GB GPU memory budget

Request Processing Pipeline

HTTP Request β†’ JSON Parse β†’ Base64 Decode β†’ Image Resize (448Γ—448)
                                                  ↓
                                            Batch Inference
                                                  ↓
                                          Result Aggregation
                                                  ↓
                            JSON Response ← Value Computation

Latency Breakdown:

  • Image decoding/resizing: ~50-100ms
  • GPU inference (batch of 8): ~200-400ms
  • JSON serialization: ~10-20ms
  • Total: ~300-800ms per request

Scaling with Multiple Services

Single Service (1 GPU):

  • Handles ~1-3 requests/second
  • Bottleneck for >4 parallel workers

Multiple Services (8 GPUs):

  • Handles ~8-24 requests/second
  • Supports 16-32 parallel workers
  • Linear scaling with GPU count

Key Parameters

Service Configuration

Parameter Default Description Impact
--port 8111 Base port for service Each service uses consecutive ports (8111, 8112, ...)
--gpu-ids "0" GPUs to use One service per GPU
--ckpt-path checkpoints/VLAC Model checkpoint path Must point to valid VLAC weights

Training Integration

Parameter Default Description Impact
VLAC_SERVICE_URL http://localhost:8111 Base URL of VLAC service Must match service host
VLAC_SERVICE_NUM 8 Number of service instances For load balancing
VLAC_DONE_THRESHOLD 0.95 Completion confidence threshold Higher = stricter termination
VLAC_OFFSET_CALL 16 Frames between progress checks Higher = fewer VLAC calls
VLAC_START_STEP_CALL 64 When to start checking Skip early exploration phase
USE_DENSE_REWARD True Use accumulative progress as reward Enable for dense feedback

Parameter Tuning Guidelines

For Long-Horizon Tasks (e.g., LIBERO-Long):

VLAC_DONE_THRESHOLD = 0.95       # Standard threshold
VLAC_OFFSET_CALL = 16            # Check every 16 steps
VLAC_START_STEP_CALL = 64        # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True  # Enable progressive horizon

For Short, Precise Tasks:

VLAC_DONE_THRESHOLD = 0.98       # Stricter threshold
VLAC_OFFSET_CALL = 8             # More frequent checks
VLAC_START_STEP_CALL = 32        # Start earlier

For Faster Training (Development):

VLAC_OFFSET_CALL = 32            # Fewer VLAC calls
VLAC_SERVICE_NUM = 4             # Fewer services

Usage Guide

Starting VLAC Service

Single Service (for debugging):

conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0

Multiple Services (for training):

conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111

This launches 8 services on ports 8111-8118, one per GPU.