TTI / Release /docs /RM_AS_SERVICE.md

JosephBai

Upload folder using huggingface_hub

857c2e9 verified about 2 months ago

preview code

raw

history blame contribute delete

7.82 kB

Reward Model as a Service Guide

This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.

Overview

EVOLVE-VLA uses progress-based reward as the core signal for rollout training. In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.

Capability Contract

Required
- progress: backend must provide trajectory progress estimates.
Optional
- pairwise: backend may provide pairwise critic signal.
- done: backend may provide direct done prediction (otherwise derived from progress threshold).

Current backend status:

Backend	progress	pairwise	done
`vlac`	yes	yes	optional
`robodopamine`	yes	no	no

For backend selection and custom backend integration, see REWARD_BACKEND_GUIDE.md.

What VLAC Does

Progress Estimation: Quantifies how much closer an agent has moved toward task completion
Termination Detection: Determines when a trajectory should end based on progress
Dense Rewards: Provides frame-by-frame feedback for RL optimization

Why a Separate Service?

GPU Memory: VLAC requires 20-30GB GPU memory, separate from training workers
Load Balancing: Multiple service instances handle concurrent requests from distributed training
Flexibility: Easy to scale independently of training infrastructure

Architecture Design

Service-Client Model

┌─────────────────────────────────────────────────────────────────┐
│                      RL Training Cluster                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐         │
│  │ Worker 1 │  │ Worker 2 │  │ Worker 3 │  │ Worker 4 │   ...   │
│  │ (rollout)│  │ (rollout)│  │ (rollout)│  │ (rollout)│         │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘         │
│       │             │             │             │               │
│       └─────────────┴─────────────┴─────────────┘               │
│                     │ HTTP/JSON                                 │
└─────────────────────┼───────────────────────────────────────────┘
                      │
        ┌─────────────┴─────────────┐
        │    Load Balancing         │
        │  (Round-robin by worker)  │
        └─────────────┬─────────────┘
                      │
        ┌─────────────┴─────────────┐
        │                           │
  ┌─────▼─────┐              ┌─────▼─────┐
  │   VLAC    │              │   VLAC    │
  │ Service 1 │     ...      │ Service 8 │
  │  :8111    │              │  :8118    │
  │  GPU 0    │              │  GPU 7    │
  └───────────┘              └───────────┘

Key Design Decisions

HTTP API: Simple, language-agnostic communication
Single Process per GPU: Each service instance owns one GPU
Stateless Services: No session management, pure request-response
Automatic Load Balancing: Workers round-robin across available services

Mini-Batching and Performance

Internal Batching

VLAC service automatically batches requests to optimize GPU utilization:

# User sends trajectory with 100 frames
response = vlac_client.compute_trajectory_values(
    frames=[frame_0, frame_1, ..., frame_99],  # 100 frames
    batch_size=10  # Suggested batch size
)

# Service internally:
# 1. Chunks 100 frames into batches of ≤8 frames
# 2. Processes each batch on GPU
# 3. Aggregates results
# 4. Returns single response with all 100 values

Why Batch Size ≤ 8?

Optimal GPU memory utilization for 448×448 images
Balances throughput and memory usage
Prevents OOM on 20-30GB GPU memory budget

Request Processing Pipeline

HTTP Request → JSON Parse → Base64 Decode → Image Resize (448×448)
                                                  ↓
                                            Batch Inference
                                                  ↓
                                          Result Aggregation
                                                  ↓
                            JSON Response ← Value Computation

Latency Breakdown:

Image decoding/resizing: ~50-100ms
GPU inference (batch of 8): ~200-400ms
JSON serialization: ~10-20ms
Total: ~300-800ms per request

Scaling with Multiple Services

Single Service (1 GPU):

Handles ~1-3 requests/second
Bottleneck for >4 parallel workers

Multiple Services (8 GPUs):

Handles ~8-24 requests/second
Supports 16-32 parallel workers
Linear scaling with GPU count

Key Parameters

Service Configuration

Parameter	Default	Description	Impact
`--port`	`8111`	Base port for service	Each service uses consecutive ports (8111, 8112, ...)
`--gpu-ids`	`"0"`	GPUs to use	One service per GPU
`--ckpt-path`	`checkpoints/VLAC`	Model checkpoint path	Must point to valid VLAC weights

Training Integration

Parameter	Default	Description	Impact
`VLAC_SERVICE_URL`	`http://localhost:8111`	Base URL of VLAC service	Must match service host
`VLAC_SERVICE_NUM`	`8`	Number of service instances	For load balancing
`VLAC_DONE_THRESHOLD`	`0.95`	Completion confidence threshold	Higher = stricter termination
`VLAC_OFFSET_CALL`	`16`	Frames between progress checks	Higher = fewer VLAC calls
`VLAC_START_STEP_CALL`	`64`	When to start checking	Skip early exploration phase
`USE_DENSE_REWARD`	`True`	Use accumulative progress as reward	Enable for dense feedback

Parameter Tuning Guidelines

For Long-Horizon Tasks (e.g., LIBERO-Long):

VLAC_DONE_THRESHOLD = 0.95       # Standard threshold
VLAC_OFFSET_CALL = 16            # Check every 16 steps
VLAC_START_STEP_CALL = 64        # Start after initial exploration
USE_PROGRESSIVE_MAX_STEP = True  # Enable progressive horizon

For Short, Precise Tasks:

VLAC_DONE_THRESHOLD = 0.98       # Stricter threshold
VLAC_OFFSET_CALL = 8             # More frequent checks
VLAC_START_STEP_CALL = 32        # Start earlier

For Faster Training (Development):

VLAC_OFFSET_CALL = 32            # Fewer VLAC calls
VLAC_SERVICE_NUM = 4             # Fewer services

Usage Guide

Starting VLAC Service

Single Service (for debugging):

conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/vlac_service.py --port 8111 --gpu-ids 0

Multiple Services (for training):

conda activate vlac
export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
python reward_model/launch_vlac_servers.py --base-port 8111

This launches 8 services on ports 8111-8118, one per GPU.