TTI / Release /docs /RM_AS_SERVICE.md

Upload folder using huggingface_hub

857c2e9 verified about 2 months ago

7.82 kB

	# Reward Model as a Service Guide

	This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training.

	---

	## Overview

	EVOLVE-VLA uses progress-based reward as the core signal for rollout training.
	In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow.

	### Capability Contract

	- Required
	- `progress`: backend must provide trajectory progress estimates.
	- Optional
	- `pairwise`: backend may provide pairwise critic signal.
	- `done`: backend may provide direct done prediction (otherwise derived from progress threshold).

	Current backend status:

	\| Backend \| progress \| pairwise \| done \|
	\|---\|---\|---\|---\|
	\| `vlac` \| yes \| yes \| optional \|
	\| `robodopamine` \| yes \| no \| no \|

	For backend selection and custom backend integration, see `REWARD_BACKEND_GUIDE.md`.

	### What VLAC Does

	1. Progress Estimation: Quantifies how much closer an agent has moved toward task completion
	2. Termination Detection: Determines when a trajectory should end based on progress
	3. Dense Rewards: Provides frame-by-frame feedback for RL optimization

	### Why a Separate Service?

	- GPU Memory: VLAC requires 20-30GB GPU memory, separate from training workers
	- Load Balancing: Multiple service instances handle concurrent requests from distributed training
	- Flexibility: Easy to scale independently of training infrastructure

	---

	## Architecture Design

	### Service-Client Model

	```
	┌─────────────────────────────────────────────────────────────────┐
	│ RL Training Cluster │
	│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
	│ │ Worker 1 │ │ Worker 2 │ │ Worker 3 │ │ Worker 4 │ ... │
	│ │ (rollout)│ │ (rollout)│ │ (rollout)│ │ (rollout)│ │
	│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
	│ │ │ │ │ │
	│ └─────────────┴─────────────┴─────────────┘ │
	│ │ HTTP/JSON │
	└─────────────────────┼───────────────────────────────────────────┘
	│
	┌─────────────┴─────────────┐
	│ Load Balancing │
	│ (Round-robin by worker) │
	└─────────────┬─────────────┘
	│
	┌─────────────┴─────────────┐
	│ │
	┌─────▼─────┐ ┌─────▼─────┐
	│ VLAC │ │ VLAC │
	│ Service 1 │ ... │ Service 8 │
	│ :8111 │ │ :8118 │
	│ GPU 0 │ │ GPU 7 │
	└───────────┘ └───────────┘
	```

	### Key Design Decisions

	1. HTTP API: Simple, language-agnostic communication
	2. Single Process per GPU: Each service instance owns one GPU
	3. Stateless Services: No session management, pure request-response
	4. Automatic Load Balancing: Workers round-robin across available services

	---

	## Mini-Batching and Performance

	### Internal Batching

	VLAC service automatically batches requests to optimize GPU utilization:

	```python
	# User sends trajectory with 100 frames
	response = vlac_client.compute_trajectory_values(
	frames=[frame_0, frame_1, ..., frame_99], # 100 frames
	batch_size=10 # Suggested batch size
	)

	# Service internally:
	# 1. Chunks 100 frames into batches of ≤8 frames
	# 2. Processes each batch on GPU
	# 3. Aggregates results
	# 4. Returns single response with all 100 values
	```

	Why Batch Size ≤ 8?
	- Optimal GPU memory utilization for 448×448 images
	- Balances throughput and memory usage
	- Prevents OOM on 20-30GB GPU memory budget

	### Request Processing Pipeline

	```
	HTTP Request → JSON Parse → Base64 Decode → Image Resize (448×448)
	↓
	Batch Inference
	↓
	Result Aggregation
	↓
	JSON Response ← Value Computation
	```

	Latency Breakdown:
	- Image decoding/resizing: ~50-100ms
	- GPU inference (batch of 8): ~200-400ms
	- JSON serialization: ~10-20ms
	- Total: ~300-800ms per request

	### Scaling with Multiple Services

	Single Service (1 GPU):
	- Handles ~1-3 requests/second
	- Bottleneck for >4 parallel workers

	Multiple Services (8 GPUs):
	- Handles ~8-24 requests/second
	- Supports 16-32 parallel workers
	- Linear scaling with GPU count

	---

	## Key Parameters

	### Service Configuration

	\| Parameter \| Default \| Description \| Impact \|
	\|-----------\|---------\|-------------\|--------\|
	\| `--port` \| `8111` \| Base port for service \| Each service uses consecutive ports (8111, 8112, ...) \|
	\| `--gpu-ids` \| `"0"` \| GPUs to use \| One service per GPU \|
	\| `--ckpt-path` \| `checkpoints/VLAC` \| Model checkpoint path \| Must point to valid VLAC weights \|

	### Training Integration

	\| Parameter \| Default \| Description \| Impact \|
	\|-----------\|---------\|-------------\|--------\|
	\| `VLAC_SERVICE_URL` \| `http://localhost:8111` \| Base URL of VLAC service \| Must match service host \|
	\| `VLAC_SERVICE_NUM` \| `8` \| Number of service instances \| For load balancing \|
	\| `VLAC_DONE_THRESHOLD` \| `0.95` \| Completion confidence threshold \| Higher = stricter termination \|
	\| `VLAC_OFFSET_CALL` \| `16` \| Frames between progress checks \| Higher = fewer VLAC calls \|
	\| `VLAC_START_STEP_CALL` \| `64` \| When to start checking \| Skip early exploration phase \|
	\| `USE_DENSE_REWARD` \| `True` \| Use accumulative progress as reward \| Enable for dense feedback \|

	### Parameter Tuning Guidelines

	For Long-Horizon Tasks (e.g., LIBERO-Long):
	```python
	VLAC_DONE_THRESHOLD = 0.95 # Standard threshold
	VLAC_OFFSET_CALL = 16 # Check every 16 steps
	VLAC_START_STEP_CALL = 64 # Start after initial exploration
	USE_PROGRESSIVE_MAX_STEP = True # Enable progressive horizon
	```

	For Short, Precise Tasks:
	```python
	VLAC_DONE_THRESHOLD = 0.98 # Stricter threshold
	VLAC_OFFSET_CALL = 8 # More frequent checks
	VLAC_START_STEP_CALL = 32 # Start earlier
	```

	For Faster Training (Development):
	```python
	VLAC_OFFSET_CALL = 32 # Fewer VLAC calls
	VLAC_SERVICE_NUM = 4 # Fewer services
	```

	---

	## Usage Guide

	### Starting VLAC Service

	Single Service (for debugging):
	```bash
	conda activate vlac
	export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
	python reward_model/vlac_service.py --port 8111 --gpu-ids 0
	```

	Multiple Services (for training):
	```bash
	conda activate vlac
	export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC
	python reward_model/launch_vlac_servers.py --base-port 8111
	```

	This launches 8 services on ports 8111-8118, one per GPU.