| # Reward Model as a Service Guide |
|
|
| This guide explains reward backend architecture (with VLAC as the reference service) and how it integrates with EVOLVE-VLA training. |
|
|
| --- |
|
|
| ## Overview |
|
|
| EVOLVE-VLA uses progress-based reward as the core signal for rollout training. |
| In this release, VLAC is the reference backend, and other backends can be integrated via the same workflow. |
|
|
| ### Capability Contract |
|
|
| - **Required** |
| - `progress`: backend must provide trajectory progress estimates. |
| - **Optional** |
| - `pairwise`: backend may provide pairwise critic signal. |
| - `done`: backend may provide direct done prediction (otherwise derived from progress threshold). |
|
|
| Current backend status: |
|
|
| | Backend | progress | pairwise | done | |
| |---|---|---|---| |
| | `vlac` | yes | yes | optional | |
| | `robodopamine` | yes | no | no | |
|
|
| For backend selection and custom backend integration, see `REWARD_BACKEND_GUIDE.md`. |
|
|
| ### What VLAC Does |
|
|
| 1. **Progress Estimation**: Quantifies how much closer an agent has moved toward task completion |
| 2. **Termination Detection**: Determines when a trajectory should end based on progress |
| 3. **Dense Rewards**: Provides frame-by-frame feedback for RL optimization |
|
|
| ### Why a Separate Service? |
|
|
| - **GPU Memory**: VLAC requires 20-30GB GPU memory, separate from training workers |
| - **Load Balancing**: Multiple service instances handle concurrent requests from distributed training |
| - **Flexibility**: Easy to scale independently of training infrastructure |
|
|
| --- |
|
|
| ## Architecture Design |
|
|
| ### Service-Client Model |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β RL Training Cluster β |
| β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β |
| β β Worker 1 β β Worker 2 β β Worker 3 β β Worker 4 β ... β |
| β β (rollout)β β (rollout)β β (rollout)β β (rollout)β β |
| β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β |
| β β β β β β |
| β βββββββββββββββ΄ββββββββββββββ΄ββββββββββββββ β |
| β β HTTP/JSON β |
| βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βββββββββββββββ΄ββββββββββββββ |
| β Load Balancing β |
| β (Round-robin by worker) β |
| βββββββββββββββ¬ββββββββββββββ |
| β |
| βββββββββββββββ΄ββββββββββββββ |
| β β |
| βββββββΌββββββ βββββββΌββββββ |
| β VLAC β β VLAC β |
| β Service 1 β ... β Service 8 β |
| β :8111 β β :8118 β |
| β GPU 0 β β GPU 7 β |
| βββββββββββββ βββββββββββββ |
| ``` |
|
|
| ### Key Design Decisions |
|
|
| 1. **HTTP API**: Simple, language-agnostic communication |
| 2. **Single Process per GPU**: Each service instance owns one GPU |
| 3. **Stateless Services**: No session management, pure request-response |
| 4. **Automatic Load Balancing**: Workers round-robin across available services |
|
|
| --- |
|
|
| ## Mini-Batching and Performance |
|
|
| ### Internal Batching |
|
|
| VLAC service automatically batches requests to optimize GPU utilization: |
|
|
| ```python |
| # User sends trajectory with 100 frames |
| response = vlac_client.compute_trajectory_values( |
| frames=[frame_0, frame_1, ..., frame_99], # 100 frames |
| batch_size=10 # Suggested batch size |
| ) |
| |
| # Service internally: |
| # 1. Chunks 100 frames into batches of β€8 frames |
| # 2. Processes each batch on GPU |
| # 3. Aggregates results |
| # 4. Returns single response with all 100 values |
| ``` |
|
|
| **Why Batch Size β€ 8?** |
| - Optimal GPU memory utilization for 448Γ448 images |
| - Balances throughput and memory usage |
| - Prevents OOM on 20-30GB GPU memory budget |
|
|
| ### Request Processing Pipeline |
|
|
| ``` |
| HTTP Request β JSON Parse β Base64 Decode β Image Resize (448Γ448) |
| β |
| Batch Inference |
| β |
| Result Aggregation |
| β |
| JSON Response β Value Computation |
| ``` |
|
|
| **Latency Breakdown**: |
| - Image decoding/resizing: ~50-100ms |
| - GPU inference (batch of 8): ~200-400ms |
| - JSON serialization: ~10-20ms |
| - **Total**: ~300-800ms per request |
|
|
| ### Scaling with Multiple Services |
|
|
| **Single Service** (1 GPU): |
| - Handles ~1-3 requests/second |
| - Bottleneck for >4 parallel workers |
|
|
| **Multiple Services** (8 GPUs): |
| - Handles ~8-24 requests/second |
| - Supports 16-32 parallel workers |
| - Linear scaling with GPU count |
|
|
| --- |
|
|
| ## Key Parameters |
|
|
| ### Service Configuration |
|
|
| | Parameter | Default | Description | Impact | |
| |-----------|---------|-------------|--------| |
| | `--port` | `8111` | Base port for service | Each service uses consecutive ports (8111, 8112, ...) | |
| | `--gpu-ids` | `"0"` | GPUs to use | One service per GPU | |
| | `--ckpt-path` | `checkpoints/VLAC` | Model checkpoint path | Must point to valid VLAC weights | |
|
|
| ### Training Integration |
|
|
| | Parameter | Default | Description | Impact | |
| |-----------|---------|-------------|--------| |
| | `VLAC_SERVICE_URL` | `http://localhost:8111` | Base URL of VLAC service | Must match service host | |
| | `VLAC_SERVICE_NUM` | `8` | Number of service instances | For load balancing | |
| | `VLAC_DONE_THRESHOLD` | `0.95` | Completion confidence threshold | Higher = stricter termination | |
| | `VLAC_OFFSET_CALL` | `16` | Frames between progress checks | Higher = fewer VLAC calls | |
| | `VLAC_START_STEP_CALL` | `64` | When to start checking | Skip early exploration phase | |
| | `USE_DENSE_REWARD` | `True` | Use accumulative progress as reward | Enable for dense feedback | |
|
|
| ### Parameter Tuning Guidelines |
|
|
| **For Long-Horizon Tasks** (e.g., LIBERO-Long): |
| ```python |
| VLAC_DONE_THRESHOLD = 0.95 # Standard threshold |
| VLAC_OFFSET_CALL = 16 # Check every 16 steps |
| VLAC_START_STEP_CALL = 64 # Start after initial exploration |
| USE_PROGRESSIVE_MAX_STEP = True # Enable progressive horizon |
| ``` |
|
|
| **For Short, Precise Tasks**: |
| ```python |
| VLAC_DONE_THRESHOLD = 0.98 # Stricter threshold |
| VLAC_OFFSET_CALL = 8 # More frequent checks |
| VLAC_START_STEP_CALL = 32 # Start earlier |
| ``` |
|
|
| **For Faster Training (Development)**: |
| ```python |
| VLAC_OFFSET_CALL = 32 # Fewer VLAC calls |
| VLAC_SERVICE_NUM = 4 # Fewer services |
| ``` |
|
|
| --- |
|
|
| ## Usage Guide |
|
|
| ### Starting VLAC Service |
|
|
| **Single Service** (for debugging): |
| ```bash |
| conda activate vlac |
| export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC |
| python reward_model/vlac_service.py --port 8111 --gpu-ids 0 |
| ``` |
|
|
| **Multiple Services** (for training): |
| ```bash |
| conda activate vlac |
| export VLAC_CKPT_PATH=/path/to/checkpoints/VLAC |
| python reward_model/launch_vlac_servers.py --base-port 8111 |
| ``` |
|
|
| This launches 8 services on ports 8111-8118, one per GPU. |
|
|