TTI / Release /docs /INSTALLATION.md
JosephBai's picture
Upload folder using huggingface_hub
857c2e9 verified
# Installation Guide
This guide provides step-by-step instructions for setting up EVOLVE-VLA. We maintain **two separate conda environments**:
1. **`evolve-vla`**: For RL training (verl framework, OpenVLA, LIBERO)
2. **`vlac`**: For VLAC reward model service
---
## System Requirements
- **OS**: Linux (Ubuntu 20.04/22.04 recommended)
- **GPU**: NVIDIA GPU with CUDA 12.1 support. Recommended: H100 80GB for distributed training
- **CUDA**: 12.1
- **Python**: 3.10
---
## Environment 1: RL Training (evolve-vla)
**Important**: Follow the exact order below to avoid dependency conflicts.
```bash
# Create conda environment
conda create -n evolve-vla python=3.10 -y
conda activate evolve-vla
# Update pip and setuptools (critical for LIBERO installation)
pip install setuptools==78.1.1 pip==23.0
# Install verl framework
cd /path/to/EVOLVE-VLA
pip install --no-deps -e verl/
# Install OpenVLA-OFT (will install its own dependencies including torch)
cd /path/to/workspace
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip install -e .
# Install LIBERO benchmark
cd /path/to/workspace
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
pip install -r experiments/robot/libero/libero_requirements.txt
# Install additional tools
pip install packaging ninja
pip install git+https://github.com/NICTA/pyairports.git
# CRITICAL: Reinstall correct PyTorch version (OpenVLA-OFT/LIBERO may have installed different versions)
pip uninstall -y torch torchvision torchaudio
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
--index-url https://download.pytorch.org/whl/cu121
# Install customized transformers for OpenVLA
pip install transformers@git+https://github.com/moojink/transformers-openvla-oft.git
# Install Flash Attention
pip uninstall -y flash_attn
pip install flash-attn==2.5.5 --no-build-isolation --no-cache-dir
# Install remaining dependencies
pip install tensordict==0.9.0 click==8.2.1
pip install "ray[default]==2.9.0"
pip install wandb # For experiment tracking
# Install MuJoCo rendering dependencies
conda install -c conda-forge -y libegl-devel libstdcxx-ng
# System packages (requires sudo, mainly for simulation rendering)
sudo apt install -y libosmesa6 libosmesa6-dev
sudo apt-get install -y libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev
```
---
## Environment 2: Reward Model Service (vlac)
```bash
# Create conda environment
conda create -n vlac python=3.10 -y
conda activate vlac
# Install VLAC dependencies first (before PyTorch)
pip install ms-swift==3.3 transformers==4.51.0 peft==0.15.2
pip install opencv-python loguru timm
# Install PyTorch
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
--index-url https://download.pytorch.org/whl/cu121
# Install Flash Attention
pip install packaging ninja
pip install flash-attn==2.5.5 --no-build-isolation --no-cache-dir
# Download VLAC checkpoint
cd /path/to/EVOLVE-VLA
mkdir -p checkpoints/VLAC
# Download checkpoint from HuggingFace: https://huggingface.co/InternRobotics/VLAC
# Set VLAC checkpoint path for service startup
export VLAC_CKPT_PATH=/path/to/EVOLVE-VLA/checkpoints/VLAC
```
---
## Ray Cluster Setup (Optional, for Multi-Node Training)
Ray is used for distributed training across multiple nodes.
**If you're training on a single node, you can skip this section** - Ray will be automatically initialized by the training script.
For multi-node distributed training (recommended for reproducing paper results):
**On Head Node (Machine 1):**
```bash
# Activate environment
conda activate evolve-vla
# Start Ray head
MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa ray start --head --port=6379
# The shell will show the head node IP
```
**On Worker Nodes (Machine 2, 3, ...):**
```bash
# Activate environment
conda activate evolve-vla
# Connect to head node (replace <HEAD_IP> with actual IP from above)
MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa ray start --address='<HEAD_IP>:6379'
# Example:
# ray start --address='10.124.104.163:6379'
```
**Verify Cluster:**
```bash
# On any node
ray status
```
You should see all nodes with their CPU/GPU resources.
**Stopping Ray:**
```bash
ray stop # Stop Ray on current node
ray stop --force # Stop and clean up
```
---
## Next Steps
After successful installation:
1. **Setup VLAC Service**: Follow [README Quick Start](../README.md#-quick-start)
2. **Set training environment variables**:
- `EVOLVE_SFT_CHECKPOINT`
- `EVOLVE_OUTPUT_DIR`
- `EVOLVE_ALIGN_JSON`
3. **Run reproduction checklist**: see [REPRODUCTION.md](REPRODUCTION.md)
4. **Run training**: check [Quick Start](../README.md#-quick-start) in main README