TTI / Release /docs /INSTALLATION.md
JosephBai's picture
Upload folder using huggingface_hub
857c2e9 verified

Installation Guide

This guide provides step-by-step instructions for setting up EVOLVE-VLA. We maintain two separate conda environments:

  1. evolve-vla: For RL training (verl framework, OpenVLA, LIBERO)
  2. vlac: For VLAC reward model service

System Requirements

  • OS: Linux (Ubuntu 20.04/22.04 recommended)
  • GPU: NVIDIA GPU with CUDA 12.1 support. Recommended: H100 80GB for distributed training
  • CUDA: 12.1
  • Python: 3.10

Environment 1: RL Training (evolve-vla)

Important: Follow the exact order below to avoid dependency conflicts.

# Create conda environment
conda create -n evolve-vla python=3.10 -y
conda activate evolve-vla

# Update pip and setuptools (critical for LIBERO installation)
pip install setuptools==78.1.1 pip==23.0

# Install verl framework
cd /path/to/EVOLVE-VLA
pip install --no-deps -e verl/

# Install OpenVLA-OFT (will install its own dependencies including torch)
cd /path/to/workspace
git clone https://github.com/moojink/openvla-oft.git
cd openvla-oft
pip install -e .

# Install LIBERO benchmark
cd /path/to/workspace
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO
pip install -e .
pip install -r experiments/robot/libero/libero_requirements.txt

# Install additional tools
pip install packaging ninja
pip install git+https://github.com/NICTA/pyairports.git

# CRITICAL: Reinstall correct PyTorch version (OpenVLA-OFT/LIBERO may have installed different versions)
pip uninstall -y torch torchvision torchaudio
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
    --index-url https://download.pytorch.org/whl/cu121

# Install customized transformers for OpenVLA
pip install transformers@git+https://github.com/moojink/transformers-openvla-oft.git

# Install Flash Attention
pip uninstall -y flash_attn
pip install flash-attn==2.5.5 --no-build-isolation --no-cache-dir

# Install remaining dependencies
pip install tensordict==0.9.0 click==8.2.1
pip install "ray[default]==2.9.0"
pip install wandb  # For experiment tracking

# Install MuJoCo rendering dependencies
conda install -c conda-forge -y libegl-devel libstdcxx-ng

# System packages (requires sudo, mainly for simulation rendering)
sudo apt install -y libosmesa6 libosmesa6-dev
sudo apt-get install -y libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev

Environment 2: Reward Model Service (vlac)

# Create conda environment
conda create -n vlac python=3.10 -y
conda activate vlac

# Install VLAC dependencies first (before PyTorch)
pip install ms-swift==3.3 transformers==4.51.0 peft==0.15.2
pip install opencv-python loguru timm

# Install PyTorch
pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
    --index-url https://download.pytorch.org/whl/cu121

# Install Flash Attention
pip install packaging ninja
pip install flash-attn==2.5.5 --no-build-isolation --no-cache-dir

# Download VLAC checkpoint
cd /path/to/EVOLVE-VLA
mkdir -p checkpoints/VLAC
# Download checkpoint from HuggingFace: https://huggingface.co/InternRobotics/VLAC

# Set VLAC checkpoint path for service startup
export VLAC_CKPT_PATH=/path/to/EVOLVE-VLA/checkpoints/VLAC

Ray Cluster Setup (Optional, for Multi-Node Training)

Ray is used for distributed training across multiple nodes. If you're training on a single node, you can skip this section - Ray will be automatically initialized by the training script.

For multi-node distributed training (recommended for reproducing paper results):

On Head Node (Machine 1):

# Activate environment
conda activate evolve-vla

# Start Ray head
MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa ray start --head --port=6379

# The shell will show the head node IP

On Worker Nodes (Machine 2, 3, ...):

# Activate environment
conda activate evolve-vla

# Connect to head node (replace <HEAD_IP> with actual IP from above)
MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa ray start --address='<HEAD_IP>:6379'

# Example:
# ray start --address='10.124.104.163:6379'

Verify Cluster:

# On any node
ray status

You should see all nodes with their CPU/GPU resources.

Stopping Ray:

ray stop  # Stop Ray on current node
ray stop --force  # Stop and clean up

Next Steps

After successful installation:

  1. Setup VLAC Service: Follow README Quick Start
  2. Set training environment variables:
    • EVOLVE_SFT_CHECKPOINT
    • EVOLVE_OUTPUT_DIR
    • EVOLVE_ALIGN_JSON
  3. Run reproduction checklist: see REPRODUCTION.md
  4. Run training: check Quick Start in main README