TTI / Release /docs /INSTALLATION.md

Upload folder using huggingface_hub

857c2e9 verified about 1 month ago

4.7 kB

	# Installation Guide

	This guide provides step-by-step instructions for setting up EVOLVE-VLA. We maintain two separate conda environments:

	1. `evolve-vla`: For RL training (verl framework, OpenVLA, LIBERO)
	2. `vlac`: For VLAC reward model service

	---

	## System Requirements


	- OS: Linux (Ubuntu 20.04/22.04 recommended)
	- GPU: NVIDIA GPU with CUDA 12.1 support. Recommended: H100 80GB for distributed training
	- CUDA: 12.1
	- Python: 3.10

	---

	## Environment 1: RL Training (evolve-vla)

	Important: Follow the exact order below to avoid dependency conflicts.

	```bash
	# Create conda environment
	conda create -n evolve-vla python=3.10 -y
	conda activate evolve-vla

	# Update pip and setuptools (critical for LIBERO installation)
	pip install setuptools==78.1.1 pip==23.0

	# Install verl framework
	cd /path/to/EVOLVE-VLA
	pip install --no-deps -e verl/

	# Install OpenVLA-OFT (will install its own dependencies including torch)
	cd /path/to/workspace
	git clone https://github.com/moojink/openvla-oft.git
	cd openvla-oft
	pip install -e .

	# Install LIBERO benchmark
	cd /path/to/workspace
	git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
	cd LIBERO
	pip install -e .
	pip install -r experiments/robot/libero/libero_requirements.txt

	# Install additional tools
	pip install packaging ninja
	pip install git+https://github.com/NICTA/pyairports.git

	# CRITICAL: Reinstall correct PyTorch version (OpenVLA-OFT/LIBERO may have installed different versions)
	pip uninstall -y torch torchvision torchaudio
	pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
	--index-url https://download.pytorch.org/whl/cu121

	# Install customized transformers for OpenVLA
	pip install transformers@git+https://github.com/moojink/transformers-openvla-oft.git

	# Install Flash Attention
	pip uninstall -y flash_attn
	pip install flash-attn==2.5.5 --no-build-isolation --no-cache-dir

	# Install remaining dependencies
	pip install tensordict==0.9.0 click==8.2.1
	pip install "ray[default]==2.9.0"
	pip install wandb # For experiment tracking

	# Install MuJoCo rendering dependencies
	conda install -c conda-forge -y libegl-devel libstdcxx-ng

	# System packages (requires sudo, mainly for simulation rendering)
	sudo apt install -y libosmesa6 libosmesa6-dev
	sudo apt-get install -y libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev
	```

	---

	## Environment 2: Reward Model Service (vlac)

	```bash
	# Create conda environment
	conda create -n vlac python=3.10 -y
	conda activate vlac

	# Install VLAC dependencies first (before PyTorch)
	pip install ms-swift==3.3 transformers==4.51.0 peft==0.15.2
	pip install opencv-python loguru timm

	# Install PyTorch
	pip3 install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 \
	--index-url https://download.pytorch.org/whl/cu121

	# Install Flash Attention
	pip install packaging ninja
	pip install flash-attn==2.5.5 --no-build-isolation --no-cache-dir

	# Download VLAC checkpoint
	cd /path/to/EVOLVE-VLA
	mkdir -p checkpoints/VLAC
	# Download checkpoint from HuggingFace: https://huggingface.co/InternRobotics/VLAC

	# Set VLAC checkpoint path for service startup
	export VLAC_CKPT_PATH=/path/to/EVOLVE-VLA/checkpoints/VLAC
	```

	---

	## Ray Cluster Setup (Optional, for Multi-Node Training)

	Ray is used for distributed training across multiple nodes.
	If you're training on a single node, you can skip this section - Ray will be automatically initialized by the training script.

	For multi-node distributed training (recommended for reproducing paper results):

	On Head Node (Machine 1):

	```bash
	# Activate environment
	conda activate evolve-vla

	# Start Ray head
	MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa ray start --head --port=6379

	# The shell will show the head node IP
	```

	On Worker Nodes (Machine 2, 3, ...):

	```bash
	# Activate environment
	conda activate evolve-vla

	# Connect to head node (replace <HEAD_IP> with actual IP from above)
	MUJOCO_GL=osmesa PYOPENGL_PLATFORM=osmesa ray start --address='<HEAD_IP>:6379'

	# Example:
	# ray start --address='10.124.104.163:6379'
	```

	Verify Cluster:

	```bash
	# On any node
	ray status
	```

	You should see all nodes with their CPU/GPU resources.

	Stopping Ray:

	```bash
	ray stop # Stop Ray on current node
	ray stop --force # Stop and clean up
	```

	---

	## Next Steps

	After successful installation:

	1. Setup VLAC Service: Follow [README Quick Start](../README.md#-quick-start)
	2. Set training environment variables:
	- `EVOLVE_SFT_CHECKPOINT`
	- `EVOLVE_OUTPUT_DIR`
	- `EVOLVE_ALIGN_JSON`
	3. Run reproduction checklist: see [REPRODUCTION.md](REPRODUCTION.md)
	4. Run training: check [Quick Start](../README.md#-quick-start) in main README