Pi0.5 Fine-tuning for SO-101
Fine-tune Physical Intelligence's Pi0.5 on the SO-101 ball-in-cup task.
Overview
| Item | Value |
|---|---|
| Base Model | Pi0.5 (gs://openpi-assets/checkpoints/pi05_base) |
| Dataset | abdul004/so101_ball_in_cup_v5 (72 episodes) |
| GPU Required | A100 80GB (~$1.50/hr on Vast.ai) |
| Training Time | ~2-3 hours for 5K steps |
Files in This Directory
pi0_so101/
βββ README.md # This file
βββ so101_policy.py # Input/output transforms
βββ so101_config.py # Config template (reference)
βββ test_config_local.py # Integration test (run locally before cloud)
βββ sync_checkpoints.py # HF Hub checkpoint sync (run on Vast.ai)
HuggingFace Package: abdul004/pi0_so101_config
so101_openpi_patch.tar.gz- Auto-installer for OpenPi
Step-by-Step Setup on Vast.ai
1. Rent GPU Instance
On Vast.ai, search for:
- GPU: A100 80GB or H100 SXM (80GB required for Pi0.5)
- Disk: 120GB+ (Critical: set slider to 120GB minimum)
- Image:
nvidia/cuda:12.4.1-devel-ubuntu22.04(or similar PyTorch image)
Add your SSH Key to Vast.ai:
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM/UHcQWuaetxjvxqz4cqofaWeLakkwWOnVN4ffUevfU abdul@Abduls-Mac-mini.local
2. Vast.ai Onstart Script (Automatic Setup)
Paste this into the "Onstart Script" box when renting:
#!/bin/bash
set -e # Exit on any error
echo "========================================"
echo "Pi0.5 SO-101 Setup Starting..."
echo "========================================"
# 1. Install system dependencies
apt-get update && apt-get install -y ffmpeg tmux git curl
# 2. Setup environment variables FIRST (critical for disk space)
export PATH="/root/.local/bin:$PATH"
export UV_CACHE_DIR="/root/.uv_cache"
export TMPDIR="/root/.tmp"
export HF_HOME="/root/.cache/huggingface"
mkdir -p $UV_CACHE_DIR $TMPDIR $HF_HOME
# Persist to bashrc for SSH sessions
cat >> ~/.bashrc << 'BASHRC_EOF'
export PATH="/root/.local/bin:$PATH"
export UV_CACHE_DIR="/root/.uv_cache"
export TMPDIR="/root/.tmp"
export HF_HOME="/root/.cache/huggingface"
BASHRC_EOF
# 3. Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc
# 4. Clone OpenPi
cd /root
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
# 5. Install OpenPi dependencies (skip LFS for speed)
GIT_LFS_SKIP_SMUDGE=1 uv sync
# =========================================================
# CRITICAL: Upgrade LeRobot to 0.4.x (fixes dataset format)
# OpenPi pins an old version that can't read v3.0 datasets
# =========================================================
uv pip install "lerobot>=0.4.0"
# 6. Verify LeRobot version (must be 0.4.x)
uv run python -c "import lerobot; v=lerobot.__version__; print(f'LeRobot: {v}'); assert v.startswith('0.4'), f'ERROR: Need 0.4.x, got {v}'"
# 7. Login to HuggingFace (replace with your token!)
uv run python3 -c "from huggingface_hub import login; login(token='YOUR_HF_TOKEN_HERE')"
# 8. Install SO-101 config patch
curl -L https://huggingface.co/abdul004/pi0_so101_config/resolve/main/so101_openpi_patch.tar.gz | tar -xz
chmod +x install.sh
./install.sh
# 9. Verify config is registered
uv run python -c "from openpi.training.config import _CONFIGS_DICT; assert 'pi05_so101' in _CONFIGS_DICT, 'Config not found!'; print('β
pi05_so101 config registered!')"
echo ""
echo "========================================"
echo "β
Pi0.5 + SO-101 Ready!"
echo "========================================"
echo ""
echo "SSH in and run:"
echo " cd ~/openpi"
echo " uv run scripts/compute_norm_stats.py --config-name pi05_so101"
echo " XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 --exp-name=ball_in_cup"
β οΈ IMPORTANT: Replace YOUR_HF_TOKEN_HERE with your actual HuggingFace token before renting!
3. SSH and Start Training
Once the instance is "Running" and onstart script finishes:
ssh root@<VAST_IP> -p <PORT>
cd ~/openpi
# 1. Compute normalization stats (~2 mins)
uv run scripts/compute_norm_stats.py --config-name pi05_so101
# 2. Start Checkpoint Sync (Window 1)
tmux new -s sync
python3 sync_checkpoints.py
# (Ctrl+B, D to detach)
# 3. Start Training (Window 2)
tmux new -s training
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 --exp-name=ball_in_cup
4. Verify Installation
# Check config is registered
uv run scripts/train.py --help | grep pi05_so101
Should show pi05_so101 in the available configs.
5. Compute Normalization Stats
uv run scripts/compute_norm_stats.py --config-name pi05_so101
6. Start Checkpoint Sync (for Spot Instances) - DO THIS FIRST!
β οΈ IMPORTANT: Run this in a tmux window BEFORE training starts! This ensures checkpoints are backed up if your spot instance gets preempted.
tmux new -s sync
python sync_checkpoints.py
# Ctrl+B, D to detach
The script will:
- Watch for new checkpoints (saved every 1000 steps)
- Upload each to
abdul004/pi05_so101_checkpointon HF Hub - Clean up old local checkpoints to save disk
7. Train (in second tmux window)
tmux new -s training
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 --exp-name=ball_in_cup
# Ctrl+B, D to detach
Training progress logged to console and Weights & Biases.
Checkpoints saved at: 1000, 2000, 3000, 4000, 5000 (final)
8. Resume Training (if spot instance died)
# Download checkpoint from HF Hub
huggingface-cli download abdul004/pi05_so101_checkpoint --local-dir ./resume_ckpt
# Find latest checkpoint
ls ./resume_ckpt/checkpoints/
# Resume with --resume flag
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 \
--exp-name=ball_in_cup \
--resume=true
9. Download Final Checkpoint
After training completes:
# Option 1: From HF Hub (if sync was running)
huggingface-cli download abdul004/pi05_so101_checkpoint --local-dir ./pi05_checkpoint
# Option 2: Direct from Vast.ai
scp -r vast_instance:openpi/checkpoints/pi05_so101/ball_in_cup/5000 ./pi05_checkpoint
Checkpoint structure (JAX/Orbax format):
5000/
βββ params/ # Model weights
βββ train_state/ # Optimizer state, step counter
βββ assets/ # Normalization stats
Inference on Robot
(Coming soon - need to adapt LeRobot inference script)
Key Adaptations from LeKiwi
| Aspect | LeKiwi | SO-101 |
|---|---|---|
| Action dim | 9 | 6 |
| Cameras | 3 (top, wrist, front) | 2 (overhead, wrist) |
| Camera keys | observation.images.top |
observation.images.overhead |
| Delta mask | make_bool_mask(5, -4) |
make_bool_mask(5, -1) |
Troubleshooting
"ForwardCompatibilityError: 3.0 format" or "KeyError: chunk_index"
Root cause: OpenPi pins an old LeRobot that can't read your dataset.
Fix: Upgrade LeRobot:
cd ~/openpi
uv pip install "lerobot>=0.4.0"
uv run python -c "import lerobot; print(f'LeRobot: {lerobot.__version__}')" # Must show 0.4.x
"Config 'pi05_so101' not found"
Root cause: The install.sh didn't run or failed silently.
Fix: Re-run the patch:
cd ~/openpi
curl -L https://huggingface.co/abdul004/pi0_so101_config/resolve/main/so101_openpi_patch.tar.gz | tar -xz
./install.sh
"FileNotFoundError: .../info.json"
Root cause: HF_HOME pointing to wrong directory.
Fix: Set the correct path:
export HF_HOME="/root/.cache/huggingface"
mkdir -p $HF_HOME
Out of Memory
Set memory fraction higher:
XLA_PYTHON_CLIENT_MEM_FRACTION=0.95 uv run scripts/train.py ...
Dataset Not Found
Make sure you're logged into HuggingFace:
uv run huggingface-cli login
Missing Norm Stats
Run compute_norm_stats.py before training:
uv run scripts/compute_norm_stats.py --config-name pi05_so101
Disk Space Issues (uv sync fails)
Root cause: Default disk too small or cache pointing to wrong location.
Fix: Ensure 120GB+ disk and correct paths:
export UV_CACHE_DIR="/root/.uv_cache"
export TMPDIR="/root/.tmp"
mkdir -p $UV_CACHE_DIR $TMPDIR