YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Pi0.5 Fine-tuning for SO-101

Fine-tune Physical Intelligence's Pi0.5 on the SO-101 ball-in-cup task.

Overview

Item Value
Base Model Pi0.5 (gs://openpi-assets/checkpoints/pi05_base)
Dataset abdul004/so101_ball_in_cup_v5 (72 episodes)
GPU Required A100 80GB (~$1.50/hr on Vast.ai)
Training Time ~2-3 hours for 5K steps

Files in This Directory

pi0_so101/
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ so101_policy.py        # Input/output transforms
β”œβ”€β”€ so101_config.py        # Config template (reference)
β”œβ”€β”€ test_config_local.py   # Integration test (run locally before cloud)
└── sync_checkpoints.py    # HF Hub checkpoint sync (run on Vast.ai)

HuggingFace Package: abdul004/pi0_so101_config

  • so101_openpi_patch.tar.gz - Auto-installer for OpenPi

Step-by-Step Setup on Vast.ai

1. Rent GPU Instance

On Vast.ai, search for:

  • GPU: A100 80GB or H100 SXM (80GB required for Pi0.5)
  • Disk: 120GB+ (Critical: set slider to 120GB minimum)
  • Image: nvidia/cuda:12.4.1-devel-ubuntu22.04 (or similar PyTorch image)

Add your SSH Key to Vast.ai:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIM/UHcQWuaetxjvxqz4cqofaWeLakkwWOnVN4ffUevfU abdul@Abduls-Mac-mini.local

2. Vast.ai Onstart Script (Automatic Setup)

Paste this into the "Onstart Script" box when renting:

#!/bin/bash
set -e  # Exit on any error

echo "========================================"
echo "Pi0.5 SO-101 Setup Starting..."
echo "========================================"

# 1. Install system dependencies
apt-get update && apt-get install -y ffmpeg tmux git curl

# 2. Setup environment variables FIRST (critical for disk space)
export PATH="/root/.local/bin:$PATH"
export UV_CACHE_DIR="/root/.uv_cache"
export TMPDIR="/root/.tmp"
export HF_HOME="/root/.cache/huggingface"
mkdir -p $UV_CACHE_DIR $TMPDIR $HF_HOME

# Persist to bashrc for SSH sessions
cat >> ~/.bashrc << 'BASHRC_EOF'
export PATH="/root/.local/bin:$PATH"
export UV_CACHE_DIR="/root/.uv_cache"
export TMPDIR="/root/.tmp"
export HF_HOME="/root/.cache/huggingface"
BASHRC_EOF

# 3. Install UV package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc

# 4. Clone OpenPi
cd /root
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi

# 5. Install OpenPi dependencies (skip LFS for speed)
GIT_LFS_SKIP_SMUDGE=1 uv sync

# =========================================================
# CRITICAL: Upgrade LeRobot to 0.4.x (fixes dataset format)
# OpenPi pins an old version that can't read v3.0 datasets
# =========================================================
uv pip install "lerobot>=0.4.0"

# 6. Verify LeRobot version (must be 0.4.x)
uv run python -c "import lerobot; v=lerobot.__version__; print(f'LeRobot: {v}'); assert v.startswith('0.4'), f'ERROR: Need 0.4.x, got {v}'"

# 7. Login to HuggingFace (replace with your token!)
uv run python3 -c "from huggingface_hub import login; login(token='YOUR_HF_TOKEN_HERE')"

# 8. Install SO-101 config patch
curl -L https://huggingface.co/abdul004/pi0_so101_config/resolve/main/so101_openpi_patch.tar.gz | tar -xz
chmod +x install.sh
./install.sh

# 9. Verify config is registered
uv run python -c "from openpi.training.config import _CONFIGS_DICT; assert 'pi05_so101' in _CONFIGS_DICT, 'Config not found!'; print('βœ… pi05_so101 config registered!')"

echo ""
echo "========================================"
echo "βœ… Pi0.5 + SO-101 Ready!"
echo "========================================"
echo ""
echo "SSH in and run:"
echo "  cd ~/openpi"
echo "  uv run scripts/compute_norm_stats.py --config-name pi05_so101"
echo "  XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 --exp-name=ball_in_cup"

⚠️ IMPORTANT: Replace YOUR_HF_TOKEN_HERE with your actual HuggingFace token before renting!

3. SSH and Start Training

Once the instance is "Running" and onstart script finishes:

ssh root@<VAST_IP> -p <PORT>
cd ~/openpi

# 1. Compute normalization stats (~2 mins)
uv run scripts/compute_norm_stats.py --config-name pi05_so101

# 2. Start Checkpoint Sync (Window 1)
tmux new -s sync
python3 sync_checkpoints.py
# (Ctrl+B, D to detach)

# 3. Start Training (Window 2)
tmux new -s training
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 --exp-name=ball_in_cup

4. Verify Installation

# Check config is registered
uv run scripts/train.py --help | grep pi05_so101

Should show pi05_so101 in the available configs.

5. Compute Normalization Stats

uv run scripts/compute_norm_stats.py --config-name pi05_so101

6. Start Checkpoint Sync (for Spot Instances) - DO THIS FIRST!

⚠️ IMPORTANT: Run this in a tmux window BEFORE training starts! This ensures checkpoints are backed up if your spot instance gets preempted.

tmux new -s sync
python sync_checkpoints.py
# Ctrl+B, D to detach

The script will:

  • Watch for new checkpoints (saved every 1000 steps)
  • Upload each to abdul004/pi05_so101_checkpoint on HF Hub
  • Clean up old local checkpoints to save disk

7. Train (in second tmux window)

tmux new -s training
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 --exp-name=ball_in_cup
# Ctrl+B, D to detach

Training progress logged to console and Weights & Biases.

Checkpoints saved at: 1000, 2000, 3000, 4000, 5000 (final)

8. Resume Training (if spot instance died)

# Download checkpoint from HF Hub
huggingface-cli download abdul004/pi05_so101_checkpoint --local-dir ./resume_ckpt

# Find latest checkpoint
ls ./resume_ckpt/checkpoints/

# Resume with --resume flag
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py pi05_so101 \
    --exp-name=ball_in_cup \
    --resume=true

9. Download Final Checkpoint

After training completes:

# Option 1: From HF Hub (if sync was running)
huggingface-cli download abdul004/pi05_so101_checkpoint --local-dir ./pi05_checkpoint

# Option 2: Direct from Vast.ai
scp -r vast_instance:openpi/checkpoints/pi05_so101/ball_in_cup/5000 ./pi05_checkpoint

Checkpoint structure (JAX/Orbax format):

5000/
β”œβ”€β”€ params/         # Model weights
β”œβ”€β”€ train_state/    # Optimizer state, step counter
└── assets/         # Normalization stats

Inference on Robot

(Coming soon - need to adapt LeRobot inference script)

Key Adaptations from LeKiwi

Aspect LeKiwi SO-101
Action dim 9 6
Cameras 3 (top, wrist, front) 2 (overhead, wrist)
Camera keys observation.images.top observation.images.overhead
Delta mask make_bool_mask(5, -4) make_bool_mask(5, -1)

Troubleshooting

"ForwardCompatibilityError: 3.0 format" or "KeyError: chunk_index"

Root cause: OpenPi pins an old LeRobot that can't read your dataset.

Fix: Upgrade LeRobot:

cd ~/openpi
uv pip install "lerobot>=0.4.0"
uv run python -c "import lerobot; print(f'LeRobot: {lerobot.__version__}')"  # Must show 0.4.x

"Config 'pi05_so101' not found"

Root cause: The install.sh didn't run or failed silently.

Fix: Re-run the patch:

cd ~/openpi
curl -L https://huggingface.co/abdul004/pi0_so101_config/resolve/main/so101_openpi_patch.tar.gz | tar -xz
./install.sh

"FileNotFoundError: .../info.json"

Root cause: HF_HOME pointing to wrong directory.

Fix: Set the correct path:

export HF_HOME="/root/.cache/huggingface"
mkdir -p $HF_HOME

Out of Memory

Set memory fraction higher:

XLA_PYTHON_CLIENT_MEM_FRACTION=0.95 uv run scripts/train.py ...

Dataset Not Found

Make sure you're logged into HuggingFace:

uv run huggingface-cli login

Missing Norm Stats

Run compute_norm_stats.py before training:

uv run scripts/compute_norm_stats.py --config-name pi05_so101

Disk Space Issues (uv sync fails)

Root cause: Default disk too small or cache pointing to wrong location.

Fix: Ensure 120GB+ disk and correct paths:

export UV_CACHE_DIR="/root/.uv_cache"
export TMPDIR="/root/.tmp"
mkdir -p $UV_CACHE_DIR $TMPDIR
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support