# Pi0.5 Action Expert Adaptation for SO-100/101 Multi-Task Manipulation

## Abstract

We adapt Physical Intelligence's Pi0.5 vision-language-action model to the SO-100/101 low-cost robotic arm by finetuning only the action expert (693M parameters) while keeping the 3B-parameter vision-language backbone frozen. The training data is a curated subset of 10,155 episodes spanning 215 distinct manipulation tasks, drawn from 376 community-contributed datasets in the HuggingFaceVLA/community_dataset_v3 collection. This work demonstrates that a pretrained VLA can be adapted to a new embodiment using diverse community teleoperation data without any task-specific data collection, producing a single model capable of following arbitrary natural language manipulation instructions on the SO-100/101 platform.

## 1. Motivation

The SO-100 and SO-101 are affordable 6-DOF robotic arms ($114 USD) built around STS3215 servo motors, designed by TheRobotStudio and widely adopted through the LeRobot project. Despite their popularity for hobbyist and research use, no generalist vision-language-action model exists for these arms. Existing policies trained on SO-100 data are single-task — trained on one specific manipulation behavior like "pick up the red cube and place it in the box" — and cannot generalize to novel instructions.

Pi0.5, released by Physical Intelligence in April 2025, is a 4B-parameter VLA with demonstrated open-world generalization. It was pretrained on cross-embodiment data from professional research robots (Franka, UR5, ALOHA, and others) but has never seen an SO-100. The question this project addresses: can we teach Pi0.5 to control an SO-100 while preserving its ability to follow diverse language instructions?

## 2. Background: Embodiment Adaptation in VLAs

Adapting a VLA to a new robot is fundamentally different from finetuning it for a single task. Single-task finetuning teaches the model one specific behavior; embodiment adaptation teaches it how a new robot body moves, while retaining the generalist capabilities learned during pretraining.

We surveyed the approaches used across the field for this problem. Octo (2024) recommends full model finetuning with ~100 demonstrations and a replacement diffusion head for new action spaces. OpenVLA-OFT uses LoRA adapters and showed that finetuning technique matters more than pretraining data composition. X-VLA introduces learnable "soft prompts" (32 embeddings, 0.04% of parameters) per embodiment. NVIDIA's GR00T N1.5 freezes the VLM entirely and trains only the action head via LoRA, demonstrating successful SO-101 adaptation with as few as 30 episodes. The Pi0.5 paper itself introduced "Knowledge Insulation" — a training recipe where the VLM backbone is trained on discretized action tokens while a separate action expert learns continuous motor commands, with gradient flow blocked from expert to backbone.

A critical finding across all approaches is that task diversity in the adaptation data prevents memorization. Physical Intelligence's own GitHub issue #768 documents that finetuning Pi0.5 on 250 episodes of a single task caused complete memorization — the model stopped following language instructions and simply replayed trajectories. Diverse multi-task data forces the model to learn the actual sensorimotor mapping rather than memorizing specific trajectories.

Our approach follows the GR00T N1.5 pattern most closely: freeze the VLM backbone, finetune the action expert and its associated projections. This is implemented via Pi0.5's built-in `train_expert_only` flag, which freezes the PaliGemma VLM and trains the Gemma-300M action expert, the action input/output projections, and the time embedding MLPs — totaling 693M trainable parameters out of 3.6B.

## 3. Dataset Construction

### 3.1 Source Data

The training data comes from HuggingFaceVLA/community_dataset_v3, a curated collection of 791 robotics datasets contributed by 235 individuals worldwide. The collection spans 46 robot embodiments, though the SO-100 family (so100, so101, so100_follower, so101_follower) accounts for 575 of the 791 datasets and 29,677 of the approximately 50,000 total episodes.

Each dataset within the collection is a self-contained LeRobot v2.1 dataset consisting of per-episode parquet files (containing timestamped joint states and actions at 30Hz), per-episode MP4 video files for each camera, and JSON metadata describing the task, robot configuration, and episode statistics. The datasets were originally collected via teleoperation using a leader-follower arm setup.

### 3.2 Filtering Pipeline

We apply a multi-stage filtering pipeline implemented in `build_index.py`. The pipeline processes the 791 datasets and produces a verified training index.

**Schema matching.** We require exactly the feature set `{action, observation.state, observation.images.image, observation.images.image2, timestamp, frame_index, episode_index, index, task_index}`. This selects datasets with exactly two cameras and the standard 6-DOF state/action representation. Datasets with 1, 3, or 4 cameras are excluded because the Pi0.5 preprocessor maps camera features by name and inconsistent camera counts would require per-dataset handling. Of the 575 SO-100/101 datasets, 427 match this exact schema.

**Resolution and framerate.** We require 480x640 resolution at 30fps. This is the dominant configuration (92% of SO-100/101 datasets use it). Non-standard resolutions (1080x1920, 720x1280, 360x640) are excluded rather than resized, as resizing introduces interpolation artifacts in the training signal and the standard-resolution subset is already large enough.

**Integrity verification.** For every episode in every passing dataset, we read the actual parquet file and record the true row count. We also verify that both video files (`observation.images.image/episode_NNNNNN.mp4` and `observation.images.image2/episode_NNNNNN.mp4`) exist on disk. This step is critical — we discovered that some datasets in the community collection have metadata reporting more frames than the parquet files actually contain. Using metadata-reported frame counts without verification caused out-of-bounds index errors during training. After verification, 381 datasets with 14,620 episodes pass all filters.

**Episode length filtering.** Episodes shorter than 150 frames (5 seconds at 30fps) are likely recording errors or incomplete demonstrations. Episodes longer than 1800 frames (60 seconds) are statistical outliers — only 0.2% of episodes exceed this threshold — and may confuse the action chunking mechanism which predicts 50-step action sequences. Both extremes are excluded.

### 3.3 Task Distribution and Balancing

The raw task distribution is heavily skewed. The most common task ("Move the chess piece from red to blue") accounts for 1,916 episodes from a single contributor. One contributor (VoicAndrei) contributed 1,776 episodes across 9 kitchen tasks. Another contributor (shuohsuan) contributed 540 episodes of "grasp the red cube" spread across 54 small datasets. The four colored-duck-grasping tasks from Loki0929 account for 3,200 episodes total.

Without balancing, these dominant tasks would constitute roughly 40% of training data, biasing the action expert toward a small number of motor patterns and undermining the task diversity that prevents memorization.

We apply two caps. First, a per-task cap of 200 episodes: for any unique task string with more than 200 episodes, we randomly sample 200 (with seed 42 for reproducibility) and discard the rest. This affected 8 tasks. Second, a per-contributor cap of 200 episodes: for any contributor with more than 200 episodes remaining after the task cap, we randomly sample 200. This affected 16 contributors and had a larger impact, reducing the dataset from 13,043 to 10,155 episodes.

The cap values of 200 were chosen based on the SmolVLA scaling study (arXiv:2512.11921), which showed that SO-100 finetuning success rates plateau around 200 episodes for a given task (20 episodes → 18%, 50 → 45%, 100 → 68%, 200 → 76%). Beyond 200, additional episodes of the same task provide diminishing returns while displacing episodes of rarer tasks.

### 3.4 Final Dataset Statistics

The filtered and balanced dataset contains 376 datasets from 186 contributors, comprising 10,155 episodes with 5,431,807 total frames (approximately 50.3 hours at 30fps). There are 215 unique task descriptions spanning 23 semantic categories. Each episode consists of a parquet file with per-frame 6-dimensional joint states and actions, and two MP4 video files (one per camera) at 480x640 resolution.

The 23 task categories, ordered by episode count after balancing, are: cube pick-and-place (71 unique task strings), block stacking and manipulation (61), box manipulation (50), plate/dish/tray transfer (48), cup/bottle/container tasks (26), pen/marker/writing tool tasks (19), toy/duck/figurine tasks (17), kitchen/cabinet/lid tasks (14), food manipulation (14), drawing/writing tasks (12), ball/round object tasks (9), clothing folding (9), pushing/sliding/granule herding (8), sorting/organizing (8), peg/hole insertion (8), open/close mechanism tasks (7), bag/storage tasks (7), fruit/produce tasks (7), Jenga/kapla/building tasks (5), cleaning/wiping tasks (5), dice/games/chess (4), Tower of Hanoi (3), and tape/adhesive tasks (3).

### 3.5 Normalization Statistics

We compute per-dimension mean and standard deviation for both `observation.state` and `action` features across all 5.4M frames in the filtered dataset (`compute_stats.py`). These statistics are used by the Pi0.5 preprocessor to normalize inputs to a consistent scale during training. We use MEAN_STD normalization rather than QUANTILES because the latter requires precomputed quantile statistics (q01, q99) that are not present in the v2.1 dataset format. The SkieyFly and Sakits research groups have demonstrated successful Pi0.5 finetuning using MEAN_STD normalization, confirming its adequacy.

The computed statistics (stored in `norm_stats.json`) show that joint positions span different ranges — shoulder_pan has std ~27 while gripper has std ~17, reflecting the different physical ranges of each joint. Action statistics are similar to state statistics, consistent with the fact that actions in teleoperation data represent target joint positions rather than velocity commands.

## 4. Dataset Implementation

A key design decision is that we do not merge, convert, or copy the community dataset. The 261GB of filtered data remains in its original v2.1 directory structure exactly as cloned from HuggingFace. Instead, we implement a custom PyTorch Dataset class (`so100_dataset.py`) that reads directly from the v2.1 parquet and video files on disk.

The `SO100Dataset` class is initialized with three paths: the root directory of the community dataset, the filtered index JSON, and the normalization statistics JSON. On construction, it builds a flat frame-level index — for each episode in the filtered index, it creates entries for frames 0 through N-50 (where N is the verified parquet row count and 50 is the Pi0.5 action chunk size). This yields 4,931,373 total training samples.

When `__getitem__` is called for a given sample index, the class loads the corresponding parquet file (with LRU caching of up to 200 DataFrames to amortize I/O), extracts the state vector and 50-step action chunk starting from the target frame, decodes the video frame at the matching timestamp from both camera MP4 files, and returns a dictionary matching the format that LeRobot's training pipeline expects. Video decoding uses PyAV (with torchcodec as an alternative backend).

The class also provides a `.meta` adapter object that exposes the `stats`, `camera_keys`, `features`, `fps`, and other properties that LeRobot's `lerobot_train.py` training script accesses on the dataset. This allows our custom dataset to be used as a drop-in replacement for LeRobotDataset without modifying the training loop.

Integration with LeRobot's training script is achieved through a small patch to `datasets/factory.py`: when the `--dataset.repo_id` argument begins with the prefix `so100:`, the factory returns an `SO100Dataset` instance instead of attempting to load a HuggingFace dataset. The remainder of the repo_id encodes the three required paths separated by colons.

## 5. Training Configuration

### 5.1 Model and Optimizer

We load the `lerobot/pi05_base` checkpoint — Pi0.5's pretrained weights ported to PyTorch by HuggingFace. With `train_expert_only=true`, the PaliGemma VLM is frozen and set to eval mode. The trainable components are the Gemma-300M action expert, the action input projection (linear, action_dim → expert_width), the action output projection (linear, expert_width → action_dim), and the time embedding MLPs. This totals 693,422,112 trainable parameters out of 3,616,757,520 total.

The optimizer is AdamW with learning rate 2.5e-5, betas (0.9, 0.95), weight decay 0.01, epsilon 1e-8, and gradient clipping at norm 1.0. These are the Pi0.5 defaults established by Physical Intelligence. The learning rate schedule is cosine decay with linear warmup: 1000 warmup steps ramping from near-zero to 2.5e-5, followed by cosine decay over 15000 total steps down to 2.5e-6 (10x below peak).

### 5.2 Camera Name Mapping

The community datasets use generic camera names (`observation.images.image` and `observation.images.image2`) while the Pi0.5 base checkpoint was pretrained with specific camera names (`observation.images.base_0_rgb`, `observation.images.left_wrist_0_rgb`, `observation.images.right_wrist_0_rgb`). We use LeRobot's `--rename_map` argument to map `image` → `base_0_rgb` and `image2` → `left_wrist_0_rgb`. The third expected camera (`right_wrist_0_rgb`) is left empty — Pi0.5 handles absent cameras through its empty_cameras mechanism.

### 5.3 Training Procedure

Each training sample consists of one frame's camera images (480x640, resized to 224x224 by the preprocessor), the 6-dimensional joint state vector (padded to 32 dimensions), and a 50-step action chunk (each step 6-dimensional, padded to 32). The Pi0.5 preprocessor tokenizes the task string into the format "Task: {task}, State: {discretized_state};\nAction: " using the PaliGemma tokenizer, normalizes state and action using our precomputed mean/std statistics, and applies ImageNet normalization to images.

The target training run is 15,000 steps with effective batch size 256 (32 per GPU on 8x H100 80GB). This represents approximately 0.8 epochs through the 4.9M training samples. Checkpoints are saved every 500 steps (30 checkpoints total) and uploaded to HuggingFace Hub, providing both fine-grained checkpoint selection and protection against instance preemption.

### 5.4 Local Scale-Up Validation

Before committing to cloud GPU resources, we validate the training pipeline locally on an RTX 3090 (24GB VRAM) with batch size 4. We added an `early_stop_steps` parameter to LeRobot's training script that exits the training loop after a specified number of steps while preserving the full LR schedule shape — this avoids the auto-scaling behavior that compresses the cosine schedule when `steps < decay_steps`, ensuring the local test accurately reflects the first N steps of the full cloud run.

Initial validation (100 steps with the real 15k-step LR schedule) showed loss starting at 0.408 during early warmup (LR ~1.6e-7) and declining to ~0.290 by step 80 (LR ~1.9e-6), with the expected noise from batch size 4. The LR at step 100 was still only 10% of peak, confirming that the warmup phase behaves correctly and that meaningful learning will occur after step 1000 when the LR reaches its peak.

## 6. Reproducing This Work

### 6.1 Prerequisites

A machine with a CUDA GPU (RTX 3090 or better for local testing), the LeRobot framework (v0.4.4), and approximately 300GB of disk space for the dataset.

### 6.2 Step 1: Clone the Community Dataset

```
git clone https://huggingface.co/datasets/HuggingFaceVLA/community_dataset_v3 /path/to/community_dataset_v3
```

This is approximately 690GB. Only 261GB is used by our filtered subset, but the HuggingFace git clone mechanism does not support partial downloads. On a cloud instance, the selective download script (`download_filtered.py`) downloads only the required 261GB using HuggingFace's `snapshot_download` with `allow_patterns`.

### 6.3 Step 2: Clone This Repository

```
huggingface-cli download StrongRoboticsLab/pi05-so100-diverse --local-dir pi05-so100-diverse
```

### 6.4 Step 3: Build the Dataset Index

```
cd pi05-so100-diverse
python build_index.py --data-root /path/to/community_dataset_v3
```

This scans all datasets, applies filters, reads every parquet file to verify frame counts, checks video file existence, applies per-task and per-contributor caps, and writes `filtered_index.json`. Takes approximately 2 minutes.

### 6.5 Step 4: Compute Normalization Statistics

```
python compute_stats.py --data-root /path/to/community_dataset_v3
```

Reads all parquets in the filtered index and computes per-dimension mean and standard deviation for state and action features. Takes approximately 10 seconds.

### 6.6 Step 5: Apply LeRobot Patches

Two modifications to the LeRobot source are required:

In `src/lerobot/datasets/factory.py`, add the SO100Dataset handler before the existing `isinstance(cfg.dataset.repo_id, str)` check in `make_dataset()`. The handler checks for the `so100:` prefix and returns an `SO100Dataset` instance.

In `src/lerobot/configs/train.py`, add the `early_stop_steps: int | None = None` field to `TrainPipelineConfig`. In `src/lerobot/scripts/lerobot_train.py`, add the early stop check after `step += 1` in the training loop.

The exact diffs are available in the lerobot fork at the commit referenced by this project.

### 6.7 Step 6: Run Training

For local validation:

```
cd pi05-so100-diverse
PYTHONPATH=$(pwd):$PYTHONPATH python -m lerobot.scripts.lerobot_train \
  --dataset.repo_id="so100:/path/to/community_dataset_v3:filtered_index.json:norm_stats.json" \
  --policy.path=lerobot/pi05_base \
  --policy.train_expert_only=true \
  --policy.dtype=bfloat16 \
  --policy.gradient_checkpointing=true \
  --policy.push_to_hub=false \
  --policy.normalization_mapping='{"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}' \
  --policy.scheduler_warmup_steps=1000 \
  --policy.scheduler_decay_steps=15000 \
  --rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}' \
  --batch_size=4 \
  --steps=15000 \
  --early_stop_steps=1000 \
  --save_freq=500 \
  --log_freq=50 \
  --num_workers=2 \
  --output_dir=outputs/local_test \
  --job_name=local_test \
  --save_checkpoint=true
```

For cloud training on 8x H100:

```
accelerate launch --multi_gpu --num_processes 8 \
  -m lerobot.scripts.lerobot_train \
  --dataset.repo_id="so100:/data/community_dataset_v3:filtered_index.json:norm_stats.json" \
  --policy.path=lerobot/pi05_base \
  --policy.train_expert_only=true \
  --policy.dtype=bfloat16 \
  --policy.gradient_checkpointing=true \
  --policy.compile_model=true \
  --policy.push_to_hub=true \
  --policy.repo_id=StrongRoboticsLab/pi05-so100-diverse \
  --policy.normalization_mapping='{"VISUAL": "IDENTITY", "STATE": "MEAN_STD", "ACTION": "MEAN_STD"}' \
  --policy.scheduler_warmup_steps=1000 \
  --policy.scheduler_decay_steps=15000 \
  --rename_map='{"observation.images.image": "observation.images.base_0_rgb", "observation.images.image2": "observation.images.left_wrist_0_rgb"}' \
  --batch_size=32 \
  --steps=15000 \
  --save_freq=500 \
  --log_freq=50 \
  --num_workers=4 \
  --wandb.enable=true \
  --wandb.project=pi05-so100-diverse \
  --output_dir=outputs/cloud_run \
  --job_name=pi05_so100_diverse
```

## 7. Repository Contents

| File | Description |
|------|-------------|
| `build_index.py` | Dataset filtering pipeline. Scans community_dataset_v3, applies schema/resolution/fps/length filters, verifies every parquet and video file, applies per-task and per-contributor caps. Outputs `filtered_index.json`. |
| `compute_stats.py` | Computes mean/std normalization statistics across all filtered episodes using Welford's online algorithm. Outputs `norm_stats.json`. |
| `so100_dataset.py` | PyTorch Dataset implementation that reads v2.1 community dataset files directly from disk. Includes parquet caching, video frame decoding (PyAV/torchcodec), and a `.meta` adapter for LeRobot compatibility. |
| `filtered_index.json` | The verified training index. Contains 10,155 episode entries, each with dataset path, episode index, task string, task index, and verified frame count. Also contains the complete task list (215 entries) and list of 376 datasets used. |
| `norm_stats.json` | Per-dimension mean and standard deviation for `observation.state` (6D) and `action` (6D), computed from 5.4M frames. |
| `README.md` | This document. |

## 8. Acknowledgments

This work builds on the community_dataset_v3 collection curated by HuggingFaceVLA, containing data from 235 contributors. The filtering and curation methodology draws on the SmolVLA pretraining recipe (Lacaze et al., 2025). The Pi0.5 model and Knowledge Insulation training approach are from Physical Intelligence (Black et al., 2025). The LeRobot framework is maintained by HuggingFace.

## License

Apache 2.0, consistent with the source data (community_dataset_v3), the base model (Pi0.5 via OpenPI), and the LeRobot framework.