folding_pi05 — π0.5 fine-tune for towel folding

π0.5 vision-language-action policy fine-tuned for autonomous towel folding on a 6-DoF SO-101 follower arm with a single wrist camera. A strong alternative to the diffusion-transformer policy larsvandorp/folding_dit.

The repo root holds the step-9000 checkpoint (our best), so from_pretrained("larsvandorp/folding_pi05") loads it directly.

Notes

Full fine-tune from lerobot/pi05_base (PaliGemma backbone + action expert, ~3B params), vision encoder unfrozen.
NVIDIA GPU only — too heavy for Mac MPS at 30 Hz.
π0.5 instantiates PaliGemma at load time, so the HF account running it must have accepted https://huggingface.co/google/paligemma-3b-pt-224.

Run it

uv venv --python 3.12 .venv
GIT_LFS_SKIP_SMUDGE=1 uv pip install --python .venv/bin/python \
  "lerobot[pi] @ git+https://github.com/LarsvanDorp/lerobot.git@dinov3"

.venv/bin/lerobot-rollout \
  --strategy.type=base \
  --robot.type=so101_follower --robot.port=/dev/ttyACM0 --robot.id=my_follower \
  --robot.cameras="{wrist: {type: opencv, index_or_path: <cam-index>, width: 800, height: 600, fps: 30, fourcc: MJPG}}" \
  --policy.path=larsvandorp/folding_pi05 \
  --policy.device=cuda --inference.type=sync \
  --task="fold the towel" --duration=60

Note the fourcc: MJPG in the camera config (needed on the lab Linux PC). We run without --interpolation_multiplier.

Training data

larsvandorp/magic_soup — the filtered SO-101 towel-folding set (bad episodes removed: high mean |Δa|, or no fold in the last frame).

Downloads last month: 4

Safetensors

Model size

4B params

Tensor type

F32

BF16

Video Preview

Robotics

Model tree for larsvandorp/folding_pi05

Base model

lerobot/pi05_base

Finetuned

(120)

this model