towel-folding-pi05

A pi0.5 vision-language-action policy fine-tuned to fold a towel with the SO-101 follower arm and a single wrist-mounted camera. Trained on 97 teleop demonstrations (~26 k frames after trimming) recorded with the LeRobot framework.

Hardware this expects

Robot: SO-101 follower (5-DOF + gripper).
Camera: one wrist-mounted RGB camera, dataset key observation.images.wrist, captured at 1280 × 720, 30 fps, MJPG. Other resolutions/keys will work but will need preprocessing tweaks.
Task language prompt: "Fold towel" (the only string the policy was trained on).

Quick install

The policy uses a patched fork of transformers that LeRobot bundles. Install LeRobot from main with the [pi] extra — do not use pip install lerobot, the PyPI release is months behind and is v2-only.

python -m venv .venv && source .venv/bin/activate && pip install --upgrade pip && pip install "lerobot[pi]@git+https://github.com/huggingface/lerobot.git"

You also need access to the PaliGemma backbone:

Accept the gated license at https://huggingface.co/google/paligemma-3b-pt-224 (one click, on the same HF account whose token you'll use).
Generate a read token at https://huggingface.co/settings/tokens.
hf auth login --token "hf_yourtoken" (or huggingface-cli login on older versions).

Loading the policy

from lerobot.policies.pi05.modeling_pi05 import PI05Policy

policy = PI05Policy.from_pretrained("ChinLR/towel-folding-pi05")
policy.eval()
policy.to("cuda")  # or "mps" / "cpu"

This pulls all 7 files in the repo:

model.safetensors — fine-tuned weights (9.35 GB).
config.json, train_config.json — architecture and training config.
policy_preprocessor.json + *_normalizer_processor.safetensors — input pipeline (image resize/normalize, state/action normalization with q01/q99 stats from the training dataset).
policy_postprocessor.json + *_unnormalizer_processor.safetensors — output pipeline (un-normalize predicted actions back to SO-101 joint-space units).

Loading any of the above outside from_pretrained is not recommended — the pre/post-processors are baked into how the model was trained, and skipping them produces garbage actions.

Running inference on the SO-101

A minimal control loop — adapt as needed for your stack. Run at 30 Hz to match the training distribution.

import torch
from lerobot.policies.pi05.modeling_pi05 import PI05Policy
from lerobot.robots.so101_follower import SO101Follower, SO101FollowerConfig
from lerobot.cameras.opencv import OpenCVCameraConfig

policy = PI05Policy.from_pretrained("ChinLR/towel-folding-pi05").eval().to("cuda")

robot = SO101Follower(SO101FollowerConfig(
    port="/dev/tty.usbmodemXXXX",       # your follower port
    id="followerbot",
    cameras={
        "wrist": OpenCVCameraConfig(
            index_or_path=0,
            width=1280, height=720, fps=30, fourcc="MJPG",
        ),
    },
))
robot.connect()

task = "Fold towel"

try:
    while True:
        obs = robot.get_observation()
        # obs is a dict of torch tensors keyed by what the dataset used:
        #   observation.state            -> shape (6,)        joint positions
        #   observation.images.wrist     -> shape (3, H, W)   uint8 RGB
        with torch.inference_mode():
            action = policy.select_action({**obs, "task": task})  # shape (6,)
        robot.send_action(action.cpu().numpy())
finally:
    robot.disconnect()

policy.select_action predicts a chunk of future actions internally and returns one per call. Reset the action queue between episodes:

policy.reset()

Training details

Base model: lerobot/pi05_base (Physical Intelligence pi0.5, ~4 B params, PaliGemma vision-language tower + action expert).
Dataset: 97 episodes filtered from 152 recorded demos using a manual fold_score ≥ 3 quality grade, then trimmed to remove idle lead-in / tail (~26 k useful frames).
Optimizer: AdamW, default LeRobot pi0.5 LR schedule, gradient checkpointing on.
Compute: single A100, batch size 4, bfloat16, gradient checkpointing.
Schedule: 30 000 steps (~5 h 41 min on A100), ~0.63 s/step.
Loss trajectory: 0.343 (step 0) → 0.017 (step 30 000), no divergence.
Reproducibility: all hyperparameters in train_config.json.

Limitations

Single task: only knows "Fold towel".
Single camera view: trained only with a wrist-cam; will not generalize to a side view.
Trained on one specific towel under one lighting condition — expect degradation with very different fabrics, sizes, or lighting.
No closed-loop recovery training: large disturbances mid-episode may put the policy in OOD states.
Not RL-tuned, not safety-bounded — wrap with appropriate joint/torque limits for your hardware.

Acknowledgements

Physical Intelligence for pi0.5.
HuggingFace LeRobot team for the base checkpoint, training stack, and SO-101 driver.
ETH Zürich Euler cluster for compute.

Downloads last month: 41

Video Preview

Robotics

Model tree for ChinLR/towel-folding-pi05

Base model

lerobot/pi05_base

Finetuned

(34)

this model