Pi0.5-PickOrange — π0.5 PyTorch expert-only FT (⚠️ negative result)

⚠️ 这是一个有据可查的失败实验(已公开作为反面教材 / educational negative result): 20-round strict benchmark = 1/60 oranges (1.7%),在 STRICT_LEADERBOARD 上末位,比同任务的 SmolVLA 低 15 倍。发布的目的是把"为什么 π0.5 在 LeIsaac PickOrange 上学不会"这件事用 ckpt 本身固定下来,供后续研究者复现 / 否证。

This is a deliberately published failure — a documented negative result. 20-round strict eval = 1/60 oranges (1.7%), last place on the strict leaderboard, 15× worse than SmolVLA on the same task. Published to anchor the "why π0.5 doesn't learn this task" claim with a real checkpoint, so others can reproduce / refute.

🔗 项目仓库 / Project repos

🎥 失败现场录屏 / The failure, on video

π0.5 expert-FT ckpt 在 LeIsaac PickOrange 上的真实录屏:机械臂持续运动满 180s,橙子一颗未入盘(0/3)。这不是 bug,是 SigLIP@224 vision bottleneck 下"看不见橙子"的真实表现——和成功模型(GR00T-N1.7 / ACT)形成直接对照。 Real screen capture: the arm keeps moving for the full 180s but places 0/3 oranges. Not a bug — the genuine behavior under the SigLIP@224 vision bottleneck. Compare against the models that actually succeed (GR00T-N1.7 / ACT) below.

TL;DR

Item Value
任务 / Task SO-101 PickOrange — 单臂依次夹起 3 颗橙子放盘子
数据集 / Dataset LightwheelAI/leisaac-pick-orange (60 demos, 30Hz)
架构 / Architecture π0.5 = PaliGemma-2B VLM (frozen) + Gemma-300M action expert (trainable) + flow-matching
可训参数 / Trainable params 693M (gemma_expert layers 425M + lm_head 263M + norm 3M)
配方 / Recipe train_expert_only=true, freeze_vision_encoder=true, bf16, lr=2.5e-5, chunk=50, batch=1 + grad_accum=8, 10k steps
vision input SigLIP @ 224×224(PaliGemma 硬编码,主嫌
Strict benchmark 1/60 oranges (1.7%) — 20 rounds × 3 ep × 1 orange/ep, ckpt-2000
σ(5-round) 0.50 / 15 (3.3%) — worst-case (μ-1σ) = -0.25 / 15
Leaderboard 排名 / Rank 6/6(末位),低 SmolVLA 15×
Inference latency ~108 ms / chunk (50-step flow matching, RTX 4090)
GPU hours ~3.5 h on RTX Pro 6000 (bf16, ZeRO-2 offload)

为什么发布失败模型 / Why publish a failed model

科研里负面结果通常被丢进抽屉,但其实和成功一样有价值:

  1. 锁定假设:让后续研究者可以 load 这个 ckpt 直接验证"是不是这套配方在这个数据集上真的不行",避免反复踩同样的坑。
  2. 隔离变量:训练侧的 dataloader / preprocessor / postprocessor / camera mapping / freeze 配置都已经调通(基础设施 4 个 bug 修完),失败不是 infra 噪声,而是架构 vs 任务的真实信号。
  3. **量化"偶尔的 1 只"**:用户最初看到 3-round 跑出 2/9 觉得有希望,但 20-round 1/60 证明那只是 Bernoulli outlier (p≈1.7%)。

Negative results matter as much as positive ones. This ckpt lets others verify the failure mode without re-spending the GPU hours.

根因分析(主嫌 80%)/ Root cause (main suspect, 80% confidence)

PaliGemma-2B 的 SigLIP vision encoder 硬编码 224×224 输入,而 LeIsaac 原生 640×480 → 2.86× downscale 后橙子只剩 10–17 px,**≤1 个 SigLIP patch (14px)**。

对比同任务上 work 的模型:

Model Vision encoder Input res Orange size after resize Result
GR00T-N1.7 Eagle-2 ViT 448 22-34 px (1.5–2.4 patch) 68.3% ✅
SmolVLA SigLIP 512 24-40 px (1.7–2.9 patch) 25.0% ✅
π0.5 (this) SigLIP 224 10-17 px (≤1 patch) 1.7% ❌

→ 橙子在 vision token 上几乎不可见,"freeze 整个 PaliGemma + 只训 action expert"再多 token 也无法补救 vision bottleneck。

PaliGemma's SigLIP is hardcoded to 224×224 — after downscaling LeIsaac's native 640×480, oranges shrink to ≤1 SigLIP patch. No amount of expert-only training can recover information already lost at the vision encoder.

训练配方 / Training recipe

# 训练入口 / training entry
bash LeIsaac/scripts/training/pi05_pt/train.sh

# 关键 flags / key flags
--policy.train_expert_only=true       # freeze PaliGemma, train only gemma_expert
--policy.freeze_vision_encoder=true   # explicit redundant lock
--policy.gradient_checkpointing=true  # 24GB VRAM under bf16
--policy.dtype=bfloat16
--policy.chunk_size=50
--policy.n_action_steps=50
--policy.max_state_dim=32
--policy.max_action_dim=32
--policy.optimizer_lr=2.5e-5
--steps=10000  --save_freq=1000  --batch_size=1

Camera rename (LeIsaac 2-cam → π0.5 3-cam, missing left_wrist auto-padded inside modeling_pi05.py:1195):

rename_map = {
    "observation.images.front":  "observation.images.base_0_rgb",
    "observation.images.wrist":  "observation.images.right_wrist_0_rgb",
}

复现 / Reproduce

from lerobot.policies.pi05 import PI05Policy
policy = PI05Policy.from_pretrained("wsagi/Pi0.5-PickOrange")
# 然后接 LeIsaac Isaac Sim eval pipeline
# Then plug into the LeIsaac Isaac Sim eval pipeline:
#   scripts/benchmark/run_one_strict.sh

20-round strict benchmark(distribution, 20 rounds × 3 episodes):

P(placed=0) P(placed=1) P(placed=2) P(placed=3) E(🍊)/ep
95% (57/60) 5% (3/60) 0% 0% 0.05

19/20 rounds 全 0/3,1 round 出现 1/3(Episode 8: placed=[F, T, F])。Bernoulli noise distribution,无 task-completion signal。

已 sweep 过的 ckpt / Checkpoints evaluated

10k 训练每 1k 存一个,13 个 ckpt(500/1k/1.5k/.../10k)全 3-round 横评 = 1/60 oranges across 13 ckpts全部 0/9 或 1/9,无单调收敛迹象。ckpt-2000 是 3-round 抓到 2/9 的那个(最高),20-round 跑下来回归到 1/60,证实是 noise outlier 不是 signal。

何时该用 / 不该用 / When (not) to use

不要在生产环境使用 — 1.7% success rate 没有 task-completion 价值 ✅ 可以用作

  • π0.5 在低分辨率 VLM bottleneck 任务上的 baseline reference
  • "freeze VLM + train expert only" 配方失败案例的复现 ckpt
  • LeIsaac eval pipeline 的 π0.5 wire 协议验证 fixture

替代方案 / Alternatives (better on same task)

这些是同任务上真能把橙子夹进盘子的模型 — 想看成功的就去这里 / models that actually place the orange:

Model Strict Where
🥇 GR00T-N1.7 (self-trained) 68.3% (2.05/3) wsagi/GR00T-N1.7-PickOrange
🥈 ACT (self, h=70) 43.3% (1.30/3) wsagi/ACT-PickOrange
🥉 SmolVLA (self-trained) 25.0% wsagi (待发布 / pending)
Diffusion Policy DDIM 概率性 3/3 / stochastic wsagi/DiffusionPolicy-PickOrange

License & Attribution

  • Apache-2.0
  • Base model: lerobot/pi05_base (Physical Intelligence × LeRobot)
  • Dataset: LightwheelAI/leisaac-pick-orange
  • Trained on RTX Pro 6000 96GB
  • Evaluated in Isaac Sim 5.1 + LeIsaac
Downloads last month
11
Safetensors
Model size
4B params
Tensor type
F32
·
BF16
·
Video Preview
loading

Dataset used to train wsagi/Pi0.5-PickOrange