Instructions to use wsagi/Pi0.5-PickOrange with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use wsagi/Pi0.5-PickOrange with LeRobot:
- Notebooks
- Google Colab
- Kaggle
license: apache-2.0
library_name: lerobot
pipeline_tag: robotics
tags:
- pi05
- openpi
- lerobot
- so101
- leisaac
- pick-orange
- isaac-sim
- flow-matching
- vla
- negative-result
datasets:
- LightwheelAI/leisaac-pick-orange
language:
- en
Pi0.5-PickOrange — π0.5 PyTorch expert-only FT (⚠️ negative result)
⚠️ 这是一个有据可查的失败实验(已公开作为反面教材 / educational negative result): 20-round strict benchmark = 1/60 oranges (1.7%),在 STRICT_LEADERBOARD 上末位,比同任务的 SmolVLA 低 15 倍。发布的目的是把"为什么 π0.5 在 LeIsaac PickOrange 上学不会"这件事用 ckpt 本身固定下来,供后续研究者复现 / 否证。
This is a deliberately published failure — a documented negative result. 20-round strict eval = 1/60 oranges (1.7%), last place on the strict leaderboard, 15× worse than SmolVLA on the same task. Published to anchor the "why π0.5 doesn't learn this task" claim with a real checkpoint, so others can reproduce / refute.
🔗 项目仓库 / Project repos:
- vitorcen/isaaclab-experience — Isaac Lab + LeIsaac 多策略横评(parent project)
- vitorcen/LeIsaac-Training — LeIsaac fork(训练脚本 + 设计文档 / training scripts + design docs)
- 完整 negative report HTML:
pi05_pytorch_expert_ft_negative.html
🎥 失败现场录屏 / The failure, on video
π0.5 expert-FT ckpt 在 LeIsaac PickOrange 上的真实录屏:机械臂持续运动满 180s,橙子一颗未入盘(0/3)。这不是 bug,是 SigLIP@224 vision bottleneck 下"看不见橙子"的真实表现——和成功模型(GR00T-N1.7 / ACT)形成直接对照。 Real screen capture: the arm keeps moving for the full 180s but places 0/3 oranges. Not a bug — the genuine behavior under the SigLIP@224 vision bottleneck. Compare against the models that actually succeed (GR00T-N1.7 / ACT) below.
TL;DR
| Item | Value |
|---|---|
| 任务 / Task | SO-101 PickOrange — 单臂依次夹起 3 颗橙子放盘子 |
| 数据集 / Dataset | LightwheelAI/leisaac-pick-orange (60 demos, 30Hz) |
| 架构 / Architecture | π0.5 = PaliGemma-2B VLM (frozen) + Gemma-300M action expert (trainable) + flow-matching |
| 可训参数 / Trainable params | 693M (gemma_expert layers 425M + lm_head 263M + norm 3M) |
| 配方 / Recipe | train_expert_only=true, freeze_vision_encoder=true, bf16, lr=2.5e-5, chunk=50, batch=1 + grad_accum=8, 10k steps |
| vision input | SigLIP @ 224×224(PaliGemma 硬编码,主嫌) |
| Strict benchmark | 1/60 oranges (1.7%) — 20 rounds × 3 ep × 1 orange/ep, ckpt-2000 |
| σ(5-round) | 0.50 / 15 (3.3%) — worst-case (μ-1σ) = -0.25 / 15 |
| Leaderboard 排名 / Rank | 6/6(末位),低 SmolVLA 15× |
| Inference latency | ~108 ms / chunk (50-step flow matching, RTX 4090) |
| GPU hours | ~3.5 h on RTX Pro 6000 (bf16, ZeRO-2 offload) |
为什么发布失败模型 / Why publish a failed model
科研里负面结果通常被丢进抽屉,但其实和成功一样有价值:
- 锁定假设:让后续研究者可以 load 这个 ckpt 直接验证"是不是这套配方在这个数据集上真的不行",避免反复踩同样的坑。
- 隔离变量:训练侧的 dataloader / preprocessor / postprocessor / camera mapping / freeze 配置都已经调通(基础设施 4 个 bug 修完),失败不是 infra 噪声,而是架构 vs 任务的真实信号。
- **量化"偶尔的 1 只"**:用户最初看到 3-round 跑出 2/9 觉得有希望,但 20-round 1/60 证明那只是 Bernoulli outlier (p≈1.7%)。
Negative results matter as much as positive ones. This ckpt lets others verify the failure mode without re-spending the GPU hours.
根因分析(主嫌 80%)/ Root cause (main suspect, 80% confidence)
PaliGemma-2B 的 SigLIP vision encoder 硬编码 224×224 输入,而 LeIsaac 原生 640×480 → 2.86× downscale 后橙子只剩 10–17 px,**≤1 个 SigLIP patch (14px)**。
对比同任务上 work 的模型:
| Model | Vision encoder | Input res | Orange size after resize | Result |
|---|---|---|---|---|
| GR00T-N1.7 | Eagle-2 ViT | 448 | 22-34 px (1.5–2.4 patch) | 68.3% ✅ |
| SmolVLA | SigLIP | 512 | 24-40 px (1.7–2.9 patch) | 25.0% ✅ |
| π0.5 (this) | SigLIP | 224 | 10-17 px (≤1 patch) | 1.7% ❌ |
→ 橙子在 vision token 上几乎不可见,"freeze 整个 PaliGemma + 只训 action expert"再多 token 也无法补救 vision bottleneck。
PaliGemma's SigLIP is hardcoded to 224×224 — after downscaling LeIsaac's native 640×480, oranges shrink to ≤1 SigLIP patch. No amount of expert-only training can recover information already lost at the vision encoder.
训练配方 / Training recipe
# 训练入口 / training entry
bash LeIsaac/scripts/training/pi05_pt/train.sh
# 关键 flags / key flags
--policy.train_expert_only=true # freeze PaliGemma, train only gemma_expert
--policy.freeze_vision_encoder=true # explicit redundant lock
--policy.gradient_checkpointing=true # 24GB VRAM under bf16
--policy.dtype=bfloat16
--policy.chunk_size=50
--policy.n_action_steps=50
--policy.max_state_dim=32
--policy.max_action_dim=32
--policy.optimizer_lr=2.5e-5
--steps=10000 --save_freq=1000 --batch_size=1
Camera rename (LeIsaac 2-cam → π0.5 3-cam, missing left_wrist auto-padded inside modeling_pi05.py:1195):
rename_map = {
"observation.images.front": "observation.images.base_0_rgb",
"observation.images.wrist": "observation.images.right_wrist_0_rgb",
}
复现 / Reproduce
from lerobot.policies.pi05 import PI05Policy
policy = PI05Policy.from_pretrained("wsagi/Pi0.5-PickOrange")
# 然后接 LeIsaac Isaac Sim eval pipeline
# Then plug into the LeIsaac Isaac Sim eval pipeline:
# scripts/benchmark/run_one_strict.sh
20-round strict benchmark(distribution, 20 rounds × 3 episodes):
| P(placed=0) | P(placed=1) | P(placed=2) | P(placed=3) | E(🍊)/ep |
|---|---|---|---|---|
| 95% (57/60) | 5% (3/60) | 0% | 0% | 0.05 |
19/20 rounds 全 0/3,1 round 出现 1/3(Episode 8: placed=[F, T, F])。Bernoulli noise distribution,无 task-completion signal。
已 sweep 过的 ckpt / Checkpoints evaluated
10k 训练每 1k 存一个,13 个 ckpt(500/1k/1.5k/.../10k)全 3-round 横评 = 1/60 oranges across 13 ckpts,全部 0/9 或 1/9,无单调收敛迹象。ckpt-2000 是 3-round 抓到 2/9 的那个(最高),20-round 跑下来回归到 1/60,证实是 noise outlier 不是 signal。
何时该用 / 不该用 / When (not) to use
❌ 不要在生产环境使用 — 1.7% success rate 没有 task-completion 价值 ✅ 可以用作:
- π0.5 在低分辨率 VLM bottleneck 任务上的 baseline reference
- "freeze VLM + train expert only" 配方失败案例的复现 ckpt
- LeIsaac eval pipeline 的 π0.5 wire 协议验证 fixture
替代方案 / Alternatives (better on same task)
这些是同任务上真能把橙子夹进盘子的模型 — 想看成功的就去这里 / models that actually place the orange:
| Model | Strict | Where |
|---|---|---|
| 🥇 GR00T-N1.7 (self-trained) | 68.3% (2.05/3) | wsagi/GR00T-N1.7-PickOrange |
| 🥈 ACT (self, h=70) | 43.3% (1.30/3) | wsagi/ACT-PickOrange |
| 🥉 SmolVLA (self-trained) | 25.0% | wsagi (待发布 / pending) |
| Diffusion Policy DDIM | 概率性 3/3 / stochastic | wsagi/DiffusionPolicy-PickOrange |
License & Attribution
- Apache-2.0
- Base model:
lerobot/pi05_base(Physical Intelligence × LeRobot) - Dataset:
LightwheelAI/leisaac-pick-orange - Trained on RTX Pro 6000 96GB
- Evaluated in Isaac Sim 5.1 + LeIsaac