SmolVLA-PickOrange

针对 LeIsaac SO-101 PickOrange 任务 LoRA-free 微调的 SmolVLA 策略 — 自训 15k step（main，sweep best）。 A fine-tuned SmolVLA policy on the LeIsaac SO-101 PickOrange task. main = step-15000 (sweep best), full-parameter from lerobot/smolvla_base.

🔗 项目仓库 / Project repos：

vitorcen/isaaclab-experience — Isaac Lab + LeIsaac 多策略横评（parent project）— 含 7-baseline benchmark
vitorcen/LeIsaac-Training — LeIsaac fork（训练脚本 + 设计文档 / training scripts + design docs）

关于命名 / About the name：config.type=smolvla (LeRobot v1 SmolVLA implementation)，backbone 用 HuggingFaceTB/SmolVLM2-500M-Video-Instruct (SmolVLM2)。LeRobot 自己也叫 smolvla 而不是 smolvla2，所以仓库名沿用 SmolVLA-PickOrange。 config.type=smolvla (LeRobot v1 SmolVLA implementation) with HuggingFaceTB/SmolVLM2-500M-Video-Instruct backbone. LeRobot keeps the policy name smolvla (matching their naming), so this repo follows suit.

TL;DR

任务 / Task：Pick up the orange and place it on the plate — SO-101 单臂依次夹起 3 颗橙子并放盘子。
数据集 / Dataset：LightwheelAI/leisaac-pick-orange — 60 episode 遥操示范，30 fps，dual-cam 480×640。
架构 / Architecture：SmolVLA v1（450M），SmolVLM2-500M-Video-Instruct backbone + Action Expert，chunk_size=50。
训练 / Training：full-param 微调（无 LoRA），batch=8 / lr=1e-4 / 总 30k step 训练，30k 后明显过拟合。**main = step-15000 (sweep best)**。
评测 / Eval（Isaac Sim 5.1，5 round × 3 颗 = 15 颗，post-fix placement check）：
- 2/5 strict rounds, 8/15 oranges (53%), 133s avg ← 15k @ h=50
- 详见 vitorcen/isaaclab-experience README leaderboard

Checkpoint branches / ckpt 分支

Branch	Step	env rounds	oranges	avg s	备注
`main`	15000	2/5	8/15 (53%)	133s	sweep best ⭐
`ckpt-20k`	20000	0/5	6/15 (40%)	180s	(will be uploaded if needed)
`ckpt-25k`	25000	1/5	5/15 (33%)	160s	(will be uploaded if needed)
`ckpt-30k`	30000	0/5	4/15 (27%)	180s	overfit; 旧 main 已搬到此分支

Sweep 用 h=50 (= train chunk_size), 5 round × 5 ckpt = 75 ep on Isaac Sim 5.1，单一 RTX 4090。

Ckpt sweep 曲线 / Ckpt sweep curve

15k 是最佳点：训得久了开始 overfit 60 ep 这个小数据集，过早（10k 以下）尚未学到完整 pick-place-pick-place 长程序列。

oranges/15
  9 |
  8 |          ⭐ 15k
  7 |
  6 |       ●  20k
  5 |          ●  25k
  4 |             ●  30k
  3 |
  2 |
  1 |●  10k
  0 +----------------------
    10  15  20  25  30  k step

推理 inference 配置

# 1. 启 LeRobot async policy server
bash server/start_server.sh --lerobot-only

# 2. 跑 LeIsaac PickOrange eval
POLICY_CHECKPOINT=wsagi/SmolVLA-PickOrange \
ACTION_HORIZON=50 \
EVAL_ROUNDS=5 EPISODE_LENGTH=120 MAX_ROUND_WALL_S=180 \
PROMPT="Pick up the orange and put it in the plate" \
conda run -n isaaclab python LeIsaac/scripts/evaluation/policy_inference.py \
    --task=LeIsaac-SO101-PickOrange-v0 \
    --policy_type=lerobot-smolvla \
    --policy_port=8080 \
    --policy_checkpoint_path=$POLICY_CHECKPOINT \
    --policy_action_horizon=$ACTION_HORIZON \
    --eval_rounds=$EVAL_ROUNDS --episode_length_s=$EPISODE_LENGTH \
    --max_round_wall_s=$MAX_ROUND_WALL_S \
    --policy_language_instruction="$PROMPT" \
    --device=cuda --enable_cameras

**关键 inference 参数 (per scripts/benchmark/baselines_action_horizon.tsv)**：

action_horizon=50（= train chunk_size，h=40 实测略弱）
选 branch main 拿 best；或 ckpt-30k / 任何 ckpt-Nk 拿对应阶段。

训练配方

Training recipe

项 / Item	值 / Value
Dataset	`LightwheelAI/leisaac-pick-orange` (60 ep, dual-cam 480×640 RGB + 6 DOF state, 30 Hz)
Policy	`smolvla` (LeRobot 实现)
Backbone	`HuggingFaceTB/SmolVLM2-500M-Video-Instruct` + Action Expert
`chunk_size` / `n_action_steps`	50 / 50
Batch size	8 (full-param, no LoRA)
Optimizer	AdamW, lr=1e-4
Steps	30000 (~14h on 4090) → main = 15000 (sweep best)
`video_backend`	`pyav`（torchcodec 长跑 segfault）
Image augmentation	无
Train expert only	False（全参数）

🚨 schema-free base 关键 fix：训练前必须用 prepare_base.sh 剥光 lerobot/smolvla_base 自带的 input_features / empty_cameras（默认 camera1/2/3 @ 256×256 会污染微调路径），否则训练时 schema 不对齐 → forward 报 KeyError 或 silent 训坏。

Eval 历史 / Eval history

版本	env rounds	oranges	avg s	备注
30k h=50 (旧 leaderboard)	1/3	5/9 (55%)	355s	sticky-OR + 3-round（旧 buggy 计数）
30k h=50 (post-fix 5-round)	0/5	4/15 (27%)	180s	真实 5-round + pre-step snapshot
15k h=50 (post-fix 5-round)	2/5	8/15 (53%)	133s	sweep best, 现 main ⭐