GR00T-N1.6-3B-PickOrange (self-trained, ckpt-6500)

针对 LeIsaac SO-101 PickOrange 任务从 nvidia/GR00T-N1.6-3B (Eagle 2.5 VLM + Cross-attention DiT action head, ~3B params) 微调的 GR00T 策略。

A NVIDIA GR00T N1.6 (Eagle 2.5 VLM + cross-attention DiT, ~3B) policy fine-tuned from nvidia/GR00T-N1.6-3B for the LeIsaac SO-101 PickOrange task.

🔗 项目仓库 / Project repos：

vitorcen/isaaclab-experience — Isaac Lab + LeIsaac 多策略横评（parent project）
vitorcen/LeIsaac-Training — LeIsaac fork（训练脚本 + 设计文档 / training scripts + design docs）

Highlights

ckpt-6500: 3/3 oranges placed, robot returned to rest pose — env reports success

ckpt-3500 (earlier checkpoint, kept on ckpt-3500 branch for reference): policy is still finding the placement — orange dropped off edge

TL;DR

Task: SO-101 single-arm picks 3 oranges sequentially and places each in a plate (LeIsaac PickOrange).
Architecture: GR00T N1.6 — Eagle 2.5 VLM (frozen) + cross-attention DiT action head (trainable). chunk_size=50, n_action_steps=16, 4-step rectified-flow denoising.
Training: 6500 step / batch=16 (per-step=2 × grad_accum=8) / adafactor / bf16 / gradient_checkpointing with use_reentrant=False.
Hardware: single RTX 4090 24GB (with DISABLE_ADDMM_CUDA_LT=1, watchdog auto-resume on intermittent CUDA assert).
🏆 Benchmark-aligned eval (3 round × 120s sim × 180s wall_cap) vs LeIsaac leaderboard:

Model	Strict rounds	Oranges placed
hi-space N1.6 (公开 SOTA)	2/3	6/9
ACT	1/3	6/9
X-VLA best	0/3	4/9
🏆 This ckpt-6500	2/3	8/9 ⭐

Architecture / training recipe

base_model              nvidia/GR00T-N1.6-3B
tune_llm                False
tune_visual             False
tune_projector          True
tune_diffusion_model    True
tune_top_llm_layers     4 (default, kept)
backbone_trainable_params_fp32   False     ← 4090 squeeze
optim                   adafactor          ← 4090 squeeze
gradient_checkpointing  True (use_reentrant=False, custom monkey-patch)
bf16                    True
DISABLE_ADDMM_CUDA_LT   1                  ← workaround torch 2.7.1 cublasLt bf16 bug
global_batch_size       16
gradient_accumulation_steps   8            ← per-step micro-batch = 2
max_steps               8000 (best ckpt at step 6500)
save_steps              100 (with custom keep-multiples-of-500 prune callback)

Training notes / known issues

4090 24GB is the hard limit: N1.6 N1.6 全参 FT on 24GB requires every memory hack stacked: bf16 + grad-ckpt with use_reentrant=False + adafactor + backbone_trainable_params_fp32=False + DISABLE_ADDMM_CUDA_LT=1. Without any of these we hit either OOM or RuntimeError: d.is_cuda() INTERNAL ASSERT FAILED at CUDAGuardImpl.h:34.
Random CUDA assert still happens every ~500-700 step despite the patches. We wrap training in a watchdog that auto-resumes from the latest checkpoint after each crash; net throughput ~70% of crash-free.
Score variance: per-checkpoint quality oscillates wildly (e.g. ckpt-5000 = 16/18 in one 6-round eval, ckpt-5500 = 0/18 in the next). We attribute this to the optimization being run at the absolute memory edge — gradients and optim states may quantize inconsistently. The 8/9 result here is benchmark-aligned single 3-round run; expect ±20% noise on any individual run.

Inference

Use Isaac-GR00T's run_gr00t_server.py directly:

cd /path/to/Isaac-GR00T
uv run --extra=gpu python gr00t/eval/run_gr00t_server.py \
    --embodiment-tag NEW_EMBODIMENT \
    --model-path wsagi/GR00T-N1.6-PickOrange \
    --host 0.0.0.0 --port 5555

Then on the Isaac Sim eval side (LeIsaac):

POLICY_PORT=5555 \
ACTION_HORIZON=16 \
EVAL_ROUNDS=3 EPISODE_LENGTH=120 MAX_ROUND_WALL_S=180 \
PROMPT="Pick up the orange and put it in the plate" \
bash server/eval_gr00t.sh

Branches

branch	step	benchmark (3-round)	notes
main	6500	2/3 strict, 8/9 oranges, 115s avg	best
`ckpt-3500`	3500	0/3, 2/9, 180s	first transition out of destruction phase
`ckpt-5000`	5000	0/3, 4/9, 180s	strong 6-round (16/18) but volatile under 3-round
`ckpt-7000`	7000	1/3, 6/9, 146s	secondary peak