pi05-rft-behavior1k

A Pi0.5 vision-language-action policy adapted for long-horizon BEHAVIOR-1K household manipulation (2025 BEHAVIOR Challenge, NeurIPS 2025).

This checkpoint is the ft_ckpt2 model (training step 9999), a task-specialised fine-tune used for the picking_up_trash, picking_up_toys, tidying_bedroom, collecting_childrens_toys tasks (task IDs 1 / 7 / 18 / 21).

📦 Code & full pipeline: https://github.com/Sunliu36/Behavior1kChallenge_Solution_by_SHAWN
🧪 Sibling method (point-cloud SFT): https://github.com/Sunliu36/Behavior1KChallenge_minor_Solution_by_SHAWN

What it is

An orbax checkpoint with params/ + assets/ (per-timestamp normalisation stats), directly loadable by the policy server in the project repo:

uv run scripts/serve_b1k.py policy:checkpoint \
    --policy.config pi_behavior_b1k_fast \
    --policy.dir <path-to-this-checkpoint>

The optimizer state (train_state/) is intentionally not included — only the inference-ready weights and normalisation assets are published.

Fine-tuning data

This checkpoint was produced by continuing training from the base Pi0.5 checkpoint (checkpoint_2) on human teleoperation demonstrations drawn from the IliaLarchenko/behavior_224_rgb LeRobot dataset, filtered to the four target tasks (head camera, 224×224 RGB):

Task ID	Task	Episodes	Frames	≈ Hours @30 fps
1	picking_up_trash	200	1,053,550	9.8
7	picking_up_toys	200	3,778,110	35.0
18	tidying_bedroom	200	2,207,489	20.4
21	collecting_childrens_toys	200	3,837,265	35.5
Total		800	10,876,414	≈ 100.7

So the fine-tune used 800 demonstration episodes (200 per task) — roughly 10.9 M frames / ~100 hours of teleoperation at 30 fps, head camera only.

Note: this particular checkpoint is a supervised fine-tune on human demonstrations. The broader project also explores rejection-sampling fine-tuning (RFT) with pose-perturbed rollouts — see the GitHub repo for that pipeline.

Training setup

Setting	Value
Backbone	Pi0.5 (PaliGemma VLM + flow-matching action expert), task embeddings (no text)
Init from	base `checkpoint_2` params
Frozen	PaliGemma LLM backbone + vision backbone (only action-specific params train)
Steps	10,000 (this checkpoint = step 9999)
Batch size	8
Optimizer	cosine decay, warmup 200, peak LR 5e-5 → 5e-6
Action space	Δ-joint, 30-step horizon, 32-dim
Aux losses	FAST tokens (0.05), subtask/stage prediction (0.1), correlation-aware flow noise (β=0.5)
Camera	head (224×224 RGB)

Attribution

Built on the open-source Pi0.5 backbone (Physical Intelligence) and the Robot Learning Collective / IliaLarchenko BEHAVIOR-1K solution, with post-training ideas from the Comet report. Full credit and references in the project README.

Citation

@techreport{liu2026behavior1k_rft,
  author      = {Shao-Yang Liu},
  title       = {Adapting Vision-Language-Action Models for BEHAVIOR-1K Household Tasks},
  institution = {National Tsing Hua University},
  year        = {2026},
  email       = {shawnliu@gapp.nthu.edu.tw}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics