pi05-rft-behavior1k
A Pi0.5 vision-language-action policy adapted for long-horizon BEHAVIOR-1K household manipulation (2025 BEHAVIOR Challenge, NeurIPS 2025).
This checkpoint is the ft_ckpt2 model (training step 9999), a task-specialised
fine-tune used for the picking_up_trash, picking_up_toys, tidying_bedroom,
collecting_childrens_toys tasks (task IDs 1 / 7 / 18 / 21).
- 📦 Code & full pipeline: https://github.com/Sunliu36/Behavior1kChallenge_Solution_by_SHAWN
- 🧪 Sibling method (point-cloud SFT): https://github.com/Sunliu36/Behavior1KChallenge_minor_Solution_by_SHAWN
What it is
An orbax checkpoint with params/ + assets/
(per-timestamp normalisation stats), directly loadable by the policy server in the
project repo:
uv run scripts/serve_b1k.py policy:checkpoint \
--policy.config pi_behavior_b1k_fast \
--policy.dir <path-to-this-checkpoint>
The optimizer state (train_state/) is intentionally not included — only the
inference-ready weights and normalisation assets are published.
Fine-tuning data
This checkpoint was produced by continuing training from the base Pi0.5
checkpoint (checkpoint_2) on human teleoperation demonstrations drawn from the
IliaLarchenko/behavior_224_rgb
LeRobot dataset, filtered to the four target tasks (head camera, 224×224 RGB):
| Task ID | Task | Episodes | Frames | ≈ Hours @30 fps |
|---|---|---|---|---|
| 1 | picking_up_trash | 200 | 1,053,550 | 9.8 |
| 7 | picking_up_toys | 200 | 3,778,110 | 35.0 |
| 18 | tidying_bedroom | 200 | 2,207,489 | 20.4 |
| 21 | collecting_childrens_toys | 200 | 3,837,265 | 35.5 |
| Total | 800 | 10,876,414 | ≈ 100.7 |
So the fine-tune used 800 demonstration episodes (200 per task) — roughly 10.9 M frames / ~100 hours of teleoperation at 30 fps, head camera only.
Note: this particular checkpoint is a supervised fine-tune on human demonstrations. The broader project also explores rejection-sampling fine-tuning (RFT) with pose-perturbed rollouts — see the GitHub repo for that pipeline.
Training setup
| Setting | Value |
|---|---|
| Backbone | Pi0.5 (PaliGemma VLM + flow-matching action expert), task embeddings (no text) |
| Init from | base checkpoint_2 params |
| Frozen | PaliGemma LLM backbone + vision backbone (only action-specific params train) |
| Steps | 10,000 (this checkpoint = step 9999) |
| Batch size | 8 |
| Optimizer | cosine decay, warmup 200, peak LR 5e-5 → 5e-6 |
| Action space | Δ-joint, 30-step horizon, 32-dim |
| Aux losses | FAST tokens (0.05), subtask/stage prediction (0.1), correlation-aware flow noise (β=0.5) |
| Camera | head (224×224 RGB) |
Attribution
Built on the open-source Pi0.5 backbone (Physical Intelligence) and the Robot Learning Collective / IliaLarchenko BEHAVIOR-1K solution, with post-training ideas from the Comet report. Full credit and references in the project README.
Citation
@techreport{liu2026behavior1k_rft,
author = {Shao-Yang Liu},
title = {Adapting Vision-Language-Action Models for BEHAVIOR-1K Household Tasks},
institution = {National Tsing Hua University},
year = {2026},
email = {shawnliu@gapp.nthu.edu.tw}
}