pi05-rft-behavior1k

A Pi0.5 vision-language-action policy adapted for long-horizon BEHAVIOR-1K household manipulation (2025 BEHAVIOR Challenge, NeurIPS 2025).

This checkpoint is the ft_ckpt2 model (training step 9999), a task-specialised fine-tune used for the picking_up_trash, picking_up_toys, tidying_bedroom, collecting_childrens_toys tasks (task IDs 1 / 7 / 18 / 21).

What it is

An orbax checkpoint with params/ + assets/ (per-timestamp normalisation stats), directly loadable by the policy server in the project repo:

uv run scripts/serve_b1k.py policy:checkpoint \
    --policy.config pi_behavior_b1k_fast \
    --policy.dir <path-to-this-checkpoint>

The optimizer state (train_state/) is intentionally not included — only the inference-ready weights and normalisation assets are published.

Fine-tuning data

This checkpoint was produced by continuing training from the base Pi0.5 checkpoint (checkpoint_2) on human teleoperation demonstrations drawn from the IliaLarchenko/behavior_224_rgb LeRobot dataset, filtered to the four target tasks (head camera, 224×224 RGB):

Task ID Task Episodes Frames ≈ Hours @30 fps
1 picking_up_trash 200 1,053,550 9.8
7 picking_up_toys 200 3,778,110 35.0
18 tidying_bedroom 200 2,207,489 20.4
21 collecting_childrens_toys 200 3,837,265 35.5
Total 800 10,876,414 ≈ 100.7

So the fine-tune used 800 demonstration episodes (200 per task) — roughly 10.9 M frames / ~100 hours of teleoperation at 30 fps, head camera only.

Note: this particular checkpoint is a supervised fine-tune on human demonstrations. The broader project also explores rejection-sampling fine-tuning (RFT) with pose-perturbed rollouts — see the GitHub repo for that pipeline.

Training setup

Setting Value
Backbone Pi0.5 (PaliGemma VLM + flow-matching action expert), task embeddings (no text)
Init from base checkpoint_2 params
Frozen PaliGemma LLM backbone + vision backbone (only action-specific params train)
Steps 10,000 (this checkpoint = step 9999)
Batch size 8
Optimizer cosine decay, warmup 200, peak LR 5e-5 → 5e-6
Action space Δ-joint, 30-step horizon, 32-dim
Aux losses FAST tokens (0.05), subtask/stage prediction (0.1), correlation-aware flow noise (β=0.5)
Camera head (224×224 RGB)

Attribution

Built on the open-source Pi0.5 backbone (Physical Intelligence) and the Robot Learning Collective / IliaLarchenko BEHAVIOR-1K solution, with post-training ideas from the Comet report. Full credit and references in the project README.

Citation

@techreport{liu2026behavior1k_rft,
  author      = {Shao-Yang Liu},
  title       = {Adapting Vision-Language-Action Models for BEHAVIOR-1K Household Tasks},
  institution = {National Tsing Hua University},
  year        = {2026},
  email       = {shawnliu@gapp.nthu.edu.tw}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading