pi0.5 Packed Multi-Arm OpenPI Artifacts

This repo packages the full local artifact set for packed-action-head studies on pi0.5 across TWIN handover and TWIN dual-push, including:

all finished checkpoints under openpi/checkpoints/
the modified openpi/ training and evaluation code
train/eval logs and structured metric tables
reproducibility manifests and environment snapshots

Three runs are included:

an initial 2K baseline-vs-parallel comparison
a longer 10K follow-up on the same packed setup
a 5K dual-push 128 screening study on the same packed path

This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering:

exact single-to-split warm-start checkpoints for split_independent and split_communicating
invariant checks for the new split architecture
detached real-data smoke and 20-step training runs on lsnu/twin_dual_push_128_train
the code changes that introduce the new split-expert action path

Experiment setup

Handover train/val: lsnu/twin_handover_256_train, lsnu/twin_handover_256_val
Dual-push train/val: lsnu/twin_dual_push_128_train, lsnu/twin_dual_push_128_val
Hardware: 4x H100 80GB
Precision: bfloat16
Semantic packed layout: [L8, 0x8, R8, 0x8]
Active action-loss dims: [0:8] and [16:24]
Masked padded dims: [8:16] and [24:32]

Headline results

Teacher-forced masked validation loss:

Model	2K @ final	10K @ 1K	10K @ 2K	10K @ 5K	10K @ 10K
Packed baseline	`0.035776`	`0.061130`	`0.041595`	`0.027324`	`0.022345`
Packed parallel	`0.035680`	`0.059715`	`0.039947`	`0.027340`	`0.022168`

Sample-based eval on the fixed 10K final validation subset:

Model	4-step masked MAE	10-step masked MAE	Train runtime	Peak VRAM
Packed baseline	`0.029935`	`0.030294`	`2:13:40`	`35.23GB`
Packed parallel	`0.029277`	`0.030241`	`2:20:51`	`35.27GB`

The long run still shows a very small parallel edge on teacher-forced validation loss by 10K, while the sample-based eval is essentially a tie.

Dual-push 128 screening results:

Model	1K val loss	2K val loss	5K val loss	5K 4-step MAE	5K 10-step MAE	Train runtime
Packed baseline	`0.095597`	`0.083194`	`0.055958`	`0.056830`	`0.058973`	`1:05:25`
Packed parallel	`0.093704`	`0.082729`	`0.055242`	`0.054630`	`0.056627`	`1:00:33`

The dual-push screening run shows a small but consistent parallel edge at 1K, 2K, and 5K on both teacher-forced validation loss and fixed-subset sample MAE.

Warm-start note

The packed parallel warm-start uses the slice/fuse mapping implemented in openpi/scripts/init_parallel_pi05_from_single_pytorch.py, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch:

handover 10K: input_projection_max_abs_diff = 0.00122881, masked_loss_abs_diff = 0.00398052
dual-push 5K: input_projection_max_abs_diff = 0.00099802, masked_loss_abs_diff = 0.08580410
both checks report warmstart_equivalent = False

So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.

Split-Expert Bring-Up (`2026-03-10`)

The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is action_expert_mode with:

shared
head_only_parallel
split_independent
split_communicating

Key bring-up results:

the split warm-start copies the original single gemma_expert into exact left/right expert branches for both split modes
split_independent passes the branch-local invariants:
- identical left/right inputs produce identical suffix outputs
- perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa
both split modes pass detached real-data training on packed TWIN dual-push:
- 3-step real-data smoke run with checkpoint save
- 20-step real-data training run with checkpoint save
the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data 20-step run

New bring-up artifact bundle:

artifacts/twin_split_expert_bringup_20260310/
- split warm-start checkpoints
- invariant-check outputs
- reproducibility commands
- summary README for the split-expert bring-up

Repo layout

openpi/
- modified source and scripts used for training/eval
- copied norm-stats assets for the packed configs
- full 2K, 10K, and dual-push 5K checkpoint trees
artifacts/twin_handover_packed_parallelization_20260309/
- initial 2K study bundle
artifacts/twin_handover_packed_parallelization_10k_20260309/
- 10K follow-up bundle with metrics, logs, repro manifests, and environment snapshot
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/
- dual-push 128 screening bundle with metrics, logs, repro manifests, and environment snapshot
artifacts/twin_split_expert_bringup_20260310/
- split-expert warm-start checkpoints, sanity checks, and bring-up repro commands
artifacts/pi05_base_params/
- staged base parameter snapshot used during JAX-to-PyTorch conversion

Key files

Full report: REPORT.md
2K summary: artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json
10K summary: artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json
10K comparison table: artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv
dual-push 5K summary: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json
dual-push 5K teacher-forced table: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv
dual-push 5K sample eval table: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv
dual-push 5K environment snapshot: artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/
split-expert bring-up summary: artifacts/twin_split_expert_bringup_20260310/README.md
split-expert repro commands: artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh
split-expert invariant check outputs: artifacts/twin_split_expert_bringup_20260310/sanity_checks/
split-expert real-data logs: openpi/run_logs/split_independent_real_smoke3_r2.log, openpi/run_logs/split_communicating_real_smoke3.log, openpi/run_logs/split_independent_real_train20.log, openpi/run_logs/split_communicating_real_train20.log
split-expert real-data checkpoints: openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/, openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/
10K repro commands: artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh
10K changed-file manifest: artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
10K environment snapshot: artifacts/twin_handover_packed_parallelization_10k_20260309/environment/

Main changed files

Initial 2K + 10K study logic lives primarily in:

openpi/src/openpi/transforms.py
openpi/src/openpi/training/config.py
openpi/src/openpi/training/data_loader.py
openpi/src/openpi/models/model.py
openpi/src/openpi/models/tokenizer.py
openpi/src/openpi/models_pytorch/pi0_pytorch.py
openpi/scripts/train_pytorch.py
openpi/scripts/eval_twin_val_loss_pytorch.py
openpi/scripts/init_parallel_pi05_from_single_pytorch.py
openpi/scripts/inspect_twin_packed_batch.py
openpi/scripts/check_parallel_warmstart_equivalence.py
openpi/scripts/check_split_expert_invariants.py
openpi/scripts/run_twin_handover_packed_followup.sh
openpi/scripts/run_twin_handover_packed_10k.sh
openpi/scripts/run_twin_dual_push_128_packed_5k.sh

The per-file rationale is recorded in:

artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt
artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt