# pi0.5 Packed Multi-Arm OpenPI Artifacts This repo packages the full local artifact set for packed-action-head studies on `pi0.5` across TWIN handover and TWIN dual-push, including: - all finished checkpoints under `openpi/checkpoints/` - the modified `openpi/` training and evaluation code - train/eval logs and structured metric tables - reproducibility manifests and environment snapshots Three runs are included: 1. an initial `2K` baseline-vs-parallel comparison 2. a longer `10K` follow-up on the same packed setup 3. a `5K` dual-push `128` screening study on the same packed path 4. a `2K` dual-push `128` four-way step comparison across `shared`, `head_only_parallel`, `split_independent`, and `split_communicating` This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering: - exact single-to-split warm-start checkpoints for `split_independent` and `split_communicating` - invariant checks for the new split architecture - detached real-data smoke and `20`-step training runs on `lsnu/twin_dual_push_128_train` - the code changes that introduce the new split-expert action path ## Experiment setup - Handover train/val: `lsnu/twin_handover_256_train`, `lsnu/twin_handover_256_val` - Dual-push train/val: `lsnu/twin_dual_push_128_train`, `lsnu/twin_dual_push_128_val` - Hardware: `4x H100 80GB` - Precision: `bfloat16` - Semantic packed layout: `[L8, 0x8, R8, 0x8]` - Active action-loss dims: `[0:8]` and `[16:24]` - Masked padded dims: `[8:16]` and `[24:32]` ## Headline results Teacher-forced masked validation loss: | Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K | | --- | ---: | ---: | ---: | ---: | ---: | | Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` | | Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` | Sample-based eval on the fixed `10K` final validation subset: | Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM | | --- | ---: | ---: | ---: | ---: | | Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` | | Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` | The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie. Dual-push `128` screening results: | Model | 1K val loss | 2K val loss | 5K val loss | 5K 4-step MAE | 5K 10-step MAE | Train runtime | | --- | ---: | ---: | ---: | ---: | ---: | ---: | | Packed baseline | `0.095597` | `0.083194` | `0.055958` | `0.056830` | `0.058973` | `1:05:25` | | Packed parallel | `0.093704` | `0.082729` | `0.055242` | `0.054630` | `0.056627` | `1:00:33` | The dual-push screening run shows a small but consistent parallel edge at `1K`, `2K`, and `5K` on both teacher-forced validation loss and fixed-subset sample MAE. Dual-push `128` four-way `2K` step comparison raw results: Step-0 teacher-forced masked validation loss: | Model | Step-0 val loss | Step-0 left/right imbalance | | --- | ---: | ---: | | Shared | `1.084735` | `0.505345` | | Head-only parallel | `1.082985` | `0.501182` | | Split independent | `1.328262` | `0.448843` | | Split communicating | `1.783048` | `0.671085` | Step-2000 teacher-forced masked validation loss: | Model | Step-2000 val loss | Step-2000 left/right imbalance | | --- | ---: | ---: | | Shared | `0.055329` | `0.069564` | | Head-only parallel | `0.055297` | `0.069380` | | Split independent | `0.063537` | `0.092029` | | Split communicating | `0.059952` | `0.080435` | Step-2000 sample masked MAE: | Model | 1-step MAE | 4-step MAE | 16-step MAE | | --- | ---: | ---: | ---: | | Shared | `0.087330` | `0.078164` | `0.085222` | | Head-only parallel | `0.086764` | `0.078301` | `0.085272` | | Split independent | `0.079100` | `0.070436` | `0.075281` | | Split communicating | `0.078618` | `0.071087` | `0.075570` | Full raw tables for the `0/100/500/2000` sweep live in: - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv` - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv` - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv` ## Warm-start note The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch: - handover `10K`: `input_projection_max_abs_diff = 0.00122881`, `masked_loss_abs_diff = 0.00398052` - dual-push `5K`: `input_projection_max_abs_diff = 0.00099802`, `masked_loss_abs_diff = 0.08580410` - both checks report `warmstart_equivalent = False` So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control. ## Split-Expert Bring-Up (`2026-03-10`) The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is `action_expert_mode` with: - `shared` - `head_only_parallel` - `split_independent` - `split_communicating` Key bring-up results: - the split warm-start copies the original single `gemma_expert` into exact left/right expert branches for both split modes - `split_independent` passes the branch-local invariants: - identical left/right inputs produce identical suffix outputs - perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa - both split modes pass detached real-data training on packed TWIN dual-push: - `3`-step real-data smoke run with checkpoint save - `20`-step real-data training run with checkpoint save - the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data `20`-step run New bring-up artifact bundle: - `artifacts/twin_split_expert_bringup_20260310/` - split warm-start checkpoints - invariant-check outputs - reproducibility commands - summary README for the split-expert bring-up ## Repo layout - `openpi/` - modified source and scripts used for training/eval - copied norm-stats assets for the packed configs - full `2K`, `10K`, and dual-push `5K` checkpoint trees - `artifacts/twin_handover_packed_parallelization_20260309/` - initial `2K` study bundle - `artifacts/twin_handover_packed_parallelization_10k_20260309/` - `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/` - dual-push `128` screening bundle with metrics, logs, repro manifests, and environment snapshot - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/` - dual-push `128` four-way `2K` step-comparison bundle with metrics, logs, repro manifests, and environment snapshot - `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/` - small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle - `artifacts/twin_split_expert_bringup_20260310/` - split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks ## Committed artifact note For this update, the committed artifact payloads are: - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/` - the official finalized `4`-model dual-push `2K` step-comparison bundle - `artifacts/twin_split_expert_bringup_20260310/` - the split-expert bring-up bundle used as the sanity and warm-start reference - `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/` - a small debug-only environment snapshot from the failed/resumed bring-up sequence The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-`_debug` step-comparison bundle plus the split bring-up bundle. - `openpi/run_logs/` - raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under `artifacts/twin_split_expert_bringup_20260310/run_logs/` - `openpi/scripts/upload_stepcmp_bundle_to_hf.py` - the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses `huggingface_hub.HfApi.upload_large_folder(...)` - `artifacts/pi05_base_params/` - staged base parameter snapshot used during JAX-to-PyTorch conversion ## Future commit/upload workflow When adding new experiment results to this repo: - keep the canonical bundle under `artifacts//` and only retain the checkpoint steps that are scientifically required under `openpi/checkpoints/` - before claiming the repo is fully committed, audit ignored artifact paths explicitly: - `git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs` - if a result is intentionally kept in an ignored path such as `openpi/checkpoints/` or `openpi/run_logs/`, force-add it explicitly with `git add --sparse -f ...` - use `openpi/scripts/upload_stepcmp_bundle_to_hf.py` for large HF uploads; it uses `huggingface_hub.HfApi.upload_large_folder(...)` and is the preferred path for checkpoint-heavy updates - never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in `HF_TOKEN` or load it from `HF_TOKEN_FILE`, and check for literal `hf_...` strings before committing ## Key files - Full report: `REPORT.md` - `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json` - `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json` - `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv` - dual-push `5K` summary: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json` - dual-push `5K` teacher-forced table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv` - dual-push `5K` sample eval table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv` - dual-push `5K` environment snapshot: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/` - dual-push `2K` step-comparison summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json` - dual-push `2K` step-comparison README: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md` - dual-push `2K` teacher-forced table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv` - dual-push `2K` sample eval table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv` - dual-push `2K` training summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv` - split-expert bring-up summary: `artifacts/twin_split_expert_bringup_20260310/README.md` - split-expert repro commands: `artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh` - split-expert invariant check outputs: `artifacts/twin_split_expert_bringup_20260310/sanity_checks/` - split-expert real-data logs: `openpi/run_logs/split_independent_real_smoke3_r2.log`, `openpi/run_logs/split_communicating_real_smoke3.log`, `openpi/run_logs/split_independent_real_train20.log`, `openpi/run_logs/split_communicating_real_train20.log` - split-expert real-data checkpoints: `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/`, `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/` - `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh` - `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` - `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/` ## Main changed files Initial `2K` + `10K` study logic lives primarily in: - `openpi/src/openpi/transforms.py` - `openpi/src/openpi/training/config.py` - `openpi/src/openpi/training/data_loader.py` - `openpi/src/openpi/models/model.py` - `openpi/src/openpi/models/tokenizer.py` - `openpi/src/openpi/models_pytorch/pi0_pytorch.py` - `openpi/scripts/train_pytorch.py` - `openpi/scripts/eval_twin_val_loss_pytorch.py` - `openpi/scripts/init_parallel_pi05_from_single_pytorch.py` - `openpi/scripts/inspect_twin_packed_batch.py` - `openpi/scripts/check_parallel_warmstart_equivalence.py` - `openpi/scripts/check_split_expert_invariants.py` - `openpi/scripts/run_twin_handover_packed_followup.sh` - `openpi/scripts/run_twin_handover_packed_10k.sh` - `openpi/scripts/run_twin_dual_push_128_packed_5k.sh` The per-file rationale is recorded in: - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt` - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`