| # pi0.5 Packed Multi-Arm OpenPI Artifacts | |
| This repo packages the full local artifact set for packed-action-head studies on `pi0.5` across TWIN handover and TWIN dual-push, including: | |
| - all finished checkpoints under `openpi/checkpoints/` | |
| - the modified `openpi/` training and evaluation code | |
| - train/eval logs and structured metric tables | |
| - reproducibility manifests and environment snapshots | |
| Three runs are included: | |
| 1. an initial `2K` baseline-vs-parallel comparison | |
| 2. a longer `10K` follow-up on the same packed setup | |
| 3. a `5K` dual-push `128` screening study on the same packed path | |
| 4. a `2K` dual-push `128` four-way step comparison across `shared`, `head_only_parallel`, `split_independent`, and `split_communicating` | |
| This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering: | |
| - exact single-to-split warm-start checkpoints for `split_independent` and `split_communicating` | |
| - invariant checks for the new split architecture | |
| - detached real-data smoke and `20`-step training runs on `lsnu/twin_dual_push_128_train` | |
| - the code changes that introduce the new split-expert action path | |
| ## Experiment setup | |
| - Handover train/val: `lsnu/twin_handover_256_train`, `lsnu/twin_handover_256_val` | |
| - Dual-push train/val: `lsnu/twin_dual_push_128_train`, `lsnu/twin_dual_push_128_val` | |
| - Hardware: `4x H100 80GB` | |
| - Precision: `bfloat16` | |
| - Semantic packed layout: `[L8, 0x8, R8, 0x8]` | |
| - Active action-loss dims: `[0:8]` and `[16:24]` | |
| - Masked padded dims: `[8:16]` and `[24:32]` | |
| ## Headline results | |
| Teacher-forced masked validation loss: | |
| | Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K | | |
| | --- | ---: | ---: | ---: | ---: | ---: | | |
| | Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` | | |
| | Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` | | |
| Sample-based eval on the fixed `10K` final validation subset: | |
| | Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM | | |
| | --- | ---: | ---: | ---: | ---: | | |
| | Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` | | |
| | Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` | | |
| The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie. | |
| Dual-push `128` screening results: | |
| | Model | 1K val loss | 2K val loss | 5K val loss | 5K 4-step MAE | 5K 10-step MAE | Train runtime | | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | | |
| | Packed baseline | `0.095597` | `0.083194` | `0.055958` | `0.056830` | `0.058973` | `1:05:25` | | |
| | Packed parallel | `0.093704` | `0.082729` | `0.055242` | `0.054630` | `0.056627` | `1:00:33` | | |
| The dual-push screening run shows a small but consistent parallel edge at `1K`, `2K`, and `5K` on both teacher-forced validation loss and fixed-subset sample MAE. | |
| Dual-push `128` four-way `2K` step comparison raw results: | |
| Step-0 teacher-forced masked validation loss: | |
| | Model | Step-0 val loss | Step-0 left/right imbalance | | |
| | --- | ---: | ---: | | |
| | Shared | `1.084735` | `0.505345` | | |
| | Head-only parallel | `1.082985` | `0.501182` | | |
| | Split independent | `1.328262` | `0.448843` | | |
| | Split communicating | `1.783048` | `0.671085` | | |
| Step-2000 teacher-forced masked validation loss: | |
| | Model | Step-2000 val loss | Step-2000 left/right imbalance | | |
| | --- | ---: | ---: | | |
| | Shared | `0.055329` | `0.069564` | | |
| | Head-only parallel | `0.055297` | `0.069380` | | |
| | Split independent | `0.063537` | `0.092029` | | |
| | Split communicating | `0.059952` | `0.080435` | | |
| Step-2000 sample masked MAE: | |
| | Model | 1-step MAE | 4-step MAE | 16-step MAE | | |
| | --- | ---: | ---: | ---: | | |
| | Shared | `0.087330` | `0.078164` | `0.085222` | | |
| | Head-only parallel | `0.086764` | `0.078301` | `0.085272` | | |
| | Split independent | `0.079100` | `0.070436` | `0.075281` | | |
| | Split communicating | `0.078618` | `0.071087` | `0.075570` | | |
| Full raw tables for the `0/100/500/2000` sweep live in: | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv` | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv` | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv` | |
| ## Warm-start note | |
| The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch: | |
| - handover `10K`: `input_projection_max_abs_diff = 0.00122881`, `masked_loss_abs_diff = 0.00398052` | |
| - dual-push `5K`: `input_projection_max_abs_diff = 0.00099802`, `masked_loss_abs_diff = 0.08580410` | |
| - both checks report `warmstart_equivalent = False` | |
| So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control. | |
| ## Split-Expert Bring-Up (`2026-03-10`) | |
| The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is `action_expert_mode` with: | |
| - `shared` | |
| - `head_only_parallel` | |
| - `split_independent` | |
| - `split_communicating` | |
| Key bring-up results: | |
| - the split warm-start copies the original single `gemma_expert` into exact left/right expert branches for both split modes | |
| - `split_independent` passes the branch-local invariants: | |
| - identical left/right inputs produce identical suffix outputs | |
| - perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa | |
| - both split modes pass detached real-data training on packed TWIN dual-push: | |
| - `3`-step real-data smoke run with checkpoint save | |
| - `20`-step real-data training run with checkpoint save | |
| - the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data `20`-step run | |
| New bring-up artifact bundle: | |
| - `artifacts/twin_split_expert_bringup_20260310/` | |
| - split warm-start checkpoints | |
| - invariant-check outputs | |
| - reproducibility commands | |
| - summary README for the split-expert bring-up | |
| ## Repo layout | |
| - `openpi/` | |
| - modified source and scripts used for training/eval | |
| - copied norm-stats assets for the packed configs | |
| - full `2K`, `10K`, and dual-push `5K` checkpoint trees | |
| - `artifacts/twin_handover_packed_parallelization_20260309/` | |
| - initial `2K` study bundle | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/` | |
| - `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/` | |
| - dual-push `128` screening bundle with metrics, logs, repro manifests, and environment snapshot | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/` | |
| - dual-push `128` four-way `2K` step-comparison bundle with metrics, logs, repro manifests, and environment snapshot | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/` | |
| - small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle | |
| - `artifacts/twin_split_expert_bringup_20260310/` | |
| - split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks | |
| ## Committed artifact note | |
| For this update, the committed artifact payloads are: | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311/` | |
| - the official finalized `4`-model dual-push `2K` step-comparison bundle | |
| - `artifacts/twin_split_expert_bringup_20260310/` | |
| - the split-expert bring-up bundle used as the sanity and warm-start reference | |
| - `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/` | |
| - a small debug-only environment snapshot from the failed/resumed bring-up sequence | |
| The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-`_debug` step-comparison bundle plus the split bring-up bundle. | |
| - `openpi/run_logs/` | |
| - raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under `artifacts/twin_split_expert_bringup_20260310/run_logs/` | |
| - `openpi/scripts/upload_stepcmp_bundle_to_hf.py` | |
| - the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses `huggingface_hub.HfApi.upload_large_folder(...)` | |
| - `artifacts/pi05_base_params/` | |
| - staged base parameter snapshot used during JAX-to-PyTorch conversion | |
| ## Future commit/upload workflow | |
| When adding new experiment results to this repo: | |
| - keep the canonical bundle under `artifacts/<study_name>/` and only retain the checkpoint steps that are scientifically required under `openpi/checkpoints/` | |
| - before claiming the repo is fully committed, audit ignored artifact paths explicitly: | |
| - `git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs` | |
| - if a result is intentionally kept in an ignored path such as `openpi/checkpoints/` or `openpi/run_logs/`, force-add it explicitly with `git add --sparse -f ...` | |
| - use `openpi/scripts/upload_stepcmp_bundle_to_hf.py` for large HF uploads; it uses `huggingface_hub.HfApi.upload_large_folder(...)` and is the preferred path for checkpoint-heavy updates | |
| - never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in `HF_TOKEN` or load it from `HF_TOKEN_FILE`, and check for literal `hf_...` strings before committing | |
| ## Key files | |
| - Full report: `REPORT.md` | |
| - `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json` | |
| - `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json` | |
| - `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv` | |
| - dual-push `5K` summary: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json` | |
| - dual-push `5K` teacher-forced table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv` | |
| - dual-push `5K` sample eval table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv` | |
| - dual-push `5K` environment snapshot: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/` | |
| - dual-push `2K` step-comparison summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json` | |
| - dual-push `2K` step-comparison README: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md` | |
| - dual-push `2K` teacher-forced table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv` | |
| - dual-push `2K` sample eval table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv` | |
| - dual-push `2K` training summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv` | |
| - split-expert bring-up summary: `artifacts/twin_split_expert_bringup_20260310/README.md` | |
| - split-expert repro commands: `artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh` | |
| - split-expert invariant check outputs: `artifacts/twin_split_expert_bringup_20260310/sanity_checks/` | |
| - split-expert real-data logs: `openpi/run_logs/split_independent_real_smoke3_r2.log`, `openpi/run_logs/split_communicating_real_smoke3.log`, `openpi/run_logs/split_independent_real_train20.log`, `openpi/run_logs/split_communicating_real_train20.log` | |
| - split-expert real-data checkpoints: `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/`, `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/` | |
| - `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh` | |
| - `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` | |
| - `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/` | |
| ## Main changed files | |
| Initial `2K` + `10K` study logic lives primarily in: | |
| - `openpi/src/openpi/transforms.py` | |
| - `openpi/src/openpi/training/config.py` | |
| - `openpi/src/openpi/training/data_loader.py` | |
| - `openpi/src/openpi/models/model.py` | |
| - `openpi/src/openpi/models/tokenizer.py` | |
| - `openpi/src/openpi/models_pytorch/pi0_pytorch.py` | |
| - `openpi/scripts/train_pytorch.py` | |
| - `openpi/scripts/eval_twin_val_loss_pytorch.py` | |
| - `openpi/scripts/init_parallel_pi05_from_single_pytorch.py` | |
| - `openpi/scripts/inspect_twin_packed_batch.py` | |
| - `openpi/scripts/check_parallel_warmstart_equivalence.py` | |
| - `openpi/scripts/check_split_expert_invariants.py` | |
| - `openpi/scripts/run_twin_handover_packed_followup.sh` | |
| - `openpi/scripts/run_twin_handover_packed_10k.sh` | |
| - `openpi/scripts/run_twin_dual_push_128_packed_5k.sh` | |
| The per-file rationale is recorded in: | |
| - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt` | |