lsnu's picture
Add files using upload-large-folder tool
f7a3ee4 verified
# pi0.5 Packed Multi-Arm OpenPI Artifacts
This repo packages the full local artifact set for packed-action-head studies on `pi0.5` across TWIN handover and TWIN dual-push, including:
- all finished checkpoints under `openpi/checkpoints/`
- the modified `openpi/` training and evaluation code
- train/eval logs and structured metric tables
- reproducibility manifests and environment snapshots
Three runs are included:
1. an initial `2K` baseline-vs-parallel comparison
2. a longer `10K` follow-up on the same packed setup
3. a `5K` dual-push `128` screening study on the same packed path
4. a `2K` dual-push `128` four-way step comparison across `shared`, `head_only_parallel`, `split_independent`, and `split_communicating`
This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering:
- exact single-to-split warm-start checkpoints for `split_independent` and `split_communicating`
- invariant checks for the new split architecture
- detached real-data smoke and `20`-step training runs on `lsnu/twin_dual_push_128_train`
- the code changes that introduce the new split-expert action path
## Experiment setup
- Handover train/val: `lsnu/twin_handover_256_train`, `lsnu/twin_handover_256_val`
- Dual-push train/val: `lsnu/twin_dual_push_128_train`, `lsnu/twin_dual_push_128_val`
- Hardware: `4x H100 80GB`
- Precision: `bfloat16`
- Semantic packed layout: `[L8, 0x8, R8, 0x8]`
- Active action-loss dims: `[0:8]` and `[16:24]`
- Masked padded dims: `[8:16]` and `[24:32]`
## Headline results
Teacher-forced masked validation loss:
| Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K |
| --- | ---: | ---: | ---: | ---: | ---: |
| Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` |
| Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` |
Sample-based eval on the fixed `10K` final validation subset:
| Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM |
| --- | ---: | ---: | ---: | ---: |
| Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` |
| Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` |
The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie.
Dual-push `128` screening results:
| Model | 1K val loss | 2K val loss | 5K val loss | 5K 4-step MAE | 5K 10-step MAE | Train runtime |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Packed baseline | `0.095597` | `0.083194` | `0.055958` | `0.056830` | `0.058973` | `1:05:25` |
| Packed parallel | `0.093704` | `0.082729` | `0.055242` | `0.054630` | `0.056627` | `1:00:33` |
The dual-push screening run shows a small but consistent parallel edge at `1K`, `2K`, and `5K` on both teacher-forced validation loss and fixed-subset sample MAE.
Dual-push `128` four-way `2K` step comparison raw results:
Step-0 teacher-forced masked validation loss:
| Model | Step-0 val loss | Step-0 left/right imbalance |
| --- | ---: | ---: |
| Shared | `1.084735` | `0.505345` |
| Head-only parallel | `1.082985` | `0.501182` |
| Split independent | `1.328262` | `0.448843` |
| Split communicating | `1.783048` | `0.671085` |
Step-2000 teacher-forced masked validation loss:
| Model | Step-2000 val loss | Step-2000 left/right imbalance |
| --- | ---: | ---: |
| Shared | `0.055329` | `0.069564` |
| Head-only parallel | `0.055297` | `0.069380` |
| Split independent | `0.063537` | `0.092029` |
| Split communicating | `0.059952` | `0.080435` |
Step-2000 sample masked MAE:
| Model | 1-step MAE | 4-step MAE | 16-step MAE |
| --- | ---: | ---: | ---: |
| Shared | `0.087330` | `0.078164` | `0.085222` |
| Head-only parallel | `0.086764` | `0.078301` | `0.085272` |
| Split independent | `0.079100` | `0.070436` | `0.075281` |
| Split communicating | `0.078618` | `0.071087` | `0.075570` |
Full raw tables for the `0/100/500/2000` sweep live in:
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`
## Warm-start note
The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch:
- handover `10K`: `input_projection_max_abs_diff = 0.00122881`, `masked_loss_abs_diff = 0.00398052`
- dual-push `5K`: `input_projection_max_abs_diff = 0.00099802`, `masked_loss_abs_diff = 0.08580410`
- both checks report `warmstart_equivalent = False`
So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.
## Split-Expert Bring-Up (`2026-03-10`)
The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is `action_expert_mode` with:
- `shared`
- `head_only_parallel`
- `split_independent`
- `split_communicating`
Key bring-up results:
- the split warm-start copies the original single `gemma_expert` into exact left/right expert branches for both split modes
- `split_independent` passes the branch-local invariants:
- identical left/right inputs produce identical suffix outputs
- perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa
- both split modes pass detached real-data training on packed TWIN dual-push:
- `3`-step real-data smoke run with checkpoint save
- `20`-step real-data training run with checkpoint save
- the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data `20`-step run
New bring-up artifact bundle:
- `artifacts/twin_split_expert_bringup_20260310/`
- split warm-start checkpoints
- invariant-check outputs
- reproducibility commands
- summary README for the split-expert bring-up
## Repo layout
- `openpi/`
- modified source and scripts used for training/eval
- copied norm-stats assets for the packed configs
- full `2K`, `10K`, and dual-push `5K` checkpoint trees
- `artifacts/twin_handover_packed_parallelization_20260309/`
- initial `2K` study bundle
- `artifacts/twin_handover_packed_parallelization_10k_20260309/`
- `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/`
- dual-push `128` screening bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
- dual-push `128` four-way `2K` step-comparison bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
- small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle
- `artifacts/twin_split_expert_bringup_20260310/`
- split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks
## Committed artifact note
For this update, the committed artifact payloads are:
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
- the official finalized `4`-model dual-push `2K` step-comparison bundle
- `artifacts/twin_split_expert_bringup_20260310/`
- the split-expert bring-up bundle used as the sanity and warm-start reference
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
- a small debug-only environment snapshot from the failed/resumed bring-up sequence
The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-`_debug` step-comparison bundle plus the split bring-up bundle.
- `openpi/run_logs/`
- raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under `artifacts/twin_split_expert_bringup_20260310/run_logs/`
- `openpi/scripts/upload_stepcmp_bundle_to_hf.py`
- the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses `huggingface_hub.HfApi.upload_large_folder(...)`
- `artifacts/pi05_base_params/`
- staged base parameter snapshot used during JAX-to-PyTorch conversion
## Future commit/upload workflow
When adding new experiment results to this repo:
- keep the canonical bundle under `artifacts/<study_name>/` and only retain the checkpoint steps that are scientifically required under `openpi/checkpoints/`
- before claiming the repo is fully committed, audit ignored artifact paths explicitly:
- `git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs`
- if a result is intentionally kept in an ignored path such as `openpi/checkpoints/` or `openpi/run_logs/`, force-add it explicitly with `git add --sparse -f ...`
- use `openpi/scripts/upload_stepcmp_bundle_to_hf.py` for large HF uploads; it uses `huggingface_hub.HfApi.upload_large_folder(...)` and is the preferred path for checkpoint-heavy updates
- never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in `HF_TOKEN` or load it from `HF_TOKEN_FILE`, and check for literal `hf_...` strings before committing
## Key files
- Full report: `REPORT.md`
- `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
- `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
- `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
- dual-push `5K` summary: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
- dual-push `5K` teacher-forced table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
- dual-push `5K` sample eval table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
- dual-push `5K` environment snapshot: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`
- dual-push `2K` step-comparison summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json`
- dual-push `2K` step-comparison README: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md`
- dual-push `2K` teacher-forced table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
- dual-push `2K` sample eval table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
- dual-push `2K` training summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`
- split-expert bring-up summary: `artifacts/twin_split_expert_bringup_20260310/README.md`
- split-expert repro commands: `artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh`
- split-expert invariant check outputs: `artifacts/twin_split_expert_bringup_20260310/sanity_checks/`
- split-expert real-data logs: `openpi/run_logs/split_independent_real_smoke3_r2.log`, `openpi/run_logs/split_communicating_real_smoke3.log`, `openpi/run_logs/split_independent_real_train20.log`, `openpi/run_logs/split_communicating_real_train20.log`
- split-expert real-data checkpoints: `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/`, `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/`
- `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
- `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
## Main changed files
Initial `2K` + `10K` study logic lives primarily in:
- `openpi/src/openpi/transforms.py`
- `openpi/src/openpi/training/config.py`
- `openpi/src/openpi/training/data_loader.py`
- `openpi/src/openpi/models/model.py`
- `openpi/src/openpi/models/tokenizer.py`
- `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
- `openpi/scripts/train_pytorch.py`
- `openpi/scripts/eval_twin_val_loss_pytorch.py`
- `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
- `openpi/scripts/inspect_twin_packed_batch.py`
- `openpi/scripts/check_parallel_warmstart_equivalence.py`
- `openpi/scripts/check_split_expert_invariants.py`
- `openpi/scripts/run_twin_handover_packed_followup.sh`
- `openpi/scripts/run_twin_handover_packed_10k.sh`
- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`
The per-file rationale is recorded in:
- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`