File size: 12,987 Bytes
1759ca7 422ae16 1759ca7 4cc9180 1759ca7 422ae16 1759ca7 4cc9180 422ae16 f7a3ee4 4cc9180 ccf25b1 4cc9180 1759ca7 422ae16 4cc9180 1759ca7 4cc9180 1759ca7 4cc9180 422ae16 f7a3ee4 4cc9180 1759ca7 422ae16 1759ca7 422ae16 4cc9180 ccf25b1 4cc9180 1759ca7 4cc9180 422ae16 1759ca7 4cc9180 422ae16 f7a3ee4 ccf25b1 f7a3ee4 1759ca7 4cc9180 1759ca7 f7a3ee4 4cc9180 1759ca7 4cc9180 422ae16 f7a3ee4 ccf25b1 4cc9180 ccf25b1 4cc9180 422ae16 1759ca7 4cc9180 1759ca7 4cc9180 422ae16 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | # pi0.5 Packed Multi-Arm OpenPI Artifacts
This repo packages the full local artifact set for packed-action-head studies on `pi0.5` across TWIN handover and TWIN dual-push, including:
- all finished checkpoints under `openpi/checkpoints/`
- the modified `openpi/` training and evaluation code
- train/eval logs and structured metric tables
- reproducibility manifests and environment snapshots
Three runs are included:
1. an initial `2K` baseline-vs-parallel comparison
2. a longer `10K` follow-up on the same packed setup
3. a `5K` dual-push `128` screening study on the same packed path
4. a `2K` dual-push `128` four-way step comparison across `shared`, `head_only_parallel`, `split_independent`, and `split_communicating`
This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering:
- exact single-to-split warm-start checkpoints for `split_independent` and `split_communicating`
- invariant checks for the new split architecture
- detached real-data smoke and `20`-step training runs on `lsnu/twin_dual_push_128_train`
- the code changes that introduce the new split-expert action path
## Experiment setup
- Handover train/val: `lsnu/twin_handover_256_train`, `lsnu/twin_handover_256_val`
- Dual-push train/val: `lsnu/twin_dual_push_128_train`, `lsnu/twin_dual_push_128_val`
- Hardware: `4x H100 80GB`
- Precision: `bfloat16`
- Semantic packed layout: `[L8, 0x8, R8, 0x8]`
- Active action-loss dims: `[0:8]` and `[16:24]`
- Masked padded dims: `[8:16]` and `[24:32]`
## Headline results
Teacher-forced masked validation loss:
| Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K |
| --- | ---: | ---: | ---: | ---: | ---: |
| Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` |
| Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` |
Sample-based eval on the fixed `10K` final validation subset:
| Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM |
| --- | ---: | ---: | ---: | ---: |
| Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` |
| Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` |
The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie.
Dual-push `128` screening results:
| Model | 1K val loss | 2K val loss | 5K val loss | 5K 4-step MAE | 5K 10-step MAE | Train runtime |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Packed baseline | `0.095597` | `0.083194` | `0.055958` | `0.056830` | `0.058973` | `1:05:25` |
| Packed parallel | `0.093704` | `0.082729` | `0.055242` | `0.054630` | `0.056627` | `1:00:33` |
The dual-push screening run shows a small but consistent parallel edge at `1K`, `2K`, and `5K` on both teacher-forced validation loss and fixed-subset sample MAE.
Dual-push `128` four-way `2K` step comparison raw results:
Step-0 teacher-forced masked validation loss:
| Model | Step-0 val loss | Step-0 left/right imbalance |
| --- | ---: | ---: |
| Shared | `1.084735` | `0.505345` |
| Head-only parallel | `1.082985` | `0.501182` |
| Split independent | `1.328262` | `0.448843` |
| Split communicating | `1.783048` | `0.671085` |
Step-2000 teacher-forced masked validation loss:
| Model | Step-2000 val loss | Step-2000 left/right imbalance |
| --- | ---: | ---: |
| Shared | `0.055329` | `0.069564` |
| Head-only parallel | `0.055297` | `0.069380` |
| Split independent | `0.063537` | `0.092029` |
| Split communicating | `0.059952` | `0.080435` |
Step-2000 sample masked MAE:
| Model | 1-step MAE | 4-step MAE | 16-step MAE |
| --- | ---: | ---: | ---: |
| Shared | `0.087330` | `0.078164` | `0.085222` |
| Head-only parallel | `0.086764` | `0.078301` | `0.085272` |
| Split independent | `0.079100` | `0.070436` | `0.075281` |
| Split communicating | `0.078618` | `0.071087` | `0.075570` |
Full raw tables for the `0/100/500/2000` sweep live in:
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`
## Warm-start note
The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch:
- handover `10K`: `input_projection_max_abs_diff = 0.00122881`, `masked_loss_abs_diff = 0.00398052`
- dual-push `5K`: `input_projection_max_abs_diff = 0.00099802`, `masked_loss_abs_diff = 0.08580410`
- both checks report `warmstart_equivalent = False`
So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.
## Split-Expert Bring-Up (`2026-03-10`)
The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is `action_expert_mode` with:
- `shared`
- `head_only_parallel`
- `split_independent`
- `split_communicating`
Key bring-up results:
- the split warm-start copies the original single `gemma_expert` into exact left/right expert branches for both split modes
- `split_independent` passes the branch-local invariants:
- identical left/right inputs produce identical suffix outputs
- perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa
- both split modes pass detached real-data training on packed TWIN dual-push:
- `3`-step real-data smoke run with checkpoint save
- `20`-step real-data training run with checkpoint save
- the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data `20`-step run
New bring-up artifact bundle:
- `artifacts/twin_split_expert_bringup_20260310/`
- split warm-start checkpoints
- invariant-check outputs
- reproducibility commands
- summary README for the split-expert bring-up
## Repo layout
- `openpi/`
- modified source and scripts used for training/eval
- copied norm-stats assets for the packed configs
- full `2K`, `10K`, and dual-push `5K` checkpoint trees
- `artifacts/twin_handover_packed_parallelization_20260309/`
- initial `2K` study bundle
- `artifacts/twin_handover_packed_parallelization_10k_20260309/`
- `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/`
- dual-push `128` screening bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
- dual-push `128` four-way `2K` step-comparison bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
- small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle
- `artifacts/twin_split_expert_bringup_20260310/`
- split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks
## Committed artifact note
For this update, the committed artifact payloads are:
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
- the official finalized `4`-model dual-push `2K` step-comparison bundle
- `artifacts/twin_split_expert_bringup_20260310/`
- the split-expert bring-up bundle used as the sanity and warm-start reference
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
- a small debug-only environment snapshot from the failed/resumed bring-up sequence
The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-`_debug` step-comparison bundle plus the split bring-up bundle.
- `openpi/run_logs/`
- raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under `artifacts/twin_split_expert_bringup_20260310/run_logs/`
- `openpi/scripts/upload_stepcmp_bundle_to_hf.py`
- the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses `huggingface_hub.HfApi.upload_large_folder(...)`
- `artifacts/pi05_base_params/`
- staged base parameter snapshot used during JAX-to-PyTorch conversion
## Future commit/upload workflow
When adding new experiment results to this repo:
- keep the canonical bundle under `artifacts/<study_name>/` and only retain the checkpoint steps that are scientifically required under `openpi/checkpoints/`
- before claiming the repo is fully committed, audit ignored artifact paths explicitly:
- `git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs`
- if a result is intentionally kept in an ignored path such as `openpi/checkpoints/` or `openpi/run_logs/`, force-add it explicitly with `git add --sparse -f ...`
- use `openpi/scripts/upload_stepcmp_bundle_to_hf.py` for large HF uploads; it uses `huggingface_hub.HfApi.upload_large_folder(...)` and is the preferred path for checkpoint-heavy updates
- never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in `HF_TOKEN` or load it from `HF_TOKEN_FILE`, and check for literal `hf_...` strings before committing
## Key files
- Full report: `REPORT.md`
- `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
- `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
- `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
- dual-push `5K` summary: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
- dual-push `5K` teacher-forced table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
- dual-push `5K` sample eval table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
- dual-push `5K` environment snapshot: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`
- dual-push `2K` step-comparison summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json`
- dual-push `2K` step-comparison README: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md`
- dual-push `2K` teacher-forced table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
- dual-push `2K` sample eval table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
- dual-push `2K` training summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`
- split-expert bring-up summary: `artifacts/twin_split_expert_bringup_20260310/README.md`
- split-expert repro commands: `artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh`
- split-expert invariant check outputs: `artifacts/twin_split_expert_bringup_20260310/sanity_checks/`
- split-expert real-data logs: `openpi/run_logs/split_independent_real_smoke3_r2.log`, `openpi/run_logs/split_communicating_real_smoke3.log`, `openpi/run_logs/split_independent_real_train20.log`, `openpi/run_logs/split_communicating_real_train20.log`
- split-expert real-data checkpoints: `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/`, `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/`
- `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
- `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
## Main changed files
Initial `2K` + `10K` study logic lives primarily in:
- `openpi/src/openpi/transforms.py`
- `openpi/src/openpi/training/config.py`
- `openpi/src/openpi/training/data_loader.py`
- `openpi/src/openpi/models/model.py`
- `openpi/src/openpi/models/tokenizer.py`
- `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
- `openpi/scripts/train_pytorch.py`
- `openpi/scripts/eval_twin_val_loss_pytorch.py`
- `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
- `openpi/scripts/inspect_twin_packed_batch.py`
- `openpi/scripts/check_parallel_warmstart_equivalence.py`
- `openpi/scripts/check_split_expert_invariants.py`
- `openpi/scripts/run_twin_handover_packed_followup.sh`
- `openpi/scripts/run_twin_handover_packed_10k.sh`
- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`
The per-file rationale is recorded in:
- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`
|