File size: 12,987 Bytes
1759ca7
 
422ae16
1759ca7
4cc9180
 
 
 
1759ca7
422ae16
1759ca7
4cc9180
 
422ae16
f7a3ee4
4cc9180
ccf25b1
 
 
 
 
 
 
4cc9180
1759ca7
422ae16
 
4cc9180
 
 
 
 
1759ca7
 
 
4cc9180
 
 
 
 
 
 
 
 
 
1759ca7
4cc9180
 
 
 
 
422ae16
 
 
 
 
 
 
 
 
f7a3ee4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
422ae16
1759ca7
422ae16
 
 
4cc9180
 
 
ccf25b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
 
4cc9180
 
422ae16
1759ca7
4cc9180
 
 
422ae16
 
f7a3ee4
 
 
 
 
 
 
 
 
 
 
 
 
ccf25b1
f7a3ee4
 
 
 
 
 
 
 
 
1759ca7
4cc9180
1759ca7
f7a3ee4
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
 
4cc9180
 
 
422ae16
 
 
 
f7a3ee4
 
 
 
 
ccf25b1
 
 
 
 
4cc9180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ccf25b1
4cc9180
 
422ae16
1759ca7
4cc9180
1759ca7
4cc9180
 
422ae16
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
# pi0.5 Packed Multi-Arm OpenPI Artifacts

This repo packages the full local artifact set for packed-action-head studies on `pi0.5` across TWIN handover and TWIN dual-push, including:

- all finished checkpoints under `openpi/checkpoints/`
- the modified `openpi/` training and evaluation code
- train/eval logs and structured metric tables
- reproducibility manifests and environment snapshots

Three runs are included:

1. an initial `2K` baseline-vs-parallel comparison
2. a longer `10K` follow-up on the same packed setup
3. a `5K` dual-push `128` screening study on the same packed path
4. a `2K` dual-push `128` four-way step comparison across `shared`, `head_only_parallel`, `split_independent`, and `split_communicating`

This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering:

- exact single-to-split warm-start checkpoints for `split_independent` and `split_communicating`
- invariant checks for the new split architecture
- detached real-data smoke and `20`-step training runs on `lsnu/twin_dual_push_128_train`
- the code changes that introduce the new split-expert action path

## Experiment setup

- Handover train/val: `lsnu/twin_handover_256_train`, `lsnu/twin_handover_256_val`
- Dual-push train/val: `lsnu/twin_dual_push_128_train`, `lsnu/twin_dual_push_128_val`
- Hardware: `4x H100 80GB`
- Precision: `bfloat16`
- Semantic packed layout: `[L8, 0x8, R8, 0x8]`
- Active action-loss dims: `[0:8]` and `[16:24]`
- Masked padded dims: `[8:16]` and `[24:32]`

## Headline results

Teacher-forced masked validation loss:

| Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K |
| --- | ---: | ---: | ---: | ---: | ---: |
| Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` |
| Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` |

Sample-based eval on the fixed `10K` final validation subset:

| Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM |
| --- | ---: | ---: | ---: | ---: |
| Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` |
| Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` |

The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie.

Dual-push `128` screening results:

| Model | 1K val loss | 2K val loss | 5K val loss | 5K 4-step MAE | 5K 10-step MAE | Train runtime |
| --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Packed baseline | `0.095597` | `0.083194` | `0.055958` | `0.056830` | `0.058973` | `1:05:25` |
| Packed parallel | `0.093704` | `0.082729` | `0.055242` | `0.054630` | `0.056627` | `1:00:33` |

The dual-push screening run shows a small but consistent parallel edge at `1K`, `2K`, and `5K` on both teacher-forced validation loss and fixed-subset sample MAE.

Dual-push `128` four-way `2K` step comparison raw results:

Step-0 teacher-forced masked validation loss:

| Model | Step-0 val loss | Step-0 left/right imbalance |
| --- | ---: | ---: |
| Shared | `1.084735` | `0.505345` |
| Head-only parallel | `1.082985` | `0.501182` |
| Split independent | `1.328262` | `0.448843` |
| Split communicating | `1.783048` | `0.671085` |

Step-2000 teacher-forced masked validation loss:

| Model | Step-2000 val loss | Step-2000 left/right imbalance |
| --- | ---: | ---: |
| Shared | `0.055329` | `0.069564` |
| Head-only parallel | `0.055297` | `0.069380` |
| Split independent | `0.063537` | `0.092029` |
| Split communicating | `0.059952` | `0.080435` |

Step-2000 sample masked MAE:

| Model | 1-step MAE | 4-step MAE | 16-step MAE |
| --- | ---: | ---: | ---: |
| Shared | `0.087330` | `0.078164` | `0.085222` |
| Head-only parallel | `0.086764` | `0.078301` | `0.085272` |
| Split independent | `0.079100` | `0.070436` | `0.075281` |
| Split communicating | `0.078618` | `0.071087` | `0.075570` |

Full raw tables for the `0/100/500/2000` sweep live in:

- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`

## Warm-start note

The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch:

- handover `10K`: `input_projection_max_abs_diff = 0.00122881`, `masked_loss_abs_diff = 0.00398052`
- dual-push `5K`: `input_projection_max_abs_diff = 0.00099802`, `masked_loss_abs_diff = 0.08580410`
- both checks report `warmstart_equivalent = False`

So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.

## Split-Expert Bring-Up (`2026-03-10`)

The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is `action_expert_mode` with:

- `shared`
- `head_only_parallel`
- `split_independent`
- `split_communicating`

Key bring-up results:

- the split warm-start copies the original single `gemma_expert` into exact left/right expert branches for both split modes
- `split_independent` passes the branch-local invariants:
  - identical left/right inputs produce identical suffix outputs
  - perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa
- both split modes pass detached real-data training on packed TWIN dual-push:
  - `3`-step real-data smoke run with checkpoint save
  - `20`-step real-data training run with checkpoint save
- the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data `20`-step run

New bring-up artifact bundle:

- `artifacts/twin_split_expert_bringup_20260310/`
  - split warm-start checkpoints
  - invariant-check outputs
  - reproducibility commands
  - summary README for the split-expert bring-up

## Repo layout

- `openpi/`
  - modified source and scripts used for training/eval
  - copied norm-stats assets for the packed configs
  - full `2K`, `10K`, and dual-push `5K` checkpoint trees
- `artifacts/twin_handover_packed_parallelization_20260309/`
  - initial `2K` study bundle
- `artifacts/twin_handover_packed_parallelization_10k_20260309/`
  - `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/`
  - dual-push `128` screening bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
  - dual-push `128` four-way `2K` step-comparison bundle with metrics, logs, repro manifests, and environment snapshot
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
  - small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle
- `artifacts/twin_split_expert_bringup_20260310/`
  - split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks

## Committed artifact note

For this update, the committed artifact payloads are:

- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
  - the official finalized `4`-model dual-push `2K` step-comparison bundle
- `artifacts/twin_split_expert_bringup_20260310/`
  - the split-expert bring-up bundle used as the sanity and warm-start reference
- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
  - a small debug-only environment snapshot from the failed/resumed bring-up sequence

The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-`_debug` step-comparison bundle plus the split bring-up bundle.
- `openpi/run_logs/`
  - raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under `artifacts/twin_split_expert_bringup_20260310/run_logs/`
- `openpi/scripts/upload_stepcmp_bundle_to_hf.py`
  - the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses `huggingface_hub.HfApi.upload_large_folder(...)`
- `artifacts/pi05_base_params/`
  - staged base parameter snapshot used during JAX-to-PyTorch conversion

## Future commit/upload workflow

When adding new experiment results to this repo:

- keep the canonical bundle under `artifacts/<study_name>/` and only retain the checkpoint steps that are scientifically required under `openpi/checkpoints/`
- before claiming the repo is fully committed, audit ignored artifact paths explicitly:
  - `git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs`
- if a result is intentionally kept in an ignored path such as `openpi/checkpoints/` or `openpi/run_logs/`, force-add it explicitly with `git add --sparse -f ...`
- use `openpi/scripts/upload_stepcmp_bundle_to_hf.py` for large HF uploads; it uses `huggingface_hub.HfApi.upload_large_folder(...)` and is the preferred path for checkpoint-heavy updates
- never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in `HF_TOKEN` or load it from `HF_TOKEN_FILE`, and check for literal `hf_...` strings before committing

## Key files

- Full report: `REPORT.md`
- `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
- `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
- `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
- dual-push `5K` summary: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
- dual-push `5K` teacher-forced table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
- dual-push `5K` sample eval table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
- dual-push `5K` environment snapshot: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`
- dual-push `2K` step-comparison summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json`
- dual-push `2K` step-comparison README: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md`
- dual-push `2K` teacher-forced table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
- dual-push `2K` sample eval table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
- dual-push `2K` training summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`
- split-expert bring-up summary: `artifacts/twin_split_expert_bringup_20260310/README.md`
- split-expert repro commands: `artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh`
- split-expert invariant check outputs: `artifacts/twin_split_expert_bringup_20260310/sanity_checks/`
- split-expert real-data logs: `openpi/run_logs/split_independent_real_smoke3_r2.log`, `openpi/run_logs/split_communicating_real_smoke3.log`, `openpi/run_logs/split_independent_real_train20.log`, `openpi/run_logs/split_communicating_real_train20.log`
- split-expert real-data checkpoints: `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/`, `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/`
- `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
- `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`

## Main changed files

Initial `2K` + `10K` study logic lives primarily in:

- `openpi/src/openpi/transforms.py`
- `openpi/src/openpi/training/config.py`
- `openpi/src/openpi/training/data_loader.py`
- `openpi/src/openpi/models/model.py`
- `openpi/src/openpi/models/tokenizer.py`
- `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
- `openpi/scripts/train_pytorch.py`
- `openpi/scripts/eval_twin_val_loss_pytorch.py`
- `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
- `openpi/scripts/inspect_twin_packed_batch.py`
- `openpi/scripts/check_parallel_warmstart_equivalence.py`
- `openpi/scripts/check_split_expert_invariants.py`
- `openpi/scripts/run_twin_handover_packed_followup.sh`
- `openpi/scripts/run_twin_handover_packed_10k.sh`
- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`

The per-file rationale is recorded in:

- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`