lsnu commited on
Commit
4cc9180
·
verified ·
1 Parent(s): ab4982c

Upload 10k report docs

Browse files
Files changed (2) hide show
  1. README.md +74 -35
  2. REPORT.md +179 -258
README.md CHANGED
@@ -1,58 +1,97 @@
1
  # pi0.5 Packed Multi-Arm OpenPI Artifacts
2
 
3
- This repo packages a finished initial comparison between:
4
 
5
- 1. a packed single-head `pi0.5` baseline
6
- 2. a packed parallel-head `pi0.5` model with an exact packed warm-start from the single-head checkpoint
 
 
7
 
8
- The study was run from the checked-out `openpi/` tree on `4x H100 80GB` with `bfloat16`, `2000` optimizer steps per model, verbose startup/debug logging, fixed validation passes, and no raw data reconversion.
9
 
10
- ## Dataset and packing
 
 
 
11
 
12
  - Train repo: `lsnu/twin_handover_256_train`
13
  - Val repo: `lsnu/twin_handover_256_val`
14
- - Original TWIN layout: `[L8, R8]`
15
- - Packed model layout used for both models: `[L8, 0x8, R8, 0x8]`
16
- - Action-loss mask: active dims `[0:8]` and `[16:24]`, padded dims masked out
17
- - Public `16`-dim norm stats were reused; they were not recomputed
 
18
 
19
  ## Headline results
20
 
21
- | Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
 
 
 
 
 
 
 
 
 
22
  | --- | ---: | ---: | ---: | ---: |
23
- | Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23 GB` |
24
- | Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27 GB` |
 
 
 
 
25
 
26
- The two models tracked closely. In this short run, the packed parallel head finished with a small edge on validation loss while staying within the same memory envelope.
27
 
28
- ## Repo contents
 
 
 
 
 
 
29
 
30
  - `openpi/`
31
- - modified training/eval code
32
- - config and transform changes
33
- - copied norm-stats assets for the new packed configs
34
- - smoke and main-run checkpoints under `openpi/checkpoints/`
35
  - `artifacts/twin_handover_packed_parallelization_20260309/`
36
- - `bootstrap_checkpoints/`: single-head PyTorch bootstrap and exact packed parallel warm-start
37
- - `metrics/`: JSON and CSV summaries
38
- - `run_logs/`: smoke, train, eval, and follow-up logs
39
- - `sanity_checks/`: packed-batch inspection output
40
- - `environment/`: system, GPU, package, HF-tooling, and workspace snapshots
41
- - `repro/`: changed-file list, checkpoint locations, and rerun commands
42
  - `artifacts/pi05_base_params/`
43
- - staged base JAX parameter snapshot used for PyTorch conversion
44
 
45
- ## Key artifact paths
46
 
47
  - Full report: `REPORT.md`
48
- - Reproduction commands: `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
49
- - Metrics summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
50
- - Train loss table: `artifacts/twin_handover_packed_parallelization_20260309/metrics/train_loss_table.csv`
51
- - Val loss table: `artifacts/twin_handover_packed_parallelization_20260309/metrics/val_loss_table.csv`
52
- - Environment snapshot: `artifacts/twin_handover_packed_parallelization_20260309/environment/`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
- ## Notes
55
 
56
- - The packed parallel warm-start is exact by construction from the implemented slice/fuse mapping.
57
- - Weight loading on both main runs reported `missing=0` and `unexpected=0`.
58
- - The packaged tree intentionally records reproducibility snapshots instead of uploading transient cache state.
 
1
  # pi0.5 Packed Multi-Arm OpenPI Artifacts
2
 
3
+ This repo packages the full local artifact set for the TWIN handover packed-action-head study on `pi0.5`, including:
4
 
5
+ - all finished checkpoints under `openpi/checkpoints/`
6
+ - the modified `openpi/` training and evaluation code
7
+ - train/eval logs and structured metric tables
8
+ - reproducibility manifests and environment snapshots
9
 
10
+ Two runs are included:
11
 
12
+ 1. an initial `2K` baseline-vs-parallel comparison
13
+ 2. a longer `10K` follow-up on the same packed setup
14
+
15
+ ## Experiment setup
16
 
17
  - Train repo: `lsnu/twin_handover_256_train`
18
  - Val repo: `lsnu/twin_handover_256_val`
19
+ - Hardware: `4x H100 80GB`
20
+ - Precision: `bfloat16`
21
+ - Semantic packed layout: `[L8, 0x8, R8, 0x8]`
22
+ - Active action-loss dims: `[0:8]` and `[16:24]`
23
+ - Masked padded dims: `[8:16]` and `[24:32]`
24
 
25
  ## Headline results
26
 
27
+ Teacher-forced masked validation loss:
28
+
29
+ | Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K |
30
+ | --- | ---: | ---: | ---: | ---: | ---: |
31
+ | Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` |
32
+ | Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` |
33
+
34
+ Sample-based eval on the fixed `10K` final validation subset:
35
+
36
+ | Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM |
37
  | --- | ---: | ---: | ---: | ---: |
38
+ | Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` |
39
+ | Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` |
40
+
41
+ The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie.
42
+
43
+ ## Warm-start note
44
 
45
+ The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical check shows it is not exactly identical end-to-end on a real batch:
46
 
47
+ - `input_projection_max_abs_diff = 0.00122881`
48
+ - `masked_loss_abs_diff = 0.00398052`
49
+ - `warmstart_equivalent = False`
50
+
51
+ So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.
52
+
53
+ ## Repo layout
54
 
55
  - `openpi/`
56
+ - modified source and scripts used for training/eval
57
+ - copied norm-stats assets for the packed configs
58
+ - full `2K` and `10K` checkpoint trees
 
59
  - `artifacts/twin_handover_packed_parallelization_20260309/`
60
+ - initial `2K` study bundle
61
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/`
62
+ - `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot
 
 
 
63
  - `artifacts/pi05_base_params/`
64
+ - staged base parameter snapshot used during JAX-to-PyTorch conversion
65
 
66
+ ## Key files
67
 
68
  - Full report: `REPORT.md`
69
+ - `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
70
+ - `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
71
+ - `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
72
+ - `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
73
+ - `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
74
+ - `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
75
+
76
+ ## Main changed files
77
+
78
+ Initial `2K` + `10K` study logic lives primarily in:
79
+
80
+ - `openpi/src/openpi/transforms.py`
81
+ - `openpi/src/openpi/training/config.py`
82
+ - `openpi/src/openpi/training/data_loader.py`
83
+ - `openpi/src/openpi/models/model.py`
84
+ - `openpi/src/openpi/models/tokenizer.py`
85
+ - `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
86
+ - `openpi/scripts/train_pytorch.py`
87
+ - `openpi/scripts/eval_twin_val_loss_pytorch.py`
88
+ - `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
89
+ - `openpi/scripts/inspect_twin_packed_batch.py`
90
+ - `openpi/scripts/check_parallel_warmstart_equivalence.py`
91
+ - `openpi/scripts/run_twin_handover_packed_followup.sh`
92
+ - `openpi/scripts/run_twin_handover_packed_10k.sh`
93
 
94
+ The per-file rationale is recorded in:
95
 
96
+ - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
97
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
 
REPORT.md CHANGED
@@ -1,347 +1,268 @@
1
  # Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover
2
 
3
- ## Objective
4
 
5
- Run the minimum scientifically meaningful comparison between:
6
 
7
- 1. a packed single-head `pi0.5` baseline
8
- 2. a packed parallel-head `pi0.5` model
9
 
10
- Both models were fine-tuned on the same converted public TWIN handover dataset with the same training schedule:
11
 
12
- - train: `lsnu/twin_handover_256_train`
13
- - val: `lsnu/twin_handover_256_val`
14
- - hardware: `4x H100 80GB`
15
- - precision: `bfloat16`
16
- - global batch size: `16`
17
- - optimizer steps per model: `2000`
18
- - save interval: `250`
19
- - log interval: `10`
20
 
21
- ## Data layout and packing
22
 
23
- The TWIN converted state/action layout is `16` dims in `[L8, R8]`, where each arm is `7` joints plus gripper. The generic `pi0.5` path right-pads to `32` dims, which does not preserve a semantic left/right split for a naive parallel-head setup.
24
 
25
- To keep the experiment minimal and still semantically correct:
26
-
27
- - existing public `16`-dim norm stats were reused
28
- - semantic packing happened after normalization in model transforms
29
- - both models consumed the same packed `32`-dim layout:
30
 
31
  ```text
32
  [L8, R8] -> [L8, 0x8, R8, 0x8]
33
  ```
34
 
35
- - the action loss was masked so only the real arm dims contributed:
36
 
37
- ```text
38
- active dims: [0:8] and [16:24]
39
- masked dims: [8:16] and [24:32]
40
- ```
41
 
42
- The packed-batch sanity check confirmed exact zero padding:
43
 
44
- - `state_padded_zero_count: 16 / 16`
45
- - `actions_padded_zero_count: 256 / 256`
46
- - `state_padded_exact_zero: True`
47
- - `actions_padded_exact_zero: True`
48
 
49
- Reference log:
50
 
51
- - `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
52
 
53
- ## Code changes tied to files
54
 
55
- The experiment-specific changes are summarized below.
56
 
57
- - `openpi/src/openpi/transforms.py`
58
- - added `PackPerArmBlocks` and `UnpackPerArmBlocks` for semantic TWIN packed training
59
  - `openpi/src/openpi/training/config.py`
60
- - added packed TWIN model-transform path
61
- - added `action_loss_mask`
62
- - added `pi05_twin_handover_256_packed_baseline_pytorch_2k`
63
- - added `pi05_twin_handover_256_packed_parallel_pytorch_2k`
64
- - `openpi/src/openpi/training/data_loader.py`
65
- - added `set_epoch`
66
- - improved local dataset mirror handling and loader startup behavior
67
- - `openpi/src/openpi/models/model.py`
68
- - made `pi0_pytorch` import lazy
69
- - `openpi/src/openpi/models/tokenizer.py`
70
- - made `AutoProcessor` import lazy
71
- - `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
72
- - disabled unconditional `sample_actions` `torch.compile` by default
73
  - `openpi/scripts/train_pytorch.py`
74
- - added startup prints
75
- - added masked action-loss reduction
76
- - added first-steps debug prints and periodic runtime/memory logging
77
- - hardened DDP/checkpoint startup
78
  - `openpi/scripts/eval_twin_val_loss_pytorch.py`
79
- - added masked validation-loss evaluation with fixed-batch execution
80
- - `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
81
- - added exact packed parallel warm-start initialization
82
- - `openpi/scripts/inspect_twin_packed_batch.py`
83
- - added packed-batch inspection and zero-padding verification
84
- - `openpi/scripts/run_twin_handover_packed_followup.sh`
85
- - added detached follow-up automation for the remaining train/eval stages
86
- - `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_2k/lsnu/twin_handover_256_train/norm_stats.json`
87
- - copied the existing handover train norm stats for the packed baseline config
88
- - `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_2k/lsnu/twin_handover_256_train/norm_stats.json`
89
- - copied the existing handover train norm stats for the packed parallel config
90
-
91
- Reference file list:
92
-
93
- - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
94
-
95
- ## Commands run
96
 
97
- The exact rerun command list is saved in:
98
 
99
- - `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
100
-
101
- The executed flow was:
102
-
103
- 1. packed-batch inspection
104
- 2. base `pi0.5` JAX-to-PyTorch conversion
105
- 3. exact packed parallel warm-start initialization from the single-head PyTorch checkpoint
106
- 4. packed baseline training for `2000` steps
107
- 5. baseline val at `1000`
108
- 6. baseline val at `2000`
109
- 7. packed parallel training for `2000` steps
110
- 8. parallel val at `1000`
111
- 9. parallel val at `2000`
112
 
113
- The parallel training and its validation passes were chained through a detached follow-up runner.
114
 
115
- Reference logs:
116
 
117
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/twin_handover_followup.log`
118
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k.log`
119
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k.log`
120
-
121
- ## Startup sanity checks
122
 
123
- ### Norm stats
124
-
125
- The copied norm-stats files were loaded successfully and reported:
126
-
127
- - keys: `['actions', 'state']`
128
- - `state_mean_len=16`
129
- - `state_std_len=16`
130
- - `actions_mean_len=16`
131
- - `actions_std_len=16`
132
-
133
- Reference:
134
 
135
- - `artifacts/twin_handover_packed_parallelization_20260309/metrics/norm_stats_verification.txt`
 
 
 
 
136
 
137
- ### Baseline startup summary
138
 
139
- Rank-0 startup logging for the packed baseline recorded:
140
 
141
- ```text
142
- Resolved config name: pi05_twin_handover_256_packed_baseline_pytorch_2k
143
- Dataset repo_id: lsnu/twin_handover_256_train
144
- Norm-stats summary: {'keys': ['actions', 'state'], 'state_mean_len': 16, 'state_std_len': 16, 'actions_mean_len': 16, 'actions_std_len': 16}
145
- Checkpoint source path: /workspace/checkpoints/pi05_base_single_pytorch
146
- Model type: baseline
147
- Packed transforms active: True
148
- Batch size: local=4, global=16
149
- Action-loss mask: (1.0 x8, 0.0 x8, 1.0 x8, 0.0 x8)
150
- Weight loading missing key count: 0
151
- Weight loading unexpected key count: 0
152
- ```
153
 
154
- The first debug steps also showed:
 
 
155
 
156
- - `observation.state shape=(4, 32)`
157
- - `actions shape=(4, 16, 32)`
158
- - `state_nonzero_counts_8d_blocks=[32, 0, 32, 0]`
159
- - `action_nonzero_counts_8d_blocks=[512, 0, 512, 0]`
160
- - masked padded dims stayed exactly zero in the batch
161
 
162
- ### Parallel startup summary
163
 
164
- Rank-0 startup logging for the packed parallel run recorded:
 
 
 
 
 
165
 
166
- ```text
167
- Resolved config name: pi05_twin_handover_256_packed_parallel_pytorch_2k
168
- Dataset repo_id: lsnu/twin_handover_256_train
169
- Norm-stats summary: {'keys': ['actions', 'state'], 'state_mean_len': 16, 'state_std_len': 16, 'actions_mean_len': 16, 'actions_std_len': 16}
170
- Checkpoint source path: /workspace/checkpoints/pi05_base_parallel_packed_from_single
171
- Model type: parallel
172
- Packed transforms active: True
173
- Batch size: local=4, global=16
174
- Action-loss mask: (1.0 x8, 0.0 x8, 1.0 x8, 0.0 x8)
175
- Weight loading missing key count: 0
176
- Weight loading unexpected key count: 0
177
- ```
178
 
179
- The first debug steps matched the expected packed layout:
180
 
181
- - `observation.state shape=(4, 32)`
182
- - `actions shape=(4, 16, 32)`
183
- - `state_nonzero_counts_8d_blocks=[32, 0, 32, 0]`
184
- - `action_nonzero_counts_8d_blocks=[512, 0, 512, 0]`
185
 
186
- ### Smoke tests
 
187
 
188
- All required smoke tests passed before the main runs:
189
 
190
- 1. `debug_pi05_multiarm_pytorch_smoke`
191
- 2. packed-batch inspection on `lsnu/twin_handover_256_train`
192
- 3. packed baseline TWIN smoke on `4` GPUs for `20` steps
193
- 4. packed parallel TWIN smoke on `4` GPUs for `20` steps
194
 
195
- Smoke logs are stored in:
 
 
 
 
 
196
 
197
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/smoke_handover_packed_baseline_20k.log`
198
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/smoke_handover_packed_baseline_20l.log`
199
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/smoke_handover_packed_parallel_20a.log`
200
 
201
- ## Warm-start note
202
 
203
- The packed parallel warm-start was implemented as an exact slice/fuse mapping from the single-head PyTorch checkpoint:
204
 
205
- - input side: split single-head input projection by packed arm blocks
206
- - fuse side: initialize `arm_token_fuse.weight` as `[I I]`
207
- - output side: split single-head output projection rows by packed arm blocks
 
208
 
209
- This was exact by construction for the implemented mapping and both the warm-start checkpoint creation and main-run loading succeeded without missing or unexpected keys.
210
 
211
- What was not done:
212
 
213
- - no separate numerical equivalence test was run that compared step-0 forward outputs between the single-head and parallel-head models on the same batch
214
 
215
- Bootstrap checkpoints:
 
 
 
216
 
217
- - `/workspace/checkpoints/pi05_base_single_pytorch`
218
- - `/workspace/checkpoints/pi05_base_parallel_packed_from_single`
219
 
220
- Copies are also staged under:
221
 
222
- - `artifacts/twin_handover_packed_parallelization_20260309/bootstrap_checkpoints/`
 
 
 
 
 
223
 
224
- ## Results
225
 
226
- ### Training loss snapshots
227
 
228
- | Model | Step 250 | Step 500 | Step 1000 | Step 1500 | Step 2000 |
229
- | --- | ---: | ---: | ---: | ---: | ---: |
230
- | Baseline loss | `0.1975` | `0.0606` | `0.0245` | `0.0155` | `0.0391` |
231
- | Baseline smoothed | `0.1166` | `0.0554` | `0.0387` | `0.0331` | `0.0278` |
232
- | Parallel loss | `0.1894` | `0.0633` | `0.0214` | `0.0155` | `0.0326` |
233
- | Parallel smoothed | `0.1153` | `0.0565` | `0.0392` | `0.0331` | `0.0270` |
234
 
235
- ### Validation loss
 
 
 
 
 
236
 
237
- | Model | Checkpoint | Batches | Mean val loss | Std val loss |
238
- | --- | ---: | ---: | ---: | ---: |
239
- | Baseline | `1000` | `50` | `0.052885` | `0.032533` |
240
- | Baseline | `2000` | `100` | `0.035776` | `0.027648` |
241
- | Parallel | `1000` | `50` | `0.051214` | `0.028985` |
242
- | Parallel | `2000` | `100` | `0.035680` | `0.026077` |
243
 
244
- ### Runtime and memory
245
 
246
- | Item | Value |
247
- | --- | --- |
248
- | Pipeline wallclock from baseline launch to final val | `01:32:29` |
249
- | Detached follow-up runner wallclock | `01:17:47` |
250
- | Baseline train runtime | `33:27` |
251
- | Parallel train runtime | `30:38` |
252
- | Baseline val @ 1000 | `00:05:14` |
253
- | Baseline val @ 2000 | `00:05:19` |
254
- | Parallel val @ 1000 | `00:03:23` |
255
- | Parallel val @ 2000 | `00:03:33` |
256
- | Peak baseline VRAM | `35.23 GB` |
257
- | Peak parallel VRAM | `35.27 GB` |
258
 
259
- ### Interpretation
 
 
 
 
 
 
 
 
 
260
 
261
- For this short `2000`-step TWIN handover run, the packed baseline and packed parallel-head models behaved very similarly. The packed parallel-head model ended slightly lower on both validation checkpoints while staying in the same memory range and training cleanly under the same schedule.
262
 
263
- This should be treated as an initial profiling run, not a final benchmark claim.
 
 
 
264
 
265
- Reference metrics:
266
 
267
- - `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
268
- - `artifacts/twin_handover_packed_parallelization_20260309/metrics/train_loss_table.csv`
269
- - `artifacts/twin_handover_packed_parallelization_20260309/metrics/val_loss_table.csv`
270
 
271
- ## Checkpoints and logs
 
 
 
 
 
272
 
273
- ### Main-run checkpoints
274
 
275
- - Baseline step `1000`:
276
- - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_baseline_pytorch_2k/handover_packed_baseline_2k/1000`
277
- - Baseline step `2000`:
278
- - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_baseline_pytorch_2k/handover_packed_baseline_2k/2000`
279
- - Parallel step `1000`:
280
- - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_parallel_pytorch_2k/handover_packed_parallel_2k/1000`
281
- - Parallel step `2000`:
282
- - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_parallel_pytorch_2k/handover_packed_parallel_2k/2000`
283
 
284
- The full checkpoint trees, including smoke checkpoints and intermediate saves every `250` steps, are under:
285
 
286
- - `openpi/checkpoints/pi05_twin_handover_256_packed_baseline_pytorch_2k/`
287
- - `openpi/checkpoints/pi05_twin_handover_256_packed_parallel_pytorch_2k/`
 
288
 
289
- ### Bootstrap checkpoints
290
 
291
- - `artifacts/twin_handover_packed_parallelization_20260309/bootstrap_checkpoints/pi05_base_single_pytorch/`
292
- - `artifacts/twin_handover_packed_parallelization_20260309/bootstrap_checkpoints/pi05_base_parallel_packed_from_single/`
 
 
 
 
 
293
 
294
- ### Logs
295
 
296
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k.log`
297
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k_val_1000.log`
298
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k_val_2000.log`
299
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k.log`
300
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k_val_1000.log`
301
- - `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k_val_2000.log`
302
 
303
- ## Environment and provenance snapshot
304
 
305
- Environment snapshots are stored in:
306
 
307
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/system_info.txt`
308
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/gpu_info.txt`
309
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/python_env.txt`
310
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/pip_freeze.txt`
311
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/hf_env.txt`
312
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/selected_env_vars.json`
313
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/workspace_snapshot.txt`
314
- - `artifacts/twin_handover_packed_parallelization_20260309/environment/openpi_source_snapshot.txt`
315
 
316
- OpenPI source provenance:
317
 
318
- - packaged `openpi/` tree does not contain a live `.git` directory
319
- - source clone snapshot recorded in `openpi_source_snapshot.txt`
320
- - source commit: `aa91438c0c130dcef4ccf378a56f4cf4cffc1310`
321
 
322
- ## Acceptance criteria status
323
 
324
- 1. Packed-batch inspection showed raw `16`-dim `[L8, R8]` and packed `32`-dim `[L8, 0x8, R8, 0x8]`: `PASS`
325
- 2. Both smoke tests passed on `4` GPUs with finite loss: `PASS`
326
- 3. Baseline run started from `/workspace/checkpoints/pi05_base_single_pytorch`: `PASS`
327
- 4. Parallel run started from `/workspace/checkpoints/pi05_base_parallel_packed_from_single`: `PASS`
328
- 5. Masked loss was active and padded dims were excluded: `PASS`
329
- 6. DDP ran without shape/key mismatches: `PASS`
330
- 7. Quick val was run at step `1000` for both models: `PASS`
331
- 8. Final val was run at step `2000` for both models: `PASS`
332
- 9. Both main runs finished under the `10`-hour cap: `PASS`
333
- 10. Final bundle includes code, checkpoints, logs, metrics, and environment snapshot: `PASS`
334
 
335
- ## Final inventory
336
 
337
- The artifact bundle at repo root contains:
338
 
339
- - all modified training/eval code under `openpi/`
340
- - all baseline and parallel checkpoints under `openpi/checkpoints/`
341
- - both bootstrap checkpoints under `artifacts/.../bootstrap_checkpoints/`
342
- - all train/eval/smoke logs under `artifacts/.../run_logs/`
343
- - metrics tables and summary JSON under `artifacts/.../metrics/`
344
- - reproducibility files under `artifacts/.../repro/`
345
- - environment and provenance snapshot under `artifacts/.../environment/`
346
 
347
- This is a complete rerunnable package for the initial TWIN handover packed action-head parallelization study.
 
1
  # Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover
2
 
3
+ ## Scope
4
 
5
+ This repo now contains two completed studies on the same packed TWIN handover setup:
6
 
7
+ 1. the initial `2K` baseline-vs-parallel comparison
8
+ 2. the longer `10K` follow-up with richer diagnostics
9
 
10
+ Both runs used:
11
 
12
+ - train repo `lsnu/twin_handover_256_train`
13
+ - val repo `lsnu/twin_handover_256_val`
14
+ - `4x H100 80GB`
15
+ - `bfloat16`
16
+ - packed semantic layout `[L8, 0x8, R8, 0x8]`
17
+ - active action-loss dims `[0:8]` and `[16:24]`
18
+ - masked dims `[8:16]` and `[24:32]`
 
19
 
20
+ Existing public `16`-dim norm stats were reused. No raw-data reconversion was done.
21
 
22
+ ## Data packing and masking
23
 
24
+ The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input:
 
 
 
 
25
 
26
  ```text
27
  [L8, R8] -> [L8, 0x8, R8, 0x8]
28
  ```
29
 
30
+ The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:
31
 
32
+ - `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
33
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
 
 
34
 
35
+ ## Files changed or created
36
 
37
+ ### Initial `2K` study
 
 
 
38
 
39
+ The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in:
40
 
41
+ - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
42
 
43
+ ### `10K` follow-up additions
44
 
45
+ The follow-up changed or added:
46
 
 
 
47
  - `openpi/src/openpi/training/config.py`
48
+ - added `pi05_twin_handover_256_packed_baseline_pytorch_10k`
49
+ - added `pi05_twin_handover_256_packed_parallel_pytorch_10k`
 
 
 
 
 
 
 
 
 
 
 
50
  - `openpi/scripts/train_pytorch.py`
51
+ - added periodic per-module gradient buckets for baseline and parallel models
 
 
 
52
  - `openpi/scripts/eval_twin_val_loss_pytorch.py`
53
+ - added left/right arm losses
54
+ - added joint vs gripper losses
55
+ - added left/right imbalance
56
+ - added small deterministic `sample_actions` eval at `num_steps={4,10}`
57
+ - `openpi/scripts/check_parallel_warmstart_equivalence.py`
58
+ - added step-0 baseline-vs-parallel numerical check
59
+ - `openpi/scripts/run_twin_handover_packed_10k.sh`
60
+ - added detached `10K` train/eval chain
61
+ - `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
62
+ - copied existing norm stats for the `10K` baseline config
63
+ - `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
64
+ - copied existing norm stats for the `10K` parallel config
65
+ - `README.md`
66
+ - updated repo landing page to cover both studies
67
+ - `REPORT.md`
68
+ - updated full report to cover both studies
 
69
 
70
+ The exact `10K` changed-file manifest is:
71
 
72
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ ## Commands and run flow
75
 
76
+ The exact `10K` rerun commands are stored in:
77
 
78
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
 
 
 
 
79
 
80
+ The `10K` execution flow was:
 
 
 
 
 
 
 
 
 
 
81
 
82
+ 1. run the warm-start equivalence check
83
+ 2. baseline `10K` train
84
+ 3. baseline evals at `1000`, `2000`, `5000`, `10000`
85
+ 4. parallel `10K` train
86
+ 5. parallel evals at `1000`, `2000`, `5000`, `10000`
87
 
88
+ The detached runner was:
89
 
90
+ - `openpi/scripts/run_twin_handover_packed_10k.sh`
91
 
92
+ Main logs:
 
 
 
 
 
 
 
 
 
 
 
93
 
94
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log`
95
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log`
96
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log`
97
 
98
+ ## Startup sanity checks
 
 
 
 
99
 
100
+ Both `10K` runs loaded cleanly with:
101
 
102
+ - packed transforms active
103
+ - correct `32`-dim packed state/action tensors
104
+ - mask active on `[0:8]` and `[16:24]`
105
+ - exact zeros preserved in masked padded blocks
106
+ - `missing=0`
107
+ - `unexpected=0`
108
 
109
+ Reference startup summary:
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt`
112
 
113
+ Checkpoint sources:
 
 
 
114
 
115
+ - baseline: `/workspace/checkpoints/pi05_base_single_pytorch`
116
+ - parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single`
117
 
118
+ ## Warm-start equivalence check
119
 
120
+ The `10K` study added an explicit step-0 numerical check:
 
 
 
121
 
122
+ - `input_projection_max_abs_diff = 0.00122881`
123
+ - `input_projection_mean_abs_diff = 0.00015435`
124
+ - `baseline_masked_loss = 1.00531137`
125
+ - `parallel_masked_loss = 1.00929189`
126
+ - `masked_loss_abs_diff = 0.00398052`
127
+ - `warmstart_equivalent = False`
128
 
129
+ Reference:
 
 
130
 
131
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log`
132
 
133
+ Interpretation:
134
 
135
+ - the slice/fuse initialization is aligned by construction
136
+ - it is not numerically exact end-to-end on the same batch
137
+ - this weakens a strict “identical function at step 0” claim
138
+ - it does not invalidate the comparison as a matched warm-start study
139
 
140
+ ## Results
141
 
142
+ ### Initial `2K` study
143
 
144
+ Teacher-forced validation loss:
145
 
146
+ | Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
147
+ | --- | ---: | ---: | ---: | ---: |
148
+ | Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` |
149
+ | Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` |
150
 
151
+ ### `10K` train checkpoints
 
152
 
153
+ Rank-0 train snapshots:
154
 
155
+ | Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 |
156
+ | --- | ---: | ---: | ---: | ---: |
157
+ | Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` |
158
+ | Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` |
159
+ | Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` |
160
+ | Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` |
161
 
162
+ Structured source:
163
 
164
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
165
 
166
+ ### `10K` teacher-forced validation
 
 
 
 
 
167
 
168
+ | Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
169
+ | --- | ---: | ---: | ---: |
170
+ | `1000` | `0.061130` | `0.059715` | `-0.001415` |
171
+ | `2000` | `0.041595` | `0.039947` | `-0.001648` |
172
+ | `5000` | `0.027324` | `0.027340` | `+0.000016` |
173
+ | `10000` | `0.022345` | `0.022168` | `-0.000177` |
174
 
175
+ The longer run still shows only a very small gap. The models remain extremely close.
 
 
 
 
 
176
 
177
+ ### `10K` final arm/joint/gripper breakdown
178
 
179
+ Teacher-forced validation at `10000`:
 
 
 
 
 
 
 
 
 
 
 
180
 
181
+ | Metric | Baseline | Parallel |
182
+ | --- | ---: | ---: |
183
+ | Mean val loss | `0.022345` | `0.022168` |
184
+ | Left arm loss | `0.029659` | `0.030184` |
185
+ | Right arm loss | `0.015031` | `0.014151` |
186
+ | Left joint loss | `0.031507` | `0.032356` |
187
+ | Left gripper loss | `0.016725` | `0.014984` |
188
+ | Right joint loss | `0.015776` | `0.014888` |
189
+ | Right gripper loss | `0.009818` | `0.008996` |
190
+ | Left/right imbalance | `0.034067` | `0.033825` |
191
 
192
+ Interpretation:
193
 
194
+ - the parallel model’s small final advantage is mostly on the right-arm side
195
+ - the baseline is slightly better on left-arm joint loss
196
+ - the parallel model is slightly better on both grippers and on right-joint loss
197
+ - imbalance is nearly unchanged
198
 
199
+ ### `10K` sample-based eval
200
 
201
+ Final fixed-subset sample eval at `10000`:
 
 
202
 
203
+ | Metric | Baseline | Parallel |
204
+ | --- | ---: | ---: |
205
+ | 4-step masked MAE | `0.029935` | `0.029277` |
206
+ | 10-step masked MAE | `0.030294` | `0.030241` |
207
+ | 4-step left/right imbalance MAE | `0.033733` | `0.031629` |
208
+ | 10-step left/right imbalance MAE | `0.034582` | `0.032456` |
209
 
210
+ Interpretation:
211
 
212
+ - sample-based quality is effectively tied by the end
213
+ - the teacher-forced gap does not widen into a large inference-time separation
 
 
 
 
 
 
214
 
215
+ Structured sources:
216
 
217
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
218
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
219
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
220
 
221
+ ### Runtime and memory
222
 
223
+ | Stage | Duration |
224
+ | --- | ---: |
225
+ | Baseline train | `2:13:40` |
226
+ | Baseline eval sweep | `0:24:24` |
227
+ | Parallel train | `2:20:51` |
228
+ | Parallel eval sweep | `0:43:54` |
229
+ | Full `10K` pipeline | `5:48:33` |
230
 
231
+ Peak VRAM:
232
 
233
+ - baseline: `35.23GB`
234
+ - parallel: `35.27GB`
 
 
 
 
235
 
236
+ Reference:
237
 
238
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv`
239
 
240
+ ## Artifact locations
 
 
 
 
 
 
 
241
 
242
+ ### `2K` bundle
243
 
244
+ - `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
245
+ - `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
246
+ - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
247
 
248
+ ### `10K` bundle
249
 
250
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
251
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
252
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
253
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
254
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
255
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
256
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
257
+ - `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
 
 
258
 
259
+ ## Bottom line
260
 
261
+ The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence.
262
 
263
+ - teacher-forced validation ends with a small parallel edge
264
+ - sample-based eval is essentially tied
265
+ - left/right imbalance does not materially change
266
+ - the main difference remains subtle rather than dramatic
 
 
 
267
 
268
+ So the packed parallel head looks competitive and slightly favorable on the masked teacher-forced objective, but the current evidence does not show a large practical separation at inference time on this task.