File size: 14,944 Bytes
422ae16
1759ca7
4cc9180
1759ca7
422ae16
1759ca7
4cc9180
 
422ae16
1759ca7
422ae16
1759ca7
4cc9180
 
 
 
 
 
 
1759ca7
4cc9180
1759ca7
422ae16
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
4cc9180
1759ca7
 
 
 
 
4cc9180
1759ca7
4cc9180
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
 
4cc9180
 
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
422ae16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
1759ca7
422ae16
 
 
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
 
1759ca7
4cc9180
1759ca7
4cc9180
 
 
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
 
1759ca7
4cc9180
1759ca7
4cc9180
 
1759ca7
4cc9180
1759ca7
4cc9180
1759ca7
422ae16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
 
 
 
 
1759ca7
422ae16
 
 
 
 
 
 
 
 
 
 
 
4cc9180
1759ca7
4cc9180
1759ca7
4cc9180
 
 
 
1759ca7
422ae16
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
# Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push

## Scope

This repo now contains three completed studies:

1. the initial `2K` baseline-vs-parallel comparison
2. the longer `10K` follow-up with richer diagnostics
3. a `5K` dual-push `128` screening run on the same packed path

The handover runs used:

- train repo `lsnu/twin_handover_256_train`
- val repo `lsnu/twin_handover_256_val`
- `4x H100 80GB`
- `bfloat16`
- packed semantic layout `[L8, 0x8, R8, 0x8]`
- active action-loss dims `[0:8]` and `[16:24]`
- masked dims `[8:16]` and `[24:32]`

Existing public `16`-dim norm stats were reused. No raw-data reconversion was done.

The dual-push screening run used:

- train repo `lsnu/twin_dual_push_128_train`
- val repo `lsnu/twin_dual_push_128_val`
- `4x H100 80GB`
- `bfloat16`
- packed semantic layout `[L8, 0x8, R8, 0x8]`
- active action-loss dims `[0:8]` and `[16:24]`
- masked dims `[8:16]` and `[24:32]`
- recomputed norm stats for the dual-push `128` train split

## Data packing and masking

The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input:

```text
[L8, R8] -> [L8, 0x8, R8, 0x8]
```

The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:

- `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`

## Files changed or created

### Initial `2K` study

The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in:

- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`

### `10K` follow-up additions

The follow-up changed or added:

- `openpi/src/openpi/training/config.py`
  - added `pi05_twin_handover_256_packed_baseline_pytorch_10k`
  - added `pi05_twin_handover_256_packed_parallel_pytorch_10k`
- `openpi/scripts/train_pytorch.py`
  - added periodic per-module gradient buckets for baseline and parallel models
- `openpi/scripts/eval_twin_val_loss_pytorch.py`
  - added left/right arm losses
  - added joint vs gripper losses
  - added left/right imbalance
  - added small deterministic `sample_actions` eval at `num_steps={4,10}`
- `openpi/scripts/check_parallel_warmstart_equivalence.py`
  - added step-0 baseline-vs-parallel numerical check
- `openpi/scripts/run_twin_handover_packed_10k.sh`
  - added detached `10K` train/eval chain
- `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
  - copied existing norm stats for the `10K` baseline config
- `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
  - copied existing norm stats for the `10K` parallel config
- `README.md`
  - updated repo landing page to cover both studies
- `REPORT.md`
  - updated full report to cover both studies

The exact `10K` changed-file manifest is:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`

### Dual-push `5K` screening additions

The dual-push screening run added or updated:

- `openpi/src/openpi/training/config.py`
  - added `pi05_twin_dual_push_128_packed_baseline_pytorch_5k`
  - added `pi05_twin_dual_push_128_packed_parallel_pytorch_5k`
- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`
  - added detached dual-push `5K` baseline->eval sweep->parallel->eval sweep runner
- `openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
  - computed dual-push `128` train norm stats for the packed baseline config
- `openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
  - computed dual-push `128` train norm stats for the packed parallel config
- `README.md`
  - updated landing page to cover the dual-push screening study
- `REPORT.md`
  - updated full report to include dual-push setup, results, and artifact locations

The exact dual-push changed-file manifest is:

- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`

## Commands and run flow

The exact `10K` rerun commands are stored in:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`

The `10K` execution flow was:

1. run the warm-start equivalence check
2. baseline `10K` train
3. baseline evals at `1000`, `2000`, `5000`, `10000`
4. parallel `10K` train
5. parallel evals at `1000`, `2000`, `5000`, `10000`

The detached runner was:

- `openpi/scripts/run_twin_handover_packed_10k.sh`

Main logs:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log`

## Startup sanity checks

Both `10K` runs loaded cleanly with:

- packed transforms active
- correct `32`-dim packed state/action tensors
- mask active on `[0:8]` and `[16:24]`
- exact zeros preserved in masked padded blocks
- `missing=0`
- `unexpected=0`

Reference startup summary:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt`

Checkpoint sources:

- baseline: `/workspace/checkpoints/pi05_base_single_pytorch`
- parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single`

## Warm-start equivalence check

The `10K` study added an explicit step-0 numerical check:

- `input_projection_max_abs_diff = 0.00122881`
- `input_projection_mean_abs_diff = 0.00015435`
- `baseline_masked_loss = 1.00531137`
- `parallel_masked_loss = 1.00929189`
- `masked_loss_abs_diff = 0.00398052`
- `warmstart_equivalent = False`

Reference:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log`

Interpretation:

- the slice/fuse initialization is aligned by construction
- it is not numerically exact end-to-end on the same batch
- this weakens a strict “identical function at step 0” claim
- it does not invalidate the comparison as a matched warm-start study

Dual-push `5K` warm-start check:

- `input_projection_max_abs_diff = 0.00099802`
- `input_projection_mean_abs_diff = 0.00010568`
- `baseline_masked_loss = 1.43506372`
- `parallel_masked_loss = 1.52086782`
- `masked_loss_abs_diff = 0.08580410`
- `warmstart_equivalent = False`

Reference:

- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log`

## Results

### Initial `2K` study

Teacher-forced validation loss:

| Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
| --- | ---: | ---: | ---: | ---: |
| Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` |
| Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` |

### `10K` train checkpoints

Rank-0 train snapshots:

| Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 |
| --- | ---: | ---: | ---: | ---: |
| Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` |
| Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` |
| Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` |
| Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` |

Structured source:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`

### `10K` teacher-forced validation

| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
| --- | ---: | ---: | ---: |
| `1000` | `0.061130` | `0.059715` | `-0.001415` |
| `2000` | `0.041595` | `0.039947` | `-0.001648` |
| `5000` | `0.027324` | `0.027340` | `+0.000016` |
| `10000` | `0.022345` | `0.022168` | `-0.000177` |

The longer run still shows only a very small gap. The models remain extremely close.

### `10K` final arm/joint/gripper breakdown

Teacher-forced validation at `10000`:

| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| Mean val loss | `0.022345` | `0.022168` |
| Left arm loss | `0.029659` | `0.030184` |
| Right arm loss | `0.015031` | `0.014151` |
| Left joint loss | `0.031507` | `0.032356` |
| Left gripper loss | `0.016725` | `0.014984` |
| Right joint loss | `0.015776` | `0.014888` |
| Right gripper loss | `0.009818` | `0.008996` |
| Left/right imbalance | `0.034067` | `0.033825` |

Interpretation:

- the parallel model’s small final advantage is mostly on the right-arm side
- the baseline is slightly better on left-arm joint loss
- the parallel model is slightly better on both grippers and on right-joint loss
- imbalance is nearly unchanged

### `10K` sample-based eval

Final fixed-subset sample eval at `10000`:

| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| 4-step masked MAE | `0.029935` | `0.029277` |
| 10-step masked MAE | `0.030294` | `0.030241` |
| 4-step left/right imbalance MAE | `0.033733` | `0.031629` |
| 10-step left/right imbalance MAE | `0.034582` | `0.032456` |

Interpretation:

- sample-based quality is effectively tied by the end
- the teacher-forced gap does not widen into a large inference-time separation

Structured sources:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`

### Runtime and memory

| Stage | Duration |
| --- | ---: |
| Baseline train | `2:13:40` |
| Baseline eval sweep | `0:24:24` |
| Parallel train | `2:20:51` |
| Parallel eval sweep | `0:43:54` |
| Full `10K` pipeline | `5:48:33` |

Peak VRAM:

- baseline: `35.23GB`
- parallel: `35.27GB`

Reference:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv`

## Dual-push `128` screening results

### Teacher-forced validation

| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
| --- | ---: | ---: | ---: |
| `1000` | `0.095597` | `0.093704` | `-0.001893` |
| `2000` | `0.083194` | `0.082729` | `-0.000465` |
| `5000` | `0.055958` | `0.055242` | `-0.000716` |

The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.

### Dual-push arm breakdown

Teacher-forced validation at `5000`:

| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| Mean val loss | `0.055958` | `0.055242` |
| Left arm loss | `0.017725` | `0.017044` |
| Right arm loss | `0.094191` | `0.093439` |
| Left joint loss | `0.017577` | `0.017052` |
| Left gripper loss | `0.018765` | `0.016992` |
| Right joint loss | `0.103576` | `0.102856` |
| Right gripper loss | `0.028502` | `0.027523` |
| Left/right imbalance | `0.080993` | `0.081011` |

Interpretation:

- the small parallel advantage is visible on both arms
- the right arm remains much harder than the left on this task
- left/right imbalance is essentially unchanged

### Dual-push sample-based eval

Fixed-subset sample eval:

| Checkpoint | Model | 4-step masked MAE | 10-step masked MAE |
| --- | --- | ---: | ---: |
| `1000` | baseline | `0.103199` | `0.108652` |
| `1000` | parallel | `0.101439` | `0.106874` |
| `2000` | baseline | `0.069732` | `0.074413` |
| `2000` | parallel | `0.069053` | `0.073501` |
| `5000` | baseline | `0.056830` | `0.058973` |
| `5000` | parallel | `0.054630` | `0.056627` |

Interpretation:

- the parallel model is also slightly better on fixed-subset inference-style eval
- unlike handover, the positive signal stays visible at `5K`
- the margin is still small enough that this remains a screening result, not a paper-final claim

### Dual-push runtime and memory

| Stage | Duration |
| --- | ---: |
| Baseline train | `1:05:25` |
| Baseline eval sweep | `0:14:34` |
| Parallel train | `1:00:33` |
| Parallel eval sweep | `0:14:39` |
| Full dual-push pipeline | `2:35:11` |

Peak VRAM:

- baseline: `35.23GB`
- parallel: `35.27GB`

## Artifact locations

### `2K` bundle

- `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
- `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`

### `10K` bundle

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`

### Dual-push `5K` bundle

- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`

## Bottom line

The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence.

- teacher-forced validation ends with a small parallel edge
- sample-based eval is essentially tied
- left/right imbalance does not materially change
- the main difference remains subtle rather than dramatic

The dual-push screening run adds a second signal:

- the packed parallel model is slightly better at `1K`, `2K`, and `5K`
- the same small advantage appears on both teacher-forced and sample-based eval
- the effect is still modest, but it is cleaner and more consistent than handover

So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.