Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push
Scope
This repo now contains three completed studies:
- the initial
2Kbaseline-vs-parallel comparison - the longer
10Kfollow-up with richer diagnostics - a
5Kdual-push128screening run on the same packed path
The handover runs used:
- train repo
lsnu/twin_handover_256_train - val repo
lsnu/twin_handover_256_val 4x H100 80GBbfloat16- packed semantic layout
[L8, 0x8, R8, 0x8] - active action-loss dims
[0:8]and[16:24] - masked dims
[8:16]and[24:32]
Existing public 16-dim norm stats were reused. No raw-data reconversion was done.
The dual-push screening run used:
- train repo
lsnu/twin_dual_push_128_train - val repo
lsnu/twin_dual_push_128_val 4x H100 80GBbfloat16- packed semantic layout
[L8, 0x8, R8, 0x8] - active action-loss dims
[0:8]and[16:24] - masked dims
[8:16]and[24:32] - recomputed norm stats for the dual-push
128train split
Data packing and masking
The TWIN converted state/action layout is [L8, R8], where each arm is 7 joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a 32-dim model input:
[L8, R8] -> [L8, 0x8, R8, 0x8]
The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:
artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.logartifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log
Files changed or created
Initial 2K study
The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached 2K runner plumbing. The exact per-file list is in:
artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt
10K follow-up additions
The follow-up changed or added:
openpi/src/openpi/training/config.py- added
pi05_twin_handover_256_packed_baseline_pytorch_10k - added
pi05_twin_handover_256_packed_parallel_pytorch_10k
- added
openpi/scripts/train_pytorch.py- added periodic per-module gradient buckets for baseline and parallel models
openpi/scripts/eval_twin_val_loss_pytorch.py- added left/right arm losses
- added joint vs gripper losses
- added left/right imbalance
- added small deterministic
sample_actionseval atnum_steps={4,10}
openpi/scripts/check_parallel_warmstart_equivalence.py- added step-0 baseline-vs-parallel numerical check
openpi/scripts/run_twin_handover_packed_10k.sh- added detached
10Ktrain/eval chain
- added detached
openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json- copied existing norm stats for the
10Kbaseline config
- copied existing norm stats for the
openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json- copied existing norm stats for the
10Kparallel config
- copied existing norm stats for the
README.md- updated repo landing page to cover both studies
REPORT.md- updated full report to cover both studies
The exact 10K changed-file manifest is:
artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
Dual-push 5K screening additions
The dual-push screening run added or updated:
openpi/src/openpi/training/config.py- added
pi05_twin_dual_push_128_packed_baseline_pytorch_5k - added
pi05_twin_dual_push_128_packed_parallel_pytorch_5k
- added
openpi/scripts/run_twin_dual_push_128_packed_5k.sh- added detached dual-push
5Kbaseline->eval sweep->parallel->eval sweep runner
- added detached dual-push
openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json- computed dual-push
128train norm stats for the packed baseline config
- computed dual-push
openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json- computed dual-push
128train norm stats for the packed parallel config
- computed dual-push
README.md- updated landing page to cover the dual-push screening study
REPORT.md- updated full report to include dual-push setup, results, and artifact locations
The exact dual-push changed-file manifest is:
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt
Commands and run flow
The exact 10K rerun commands are stored in:
artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh
The 10K execution flow was:
- run the warm-start equivalence check
- baseline
10Ktrain - baseline evals at
1000,2000,5000,10000 - parallel
10Ktrain - parallel evals at
1000,2000,5000,10000
The detached runner was:
openpi/scripts/run_twin_handover_packed_10k.sh
Main logs:
artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.logartifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.logartifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log
Startup sanity checks
Both 10K runs loaded cleanly with:
- packed transforms active
- correct
32-dim packed state/action tensors - mask active on
[0:8]and[16:24] - exact zeros preserved in masked padded blocks
missing=0unexpected=0
Reference startup summary:
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt
Checkpoint sources:
- baseline:
/workspace/checkpoints/pi05_base_single_pytorch - parallel:
/workspace/checkpoints/pi05_base_parallel_packed_from_single
Warm-start equivalence check
The 10K study added an explicit step-0 numerical check:
input_projection_max_abs_diff = 0.00122881input_projection_mean_abs_diff = 0.00015435baseline_masked_loss = 1.00531137parallel_masked_loss = 1.00929189masked_loss_abs_diff = 0.00398052warmstart_equivalent = False
Reference:
artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log
Interpretation:
- the slice/fuse initialization is aligned by construction
- it is not numerically exact end-to-end on the same batch
- this weakens a strict “identical function at step 0” claim
- it does not invalidate the comparison as a matched warm-start study
Dual-push 5K warm-start check:
input_projection_max_abs_diff = 0.00099802input_projection_mean_abs_diff = 0.00010568baseline_masked_loss = 1.43506372parallel_masked_loss = 1.52086782masked_loss_abs_diff = 0.08580410warmstart_equivalent = False
Reference:
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log
Results
Initial 2K study
Teacher-forced validation loss:
| Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
|---|---|---|---|---|
| Packed baseline | 0.052885 |
0.035776 |
33:27 |
35.23GB |
| Packed parallel | 0.051214 |
0.035680 |
30:38 |
35.27GB |
10K train checkpoints
Rank-0 train snapshots:
| Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 |
|---|---|---|---|---|
| Baseline loss | 0.0228 |
0.0376 |
0.0202 |
0.0141 |
| Baseline smoothed | 0.0476 |
0.0273 |
0.0226 |
0.0172 |
| Parallel loss | 0.0211 |
0.0368 |
0.0212 |
0.0140 |
| Parallel smoothed | 0.0461 |
0.0259 |
0.0225 |
0.0169 |
Structured source:
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv
10K teacher-forced validation
| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
|---|---|---|---|
1000 |
0.061130 |
0.059715 |
-0.001415 |
2000 |
0.041595 |
0.039947 |
-0.001648 |
5000 |
0.027324 |
0.027340 |
+0.000016 |
10000 |
0.022345 |
0.022168 |
-0.000177 |
The longer run still shows only a very small gap. The models remain extremely close.
10K final arm/joint/gripper breakdown
Teacher-forced validation at 10000:
| Metric | Baseline | Parallel |
|---|---|---|
| Mean val loss | 0.022345 |
0.022168 |
| Left arm loss | 0.029659 |
0.030184 |
| Right arm loss | 0.015031 |
0.014151 |
| Left joint loss | 0.031507 |
0.032356 |
| Left gripper loss | 0.016725 |
0.014984 |
| Right joint loss | 0.015776 |
0.014888 |
| Right gripper loss | 0.009818 |
0.008996 |
| Left/right imbalance | 0.034067 |
0.033825 |
Interpretation:
- the parallel model’s small final advantage is mostly on the right-arm side
- the baseline is slightly better on left-arm joint loss
- the parallel model is slightly better on both grippers and on right-joint loss
- imbalance is nearly unchanged
10K sample-based eval
Final fixed-subset sample eval at 10000:
| Metric | Baseline | Parallel |
|---|---|---|
| 4-step masked MAE | 0.029935 |
0.029277 |
| 10-step masked MAE | 0.030294 |
0.030241 |
| 4-step left/right imbalance MAE | 0.033733 |
0.031629 |
| 10-step left/right imbalance MAE | 0.034582 |
0.032456 |
Interpretation:
- sample-based quality is effectively tied by the end
- the teacher-forced gap does not widen into a large inference-time separation
Structured sources:
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csvartifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csvartifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv
Runtime and memory
| Stage | Duration |
|---|---|
| Baseline train | 2:13:40 |
| Baseline eval sweep | 0:24:24 |
| Parallel train | 2:20:51 |
| Parallel eval sweep | 0:43:54 |
Full 10K pipeline |
5:48:33 |
Peak VRAM:
- baseline:
35.23GB - parallel:
35.27GB
Reference:
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv
Dual-push 128 screening results
Teacher-forced validation
| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
|---|---|---|---|
1000 |
0.095597 |
0.093704 |
-0.001893 |
2000 |
0.083194 |
0.082729 |
-0.000465 |
5000 |
0.055958 |
0.055242 |
-0.000716 |
The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.
Dual-push arm breakdown
Teacher-forced validation at 5000:
| Metric | Baseline | Parallel |
|---|---|---|
| Mean val loss | 0.055958 |
0.055242 |
| Left arm loss | 0.017725 |
0.017044 |
| Right arm loss | 0.094191 |
0.093439 |
| Left joint loss | 0.017577 |
0.017052 |
| Left gripper loss | 0.018765 |
0.016992 |
| Right joint loss | 0.103576 |
0.102856 |
| Right gripper loss | 0.028502 |
0.027523 |
| Left/right imbalance | 0.080993 |
0.081011 |
Interpretation:
- the small parallel advantage is visible on both arms
- the right arm remains much harder than the left on this task
- left/right imbalance is essentially unchanged
Dual-push sample-based eval
Fixed-subset sample eval:
| Checkpoint | Model | 4-step masked MAE | 10-step masked MAE |
|---|---|---|---|
1000 |
baseline | 0.103199 |
0.108652 |
1000 |
parallel | 0.101439 |
0.106874 |
2000 |
baseline | 0.069732 |
0.074413 |
2000 |
parallel | 0.069053 |
0.073501 |
5000 |
baseline | 0.056830 |
0.058973 |
5000 |
parallel | 0.054630 |
0.056627 |
Interpretation:
- the parallel model is also slightly better on fixed-subset inference-style eval
- unlike handover, the positive signal stays visible at
5K - the margin is still small enough that this remains a screening result, not a paper-final claim
Dual-push runtime and memory
| Stage | Duration |
|---|---|
| Baseline train | 1:05:25 |
| Baseline eval sweep | 0:14:34 |
| Parallel train | 1:00:33 |
| Parallel eval sweep | 0:14:39 |
| Full dual-push pipeline | 2:35:11 |
Peak VRAM:
- baseline:
35.23GB - parallel:
35.27GB
Artifact locations
2K bundle
artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.jsonartifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.shartifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt
10K bundle
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.jsonartifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csvartifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csvartifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csvartifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csvartifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.shartifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txtartifacts/twin_handover_packed_parallelization_10k_20260309/environment/
Dual-push 5K bundle
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.jsonartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csvartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csvartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csvartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csvartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.shartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txtartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txtartifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/
Bottom line
The 10K follow-up suggests the 2K near-tie was not hiding a large later divergence.
- teacher-forced validation ends with a small parallel edge
- sample-based eval is essentially tied
- left/right imbalance does not materially change
- the main difference remains subtle rather than dramatic
The dual-push screening run adds a second signal:
- the packed parallel model is slightly better at
1K,2K, and5K - the same small advantage appears on both teacher-forced and sample-based eval
- the effect is still modest, but it is cleaner and more consistent than handover
So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.