| # Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push | |
| ## Scope | |
| This repo now contains three completed studies: | |
| 1. the initial `2K` baseline-vs-parallel comparison | |
| 2. the longer `10K` follow-up with richer diagnostics | |
| 3. a `5K` dual-push `128` screening run on the same packed path | |
| The handover runs used: | |
| - train repo `lsnu/twin_handover_256_train` | |
| - val repo `lsnu/twin_handover_256_val` | |
| - `4x H100 80GB` | |
| - `bfloat16` | |
| - packed semantic layout `[L8, 0x8, R8, 0x8]` | |
| - active action-loss dims `[0:8]` and `[16:24]` | |
| - masked dims `[8:16]` and `[24:32]` | |
| Existing public `16`-dim norm stats were reused. No raw-data reconversion was done. | |
| The dual-push screening run used: | |
| - train repo `lsnu/twin_dual_push_128_train` | |
| - val repo `lsnu/twin_dual_push_128_val` | |
| - `4x H100 80GB` | |
| - `bfloat16` | |
| - packed semantic layout `[L8, 0x8, R8, 0x8]` | |
| - active action-loss dims `[0:8]` and `[16:24]` | |
| - masked dims `[8:16]` and `[24:32]` | |
| - recomputed norm stats for the dual-push `128` train split | |
| ## Data packing and masking | |
| The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input: | |
| ```text | |
| [L8, R8] -> [L8, 0x8, R8, 0x8] | |
| ``` | |
| The batch inspection confirmed exact zero padding in the masked blocks. Reference logs: | |
| - `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log` | |
| ## Files changed or created | |
| ### Initial `2K` study | |
| The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in: | |
| - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt` | |
| ### `10K` follow-up additions | |
| The follow-up changed or added: | |
| - `openpi/src/openpi/training/config.py` | |
| - added `pi05_twin_handover_256_packed_baseline_pytorch_10k` | |
| - added `pi05_twin_handover_256_packed_parallel_pytorch_10k` | |
| - `openpi/scripts/train_pytorch.py` | |
| - added periodic per-module gradient buckets for baseline and parallel models | |
| - `openpi/scripts/eval_twin_val_loss_pytorch.py` | |
| - added left/right arm losses | |
| - added joint vs gripper losses | |
| - added left/right imbalance | |
| - added small deterministic `sample_actions` eval at `num_steps={4,10}` | |
| - `openpi/scripts/check_parallel_warmstart_equivalence.py` | |
| - added step-0 baseline-vs-parallel numerical check | |
| - `openpi/scripts/run_twin_handover_packed_10k.sh` | |
| - added detached `10K` train/eval chain | |
| - `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json` | |
| - copied existing norm stats for the `10K` baseline config | |
| - `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json` | |
| - copied existing norm stats for the `10K` parallel config | |
| - `README.md` | |
| - updated repo landing page to cover both studies | |
| - `REPORT.md` | |
| - updated full report to cover both studies | |
| The exact `10K` changed-file manifest is: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` | |
| ### Dual-push `5K` screening additions | |
| The dual-push screening run added or updated: | |
| - `openpi/src/openpi/training/config.py` | |
| - added `pi05_twin_dual_push_128_packed_baseline_pytorch_5k` | |
| - added `pi05_twin_dual_push_128_packed_parallel_pytorch_5k` | |
| - `openpi/scripts/run_twin_dual_push_128_packed_5k.sh` | |
| - added detached dual-push `5K` baseline->eval sweep->parallel->eval sweep runner | |
| - `openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json` | |
| - computed dual-push `128` train norm stats for the packed baseline config | |
| - `openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json` | |
| - computed dual-push `128` train norm stats for the packed parallel config | |
| - `README.md` | |
| - updated landing page to cover the dual-push screening study | |
| - `REPORT.md` | |
| - updated full report to include dual-push setup, results, and artifact locations | |
| The exact dual-push changed-file manifest is: | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt` | |
| ## Commands and run flow | |
| The exact `10K` rerun commands are stored in: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh` | |
| The `10K` execution flow was: | |
| 1. run the warm-start equivalence check | |
| 2. baseline `10K` train | |
| 3. baseline evals at `1000`, `2000`, `5000`, `10000` | |
| 4. parallel `10K` train | |
| 5. parallel evals at `1000`, `2000`, `5000`, `10000` | |
| The detached runner was: | |
| - `openpi/scripts/run_twin_handover_packed_10k.sh` | |
| Main logs: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log` | |
| ## Startup sanity checks | |
| Both `10K` runs loaded cleanly with: | |
| - packed transforms active | |
| - correct `32`-dim packed state/action tensors | |
| - mask active on `[0:8]` and `[16:24]` | |
| - exact zeros preserved in masked padded blocks | |
| - `missing=0` | |
| - `unexpected=0` | |
| Reference startup summary: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt` | |
| Checkpoint sources: | |
| - baseline: `/workspace/checkpoints/pi05_base_single_pytorch` | |
| - parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single` | |
| ## Warm-start equivalence check | |
| The `10K` study added an explicit step-0 numerical check: | |
| - `input_projection_max_abs_diff = 0.00122881` | |
| - `input_projection_mean_abs_diff = 0.00015435` | |
| - `baseline_masked_loss = 1.00531137` | |
| - `parallel_masked_loss = 1.00929189` | |
| - `masked_loss_abs_diff = 0.00398052` | |
| - `warmstart_equivalent = False` | |
| Reference: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log` | |
| Interpretation: | |
| - the slice/fuse initialization is aligned by construction | |
| - it is not numerically exact end-to-end on the same batch | |
| - this weakens a strict “identical function at step 0” claim | |
| - it does not invalidate the comparison as a matched warm-start study | |
| Dual-push `5K` warm-start check: | |
| - `input_projection_max_abs_diff = 0.00099802` | |
| - `input_projection_mean_abs_diff = 0.00010568` | |
| - `baseline_masked_loss = 1.43506372` | |
| - `parallel_masked_loss = 1.52086782` | |
| - `masked_loss_abs_diff = 0.08580410` | |
| - `warmstart_equivalent = False` | |
| Reference: | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log` | |
| ## Results | |
| ### Initial `2K` study | |
| Teacher-forced validation loss: | |
| | Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM | | |
| | --- | ---: | ---: | ---: | ---: | | |
| | Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` | | |
| | Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` | | |
| ### `10K` train checkpoints | |
| Rank-0 train snapshots: | |
| | Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 | | |
| | --- | ---: | ---: | ---: | ---: | | |
| | Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` | | |
| | Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` | | |
| | Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` | | |
| | Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` | | |
| Structured source: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv` | |
| ### `10K` teacher-forced validation | |
| | Checkpoint | Baseline | Parallel | Delta (parallel - baseline) | | |
| | --- | ---: | ---: | ---: | | |
| | `1000` | `0.061130` | `0.059715` | `-0.001415` | | |
| | `2000` | `0.041595` | `0.039947` | `-0.001648` | | |
| | `5000` | `0.027324` | `0.027340` | `+0.000016` | | |
| | `10000` | `0.022345` | `0.022168` | `-0.000177` | | |
| The longer run still shows only a very small gap. The models remain extremely close. | |
| ### `10K` final arm/joint/gripper breakdown | |
| Teacher-forced validation at `10000`: | |
| | Metric | Baseline | Parallel | | |
| | --- | ---: | ---: | | |
| | Mean val loss | `0.022345` | `0.022168` | | |
| | Left arm loss | `0.029659` | `0.030184` | | |
| | Right arm loss | `0.015031` | `0.014151` | | |
| | Left joint loss | `0.031507` | `0.032356` | | |
| | Left gripper loss | `0.016725` | `0.014984` | | |
| | Right joint loss | `0.015776` | `0.014888` | | |
| | Right gripper loss | `0.009818` | `0.008996` | | |
| | Left/right imbalance | `0.034067` | `0.033825` | | |
| Interpretation: | |
| - the parallel model’s small final advantage is mostly on the right-arm side | |
| - the baseline is slightly better on left-arm joint loss | |
| - the parallel model is slightly better on both grippers and on right-joint loss | |
| - imbalance is nearly unchanged | |
| ### `10K` sample-based eval | |
| Final fixed-subset sample eval at `10000`: | |
| | Metric | Baseline | Parallel | | |
| | --- | ---: | ---: | | |
| | 4-step masked MAE | `0.029935` | `0.029277` | | |
| | 10-step masked MAE | `0.030294` | `0.030241` | | |
| | 4-step left/right imbalance MAE | `0.033733` | `0.031629` | | |
| | 10-step left/right imbalance MAE | `0.034582` | `0.032456` | | |
| Interpretation: | |
| - sample-based quality is effectively tied by the end | |
| - the teacher-forced gap does not widen into a large inference-time separation | |
| Structured sources: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv` | |
| ### Runtime and memory | |
| | Stage | Duration | | |
| | --- | ---: | | |
| | Baseline train | `2:13:40` | | |
| | Baseline eval sweep | `0:24:24` | | |
| | Parallel train | `2:20:51` | | |
| | Parallel eval sweep | `0:43:54` | | |
| | Full `10K` pipeline | `5:48:33` | | |
| Peak VRAM: | |
| - baseline: `35.23GB` | |
| - parallel: `35.27GB` | |
| Reference: | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv` | |
| ## Dual-push `128` screening results | |
| ### Teacher-forced validation | |
| | Checkpoint | Baseline | Parallel | Delta (parallel - baseline) | | |
| | --- | ---: | ---: | ---: | | |
| | `1000` | `0.095597` | `0.093704` | `-0.001893` | | |
| | `2000` | `0.083194` | `0.082729` | `-0.000465` | | |
| | `5000` | `0.055958` | `0.055242` | `-0.000716` | | |
| The screening signal is small but consistently positive for the packed parallel model at all three checkpoints. | |
| ### Dual-push arm breakdown | |
| Teacher-forced validation at `5000`: | |
| | Metric | Baseline | Parallel | | |
| | --- | ---: | ---: | | |
| | Mean val loss | `0.055958` | `0.055242` | | |
| | Left arm loss | `0.017725` | `0.017044` | | |
| | Right arm loss | `0.094191` | `0.093439` | | |
| | Left joint loss | `0.017577` | `0.017052` | | |
| | Left gripper loss | `0.018765` | `0.016992` | | |
| | Right joint loss | `0.103576` | `0.102856` | | |
| | Right gripper loss | `0.028502` | `0.027523` | | |
| | Left/right imbalance | `0.080993` | `0.081011` | | |
| Interpretation: | |
| - the small parallel advantage is visible on both arms | |
| - the right arm remains much harder than the left on this task | |
| - left/right imbalance is essentially unchanged | |
| ### Dual-push sample-based eval | |
| Fixed-subset sample eval: | |
| | Checkpoint | Model | 4-step masked MAE | 10-step masked MAE | | |
| | --- | --- | ---: | ---: | | |
| | `1000` | baseline | `0.103199` | `0.108652` | | |
| | `1000` | parallel | `0.101439` | `0.106874` | | |
| | `2000` | baseline | `0.069732` | `0.074413` | | |
| | `2000` | parallel | `0.069053` | `0.073501` | | |
| | `5000` | baseline | `0.056830` | `0.058973` | | |
| | `5000` | parallel | `0.054630` | `0.056627` | | |
| Interpretation: | |
| - the parallel model is also slightly better on fixed-subset inference-style eval | |
| - unlike handover, the positive signal stays visible at `5K` | |
| - the margin is still small enough that this remains a screening result, not a paper-final claim | |
| ### Dual-push runtime and memory | |
| | Stage | Duration | | |
| | --- | ---: | | |
| | Baseline train | `1:05:25` | | |
| | Baseline eval sweep | `0:14:34` | | |
| | Parallel train | `1:00:33` | | |
| | Parallel eval sweep | `0:14:39` | | |
| | Full dual-push pipeline | `2:35:11` | | |
| Peak VRAM: | |
| - baseline: `35.23GB` | |
| - parallel: `35.27GB` | |
| ## Artifact locations | |
| ### `2K` bundle | |
| - `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json` | |
| - `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh` | |
| - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt` | |
| ### `10K` bundle | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` | |
| - `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/` | |
| ### Dual-push `5K` bundle | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt` | |
| - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/` | |
| ## Bottom line | |
| The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence. | |
| - teacher-forced validation ends with a small parallel edge | |
| - sample-based eval is essentially tied | |
| - left/right imbalance does not materially change | |
| - the main difference remains subtle rather than dramatic | |
| The dual-push screening run adds a second signal: | |
| - the packed parallel model is slightly better at `1K`, `2K`, and `5K` | |
| - the same small advantage appears on both teacher-forced and sample-based eval | |
| - the effect is still modest, but it is cleaner and more consistent than handover | |
| So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation. | |