# Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push ## Scope This repo now contains three completed studies: 1. the initial `2K` baseline-vs-parallel comparison 2. the longer `10K` follow-up with richer diagnostics 3. a `5K` dual-push `128` screening run on the same packed path The handover runs used: - train repo `lsnu/twin_handover_256_train` - val repo `lsnu/twin_handover_256_val` - `4x H100 80GB` - `bfloat16` - packed semantic layout `[L8, 0x8, R8, 0x8]` - active action-loss dims `[0:8]` and `[16:24]` - masked dims `[8:16]` and `[24:32]` Existing public `16`-dim norm stats were reused. No raw-data reconversion was done. The dual-push screening run used: - train repo `lsnu/twin_dual_push_128_train` - val repo `lsnu/twin_dual_push_128_val` - `4x H100 80GB` - `bfloat16` - packed semantic layout `[L8, 0x8, R8, 0x8]` - active action-loss dims `[0:8]` and `[16:24]` - masked dims `[8:16]` and `[24:32]` - recomputed norm stats for the dual-push `128` train split ## Data packing and masking The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input: ```text [L8, R8] -> [L8, 0x8, R8, 0x8] ``` The batch inspection confirmed exact zero padding in the masked blocks. Reference logs: - `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log` - `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log` ## Files changed or created ### Initial `2K` study The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in: - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt` ### `10K` follow-up additions The follow-up changed or added: - `openpi/src/openpi/training/config.py` - added `pi05_twin_handover_256_packed_baseline_pytorch_10k` - added `pi05_twin_handover_256_packed_parallel_pytorch_10k` - `openpi/scripts/train_pytorch.py` - added periodic per-module gradient buckets for baseline and parallel models - `openpi/scripts/eval_twin_val_loss_pytorch.py` - added left/right arm losses - added joint vs gripper losses - added left/right imbalance - added small deterministic `sample_actions` eval at `num_steps={4,10}` - `openpi/scripts/check_parallel_warmstart_equivalence.py` - added step-0 baseline-vs-parallel numerical check - `openpi/scripts/run_twin_handover_packed_10k.sh` - added detached `10K` train/eval chain - `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json` - copied existing norm stats for the `10K` baseline config - `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json` - copied existing norm stats for the `10K` parallel config - `README.md` - updated repo landing page to cover both studies - `REPORT.md` - updated full report to cover both studies The exact `10K` changed-file manifest is: - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` ### Dual-push `5K` screening additions The dual-push screening run added or updated: - `openpi/src/openpi/training/config.py` - added `pi05_twin_dual_push_128_packed_baseline_pytorch_5k` - added `pi05_twin_dual_push_128_packed_parallel_pytorch_5k` - `openpi/scripts/run_twin_dual_push_128_packed_5k.sh` - added detached dual-push `5K` baseline->eval sweep->parallel->eval sweep runner - `openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json` - computed dual-push `128` train norm stats for the packed baseline config - `openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json` - computed dual-push `128` train norm stats for the packed parallel config - `README.md` - updated landing page to cover the dual-push screening study - `REPORT.md` - updated full report to include dual-push setup, results, and artifact locations The exact dual-push changed-file manifest is: - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt` ## Commands and run flow The exact `10K` rerun commands are stored in: - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh` The `10K` execution flow was: 1. run the warm-start equivalence check 2. baseline `10K` train 3. baseline evals at `1000`, `2000`, `5000`, `10000` 4. parallel `10K` train 5. parallel evals at `1000`, `2000`, `5000`, `10000` The detached runner was: - `openpi/scripts/run_twin_handover_packed_10k.sh` Main logs: - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log` - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log` - `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log` ## Startup sanity checks Both `10K` runs loaded cleanly with: - packed transforms active - correct `32`-dim packed state/action tensors - mask active on `[0:8]` and `[16:24]` - exact zeros preserved in masked padded blocks - `missing=0` - `unexpected=0` Reference startup summary: - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt` Checkpoint sources: - baseline: `/workspace/checkpoints/pi05_base_single_pytorch` - parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single` ## Warm-start equivalence check The `10K` study added an explicit step-0 numerical check: - `input_projection_max_abs_diff = 0.00122881` - `input_projection_mean_abs_diff = 0.00015435` - `baseline_masked_loss = 1.00531137` - `parallel_masked_loss = 1.00929189` - `masked_loss_abs_diff = 0.00398052` - `warmstart_equivalent = False` Reference: - `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log` Interpretation: - the slice/fuse initialization is aligned by construction - it is not numerically exact end-to-end on the same batch - this weakens a strict “identical function at step 0” claim - it does not invalidate the comparison as a matched warm-start study Dual-push `5K` warm-start check: - `input_projection_max_abs_diff = 0.00099802` - `input_projection_mean_abs_diff = 0.00010568` - `baseline_masked_loss = 1.43506372` - `parallel_masked_loss = 1.52086782` - `masked_loss_abs_diff = 0.08580410` - `warmstart_equivalent = False` Reference: - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log` ## Results ### Initial `2K` study Teacher-forced validation loss: | Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM | | --- | ---: | ---: | ---: | ---: | | Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` | | Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` | ### `10K` train checkpoints Rank-0 train snapshots: | Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 | | --- | ---: | ---: | ---: | ---: | | Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` | | Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` | | Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` | | Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` | Structured source: - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv` ### `10K` teacher-forced validation | Checkpoint | Baseline | Parallel | Delta (parallel - baseline) | | --- | ---: | ---: | ---: | | `1000` | `0.061130` | `0.059715` | `-0.001415` | | `2000` | `0.041595` | `0.039947` | `-0.001648` | | `5000` | `0.027324` | `0.027340` | `+0.000016` | | `10000` | `0.022345` | `0.022168` | `-0.000177` | The longer run still shows only a very small gap. The models remain extremely close. ### `10K` final arm/joint/gripper breakdown Teacher-forced validation at `10000`: | Metric | Baseline | Parallel | | --- | ---: | ---: | | Mean val loss | `0.022345` | `0.022168` | | Left arm loss | `0.029659` | `0.030184` | | Right arm loss | `0.015031` | `0.014151` | | Left joint loss | `0.031507` | `0.032356` | | Left gripper loss | `0.016725` | `0.014984` | | Right joint loss | `0.015776` | `0.014888` | | Right gripper loss | `0.009818` | `0.008996` | | Left/right imbalance | `0.034067` | `0.033825` | Interpretation: - the parallel model’s small final advantage is mostly on the right-arm side - the baseline is slightly better on left-arm joint loss - the parallel model is slightly better on both grippers and on right-joint loss - imbalance is nearly unchanged ### `10K` sample-based eval Final fixed-subset sample eval at `10000`: | Metric | Baseline | Parallel | | --- | ---: | ---: | | 4-step masked MAE | `0.029935` | `0.029277` | | 10-step masked MAE | `0.030294` | `0.030241` | | 4-step left/right imbalance MAE | `0.033733` | `0.031629` | | 10-step left/right imbalance MAE | `0.034582` | `0.032456` | Interpretation: - sample-based quality is effectively tied by the end - the teacher-forced gap does not widen into a large inference-time separation Structured sources: - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv` - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv` - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv` ### Runtime and memory | Stage | Duration | | --- | ---: | | Baseline train | `2:13:40` | | Baseline eval sweep | `0:24:24` | | Parallel train | `2:20:51` | | Parallel eval sweep | `0:43:54` | | Full `10K` pipeline | `5:48:33` | Peak VRAM: - baseline: `35.23GB` - parallel: `35.27GB` Reference: - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv` ## Dual-push `128` screening results ### Teacher-forced validation | Checkpoint | Baseline | Parallel | Delta (parallel - baseline) | | --- | ---: | ---: | ---: | | `1000` | `0.095597` | `0.093704` | `-0.001893` | | `2000` | `0.083194` | `0.082729` | `-0.000465` | | `5000` | `0.055958` | `0.055242` | `-0.000716` | The screening signal is small but consistently positive for the packed parallel model at all three checkpoints. ### Dual-push arm breakdown Teacher-forced validation at `5000`: | Metric | Baseline | Parallel | | --- | ---: | ---: | | Mean val loss | `0.055958` | `0.055242` | | Left arm loss | `0.017725` | `0.017044` | | Right arm loss | `0.094191` | `0.093439` | | Left joint loss | `0.017577` | `0.017052` | | Left gripper loss | `0.018765` | `0.016992` | | Right joint loss | `0.103576` | `0.102856` | | Right gripper loss | `0.028502` | `0.027523` | | Left/right imbalance | `0.080993` | `0.081011` | Interpretation: - the small parallel advantage is visible on both arms - the right arm remains much harder than the left on this task - left/right imbalance is essentially unchanged ### Dual-push sample-based eval Fixed-subset sample eval: | Checkpoint | Model | 4-step masked MAE | 10-step masked MAE | | --- | --- | ---: | ---: | | `1000` | baseline | `0.103199` | `0.108652` | | `1000` | parallel | `0.101439` | `0.106874` | | `2000` | baseline | `0.069732` | `0.074413` | | `2000` | parallel | `0.069053` | `0.073501` | | `5000` | baseline | `0.056830` | `0.058973` | | `5000` | parallel | `0.054630` | `0.056627` | Interpretation: - the parallel model is also slightly better on fixed-subset inference-style eval - unlike handover, the positive signal stays visible at `5K` - the margin is still small enough that this remains a screening result, not a paper-final claim ### Dual-push runtime and memory | Stage | Duration | | --- | ---: | | Baseline train | `1:05:25` | | Baseline eval sweep | `0:14:34` | | Parallel train | `1:00:33` | | Parallel eval sweep | `0:14:39` | | Full dual-push pipeline | `2:35:11` | Peak VRAM: - baseline: `35.23GB` - parallel: `35.27GB` ## Artifact locations ### `2K` bundle - `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json` - `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh` - `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt` ### `10K` bundle - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json` - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv` - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv` - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv` - `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv` - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh` - `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt` - `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/` ### Dual-push `5K` bundle - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt` - `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/` ## Bottom line The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence. - teacher-forced validation ends with a small parallel edge - sample-based eval is essentially tied - left/right imbalance does not materially change - the main difference remains subtle rather than dramatic The dual-push screening run adds a second signal: - the packed parallel model is slightly better at `1K`, `2K`, and `5K` - the same small advantage appears on both teacher-forced and sample-based eval - the effect is still modest, but it is cleaner and more consistent than handover So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.