lsnu's picture
Upload dual-push report docs
422ae16 verified

Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push

Scope

This repo now contains three completed studies:

  1. the initial 2K baseline-vs-parallel comparison
  2. the longer 10K follow-up with richer diagnostics
  3. a 5K dual-push 128 screening run on the same packed path

The handover runs used:

  • train repo lsnu/twin_handover_256_train
  • val repo lsnu/twin_handover_256_val
  • 4x H100 80GB
  • bfloat16
  • packed semantic layout [L8, 0x8, R8, 0x8]
  • active action-loss dims [0:8] and [16:24]
  • masked dims [8:16] and [24:32]

Existing public 16-dim norm stats were reused. No raw-data reconversion was done.

The dual-push screening run used:

  • train repo lsnu/twin_dual_push_128_train
  • val repo lsnu/twin_dual_push_128_val
  • 4x H100 80GB
  • bfloat16
  • packed semantic layout [L8, 0x8, R8, 0x8]
  • active action-loss dims [0:8] and [16:24]
  • masked dims [8:16] and [24:32]
  • recomputed norm stats for the dual-push 128 train split

Data packing and masking

The TWIN converted state/action layout is [L8, R8], where each arm is 7 joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a 32-dim model input:

[L8, R8] -> [L8, 0x8, R8, 0x8]

The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:

  • artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log
  • artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log

Files changed or created

Initial 2K study

The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached 2K runner plumbing. The exact per-file list is in:

  • artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt

10K follow-up additions

The follow-up changed or added:

  • openpi/src/openpi/training/config.py
    • added pi05_twin_handover_256_packed_baseline_pytorch_10k
    • added pi05_twin_handover_256_packed_parallel_pytorch_10k
  • openpi/scripts/train_pytorch.py
    • added periodic per-module gradient buckets for baseline and parallel models
  • openpi/scripts/eval_twin_val_loss_pytorch.py
    • added left/right arm losses
    • added joint vs gripper losses
    • added left/right imbalance
    • added small deterministic sample_actions eval at num_steps={4,10}
  • openpi/scripts/check_parallel_warmstart_equivalence.py
    • added step-0 baseline-vs-parallel numerical check
  • openpi/scripts/run_twin_handover_packed_10k.sh
    • added detached 10K train/eval chain
  • openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json
    • copied existing norm stats for the 10K baseline config
  • openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json
    • copied existing norm stats for the 10K parallel config
  • README.md
    • updated repo landing page to cover both studies
  • REPORT.md
    • updated full report to cover both studies

The exact 10K changed-file manifest is:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt

Dual-push 5K screening additions

The dual-push screening run added or updated:

  • openpi/src/openpi/training/config.py
    • added pi05_twin_dual_push_128_packed_baseline_pytorch_5k
    • added pi05_twin_dual_push_128_packed_parallel_pytorch_5k
  • openpi/scripts/run_twin_dual_push_128_packed_5k.sh
    • added detached dual-push 5K baseline->eval sweep->parallel->eval sweep runner
  • openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json
    • computed dual-push 128 train norm stats for the packed baseline config
  • openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json
    • computed dual-push 128 train norm stats for the packed parallel config
  • README.md
    • updated landing page to cover the dual-push screening study
  • REPORT.md
    • updated full report to include dual-push setup, results, and artifact locations

The exact dual-push changed-file manifest is:

  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt

Commands and run flow

The exact 10K rerun commands are stored in:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh

The 10K execution flow was:

  1. run the warm-start equivalence check
  2. baseline 10K train
  3. baseline evals at 1000, 2000, 5000, 10000
  4. parallel 10K train
  5. parallel evals at 1000, 2000, 5000, 10000

The detached runner was:

  • openpi/scripts/run_twin_handover_packed_10k.sh

Main logs:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log
  • artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log
  • artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log

Startup sanity checks

Both 10K runs loaded cleanly with:

  • packed transforms active
  • correct 32-dim packed state/action tensors
  • mask active on [0:8] and [16:24]
  • exact zeros preserved in masked padded blocks
  • missing=0
  • unexpected=0

Reference startup summary:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt

Checkpoint sources:

  • baseline: /workspace/checkpoints/pi05_base_single_pytorch
  • parallel: /workspace/checkpoints/pi05_base_parallel_packed_from_single

Warm-start equivalence check

The 10K study added an explicit step-0 numerical check:

  • input_projection_max_abs_diff = 0.00122881
  • input_projection_mean_abs_diff = 0.00015435
  • baseline_masked_loss = 1.00531137
  • parallel_masked_loss = 1.00929189
  • masked_loss_abs_diff = 0.00398052
  • warmstart_equivalent = False

Reference:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log

Interpretation:

  • the slice/fuse initialization is aligned by construction
  • it is not numerically exact end-to-end on the same batch
  • this weakens a strict “identical function at step 0” claim
  • it does not invalidate the comparison as a matched warm-start study

Dual-push 5K warm-start check:

  • input_projection_max_abs_diff = 0.00099802
  • input_projection_mean_abs_diff = 0.00010568
  • baseline_masked_loss = 1.43506372
  • parallel_masked_loss = 1.52086782
  • masked_loss_abs_diff = 0.08580410
  • warmstart_equivalent = False

Reference:

  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log

Results

Initial 2K study

Teacher-forced validation loss:

Model Val @ 1000 Val @ 2000 Train runtime Peak VRAM
Packed baseline 0.052885 0.035776 33:27 35.23GB
Packed parallel 0.051214 0.035680 30:38 35.27GB

10K train checkpoints

Rank-0 train snapshots:

Model Step 1000 Step 2000 Step 5000 Step 10000
Baseline loss 0.0228 0.0376 0.0202 0.0141
Baseline smoothed 0.0476 0.0273 0.0226 0.0172
Parallel loss 0.0211 0.0368 0.0212 0.0140
Parallel smoothed 0.0461 0.0259 0.0225 0.0169

Structured source:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv

10K teacher-forced validation

Checkpoint Baseline Parallel Delta (parallel - baseline)
1000 0.061130 0.059715 -0.001415
2000 0.041595 0.039947 -0.001648
5000 0.027324 0.027340 +0.000016
10000 0.022345 0.022168 -0.000177

The longer run still shows only a very small gap. The models remain extremely close.

10K final arm/joint/gripper breakdown

Teacher-forced validation at 10000:

Metric Baseline Parallel
Mean val loss 0.022345 0.022168
Left arm loss 0.029659 0.030184
Right arm loss 0.015031 0.014151
Left joint loss 0.031507 0.032356
Left gripper loss 0.016725 0.014984
Right joint loss 0.015776 0.014888
Right gripper loss 0.009818 0.008996
Left/right imbalance 0.034067 0.033825

Interpretation:

  • the parallel model’s small final advantage is mostly on the right-arm side
  • the baseline is slightly better on left-arm joint loss
  • the parallel model is slightly better on both grippers and on right-joint loss
  • imbalance is nearly unchanged

10K sample-based eval

Final fixed-subset sample eval at 10000:

Metric Baseline Parallel
4-step masked MAE 0.029935 0.029277
10-step masked MAE 0.030294 0.030241
4-step left/right imbalance MAE 0.033733 0.031629
10-step left/right imbalance MAE 0.034582 0.032456

Interpretation:

  • sample-based quality is effectively tied by the end
  • the teacher-forced gap does not widen into a large inference-time separation

Structured sources:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv
  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv
  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv

Runtime and memory

Stage Duration
Baseline train 2:13:40
Baseline eval sweep 0:24:24
Parallel train 2:20:51
Parallel eval sweep 0:43:54
Full 10K pipeline 5:48:33

Peak VRAM:

  • baseline: 35.23GB
  • parallel: 35.27GB

Reference:

  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv

Dual-push 128 screening results

Teacher-forced validation

Checkpoint Baseline Parallel Delta (parallel - baseline)
1000 0.095597 0.093704 -0.001893
2000 0.083194 0.082729 -0.000465
5000 0.055958 0.055242 -0.000716

The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.

Dual-push arm breakdown

Teacher-forced validation at 5000:

Metric Baseline Parallel
Mean val loss 0.055958 0.055242
Left arm loss 0.017725 0.017044
Right arm loss 0.094191 0.093439
Left joint loss 0.017577 0.017052
Left gripper loss 0.018765 0.016992
Right joint loss 0.103576 0.102856
Right gripper loss 0.028502 0.027523
Left/right imbalance 0.080993 0.081011

Interpretation:

  • the small parallel advantage is visible on both arms
  • the right arm remains much harder than the left on this task
  • left/right imbalance is essentially unchanged

Dual-push sample-based eval

Fixed-subset sample eval:

Checkpoint Model 4-step masked MAE 10-step masked MAE
1000 baseline 0.103199 0.108652
1000 parallel 0.101439 0.106874
2000 baseline 0.069732 0.074413
2000 parallel 0.069053 0.073501
5000 baseline 0.056830 0.058973
5000 parallel 0.054630 0.056627

Interpretation:

  • the parallel model is also slightly better on fixed-subset inference-style eval
  • unlike handover, the positive signal stays visible at 5K
  • the margin is still small enough that this remains a screening result, not a paper-final claim

Dual-push runtime and memory

Stage Duration
Baseline train 1:05:25
Baseline eval sweep 0:14:34
Parallel train 1:00:33
Parallel eval sweep 0:14:39
Full dual-push pipeline 2:35:11

Peak VRAM:

  • baseline: 35.23GB
  • parallel: 35.27GB

Artifact locations

2K bundle

  • artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json
  • artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh
  • artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt

10K bundle

  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json
  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv
  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv
  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv
  • artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv
  • artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh
  • artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
  • artifacts/twin_handover_packed_parallelization_10k_20260309/environment/

Dual-push 5K bundle

  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt
  • artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/

Bottom line

The 10K follow-up suggests the 2K near-tie was not hiding a large later divergence.

  • teacher-forced validation ends with a small parallel edge
  • sample-based eval is essentially tied
  • left/right imbalance does not materially change
  • the main difference remains subtle rather than dramatic

The dual-push screening run adds a second signal:

  • the packed parallel model is slightly better at 1K, 2K, and 5K
  • the same small advantage appears on both teacher-forced and sample-based eval
  • the effect is still modest, but it is cleaner and more consistent than handover

So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.