lsnu

Upload dual-push report docs

422ae16 verified 1 day ago

preview code

raw

history blame contribute delete

14.9 kB

Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push

Scope

This repo now contains three completed studies:

the initial 2K baseline-vs-parallel comparison
the longer 10K follow-up with richer diagnostics
a 5K dual-push 128 screening run on the same packed path

The handover runs used:

train repo lsnu/twin_handover_256_train
val repo lsnu/twin_handover_256_val
4x H100 80GB
bfloat16
packed semantic layout [L8, 0x8, R8, 0x8]
active action-loss dims [0:8] and [16:24]
masked dims [8:16] and [24:32]

Existing public 16-dim norm stats were reused. No raw-data reconversion was done.

The dual-push screening run used:

train repo lsnu/twin_dual_push_128_train
val repo lsnu/twin_dual_push_128_val
4x H100 80GB
bfloat16
packed semantic layout [L8, 0x8, R8, 0x8]
active action-loss dims [0:8] and [16:24]
masked dims [8:16] and [24:32]
recomputed norm stats for the dual-push 128 train split

Data packing and masking

The TWIN converted state/action layout is [L8, R8], where each arm is 7 joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a 32-dim model input:

[L8, R8] -> [L8, 0x8, R8, 0x8]

The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:

artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log
artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log

Files changed or created

Initial `2K` study

The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached 2K runner plumbing. The exact per-file list is in:

artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt

`10K` follow-up additions

The follow-up changed or added:

openpi/src/openpi/training/config.py
- added pi05_twin_handover_256_packed_baseline_pytorch_10k
- added pi05_twin_handover_256_packed_parallel_pytorch_10k
openpi/scripts/train_pytorch.py
- added periodic per-module gradient buckets for baseline and parallel models
openpi/scripts/eval_twin_val_loss_pytorch.py
- added left/right arm losses
- added joint vs gripper losses
- added left/right imbalance
- added small deterministic sample_actions eval at num_steps={4,10}
openpi/scripts/check_parallel_warmstart_equivalence.py
- added step-0 baseline-vs-parallel numerical check
openpi/scripts/run_twin_handover_packed_10k.sh
- added detached 10K train/eval chain
openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json
- copied existing norm stats for the 10K baseline config
openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json
- copied existing norm stats for the 10K parallel config
README.md
- updated repo landing page to cover both studies
REPORT.md
- updated full report to cover both studies

The exact 10K changed-file manifest is:

artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt

Dual-push `5K` screening additions

The dual-push screening run added or updated:

openpi/src/openpi/training/config.py
- added pi05_twin_dual_push_128_packed_baseline_pytorch_5k
- added pi05_twin_dual_push_128_packed_parallel_pytorch_5k
openpi/scripts/run_twin_dual_push_128_packed_5k.sh
- added detached dual-push 5K baseline->eval sweep->parallel->eval sweep runner
openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json
- computed dual-push 128 train norm stats for the packed baseline config
openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json
- computed dual-push 128 train norm stats for the packed parallel config
README.md
- updated landing page to cover the dual-push screening study
REPORT.md
- updated full report to include dual-push setup, results, and artifact locations

The exact dual-push changed-file manifest is:

artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt

Commands and run flow

The exact 10K rerun commands are stored in:

artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh

The 10K execution flow was:

run the warm-start equivalence check
baseline 10K train
baseline evals at 1000, 2000, 5000, 10000
parallel 10K train
parallel evals at 1000, 2000, 5000, 10000

The detached runner was:

openpi/scripts/run_twin_handover_packed_10k.sh

Main logs:

artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log
artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log
artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log

Startup sanity checks

Both 10K runs loaded cleanly with:

packed transforms active
correct 32-dim packed state/action tensors
mask active on [0:8] and [16:24]
exact zeros preserved in masked padded blocks
missing=0
unexpected=0

Reference startup summary:

artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt

Checkpoint sources:

baseline: /workspace/checkpoints/pi05_base_single_pytorch
parallel: /workspace/checkpoints/pi05_base_parallel_packed_from_single

Warm-start equivalence check

The 10K study added an explicit step-0 numerical check:

input_projection_max_abs_diff = 0.00122881
input_projection_mean_abs_diff = 0.00015435
baseline_masked_loss = 1.00531137
parallel_masked_loss = 1.00929189
masked_loss_abs_diff = 0.00398052
warmstart_equivalent = False

Reference:

artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log

Interpretation:

the slice/fuse initialization is aligned by construction
it is not numerically exact end-to-end on the same batch
this weakens a strict “identical function at step 0” claim
it does not invalidate the comparison as a matched warm-start study

Dual-push 5K warm-start check:

input_projection_max_abs_diff = 0.00099802
input_projection_mean_abs_diff = 0.00010568
baseline_masked_loss = 1.43506372
parallel_masked_loss = 1.52086782
masked_loss_abs_diff = 0.08580410
warmstart_equivalent = False

Reference:

artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log

Results

Initial `2K` study

Teacher-forced validation loss:

Model	Val @ 1000	Val @ 2000	Train runtime	Peak VRAM
Packed baseline	`0.052885`	`0.035776`	`33:27`	`35.23GB`
Packed parallel	`0.051214`	`0.035680`	`30:38`	`35.27GB`

`10K` train checkpoints

Rank-0 train snapshots:

Model	Step 1000	Step 2000	Step 5000	Step 10000
Baseline loss	`0.0228`	`0.0376`	`0.0202`	`0.0141`
Baseline smoothed	`0.0476`	`0.0273`	`0.0226`	`0.0172`
Parallel loss	`0.0211`	`0.0368`	`0.0212`	`0.0140`
Parallel smoothed	`0.0461`	`0.0259`	`0.0225`	`0.0169`

Structured source:

artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv

`10K` teacher-forced validation

Checkpoint	Baseline	Parallel	Delta (parallel - baseline)
`1000`	`0.061130`	`0.059715`	`-0.001415`
`2000`	`0.041595`	`0.039947`	`-0.001648`
`5000`	`0.027324`	`0.027340`	`+0.000016`
`10000`	`0.022345`	`0.022168`	`-0.000177`

The longer run still shows only a very small gap. The models remain extremely close.

`10K` final arm/joint/gripper breakdown

Teacher-forced validation at 10000:

Metric	Baseline	Parallel
Mean val loss	`0.022345`	`0.022168`
Left arm loss	`0.029659`	`0.030184`
Right arm loss	`0.015031`	`0.014151`
Left joint loss	`0.031507`	`0.032356`
Left gripper loss	`0.016725`	`0.014984`
Right joint loss	`0.015776`	`0.014888`
Right gripper loss	`0.009818`	`0.008996`
Left/right imbalance	`0.034067`	`0.033825`

Interpretation:

the parallel model’s small final advantage is mostly on the right-arm side
the baseline is slightly better on left-arm joint loss
the parallel model is slightly better on both grippers and on right-joint loss
imbalance is nearly unchanged

`10K` sample-based eval

Final fixed-subset sample eval at 10000:

Metric	Baseline	Parallel
4-step masked MAE	`0.029935`	`0.029277`
10-step masked MAE	`0.030294`	`0.030241`
4-step left/right imbalance MAE	`0.033733`	`0.031629`
10-step left/right imbalance MAE	`0.034582`	`0.032456`

Interpretation:

sample-based quality is effectively tied by the end
the teacher-forced gap does not widen into a large inference-time separation

Structured sources:

artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv

Runtime and memory

Stage	Duration
Baseline train	`2:13:40`
Baseline eval sweep	`0:24:24`
Parallel train	`2:20:51`
Parallel eval sweep	`0:43:54`
Full `10K` pipeline	`5:48:33`

Peak VRAM:

baseline: 35.23GB
parallel: 35.27GB

Reference:

artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv

Dual-push `128` screening results

Teacher-forced validation

Checkpoint	Baseline	Parallel	Delta (parallel - baseline)
`1000`	`0.095597`	`0.093704`	`-0.001893`
`2000`	`0.083194`	`0.082729`	`-0.000465`
`5000`	`0.055958`	`0.055242`	`-0.000716`

The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.

Dual-push arm breakdown

Teacher-forced validation at 5000:

Metric	Baseline	Parallel
Mean val loss	`0.055958`	`0.055242`
Left arm loss	`0.017725`	`0.017044`
Right arm loss	`0.094191`	`0.093439`
Left joint loss	`0.017577`	`0.017052`
Left gripper loss	`0.018765`	`0.016992`
Right joint loss	`0.103576`	`0.102856`
Right gripper loss	`0.028502`	`0.027523`
Left/right imbalance	`0.080993`	`0.081011`

Interpretation:

the small parallel advantage is visible on both arms
the right arm remains much harder than the left on this task
left/right imbalance is essentially unchanged

Dual-push sample-based eval

Fixed-subset sample eval:

Checkpoint	Model	4-step masked MAE	10-step masked MAE
`1000`	baseline	`0.103199`	`0.108652`
`1000`	parallel	`0.101439`	`0.106874`
`2000`	baseline	`0.069732`	`0.074413`
`2000`	parallel	`0.069053`	`0.073501`
`5000`	baseline	`0.056830`	`0.058973`
`5000`	parallel	`0.054630`	`0.056627`

Interpretation:

the parallel model is also slightly better on fixed-subset inference-style eval
unlike handover, the positive signal stays visible at 5K
the margin is still small enough that this remains a screening result, not a paper-final claim

Dual-push runtime and memory

Stage	Duration
Baseline train	`1:05:25`
Baseline eval sweep	`0:14:34`
Parallel train	`1:00:33`
Parallel eval sweep	`0:14:39`
Full dual-push pipeline	`2:35:11`

Peak VRAM:

baseline: 35.23GB
parallel: 35.27GB

Artifact locations

`2K` bundle

artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json
artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh
artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt

`10K` bundle

artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv
artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv
artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh
artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt
artifacts/twin_handover_packed_parallelization_10k_20260309/environment/

Dual-push `5K` bundle

artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt
artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/

Bottom line

The 10K follow-up suggests the 2K near-tie was not hiding a large later divergence.

teacher-forced validation ends with a small parallel edge
sample-based eval is essentially tied
left/right imbalance does not materially change
the main difference remains subtle rather than dramatic

The dual-push screening run adds a second signal:

the packed parallel model is slightly better at 1K, 2K, and 5K
the same small advantage appears on both teacher-forced and sample-based eval
the effect is still modest, but it is cleaner and more consistent than handover

So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.

Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push

Scope

Data packing and masking

Files changed or created

Initial 2K study

10K follow-up additions

Dual-push 5K screening additions

Commands and run flow

Startup sanity checks

Warm-start equivalence check

Results

Initial 2K study

10K train checkpoints

10K teacher-forced validation

10K final arm/joint/gripper breakdown

10K sample-based eval

Runtime and memory

Dual-push 128 screening results

Teacher-forced validation

Dual-push arm breakdown

Dual-push sample-based eval

Dual-push runtime and memory

Artifact locations

2K bundle

10K bundle

Dual-push 5K bundle

Bottom line

Initial `2K` study

`10K` follow-up additions

Dual-push `5K` screening additions

Initial `2K` study

`10K` train checkpoints

`10K` teacher-forced validation

`10K` final arm/joint/gripper breakdown

`10K` sample-based eval

Dual-push `128` screening results

`2K` bundle

`10K` bundle

Dual-push `5K` bundle