Upload dual-push report docs

422ae16 verified 2 days ago

14.9 kB

	# Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push

	## Scope

	This repo now contains three completed studies:

	1. the initial `2K` baseline-vs-parallel comparison
	2. the longer `10K` follow-up with richer diagnostics
	3. a `5K` dual-push `128` screening run on the same packed path

	The handover runs used:

	- train repo `lsnu/twin_handover_256_train`
	- val repo `lsnu/twin_handover_256_val`
	- `4x H100 80GB`
	- `bfloat16`
	- packed semantic layout `[L8, 0x8, R8, 0x8]`
	- active action-loss dims `[0:8]` and `[16:24]`
	- masked dims `[8:16]` and `[24:32]`

	Existing public `16`-dim norm stats were reused. No raw-data reconversion was done.

	The dual-push screening run used:

	- train repo `lsnu/twin_dual_push_128_train`
	- val repo `lsnu/twin_dual_push_128_val`
	- `4x H100 80GB`
	- `bfloat16`
	- packed semantic layout `[L8, 0x8, R8, 0x8]`
	- active action-loss dims `[0:8]` and `[16:24]`
	- masked dims `[8:16]` and `[24:32]`
	- recomputed norm stats for the dual-push `128` train split

	## Data packing and masking

	The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input:

	```text
	[L8, R8] -> [L8, 0x8, R8, 0x8]
	```

	The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:

	- `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`

	## Files changed or created

	### Initial `2K` study

	The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in:

	- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`

	### `10K` follow-up additions

	The follow-up changed or added:

	- `openpi/src/openpi/training/config.py`
	- added `pi05_twin_handover_256_packed_baseline_pytorch_10k`
	- added `pi05_twin_handover_256_packed_parallel_pytorch_10k`
	- `openpi/scripts/train_pytorch.py`
	- added periodic per-module gradient buckets for baseline and parallel models
	- `openpi/scripts/eval_twin_val_loss_pytorch.py`
	- added left/right arm losses
	- added joint vs gripper losses
	- added left/right imbalance
	- added small deterministic `sample_actions` eval at `num_steps={4,10}`
	- `openpi/scripts/check_parallel_warmstart_equivalence.py`
	- added step-0 baseline-vs-parallel numerical check
	- `openpi/scripts/run_twin_handover_packed_10k.sh`
	- added detached `10K` train/eval chain
	- `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
	- copied existing norm stats for the `10K` baseline config
	- `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
	- copied existing norm stats for the `10K` parallel config
	- `README.md`
	- updated repo landing page to cover both studies
	- `REPORT.md`
	- updated full report to cover both studies

	The exact `10K` changed-file manifest is:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`

	### Dual-push `5K` screening additions

	The dual-push screening run added or updated:

	- `openpi/src/openpi/training/config.py`
	- added `pi05_twin_dual_push_128_packed_baseline_pytorch_5k`
	- added `pi05_twin_dual_push_128_packed_parallel_pytorch_5k`
	- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`
	- added detached dual-push `5K` baseline->eval sweep->parallel->eval sweep runner
	- `openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
	- computed dual-push `128` train norm stats for the packed baseline config
	- `openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
	- computed dual-push `128` train norm stats for the packed parallel config
	- `README.md`
	- updated landing page to cover the dual-push screening study
	- `REPORT.md`
	- updated full report to include dual-push setup, results, and artifact locations

	The exact dual-push changed-file manifest is:

	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`

	## Commands and run flow

	The exact `10K` rerun commands are stored in:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`

	The `10K` execution flow was:

	1. run the warm-start equivalence check
	2. baseline `10K` train
	3. baseline evals at `1000`, `2000`, `5000`, `10000`
	4. parallel `10K` train
	5. parallel evals at `1000`, `2000`, `5000`, `10000`

	The detached runner was:

	- `openpi/scripts/run_twin_handover_packed_10k.sh`

	Main logs:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log`

	## Startup sanity checks

	Both `10K` runs loaded cleanly with:

	- packed transforms active
	- correct `32`-dim packed state/action tensors
	- mask active on `[0:8]` and `[16:24]`
	- exact zeros preserved in masked padded blocks
	- `missing=0`
	- `unexpected=0`

	Reference startup summary:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt`

	Checkpoint sources:

	- baseline: `/workspace/checkpoints/pi05_base_single_pytorch`
	- parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single`

	## Warm-start equivalence check

	The `10K` study added an explicit step-0 numerical check:

	- `input_projection_max_abs_diff = 0.00122881`
	- `input_projection_mean_abs_diff = 0.00015435`
	- `baseline_masked_loss = 1.00531137`
	- `parallel_masked_loss = 1.00929189`
	- `masked_loss_abs_diff = 0.00398052`
	- `warmstart_equivalent = False`

	Reference:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log`

	Interpretation:

	- the slice/fuse initialization is aligned by construction
	- it is not numerically exact end-to-end on the same batch
	- this weakens a strict “identical function at step 0” claim
	- it does not invalidate the comparison as a matched warm-start study

	Dual-push `5K` warm-start check:

	- `input_projection_max_abs_diff = 0.00099802`
	- `input_projection_mean_abs_diff = 0.00010568`
	- `baseline_masked_loss = 1.43506372`
	- `parallel_masked_loss = 1.52086782`
	- `masked_loss_abs_diff = 0.08580410`
	- `warmstart_equivalent = False`

	Reference:

	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log`

	## Results

	### Initial `2K` study

	Teacher-forced validation loss:

	\| Model \| Val @ 1000 \| Val @ 2000 \| Train runtime \| Peak VRAM \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Packed baseline \| `0.052885` \| `0.035776` \| `33:27` \| `35.23GB` \|
	\| Packed parallel \| `0.051214` \| `0.035680` \| `30:38` \| `35.27GB` \|

	### `10K` train checkpoints

	Rank-0 train snapshots:

	\| Model \| Step 1000 \| Step 2000 \| Step 5000 \| Step 10000 \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Baseline loss \| `0.0228` \| `0.0376` \| `0.0202` \| `0.0141` \|
	\| Baseline smoothed \| `0.0476` \| `0.0273` \| `0.0226` \| `0.0172` \|
	\| Parallel loss \| `0.0211` \| `0.0368` \| `0.0212` \| `0.0140` \|
	\| Parallel smoothed \| `0.0461` \| `0.0259` \| `0.0225` \| `0.0169` \|

	Structured source:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`

	### `10K` teacher-forced validation

	\| Checkpoint \| Baseline \| Parallel \| Delta (parallel - baseline) \|
	\| --- \| ---: \| ---: \| ---: \|
	\| `1000` \| `0.061130` \| `0.059715` \| `-0.001415` \|
	\| `2000` \| `0.041595` \| `0.039947` \| `-0.001648` \|
	\| `5000` \| `0.027324` \| `0.027340` \| `+0.000016` \|
	\| `10000` \| `0.022345` \| `0.022168` \| `-0.000177` \|

	The longer run still shows only a very small gap. The models remain extremely close.

	### `10K` final arm/joint/gripper breakdown

	Teacher-forced validation at `10000`:

	\| Metric \| Baseline \| Parallel \|
	\| --- \| ---: \| ---: \|
	\| Mean val loss \| `0.022345` \| `0.022168` \|
	\| Left arm loss \| `0.029659` \| `0.030184` \|
	\| Right arm loss \| `0.015031` \| `0.014151` \|
	\| Left joint loss \| `0.031507` \| `0.032356` \|
	\| Left gripper loss \| `0.016725` \| `0.014984` \|
	\| Right joint loss \| `0.015776` \| `0.014888` \|
	\| Right gripper loss \| `0.009818` \| `0.008996` \|
	\| Left/right imbalance \| `0.034067` \| `0.033825` \|

	Interpretation:

	- the parallel model’s small final advantage is mostly on the right-arm side
	- the baseline is slightly better on left-arm joint loss
	- the parallel model is slightly better on both grippers and on right-joint loss
	- imbalance is nearly unchanged

	### `10K` sample-based eval

	Final fixed-subset sample eval at `10000`:

	\| Metric \| Baseline \| Parallel \|
	\| --- \| ---: \| ---: \|
	\| 4-step masked MAE \| `0.029935` \| `0.029277` \|
	\| 10-step masked MAE \| `0.030294` \| `0.030241` \|
	\| 4-step left/right imbalance MAE \| `0.033733` \| `0.031629` \|
	\| 10-step left/right imbalance MAE \| `0.034582` \| `0.032456` \|

	Interpretation:

	- sample-based quality is effectively tied by the end
	- the teacher-forced gap does not widen into a large inference-time separation

	Structured sources:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`

	### Runtime and memory

	\| Stage \| Duration \|
	\| --- \| ---: \|
	\| Baseline train \| `2:13:40` \|
	\| Baseline eval sweep \| `0:24:24` \|
	\| Parallel train \| `2:20:51` \|
	\| Parallel eval sweep \| `0:43:54` \|
	\| Full `10K` pipeline \| `5:48:33` \|

	Peak VRAM:

	- baseline: `35.23GB`
	- parallel: `35.27GB`

	Reference:

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv`

	## Dual-push `128` screening results

	### Teacher-forced validation

	\| Checkpoint \| Baseline \| Parallel \| Delta (parallel - baseline) \|
	\| --- \| ---: \| ---: \| ---: \|
	\| `1000` \| `0.095597` \| `0.093704` \| `-0.001893` \|
	\| `2000` \| `0.083194` \| `0.082729` \| `-0.000465` \|
	\| `5000` \| `0.055958` \| `0.055242` \| `-0.000716` \|

	The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.

	### Dual-push arm breakdown

	Teacher-forced validation at `5000`:

	\| Metric \| Baseline \| Parallel \|
	\| --- \| ---: \| ---: \|
	\| Mean val loss \| `0.055958` \| `0.055242` \|
	\| Left arm loss \| `0.017725` \| `0.017044` \|
	\| Right arm loss \| `0.094191` \| `0.093439` \|
	\| Left joint loss \| `0.017577` \| `0.017052` \|
	\| Left gripper loss \| `0.018765` \| `0.016992` \|
	\| Right joint loss \| `0.103576` \| `0.102856` \|
	\| Right gripper loss \| `0.028502` \| `0.027523` \|
	\| Left/right imbalance \| `0.080993` \| `0.081011` \|

	Interpretation:

	- the small parallel advantage is visible on both arms
	- the right arm remains much harder than the left on this task
	- left/right imbalance is essentially unchanged

	### Dual-push sample-based eval

	Fixed-subset sample eval:

	\| Checkpoint \| Model \| 4-step masked MAE \| 10-step masked MAE \|
	\| --- \| --- \| ---: \| ---: \|
	\| `1000` \| baseline \| `0.103199` \| `0.108652` \|
	\| `1000` \| parallel \| `0.101439` \| `0.106874` \|
	\| `2000` \| baseline \| `0.069732` \| `0.074413` \|
	\| `2000` \| parallel \| `0.069053` \| `0.073501` \|
	\| `5000` \| baseline \| `0.056830` \| `0.058973` \|
	\| `5000` \| parallel \| `0.054630` \| `0.056627` \|

	Interpretation:

	- the parallel model is also slightly better on fixed-subset inference-style eval
	- unlike handover, the positive signal stays visible at `5K`
	- the margin is still small enough that this remains a screening result, not a paper-final claim

	### Dual-push runtime and memory

	\| Stage \| Duration \|
	\| --- \| ---: \|
	\| Baseline train \| `1:05:25` \|
	\| Baseline eval sweep \| `0:14:34` \|
	\| Parallel train \| `1:00:33` \|
	\| Parallel eval sweep \| `0:14:39` \|
	\| Full dual-push pipeline \| `2:35:11` \|

	Peak VRAM:

	- baseline: `35.23GB`
	- parallel: `35.27GB`

	## Artifact locations

	### `2K` bundle

	- `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
	- `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
	- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`

	### `10K` bundle

	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`

	### Dual-push `5K` bundle

	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`

	## Bottom line

	The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence.

	- teacher-forced validation ends with a small parallel edge
	- sample-based eval is essentially tied
	- left/right imbalance does not materially change
	- the main difference remains subtle rather than dramatic

	The dual-push screening run adds a second signal:

	- the packed parallel model is slightly better at `1K`, `2K`, and `5K`
	- the same small advantage appears on both teacher-forced and sample-based eval
	- the effect is still modest, but it is cleaner and more consistent than handover

	So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.