lsnu's picture
Upload dual-push report docs
422ae16 verified
# Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push
## Scope
This repo now contains three completed studies:
1. the initial `2K` baseline-vs-parallel comparison
2. the longer `10K` follow-up with richer diagnostics
3. a `5K` dual-push `128` screening run on the same packed path
The handover runs used:
- train repo `lsnu/twin_handover_256_train`
- val repo `lsnu/twin_handover_256_val`
- `4x H100 80GB`
- `bfloat16`
- packed semantic layout `[L8, 0x8, R8, 0x8]`
- active action-loss dims `[0:8]` and `[16:24]`
- masked dims `[8:16]` and `[24:32]`
Existing public `16`-dim norm stats were reused. No raw-data reconversion was done.
The dual-push screening run used:
- train repo `lsnu/twin_dual_push_128_train`
- val repo `lsnu/twin_dual_push_128_val`
- `4x H100 80GB`
- `bfloat16`
- packed semantic layout `[L8, 0x8, R8, 0x8]`
- active action-loss dims `[0:8]` and `[16:24]`
- masked dims `[8:16]` and `[24:32]`
- recomputed norm stats for the dual-push `128` train split
## Data packing and masking
The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input:
```text
[L8, R8] -> [L8, 0x8, R8, 0x8]
```
The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:
- `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
## Files changed or created
### Initial `2K` study
The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in:
- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
### `10K` follow-up additions
The follow-up changed or added:
- `openpi/src/openpi/training/config.py`
- added `pi05_twin_handover_256_packed_baseline_pytorch_10k`
- added `pi05_twin_handover_256_packed_parallel_pytorch_10k`
- `openpi/scripts/train_pytorch.py`
- added periodic per-module gradient buckets for baseline and parallel models
- `openpi/scripts/eval_twin_val_loss_pytorch.py`
- added left/right arm losses
- added joint vs gripper losses
- added left/right imbalance
- added small deterministic `sample_actions` eval at `num_steps={4,10}`
- `openpi/scripts/check_parallel_warmstart_equivalence.py`
- added step-0 baseline-vs-parallel numerical check
- `openpi/scripts/run_twin_handover_packed_10k.sh`
- added detached `10K` train/eval chain
- `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
- copied existing norm stats for the `10K` baseline config
- `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
- copied existing norm stats for the `10K` parallel config
- `README.md`
- updated repo landing page to cover both studies
- `REPORT.md`
- updated full report to cover both studies
The exact `10K` changed-file manifest is:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
### Dual-push `5K` screening additions
The dual-push screening run added or updated:
- `openpi/src/openpi/training/config.py`
- added `pi05_twin_dual_push_128_packed_baseline_pytorch_5k`
- added `pi05_twin_dual_push_128_packed_parallel_pytorch_5k`
- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`
- added detached dual-push `5K` baseline->eval sweep->parallel->eval sweep runner
- `openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
- computed dual-push `128` train norm stats for the packed baseline config
- `openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
- computed dual-push `128` train norm stats for the packed parallel config
- `README.md`
- updated landing page to cover the dual-push screening study
- `REPORT.md`
- updated full report to include dual-push setup, results, and artifact locations
The exact dual-push changed-file manifest is:
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`
## Commands and run flow
The exact `10K` rerun commands are stored in:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
The `10K` execution flow was:
1. run the warm-start equivalence check
2. baseline `10K` train
3. baseline evals at `1000`, `2000`, `5000`, `10000`
4. parallel `10K` train
5. parallel evals at `1000`, `2000`, `5000`, `10000`
The detached runner was:
- `openpi/scripts/run_twin_handover_packed_10k.sh`
Main logs:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log`
## Startup sanity checks
Both `10K` runs loaded cleanly with:
- packed transforms active
- correct `32`-dim packed state/action tensors
- mask active on `[0:8]` and `[16:24]`
- exact zeros preserved in masked padded blocks
- `missing=0`
- `unexpected=0`
Reference startup summary:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt`
Checkpoint sources:
- baseline: `/workspace/checkpoints/pi05_base_single_pytorch`
- parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single`
## Warm-start equivalence check
The `10K` study added an explicit step-0 numerical check:
- `input_projection_max_abs_diff = 0.00122881`
- `input_projection_mean_abs_diff = 0.00015435`
- `baseline_masked_loss = 1.00531137`
- `parallel_masked_loss = 1.00929189`
- `masked_loss_abs_diff = 0.00398052`
- `warmstart_equivalent = False`
Reference:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log`
Interpretation:
- the slice/fuse initialization is aligned by construction
- it is not numerically exact end-to-end on the same batch
- this weakens a strict “identical function at step 0” claim
- it does not invalidate the comparison as a matched warm-start study
Dual-push `5K` warm-start check:
- `input_projection_max_abs_diff = 0.00099802`
- `input_projection_mean_abs_diff = 0.00010568`
- `baseline_masked_loss = 1.43506372`
- `parallel_masked_loss = 1.52086782`
- `masked_loss_abs_diff = 0.08580410`
- `warmstart_equivalent = False`
Reference:
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log`
## Results
### Initial `2K` study
Teacher-forced validation loss:
| Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
| --- | ---: | ---: | ---: | ---: |
| Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` |
| Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` |
### `10K` train checkpoints
Rank-0 train snapshots:
| Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 |
| --- | ---: | ---: | ---: | ---: |
| Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` |
| Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` |
| Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` |
| Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` |
Structured source:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
### `10K` teacher-forced validation
| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
| --- | ---: | ---: | ---: |
| `1000` | `0.061130` | `0.059715` | `-0.001415` |
| `2000` | `0.041595` | `0.039947` | `-0.001648` |
| `5000` | `0.027324` | `0.027340` | `+0.000016` |
| `10000` | `0.022345` | `0.022168` | `-0.000177` |
The longer run still shows only a very small gap. The models remain extremely close.
### `10K` final arm/joint/gripper breakdown
Teacher-forced validation at `10000`:
| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| Mean val loss | `0.022345` | `0.022168` |
| Left arm loss | `0.029659` | `0.030184` |
| Right arm loss | `0.015031` | `0.014151` |
| Left joint loss | `0.031507` | `0.032356` |
| Left gripper loss | `0.016725` | `0.014984` |
| Right joint loss | `0.015776` | `0.014888` |
| Right gripper loss | `0.009818` | `0.008996` |
| Left/right imbalance | `0.034067` | `0.033825` |
Interpretation:
- the parallel model’s small final advantage is mostly on the right-arm side
- the baseline is slightly better on left-arm joint loss
- the parallel model is slightly better on both grippers and on right-joint loss
- imbalance is nearly unchanged
### `10K` sample-based eval
Final fixed-subset sample eval at `10000`:
| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| 4-step masked MAE | `0.029935` | `0.029277` |
| 10-step masked MAE | `0.030294` | `0.030241` |
| 4-step left/right imbalance MAE | `0.033733` | `0.031629` |
| 10-step left/right imbalance MAE | `0.034582` | `0.032456` |
Interpretation:
- sample-based quality is effectively tied by the end
- the teacher-forced gap does not widen into a large inference-time separation
Structured sources:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
### Runtime and memory
| Stage | Duration |
| --- | ---: |
| Baseline train | `2:13:40` |
| Baseline eval sweep | `0:24:24` |
| Parallel train | `2:20:51` |
| Parallel eval sweep | `0:43:54` |
| Full `10K` pipeline | `5:48:33` |
Peak VRAM:
- baseline: `35.23GB`
- parallel: `35.27GB`
Reference:
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv`
## Dual-push `128` screening results
### Teacher-forced validation
| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
| --- | ---: | ---: | ---: |
| `1000` | `0.095597` | `0.093704` | `-0.001893` |
| `2000` | `0.083194` | `0.082729` | `-0.000465` |
| `5000` | `0.055958` | `0.055242` | `-0.000716` |
The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.
### Dual-push arm breakdown
Teacher-forced validation at `5000`:
| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| Mean val loss | `0.055958` | `0.055242` |
| Left arm loss | `0.017725` | `0.017044` |
| Right arm loss | `0.094191` | `0.093439` |
| Left joint loss | `0.017577` | `0.017052` |
| Left gripper loss | `0.018765` | `0.016992` |
| Right joint loss | `0.103576` | `0.102856` |
| Right gripper loss | `0.028502` | `0.027523` |
| Left/right imbalance | `0.080993` | `0.081011` |
Interpretation:
- the small parallel advantage is visible on both arms
- the right arm remains much harder than the left on this task
- left/right imbalance is essentially unchanged
### Dual-push sample-based eval
Fixed-subset sample eval:
| Checkpoint | Model | 4-step masked MAE | 10-step masked MAE |
| --- | --- | ---: | ---: |
| `1000` | baseline | `0.103199` | `0.108652` |
| `1000` | parallel | `0.101439` | `0.106874` |
| `2000` | baseline | `0.069732` | `0.074413` |
| `2000` | parallel | `0.069053` | `0.073501` |
| `5000` | baseline | `0.056830` | `0.058973` |
| `5000` | parallel | `0.054630` | `0.056627` |
Interpretation:
- the parallel model is also slightly better on fixed-subset inference-style eval
- unlike handover, the positive signal stays visible at `5K`
- the margin is still small enough that this remains a screening result, not a paper-final claim
### Dual-push runtime and memory
| Stage | Duration |
| --- | ---: |
| Baseline train | `1:05:25` |
| Baseline eval sweep | `0:14:34` |
| Parallel train | `1:00:33` |
| Parallel eval sweep | `0:14:39` |
| Full dual-push pipeline | `2:35:11` |
Peak VRAM:
- baseline: `35.23GB`
- parallel: `35.27GB`
## Artifact locations
### `2K` bundle
- `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
- `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
### `10K` bundle
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
### Dual-push `5K` bundle
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`
## Bottom line
The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence.
- teacher-forced validation ends with a small parallel edge
- sample-based eval is essentially tied
- left/right imbalance does not materially change
- the main difference remains subtle rather than dramatic
The dual-push screening run adds a second signal:
- the packed parallel model is slightly better at `1K`, `2K`, and `5K`
- the same small advantage appears on both teacher-forced and sample-based eval
- the effect is still modest, but it is cleaner and more consistent than handover
So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.