File size: 14,944 Bytes

# Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover and Dual Push

## Scope

This repo now contains three completed studies:

1. the initial `2K` baseline-vs-parallel comparison
2. the longer `10K` follow-up with richer diagnostics
3. a `5K` dual-push `128` screening run on the same packed path

The handover runs used:

- train repo `lsnu/twin_handover_256_train`
- val repo `lsnu/twin_handover_256_val`
- `4x H100 80GB`
- `bfloat16`
- packed semantic layout `[L8, 0x8, R8, 0x8]`
- active action-loss dims `[0:8]` and `[16:24]`
- masked dims `[8:16]` and `[24:32]`

Existing public `16`-dim norm stats were reused. No raw-data reconversion was done.

The dual-push screening run used:

- train repo `lsnu/twin_dual_push_128_train`
- val repo `lsnu/twin_dual_push_128_val`
- `4x H100 80GB`
- `bfloat16`
- packed semantic layout `[L8, 0x8, R8, 0x8]`
- active action-loss dims `[0:8]` and `[16:24]`
- masked dims `[8:16]` and `[24:32]`
- recomputed norm stats for the dual-push `128` train split

## Data packing and masking

The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input:

```text
[L8, R8] -> [L8, 0x8, R8, 0x8]
```

The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:

- `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`

## Files changed or created

### Initial `2K` study

The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in:

- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`

### `10K` follow-up additions

The follow-up changed or added:

- `openpi/src/openpi/training/config.py`
  - added `pi05_twin_handover_256_packed_baseline_pytorch_10k`
  - added `pi05_twin_handover_256_packed_parallel_pytorch_10k`
- `openpi/scripts/train_pytorch.py`
  - added periodic per-module gradient buckets for baseline and parallel models
- `openpi/scripts/eval_twin_val_loss_pytorch.py`
  - added left/right arm losses
  - added joint vs gripper losses
  - added left/right imbalance
  - added small deterministic `sample_actions` eval at `num_steps={4,10}`
- `openpi/scripts/check_parallel_warmstart_equivalence.py`
  - added step-0 baseline-vs-parallel numerical check
- `openpi/scripts/run_twin_handover_packed_10k.sh`
  - added detached `10K` train/eval chain
- `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
  - copied existing norm stats for the `10K` baseline config
- `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
  - copied existing norm stats for the `10K` parallel config
- `README.md`
  - updated repo landing page to cover both studies
- `REPORT.md`
  - updated full report to cover both studies

The exact `10K` changed-file manifest is:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`

### Dual-push `5K` screening additions

The dual-push screening run added or updated:

- `openpi/src/openpi/training/config.py`
  - added `pi05_twin_dual_push_128_packed_baseline_pytorch_5k`
  - added `pi05_twin_dual_push_128_packed_parallel_pytorch_5k`
- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`
  - added detached dual-push `5K` baseline->eval sweep->parallel->eval sweep runner
- `openpi/assets/pi05_twin_dual_push_128_packed_baseline_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
  - computed dual-push `128` train norm stats for the packed baseline config
- `openpi/assets/pi05_twin_dual_push_128_packed_parallel_pytorch_5k/lsnu/twin_dual_push_128_train/norm_stats.json`
  - computed dual-push `128` train norm stats for the packed parallel config
- `README.md`
  - updated landing page to cover the dual-push screening study
- `REPORT.md`
  - updated full report to include dual-push setup, results, and artifact locations

The exact dual-push changed-file manifest is:

- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`

## Commands and run flow

The exact `10K` rerun commands are stored in:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`

The `10K` execution flow was:

1. run the warm-start equivalence check
2. baseline `10K` train
3. baseline evals at `1000`, `2000`, `5000`, `10000`
4. parallel `10K` train
5. parallel evals at `1000`, `2000`, `5000`, `10000`

The detached runner was:

- `openpi/scripts/run_twin_handover_packed_10k.sh`

Main logs:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log`

## Startup sanity checks

Both `10K` runs loaded cleanly with:

- packed transforms active
- correct `32`-dim packed state/action tensors
- mask active on `[0:8]` and `[16:24]`
- exact zeros preserved in masked padded blocks
- `missing=0`
- `unexpected=0`

Reference startup summary:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt`

Checkpoint sources:

- baseline: `/workspace/checkpoints/pi05_base_single_pytorch`
- parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single`

## Warm-start equivalence check

The `10K` study added an explicit step-0 numerical check:

- `input_projection_max_abs_diff = 0.00122881`
- `input_projection_mean_abs_diff = 0.00015435`
- `baseline_masked_loss = 1.00531137`
- `parallel_masked_loss = 1.00929189`
- `masked_loss_abs_diff = 0.00398052`
- `warmstart_equivalent = False`

Reference:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log`

Interpretation:

- the slice/fuse initialization is aligned by construction
- it is not numerically exact end-to-end on the same batch
- this weakens a strict “identical function at step 0” claim
- it does not invalidate the comparison as a matched warm-start study

Dual-push `5K` warm-start check:

- `input_projection_max_abs_diff = 0.00099802`
- `input_projection_mean_abs_diff = 0.00010568`
- `baseline_masked_loss = 1.43506372`
- `parallel_masked_loss = 1.52086782`
- `masked_loss_abs_diff = 0.08580410`
- `warmstart_equivalent = False`

Reference:

- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/sanity_checks/warmstart_dual_push_128_5k.log`

## Results

### Initial `2K` study

Teacher-forced validation loss:

| Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
| --- | ---: | ---: | ---: | ---: |
| Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` |
| Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` |

### `10K` train checkpoints

Rank-0 train snapshots:

| Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 |
| --- | ---: | ---: | ---: | ---: |
| Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` |
| Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` |
| Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` |
| Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` |

Structured source:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`

### `10K` teacher-forced validation

| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
| --- | ---: | ---: | ---: |
| `1000` | `0.061130` | `0.059715` | `-0.001415` |
| `2000` | `0.041595` | `0.039947` | `-0.001648` |
| `5000` | `0.027324` | `0.027340` | `+0.000016` |
| `10000` | `0.022345` | `0.022168` | `-0.000177` |

The longer run still shows only a very small gap. The models remain extremely close.

### `10K` final arm/joint/gripper breakdown

Teacher-forced validation at `10000`:

| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| Mean val loss | `0.022345` | `0.022168` |
| Left arm loss | `0.029659` | `0.030184` |
| Right arm loss | `0.015031` | `0.014151` |
| Left joint loss | `0.031507` | `0.032356` |
| Left gripper loss | `0.016725` | `0.014984` |
| Right joint loss | `0.015776` | `0.014888` |
| Right gripper loss | `0.009818` | `0.008996` |
| Left/right imbalance | `0.034067` | `0.033825` |

Interpretation:

- the parallel model’s small final advantage is mostly on the right-arm side
- the baseline is slightly better on left-arm joint loss
- the parallel model is slightly better on both grippers and on right-joint loss
- imbalance is nearly unchanged

### `10K` sample-based eval

Final fixed-subset sample eval at `10000`:

| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| 4-step masked MAE | `0.029935` | `0.029277` |
| 10-step masked MAE | `0.030294` | `0.030241` |
| 4-step left/right imbalance MAE | `0.033733` | `0.031629` |
| 10-step left/right imbalance MAE | `0.034582` | `0.032456` |

Interpretation:

- sample-based quality is effectively tied by the end
- the teacher-forced gap does not widen into a large inference-time separation

Structured sources:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`

### Runtime and memory

| Stage | Duration |
| --- | ---: |
| Baseline train | `2:13:40` |
| Baseline eval sweep | `0:24:24` |
| Parallel train | `2:20:51` |
| Parallel eval sweep | `0:43:54` |
| Full `10K` pipeline | `5:48:33` |

Peak VRAM:

- baseline: `35.23GB`
- parallel: `35.27GB`

Reference:

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv`

## Dual-push `128` screening results

### Teacher-forced validation

| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
| --- | ---: | ---: | ---: |
| `1000` | `0.095597` | `0.093704` | `-0.001893` |
| `2000` | `0.083194` | `0.082729` | `-0.000465` |
| `5000` | `0.055958` | `0.055242` | `-0.000716` |

The screening signal is small but consistently positive for the packed parallel model at all three checkpoints.

### Dual-push arm breakdown

Teacher-forced validation at `5000`:

| Metric | Baseline | Parallel |
| --- | ---: | ---: |
| Mean val loss | `0.055958` | `0.055242` |
| Left arm loss | `0.017725` | `0.017044` |
| Right arm loss | `0.094191` | `0.093439` |
| Left joint loss | `0.017577` | `0.017052` |
| Left gripper loss | `0.018765` | `0.016992` |
| Right joint loss | `0.103576` | `0.102856` |
| Right gripper loss | `0.028502` | `0.027523` |
| Left/right imbalance | `0.080993` | `0.081011` |

Interpretation:

- the small parallel advantage is visible on both arms
- the right arm remains much harder than the left on this task
- left/right imbalance is essentially unchanged

### Dual-push sample-based eval

Fixed-subset sample eval:

| Checkpoint | Model | 4-step masked MAE | 10-step masked MAE |
| --- | --- | ---: | ---: |
| `1000` | baseline | `0.103199` | `0.108652` |
| `1000` | parallel | `0.101439` | `0.106874` |
| `2000` | baseline | `0.069732` | `0.074413` |
| `2000` | parallel | `0.069053` | `0.073501` |
| `5000` | baseline | `0.056830` | `0.058973` |
| `5000` | parallel | `0.054630` | `0.056627` |

Interpretation:

- the parallel model is also slightly better on fixed-subset inference-style eval
- unlike handover, the positive signal stays visible at `5K`
- the margin is still small enough that this remains a screening result, not a paper-final claim

### Dual-push runtime and memory

| Stage | Duration |
| --- | ---: |
| Baseline train | `1:05:25` |
| Baseline eval sweep | `0:14:34` |
| Parallel train | `1:00:33` |
| Parallel eval sweep | `0:14:39` |
| Full dual-push pipeline | `2:35:11` |

Peak VRAM:

- baseline: `35.23GB`
- parallel: `35.27GB`

## Artifact locations

### `2K` bundle

- `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
- `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`

### `10K` bundle

- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
- `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`

### Dual-push `5K` bundle

- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/train_loss_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/runtime_table.csv`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/commands_reproduce.sh`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/checkpoint_locations.txt`
- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`

## Bottom line

The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence.

- teacher-forced validation ends with a small parallel edge
- sample-based eval is essentially tied
- left/right imbalance does not materially change
- the main difference remains subtle rather than dramatic

The dual-push screening run adds a second signal:

- the packed parallel model is slightly better at `1K`, `2K`, and `5K`
- the same small advantage appears on both teacher-forced and sample-based eval
- the effect is still modest, but it is cleaner and more consistent than handover

So the current repo state supports a narrow next-step conclusion: packed parallelization remains subtle on handover, but dual-push is a better candidate task for the next seed/scale confirmation.