lsnu
/

pi05tests-openpi-multiarm

Safetensors

Model card Files Files and versions

xet

Community

lsnu commited on 3 days ago

Commit

4cc9180

verified ·

1 Parent(s): ab4982c

Upload 10k report docs

Browse files

Files changed (2) hide show

README.md +74 -35
REPORT.md +179 -258

README.md CHANGED Viewed

@@ -1,58 +1,97 @@
 # pi0.5 Packed Multi-Arm OpenPI Artifacts
-This repo packages a finished initial comparison between:
-1. a packed single-head `pi0.5` baseline
-2. a packed parallel-head `pi0.5` model with an exact packed warm-start from the single-head checkpoint
-The study was run from the checked-out `openpi/` tree on `4x H100 80GB` with `bfloat16`, `2000` optimizer steps per model, verbose startup/debug logging, fixed validation passes, and no raw data reconversion.
-## Dataset and packing
 - Train repo: `lsnu/twin_handover_256_train`
 - Val repo: `lsnu/twin_handover_256_val`
-- Original TWIN layout: `[L8, R8]`
-- Packed model layout used for both models: `[L8, 0x8, R8, 0x8]`
-- Action-loss mask: active dims `[0:8]` and `[16:24]`, padded dims masked out
-- Public `16`-dim norm stats were reused; they were not recomputed
 ## Headline results
-| Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
 | --- | ---: | ---: | ---: | ---: |
-| Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23 GB` |
-| Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27 GB` |
-The two models tracked closely. In this short run, the packed parallel head finished with a small edge on validation loss while staying within the same memory envelope.
-## Repo contents
 - `openpi/`
-  - modified training/eval code
-  - config and transform changes
-  - copied norm-stats assets for the new packed configs
-  - smoke and main-run checkpoints under `openpi/checkpoints/`
 - `artifacts/twin_handover_packed_parallelization_20260309/`
-  - `bootstrap_checkpoints/`: single-head PyTorch bootstrap and exact packed parallel warm-start
-  - `metrics/`: JSON and CSV summaries
-  - `run_logs/`: smoke, train, eval, and follow-up logs
-  - `sanity_checks/`: packed-batch inspection output
-  - `environment/`: system, GPU, package, HF-tooling, and workspace snapshots
-  - `repro/`: changed-file list, checkpoint locations, and rerun commands
 - `artifacts/pi05_base_params/`
-  - staged base JAX parameter snapshot used for PyTorch conversion
-## Key artifact paths
 - Full report: `REPORT.md`
-- Reproduction commands: `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
-- Metrics summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
-- Train loss table: `artifacts/twin_handover_packed_parallelization_20260309/metrics/train_loss_table.csv`
-- Val loss table: `artifacts/twin_handover_packed_parallelization_20260309/metrics/val_loss_table.csv`
-- Environment snapshot: `artifacts/twin_handover_packed_parallelization_20260309/environment/`
-## Notes
-- The packed parallel warm-start is exact by construction from the implemented slice/fuse mapping.
-- Weight loading on both main runs reported `missing=0` and `unexpected=0`.
-- The packaged tree intentionally records reproducibility snapshots instead of uploading transient cache state.

 # pi0.5 Packed Multi-Arm OpenPI Artifacts
+This repo packages the full local artifact set for the TWIN handover packed-action-head study on `pi0.5`, including:
+- all finished checkpoints under `openpi/checkpoints/`
+- the modified `openpi/` training and evaluation code
+- train/eval logs and structured metric tables
+- reproducibility manifests and environment snapshots
+Two runs are included:
+1. an initial `2K` baseline-vs-parallel comparison
+2. a longer `10K` follow-up on the same packed setup
+## Experiment setup
 - Train repo: `lsnu/twin_handover_256_train`
 - Val repo: `lsnu/twin_handover_256_val`
+- Hardware: `4x H100 80GB`
+- Precision: `bfloat16`
+- Semantic packed layout: `[L8, 0x8, R8, 0x8]`
+- Active action-loss dims: `[0:8]` and `[16:24]`
+- Masked padded dims: `[8:16]` and `[24:32]`
 ## Headline results
+Teacher-forced masked validation loss:
+| Model | 2K @ final | 10K @ 1K | 10K @ 2K | 10K @ 5K | 10K @ 10K |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| Packed baseline | `0.035776` | `0.061130` | `0.041595` | `0.027324` | `0.022345` |
+| Packed parallel | `0.035680` | `0.059715` | `0.039947` | `0.027340` | `0.022168` |
+Sample-based eval on the fixed `10K` final validation subset:
+| Model | 4-step masked MAE | 10-step masked MAE | Train runtime | Peak VRAM |
 | --- | ---: | ---: | ---: | ---: |
+| Packed baseline | `0.029935` | `0.030294` | `2:13:40` | `35.23GB` |
+| Packed parallel | `0.029277` | `0.030241` | `2:20:51` | `35.27GB` |
+The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie.
+## Warm-start note
+The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical check shows it is not exactly identical end-to-end on a real batch:
+- `input_projection_max_abs_diff = 0.00122881`
+- `masked_loss_abs_diff = 0.00398052`
+- `warmstart_equivalent = False`
+So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.
+## Repo layout
 - `openpi/`
+  - modified source and scripts used for training/eval
+  - copied norm-stats assets for the packed configs
+  - full `2K` and `10K` checkpoint trees
 - `artifacts/twin_handover_packed_parallelization_20260309/`
+  - initial `2K` study bundle
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/`
+  - `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot
 - `artifacts/pi05_base_params/`
+  - staged base parameter snapshot used during JAX-to-PyTorch conversion
+## Key files
 - Full report: `REPORT.md`
+- `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
+- `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
+- `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
+- `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
+- `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
+- `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
+## Main changed files
+Initial `2K` + `10K` study logic lives primarily in:
+- `openpi/src/openpi/transforms.py`
+- `openpi/src/openpi/training/config.py`
+- `openpi/src/openpi/training/data_loader.py`
+- `openpi/src/openpi/models/model.py`
+- `openpi/src/openpi/models/tokenizer.py`
+- `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
+- `openpi/scripts/train_pytorch.py`
+- `openpi/scripts/eval_twin_val_loss_pytorch.py`
+- `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
+- `openpi/scripts/inspect_twin_packed_batch.py`
+- `openpi/scripts/check_parallel_warmstart_equivalence.py`
+- `openpi/scripts/run_twin_handover_packed_followup.sh`
+- `openpi/scripts/run_twin_handover_packed_10k.sh`
+The per-file rationale is recorded in:
+- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`

REPORT.md CHANGED Viewed

@@ -1,347 +1,268 @@
 # Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover
-## Objective
-Run the minimum scientifically meaningful comparison between:
-1. a packed single-head `pi0.5` baseline
-2. a packed parallel-head `pi0.5` model
-Both models were fine-tuned on the same converted public TWIN handover dataset with the same training schedule:
-- train: `lsnu/twin_handover_256_train`
-- val: `lsnu/twin_handover_256_val`
-- hardware: `4x H100 80GB`
-- precision: `bfloat16`
-- global batch size: `16`
-- optimizer steps per model: `2000`
-- save interval: `250`
-- log interval: `10`
-## Data layout and packing
-The TWIN converted state/action layout is `16` dims in `[L8, R8]`, where each arm is `7` joints plus gripper. The generic `pi0.5` path right-pads to `32` dims, which does not preserve a semantic left/right split for a naive parallel-head setup.
-To keep the experiment minimal and still semantically correct:
-- existing public `16`-dim norm stats were reused
-- semantic packing happened after normalization in model transforms
-- both models consumed the same packed `32`-dim layout:
 ```text
 [L8, R8] -> [L8, 0x8, R8, 0x8]
 ```
-- the action loss was masked so only the real arm dims contributed:
-```text
-active dims: [0:8] and [16:24]
-masked dims: [8:16] and [24:32]
-```
-The packed-batch sanity check confirmed exact zero padding:
-- `state_padded_zero_count: 16 / 16`
-- `actions_padded_zero_count: 256 / 256`
-- `state_padded_exact_zero: True`
-- `actions_padded_exact_zero: True`
-Reference log:
-- `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
-## Code changes tied to files
-The experiment-specific changes are summarized below.
-- `openpi/src/openpi/transforms.py`
-  - added `PackPerArmBlocks` and `UnpackPerArmBlocks` for semantic TWIN packed training
 - `openpi/src/openpi/training/config.py`
-  - added packed TWIN model-transform path
-  - added `action_loss_mask`
-  - added `pi05_twin_handover_256_packed_baseline_pytorch_2k`
-  - added `pi05_twin_handover_256_packed_parallel_pytorch_2k`
-- `openpi/src/openpi/training/data_loader.py`
-  - added `set_epoch`
-  - improved local dataset mirror handling and loader startup behavior
-- `openpi/src/openpi/models/model.py`
-  - made `pi0_pytorch` import lazy
-- `openpi/src/openpi/models/tokenizer.py`
-  - made `AutoProcessor` import lazy
-- `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
-  - disabled unconditional `sample_actions` `torch.compile` by default
 - `openpi/scripts/train_pytorch.py`
-  - added startup prints
-  - added masked action-loss reduction
-  - added first-steps debug prints and periodic runtime/memory logging
-  - hardened DDP/checkpoint startup
 - `openpi/scripts/eval_twin_val_loss_pytorch.py`
-  - added masked validation-loss evaluation with fixed-batch execution
-- `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
-  - added exact packed parallel warm-start initialization
-- `openpi/scripts/inspect_twin_packed_batch.py`
-  - added packed-batch inspection and zero-padding verification
-- `openpi/scripts/run_twin_handover_packed_followup.sh`
-  - added detached follow-up automation for the remaining train/eval stages
-- `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_2k/lsnu/twin_handover_256_train/norm_stats.json`
-  - copied the existing handover train norm stats for the packed baseline config
-- `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_2k/lsnu/twin_handover_256_train/norm_stats.json`
-  - copied the existing handover train norm stats for the packed parallel config
-Reference file list:
-- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
-## Commands run
-The exact rerun command list is saved in:
-- `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
-The executed flow was:
-1. packed-batch inspection
-2. base `pi0.5` JAX-to-PyTorch conversion
-3. exact packed parallel warm-start initialization from the single-head PyTorch checkpoint
-4. packed baseline training for `2000` steps
-5. baseline val at `1000`
-6. baseline val at `2000`
-7. packed parallel training for `2000` steps
-8. parallel val at `1000`
-9. parallel val at `2000`
-The parallel training and its validation passes were chained through a detached follow-up runner.
-Reference logs:
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/twin_handover_followup.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k.log`
-## Startup sanity checks
-### Norm stats
-The copied norm-stats files were loaded successfully and reported:
-- keys: `['actions', 'state']`
-- `state_mean_len=16`
-- `state_std_len=16`
-- `actions_mean_len=16`
-- `actions_std_len=16`
-Reference:
-- `artifacts/twin_handover_packed_parallelization_20260309/metrics/norm_stats_verification.txt`
-### Baseline startup summary
-Rank-0 startup logging for the packed baseline recorded:
-```text
-Resolved config name: pi05_twin_handover_256_packed_baseline_pytorch_2k
-Dataset repo_id: lsnu/twin_handover_256_train
-Norm-stats summary: {'keys': ['actions', 'state'], 'state_mean_len': 16, 'state_std_len': 16, 'actions_mean_len': 16, 'actions_std_len': 16}
-Checkpoint source path: /workspace/checkpoints/pi05_base_single_pytorch
-Model type: baseline
-Packed transforms active: True
-Batch size: local=4, global=16
-Action-loss mask: (1.0 x8, 0.0 x8, 1.0 x8, 0.0 x8)
-Weight loading missing key count: 0
-Weight loading unexpected key count: 0
-```
-The first debug steps also showed:
-- `observation.state shape=(4, 32)`
-- `actions shape=(4, 16, 32)`
-- `state_nonzero_counts_8d_blocks=[32, 0, 32, 0]`
-- `action_nonzero_counts_8d_blocks=[512, 0, 512, 0]`
-- masked padded dims stayed exactly zero in the batch
-### Parallel startup summary
-Rank-0 startup logging for the packed parallel run recorded:
-```text
-Resolved config name: pi05_twin_handover_256_packed_parallel_pytorch_2k
-Dataset repo_id: lsnu/twin_handover_256_train
-Norm-stats summary: {'keys': ['actions', 'state'], 'state_mean_len': 16, 'state_std_len': 16, 'actions_mean_len': 16, 'actions_std_len': 16}
-Checkpoint source path: /workspace/checkpoints/pi05_base_parallel_packed_from_single
-Model type: parallel
-Packed transforms active: True
-Batch size: local=4, global=16
-Action-loss mask: (1.0 x8, 0.0 x8, 1.0 x8, 0.0 x8)
-Weight loading missing key count: 0
-Weight loading unexpected key count: 0
-```
-The first debug steps matched the expected packed layout:
-- `observation.state shape=(4, 32)`
-- `actions shape=(4, 16, 32)`
-- `state_nonzero_counts_8d_blocks=[32, 0, 32, 0]`
-- `action_nonzero_counts_8d_blocks=[512, 0, 512, 0]`
-### Smoke tests
-All required smoke tests passed before the main runs:
-1. `debug_pi05_multiarm_pytorch_smoke`
-2. packed-batch inspection on `lsnu/twin_handover_256_train`
-3. packed baseline TWIN smoke on `4` GPUs for `20` steps
-4. packed parallel TWIN smoke on `4` GPUs for `20` steps
-Smoke logs are stored in:
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/smoke_handover_packed_baseline_20k.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/smoke_handover_packed_baseline_20l.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/smoke_handover_packed_parallel_20a.log`
-## Warm-start note
-The packed parallel warm-start was implemented as an exact slice/fuse mapping from the single-head PyTorch checkpoint:
-- input side: split single-head input projection by packed arm blocks
-- fuse side: initialize `arm_token_fuse.weight` as `[I I]`
-- output side: split single-head output projection rows by packed arm blocks
-This was exact by construction for the implemented mapping and both the warm-start checkpoint creation and main-run loading succeeded without missing or unexpected keys.
-What was not done:
-- no separate numerical equivalence test was run that compared step-0 forward outputs between the single-head and parallel-head models on the same batch
-Bootstrap checkpoints:
-- `/workspace/checkpoints/pi05_base_single_pytorch`
-- `/workspace/checkpoints/pi05_base_parallel_packed_from_single`
-Copies are also staged under:
-- `artifacts/twin_handover_packed_parallelization_20260309/bootstrap_checkpoints/`
-## Results
-### Training loss snapshots
-| Model | Step 250 | Step 500 | Step 1000 | Step 1500 | Step 2000 |
-| --- | ---: | ---: | ---: | ---: | ---: |
-| Baseline loss | `0.1975` | `0.0606` | `0.0245` | `0.0155` | `0.0391` |
-| Baseline smoothed | `0.1166` | `0.0554` | `0.0387` | `0.0331` | `0.0278` |
-| Parallel loss | `0.1894` | `0.0633` | `0.0214` | `0.0155` | `0.0326` |
-| Parallel smoothed | `0.1153` | `0.0565` | `0.0392` | `0.0331` | `0.0270` |
-### Validation loss
-| Model | Checkpoint | Batches | Mean val loss | Std val loss |
-| --- | ---: | ---: | ---: | ---: |
-| Baseline | `1000` | `50` | `0.052885` | `0.032533` |
-| Baseline | `2000` | `100` | `0.035776` | `0.027648` |
-| Parallel | `1000` | `50` | `0.051214` | `0.028985` |
-| Parallel | `2000` | `100` | `0.035680` | `0.026077` |
-### Runtime and memory
-| Item | Value |
-| --- | --- |
-| Pipeline wallclock from baseline launch to final val | `01:32:29` |
-| Detached follow-up runner wallclock | `01:17:47` |
-| Baseline train runtime | `33:27` |
-| Parallel train runtime | `30:38` |
-| Baseline val @ 1000 | `00:05:14` |
-| Baseline val @ 2000 | `00:05:19` |
-| Parallel val @ 1000 | `00:03:23` |
-| Parallel val @ 2000 | `00:03:33` |
-| Peak baseline VRAM | `35.23 GB` |
-| Peak parallel VRAM | `35.27 GB` |
-### Interpretation
-For this short `2000`-step TWIN handover run, the packed baseline and packed parallel-head models behaved very similarly. The packed parallel-head model ended slightly lower on both validation checkpoints while staying in the same memory range and training cleanly under the same schedule.
-This should be treated as an initial profiling run, not a final benchmark claim.
-Reference metrics:
-- `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
-- `artifacts/twin_handover_packed_parallelization_20260309/metrics/train_loss_table.csv`
-- `artifacts/twin_handover_packed_parallelization_20260309/metrics/val_loss_table.csv`
-## Checkpoints and logs
-### Main-run checkpoints
-- Baseline step `1000`:
-  - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_baseline_pytorch_2k/handover_packed_baseline_2k/1000`
-- Baseline step `2000`:
-  - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_baseline_pytorch_2k/handover_packed_baseline_2k/2000`
-- Parallel step `1000`:
-  - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_parallel_pytorch_2k/handover_packed_parallel_2k/1000`
-- Parallel step `2000`:
-  - `/workspace/openpi/checkpoints/pi05_twin_handover_256_packed_parallel_pytorch_2k/handover_packed_parallel_2k/2000`
-The full checkpoint trees, including smoke checkpoints and intermediate saves every `250` steps, are under:
-- `openpi/checkpoints/pi05_twin_handover_256_packed_baseline_pytorch_2k/`
-- `openpi/checkpoints/pi05_twin_handover_256_packed_parallel_pytorch_2k/`
-### Bootstrap checkpoints
-- `artifacts/twin_handover_packed_parallelization_20260309/bootstrap_checkpoints/pi05_base_single_pytorch/`
-- `artifacts/twin_handover_packed_parallelization_20260309/bootstrap_checkpoints/pi05_base_parallel_packed_from_single/`
-### Logs
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k_val_1000.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_baseline_2k_val_2000.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k_val_1000.log`
-- `artifacts/twin_handover_packed_parallelization_20260309/run_logs/handover_packed_parallel_2k_val_2000.log`
-## Environment and provenance snapshot
-Environment snapshots are stored in:
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/system_info.txt`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/gpu_info.txt`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/python_env.txt`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/pip_freeze.txt`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/hf_env.txt`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/selected_env_vars.json`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/workspace_snapshot.txt`
-- `artifacts/twin_handover_packed_parallelization_20260309/environment/openpi_source_snapshot.txt`
-OpenPI source provenance:
-- packaged `openpi/` tree does not contain a live `.git` directory
-- source clone snapshot recorded in `openpi_source_snapshot.txt`
-- source commit: `aa91438c0c130dcef4ccf378a56f4cf4cffc1310`
-## Acceptance criteria status
-1. Packed-batch inspection showed raw `16`-dim `[L8, R8]` and packed `32`-dim `[L8, 0x8, R8, 0x8]`: `PASS`
-2. Both smoke tests passed on `4` GPUs with finite loss: `PASS`
-3. Baseline run started from `/workspace/checkpoints/pi05_base_single_pytorch`: `PASS`
-4. Parallel run started from `/workspace/checkpoints/pi05_base_parallel_packed_from_single`: `PASS`
-5. Masked loss was active and padded dims were excluded: `PASS`
-6. DDP ran without shape/key mismatches: `PASS`
-7. Quick val was run at step `1000` for both models: `PASS`
-8. Final val was run at step `2000` for both models: `PASS`
-9. Both main runs finished under the `10`-hour cap: `PASS`
-10. Final bundle includes code, checkpoints, logs, metrics, and environment snapshot: `PASS`
-## Final inventory
-The artifact bundle at repo root contains:
-- all modified training/eval code under `openpi/`
-- all baseline and parallel checkpoints under `openpi/checkpoints/`
-- both bootstrap checkpoints under `artifacts/.../bootstrap_checkpoints/`
-- all train/eval/smoke logs under `artifacts/.../run_logs/`
-- metrics tables and summary JSON under `artifacts/.../metrics/`
-- reproducibility files under `artifacts/.../repro/`
-- environment and provenance snapshot under `artifacts/.../environment/`
-This is a complete rerunnable package for the initial TWIN handover packed action-head parallelization study.

 # Report: pi0.5 Packed Action-Head Parallelization on TWIN Handover
+## Scope
+This repo now contains two completed studies on the same packed TWIN handover setup:
+1. the initial `2K` baseline-vs-parallel comparison
+2. the longer `10K` follow-up with richer diagnostics
+Both runs used:
+- train repo `lsnu/twin_handover_256_train`
+- val repo `lsnu/twin_handover_256_val`
+- `4x H100 80GB`
+- `bfloat16`
+- packed semantic layout `[L8, 0x8, R8, 0x8]`
+- active action-loss dims `[0:8]` and `[16:24]`
+- masked dims `[8:16]` and `[24:32]`
+Existing public `16`-dim norm stats were reused. No raw-data reconversion was done.
+## Data packing and masking
+The TWIN converted state/action layout is `[L8, R8]`, where each arm is `7` joints plus gripper. The packed transform path added for these runs preserves the left/right semantics inside a `32`-dim model input:
 ```text
 [L8, R8] -> [L8, 0x8, R8, 0x8]
 ```
+The batch inspection confirmed exact zero padding in the masked blocks. Reference logs:
+- `artifacts/twin_handover_packed_parallelization_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/inspect_twin_packed_batch_handover_train.log`
+## Files changed or created
+### Initial `2K` study
+The initial study added the packed TWIN path, masked loss, warm-start init, inspection script, and detached `2K` runner plumbing. The exact per-file list is in:
+- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
+### `10K` follow-up additions
+The follow-up changed or added:
 - `openpi/src/openpi/training/config.py`
+  - added `pi05_twin_handover_256_packed_baseline_pytorch_10k`
+  - added `pi05_twin_handover_256_packed_parallel_pytorch_10k`
 - `openpi/scripts/train_pytorch.py`
+  - added periodic per-module gradient buckets for baseline and parallel models
 - `openpi/scripts/eval_twin_val_loss_pytorch.py`
+  - added left/right arm losses
+  - added joint vs gripper losses
+  - added left/right imbalance
+  - added small deterministic `sample_actions` eval at `num_steps={4,10}`
+- `openpi/scripts/check_parallel_warmstart_equivalence.py`
+  - added step-0 baseline-vs-parallel numerical check
+- `openpi/scripts/run_twin_handover_packed_10k.sh`
+  - added detached `10K` train/eval chain
+- `openpi/assets/pi05_twin_handover_256_packed_baseline_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
+  - copied existing norm stats for the `10K` baseline config
+- `openpi/assets/pi05_twin_handover_256_packed_parallel_pytorch_10k/lsnu/twin_handover_256_train/norm_stats.json`
+  - copied existing norm stats for the `10K` parallel config
+- `README.md`
+  - updated repo landing page to cover both studies
+- `REPORT.md`
+  - updated full report to cover both studies
+The exact `10K` changed-file manifest is:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
+## Commands and run flow
+The exact `10K` rerun commands are stored in:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
+The `10K` execution flow was:
+1. run the warm-start equivalence check
+2. baseline `10K` train
+3. baseline evals at `1000`, `2000`, `5000`, `10000`
+4. parallel `10K` train
+5. parallel evals at `1000`, `2000`, `5000`, `10000`
+The detached runner was:
+- `openpi/scripts/run_twin_handover_packed_10k.sh`
+Main logs:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_10k_followup.log`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_baseline_10k.log`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/run_logs/handover_packed_parallel_10k.log`
+## Startup sanity checks
+Both `10K` runs loaded cleanly with:
+- packed transforms active
+- correct `32`-dim packed state/action tensors
+- mask active on `[0:8]` and `[16:24]`
+- exact zeros preserved in masked padded blocks
+- `missing=0`
+- `unexpected=0`
+Reference startup summary:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/startup_summaries.txt`
+Checkpoint sources:
+- baseline: `/workspace/checkpoints/pi05_base_single_pytorch`
+- parallel: `/workspace/checkpoints/pi05_base_parallel_packed_from_single`
+## Warm-start equivalence check
+The `10K` study added an explicit step-0 numerical check:
+- `input_projection_max_abs_diff = 0.00122881`
+- `input_projection_mean_abs_diff = 0.00015435`
+- `baseline_masked_loss = 1.00531137`
+- `parallel_masked_loss = 1.00929189`
+- `masked_loss_abs_diff = 0.00398052`
+- `warmstart_equivalent = False`
+Reference:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/sanity_checks/warmstart_equivalence_10k.log`
+Interpretation:
+- the slice/fuse initialization is aligned by construction
+- it is not numerically exact end-to-end on the same batch
+- this weakens a strict “identical function at step 0” claim
+- it does not invalidate the comparison as a matched warm-start study
+## Results
+### Initial `2K` study
+Teacher-forced validation loss:
+| Model | Val @ 1000 | Val @ 2000 | Train runtime | Peak VRAM |
+| --- | ---: | ---: | ---: | ---: |
+| Packed baseline | `0.052885` | `0.035776` | `33:27` | `35.23GB` |
+| Packed parallel | `0.051214` | `0.035680` | `30:38` | `35.27GB` |
+### `10K` train checkpoints
+Rank-0 train snapshots:
+| Model | Step 1000 | Step 2000 | Step 5000 | Step 10000 |
+| --- | ---: | ---: | ---: | ---: |
+| Baseline loss | `0.0228` | `0.0376` | `0.0202` | `0.0141` |
+| Baseline smoothed | `0.0476` | `0.0273` | `0.0226` | `0.0172` |
+| Parallel loss | `0.0211` | `0.0368` | `0.0212` | `0.0140` |
+| Parallel smoothed | `0.0461` | `0.0259` | `0.0225` | `0.0169` |
+Structured source:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
+### `10K` teacher-forced validation
+| Checkpoint | Baseline | Parallel | Delta (parallel - baseline) |
+| --- | ---: | ---: | ---: |
+| `1000` | `0.061130` | `0.059715` | `-0.001415` |
+| `2000` | `0.041595` | `0.039947` | `-0.001648` |
+| `5000` | `0.027324` | `0.027340` | `+0.000016` |
+| `10000` | `0.022345` | `0.022168` | `-0.000177` |
+The longer run still shows only a very small gap. The models remain extremely close.
+### `10K` final arm/joint/gripper breakdown
+Teacher-forced validation at `10000`:
+| Metric | Baseline | Parallel |
+| --- | ---: | ---: |
+| Mean val loss | `0.022345` | `0.022168` |
+| Left arm loss | `0.029659` | `0.030184` |
+| Right arm loss | `0.015031` | `0.014151` |
+| Left joint loss | `0.031507` | `0.032356` |
+| Left gripper loss | `0.016725` | `0.014984` |
+| Right joint loss | `0.015776` | `0.014888` |
+| Right gripper loss | `0.009818` | `0.008996` |
+| Left/right imbalance | `0.034067` | `0.033825` |
+Interpretation:
+- the parallel model’s small final advantage is mostly on the right-arm side
+- the baseline is slightly better on left-arm joint loss
+- the parallel model is slightly better on both grippers and on right-joint loss
+- imbalance is nearly unchanged
+### `10K` sample-based eval
+Final fixed-subset sample eval at `10000`:
+| Metric | Baseline | Parallel |
+| --- | ---: | ---: |
+| 4-step masked MAE | `0.029935` | `0.029277` |
+| 10-step masked MAE | `0.030294` | `0.030241` |
+| 4-step left/right imbalance MAE | `0.033733` | `0.031629` |
+| 10-step left/right imbalance MAE | `0.034582` | `0.032456` |
+Interpretation:
+- sample-based quality is effectively tied by the end
+- the teacher-forced gap does not widen into a large inference-time separation
+Structured sources:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
+### Runtime and memory
+| Stage | Duration |
+| --- | ---: |
+| Baseline train | `2:13:40` |
+| Baseline eval sweep | `0:24:24` |
+| Parallel train | `2:20:51` |
+| Parallel eval sweep | `0:43:54` |
+| Full `10K` pipeline | `5:48:33` |
+Peak VRAM:
+- baseline: `35.23GB`
+- parallel: `35.27GB`
+Reference:
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/runtime_table.csv`
+## Artifact locations
+### `2K` bundle
+- `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
+- `artifacts/twin_handover_packed_parallelization_20260309/repro/commands_reproduce.sh`
+- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
+### `10K` bundle
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/train_loss_table.csv`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/teacher_forced_eval_table.csv`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/sample_eval_table.csv`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
+- `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`
+## Bottom line
+The `10K` follow-up suggests the `2K` near-tie was not hiding a large later divergence.
+- teacher-forced validation ends with a small parallel edge
+- sample-based eval is essentially tied
+- left/right imbalance does not materially change
+- the main difference remains subtle rather than dramatic
+So the packed parallel head looks competitive and slightly favorable on the masked teacher-forced objective, but the current evidence does not show a large practical separation at inference time on this task.