Add files using upload-large-folder tool

f7a3ee4 verified about 9 hours ago

13 kB

	# pi0.5 Packed Multi-Arm OpenPI Artifacts

	This repo packages the full local artifact set for packed-action-head studies on `pi0.5` across TWIN handover and TWIN dual-push, including:

	- all finished checkpoints under `openpi/checkpoints/`
	- the modified `openpi/` training and evaluation code
	- train/eval logs and structured metric tables
	- reproducibility manifests and environment snapshots

	Three runs are included:

	1. an initial `2K` baseline-vs-parallel comparison
	2. a longer `10K` follow-up on the same packed setup
	3. a `5K` dual-push `128` screening study on the same packed path
	4. a `2K` dual-push `128` four-way step comparison across `shared`, `head_only_parallel`, `split_independent`, and `split_communicating`

	This update also adds a split-action-expert bring-up bundle for the packed TWIN path, covering:

	- exact single-to-split warm-start checkpoints for `split_independent` and `split_communicating`
	- invariant checks for the new split architecture
	- detached real-data smoke and `20`-step training runs on `lsnu/twin_dual_push_128_train`
	- the code changes that introduce the new split-expert action path

	## Experiment setup

	- Handover train/val: `lsnu/twin_handover_256_train`, `lsnu/twin_handover_256_val`
	- Dual-push train/val: `lsnu/twin_dual_push_128_train`, `lsnu/twin_dual_push_128_val`
	- Hardware: `4x H100 80GB`
	- Precision: `bfloat16`
	- Semantic packed layout: `[L8, 0x8, R8, 0x8]`
	- Active action-loss dims: `[0:8]` and `[16:24]`
	- Masked padded dims: `[8:16]` and `[24:32]`

	## Headline results

	Teacher-forced masked validation loss:

	\| Model \| 2K @ final \| 10K @ 1K \| 10K @ 2K \| 10K @ 5K \| 10K @ 10K \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| Packed baseline \| `0.035776` \| `0.061130` \| `0.041595` \| `0.027324` \| `0.022345` \|
	\| Packed parallel \| `0.035680` \| `0.059715` \| `0.039947` \| `0.027340` \| `0.022168` \|

	Sample-based eval on the fixed `10K` final validation subset:

	\| Model \| 4-step masked MAE \| 10-step masked MAE \| Train runtime \| Peak VRAM \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Packed baseline \| `0.029935` \| `0.030294` \| `2:13:40` \| `35.23GB` \|
	\| Packed parallel \| `0.029277` \| `0.030241` \| `2:20:51` \| `35.27GB` \|

	The long run still shows a very small parallel edge on teacher-forced validation loss by `10K`, while the sample-based eval is essentially a tie.

	Dual-push `128` screening results:

	\| Model \| 1K val loss \| 2K val loss \| 5K val loss \| 5K 4-step MAE \| 5K 10-step MAE \| Train runtime \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| Packed baseline \| `0.095597` \| `0.083194` \| `0.055958` \| `0.056830` \| `0.058973` \| `1:05:25` \|
	\| Packed parallel \| `0.093704` \| `0.082729` \| `0.055242` \| `0.054630` \| `0.056627` \| `1:00:33` \|

	The dual-push screening run shows a small but consistent parallel edge at `1K`, `2K`, and `5K` on both teacher-forced validation loss and fixed-subset sample MAE.

	Dual-push `128` four-way `2K` step comparison raw results:

	Step-0 teacher-forced masked validation loss:

	\| Model \| Step-0 val loss \| Step-0 left/right imbalance \|
	\| --- \| ---: \| ---: \|
	\| Shared \| `1.084735` \| `0.505345` \|
	\| Head-only parallel \| `1.082985` \| `0.501182` \|
	\| Split independent \| `1.328262` \| `0.448843` \|
	\| Split communicating \| `1.783048` \| `0.671085` \|

	Step-2000 teacher-forced masked validation loss:

	\| Model \| Step-2000 val loss \| Step-2000 left/right imbalance \|
	\| --- \| ---: \| ---: \|
	\| Shared \| `0.055329` \| `0.069564` \|
	\| Head-only parallel \| `0.055297` \| `0.069380` \|
	\| Split independent \| `0.063537` \| `0.092029` \|
	\| Split communicating \| `0.059952` \| `0.080435` \|

	Step-2000 sample masked MAE:

	\| Model \| 1-step MAE \| 4-step MAE \| 16-step MAE \|
	\| --- \| ---: \| ---: \| ---: \|
	\| Shared \| `0.087330` \| `0.078164` \| `0.085222` \|
	\| Head-only parallel \| `0.086764` \| `0.078301` \| `0.085272` \|
	\| Split independent \| `0.079100` \| `0.070436` \| `0.075281` \|
	\| Split communicating \| `0.078618` \| `0.071087` \| `0.075570` \|

	Full raw tables for the `0/100/500/2000` sweep live in:

	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`

	## Warm-start note

	The packed parallel warm-start uses the slice/fuse mapping implemented in `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`, but the added step-0 numerical checks show it is not exactly identical end-to-end on a real batch:

	- handover `10K`: `input_projection_max_abs_diff = 0.00122881`, `masked_loss_abs_diff = 0.00398052`
	- dual-push `5K`: `input_projection_max_abs_diff = 0.00099802`, `masked_loss_abs_diff = 0.08580410`
	- both checks report `warmstart_equivalent = False`

	So this repo should be read as a matched warm-start study, not as a bitwise-identical step-0 control.

	## Split-Expert Bring-Up (`2026-03-10`)

	The current repo now contains a true split-action-expert implementation in addition to the earlier packed head-only factorization. The new config flag is `action_expert_mode` with:

	- `shared`
	- `head_only_parallel`
	- `split_independent`
	- `split_communicating`

	Key bring-up results:

	- the split warm-start copies the original single `gemma_expert` into exact left/right expert branches for both split modes
	- `split_independent` passes the branch-local invariants:
	- identical left/right inputs produce identical suffix outputs
	- perturbing right-arm inputs leaves left-arm outputs unchanged, and vice versa
	- both split modes pass detached real-data training on packed TWIN dual-push:
	- `3`-step real-data smoke run with checkpoint save
	- `20`-step real-data training run with checkpoint save
	- the communicating model emits nonzero cross-arm attention diagnostics and remains finite through the real-data `20`-step run

	New bring-up artifact bundle:

	- `artifacts/twin_split_expert_bringup_20260310/`
	- split warm-start checkpoints
	- invariant-check outputs
	- reproducibility commands
	- summary README for the split-expert bring-up

	## Repo layout

	- `openpi/`
	- modified source and scripts used for training/eval
	- copied norm-stats assets for the packed configs
	- full `2K`, `10K`, and dual-push `5K` checkpoint trees
	- `artifacts/twin_handover_packed_parallelization_20260309/`
	- initial `2K` study bundle
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/`
	- `10K` follow-up bundle with metrics, logs, repro manifests, and environment snapshot
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/`
	- dual-push `128` screening bundle with metrics, logs, repro manifests, and environment snapshot
	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
	- dual-push `128` four-way `2K` step-comparison bundle with metrics, logs, repro manifests, and environment snapshot
	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
	- small preflight/debug snapshot from the interrupted bring-up path; useful for debugging the runner, not the canonical result bundle
	- `artifacts/twin_split_expert_bringup_20260310/`
	- split-expert bring-up bundle committed with summary README, repro commands, detached run logs, and sanity checks

	## Committed artifact note

	For this update, the committed artifact payloads are:

	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311/`
	- the official finalized `4`-model dual-push `2K` step-comparison bundle
	- `artifacts/twin_split_expert_bringup_20260310/`
	- the split-expert bring-up bundle used as the sanity and warm-start reference
	- `artifacts/twin_dual_push_128_stepcmp_2k_20260311_debug/`
	- a small debug-only environment snapshot from the failed/resumed bring-up sequence

	The debug bundle is intentionally committed only as runner diagnostics. The canonical study outputs are the non-`_debug` step-comparison bundle plus the split bring-up bundle.
	- `openpi/run_logs/`
	- raw local split bring-up logs kept for completeness; the canonical copies for the finalized bring-up record live under `artifacts/twin_split_expert_bringup_20260310/run_logs/`
	- `openpi/scripts/upload_stepcmp_bundle_to_hf.py`
	- the committed high-throughput HF uploader for the step-comparison bundle and retained checkpoints; it uses `huggingface_hub.HfApi.upload_large_folder(...)`
	- `artifacts/pi05_base_params/`
	- staged base parameter snapshot used during JAX-to-PyTorch conversion

	## Future commit/upload workflow

	When adding new experiment results to this repo:

	- keep the canonical bundle under `artifacts/<study_name>/` and only retain the checkpoint steps that are scientifically required under `openpi/checkpoints/`
	- before claiming the repo is fully committed, audit ignored artifact paths explicitly:
	- `git ls-files --others -i --exclude-standard --directory -- openpi/checkpoints artifacts openpi/run_logs run_logs`
	- if a result is intentionally kept in an ignored path such as `openpi/checkpoints/` or `openpi/run_logs/`, force-add it explicitly with `git add --sparse -f ...`
	- use `openpi/scripts/upload_stepcmp_bundle_to_hf.py` for large HF uploads; it uses `huggingface_hub.HfApi.upload_large_folder(...)` and is the preferred path for checkpoint-heavy updates
	- never hardcode HF credentials in scripts, logs, or READMEs; keep the credential in `HF_TOKEN` or load it from `HF_TOKEN_FILE`, and check for literal `hf_...` strings before committing

	## Key files

	- Full report: `REPORT.md`
	- `2K` summary: `artifacts/twin_handover_packed_parallelization_20260309/metrics/summary.json`
	- `10K` summary: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/summary.json`
	- `10K` comparison table: `artifacts/twin_handover_packed_parallelization_10k_20260309/metrics/comparison_2k_vs_10k.csv`
	- dual-push `5K` summary: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/summary.json`
	- dual-push `5K` teacher-forced table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/teacher_forced_eval_table.csv`
	- dual-push `5K` sample eval table: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/metrics/sample_eval_table.csv`
	- dual-push `5K` environment snapshot: `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/environment/`
	- dual-push `2K` step-comparison summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/summary.json`
	- dual-push `2K` step-comparison README: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/README.md`
	- dual-push `2K` teacher-forced table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/teacher_forced_eval_table.csv`
	- dual-push `2K` sample eval table: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/sample_eval_table.csv`
	- dual-push `2K` training summary: `artifacts/twin_dual_push_128_stepcmp_2k_20260311/metrics/training_summary.csv`
	- split-expert bring-up summary: `artifacts/twin_split_expert_bringup_20260310/README.md`
	- split-expert repro commands: `artifacts/twin_split_expert_bringup_20260310/repro/commands_bringup.sh`
	- split-expert invariant check outputs: `artifacts/twin_split_expert_bringup_20260310/sanity_checks/`
	- split-expert real-data logs: `openpi/run_logs/split_independent_real_smoke3_r2.log`, `openpi/run_logs/split_communicating_real_smoke3.log`, `openpi/run_logs/split_independent_real_train20.log`, `openpi/run_logs/split_communicating_real_train20.log`
	- split-expert real-data checkpoints: `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_independent_pytorch_5k/`, `openpi/checkpoints/pi05_twin_dual_push_128_packed_split_expert_communicating_pytorch_5k/`
	- `10K` repro commands: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/commands_reproduce.sh`
	- `10K` changed-file manifest: `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
	- `10K` environment snapshot: `artifacts/twin_handover_packed_parallelization_10k_20260309/environment/`

	## Main changed files

	Initial `2K` + `10K` study logic lives primarily in:

	- `openpi/src/openpi/transforms.py`
	- `openpi/src/openpi/training/config.py`
	- `openpi/src/openpi/training/data_loader.py`
	- `openpi/src/openpi/models/model.py`
	- `openpi/src/openpi/models/tokenizer.py`
	- `openpi/src/openpi/models_pytorch/pi0_pytorch.py`
	- `openpi/scripts/train_pytorch.py`
	- `openpi/scripts/eval_twin_val_loss_pytorch.py`
	- `openpi/scripts/init_parallel_pi05_from_single_pytorch.py`
	- `openpi/scripts/inspect_twin_packed_batch.py`
	- `openpi/scripts/check_parallel_warmstart_equivalence.py`
	- `openpi/scripts/check_split_expert_invariants.py`
	- `openpi/scripts/run_twin_handover_packed_followup.sh`
	- `openpi/scripts/run_twin_handover_packed_10k.sh`
	- `openpi/scripts/run_twin_dual_push_128_packed_5k.sh`

	The per-file rationale is recorded in:

	- `artifacts/twin_handover_packed_parallelization_20260309/repro/changed_files.txt`
	- `artifacts/twin_handover_packed_parallelization_10k_20260309/repro/changed_files.txt`
	- `artifacts/twin_dual_push_128_packed_parallelization_5k_20260310/repro/changed_files.txt`