# Modification Summary (branch `0502_mp_process`)

Snapshot of every edit in this session, grouped by theme. Use this as a
code-review checklist and as a record of what changed *behaviorally* in the
training, evaluation, and data-prep pipelines.

## High-level themes

1. **Multi-process dataset builder + per-build diagnostics** — cleaning &
   replay summaries written to `meta/`, with per-speed integrated-motion
   error reporting.
2. **Action-norm profiler** for data-driven `clean_*_eps` calibration.
3. **Speed-integration ablation framework** — three model-side strategies
   (`text` / `modulation` / `soft_prompt`) selectable via a single config
   field, plus a P-length sweep on the soft-prompt arm.
4. **Ablation orchestration** — `scripts/run_ablations.py` (end-to-end) and
   `scripts/build_ablation_datasets.py` (data-prep only) with build /
   norm-stats dedup.
5. **8-GPU LIBERO evaluation** — partitioned eval driver that runs
   `libero_spatial / goal / object` plus a 5-way split of `libero_10`,
   tracks per-episode step counts, and reports per-suite + global rollups.
6. **`train_pytorch.py` cleanup** — removed silent NFS hang (wandb sample
   image block), removed hardcoded WANDB API key, made per-speed wandb
   breakdown config-driven.
7. **Four latent bugs fixed** that were silently degrading training or
   evaluation, see §7.

## 1. Data processing

### New files

| File | Purpose |
|---|---|
| `scripts/build_libero_speed_dataset_mp.py` | Multi-process build. Per-source-episode workers via `ProcessPoolExecutor` (`spawn`); main process does a sequential fix-up to fill the global `index` column. Same output schema as the single-process variant. CLI adds `--num-workers`. |
| `scripts/profile_action_norms.py` | Profile `‖action[:, :3]‖` and `‖action[:, 3:6]‖` distributions on the source dataset. Prints percentiles + threshold table; suggests P1 / P5 values for `--clean-transl-eps` / `--clean-rot-eps`. Optional JSON output. |

### Modified files

| File | Change |
|---|---|
| `scripts/build_libero_speed_dataset.py` | New `_aggregate_cleaning_stats` (deduped by `source_episode_index`) and `_aggregate_replay_metrics` (per target speed). Writes `meta/cleaning_summary.json` and `meta/replay_summary.json`; prints both at end of build. |
| `src/various_speed/core.py` | `transform_episode` metrics now include `cleaned_any_frames`, `cleaned_both_frames`, plus the four ratios (`*_translation_ratio`, `*_rotation_ratio`, `*_any_ratio`, `*_both_ratio`). |
| `scripts/compute_norm_stats.py` | `main()` accepts `--repo-id` and `--asset-id` overrides. Uses `dataclasses.replace` on frozen `TrainConfig` / nested `LeRobotVariousSpeedLiberoDataConfig` / `AssetsConfig` to apply overrides without registering one TrainConfig per ablation. |

## 2. `train_pytorch.py` cleanup

- Removed the wandb sample-image logging block. It created a second
  DataLoader and fetched 256 samples on the first batch, hanging silently
  for many minutes on NFS-backed datasets. Loss / lr / grad-norm wandb
  logging is unaffected.
- Removed the hardcoded `WANDB_API_KEY` env-var assignment (security: a real
  key was committed to the repo). Auth now uses the standard wandb
  resolution order. **The leaked key is in git history; rotate it on
  wandb**.
- Removed the `"EMA is not supported for PyTorch training"` log line
  (noise).
- `speed_specs` (per-speed wandb loss breakdown) and `avg_flow_metrics` key
  list are now derived from `config.eval_speed_set`, no longer hardcoded
  `0p5 / 1p0 / 2p0`. (Only fires when `observation.flow_control is not None`.)

## 3. Speed-integration ablation: model + config

### New TrainConfig field (`src/openpi/training/config.py`)

```python
eval_speed_set: tuple[float, ...] = (0.5, 1.0, 2.0)
```

Drives the per-speed wandb breakdown in `train_pytorch.py`.

### New LeRobotVariousSpeedLiberoDataConfig field

```python
speed_integration: Literal["text", "modulation", "soft_prompt", "auto"] = "auto"
```

A high-level switch:

| value | behavior | requirement |
|---|---|---|
| `text` | adds `SpeedConditionedPrompt` to data transforms | none |
| `modulation` | model reads raw `observation.speed` -> MLP -> adaRMS in action expert | `Pi0Config.speed_modulation=True` |
| `soft_prompt` | inserts K × P learnable tokens between vision and instruction | `Pi0Config.soft_prompt_p >= 1` and `soft_prompt_speeds` non-empty |

### New Pi0Config fields (`src/openpi/models/pi0_config.py`)

```python
speed_modulation: bool = False
soft_prompt_speeds: tuple[float, ...] = ()
soft_prompt_p: int = 0
```

`flow_control_dim` was removed. `inputs_spec` now declares
`speed=ShapeDtypeStruct([B, 1], float32)` whenever modulation OR
soft_prompt is enabled.

### Observation schema (`src/openpi/models/model.py`)

Added `speed: at.Float[ArrayT, "*b 1"] | None = None`. Both `from_dict`
and `preprocess_observation` (JAX) propagate it.

### PI0Pytorch surgery (`src/openpi/models_pytorch/pi0_pytorch.py`)

- `__init__`:
  - When `speed_modulation=True`: registers `speed_mod_mlp_in/out` and
    `speed_condition_mlp_in/out` (replaces the old `flow_control_*` /
    `flow_condition_*` MLPs). Reads raw `observation.speed`
    (shape `(B, 1)`); no log transform is applied.
  - When `soft_prompt_p > 0`: registers `soft_prompt_tokens: nn.Parameter`
    of shape `(K, P, paligemma_width)` with `N(0, 0.02)` init, plus a
    non-persistent buffer `soft_prompt_anchors: tensor(K,)`.
- `_preprocess_observation`: returns a 6-tuple
  `(images, image_masks, lang_tokens, lang_masks, state, speed)`.
- `embed_prefix`: accepts `speed=None`; when soft_prompt is enabled,
  computes `argmin |speed − anchors|` per batch element and inserts
  `(B, P, hidden)` tokens between image and language tokens with full
  attention. OOD speeds fall back to the nearest training anchor.
- `embed_suffix`: accepts `speed=None`; when modulation is enabled,
  pushes raw speed through `speed_mod_mlp` and fuses with the timestep
  embedding via `speed_condition_mlp`.
- `forward`, `sample_actions`, and `denoise_step` plumb `speed` through.

JAX `Pi0` (`src/openpi/models/pi0.py`) was renamed in the same way for
consistency: `flow_control_dim → speed_modulation`, `flow_control_mlp_*
→ speed_mod_mlp_*`, `flow_condition_mlp_* → speed_condition_mlp_*`,
reads `obs.speed`.

### Policy passthrough (`src/openpi/policies/libero_policy.py`)

`LiberoInputs` now passes `data["speed"]` through, alongside the existing
`flow_control` passthrough.

### Smoke tests (`tests/test_soft_prompt_smoke.py`)

Light tests for `Pi0Config` field acceptance and the argmin
nearest-neighbor logic. The full end-to-end forward-pass test requires
PaliGemma weights and is gated to a manual GPU run.

## 4. Ablation orchestration

### `scripts/run_ablations.py` (new)

End-to-end orchestrator: build → norm-stats → train, per ablation.

- `Ablation` dataclass: `name`, `speeds`, `speed_integration`,
  `extra_train_args`, `shared_norm_key`.
- 12 default ablations:
  - **Speed-set sweep** (5): `g1_baseline`, `g2_coarse`, `g3a_step025`,
    `g4_narrow`, `g5_extreme`. All use `text` integration.
  - **Speed-integration sweep** (3): `speedint_text`,
    `speedint_modulation` (with `--model.flow-control-dim=1`), and
    `softprompt_p8` (reused from the P-sweep).
  - **Soft-prompt P-length sweep** (5): `softprompt_p{1,4,8,16,32}`. All
    declare `shared_norm_key="softprompt_shared"` so they reuse one
    `norm_stats.json`.
- For the default 12-ablation table, dedup gives **5 builds, 8 norm-stats,
  12 train runs**.
- `--only`, `--skip-build`, `--skip-norm-stats`, `--skip-train`,
  `--dry-run` for scoped runs.

### `scripts/build_ablation_datasets.py` (new)

Thin focused wrapper for the data-prep stage only. Imports the same
`ABLATIONS` table from `run_ablations.py`, applies the same build dedup,
exits with a summary mapping ablation names to dataset paths.

## 5. 8-GPU LIBERO evaluation

### New files

| File | Purpose |
|---|---|
| `scripts/eval_libero_speed.py` | Single-GPU LIBERO eval client: connects to a websocket policy server, runs rollouts on a chosen suite or task-id subset, sends `speed` and `speed_label` in the observation element, records per-episode `success` / `steps` / `task_id`, prints per-rank summary, and writes a JSON. |
| `scripts/eval_libero_8gpu.sh` | Driver: dispatches 8 parallel `eval_libero_speed.py` clients. Partition is **GPU 0 → spatial / 1 → goal / 2 → object** (full suites) and **GPU 3-7 → libero_10 split into pairs of tasks**, balancing wall-clock. After all 8 finish, auto-aggregates per-suite and global rollups (success rate, mean steps for successes, mean steps overall). |

### Per-episode tracking

`eval_libero_speed.py` records the **policy steps actually executed**
(excluding the `num_steps_wait` warmup), so:

- successes terminate early when `env.step` returns `done=True` and report
  the true step count;
- failures hit `max_steps` for the suite (220 / 280 / 300 / 520 / 400 for
  spatial / object / goal / 10 / 90 respectively).

The gap between `mean_steps_success` and `mean_steps_all` is a fast
read-out for failure rate at a glance: `mean_steps_all` rises sharply
when failures push the time-limit cap.

### Output

```
results/libero_eval_<speed>x_<ts>/
  spatial_<speed>x.json
  goal_<speed>x.json
  object_<speed>x.json
  long_t0_1_<speed>x.json   long_t2_3_<speed>x.json   ...   long_t8_9_<speed>x.json
  logs/<...>.log
  videos/<...>/<rollout>.mp4
```

Each per-rank JSON contains a `summary` block (success rate, step
statistics, summary line) and a per-episode list. The driver's final
output is a per-suite rollup and a global line.

## 6. Documentation

- `VARIOUS_SPEED_README.md` — added §2 (action-norm profiling) and §4
  (multi-process build + cleaning/replay summary outputs); §8 notes the
  wandb-image-logging removal.
- `README_ablation.md` (new) — full ablation workflow doc, including:
  the four sweep tables, build/norm dedup behavior, the 8-GPU evaluation
  workflow, and the soft-prompt implementation notes.
- `modification_summary.md` (this file).
- `VARIOUS_SPEED_CURRENT_PIPELINE.md` — deleted (superseded).

## 7. Bug fixes worth flagging

### 7.1 `flow_control` was being silently dropped

`src/openpi/models_pytorch/preprocessing_pytorch.py` constructed
`SimpleProcessedObservation` without copying `flow_control`. Result: in
PyTorch training, `observation.flow_control` was always `None`, so the
action expert's modulation MLP always received zeros.

**Implication**: any prior PyTorch run with the (now-removed)
`pi05_libero_various_speed_all_flow_prompt` config did **not** actually
use modulation — it was equivalent to the text-only path. After the
later refactor, the `flow_control` field was eliminated entirely; the
modulation path now reads `observation.speed` directly. The new
`pi05_libero_various_speed_all_modulation` config replaces it.

Fix: pass `speed` through to `SimpleProcessedObservation` (the
`flow_control` field has since been removed).

### 7.2 JAX `preprocess_observation` did not pass `speed`

`src/openpi/models/model.py:preprocess_observation` (JAX path) didn't
propagate the new `speed` field. Even though the JAX trainer is not used
for the soft_prompt sweep, the field should round-trip cleanly to keep
both backends consistent. Fixed.

### 7.3 `--model.soft-prompt-speeds` CLI syntax

`scripts/run_ablations.py` initially emitted
`--model.soft-prompt-speeds=0.75,1,1.25,1.5` (comma-joined). Tyro parses
`tuple[float, ...]` from space-separated argv elements (matching
`--eval-speed-set` style). Fixed: emit the flag and each value as
separate argv elements.

### 7.4 Hardcoded WANDB API key in `init_wandb`

A live key was hardcoded and unconditionally written to
`os.environ["WANDB_API_KEY"]`, overriding each user's own credentials and
attributing all runs to one account. Removed; wandb now uses its standard
auth resolution order. **The committed key is exposed in git history;
revoke and rotate**.

## 8. Behavioral changes you should be aware of

- **`speed_integration` defaults to `"auto"`**, which preserves legacy
  behavior of the existing 3 LIBERO speed configs in `config.py`. New
  ablation configs should set `speed_integration` explicitly.
- **The legacy `pi05_libero_various_speed_all_flow_prompt` and
  `pi05_libero_various_speed_all_flow_noprompt` configs were removed**
  (replaced by `pi05_libero_various_speed_all_modulation`). The old
  configs were equivalent to text-only training due to the §7.1 bug, so
  any checkpoints from those names are not the modulation behavior they
  appeared to be.
- **wandb no longer logs sample camera images on first batch**. If you
  relied on that for debugging data inputs, run
  `scripts/visualize_speed_dataset.py` separately.
- **Per-build `meta/cleaning_summary.json` and `meta/replay_summary.json`**
  are new artifacts. Existing downstream consumers should ignore unknown
  meta files; verify if you have custom tooling that reads `meta/*.json`.
- **`g2_coarse` and `g4_narrow` speeds were updated mid-session**:
  - `g2_coarse`: `[0.5, 1.0, 2.0]` → `[0.5, 1.0, 1.5, 2.0]`
  - `g4_narrow`: `[0.75, 1.0, 1.25]` → `[0.75, 1.0, 1.25, 1.5]`
  - `g4_narrow` now shares its dataset with the entire speed-integration
    sweep, so the runner builds it only once.

## 9. Files added / modified / deleted

```
# New
A  README_ablation.md
A  modification_summary.md
A  scripts/build_ablation_datasets.py
A  scripts/eval_libero_8gpu.sh
A  scripts/eval_libero_speed.py
A  scripts/profile_action_norms.py
A  scripts/run_ablations.py
A  tests/test_soft_prompt_smoke.py

# Modified
M  src/openpi/models/model.py
M  src/openpi/models/pi0.py
M  src/openpi/models/pi0_config.py
M  src/openpi/models_pytorch/pi0_pytorch.py
M  src/openpi/models_pytorch/preprocessing_pytorch.py
M  src/openpi/policies/libero_policy.py
M  src/openpi/transforms.py
M  src/openpi/training/config.py
M  src/various_speed/core.py
M  scripts/build_libero_speed_dataset.py
M  scripts/build_libero_speed_dataset_mp.py
M  scripts/compute_norm_stats.py
M  scripts/train_pytorch.py
M  VARIOUS_SPEED_README.md

# Deleted
D  VARIOUS_SPEED_CURRENT_PIPELINE.md
```

## 10. Verification still needed (manual, on GPU host)

1. `uv run pytest tests/test_soft_prompt_smoke.py -v` — config validation
   and nearest-neighbor logic. CPU-only, fast.
2. Single-batch forward pass of `PI0Pytorch` with soft_prompt enabled
   (see docstring on
   `tests/test_soft_prompt_smoke.py::test_full_forward_pass_manual_only`).
3. `uv run python scripts/run_ablations.py ... --dry-run` — visually
   confirm the printed CLI commands look correct, especially that
   `--model.soft-prompt-speeds 0.75 1 1.25 1.5` is space-separated.
4. ~50-step smoke run of `softprompt_p8` on 1 GPU to confirm the model
   trains without shape / mask / dtype errors.
5. `profile_action_norms.py` on the source dataset, then update
   `--clean-transl-eps` / `--clean-rot-eps` in build commands to
   data-driven values before kicking off the full sweep.
6. `eval_libero_8gpu.sh` end-to-end with a single trained checkpoint,
   `SPEED=1.0` on the in-distribution speed first to confirm 8-rank
   coordination works, then iterate over OOD speeds.