Document regime runbook and schedule provenance

Files changed (2) hide show

docs/openwebtext10k_streaming_report.md +20 -0
docs/plan.md +550 -8

docs/openwebtext10k_streaming_report.md CHANGED Viewed

@@ -16,6 +16,26 @@ baselines.
 - `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`
 ## Condition Ranking By Final Loss
 | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |

 - `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`
+## Condition Provenance
+The `anchor_decay` label means the dropout value is chosen from explicit
+prefix-token anchors. It does not by itself imply that the schedule came from
+the coefficient formula.
+| Condition | Provenance | Dropout path | Interpretation |
+|---|---|---|---|
+| `openwebtext10k_interaction` | coefficient-derived schedule | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` | Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. |
+| `hold_30_then_decay` | heuristic schedule-search ablation | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. |
+| `mild_30_to_08` | heuristic schedule-search ablation | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. |
+| `fitted_l16_static_law` | older fitted/static-law schedule | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. |
+| `static_dropout_*` | static baseline | constant | Fixed dropout used at every stream prefix. |
+The two heuristic schedules should be treated as ablations, not as independent
+evidence that the coefficient formula generated their exact paths. Their role is
+to show that the shape of the decay matters and that reasonable hand-designed
+decays can also beat weak static choices. The main formula claim for this
+regime should be based on `openwebtext10k_interaction`.
 ## Condition Ranking By Final Loss
 | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |

docs/plan.md CHANGED Viewed

@@ -277,6 +277,520 @@ Use this order for every regime.
 7. Immediately backtest the new regime against all previous regimes.
 8. Only then run expensive streaming validation in the new regime.
 ## Current Regime Ledger
 | Regime | Status | Role |
@@ -357,8 +871,9 @@ Paired final-loss result:
 | `smooth_low` | 4/5, with the one miss only `+0.0003` |
 The immediate risk is no longer seed count for TinyStories or OpenWebText10K.
-The main remaining risk is external validity beyond two tested regimes. The
-current defensible claim is:
 ```text
 Formula-derived dropout schedules track the moving useful dropout region and
@@ -371,9 +886,9 @@ The stronger claim:
 Formula-derived dropout decay beats the best static dropout.
 ```
-is supported at `n=5` in both the TinyStories and OpenWebText10K streaming
-setups, with interaction decay beating the per-seed best static baseline in all
-five seeds in both regimes.
 Latest OpenWebText10K 5-seed streaming final-loss table:
@@ -388,6 +903,33 @@ Latest OpenWebText10K 5-seed streaming final-loss table:
 | static `0.02` | 4.5358 | 0.0091 |
 | static `0.00` | 4.5943 | 0.0216 |
 Paired final-loss result:
 | Decay schedule | Paired wins vs best static |
@@ -468,9 +1010,9 @@ the same MPS-only, five-seed validation standard.
 ## Next Training After Current Gate
-No MPS training should launch until the two completed five-seed streaming
-reports are read together. Since OpenWebText10K seed count is no longer the
-limiting issue, use a third held-out regime for the next validation step:
 ```text
 completed: TinyStories 5-seed streaming report

 7. Immediately backtest the new regime against all previous regimes.
 8. Only then run expensive streaming validation in the new regime.
+## New Regime Script Runbook
+Use this exact command sequence for any new regime. Replace placeholders such as
+`<regime>`, `<CORPUS_OR_PARQUET_PATH>`, `<MODEL_SPEC>`, and `<TIMESTAMP>` with
+absolute choices before launching. Do not skip from calibration directly to
+streaming: the schedule must be frozen from the coefficient fit before the
+streaming run starts.
+This section is intentionally verbose. Its purpose is to make future regimes
+auditable: an external reader should be able to tell what each script did, what
+file it produced, and which decision gate came next.
+### New Regime Step 0: MPS Preflight
+Run this before any torch training command:
+```bash
+.venv/bin/python -c "import torch; print({'mps_built': torch.backends.mps.is_built(), 'mps_available': torch.backends.mps.is_available(), 'cuda_available': torch.cuda.is_available()}); raise SystemExit(0 if torch.backends.mps.is_available() else 1)"
+```
+What this does:
+| Check | Meaning |
+|---|---|
+| `mps_built` | PyTorch was compiled with Apple MPS support |
+| `mps_available` | this machine can actually run MPS now |
+| `cuda_available` | should not be used for this project |
+Decision rule:
+```text
+if mps_available is false: stop and report
+if cuda_available is true: still do not use CUDA
+```
+Also check for duplicate experiment processes before launching a long run. This
+is not part of the coefficient method, but it prevents corrupt timing/resource
+comparisons.
+### New Regime Step 1: Static Dropout Calibration Screen
+Run:
+```bash
+.venv/bin/python scripts/run_experiments.py \
+  --mode screen_static \
+  --corpus <CORPUS_OR_PARQUET_PATH> \
+  --text-column <TEXT_COLUMN_IF_PARQUET> \
+  --cache-dir .cache/dropout_decay_<regime> \
+  --output-dir runs/<regime>_static_screen \
+  --models <M1=layersxheadsxdim> <M2=layersxheadsxdim> <M3=layersxheadsxdim> \
+  --seeds 1 2 \
+  --token-limits <U1> <U2> <U3> <U4> \
+  --dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
+  --steps <STATIC_STEPS> \
+  --batch-size <BATCH> \
+  --block-size <BLOCK> \
+  --eval-batches <EVAL_BATCHES> \
+  --train-eval-batches <TRAIN_EVAL_BATCHES> \
+  --trace-eval-batches <TRACE_EVAL_BATCHES> \
+  --vocab-size <VOCAB_SIZE> \
+  --val-tokens <VAL_TOKENS> \
+  --lr <LR> \
+  --weight-decay <WEIGHT_DECAY> \
+  --grad-clip 1.0 \
+  --screen-early-stop
+```
+What this script run does:
+`scripts/run_experiments.py --mode screen_static` trains a grid of static
+dropout models. It does not test the final decay hypothesis. It estimates the
+best static dropout rate for each calibration cell:
+```text
+cell = (model parameter count P, prefix/unique tokens U, sampled tokens C)
+```
+For each cell, the script evaluates a fixed dropout grid and writes the
+validation curve. The curve is used later to extract the target dropout `p*`.
+Expected outputs under `runs/<regime>_static_screen/screen_static/<TIMESTAMP>/`:
+| File | Use |
+|---|---|
+| `metrics.jsonl` | per-run raw metrics; includes token limit, model, seed, losses, and tokens seen |
+| `model_selection.csv` | per-cell static dropout curve and selected best dropout |
+| `summary.csv` / `summary.json` | compact aggregate summary |
+| `trace.jsonl` | lower-frequency trace for diagnostics |
+| `RESULT_SUMMARY.md` | human-readable first-pass summary |
+Why this is needed:
+The coefficient formula is not fitted from streaming outcomes. It is fitted
+from static dropout optima. This separation is essential: calibration estimates
+where useful regularization sits; streaming validation tests whether following
+that moving estimate helps.
+Recommended cheap calibration:
+| Dimension | Default |
+|---|---|
+| models | at least 3 model sizes if testing coefficient generality |
+| token prefixes | at least 4 prefixes |
+| seeds | 1-2 for calibration, 5 only for final streaming validation |
+| dropout grid | include low, middle, and high values so the optimum can be bracketed |
+Decision rule:
+```text
+continue if most cells have a bracketed or near-bracketed optimum
+refine if many best dropouts sit at the edge of the grid
+stop and inspect if validation curves are flat/noisy enough that p* is unstable
+```
+### New Regime Step 2: Fit First-Order Base Coefficients
+Run:
+```bash
+.venv/bin/python scripts/fit_dropout_coefficients.py \
+  --run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
+  --output-dir runs/coefficient_calibration/<regime>_base \
+  --target quad \
+  --weighting heuristic \
+  --feature-set base \
+  --min-rate 0.0 \
+  --max-rate 0.30
+```
+What this script run does:
+`scripts/fit_dropout_coefficients.py` reads `model_selection.csv` and
+`metrics.jsonl` from the static screen. It converts each calibration cell into:
+```text
+x = log10(P / U)
+y = log10(C / U)
+target = observed useful static dropout p*
+```
+With `--feature-set base`, it fits the first-order ablation:
+```text
+p* ~= A*x + B*y + C0
+```
+With `--target quad`, the target `p*` is the local quadratic minimum around the
+best dropout grid point when the curve is bracketed. If the curve is not
+bracketed, the script falls back to the grid best and marks the cell as weaker
+evidence.
+With `--weighting heuristic`, the fit downweights cells that are less reliable:
+| Cell condition | Why it is weaker |
+|---|---|
+| boundary optimum | true optimum may be outside the tested dropout grid |
+| not bracketed | local quadratic minimum is less trustworthy |
+| very flat curve | many dropout rates perform nearly the same |
+| noisy best loss | target dropout is less stable |
+Expected outputs under `runs/coefficient_calibration/<regime>_base/`:
+| File | Use |
+|---|---|
+| `coefficients.json` | fitted `A`, `B`, `C0`, metrics, and cross-validation scores |
+| `fit_diagnostics.md` | readable coefficient table, formula, fit metrics, and cell residuals |
+| `calibration_cells.csv` | one row per fitted cell with target, prediction, residual, and flags |
+| `next_dropout_suggestions.csv` | dropout rates to add if a cell needs refinement |
+Why this is needed:
+The base model is the simplest pressure-law hypothesis. It is the ablation that
+tells reviewers whether the interaction term is actually necessary.
+Decision rule:
+```text
+if base MAE and held-out errors are already low: keep it as a strong ablation
+if base has biased residuals or higher MAE: compare against interaction next
+```
+### New Regime Step 3: Fit Interaction Coefficients
+Run:
+```bash
+.venv/bin/python scripts/fit_dropout_coefficients.py \
+  --run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
+  --output-dir runs/coefficient_calibration/<regime>_interaction \
+  --target quad \
+  --weighting heuristic \
+  --feature-set interaction \
+  --min-rate 0.0 \
+  --max-rate 0.30
+```
+What this script run does:
+This repeats the same target extraction and weighted least-squares fitting, but
+uses the interaction pressure law:
+```text
+p* ~= A*x + B*y + D*x*y + C0
+```
+The extra term `D*x*y` lets model/data pressure and sampled-token pressure
+interact. Empirically, this has mattered because dropout pressure is not always
+additive: the useful effect of seeing more cumulative sampled tokens can depend
+on how oversized the model is relative to the available unique data.
+Expected outputs are the same as Step 2, but under:
+```text
+runs/coefficient_calibration/<regime>_interaction/
+```
+Decision rule:
+```text
+promote interaction if it lowers MAE/RMSE, improves leave-prefix/leave-model
+validation, and does not create obvious residual bias
+```
+Do not promote the interaction form merely because it has more parameters. The
+paper needs the base-vs-interaction comparison to show that the extra term buys
+predictive accuracy, not just in-sample flexibility.
+### New Regime Step 4: Optional Static Refinement
+Only run this if `fit_diagnostics.md` or `next_dropout_suggestions.csv` shows
+that important cells are weakly identified.
+Run:
+```bash
+.venv/bin/python scripts/run_experiments.py \
+  --mode screen_static \
+  --resume-from runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
+  --use-cached-data \
+  --cache-dir .cache/dropout_decay_<regime> \
+  --output-dir runs/<regime>_static_refined \
+  --models <ONLY_AFFECTED_MODELS> \
+  --seeds 1 2 \
+  --token-limits <ONLY_AFFECTED_PREFIXES> \
+  --dropout-rates <SUGGESTED_RATES> \
+  --steps <STATIC_STEPS> \
+  --batch-size <BATCH> \
+  --block-size <BLOCK> \
+  --eval-batches <EVAL_BATCHES> \
+  --train-eval-batches <TRAIN_EVAL_BATCHES> \
+  --trace-eval-batches <TRACE_EVAL_BATCHES> \
+  --vocab-size <VOCAB_SIZE> \
+  --val-tokens <VAL_TOKENS> \
+  --lr <LR> \
+  --weight-decay <WEIGHT_DECAY> \
+  --grad-clip 1.0
+```
+What this script run does:
+This adds only missing static dropout points. It should not rerun the full grid.
+`--resume-from` lets the experiment skip rows already completed in the original
+static screen. `--use-cached-data` reuses the cached tokenizer and token arrays
+so refinement is measuring dropout/model behavior, not data preprocessing
+differences.
+When to use it:
+| Trigger | Refinement action |
+|---|---|
+| best dropout is at grid edge | add rates beyond or near that edge if allowed |
+| curve is too coarse near optimum | add rates around the local best |
+| static curve is flat | add seeds or eval batches before changing the formula |
+After refinement, rerun Steps 2 and 3 with all relevant run dirs. At minimum,
+rerun the promoted feature family. If the paper will compare base versus
+interaction after refinement, rerun both.
+```bash
+.venv/bin/python scripts/fit_dropout_coefficients.py \
+  --run-dirs \
+    runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
+    runs/<regime>_static_refined/screen_static/<TIMESTAMP> \
+  --output-dir runs/coefficient_calibration/<regime>_interaction_refined \
+  --target quad \
+  --weighting heuristic \
+  --feature-set interaction \
+  --min-rate 0.0 \
+  --max-rate 0.30
+```
+Decision rule:
+```text
+refinement is complete when the promoted coefficient fit has acceptable MAE,
+held-out errors, and no obvious residual direction across P/U or C/U
+```
+### New Regime Step 5: Generate Frozen Streaming Anchors
+Run:
+```bash
+.venv/bin/python scripts/make_streaming_anchors.py \
+  --coefficients-json <PROMOTED_COEFFICIENTS_JSON> \
+  --name <regime>_interaction \
+  --parameters <WINNER_MODEL_PARAM_COUNT> \
+  --stream-token-caps <U1> <U2> <U3> <U4> <U5> \
+  --stage-steps <STAGE_STEPS> \
+  --batch-size <BATCH> \
+  --block-size <BLOCK> \
+  --min-rate 0.02 \
+  --max-rate 0.65 \
+  --precision 3
+```
+What this script run does:
+`scripts/make_streaming_anchors.py` turns `coefficients.json` into the exact
+dropout schedule used by `locked_stream`. For each stream prefix, it computes:
+```text
+P = chosen model parameter count
+U_t = stream prefix tokens at stage t
+C_t = cumulative sampled optimizer tokens through stage t
+x_t = log10(P / U_t)
+y_t = log10(C_t / U_t)
+p_t = clamp(p_min, p_max, A*x_t + B*y_t + D*x_t*y_t + C0)
+```
+The script prints two things:
+1. a JSON diagnostic table with raw and clipped dropout values
+2. a final one-line anchor spec, for example:
+```text
+<regime>_interaction:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020
+```
+That final line is copied into the next command as `--anchor-decays`.
+`<PROMOTED_COEFFICIENTS_JSON>` should point to the coefficient file selected by
+the coefficient gate. In a clean first pass, this is usually:
+```text
+runs/coefficient_calibration/<regime>_interaction/coefficients.json
+```
+If optional refinement was needed and accepted, use the refined coefficient
+file instead:
+```text
+runs/coefficient_calibration/<regime>_interaction_refined/coefficients.json
+```
+Decision rule:
+```text
+freeze this anchor spec before streaming starts
+do not edit the schedule after looking at streaming validation losses
+```
+If the anchor schedule looks pathological before training, such as all values
+clipping at `p_min` or `p_max`, inspect the coefficient fit and calibration
+cells before launching streaming.
+### New Regime Step 6: Five-Seed Locked Streaming Validation
+Run:
+```bash
+.venv/bin/python scripts/run_experiments.py \
+  --mode locked_stream \
+  --use-cached-data \
+  --cache-dir .cache/dropout_decay_<regime> \
+  --output-dir runs/<regime>_<model>_streaming_validation_5seed \
+  --models <WINNER_MODEL_NAME=layersxheadsxdim> \
+  --seeds 1 2 3 4 5 \
+  --stream-token-caps <U1> <U2> <U3> <U4> <U5> \
+  --dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
+  --anchor-decays <FROZEN_ANCHOR_SPEC_FROM_STEP_5> \
+  --stage-steps <STAGE_STEPS> \
+  --batch-size <BATCH> \
+  --block-size <BLOCK> \
+  --eval-batches <EVAL_BATCHES> \
+  --train-eval-batches <TRAIN_EVAL_BATCHES> \
+  --trace-eval-batches <TRACE_EVAL_BATCHES> \
+  --log-every 250 \
+  --vocab-size <VOCAB_SIZE> \
+  --val-tokens <VAL_TOKENS> \
+  --lr <LR> \
+  --weight-decay <WEIGHT_DECAY> \
+  --grad-clip 1.0
+```
+What this script run does:
+`locked_stream` is the paper-grade test. It simulates a stream by increasing
+the available prefix tokens over stages. For each seed, it trains:
+| Condition type | Meaning |
+|---|---|
+| static dropout baselines | same dropout at every stream stage |
+| anchor decay schedule | frozen coefficient-derived dropout at each stream stage |
+The static baselines must be broad enough to make the comparison fair. The
+claim is not that decay beats weak static choices; the claim is that it can beat
+the best static dropout available in the tested grid.
+Expected outputs under
+`runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/`:
+| File | Use |
+|---|---|
+| `metrics.jsonl` | raw row-level results for each condition, seed, and prefix |
+| `summary.csv` / `summary.json` | aggregate condition and stage summaries |
+| `trace.jsonl` | progress traces for diagnostic plotting |
+| `config.json` | exact run configuration |
+| `RESULT_SUMMARY.md` | built-in readable summary |
+Primary evaluation metrics:
+```text
+final validation loss at largest prefix
+mean trajectory validation loss
+stage-wise validation loss
+paired seed delta versus the best static baseline
+rank consistency across seeds
+```
+Decision rule:
+```text
+strong pass: decay has best mean final loss and beats best static in most or all
+paired seeds
+weak pass: decay ties best static while avoiding bad early/late static choices
+fail: decay loses to a simple static baseline in most paired seeds or wins early
+only by sacrificing final loss
+```
+### New Regime Step 7: Summarize Streaming Validation
+Run:
+```bash
+.venv/bin/python scripts/summarize_streaming_multiseed.py \
+  --metrics runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/metrics.jsonl \
+  --output-dir runs/<regime>_streaming_report/<model>_validation_5seed \
+  --report docs/<regime>_streaming_report.md \
+  --title "<Regime Name> Streaming Validation" \
+  --date <YYYY-MM-DD> \
+  --context "<regime/model/token/step description>" \
+  --conditions <regime>_interaction static_dropout_0.1 static_dropout_0.08 static_dropout_0.06 static_dropout_0.14 static_dropout_0.18 static_dropout_0.2 static_dropout_0.04 static_dropout_0.02 static_dropout_0 static_dropout_0.26 static_dropout_0.3
+```
+What this script run does:
+`scripts/summarize_streaming_multiseed.py` performs no training. It reads the
+saved `metrics.jsonl` file and writes standardized artifacts comparable across
+regimes.
+Expected outputs:
+| File | Use |
+|---|---|
+| `docs/<regime>_streaming_report.md` | human-readable regime report for paper discussion |
+| `condition_summary.csv` | condition ranking by final validation loss |
+| `stage_summary.csv` | stage-wise trajectory table |
+| `paired_final_deltas.csv` | per-seed final-loss comparison against the best static baseline |
+The most important table is `paired_final_deltas.csv`. A mean win is useful, but
+paired seed wins are stronger because they reduce initialization-bias concerns.
+Decision rule:
+```text
+if the decay schedule wins 5/5 paired seeds: promote regime to strong evidence
+if it wins 3-4/5: inspect effect size, variance, and trajectory tradeoff
+if it wins 0-2/5: treat as a failed regime or schedule and do not bury it
+```
+### New Regime Step 8: Smoke Check And Commit
+Run:
+```bash
+.venv/bin/python -m py_compile \
+  scripts/run_experiments.py \
+  scripts/fit_dropout_coefficients.py \
+  scripts/make_streaming_anchors.py \
+  scripts/summarize_streaming_multiseed.py
+```
+What this script run does:
+This is a code integrity check. It does not validate the scientific result, but
+it catches syntax or import errors in the scripts required to reproduce the
+regime.
+After the smoke check, update this `docs/plan.md` ledger and commit:
+```text
+docs/<regime>_streaming_report.md
+runs/<regime>_streaming_report/<model>_validation_5seed/
+runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/
+runs/coefficient_calibration/<regime>_interaction/
+```
+Do not commit temporary checkpoints or external corpus files unless they are
+small, intentionally versioned, and needed for reproducibility.
 ## Current Regime Ledger
 | Regime | Status | Role |
 | `smooth_low` | 4/5, with the one miss only `+0.0003` |
 The immediate risk is no longer seed count for TinyStories or OpenWebText10K.
+The main remaining risk is external validity beyond the three tested text
+regimes and robustness across controlled architecture or token-budget changes.
+The current defensible claim is:
 ```text
 Formula-derived dropout schedules track the moving useful dropout region and
 Formula-derived dropout decay beats the best static dropout.
 ```
+is supported at `n=5` in TinyStories, OpenWebText10K, and WikiText-103. The
+strongest schedule in each of the three regimes beats the per-seed best static
+baseline in all five seeds.
 Latest OpenWebText10K 5-seed streaming final-loss table:
 | static `0.02` | 4.5358 | 0.0091 |
 | static `0.00` | 4.5943 | 0.0216 |
+OpenWebText10K condition provenance:
+| Condition | Provenance | How to interpret it |
+|---|---|---|
+| `openwebtext10k_interaction` | coefficient-derived interaction schedule | main OpenWebText10K formula hypothesis test |
+| `hold_30_then_decay` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients |
+| `mild_30_to_08` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients |
+| `fitted_l16_static_law` | older fitted/static-law schedule | retained as a comparison to the earlier aggressive fitted path |
+| static conditions | fixed dropout baselines | same dropout at every stream prefix |
+The heuristic OpenWebText10K schedules were chosen from failure analysis, not
+from the final coefficient formula. The older `fitted_l16_static_law` path
+started too high (`0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02`), while static
+dropout `0.30` looked useful early but worse at the final 4M-token stage and
+static dropout `0.14` was the strongest static endpoint. This motivated two
+manual ablations:
+```text
+hold_30_then_decay = 0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
+mild_30_to_08      = 0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
+```
+These ablations support the broader mechanism that stream-dependent dropout can
+matter, but they should not be used as evidence that the coefficient formula
+generated those exact schedules. The formula claim for OpenWebText10K should be
+based on `openwebtext10k_interaction`.
 Paired final-loss result:
 | Decay schedule | Paired wins vs best static |
 ## Next Training After Current Gate
+No MPS training should launch until the three completed five-seed streaming
+reports are read together. Since a third held-out text regime is no longer the
+limiting issue, use the next run only for a narrowed robustness test:
 ```text
 completed: TinyStories 5-seed streaming report