| # Cross-Regime Hypothesis Testing Plan |
|
|
| Date started: 2026-05-30 |
|
|
| This is the standing protocol for testing the dropout-pressure hypothesis across |
| regimes. Use this file before launching new experiments so formula changes are |
| backtested against existing results first, instead of repeatedly rerunning |
| expensive training. |
|
|
| For the detailed explanation of how coefficients are derived and how the |
| formula is tested, see [formula_coefficient_methodology.md](formula_coefficient_methodology.md). |
|
|
| Current operating decision: static coefficient backtests are internal gates; |
| final evidence should be streaming multi-seed validation reports per regime. |
| When a new regime or formula variant appears, fit it against existing saved |
| results first, then decide whether new MPS experiments are actually needed. |
|
|
| ## Research Hypothesis |
|
|
| For a fixed training regime, the useful dropout rate is governed by pressure |
| from model size, available unique data, and cumulative sampled training tokens. |
| As a stream grows, this pressure changes, so a formula-derived dropout schedule |
| can track the moving useful regularization region better than a hand-picked |
| static dropout. |
|
|
| The current candidate formula family is: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| A * log10(P / U_t) |
| + B * log10(C_t / U_t) |
| + D * log10(P / U_t) * log10(C_t / U_t) |
| + C0) |
| ``` |
|
|
| The simpler first-order ablation is: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| A * log10(P / U_t) |
| + B * log10(C_t / U_t) |
| + C0) |
| ``` |
|
|
| Where: |
|
|
| | Symbol | Meaning | |
| |---|---| |
| | `P` | model parameter count | |
| | `U_t` | unique tokens available at stream stage `t` | |
| | `C_t` | cumulative sampled training tokens consumed by the optimizer by stage `t` | |
| | `p_t` | active dropout rate at stage `t` | |
| | `A` | model/data pressure coefficient | |
| | `B` | sampled-token pressure coefficient | |
| | `D` | interaction coefficient between model pressure and sampled-token pressure | |
| | `C0` | regime offset | |
|
|
| ## Regime Definition |
|
|
| A regime is the full experimental environment in which coefficients are assumed |
| to stay valid: |
|
|
| ```text |
| architecture family |
| + tokenizer |
| + corpus family |
| + optimizer and learning-rate protocol |
| + dropout placement and semantics |
| + streaming protocol |
| + evaluation distribution |
| ``` |
|
|
| Within a regime, `P`, `U_t`, and `C_t` are formula inputs. Changing those values |
| should not require new coefficients. Refit coefficients when the corpus, |
| tokenizer, architecture class, optimizer recipe, dropout semantics, streaming |
| protocol, or validation distribution changes. |
|
|
| ## Non-Negotiable Rules |
|
|
| 1. Use MPS only for torch training experiments. If MPS is unavailable, stop and |
| report it. |
| 2. Before launching any new MPS training, backtest the current formula family on |
| all relevant existing saved results. |
| 3. Do not rerun a regime merely because the formula changed. Refit and backtest |
| offline first. |
| 4. Treat coefficient fitting and streaming validation as different claims: |
| coefficient fitting estimates useful static dropout; streaming validation |
| tests whether those estimates form a good path-dependent schedule. |
| 5. Keep exploratory one-seed results separate from paper-grade multi-seed |
| results. |
|
|
| ## Backtest-First Workflow |
|
|
| Run this workflow whenever a formula family changes or a new regime is added. |
|
|
| ### Step 1: Freeze Candidate Formula Families |
|
|
| Define the exact formula families being tested before looking at new training |
| results: |
|
|
| | Name | Formula | Purpose | |
| |---|---|---| |
| | `base_abc` | `A*x + B*y + C0` | first-order pressure law ablation | |
| | `interaction` | `A*x + B*y + D*x*y + C0` | current main candidate | |
| | optional higher-order | quadratic or corpus terms | only if simpler forms fail | |
|
|
| Where: |
|
|
| ```text |
| x = log10(P / U_t) |
| y = log10(C_t / U_t) |
| ``` |
|
|
| ### Step 2: Inventory Existing Results |
|
|
| Before training anything new, enumerate saved runs by regime and decide which |
| ones can be used for offline fitting or validation. |
|
|
| For each result source, record: |
|
|
| | Field | Required detail | |
| |---|---| |
| | regime name | short stable label | |
| | run path | directory containing `summary.csv`, `metrics.jsonl`, or equivalent | |
| | model specs | model names and parameter counts | |
| | token prefixes | unique-token limits used | |
| | sampled tokens | steps * batch size * block size, or equivalent | |
| | dropout grid | rates tested | |
| | seeds | seed count | |
| | target extraction | grid best, quadratic optimum, or boundary-marked optimum | |
| | quality flags | bracketed, boundary optimum, flat curve, noisy curve | |
|
|
| ### Step 3: Fit Within Each Regime |
|
|
| For every regime separately: |
|
|
| 1. Extract the observed static optimum for each `(P, U, C)` cell. |
| 2. Fit `base_abc`. |
| 3. Fit `interaction`. |
| 4. Optionally fit a higher-order variant only if the first two fail. |
| 5. Report coefficient values, RMSE, MAE, and residual direction. |
|
|
| Use boundary optima carefully. Keep them in the report, but downweight or flag |
| them if the static dropout curve is not bracketed. |
|
|
| ### Step 4: Validate Without New Training |
|
|
| For every regime with enough cells, run: |
|
|
| | Validation | Meaning | |
| |---|---| |
| | leave-model-out | fit on some model sizes, test held-out model size | |
| | leave-prefix-out | fit on some unique-token prefixes, test held-out prefix | |
| | leave-source-out | fit on one run source, test another run source | |
| | cross-regime transfer | fit on one regime, test another regime | |
|
|
| Expected result: |
|
|
| ```text |
| within-regime fit should be good; |
| cross-regime raw coefficient transfer may be weaker; |
| formula structure should still explain why coefficients differ. |
| ``` |
|
|
| ### Step 5: Backtest Streaming Runs Already on Disk |
|
|
| For existing streaming runs, do not refit on the streaming outcome first. |
| Instead: |
|
|
| 1. Generate schedule values from the frozen coefficients. |
| 2. Compare them to the tested decay/static conditions already present. |
| 3. Report stage-wise and final-loss deltas versus the best static baseline. |
| 4. Mark whether the formula schedule wins, ties, or loses. |
|
|
| This separates two questions: |
|
|
| ```text |
| Can the formula estimate static useful dropout? |
| Can the static estimate be used directly as a streaming schedule? |
| ``` |
|
|
| ## Decision Gates |
|
|
| ### Coefficient Gate |
|
|
| Promote a formula family for a regime only if it satisfies: |
|
|
| | Criterion | Target | |
| |---|---| |
| | within-regime MAE | preferably under `0.05` dropout | |
| | leave-model MAE | preferably under `0.05` dropout | |
| | leave-prefix MAE | preferably under `0.05` dropout | |
| | residual bias | no systematic over/under prediction across `P/U` or `C/U` | |
| | interpretability | coefficients have a defensible pressure-law explanation | |
|
|
| If `base_abc` fails but `interaction` passes, present `base_abc` as the |
| first-order law and `interaction` as the necessary second-order correction. |
|
|
| ### Streaming Gate |
|
|
| Run paper-grade streaming only after the coefficient gate passes. |
|
|
| A decay schedule passes strongly if: |
|
|
| ```text |
| mean final validation loss beats the best static baseline across seeds |
| and the win appears in most paired seed comparisons. |
| ``` |
|
|
| A decay schedule passes weakly if: |
|
|
| ```text |
| it ties the best hand-picked static dropout while avoiding bad static choices |
| across stream stages. |
| ``` |
|
|
| It fails if: |
|
|
| ```text |
| it loses to a simple static baseline in most seeds |
| or improves early loss only by sacrificing final loss. |
| ``` |
|
|
| ## Failure Handling |
|
|
| Use this decision tree before adding new experiments: |
|
|
| ```text |
| Formula changed or new regime added |
| | |
| v |
| Backtest on all existing saved results |
| | |
| v |
| Within-regime static fit passes? |
| | |
| no -> inspect target extraction, boundary cells, feature family |
| | |
| yes |
| v |
| Held-out static validation passes? |
| | |
| no -> test interaction or limited higher-order correction offline |
| | |
| yes |
| v |
| Existing streaming backtest passes or ties? |
| | |
| no -> consider streaming-specific transform before new training |
| | |
| yes |
| v |
| Launch narrowed multi-seed streaming validation |
| ``` |
|
|
| If a pass condition fails, do not immediately launch a larger sweep. First |
| decide whether the failure is due to: |
|
|
| | Failure type | Response | |
| |---|---| |
| | unbracketed static optimum | add only the missing dropout-side points | |
| | flat/noisy curve | increase seeds or eval batches before changing formula | |
| | bad held-out prefix | add pressure interaction or revise `C/U` treatment | |
| | bad held-out model | inspect parameter-count scaling and architecture invariance | |
| | streaming loses despite static fit | fit a static-to-streaming transform, then backtest | |
|
|
| ## Standard Experiment Order |
|
|
| Use this order for every regime. |
|
|
| 1. Backtest current formula family on all existing data. |
| 2. Fit coefficients within each existing regime. |
| 3. Produce cross-regime coefficient and error table. |
| 4. Decide whether the formula family is stable enough. |
| 5. If stable, run narrowed multi-seed streaming in the best current regime. |
| 6. If still stable, add a new regime with minimal calibration. |
| 7. Immediately backtest the new regime against all previous regimes. |
| 8. Only then run expensive streaming validation in the new regime. |
|
|
| ## New Regime Script Runbook |
|
|
| Use this exact command sequence for any new regime. Replace placeholders such as |
| `<regime>`, `<CORPUS_OR_PARQUET_PATH>`, `<MODEL_SPEC>`, and `<TIMESTAMP>` with |
| absolute choices before launching. Do not skip from calibration directly to |
| streaming: the schedule must be frozen from the coefficient fit before the |
| streaming run starts. |
|
|
| This section is intentionally verbose. Its purpose is to make future regimes |
| auditable: an external reader should be able to tell what each script did, what |
| file it produced, and which decision gate came next. |
|
|
| ### New Regime Step 0: MPS Preflight |
|
|
| Run this before any torch training command: |
|
|
| ```bash |
| .venv/bin/python -c "import torch; print({'mps_built': torch.backends.mps.is_built(), 'mps_available': torch.backends.mps.is_available(), 'cuda_available': torch.cuda.is_available()}); raise SystemExit(0 if torch.backends.mps.is_available() else 1)" |
| ``` |
|
|
| What this does: |
|
|
| | Check | Meaning | |
| |---|---| |
| | `mps_built` | PyTorch was compiled with Apple MPS support | |
| | `mps_available` | this machine can actually run MPS now | |
| | `cuda_available` | should not be used for this project | |
|
|
| Decision rule: |
|
|
| ```text |
| if mps_available is false: stop and report |
| if cuda_available is true: still do not use CUDA |
| ``` |
|
|
| Also check for duplicate experiment processes before launching a long run. This |
| is not part of the coefficient method, but it prevents corrupt timing/resource |
| comparisons. |
|
|
| ### New Regime Step 1: Static Dropout Calibration Screen |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/run_experiments.py \ |
| --mode screen_static \ |
| --corpus <CORPUS_OR_PARQUET_PATH> \ |
| --text-column <TEXT_COLUMN_IF_PARQUET> \ |
| --cache-dir .cache/dropout_decay_<regime> \ |
| --output-dir runs/<regime>_static_screen \ |
| --models <M1=layersxheadsxdim> <M2=layersxheadsxdim> <M3=layersxheadsxdim> \ |
| --seeds 1 2 \ |
| --token-limits <U1> <U2> <U3> <U4> \ |
| --dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \ |
| --steps <STATIC_STEPS> \ |
| --batch-size <BATCH> \ |
| --block-size <BLOCK> \ |
| --eval-batches <EVAL_BATCHES> \ |
| --train-eval-batches <TRAIN_EVAL_BATCHES> \ |
| --trace-eval-batches <TRACE_EVAL_BATCHES> \ |
| --vocab-size <VOCAB_SIZE> \ |
| --val-tokens <VAL_TOKENS> \ |
| --lr <LR> \ |
| --weight-decay <WEIGHT_DECAY> \ |
| --grad-clip 1.0 \ |
| --screen-early-stop |
| ``` |
|
|
| What this script run does: |
|
|
| `scripts/run_experiments.py --mode screen_static` trains a grid of static |
| dropout models. It does not test the final decay hypothesis. It estimates the |
| best static dropout rate for each calibration cell: |
|
|
| ```text |
| cell = (model parameter count P, prefix/unique tokens U, sampled tokens C) |
| ``` |
|
|
| For each cell, the script evaluates a fixed dropout grid and writes the |
| validation curve. The curve is used later to extract the target dropout `p*`. |
|
|
| Expected outputs under `runs/<regime>_static_screen/screen_static/<TIMESTAMP>/`: |
|
|
| | File | Use | |
| |---|---| |
| | `metrics.jsonl` | per-run raw metrics; includes token limit, model, seed, losses, and tokens seen | |
| | `model_selection.csv` | per-cell static dropout curve and selected best dropout | |
| | `summary.csv` / `summary.json` | compact aggregate summary | |
| | `trace.jsonl` | lower-frequency trace for diagnostics | |
| | `RESULT_SUMMARY.md` | human-readable first-pass summary | |
|
|
| Why this is needed: |
|
|
| The coefficient formula is not fitted from streaming outcomes. It is fitted |
| from static dropout optima. This separation is essential: calibration estimates |
| where useful regularization sits; streaming validation tests whether following |
| that moving estimate helps. |
|
|
| Recommended cheap calibration: |
|
|
| | Dimension | Default | |
| |---|---| |
| | models | at least 3 model sizes if testing coefficient generality | |
| | token prefixes | at least 4 prefixes | |
| | seeds | 1-2 for calibration, 5 only for final streaming validation | |
| | dropout grid | include low, middle, and high values so the optimum can be bracketed | |
|
|
| Decision rule: |
|
|
| ```text |
| continue if most cells have a bracketed or near-bracketed optimum |
| refine if many best dropouts sit at the edge of the grid |
| stop and inspect if validation curves are flat/noisy enough that p* is unstable |
| ``` |
|
|
| ### New Regime Step 2: Fit First-Order Base Coefficients |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/fit_dropout_coefficients.py \ |
| --run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \ |
| --output-dir runs/coefficient_calibration/<regime>_base \ |
| --target quad \ |
| --weighting heuristic \ |
| --feature-set base \ |
| --min-rate 0.0 \ |
| --max-rate 0.30 |
| ``` |
|
|
| What this script run does: |
|
|
| `scripts/fit_dropout_coefficients.py` reads `model_selection.csv` and |
| `metrics.jsonl` from the static screen. It converts each calibration cell into: |
|
|
| ```text |
| x = log10(P / U) |
| y = log10(C / U) |
| target = observed useful static dropout p* |
| ``` |
|
|
| With `--feature-set base`, it fits the first-order ablation: |
|
|
| ```text |
| p* ~= A*x + B*y + C0 |
| ``` |
|
|
| With `--target quad`, the target `p*` is the local quadratic minimum around the |
| best dropout grid point when the curve is bracketed. If the curve is not |
| bracketed, the script falls back to the grid best and marks the cell as weaker |
| evidence. |
|
|
| With `--weighting heuristic`, the fit downweights cells that are less reliable: |
|
|
| | Cell condition | Why it is weaker | |
| |---|---| |
| | boundary optimum | true optimum may be outside the tested dropout grid | |
| | not bracketed | local quadratic minimum is less trustworthy | |
| | very flat curve | many dropout rates perform nearly the same | |
| | noisy best loss | target dropout is less stable | |
|
|
| Expected outputs under `runs/coefficient_calibration/<regime>_base/`: |
|
|
| | File | Use | |
| |---|---| |
| | `coefficients.json` | fitted `A`, `B`, `C0`, metrics, and cross-validation scores | |
| | `fit_diagnostics.md` | readable coefficient table, formula, fit metrics, and cell residuals | |
| | `calibration_cells.csv` | one row per fitted cell with target, prediction, residual, and flags | |
| | `next_dropout_suggestions.csv` | dropout rates to add if a cell needs refinement | |
|
|
| Why this is needed: |
|
|
| The base model is the simplest pressure-law hypothesis. It is the ablation that |
| tells reviewers whether the interaction term is actually necessary. |
|
|
| Decision rule: |
|
|
| ```text |
| if base MAE and held-out errors are already low: keep it as a strong ablation |
| if base has biased residuals or higher MAE: compare against interaction next |
| ``` |
|
|
| ### New Regime Step 3: Fit Interaction Coefficients |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/fit_dropout_coefficients.py \ |
| --run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \ |
| --output-dir runs/coefficient_calibration/<regime>_interaction \ |
| --target quad \ |
| --weighting heuristic \ |
| --feature-set interaction \ |
| --min-rate 0.0 \ |
| --max-rate 0.30 |
| ``` |
|
|
| What this script run does: |
|
|
| This repeats the same target extraction and weighted least-squares fitting, but |
| uses the interaction pressure law: |
|
|
| ```text |
| p* ~= A*x + B*y + D*x*y + C0 |
| ``` |
|
|
| The extra term `D*x*y` lets model/data pressure and sampled-token pressure |
| interact. Empirically, this has mattered because dropout pressure is not always |
| additive: the useful effect of seeing more cumulative sampled tokens can depend |
| on how oversized the model is relative to the available unique data. |
|
|
| Expected outputs are the same as Step 2, but under: |
|
|
| ```text |
| runs/coefficient_calibration/<regime>_interaction/ |
| ``` |
|
|
| Decision rule: |
|
|
| ```text |
| promote interaction if it lowers MAE/RMSE, improves leave-prefix/leave-model |
| validation, and does not create obvious residual bias |
| ``` |
|
|
| Do not promote the interaction form merely because it has more parameters. The |
| paper needs the base-vs-interaction comparison to show that the extra term buys |
| predictive accuracy, not just in-sample flexibility. |
|
|
| ### New Regime Step 4: Optional Static Refinement |
|
|
| Only run this if `fit_diagnostics.md` or `next_dropout_suggestions.csv` shows |
| that important cells are weakly identified. |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/run_experiments.py \ |
| --mode screen_static \ |
| --resume-from runs/<regime>_static_screen/screen_static/<TIMESTAMP> \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay_<regime> \ |
| --output-dir runs/<regime>_static_refined \ |
| --models <ONLY_AFFECTED_MODELS> \ |
| --seeds 1 2 \ |
| --token-limits <ONLY_AFFECTED_PREFIXES> \ |
| --dropout-rates <SUGGESTED_RATES> \ |
| --steps <STATIC_STEPS> \ |
| --batch-size <BATCH> \ |
| --block-size <BLOCK> \ |
| --eval-batches <EVAL_BATCHES> \ |
| --train-eval-batches <TRAIN_EVAL_BATCHES> \ |
| --trace-eval-batches <TRACE_EVAL_BATCHES> \ |
| --vocab-size <VOCAB_SIZE> \ |
| --val-tokens <VAL_TOKENS> \ |
| --lr <LR> \ |
| --weight-decay <WEIGHT_DECAY> \ |
| --grad-clip 1.0 |
| ``` |
|
|
| What this script run does: |
|
|
| This adds only missing static dropout points. It should not rerun the full grid. |
| `--resume-from` lets the experiment skip rows already completed in the original |
| static screen. `--use-cached-data` reuses the cached tokenizer and token arrays |
| so refinement is measuring dropout/model behavior, not data preprocessing |
| differences. |
|
|
| When to use it: |
|
|
| | Trigger | Refinement action | |
| |---|---| |
| | best dropout is at grid edge | add rates beyond or near that edge if allowed | |
| | curve is too coarse near optimum | add rates around the local best | |
| | static curve is flat | add seeds or eval batches before changing the formula | |
|
|
| After refinement, rerun Steps 2 and 3 with all relevant run dirs. At minimum, |
| rerun the promoted feature family. If the paper will compare base versus |
| interaction after refinement, rerun both. |
|
|
| ```bash |
| .venv/bin/python scripts/fit_dropout_coefficients.py \ |
| --run-dirs \ |
| runs/<regime>_static_screen/screen_static/<TIMESTAMP> \ |
| runs/<regime>_static_refined/screen_static/<TIMESTAMP> \ |
| --output-dir runs/coefficient_calibration/<regime>_interaction_refined \ |
| --target quad \ |
| --weighting heuristic \ |
| --feature-set interaction \ |
| --min-rate 0.0 \ |
| --max-rate 0.30 |
| ``` |
|
|
| Decision rule: |
|
|
| ```text |
| refinement is complete when the promoted coefficient fit has acceptable MAE, |
| held-out errors, and no obvious residual direction across P/U or C/U |
| ``` |
|
|
| ### New Regime Step 5: Generate Frozen Streaming Anchors |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/make_streaming_anchors.py \ |
| --coefficients-json <PROMOTED_COEFFICIENTS_JSON> \ |
| --name <regime>_interaction \ |
| --parameters <WINNER_MODEL_PARAM_COUNT> \ |
| --stream-token-caps <U1> <U2> <U3> <U4> <U5> \ |
| --stage-steps <STAGE_STEPS> \ |
| --batch-size <BATCH> \ |
| --block-size <BLOCK> \ |
| --min-rate 0.02 \ |
| --max-rate 0.65 \ |
| --precision 3 |
| ``` |
|
|
| What this script run does: |
|
|
| `scripts/make_streaming_anchors.py` turns `coefficients.json` into the exact |
| dropout schedule used by `locked_stream`. For each stream prefix, it computes: |
|
|
| ```text |
| P = chosen model parameter count |
| U_t = stream prefix tokens at stage t |
| C_t = cumulative sampled optimizer tokens through stage t |
| x_t = log10(P / U_t) |
| y_t = log10(C_t / U_t) |
| p_t = clamp(p_min, p_max, A*x_t + B*y_t + D*x_t*y_t + C0) |
| ``` |
|
|
| The script prints two things: |
|
|
| 1. a JSON diagnostic table with raw and clipped dropout values |
| 2. a final one-line anchor spec, for example: |
|
|
| ```text |
| <regime>_interaction:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 |
| ``` |
|
|
| That final line is copied into the next command as `--anchor-decays`. |
|
|
| `<PROMOTED_COEFFICIENTS_JSON>` should point to the coefficient file selected by |
| the coefficient gate. In a clean first pass, this is usually: |
|
|
| ```text |
| runs/coefficient_calibration/<regime>_interaction/coefficients.json |
| ``` |
|
|
| If optional refinement was needed and accepted, use the refined coefficient |
| file instead: |
|
|
| ```text |
| runs/coefficient_calibration/<regime>_interaction_refined/coefficients.json |
| ``` |
|
|
| Decision rule: |
|
|
| ```text |
| freeze this anchor spec before streaming starts |
| do not edit the schedule after looking at streaming validation losses |
| ``` |
|
|
| If the anchor schedule looks pathological before training, such as all values |
| clipping at `p_min` or `p_max`, inspect the coefficient fit and calibration |
| cells before launching streaming. |
|
|
| ### New Regime Step 6: Five-Seed Locked Streaming Validation |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay_<regime> \ |
| --output-dir runs/<regime>_<model>_streaming_validation_5seed \ |
| --models <WINNER_MODEL_NAME=layersxheadsxdim> \ |
| --seeds 1 2 3 4 5 \ |
| --stream-token-caps <U1> <U2> <U3> <U4> <U5> \ |
| --dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \ |
| --anchor-decays <FROZEN_ANCHOR_SPEC_FROM_STEP_5> \ |
| --stage-steps <STAGE_STEPS> \ |
| --batch-size <BATCH> \ |
| --block-size <BLOCK> \ |
| --eval-batches <EVAL_BATCHES> \ |
| --train-eval-batches <TRAIN_EVAL_BATCHES> \ |
| --trace-eval-batches <TRACE_EVAL_BATCHES> \ |
| --log-every 250 \ |
| --vocab-size <VOCAB_SIZE> \ |
| --val-tokens <VAL_TOKENS> \ |
| --lr <LR> \ |
| --weight-decay <WEIGHT_DECAY> \ |
| --grad-clip 1.0 |
| ``` |
|
|
| What this script run does: |
|
|
| `locked_stream` is the paper-grade test. It simulates a stream by increasing |
| the available prefix tokens over stages. For each seed, it trains: |
|
|
| | Condition type | Meaning | |
| |---|---| |
| | static dropout baselines | same dropout at every stream stage | |
| | anchor decay schedule | frozen coefficient-derived dropout at each stream stage | |
|
|
| The static baselines must be broad enough to make the comparison fair. The |
| claim is not that decay beats weak static choices; the claim is that it can beat |
| the best static dropout available in the tested grid. |
|
|
| Expected outputs under |
| `runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/`: |
|
|
| | File | Use | |
| |---|---| |
| | `metrics.jsonl` | raw row-level results for each condition, seed, and prefix | |
| | `summary.csv` / `summary.json` | aggregate condition and stage summaries | |
| | `trace.jsonl` | progress traces for diagnostic plotting | |
| | `config.json` | exact run configuration | |
| | `RESULT_SUMMARY.md` | built-in readable summary | |
|
|
| Primary evaluation metrics: |
|
|
| ```text |
| final validation loss at largest prefix |
| mean trajectory validation loss |
| stage-wise validation loss |
| paired seed delta versus the best static baseline |
| rank consistency across seeds |
| ``` |
|
|
| Decision rule: |
|
|
| ```text |
| strong pass: decay has best mean final loss and beats best static in most or all |
| paired seeds |
| |
| weak pass: decay ties best static while avoiding bad early/late static choices |
| |
| fail: decay loses to a simple static baseline in most paired seeds or wins early |
| only by sacrificing final loss |
| ``` |
|
|
| ### New Regime Step 7: Summarize Streaming Validation |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python scripts/summarize_streaming_multiseed.py \ |
| --metrics runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/metrics.jsonl \ |
| --output-dir runs/<regime>_streaming_report/<model>_validation_5seed \ |
| --report docs/<regime>_streaming_report.md \ |
| --title "<Regime Name> Streaming Validation" \ |
| --date <YYYY-MM-DD> \ |
| --context "<regime/model/token/step description>" \ |
| --conditions <regime>_interaction static_dropout_0.1 static_dropout_0.08 static_dropout_0.06 static_dropout_0.14 static_dropout_0.18 static_dropout_0.2 static_dropout_0.04 static_dropout_0.02 static_dropout_0 static_dropout_0.26 static_dropout_0.3 |
| ``` |
|
|
| What this script run does: |
|
|
| `scripts/summarize_streaming_multiseed.py` performs no training. It reads the |
| saved `metrics.jsonl` file and writes standardized artifacts comparable across |
| regimes. |
|
|
| Expected outputs: |
|
|
| | File | Use | |
| |---|---| |
| | `docs/<regime>_streaming_report.md` | human-readable regime report for paper discussion | |
| | `condition_summary.csv` | condition ranking by final validation loss | |
| | `stage_summary.csv` | stage-wise trajectory table | |
| | `paired_final_deltas.csv` | per-seed final-loss comparison against the best static baseline | |
|
|
| The most important table is `paired_final_deltas.csv`. A mean win is useful, but |
| paired seed wins are stronger because they reduce initialization-bias concerns. |
|
|
| Decision rule: |
|
|
| ```text |
| if the decay schedule wins 5/5 paired seeds: promote regime to strong evidence |
| if it wins 3-4/5: inspect effect size, variance, and trajectory tradeoff |
| if it wins 0-2/5: treat as a failed regime or schedule and do not bury it |
| ``` |
|
|
| ### New Regime Step 8: Smoke Check And Commit |
|
|
| Run: |
|
|
| ```bash |
| .venv/bin/python -m py_compile \ |
| scripts/run_experiments.py \ |
| scripts/fit_dropout_coefficients.py \ |
| scripts/make_streaming_anchors.py \ |
| scripts/summarize_streaming_multiseed.py |
| ``` |
|
|
| What this script run does: |
|
|
| This is a code integrity check. It does not validate the scientific result, but |
| it catches syntax or import errors in the scripts required to reproduce the |
| regime. |
|
|
| After the smoke check, update this `docs/plan.md` ledger and commit: |
|
|
| ```text |
| docs/<regime>_streaming_report.md |
| runs/<regime>_streaming_report/<model>_validation_5seed/ |
| runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/ |
| runs/coefficient_calibration/<regime>_interaction/ |
| ``` |
|
|
| Do not commit temporary checkpoints or scratch corpora. Curated source corpora |
| and token caches that are intentionally part of reproducibility should live |
| under `data/` and `.cache/dropout_decay*` with explicit provenance. |
|
|
| ## Current Regime Ledger |
|
|
| | Regime | Status | Role | |
| |---|---|---| |
| | OpenWebText10K static/coefficient regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary | |
| | TinyStories static/coefficient regime | active | main coefficient evidence | |
| | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons | |
| | OpenWebText10K streaming regime | 5-seed clean validation complete | OpenWebText10K interaction decay beats best static in 5/5 paired final-loss comparisons | |
| | WikiText-103 streaming regime | 5-seed validation complete | formula-derived L12 decay beats best static in 5/5 paired final-loss comparisons | |
|
|
| ## Current Formula Status |
|
|
| The latest TinyStories analysis supports the interaction form more strongly |
| than the first-order ABC form: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| A*x_t + B*y_t + D*x_t*y_t + C0) |
| |
| x_t = log10(P / U_t) |
| y_t = log10(C_t / U_t) |
| ``` |
|
|
| For the current TinyStories regime, the latest fitted coefficients are: |
|
|
| ```text |
| A = -0.089261 |
| B = -0.129754 |
| D = 0.255069 |
| C0 = 0.081525 |
| ``` |
|
|
| Absolute TinyStories-regime formula: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| -0.089261 * log10(P / U_t) |
| - 0.129754 * log10(C_t / U_t) |
| + 0.255069 * log10(P / U_t) * log10(C_t / U_t) |
| + 0.081525) |
| ``` |
|
|
| Use these only as the current TinyStories-regime coefficients. They are not |
| assumed to transfer numerically to the OpenWebText10K regime or any future |
| corpus regime. The cross-regime claim we are testing is that the pressure-law |
| structure transfers, while coefficients may be regime-specific. |
|
|
| ## Current Evidence Summary |
|
|
| | Evidence item | Current reading | |
| |---|---| |
| | OpenWebText10K static/coefficient regime | backtest complete; interaction MAE `0.0148` on OpenWebText10K+5M versus base ABC MAE `0.0389` | |
| | TinyStories static optima | interaction form fits static dropout optima better than base ABC | |
| | TinyStories held-out prefix | supports pressure dependence on unique tokens | |
| | TinyStories held-out model | supports pressure dependence on model size | |
| | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons | |
| | OpenWebText10K streaming, 5 seeds | interaction decay has best mean final loss; top decay schedules beat best static in 5/5 paired comparisons | |
| | WikiText-103 streaming, 5 seeds | formula-derived L12 decay has best mean final loss; beats best static in 5/5 paired comparisons | |
| | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients | |
|
|
| Latest TinyStories 5-seed streaming final-loss table: |
|
|
| | Condition | Mean final 4M validation loss | Std | |
| |---|---:|---:| |
| | `interaction` decay | 2.5311 | 0.0213 | |
| | `smooth_low` decay | 2.5321 | 0.0203 | |
| | `baseabc` decay | 2.5357 | 0.0175 | |
| | static `0.08` | 2.5444 | 0.0211 | |
| | static `0.12` | 2.5477 | 0.0178 | |
| | static `0.18` | 2.5644 | 0.0182 | |
|
|
| Paired final-loss result: |
|
|
| | Decay schedule | Paired wins vs best static | |
| |---|---:| |
| | `interaction` | 5/5 | |
| | `baseabc` | 5/5 | |
| | `smooth_low` | 4/5, with the one miss only `+0.0003` | |
|
|
| The immediate risk is no longer seed count for TinyStories or OpenWebText10K. |
| The main remaining risk is external validity beyond the three tested text |
| regimes and robustness across controlled architecture or token-budget changes. |
| The current defensible claim is: |
|
|
| ```text |
| Formula-derived dropout schedules track the moving useful dropout region and |
| avoid poor static dropout choices as stream scale changes. |
| ``` |
|
|
| The stronger claim: |
|
|
| ```text |
| Formula-derived dropout decay beats the best static dropout. |
| ``` |
|
|
| is supported at `n=5` in TinyStories, OpenWebText10K, and WikiText-103. The |
| strongest schedule in each of the three regimes beats the per-seed best static |
| baseline in all five seeds. |
|
|
| Latest OpenWebText10K 5-seed streaming final-loss table: |
|
|
| | Condition | Mean final 4M validation loss | Std | |
| |---|---:|---:| |
| | `openwebtext10k_interaction` decay | 4.3981 | 0.0095 | |
| | `hold_30_then_decay` | 4.4052 | 0.0112 | |
| | `mild_30_to_08` | 4.4073 | 0.0085 | |
| | `fitted_l16_static_law` | 4.4124 | 0.0084 | |
| | static `0.14` | 4.4455 | 0.0120 | |
| | static `0.30` | 4.4668 | 0.0141 | |
| | static `0.02` | 4.5358 | 0.0091 | |
| | static `0.00` | 4.5943 | 0.0216 | |
|
|
| OpenWebText10K condition provenance: |
|
|
| | Condition | Provenance | How to interpret it | |
| |---|---|---| |
| | `openwebtext10k_interaction` | coefficient-derived interaction schedule | main OpenWebText10K formula hypothesis test | |
| | `hold_30_then_decay` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients | |
| | `mild_30_to_08` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients | |
| | `fitted_l16_static_law` | older fitted/static-law schedule | retained as a comparison to the earlier aggressive fitted path | |
| | static conditions | fixed dropout baselines | same dropout at every stream prefix | |
|
|
| The heuristic OpenWebText10K schedules were chosen from failure analysis, not |
| from the final coefficient formula. The older `fitted_l16_static_law` path |
| started too high (`0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02`), while static |
| dropout `0.30` looked useful early but worse at the final 4M-token stage and |
| static dropout `0.14` was the strongest static endpoint. This motivated two |
| manual ablations: |
|
|
| ```text |
| hold_30_then_decay = 0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02 |
| mild_30_to_08 = 0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08 |
| ``` |
|
|
| These ablations support the broader mechanism that stream-dependent dropout can |
| matter, but they should not be used as evidence that the coefficient formula |
| generated those exact schedules. The formula claim for OpenWebText10K should be |
| based on `openwebtext10k_interaction`. |
|
|
| Paired final-loss result: |
|
|
| | Decay schedule | Paired wins vs best static | |
| |---|---:| |
| | `openwebtext10k_interaction` | 5/5 | |
| | `hold_30_then_decay` | 5/5 | |
| | `mild_30_to_08` | 5/5 | |
| | `fitted_l16_static_law` | 5/5 | |
|
|
| The best static baseline in the clean OpenWebText10K run is static dropout |
| `0.14`. The interaction schedule improves mean final validation loss by about |
| `0.0473` and wins every paired seed comparison. This promotes OpenWebText10K |
| from exploratory support to a second multi-seed streaming validation regime. |
|
|
| Latest WikiText-103 5-seed streaming final-loss table: |
|
|
| | Condition | Mean final 4M validation loss | Std | |
| |---|---:|---:| |
| | `wikitext103_formula_l12` decay | 4.0808 | 0.0195 | |
| | `wikitext103_probe_blend` decay | 4.0961 | 0.0145 | |
| | `wikitext103_low_decay` decay | 4.1020 | 0.0166 | |
| | static `0.10` | 4.1105 | 0.0188 | |
| | static `0.08` | 4.1116 | 0.0186 | |
| | static `0.06` | 4.1197 | 0.0082 | |
| | static `0.14` | 4.1221 | 0.0155 | |
| | static `0.18` | 4.1304 | 0.0130 | |
| | static `0.04` | 4.1331 | 0.0227 | |
| | static `0.20` | 4.1394 | 0.0167 | |
| | static `0.02` | 4.1459 | 0.0165 | |
| | static `0.26` | 4.1784 | 0.0145 | |
| | static `0.00` | 4.1835 | 0.0165 | |
| | static `0.30` | 4.1946 | 0.0141 | |
|
|
| Paired final-loss result: |
|
|
| | Decay schedule | Paired wins vs best static | |
| |---|---:| |
| | `wikitext103_formula_l12` | 5/5 | |
| | `wikitext103_probe_blend` | 4/5 | |
| | `wikitext103_low_decay` | 4/5 | |
|
|
| The best static baseline in the clean WikiText-103 run is static dropout |
| `0.10` by mean final loss. The formula-derived L12 decay improves mean final |
| validation loss by about `0.0297` and wins every paired seed comparison. This |
| promotes WikiText-103 to a third multi-seed streaming validation regime. |
|
|
| ## Completed Static Backtest Gate |
|
|
| The first offline coefficient backtest is complete. It is retained as supporting |
| artifact, not as final proof: |
|
|
| ```text |
| runs/coefficient_calibration/cross_regime_backtest/ |
| ``` |
|
|
| Main reading: |
|
|
| ```text |
| the interaction pressure-law structure is supported in both the OpenWebText10K |
| regime and the current TinyStories regime, but coefficient values are |
| regime-specific. |
| ``` |
|
|
| Do not claim universal numeric coefficients. For final paper evidence, use |
| streaming multi-seed reports for each regime. |
|
|
| ## Immediate Next Action |
|
|
| Reconcile the TinyStories, OpenWebText10K, and WikiText-103 five-seed streaming |
| reports into the paper outline. The strongest current claim is now supported in |
| three regimes: formula-derived or regime-fitted decay schedules beat the best |
| static dropout baseline in paired five-seed final-loss comparisons. |
|
|
| The next empirical weakness is no longer a missing third text regime. The next |
| useful strengthening step is to test robustness across a controlled architecture |
| or token-budget change inside one established corpus regime, while preserving |
| the same MPS-only, five-seed validation standard. |
|
|
| ## Next Training After Current Gate |
|
|
| No MPS training should launch until the three completed five-seed streaming |
| reports are read together. Since a third held-out text regime is no longer the |
| limiting issue, use the next run only for a narrowed robustness test: |
|
|
| ```text |
| completed: TinyStories 5-seed streaming report |
| completed: OpenWebText10K 5-seed clean streaming report |
| completed: WikiText-103 5-seed clean streaming report |
| next: reconcile three-regime evidence into the paper, then choose one narrowed |
| robustness test |
| avoid: broad new sweep before three-regime report reconciliation |
| ``` |
|
|
| Evaluate with paired seed comparisons: |
|
|
| ```text |
| final 4M validation loss |
| mean trajectory validation loss |
| stage-wise validation loss |
| decay minus best-static delta per seed |
| rank consistency across seeds |
| ``` |
|
|
| Because TinyStories, OpenWebText10K, and WikiText-103 decays win across paired |
| seeds, promote the cross-regime streaming claim to "supported in three regimes." |
| Do not yet claim universal numeric coefficients. The current defensible |
| paper-level claim is that the pressure-law structure and regime-specific fitting |
| procedure can produce dropout schedules that beat the best static dropout |
| baseline across multiple text regimes. |
|
|
| Latest streaming report: |
|
|
| ```text |
| docs/tinystories_streaming_report.md |
| docs/openwebtext10k_streaming_report.md |
| docs/wikitext103_streaming_report.md |
| runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/ |
| runs/openwebtext10k_streaming_report/l16_updated_formula_clean_5seed/ |
| runs/wikitext103_streaming_report/l12_validation_5seed/ |
| ``` |
|
|