Mandeep Sidhu commited on
Commit ·
b5daf7c
1
Parent(s): cf52b0e
Document regime runbook and schedule provenance
Browse files- docs/openwebtext10k_streaming_report.md +20 -0
- docs/plan.md +550 -8
docs/openwebtext10k_streaming_report.md
CHANGED
|
@@ -16,6 +16,26 @@ baselines.
|
|
| 16 |
|
| 17 |
- `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`
|
| 18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
## Condition Ranking By Final Loss
|
| 20 |
|
| 21 |
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
|
|
|
| 16 |
|
| 17 |
- `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`
|
| 18 |
|
| 19 |
+
## Condition Provenance
|
| 20 |
+
|
| 21 |
+
The `anchor_decay` label means the dropout value is chosen from explicit
|
| 22 |
+
prefix-token anchors. It does not by itself imply that the schedule came from
|
| 23 |
+
the coefficient formula.
|
| 24 |
+
|
| 25 |
+
| Condition | Provenance | Dropout path | Interpretation |
|
| 26 |
+
|---|---|---|---|
|
| 27 |
+
| `openwebtext10k_interaction` | coefficient-derived schedule | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` | Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. |
|
| 28 |
+
| `hold_30_then_decay` | heuristic schedule-search ablation | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. |
|
| 29 |
+
| `mild_30_to_08` | heuristic schedule-search ablation | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. |
|
| 30 |
+
| `fitted_l16_static_law` | older fitted/static-law schedule | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. |
|
| 31 |
+
| `static_dropout_*` | static baseline | constant | Fixed dropout used at every stream prefix. |
|
| 32 |
+
|
| 33 |
+
The two heuristic schedules should be treated as ablations, not as independent
|
| 34 |
+
evidence that the coefficient formula generated their exact paths. Their role is
|
| 35 |
+
to show that the shape of the decay matters and that reasonable hand-designed
|
| 36 |
+
decays can also beat weak static choices. The main formula claim for this
|
| 37 |
+
regime should be based on `openwebtext10k_interaction`.
|
| 38 |
+
|
| 39 |
## Condition Ranking By Final Loss
|
| 40 |
|
| 41 |
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
docs/plan.md
CHANGED
|
@@ -277,6 +277,520 @@ Use this order for every regime.
|
|
| 277 |
7. Immediately backtest the new regime against all previous regimes.
|
| 278 |
8. Only then run expensive streaming validation in the new regime.
|
| 279 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 280 |
## Current Regime Ledger
|
| 281 |
|
| 282 |
| Regime | Status | Role |
|
|
@@ -357,8 +871,9 @@ Paired final-loss result:
|
|
| 357 |
| `smooth_low` | 4/5, with the one miss only `+0.0003` |
|
| 358 |
|
| 359 |
The immediate risk is no longer seed count for TinyStories or OpenWebText10K.
|
| 360 |
-
The main remaining risk is external validity beyond
|
| 361 |
-
|
|
|
|
| 362 |
|
| 363 |
```text
|
| 364 |
Formula-derived dropout schedules track the moving useful dropout region and
|
|
@@ -371,9 +886,9 @@ The stronger claim:
|
|
| 371 |
Formula-derived dropout decay beats the best static dropout.
|
| 372 |
```
|
| 373 |
|
| 374 |
-
is supported at `n=5` in
|
| 375 |
-
|
| 376 |
-
|
| 377 |
|
| 378 |
Latest OpenWebText10K 5-seed streaming final-loss table:
|
| 379 |
|
|
@@ -388,6 +903,33 @@ Latest OpenWebText10K 5-seed streaming final-loss table:
|
|
| 388 |
| static `0.02` | 4.5358 | 0.0091 |
|
| 389 |
| static `0.00` | 4.5943 | 0.0216 |
|
| 390 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 391 |
Paired final-loss result:
|
| 392 |
|
| 393 |
| Decay schedule | Paired wins vs best static |
|
|
@@ -468,9 +1010,9 @@ the same MPS-only, five-seed validation standard.
|
|
| 468 |
|
| 469 |
## Next Training After Current Gate
|
| 470 |
|
| 471 |
-
No MPS training should launch until the
|
| 472 |
-
reports are read together. Since
|
| 473 |
-
limiting issue, use
|
| 474 |
|
| 475 |
```text
|
| 476 |
completed: TinyStories 5-seed streaming report
|
|
|
|
| 277 |
7. Immediately backtest the new regime against all previous regimes.
|
| 278 |
8. Only then run expensive streaming validation in the new regime.
|
| 279 |
|
| 280 |
+
## New Regime Script Runbook
|
| 281 |
+
|
| 282 |
+
Use this exact command sequence for any new regime. Replace placeholders such as
|
| 283 |
+
`<regime>`, `<CORPUS_OR_PARQUET_PATH>`, `<MODEL_SPEC>`, and `<TIMESTAMP>` with
|
| 284 |
+
absolute choices before launching. Do not skip from calibration directly to
|
| 285 |
+
streaming: the schedule must be frozen from the coefficient fit before the
|
| 286 |
+
streaming run starts.
|
| 287 |
+
|
| 288 |
+
This section is intentionally verbose. Its purpose is to make future regimes
|
| 289 |
+
auditable: an external reader should be able to tell what each script did, what
|
| 290 |
+
file it produced, and which decision gate came next.
|
| 291 |
+
|
| 292 |
+
### New Regime Step 0: MPS Preflight
|
| 293 |
+
|
| 294 |
+
Run this before any torch training command:
|
| 295 |
+
|
| 296 |
+
```bash
|
| 297 |
+
.venv/bin/python -c "import torch; print({'mps_built': torch.backends.mps.is_built(), 'mps_available': torch.backends.mps.is_available(), 'cuda_available': torch.cuda.is_available()}); raise SystemExit(0 if torch.backends.mps.is_available() else 1)"
|
| 298 |
+
```
|
| 299 |
+
|
| 300 |
+
What this does:
|
| 301 |
+
|
| 302 |
+
| Check | Meaning |
|
| 303 |
+
|---|---|
|
| 304 |
+
| `mps_built` | PyTorch was compiled with Apple MPS support |
|
| 305 |
+
| `mps_available` | this machine can actually run MPS now |
|
| 306 |
+
| `cuda_available` | should not be used for this project |
|
| 307 |
+
|
| 308 |
+
Decision rule:
|
| 309 |
+
|
| 310 |
+
```text
|
| 311 |
+
if mps_available is false: stop and report
|
| 312 |
+
if cuda_available is true: still do not use CUDA
|
| 313 |
+
```
|
| 314 |
+
|
| 315 |
+
Also check for duplicate experiment processes before launching a long run. This
|
| 316 |
+
is not part of the coefficient method, but it prevents corrupt timing/resource
|
| 317 |
+
comparisons.
|
| 318 |
+
|
| 319 |
+
### New Regime Step 1: Static Dropout Calibration Screen
|
| 320 |
+
|
| 321 |
+
Run:
|
| 322 |
+
|
| 323 |
+
```bash
|
| 324 |
+
.venv/bin/python scripts/run_experiments.py \
|
| 325 |
+
--mode screen_static \
|
| 326 |
+
--corpus <CORPUS_OR_PARQUET_PATH> \
|
| 327 |
+
--text-column <TEXT_COLUMN_IF_PARQUET> \
|
| 328 |
+
--cache-dir .cache/dropout_decay_<regime> \
|
| 329 |
+
--output-dir runs/<regime>_static_screen \
|
| 330 |
+
--models <M1=layersxheadsxdim> <M2=layersxheadsxdim> <M3=layersxheadsxdim> \
|
| 331 |
+
--seeds 1 2 \
|
| 332 |
+
--token-limits <U1> <U2> <U3> <U4> \
|
| 333 |
+
--dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
|
| 334 |
+
--steps <STATIC_STEPS> \
|
| 335 |
+
--batch-size <BATCH> \
|
| 336 |
+
--block-size <BLOCK> \
|
| 337 |
+
--eval-batches <EVAL_BATCHES> \
|
| 338 |
+
--train-eval-batches <TRAIN_EVAL_BATCHES> \
|
| 339 |
+
--trace-eval-batches <TRACE_EVAL_BATCHES> \
|
| 340 |
+
--vocab-size <VOCAB_SIZE> \
|
| 341 |
+
--val-tokens <VAL_TOKENS> \
|
| 342 |
+
--lr <LR> \
|
| 343 |
+
--weight-decay <WEIGHT_DECAY> \
|
| 344 |
+
--grad-clip 1.0 \
|
| 345 |
+
--screen-early-stop
|
| 346 |
+
```
|
| 347 |
+
|
| 348 |
+
What this script run does:
|
| 349 |
+
|
| 350 |
+
`scripts/run_experiments.py --mode screen_static` trains a grid of static
|
| 351 |
+
dropout models. It does not test the final decay hypothesis. It estimates the
|
| 352 |
+
best static dropout rate for each calibration cell:
|
| 353 |
+
|
| 354 |
+
```text
|
| 355 |
+
cell = (model parameter count P, prefix/unique tokens U, sampled tokens C)
|
| 356 |
+
```
|
| 357 |
+
|
| 358 |
+
For each cell, the script evaluates a fixed dropout grid and writes the
|
| 359 |
+
validation curve. The curve is used later to extract the target dropout `p*`.
|
| 360 |
+
|
| 361 |
+
Expected outputs under `runs/<regime>_static_screen/screen_static/<TIMESTAMP>/`:
|
| 362 |
+
|
| 363 |
+
| File | Use |
|
| 364 |
+
|---|---|
|
| 365 |
+
| `metrics.jsonl` | per-run raw metrics; includes token limit, model, seed, losses, and tokens seen |
|
| 366 |
+
| `model_selection.csv` | per-cell static dropout curve and selected best dropout |
|
| 367 |
+
| `summary.csv` / `summary.json` | compact aggregate summary |
|
| 368 |
+
| `trace.jsonl` | lower-frequency trace for diagnostics |
|
| 369 |
+
| `RESULT_SUMMARY.md` | human-readable first-pass summary |
|
| 370 |
+
|
| 371 |
+
Why this is needed:
|
| 372 |
+
|
| 373 |
+
The coefficient formula is not fitted from streaming outcomes. It is fitted
|
| 374 |
+
from static dropout optima. This separation is essential: calibration estimates
|
| 375 |
+
where useful regularization sits; streaming validation tests whether following
|
| 376 |
+
that moving estimate helps.
|
| 377 |
+
|
| 378 |
+
Recommended cheap calibration:
|
| 379 |
+
|
| 380 |
+
| Dimension | Default |
|
| 381 |
+
|---|---|
|
| 382 |
+
| models | at least 3 model sizes if testing coefficient generality |
|
| 383 |
+
| token prefixes | at least 4 prefixes |
|
| 384 |
+
| seeds | 1-2 for calibration, 5 only for final streaming validation |
|
| 385 |
+
| dropout grid | include low, middle, and high values so the optimum can be bracketed |
|
| 386 |
+
|
| 387 |
+
Decision rule:
|
| 388 |
+
|
| 389 |
+
```text
|
| 390 |
+
continue if most cells have a bracketed or near-bracketed optimum
|
| 391 |
+
refine if many best dropouts sit at the edge of the grid
|
| 392 |
+
stop and inspect if validation curves are flat/noisy enough that p* is unstable
|
| 393 |
+
```
|
| 394 |
+
|
| 395 |
+
### New Regime Step 2: Fit First-Order Base Coefficients
|
| 396 |
+
|
| 397 |
+
Run:
|
| 398 |
+
|
| 399 |
+
```bash
|
| 400 |
+
.venv/bin/python scripts/fit_dropout_coefficients.py \
|
| 401 |
+
--run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
|
| 402 |
+
--output-dir runs/coefficient_calibration/<regime>_base \
|
| 403 |
+
--target quad \
|
| 404 |
+
--weighting heuristic \
|
| 405 |
+
--feature-set base \
|
| 406 |
+
--min-rate 0.0 \
|
| 407 |
+
--max-rate 0.30
|
| 408 |
+
```
|
| 409 |
+
|
| 410 |
+
What this script run does:
|
| 411 |
+
|
| 412 |
+
`scripts/fit_dropout_coefficients.py` reads `model_selection.csv` and
|
| 413 |
+
`metrics.jsonl` from the static screen. It converts each calibration cell into:
|
| 414 |
+
|
| 415 |
+
```text
|
| 416 |
+
x = log10(P / U)
|
| 417 |
+
y = log10(C / U)
|
| 418 |
+
target = observed useful static dropout p*
|
| 419 |
+
```
|
| 420 |
+
|
| 421 |
+
With `--feature-set base`, it fits the first-order ablation:
|
| 422 |
+
|
| 423 |
+
```text
|
| 424 |
+
p* ~= A*x + B*y + C0
|
| 425 |
+
```
|
| 426 |
+
|
| 427 |
+
With `--target quad`, the target `p*` is the local quadratic minimum around the
|
| 428 |
+
best dropout grid point when the curve is bracketed. If the curve is not
|
| 429 |
+
bracketed, the script falls back to the grid best and marks the cell as weaker
|
| 430 |
+
evidence.
|
| 431 |
+
|
| 432 |
+
With `--weighting heuristic`, the fit downweights cells that are less reliable:
|
| 433 |
+
|
| 434 |
+
| Cell condition | Why it is weaker |
|
| 435 |
+
|---|---|
|
| 436 |
+
| boundary optimum | true optimum may be outside the tested dropout grid |
|
| 437 |
+
| not bracketed | local quadratic minimum is less trustworthy |
|
| 438 |
+
| very flat curve | many dropout rates perform nearly the same |
|
| 439 |
+
| noisy best loss | target dropout is less stable |
|
| 440 |
+
|
| 441 |
+
Expected outputs under `runs/coefficient_calibration/<regime>_base/`:
|
| 442 |
+
|
| 443 |
+
| File | Use |
|
| 444 |
+
|---|---|
|
| 445 |
+
| `coefficients.json` | fitted `A`, `B`, `C0`, metrics, and cross-validation scores |
|
| 446 |
+
| `fit_diagnostics.md` | readable coefficient table, formula, fit metrics, and cell residuals |
|
| 447 |
+
| `calibration_cells.csv` | one row per fitted cell with target, prediction, residual, and flags |
|
| 448 |
+
| `next_dropout_suggestions.csv` | dropout rates to add if a cell needs refinement |
|
| 449 |
+
|
| 450 |
+
Why this is needed:
|
| 451 |
+
|
| 452 |
+
The base model is the simplest pressure-law hypothesis. It is the ablation that
|
| 453 |
+
tells reviewers whether the interaction term is actually necessary.
|
| 454 |
+
|
| 455 |
+
Decision rule:
|
| 456 |
+
|
| 457 |
+
```text
|
| 458 |
+
if base MAE and held-out errors are already low: keep it as a strong ablation
|
| 459 |
+
if base has biased residuals or higher MAE: compare against interaction next
|
| 460 |
+
```
|
| 461 |
+
|
| 462 |
+
### New Regime Step 3: Fit Interaction Coefficients
|
| 463 |
+
|
| 464 |
+
Run:
|
| 465 |
+
|
| 466 |
+
```bash
|
| 467 |
+
.venv/bin/python scripts/fit_dropout_coefficients.py \
|
| 468 |
+
--run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
|
| 469 |
+
--output-dir runs/coefficient_calibration/<regime>_interaction \
|
| 470 |
+
--target quad \
|
| 471 |
+
--weighting heuristic \
|
| 472 |
+
--feature-set interaction \
|
| 473 |
+
--min-rate 0.0 \
|
| 474 |
+
--max-rate 0.30
|
| 475 |
+
```
|
| 476 |
+
|
| 477 |
+
What this script run does:
|
| 478 |
+
|
| 479 |
+
This repeats the same target extraction and weighted least-squares fitting, but
|
| 480 |
+
uses the interaction pressure law:
|
| 481 |
+
|
| 482 |
+
```text
|
| 483 |
+
p* ~= A*x + B*y + D*x*y + C0
|
| 484 |
+
```
|
| 485 |
+
|
| 486 |
+
The extra term `D*x*y` lets model/data pressure and sampled-token pressure
|
| 487 |
+
interact. Empirically, this has mattered because dropout pressure is not always
|
| 488 |
+
additive: the useful effect of seeing more cumulative sampled tokens can depend
|
| 489 |
+
on how oversized the model is relative to the available unique data.
|
| 490 |
+
|
| 491 |
+
Expected outputs are the same as Step 2, but under:
|
| 492 |
+
|
| 493 |
+
```text
|
| 494 |
+
runs/coefficient_calibration/<regime>_interaction/
|
| 495 |
+
```
|
| 496 |
+
|
| 497 |
+
Decision rule:
|
| 498 |
+
|
| 499 |
+
```text
|
| 500 |
+
promote interaction if it lowers MAE/RMSE, improves leave-prefix/leave-model
|
| 501 |
+
validation, and does not create obvious residual bias
|
| 502 |
+
```
|
| 503 |
+
|
| 504 |
+
Do not promote the interaction form merely because it has more parameters. The
|
| 505 |
+
paper needs the base-vs-interaction comparison to show that the extra term buys
|
| 506 |
+
predictive accuracy, not just in-sample flexibility.
|
| 507 |
+
|
| 508 |
+
### New Regime Step 4: Optional Static Refinement
|
| 509 |
+
|
| 510 |
+
Only run this if `fit_diagnostics.md` or `next_dropout_suggestions.csv` shows
|
| 511 |
+
that important cells are weakly identified.
|
| 512 |
+
|
| 513 |
+
Run:
|
| 514 |
+
|
| 515 |
+
```bash
|
| 516 |
+
.venv/bin/python scripts/run_experiments.py \
|
| 517 |
+
--mode screen_static \
|
| 518 |
+
--resume-from runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
|
| 519 |
+
--use-cached-data \
|
| 520 |
+
--cache-dir .cache/dropout_decay_<regime> \
|
| 521 |
+
--output-dir runs/<regime>_static_refined \
|
| 522 |
+
--models <ONLY_AFFECTED_MODELS> \
|
| 523 |
+
--seeds 1 2 \
|
| 524 |
+
--token-limits <ONLY_AFFECTED_PREFIXES> \
|
| 525 |
+
--dropout-rates <SUGGESTED_RATES> \
|
| 526 |
+
--steps <STATIC_STEPS> \
|
| 527 |
+
--batch-size <BATCH> \
|
| 528 |
+
--block-size <BLOCK> \
|
| 529 |
+
--eval-batches <EVAL_BATCHES> \
|
| 530 |
+
--train-eval-batches <TRAIN_EVAL_BATCHES> \
|
| 531 |
+
--trace-eval-batches <TRACE_EVAL_BATCHES> \
|
| 532 |
+
--vocab-size <VOCAB_SIZE> \
|
| 533 |
+
--val-tokens <VAL_TOKENS> \
|
| 534 |
+
--lr <LR> \
|
| 535 |
+
--weight-decay <WEIGHT_DECAY> \
|
| 536 |
+
--grad-clip 1.0
|
| 537 |
+
```
|
| 538 |
+
|
| 539 |
+
What this script run does:
|
| 540 |
+
|
| 541 |
+
This adds only missing static dropout points. It should not rerun the full grid.
|
| 542 |
+
`--resume-from` lets the experiment skip rows already completed in the original
|
| 543 |
+
static screen. `--use-cached-data` reuses the cached tokenizer and token arrays
|
| 544 |
+
so refinement is measuring dropout/model behavior, not data preprocessing
|
| 545 |
+
differences.
|
| 546 |
+
|
| 547 |
+
When to use it:
|
| 548 |
+
|
| 549 |
+
| Trigger | Refinement action |
|
| 550 |
+
|---|---|
|
| 551 |
+
| best dropout is at grid edge | add rates beyond or near that edge if allowed |
|
| 552 |
+
| curve is too coarse near optimum | add rates around the local best |
|
| 553 |
+
| static curve is flat | add seeds or eval batches before changing the formula |
|
| 554 |
+
|
| 555 |
+
After refinement, rerun Steps 2 and 3 with all relevant run dirs. At minimum,
|
| 556 |
+
rerun the promoted feature family. If the paper will compare base versus
|
| 557 |
+
interaction after refinement, rerun both.
|
| 558 |
+
|
| 559 |
+
```bash
|
| 560 |
+
.venv/bin/python scripts/fit_dropout_coefficients.py \
|
| 561 |
+
--run-dirs \
|
| 562 |
+
runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
|
| 563 |
+
runs/<regime>_static_refined/screen_static/<TIMESTAMP> \
|
| 564 |
+
--output-dir runs/coefficient_calibration/<regime>_interaction_refined \
|
| 565 |
+
--target quad \
|
| 566 |
+
--weighting heuristic \
|
| 567 |
+
--feature-set interaction \
|
| 568 |
+
--min-rate 0.0 \
|
| 569 |
+
--max-rate 0.30
|
| 570 |
+
```
|
| 571 |
+
|
| 572 |
+
Decision rule:
|
| 573 |
+
|
| 574 |
+
```text
|
| 575 |
+
refinement is complete when the promoted coefficient fit has acceptable MAE,
|
| 576 |
+
held-out errors, and no obvious residual direction across P/U or C/U
|
| 577 |
+
```
|
| 578 |
+
|
| 579 |
+
### New Regime Step 5: Generate Frozen Streaming Anchors
|
| 580 |
+
|
| 581 |
+
Run:
|
| 582 |
+
|
| 583 |
+
```bash
|
| 584 |
+
.venv/bin/python scripts/make_streaming_anchors.py \
|
| 585 |
+
--coefficients-json <PROMOTED_COEFFICIENTS_JSON> \
|
| 586 |
+
--name <regime>_interaction \
|
| 587 |
+
--parameters <WINNER_MODEL_PARAM_COUNT> \
|
| 588 |
+
--stream-token-caps <U1> <U2> <U3> <U4> <U5> \
|
| 589 |
+
--stage-steps <STAGE_STEPS> \
|
| 590 |
+
--batch-size <BATCH> \
|
| 591 |
+
--block-size <BLOCK> \
|
| 592 |
+
--min-rate 0.02 \
|
| 593 |
+
--max-rate 0.65 \
|
| 594 |
+
--precision 3
|
| 595 |
+
```
|
| 596 |
+
|
| 597 |
+
What this script run does:
|
| 598 |
+
|
| 599 |
+
`scripts/make_streaming_anchors.py` turns `coefficients.json` into the exact
|
| 600 |
+
dropout schedule used by `locked_stream`. For each stream prefix, it computes:
|
| 601 |
+
|
| 602 |
+
```text
|
| 603 |
+
P = chosen model parameter count
|
| 604 |
+
U_t = stream prefix tokens at stage t
|
| 605 |
+
C_t = cumulative sampled optimizer tokens through stage t
|
| 606 |
+
x_t = log10(P / U_t)
|
| 607 |
+
y_t = log10(C_t / U_t)
|
| 608 |
+
p_t = clamp(p_min, p_max, A*x_t + B*y_t + D*x_t*y_t + C0)
|
| 609 |
+
```
|
| 610 |
+
|
| 611 |
+
The script prints two things:
|
| 612 |
+
|
| 613 |
+
1. a JSON diagnostic table with raw and clipped dropout values
|
| 614 |
+
2. a final one-line anchor spec, for example:
|
| 615 |
+
|
| 616 |
+
```text
|
| 617 |
+
<regime>_interaction:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020
|
| 618 |
+
```
|
| 619 |
+
|
| 620 |
+
That final line is copied into the next command as `--anchor-decays`.
|
| 621 |
+
|
| 622 |
+
`<PROMOTED_COEFFICIENTS_JSON>` should point to the coefficient file selected by
|
| 623 |
+
the coefficient gate. In a clean first pass, this is usually:
|
| 624 |
+
|
| 625 |
+
```text
|
| 626 |
+
runs/coefficient_calibration/<regime>_interaction/coefficients.json
|
| 627 |
+
```
|
| 628 |
+
|
| 629 |
+
If optional refinement was needed and accepted, use the refined coefficient
|
| 630 |
+
file instead:
|
| 631 |
+
|
| 632 |
+
```text
|
| 633 |
+
runs/coefficient_calibration/<regime>_interaction_refined/coefficients.json
|
| 634 |
+
```
|
| 635 |
+
|
| 636 |
+
Decision rule:
|
| 637 |
+
|
| 638 |
+
```text
|
| 639 |
+
freeze this anchor spec before streaming starts
|
| 640 |
+
do not edit the schedule after looking at streaming validation losses
|
| 641 |
+
```
|
| 642 |
+
|
| 643 |
+
If the anchor schedule looks pathological before training, such as all values
|
| 644 |
+
clipping at `p_min` or `p_max`, inspect the coefficient fit and calibration
|
| 645 |
+
cells before launching streaming.
|
| 646 |
+
|
| 647 |
+
### New Regime Step 6: Five-Seed Locked Streaming Validation
|
| 648 |
+
|
| 649 |
+
Run:
|
| 650 |
+
|
| 651 |
+
```bash
|
| 652 |
+
.venv/bin/python scripts/run_experiments.py \
|
| 653 |
+
--mode locked_stream \
|
| 654 |
+
--use-cached-data \
|
| 655 |
+
--cache-dir .cache/dropout_decay_<regime> \
|
| 656 |
+
--output-dir runs/<regime>_<model>_streaming_validation_5seed \
|
| 657 |
+
--models <WINNER_MODEL_NAME=layersxheadsxdim> \
|
| 658 |
+
--seeds 1 2 3 4 5 \
|
| 659 |
+
--stream-token-caps <U1> <U2> <U3> <U4> <U5> \
|
| 660 |
+
--dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
|
| 661 |
+
--anchor-decays <FROZEN_ANCHOR_SPEC_FROM_STEP_5> \
|
| 662 |
+
--stage-steps <STAGE_STEPS> \
|
| 663 |
+
--batch-size <BATCH> \
|
| 664 |
+
--block-size <BLOCK> \
|
| 665 |
+
--eval-batches <EVAL_BATCHES> \
|
| 666 |
+
--train-eval-batches <TRAIN_EVAL_BATCHES> \
|
| 667 |
+
--trace-eval-batches <TRACE_EVAL_BATCHES> \
|
| 668 |
+
--log-every 250 \
|
| 669 |
+
--vocab-size <VOCAB_SIZE> \
|
| 670 |
+
--val-tokens <VAL_TOKENS> \
|
| 671 |
+
--lr <LR> \
|
| 672 |
+
--weight-decay <WEIGHT_DECAY> \
|
| 673 |
+
--grad-clip 1.0
|
| 674 |
+
```
|
| 675 |
+
|
| 676 |
+
What this script run does:
|
| 677 |
+
|
| 678 |
+
`locked_stream` is the paper-grade test. It simulates a stream by increasing
|
| 679 |
+
the available prefix tokens over stages. For each seed, it trains:
|
| 680 |
+
|
| 681 |
+
| Condition type | Meaning |
|
| 682 |
+
|---|---|
|
| 683 |
+
| static dropout baselines | same dropout at every stream stage |
|
| 684 |
+
| anchor decay schedule | frozen coefficient-derived dropout at each stream stage |
|
| 685 |
+
|
| 686 |
+
The static baselines must be broad enough to make the comparison fair. The
|
| 687 |
+
claim is not that decay beats weak static choices; the claim is that it can beat
|
| 688 |
+
the best static dropout available in the tested grid.
|
| 689 |
+
|
| 690 |
+
Expected outputs under
|
| 691 |
+
`runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/`:
|
| 692 |
+
|
| 693 |
+
| File | Use |
|
| 694 |
+
|---|---|
|
| 695 |
+
| `metrics.jsonl` | raw row-level results for each condition, seed, and prefix |
|
| 696 |
+
| `summary.csv` / `summary.json` | aggregate condition and stage summaries |
|
| 697 |
+
| `trace.jsonl` | progress traces for diagnostic plotting |
|
| 698 |
+
| `config.json` | exact run configuration |
|
| 699 |
+
| `RESULT_SUMMARY.md` | built-in readable summary |
|
| 700 |
+
|
| 701 |
+
Primary evaluation metrics:
|
| 702 |
+
|
| 703 |
+
```text
|
| 704 |
+
final validation loss at largest prefix
|
| 705 |
+
mean trajectory validation loss
|
| 706 |
+
stage-wise validation loss
|
| 707 |
+
paired seed delta versus the best static baseline
|
| 708 |
+
rank consistency across seeds
|
| 709 |
+
```
|
| 710 |
+
|
| 711 |
+
Decision rule:
|
| 712 |
+
|
| 713 |
+
```text
|
| 714 |
+
strong pass: decay has best mean final loss and beats best static in most or all
|
| 715 |
+
paired seeds
|
| 716 |
+
|
| 717 |
+
weak pass: decay ties best static while avoiding bad early/late static choices
|
| 718 |
+
|
| 719 |
+
fail: decay loses to a simple static baseline in most paired seeds or wins early
|
| 720 |
+
only by sacrificing final loss
|
| 721 |
+
```
|
| 722 |
+
|
| 723 |
+
### New Regime Step 7: Summarize Streaming Validation
|
| 724 |
+
|
| 725 |
+
Run:
|
| 726 |
+
|
| 727 |
+
```bash
|
| 728 |
+
.venv/bin/python scripts/summarize_streaming_multiseed.py \
|
| 729 |
+
--metrics runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/metrics.jsonl \
|
| 730 |
+
--output-dir runs/<regime>_streaming_report/<model>_validation_5seed \
|
| 731 |
+
--report docs/<regime>_streaming_report.md \
|
| 732 |
+
--title "<Regime Name> Streaming Validation" \
|
| 733 |
+
--date <YYYY-MM-DD> \
|
| 734 |
+
--context "<regime/model/token/step description>" \
|
| 735 |
+
--conditions <regime>_interaction static_dropout_0.1 static_dropout_0.08 static_dropout_0.06 static_dropout_0.14 static_dropout_0.18 static_dropout_0.2 static_dropout_0.04 static_dropout_0.02 static_dropout_0 static_dropout_0.26 static_dropout_0.3
|
| 736 |
+
```
|
| 737 |
+
|
| 738 |
+
What this script run does:
|
| 739 |
+
|
| 740 |
+
`scripts/summarize_streaming_multiseed.py` performs no training. It reads the
|
| 741 |
+
saved `metrics.jsonl` file and writes standardized artifacts comparable across
|
| 742 |
+
regimes.
|
| 743 |
+
|
| 744 |
+
Expected outputs:
|
| 745 |
+
|
| 746 |
+
| File | Use |
|
| 747 |
+
|---|---|
|
| 748 |
+
| `docs/<regime>_streaming_report.md` | human-readable regime report for paper discussion |
|
| 749 |
+
| `condition_summary.csv` | condition ranking by final validation loss |
|
| 750 |
+
| `stage_summary.csv` | stage-wise trajectory table |
|
| 751 |
+
| `paired_final_deltas.csv` | per-seed final-loss comparison against the best static baseline |
|
| 752 |
+
|
| 753 |
+
The most important table is `paired_final_deltas.csv`. A mean win is useful, but
|
| 754 |
+
paired seed wins are stronger because they reduce initialization-bias concerns.
|
| 755 |
+
|
| 756 |
+
Decision rule:
|
| 757 |
+
|
| 758 |
+
```text
|
| 759 |
+
if the decay schedule wins 5/5 paired seeds: promote regime to strong evidence
|
| 760 |
+
if it wins 3-4/5: inspect effect size, variance, and trajectory tradeoff
|
| 761 |
+
if it wins 0-2/5: treat as a failed regime or schedule and do not bury it
|
| 762 |
+
```
|
| 763 |
+
|
| 764 |
+
### New Regime Step 8: Smoke Check And Commit
|
| 765 |
+
|
| 766 |
+
Run:
|
| 767 |
+
|
| 768 |
+
```bash
|
| 769 |
+
.venv/bin/python -m py_compile \
|
| 770 |
+
scripts/run_experiments.py \
|
| 771 |
+
scripts/fit_dropout_coefficients.py \
|
| 772 |
+
scripts/make_streaming_anchors.py \
|
| 773 |
+
scripts/summarize_streaming_multiseed.py
|
| 774 |
+
```
|
| 775 |
+
|
| 776 |
+
What this script run does:
|
| 777 |
+
|
| 778 |
+
This is a code integrity check. It does not validate the scientific result, but
|
| 779 |
+
it catches syntax or import errors in the scripts required to reproduce the
|
| 780 |
+
regime.
|
| 781 |
+
|
| 782 |
+
After the smoke check, update this `docs/plan.md` ledger and commit:
|
| 783 |
+
|
| 784 |
+
```text
|
| 785 |
+
docs/<regime>_streaming_report.md
|
| 786 |
+
runs/<regime>_streaming_report/<model>_validation_5seed/
|
| 787 |
+
runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/
|
| 788 |
+
runs/coefficient_calibration/<regime>_interaction/
|
| 789 |
+
```
|
| 790 |
+
|
| 791 |
+
Do not commit temporary checkpoints or external corpus files unless they are
|
| 792 |
+
small, intentionally versioned, and needed for reproducibility.
|
| 793 |
+
|
| 794 |
## Current Regime Ledger
|
| 795 |
|
| 796 |
| Regime | Status | Role |
|
|
|
|
| 871 |
| `smooth_low` | 4/5, with the one miss only `+0.0003` |
|
| 872 |
|
| 873 |
The immediate risk is no longer seed count for TinyStories or OpenWebText10K.
|
| 874 |
+
The main remaining risk is external validity beyond the three tested text
|
| 875 |
+
regimes and robustness across controlled architecture or token-budget changes.
|
| 876 |
+
The current defensible claim is:
|
| 877 |
|
| 878 |
```text
|
| 879 |
Formula-derived dropout schedules track the moving useful dropout region and
|
|
|
|
| 886 |
Formula-derived dropout decay beats the best static dropout.
|
| 887 |
```
|
| 888 |
|
| 889 |
+
is supported at `n=5` in TinyStories, OpenWebText10K, and WikiText-103. The
|
| 890 |
+
strongest schedule in each of the three regimes beats the per-seed best static
|
| 891 |
+
baseline in all five seeds.
|
| 892 |
|
| 893 |
Latest OpenWebText10K 5-seed streaming final-loss table:
|
| 894 |
|
|
|
|
| 903 |
| static `0.02` | 4.5358 | 0.0091 |
|
| 904 |
| static `0.00` | 4.5943 | 0.0216 |
|
| 905 |
|
| 906 |
+
OpenWebText10K condition provenance:
|
| 907 |
+
|
| 908 |
+
| Condition | Provenance | How to interpret it |
|
| 909 |
+
|---|---|---|
|
| 910 |
+
| `openwebtext10k_interaction` | coefficient-derived interaction schedule | main OpenWebText10K formula hypothesis test |
|
| 911 |
+
| `hold_30_then_decay` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients |
|
| 912 |
+
| `mild_30_to_08` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients |
|
| 913 |
+
| `fitted_l16_static_law` | older fitted/static-law schedule | retained as a comparison to the earlier aggressive fitted path |
|
| 914 |
+
| static conditions | fixed dropout baselines | same dropout at every stream prefix |
|
| 915 |
+
|
| 916 |
+
The heuristic OpenWebText10K schedules were chosen from failure analysis, not
|
| 917 |
+
from the final coefficient formula. The older `fitted_l16_static_law` path
|
| 918 |
+
started too high (`0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02`), while static
|
| 919 |
+
dropout `0.30` looked useful early but worse at the final 4M-token stage and
|
| 920 |
+
static dropout `0.14` was the strongest static endpoint. This motivated two
|
| 921 |
+
manual ablations:
|
| 922 |
+
|
| 923 |
+
```text
|
| 924 |
+
hold_30_then_decay = 0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
|
| 925 |
+
mild_30_to_08 = 0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
|
| 926 |
+
```
|
| 927 |
+
|
| 928 |
+
These ablations support the broader mechanism that stream-dependent dropout can
|
| 929 |
+
matter, but they should not be used as evidence that the coefficient formula
|
| 930 |
+
generated those exact schedules. The formula claim for OpenWebText10K should be
|
| 931 |
+
based on `openwebtext10k_interaction`.
|
| 932 |
+
|
| 933 |
Paired final-loss result:
|
| 934 |
|
| 935 |
| Decay schedule | Paired wins vs best static |
|
|
|
|
| 1010 |
|
| 1011 |
## Next Training After Current Gate
|
| 1012 |
|
| 1013 |
+
No MPS training should launch until the three completed five-seed streaming
|
| 1014 |
+
reports are read together. Since a third held-out text regime is no longer the
|
| 1015 |
+
limiting issue, use the next run only for a narrowed robustness test:
|
| 1016 |
|
| 1017 |
```text
|
| 1018 |
completed: TinyStories 5-seed streaming report
|