Mandeep Sidhu commited on
Commit
b5daf7c
·
1 Parent(s): cf52b0e

Document regime runbook and schedule provenance

Browse files
docs/openwebtext10k_streaming_report.md CHANGED
@@ -16,6 +16,26 @@ baselines.
16
 
17
  - `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`
18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ## Condition Ranking By Final Loss
20
 
21
  | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
 
16
 
17
  - `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`
18
 
19
+ ## Condition Provenance
20
+
21
+ The `anchor_decay` label means the dropout value is chosen from explicit
22
+ prefix-token anchors. It does not by itself imply that the schedule came from
23
+ the coefficient formula.
24
+
25
+ | Condition | Provenance | Dropout path | Interpretation |
26
+ |---|---|---|---|
27
+ | `openwebtext10k_interaction` | coefficient-derived schedule | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` | Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. |
28
+ | `hold_30_then_decay` | heuristic schedule-search ablation | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. |
29
+ | `mild_30_to_08` | heuristic schedule-search ablation | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. |
30
+ | `fitted_l16_static_law` | older fitted/static-law schedule | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. |
31
+ | `static_dropout_*` | static baseline | constant | Fixed dropout used at every stream prefix. |
32
+
33
+ The two heuristic schedules should be treated as ablations, not as independent
34
+ evidence that the coefficient formula generated their exact paths. Their role is
35
+ to show that the shape of the decay matters and that reasonable hand-designed
36
+ decays can also beat weak static choices. The main formula claim for this
37
+ regime should be based on `openwebtext10k_interaction`.
38
+
39
  ## Condition Ranking By Final Loss
40
 
41
  | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
docs/plan.md CHANGED
@@ -277,6 +277,520 @@ Use this order for every regime.
277
  7. Immediately backtest the new regime against all previous regimes.
278
  8. Only then run expensive streaming validation in the new regime.
279
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
  ## Current Regime Ledger
281
 
282
  | Regime | Status | Role |
@@ -357,8 +871,9 @@ Paired final-loss result:
357
  | `smooth_low` | 4/5, with the one miss only `+0.0003` |
358
 
359
  The immediate risk is no longer seed count for TinyStories or OpenWebText10K.
360
- The main remaining risk is external validity beyond two tested regimes. The
361
- current defensible claim is:
 
362
 
363
  ```text
364
  Formula-derived dropout schedules track the moving useful dropout region and
@@ -371,9 +886,9 @@ The stronger claim:
371
  Formula-derived dropout decay beats the best static dropout.
372
  ```
373
 
374
- is supported at `n=5` in both the TinyStories and OpenWebText10K streaming
375
- setups, with interaction decay beating the per-seed best static baseline in all
376
- five seeds in both regimes.
377
 
378
  Latest OpenWebText10K 5-seed streaming final-loss table:
379
 
@@ -388,6 +903,33 @@ Latest OpenWebText10K 5-seed streaming final-loss table:
388
  | static `0.02` | 4.5358 | 0.0091 |
389
  | static `0.00` | 4.5943 | 0.0216 |
390
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
391
  Paired final-loss result:
392
 
393
  | Decay schedule | Paired wins vs best static |
@@ -468,9 +1010,9 @@ the same MPS-only, five-seed validation standard.
468
 
469
  ## Next Training After Current Gate
470
 
471
- No MPS training should launch until the two completed five-seed streaming
472
- reports are read together. Since OpenWebText10K seed count is no longer the
473
- limiting issue, use a third held-out regime for the next validation step:
474
 
475
  ```text
476
  completed: TinyStories 5-seed streaming report
 
277
  7. Immediately backtest the new regime against all previous regimes.
278
  8. Only then run expensive streaming validation in the new regime.
279
 
280
+ ## New Regime Script Runbook
281
+
282
+ Use this exact command sequence for any new regime. Replace placeholders such as
283
+ `<regime>`, `<CORPUS_OR_PARQUET_PATH>`, `<MODEL_SPEC>`, and `<TIMESTAMP>` with
284
+ absolute choices before launching. Do not skip from calibration directly to
285
+ streaming: the schedule must be frozen from the coefficient fit before the
286
+ streaming run starts.
287
+
288
+ This section is intentionally verbose. Its purpose is to make future regimes
289
+ auditable: an external reader should be able to tell what each script did, what
290
+ file it produced, and which decision gate came next.
291
+
292
+ ### New Regime Step 0: MPS Preflight
293
+
294
+ Run this before any torch training command:
295
+
296
+ ```bash
297
+ .venv/bin/python -c "import torch; print({'mps_built': torch.backends.mps.is_built(), 'mps_available': torch.backends.mps.is_available(), 'cuda_available': torch.cuda.is_available()}); raise SystemExit(0 if torch.backends.mps.is_available() else 1)"
298
+ ```
299
+
300
+ What this does:
301
+
302
+ | Check | Meaning |
303
+ |---|---|
304
+ | `mps_built` | PyTorch was compiled with Apple MPS support |
305
+ | `mps_available` | this machine can actually run MPS now |
306
+ | `cuda_available` | should not be used for this project |
307
+
308
+ Decision rule:
309
+
310
+ ```text
311
+ if mps_available is false: stop and report
312
+ if cuda_available is true: still do not use CUDA
313
+ ```
314
+
315
+ Also check for duplicate experiment processes before launching a long run. This
316
+ is not part of the coefficient method, but it prevents corrupt timing/resource
317
+ comparisons.
318
+
319
+ ### New Regime Step 1: Static Dropout Calibration Screen
320
+
321
+ Run:
322
+
323
+ ```bash
324
+ .venv/bin/python scripts/run_experiments.py \
325
+ --mode screen_static \
326
+ --corpus <CORPUS_OR_PARQUET_PATH> \
327
+ --text-column <TEXT_COLUMN_IF_PARQUET> \
328
+ --cache-dir .cache/dropout_decay_<regime> \
329
+ --output-dir runs/<regime>_static_screen \
330
+ --models <M1=layersxheadsxdim> <M2=layersxheadsxdim> <M3=layersxheadsxdim> \
331
+ --seeds 1 2 \
332
+ --token-limits <U1> <U2> <U3> <U4> \
333
+ --dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
334
+ --steps <STATIC_STEPS> \
335
+ --batch-size <BATCH> \
336
+ --block-size <BLOCK> \
337
+ --eval-batches <EVAL_BATCHES> \
338
+ --train-eval-batches <TRAIN_EVAL_BATCHES> \
339
+ --trace-eval-batches <TRACE_EVAL_BATCHES> \
340
+ --vocab-size <VOCAB_SIZE> \
341
+ --val-tokens <VAL_TOKENS> \
342
+ --lr <LR> \
343
+ --weight-decay <WEIGHT_DECAY> \
344
+ --grad-clip 1.0 \
345
+ --screen-early-stop
346
+ ```
347
+
348
+ What this script run does:
349
+
350
+ `scripts/run_experiments.py --mode screen_static` trains a grid of static
351
+ dropout models. It does not test the final decay hypothesis. It estimates the
352
+ best static dropout rate for each calibration cell:
353
+
354
+ ```text
355
+ cell = (model parameter count P, prefix/unique tokens U, sampled tokens C)
356
+ ```
357
+
358
+ For each cell, the script evaluates a fixed dropout grid and writes the
359
+ validation curve. The curve is used later to extract the target dropout `p*`.
360
+
361
+ Expected outputs under `runs/<regime>_static_screen/screen_static/<TIMESTAMP>/`:
362
+
363
+ | File | Use |
364
+ |---|---|
365
+ | `metrics.jsonl` | per-run raw metrics; includes token limit, model, seed, losses, and tokens seen |
366
+ | `model_selection.csv` | per-cell static dropout curve and selected best dropout |
367
+ | `summary.csv` / `summary.json` | compact aggregate summary |
368
+ | `trace.jsonl` | lower-frequency trace for diagnostics |
369
+ | `RESULT_SUMMARY.md` | human-readable first-pass summary |
370
+
371
+ Why this is needed:
372
+
373
+ The coefficient formula is not fitted from streaming outcomes. It is fitted
374
+ from static dropout optima. This separation is essential: calibration estimates
375
+ where useful regularization sits; streaming validation tests whether following
376
+ that moving estimate helps.
377
+
378
+ Recommended cheap calibration:
379
+
380
+ | Dimension | Default |
381
+ |---|---|
382
+ | models | at least 3 model sizes if testing coefficient generality |
383
+ | token prefixes | at least 4 prefixes |
384
+ | seeds | 1-2 for calibration, 5 only for final streaming validation |
385
+ | dropout grid | include low, middle, and high values so the optimum can be bracketed |
386
+
387
+ Decision rule:
388
+
389
+ ```text
390
+ continue if most cells have a bracketed or near-bracketed optimum
391
+ refine if many best dropouts sit at the edge of the grid
392
+ stop and inspect if validation curves are flat/noisy enough that p* is unstable
393
+ ```
394
+
395
+ ### New Regime Step 2: Fit First-Order Base Coefficients
396
+
397
+ Run:
398
+
399
+ ```bash
400
+ .venv/bin/python scripts/fit_dropout_coefficients.py \
401
+ --run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
402
+ --output-dir runs/coefficient_calibration/<regime>_base \
403
+ --target quad \
404
+ --weighting heuristic \
405
+ --feature-set base \
406
+ --min-rate 0.0 \
407
+ --max-rate 0.30
408
+ ```
409
+
410
+ What this script run does:
411
+
412
+ `scripts/fit_dropout_coefficients.py` reads `model_selection.csv` and
413
+ `metrics.jsonl` from the static screen. It converts each calibration cell into:
414
+
415
+ ```text
416
+ x = log10(P / U)
417
+ y = log10(C / U)
418
+ target = observed useful static dropout p*
419
+ ```
420
+
421
+ With `--feature-set base`, it fits the first-order ablation:
422
+
423
+ ```text
424
+ p* ~= A*x + B*y + C0
425
+ ```
426
+
427
+ With `--target quad`, the target `p*` is the local quadratic minimum around the
428
+ best dropout grid point when the curve is bracketed. If the curve is not
429
+ bracketed, the script falls back to the grid best and marks the cell as weaker
430
+ evidence.
431
+
432
+ With `--weighting heuristic`, the fit downweights cells that are less reliable:
433
+
434
+ | Cell condition | Why it is weaker |
435
+ |---|---|
436
+ | boundary optimum | true optimum may be outside the tested dropout grid |
437
+ | not bracketed | local quadratic minimum is less trustworthy |
438
+ | very flat curve | many dropout rates perform nearly the same |
439
+ | noisy best loss | target dropout is less stable |
440
+
441
+ Expected outputs under `runs/coefficient_calibration/<regime>_base/`:
442
+
443
+ | File | Use |
444
+ |---|---|
445
+ | `coefficients.json` | fitted `A`, `B`, `C0`, metrics, and cross-validation scores |
446
+ | `fit_diagnostics.md` | readable coefficient table, formula, fit metrics, and cell residuals |
447
+ | `calibration_cells.csv` | one row per fitted cell with target, prediction, residual, and flags |
448
+ | `next_dropout_suggestions.csv` | dropout rates to add if a cell needs refinement |
449
+
450
+ Why this is needed:
451
+
452
+ The base model is the simplest pressure-law hypothesis. It is the ablation that
453
+ tells reviewers whether the interaction term is actually necessary.
454
+
455
+ Decision rule:
456
+
457
+ ```text
458
+ if base MAE and held-out errors are already low: keep it as a strong ablation
459
+ if base has biased residuals or higher MAE: compare against interaction next
460
+ ```
461
+
462
+ ### New Regime Step 3: Fit Interaction Coefficients
463
+
464
+ Run:
465
+
466
+ ```bash
467
+ .venv/bin/python scripts/fit_dropout_coefficients.py \
468
+ --run-dirs runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
469
+ --output-dir runs/coefficient_calibration/<regime>_interaction \
470
+ --target quad \
471
+ --weighting heuristic \
472
+ --feature-set interaction \
473
+ --min-rate 0.0 \
474
+ --max-rate 0.30
475
+ ```
476
+
477
+ What this script run does:
478
+
479
+ This repeats the same target extraction and weighted least-squares fitting, but
480
+ uses the interaction pressure law:
481
+
482
+ ```text
483
+ p* ~= A*x + B*y + D*x*y + C0
484
+ ```
485
+
486
+ The extra term `D*x*y` lets model/data pressure and sampled-token pressure
487
+ interact. Empirically, this has mattered because dropout pressure is not always
488
+ additive: the useful effect of seeing more cumulative sampled tokens can depend
489
+ on how oversized the model is relative to the available unique data.
490
+
491
+ Expected outputs are the same as Step 2, but under:
492
+
493
+ ```text
494
+ runs/coefficient_calibration/<regime>_interaction/
495
+ ```
496
+
497
+ Decision rule:
498
+
499
+ ```text
500
+ promote interaction if it lowers MAE/RMSE, improves leave-prefix/leave-model
501
+ validation, and does not create obvious residual bias
502
+ ```
503
+
504
+ Do not promote the interaction form merely because it has more parameters. The
505
+ paper needs the base-vs-interaction comparison to show that the extra term buys
506
+ predictive accuracy, not just in-sample flexibility.
507
+
508
+ ### New Regime Step 4: Optional Static Refinement
509
+
510
+ Only run this if `fit_diagnostics.md` or `next_dropout_suggestions.csv` shows
511
+ that important cells are weakly identified.
512
+
513
+ Run:
514
+
515
+ ```bash
516
+ .venv/bin/python scripts/run_experiments.py \
517
+ --mode screen_static \
518
+ --resume-from runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
519
+ --use-cached-data \
520
+ --cache-dir .cache/dropout_decay_<regime> \
521
+ --output-dir runs/<regime>_static_refined \
522
+ --models <ONLY_AFFECTED_MODELS> \
523
+ --seeds 1 2 \
524
+ --token-limits <ONLY_AFFECTED_PREFIXES> \
525
+ --dropout-rates <SUGGESTED_RATES> \
526
+ --steps <STATIC_STEPS> \
527
+ --batch-size <BATCH> \
528
+ --block-size <BLOCK> \
529
+ --eval-batches <EVAL_BATCHES> \
530
+ --train-eval-batches <TRAIN_EVAL_BATCHES> \
531
+ --trace-eval-batches <TRACE_EVAL_BATCHES> \
532
+ --vocab-size <VOCAB_SIZE> \
533
+ --val-tokens <VAL_TOKENS> \
534
+ --lr <LR> \
535
+ --weight-decay <WEIGHT_DECAY> \
536
+ --grad-clip 1.0
537
+ ```
538
+
539
+ What this script run does:
540
+
541
+ This adds only missing static dropout points. It should not rerun the full grid.
542
+ `--resume-from` lets the experiment skip rows already completed in the original
543
+ static screen. `--use-cached-data` reuses the cached tokenizer and token arrays
544
+ so refinement is measuring dropout/model behavior, not data preprocessing
545
+ differences.
546
+
547
+ When to use it:
548
+
549
+ | Trigger | Refinement action |
550
+ |---|---|
551
+ | best dropout is at grid edge | add rates beyond or near that edge if allowed |
552
+ | curve is too coarse near optimum | add rates around the local best |
553
+ | static curve is flat | add seeds or eval batches before changing the formula |
554
+
555
+ After refinement, rerun Steps 2 and 3 with all relevant run dirs. At minimum,
556
+ rerun the promoted feature family. If the paper will compare base versus
557
+ interaction after refinement, rerun both.
558
+
559
+ ```bash
560
+ .venv/bin/python scripts/fit_dropout_coefficients.py \
561
+ --run-dirs \
562
+ runs/<regime>_static_screen/screen_static/<TIMESTAMP> \
563
+ runs/<regime>_static_refined/screen_static/<TIMESTAMP> \
564
+ --output-dir runs/coefficient_calibration/<regime>_interaction_refined \
565
+ --target quad \
566
+ --weighting heuristic \
567
+ --feature-set interaction \
568
+ --min-rate 0.0 \
569
+ --max-rate 0.30
570
+ ```
571
+
572
+ Decision rule:
573
+
574
+ ```text
575
+ refinement is complete when the promoted coefficient fit has acceptable MAE,
576
+ held-out errors, and no obvious residual direction across P/U or C/U
577
+ ```
578
+
579
+ ### New Regime Step 5: Generate Frozen Streaming Anchors
580
+
581
+ Run:
582
+
583
+ ```bash
584
+ .venv/bin/python scripts/make_streaming_anchors.py \
585
+ --coefficients-json <PROMOTED_COEFFICIENTS_JSON> \
586
+ --name <regime>_interaction \
587
+ --parameters <WINNER_MODEL_PARAM_COUNT> \
588
+ --stream-token-caps <U1> <U2> <U3> <U4> <U5> \
589
+ --stage-steps <STAGE_STEPS> \
590
+ --batch-size <BATCH> \
591
+ --block-size <BLOCK> \
592
+ --min-rate 0.02 \
593
+ --max-rate 0.65 \
594
+ --precision 3
595
+ ```
596
+
597
+ What this script run does:
598
+
599
+ `scripts/make_streaming_anchors.py` turns `coefficients.json` into the exact
600
+ dropout schedule used by `locked_stream`. For each stream prefix, it computes:
601
+
602
+ ```text
603
+ P = chosen model parameter count
604
+ U_t = stream prefix tokens at stage t
605
+ C_t = cumulative sampled optimizer tokens through stage t
606
+ x_t = log10(P / U_t)
607
+ y_t = log10(C_t / U_t)
608
+ p_t = clamp(p_min, p_max, A*x_t + B*y_t + D*x_t*y_t + C0)
609
+ ```
610
+
611
+ The script prints two things:
612
+
613
+ 1. a JSON diagnostic table with raw and clipped dropout values
614
+ 2. a final one-line anchor spec, for example:
615
+
616
+ ```text
617
+ <regime>_interaction:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020
618
+ ```
619
+
620
+ That final line is copied into the next command as `--anchor-decays`.
621
+
622
+ `<PROMOTED_COEFFICIENTS_JSON>` should point to the coefficient file selected by
623
+ the coefficient gate. In a clean first pass, this is usually:
624
+
625
+ ```text
626
+ runs/coefficient_calibration/<regime>_interaction/coefficients.json
627
+ ```
628
+
629
+ If optional refinement was needed and accepted, use the refined coefficient
630
+ file instead:
631
+
632
+ ```text
633
+ runs/coefficient_calibration/<regime>_interaction_refined/coefficients.json
634
+ ```
635
+
636
+ Decision rule:
637
+
638
+ ```text
639
+ freeze this anchor spec before streaming starts
640
+ do not edit the schedule after looking at streaming validation losses
641
+ ```
642
+
643
+ If the anchor schedule looks pathological before training, such as all values
644
+ clipping at `p_min` or `p_max`, inspect the coefficient fit and calibration
645
+ cells before launching streaming.
646
+
647
+ ### New Regime Step 6: Five-Seed Locked Streaming Validation
648
+
649
+ Run:
650
+
651
+ ```bash
652
+ .venv/bin/python scripts/run_experiments.py \
653
+ --mode locked_stream \
654
+ --use-cached-data \
655
+ --cache-dir .cache/dropout_decay_<regime> \
656
+ --output-dir runs/<regime>_<model>_streaming_validation_5seed \
657
+ --models <WINNER_MODEL_NAME=layersxheadsxdim> \
658
+ --seeds 1 2 3 4 5 \
659
+ --stream-token-caps <U1> <U2> <U3> <U4> <U5> \
660
+ --dropout-rates 0 0.02 0.04 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
661
+ --anchor-decays <FROZEN_ANCHOR_SPEC_FROM_STEP_5> \
662
+ --stage-steps <STAGE_STEPS> \
663
+ --batch-size <BATCH> \
664
+ --block-size <BLOCK> \
665
+ --eval-batches <EVAL_BATCHES> \
666
+ --train-eval-batches <TRAIN_EVAL_BATCHES> \
667
+ --trace-eval-batches <TRACE_EVAL_BATCHES> \
668
+ --log-every 250 \
669
+ --vocab-size <VOCAB_SIZE> \
670
+ --val-tokens <VAL_TOKENS> \
671
+ --lr <LR> \
672
+ --weight-decay <WEIGHT_DECAY> \
673
+ --grad-clip 1.0
674
+ ```
675
+
676
+ What this script run does:
677
+
678
+ `locked_stream` is the paper-grade test. It simulates a stream by increasing
679
+ the available prefix tokens over stages. For each seed, it trains:
680
+
681
+ | Condition type | Meaning |
682
+ |---|---|
683
+ | static dropout baselines | same dropout at every stream stage |
684
+ | anchor decay schedule | frozen coefficient-derived dropout at each stream stage |
685
+
686
+ The static baselines must be broad enough to make the comparison fair. The
687
+ claim is not that decay beats weak static choices; the claim is that it can beat
688
+ the best static dropout available in the tested grid.
689
+
690
+ Expected outputs under
691
+ `runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/`:
692
+
693
+ | File | Use |
694
+ |---|---|
695
+ | `metrics.jsonl` | raw row-level results for each condition, seed, and prefix |
696
+ | `summary.csv` / `summary.json` | aggregate condition and stage summaries |
697
+ | `trace.jsonl` | progress traces for diagnostic plotting |
698
+ | `config.json` | exact run configuration |
699
+ | `RESULT_SUMMARY.md` | built-in readable summary |
700
+
701
+ Primary evaluation metrics:
702
+
703
+ ```text
704
+ final validation loss at largest prefix
705
+ mean trajectory validation loss
706
+ stage-wise validation loss
707
+ paired seed delta versus the best static baseline
708
+ rank consistency across seeds
709
+ ```
710
+
711
+ Decision rule:
712
+
713
+ ```text
714
+ strong pass: decay has best mean final loss and beats best static in most or all
715
+ paired seeds
716
+
717
+ weak pass: decay ties best static while avoiding bad early/late static choices
718
+
719
+ fail: decay loses to a simple static baseline in most paired seeds or wins early
720
+ only by sacrificing final loss
721
+ ```
722
+
723
+ ### New Regime Step 7: Summarize Streaming Validation
724
+
725
+ Run:
726
+
727
+ ```bash
728
+ .venv/bin/python scripts/summarize_streaming_multiseed.py \
729
+ --metrics runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/metrics.jsonl \
730
+ --output-dir runs/<regime>_streaming_report/<model>_validation_5seed \
731
+ --report docs/<regime>_streaming_report.md \
732
+ --title "<Regime Name> Streaming Validation" \
733
+ --date <YYYY-MM-DD> \
734
+ --context "<regime/model/token/step description>" \
735
+ --conditions <regime>_interaction static_dropout_0.1 static_dropout_0.08 static_dropout_0.06 static_dropout_0.14 static_dropout_0.18 static_dropout_0.2 static_dropout_0.04 static_dropout_0.02 static_dropout_0 static_dropout_0.26 static_dropout_0.3
736
+ ```
737
+
738
+ What this script run does:
739
+
740
+ `scripts/summarize_streaming_multiseed.py` performs no training. It reads the
741
+ saved `metrics.jsonl` file and writes standardized artifacts comparable across
742
+ regimes.
743
+
744
+ Expected outputs:
745
+
746
+ | File | Use |
747
+ |---|---|
748
+ | `docs/<regime>_streaming_report.md` | human-readable regime report for paper discussion |
749
+ | `condition_summary.csv` | condition ranking by final validation loss |
750
+ | `stage_summary.csv` | stage-wise trajectory table |
751
+ | `paired_final_deltas.csv` | per-seed final-loss comparison against the best static baseline |
752
+
753
+ The most important table is `paired_final_deltas.csv`. A mean win is useful, but
754
+ paired seed wins are stronger because they reduce initialization-bias concerns.
755
+
756
+ Decision rule:
757
+
758
+ ```text
759
+ if the decay schedule wins 5/5 paired seeds: promote regime to strong evidence
760
+ if it wins 3-4/5: inspect effect size, variance, and trajectory tradeoff
761
+ if it wins 0-2/5: treat as a failed regime or schedule and do not bury it
762
+ ```
763
+
764
+ ### New Regime Step 8: Smoke Check And Commit
765
+
766
+ Run:
767
+
768
+ ```bash
769
+ .venv/bin/python -m py_compile \
770
+ scripts/run_experiments.py \
771
+ scripts/fit_dropout_coefficients.py \
772
+ scripts/make_streaming_anchors.py \
773
+ scripts/summarize_streaming_multiseed.py
774
+ ```
775
+
776
+ What this script run does:
777
+
778
+ This is a code integrity check. It does not validate the scientific result, but
779
+ it catches syntax or import errors in the scripts required to reproduce the
780
+ regime.
781
+
782
+ After the smoke check, update this `docs/plan.md` ledger and commit:
783
+
784
+ ```text
785
+ docs/<regime>_streaming_report.md
786
+ runs/<regime>_streaming_report/<model>_validation_5seed/
787
+ runs/<regime>_<model>_streaming_validation_5seed/locked_stream/<TIMESTAMP>/
788
+ runs/coefficient_calibration/<regime>_interaction/
789
+ ```
790
+
791
+ Do not commit temporary checkpoints or external corpus files unless they are
792
+ small, intentionally versioned, and needed for reproducibility.
793
+
794
  ## Current Regime Ledger
795
 
796
  | Regime | Status | Role |
 
871
  | `smooth_low` | 4/5, with the one miss only `+0.0003` |
872
 
873
  The immediate risk is no longer seed count for TinyStories or OpenWebText10K.
874
+ The main remaining risk is external validity beyond the three tested text
875
+ regimes and robustness across controlled architecture or token-budget changes.
876
+ The current defensible claim is:
877
 
878
  ```text
879
  Formula-derived dropout schedules track the moving useful dropout region and
 
886
  Formula-derived dropout decay beats the best static dropout.
887
  ```
888
 
889
+ is supported at `n=5` in TinyStories, OpenWebText10K, and WikiText-103. The
890
+ strongest schedule in each of the three regimes beats the per-seed best static
891
+ baseline in all five seeds.
892
 
893
  Latest OpenWebText10K 5-seed streaming final-loss table:
894
 
 
903
  | static `0.02` | 4.5358 | 0.0091 |
904
  | static `0.00` | 4.5943 | 0.0216 |
905
 
906
+ OpenWebText10K condition provenance:
907
+
908
+ | Condition | Provenance | How to interpret it |
909
+ |---|---|---|
910
+ | `openwebtext10k_interaction` | coefficient-derived interaction schedule | main OpenWebText10K formula hypothesis test |
911
+ | `hold_30_then_decay` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients |
912
+ | `mild_30_to_08` | heuristic schedule-search ablation | manually specified after exploratory single-seed OpenWebText10K schedule search; not generated from coefficients |
913
+ | `fitted_l16_static_law` | older fitted/static-law schedule | retained as a comparison to the earlier aggressive fitted path |
914
+ | static conditions | fixed dropout baselines | same dropout at every stream prefix |
915
+
916
+ The heuristic OpenWebText10K schedules were chosen from failure analysis, not
917
+ from the final coefficient formula. The older `fitted_l16_static_law` path
918
+ started too high (`0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02`), while static
919
+ dropout `0.30` looked useful early but worse at the final 4M-token stage and
920
+ static dropout `0.14` was the strongest static endpoint. This motivated two
921
+ manual ablations:
922
+
923
+ ```text
924
+ hold_30_then_decay = 0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
925
+ mild_30_to_08 = 0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
926
+ ```
927
+
928
+ These ablations support the broader mechanism that stream-dependent dropout can
929
+ matter, but they should not be used as evidence that the coefficient formula
930
+ generated those exact schedules. The formula claim for OpenWebText10K should be
931
+ based on `openwebtext10k_interaction`.
932
+
933
  Paired final-loss result:
934
 
935
  | Decay schedule | Paired wins vs best static |
 
1010
 
1011
  ## Next Training After Current Gate
1012
 
1013
+ No MPS training should launch until the three completed five-seed streaming
1014
+ reports are read together. Since a third held-out text regime is no longer the
1015
+ limiting issue, use the next run only for a narrowed robustness test:
1016
 
1017
  ```text
1018
  completed: TinyStories 5-seed streaming report