Add previous local streaming report

Files changed (7) hide show

docs/plan.md +18 -11
docs/previous_regime_streaming_report.md +152 -0
docs/streaming_multiseed_validation_report.md +7 -8
runs/previous_local_streaming_report/l16_multiseed_confirm/condition_summary.csv +8 -0
runs/previous_local_streaming_report/l16_multiseed_confirm/paired_final_deltas.csv +22 -0
runs/previous_local_streaming_report/l16_multiseed_confirm/stage_summary.csv +36 -0
scripts/summarize_streaming_multiseed.py +45 -12

docs/plan.md CHANGED Viewed

@@ -284,7 +284,7 @@ Use this order for every regime.
 | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
 | TinyStories static/coefficient regime | active | main coefficient evidence |
 | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
-| original/local streaming regime | pending report | summarize saved streaming runs before launching any additional training |
 | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
 ## Current Formula Status
@@ -333,6 +333,7 @@ structure transfers, while coefficients may be regime-specific.
 | TinyStories held-out prefix | supports pressure dependence on unique tokens |
 | TinyStories held-out model | supports pressure dependence on model size |
 | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
 | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
 Latest TinyStories 5-seed streaming final-loss table:
@@ -394,21 +395,25 @@ streaming multi-seed reports for each regime.
 ## Immediate Next Action
-Build the original/local streaming report from saved runs. Do not launch a
-broad new regime sweep until the previous/local report is reconciled against
-the TinyStories five-seed result.
 ## Next Training After Current Gate
-No MPS training should launch before the previous/local streaming report is
-generated from existing saved runs. If that report lacks enough coverage for a
-clean claim, the next MPS run should be narrowly scoped to only the missing
-conditions or seeds:
 ```text
-preferred first step: no training, saved-run report only
-possible follow-up: fill missing previous/local streaming cells
-avoid: broad new regime sweep before the report audit
 ```
 Evaluate with paired seed comparisons:
@@ -430,5 +435,7 @@ Latest streaming report:
 ```text
 docs/streaming_multiseed_validation_report.md
 runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
 ```

 | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
 | TinyStories static/coefficient regime | active | main coefficient evidence |
 | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
+| original/local streaming regime | 3-seed saved-run report complete | previous/local decay schedules beat best static in 3/3 paired final-loss comparisons |
 | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
 ## Current Formula Status
 | TinyStories held-out prefix | supports pressure dependence on unique tokens |
 | TinyStories held-out model | supports pressure dependence on model size |
 | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
+| previous/local streaming, 3 seeds | hold-30 decay has best mean final loss; top decay schedules beat best static in 3/3 paired comparisons |
 | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
 Latest TinyStories 5-seed streaming final-loss table:
 ## Immediate Next Action
+Reconcile the TinyStories five-seed report and previous/local three-seed
+report into the paper outline. Decide whether the previous/local regime needs a
+targeted seed-4/5 extension, or whether the next better use of MPS time is a
+third held-out regime.
 ## Next Training After Current Gate
+No MPS training should launch until the two completed streaming reports are
+read together. If previous/local seed count is the limiting issue, the next run
+should be narrowly scoped to only the missing seed-4/5 previous/local
+conditions. If external validity is the limiting issue, use a third held-out
+regime instead:
 ```text
+completed: TinyStories 5-seed streaming report
+completed: previous/local 3-seed saved-run streaming report
+possible follow-up A: previous/local seed-4/5 extension
+possible follow-up B: third held-out regime
+avoid: broad new sweep before choosing A vs B
 ```
 Evaluate with paired seed comparisons:
 ```text
 docs/streaming_multiseed_validation_report.md
+docs/previous_regime_streaming_report.md
 runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
+runs/previous_local_streaming_report/l16_multiseed_confirm/
 ```

docs/previous_regime_streaming_report.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Previous/Local Regime Streaming Validation
+Date: 2026-05-30
+This report combines 3 random seeds (1, 2, 3) from saved streaming runs.
+No additional training is performed by this script; it reads saved
+`metrics.jsonl` files.
+Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This report uses the existing three-seed confirmation run only; earlier single-seed search/refinement runs are treated as exploratory support, not as the primary proof table.
+## Sources
+- `runs/stream_multiseed_confirm/locked_stream/20260526-203116/metrics.jsonl`
+## Condition Ranking By Final Loss
+| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
+|---|---|---:|---:|---:|---:|---:|---:|---|
+| `hold_30_then_decay` | `anchor_decay` | 3 | 4.8503 | 0.0017 | 4.4060 | 0.0118 | 0.3530 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
+| `mild_30_to_08` | `anchor_decay` | 3 | 4.8504 | 0.0018 | 4.4075 | 0.0078 | 0.3307 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
+| `fitted_l16_static_law` | `anchor_decay` | 3 | 4.9527 | 0.0052 | 4.4159 | 0.0042 | 0.3144 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
+| `static_dropout_0.14` | `static` | 3 | 4.9043 | 0.0119 | 4.4459 | 0.0128 | 0.3205 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
+| `static_dropout_0.3` | `static` | 3 | 4.8764 | 0.0014 | 4.4693 | 0.0081 | 0.2327 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
+| `static_dropout_0.02` | `static` | 3 | 5.1544 | 0.0091 | 4.5405 | 0.0061 | 0.4747 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
+| `static_dropout_0` | `static` | 3 | 5.2422 | 0.0015 | 4.5905 | 0.0192 | 0.5464 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
+## Paired Final-Loss Deltas
+Negative `delta_vs_best_static` means the condition beat the best static
+baseline for that seed.
+| Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
+|---:|---|---:|---|---:|---:|
+| 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
+| 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
+| 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
+| 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
+| 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
+| 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0985 |
+| 1 | `static_dropout_0` | 4.5703 | `static_dropout_0.14` | 4.4418 | +0.1286 |
+| 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4603 | -0.0535 |
+| 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4603 | -0.0523 |
+| 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4603 | -0.0467 |
+| 2 | `static_dropout_0.14` | 4.4603 | `static_dropout_0.14` | 4.4603 | +0.0000 |
+| 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4603 | +0.0116 |
+| 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4603 | +0.0863 |
+| 2 | `static_dropout_0` | 4.6085 | `static_dropout_0.14` | 4.4603 | +0.1482 |
+| 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4357 | -0.0183 |
+| 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4357 | -0.0206 |
+| 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4357 | -0.0223 |
+| 3 | `static_dropout_0.14` | 4.4357 | `static_dropout_0.14` | 4.4357 | +0.0000 |
+| 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4357 | +0.0401 |
+| 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4357 | +0.0988 |
+| 3 | `static_dropout_0` | 4.5926 | `static_dropout_0.14` | 4.4357 | +0.1569 |
+## Stage Trajectory
+| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
+|---:|---:|---|---:|---:|---:|---:|---:|---:|
+| 0 | 250,000 | `mild_30_to_08` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
+| 0 | 250,000 | `static_dropout_0.3` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
+| 0 | 250,000 | `hold_30_then_decay` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
+| 0 | 250,000 | `static_dropout_0.14` | 0.140 | 3 | 5.4707 | 0.0281 | 4.0325 | 1.4383 |
+| 0 | 250,000 | `static_dropout_0.02` | 0.020 | 3 | 5.7452 | 0.0319 | 3.5394 | 2.2057 |
+| 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 3 | 5.7847 | 0.0108 | 5.1677 | 0.6170 |
+| 0 | 250,000 | `static_dropout_0` | 0.000 | 3 | 5.8283 | 0.0158 | 3.4498 | 2.3785 |
+| 1 | 500,000 | `mild_30_to_08` | 0.240 | 3 | 5.0573 | 0.0197 | 4.0209 | 1.0364 |
+| 1 | 500,000 | `static_dropout_0.3` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
+| 1 | 500,000 | `hold_30_then_decay` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
+| 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 3 | 5.1479 | 0.0127 | 4.4501 | 0.6978 |
+| 1 | 500,000 | `static_dropout_0.14` | 0.140 | 3 | 5.1493 | 0.0097 | 3.7036 | 1.4457 |
+| 1 | 500,000 | `static_dropout_0.02` | 0.020 | 3 | 5.5605 | 0.0148 | 3.1103 | 2.4502 |
+| 1 | 500,000 | `static_dropout_0` | 0.000 | 3 | 5.6920 | 0.0452 | 2.9511 | 2.7409 |
+| 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 3 | 4.7695 | 0.0164 | 4.0408 | 0.7287 |
+| 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 3 | 4.7717 | 0.0162 | 3.9925 | 0.7793 |
+| 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.7927 | 0.0173 | 4.1535 | 0.6392 |
+| 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 3 | 4.8273 | 0.0096 | 4.2699 | 0.5573 |
+| 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.8466 | 0.0278 | 3.8815 | 0.9651 |
+| 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 3 | 5.1459 | 0.0294 | 3.4641 | 1.6818 |
+| 2 | 1,000,000 | `static_dropout_0` | 0.000 | 3 | 5.2484 | 0.0091 | 3.3281 | 1.9203 |
+| 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 3 | 4.5655 | 0.0060 | 4.0390 | 0.5265 |
+| 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 3 | 4.5691 | 0.0072 | 4.0380 | 0.5312 |
+| 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 3 | 4.5879 | 0.0086 | 4.1457 | 0.4422 |
+| 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.6088 | 0.0069 | 4.0454 | 0.5634 |
+| 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.6094 | 0.0059 | 4.2125 | 0.3968 |
+| 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.7799 | 0.0219 | 3.8344 | 0.9455 |
+| 3 | 2,000,000 | `static_dropout_0` | 0.000 | 3 | 4.8517 | 0.0153 | 3.7761 | 1.0757 |
+| 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 3 | 4.4060 | 0.0118 | 4.0530 | 0.3530 |
+| 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 3 | 4.4075 | 0.0078 | 4.0768 | 0.3307 |
+| 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 3 | 4.4159 | 0.0042 | 4.1015 | 0.3144 |
+| 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.4459 | 0.0128 | 4.1254 | 0.3205 |
+| 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.4693 | 0.0081 | 4.2365 | 0.2327 |
+| 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.5405 | 0.0061 | 4.0657 | 0.4747 |
+| 4 | 4,000,000 | `static_dropout_0` | 0.000 | 3 | 4.5905 | 0.0192 | 4.0441 | 0.5464 |
+## Interpretation
+- `hold_30_then_decay` has the best 3-seed mean final validation loss: 4.4060 +/- 0.0118.
+- The second-best final condition is `mild_30_to_08` at 4.4075 +/- 0.0078.
+- The best static baseline by mean final loss is `static_dropout_0.14` at 4.4459 +/- 0.0128.
+- `hold_30_then_decay` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0183.
+- `mild_30_to_08` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0206.
+- `fitted_l16_static_law` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0211.
+- The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4463; compare this with the final ranking before claiming a schedule is uniformly better.
+- This is a saved-run streaming validation artifact. Treat it as strong
+  evidence only when the tested conditions, seeds, static baselines, and
+  stream protocol match the claim being made.
+## Supporting Exploratory Runs
+The primary proof table above is the three-seed confirmation run:
+```text
+runs/stream_multiseed_confirm/locked_stream/20260526-203116/
+```
+Earlier single-seed runs are useful for interpreting how the schedule was
+selected, but they are not counted as multi-seed proof:
+| Supporting run | Role | Main reading |
+|---|---|---|
+| `runs/stream_schedule_search/locked_stream/20260526-171537/` | schedule search | decay schedules starting near `0.30` and ending near `0.02` to `0.08` beat static `0.14` and `0.30` at the final 4M prefix |
+| `runs/stream_schedule_refinement/locked_stream/20260526-184506/` | endpoint and curvature refinement | several `hold_30` variants ended tightly around `4.394`, while `hold_24_then_decay` was weaker at `4.4214`, suggesting the initial dropout should not be reduced too aggressively in this regime |
+| `runs/formula_l16_exact_multiseed/locked_stream/20260527-123806/` | coefficient-derived schedule check | `pressure_formula_l16_floor02` reached `4.4059 +/- 0.0042` over three seeds versus static `0.14` at `4.4459 +/- 0.0128` |
+## Research Reading
+This previous/local regime supports the same qualitative claim as the
+TinyStories five-seed validation: a static dropout that is reasonable at one
+stream scale is not necessarily optimal as the data prefix grows. In this
+regime, the useful path keeps dropout high early (`0.30`) and then lowers it
+as unique tokens and sampled tokens increase.
+The strongest previous/local evidence is:
+| Claim | Evidence |
+|---|---|
+| decay beats best static final loss | `hold_30_then_decay` beats the per-seed best static baseline in `3/3` seeds |
+| endpoint is not uniquely fixed | `mild_30_to_08` is nearly tied with `hold_30_then_decay` |
+| too-low early dropout is harmful | static `0.02` and `0.00` are much worse throughout the stream |
+| too-high static dropout underuses later data | static `0.30` wins no final paired comparison despite being strong early |
+| coefficient-derived schedules are viable | `fitted_l16_static_law` and `pressure_formula_l16_floor02` both beat static `0.14` in the saved three-seed comparisons |
+Limitations:
+1. This report is `n=3`, not `n=5`.
+2. The schedules were refined inside this local regime, so this is not a
+   clean held-out-regime proof of universal coefficients.
+3. The report still supports the cross-regime mechanism because the direction
+   of the effect matches TinyStories: high enough initial regularization
+   prevents early overfit, and lowering dropout later improves final validation
+   loss versus holding one static value fixed.

docs/streaming_multiseed_validation_report.md CHANGED Viewed

@@ -6,6 +6,8 @@ This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
 No additional training is performed by this script; it reads saved
 `metrics.jsonl` files.
 ## Sources
 - `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
@@ -93,15 +95,12 @@ baseline for that seed.
 ## Interpretation
 - `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
 - The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
-- `smooth_low` is very close to `interaction`, suggesting the exact anchor
-  values may not be uniquely required as long as the schedule follows the
-  same pressure range.
 - `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
 - `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
 - `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
-- Static `0.12` can win early stages, but holding it fixed loses at the
-  final 4M stage.
-- This is now the TinyStories paper-grade validation gate for this narrowed
-  setup: five seeds, paired seed comparisons, and static baselines selected
-  from the same stream protocol.

 No additional training is performed by this script; it reads saved
 `metrics.jsonl` files.
+Regime: TinyStories BPE streaming validation with L12_H8_D320, 17,367,040 parameters, four prefixes from 500k to 4M tokens, and 2,000 optimizer steps per stage.
 ## Sources
 - `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
 ## Interpretation
 - `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
+- The second-best final condition is `smooth_low` at 2.5321 +/- 0.0203.
 - The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
 - `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
 - `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
 - `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
+- The best first-stage condition is `static_dropout_0.12` at prefix 500,000 with mean validation loss 3.2226; compare this with the final ranking before claiming a schedule is uniformly better.
+- This is a saved-run streaming validation artifact. Treat it as strong
+  evidence only when the tested conditions, seeds, static baselines, and
+  stream protocol match the claim being made.

runs/previous_local_streaming_report/l16_multiseed_confirm/condition_summary.csv ADDED Viewed

	@@ -0,0 +1,8 @@

+condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
+hold_30_then_decay,anchor_decay,3,4.85031243065993,0.001654499467210519,4.406020574271679,0.011754833839775457,0.3529989644885063,0.015994346590220903,0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
+mild_30_to_08,anchor_decay,3,4.8503826280434925,0.0017992301945451007,4.407524392008781,0.007800724825508926,0.33071232338746387,0.012432927596429534,0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
+fitted_l16_static_law,anchor_decay,3,4.952731303373973,0.0052436768034393386,4.415892139077187,0.004158092833266487,0.31436355660359067,0.010011530967558104,0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02
+static_dropout_0.14,static,3,4.904260951777299,0.011925494764473046,4.445927885671456,0.01281389472329066,0.32051239907741547,0.005114602595178522,0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14
+static_dropout_0.3,static,3,4.876375656326612,0.0014325425566632558,4.469294945398967,0.008116710580691494,0.23274652659893036,0.009938004189833373,0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30
+static_dropout_0.02,static,3,5.154374482234319,0.009061613413273495,4.540457583963871,0.006064728711460762,0.47473999857902527,0.005772200910659103,0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02
+static_dropout_0,static,3,5.242185181876024,0.00150415496846462,4.59050024797519,0.01917389367658262,0.5464497481783231,0.027464276868201288,0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00

runs/previous_local_streaming_report/l16_multiseed_confirm/paired_final_deltas.csv ADDED Viewed

	@@ -0,0 +1,22 @@

+seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
+1,hold_30_then_decay,4.393905431032181,static_dropout_0.14,4.441768206655979,-0.04786277562379837
+1,mild_30_to_08,4.399484351277351,static_dropout_0.14,4.441768206655979,-0.04228385537862778
+1,fitted_l16_static_law,4.420692302286625,static_dropout_0.14,4.441768206655979,-0.021075904369354248
+1,static_dropout_0.14,4.441768206655979,static_dropout_0.14,4.441768206655979,0.0
+1,static_dropout_0.3,4.460189968347549,static_dropout_0.14,4.441768206655979,0.018421761691570282
+1,static_dropout_0.02,4.540249735116959,static_dropout_0.14,4.441768206655979,0.09848152846097946
+1,static_dropout_0,4.5703444480896,static_dropout_0.14,4.441768206655979,0.12857624143362045
+2,hold_30_then_decay,4.406777806580067,static_dropout_0.14,4.460304826498032,-0.053527019917964935
+2,mild_30_to_08,4.408027365803719,static_dropout_0.14,4.460304826498032,-0.05227746069431305
+2,fitted_l16_static_law,4.41358458250761,static_dropout_0.14,4.460304826498032,-0.046720243990421295
+2,static_dropout_0.14,4.460304826498032,static_dropout_0.14,4.460304826498032,0.0
+2,static_dropout_0.3,4.47192245721817,static_dropout_0.14,4.460304826498032,0.01161763072013855
+2,static_dropout_0.02,4.546623565256596,static_dropout_0.14,4.460304826498032,0.086318738758564
+2,static_dropout_0,4.608511999249458,static_dropout_0.14,4.460304826498032,0.1482071727514267
+3,hold_30_then_decay,4.417378485202789,static_dropout_0.14,4.435710623860359,-0.018332138657569885
+3,mild_30_to_08,4.415061458945274,static_dropout_0.14,4.435710623860359,-0.02064916491508484
+3,fitted_l16_static_law,4.4133995324373245,static_dropout_0.14,4.435710623860359,-0.022311091423034668
+3,static_dropout_0.14,4.435710623860359,static_dropout_0.14,4.435710623860359,0.0
+3,static_dropout_0.3,4.47577241063118,static_dropout_0.14,4.435710623860359,0.04006178677082062
+3,static_dropout_0.02,4.534499451518059,static_dropout_0.14,4.435710623860359,0.09878882765769958
+3,static_dropout_0,4.5926442965865135,static_dropout_0.14,4.435710623860359,0.15693367272615433

runs/previous_local_streaming_report/l16_multiseed_confirm/stage_summary.csv ADDED Viewed

	@@ -0,0 +1,36 @@

+condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
+hold_30_then_decay,0,250000,0.3,3,5.446260958909988,0.019125165089046693,4.445705617467563,0.034358782454698235,1.000555341442426,0.024155559429602118
+hold_30_then_decay,1,500000,0.3,3,5.064259976148605,0.021637732528368203,4.125050333638986,0.04522623234945776,0.9392096425096194,0.05214362791131757
+hold_30_then_decay,2,1000000,0.20000000000000004,3,4.769520824154218,0.01635006458114404,4.040781140327454,0.030883572207078522,0.7287396838267645,0.01586184848529449
+hold_30_then_decay,3,2000000,0.10000000000000002,3,4.565499819815159,0.005976341537445996,4.039049598077933,0.013075486007728024,0.5264502217372259,0.018983448452741372
+hold_30_then_decay,4,4000000,0.02,3,4.406020574271679,0.011754833839775457,4.053021609783173,0.005869636087162683,0.3529989644885063,0.015994346590220903
+mild_30_to_08,0,250000,0.3,3,5.4462608098983765,0.019125068401667725,4.445705557862918,0.034358764649810906,1.0005552520354588,0.02415560068535275
+mild_30_to_08,1,500000,0.24,3,5.0572623411814375,0.019719838039851723,4.020891976853211,0.04693412607863078,1.0363703643282254,0.05164875161403775
+mild_30_to_08,2,1000000,0.18000000000000002,3,4.771725835899512,0.016156229486385284,3.992456587652365,0.03048971998308727,0.7792692482471466,0.015220576762990419
+mild_30_to_08,3,2000000,0.12,3,4.569139761229356,0.0072146469245784745,4.03795708467563,0.013373284097680214,0.5311826765537262,0.020578724869085317
+mild_30_to_08,4,4000000,0.08,3,4.407524392008781,0.007800724825508926,4.076812068621318,0.005051302799236397,0.33071232338746387,0.012432927596429534
+fitted_l16_static_law,0,250000,0.6,3,5.784741741915544,0.010830950668144494,5.167703871925672,0.03352670524735539,0.617037869989872,0.02282459187460017
+fitted_l16_static_law,1,500000,0.4000000000000001,3,5.147908595701058,0.012709997007779845,4.450103844205539,0.031679739247628576,0.6978047514955202,0.027979523555407795
+fitted_l16_static_law,2,1000000,0.3,3,4.827251675228278,0.009557452344926845,4.269914664328098,0.02965806771390336,0.5573370109001795,0.020671084306813296
+fitted_l16_static_law,3,2000000,0.14,3,4.587862364947796,0.00860798597712663,4.145651715497176,0.020112725612223863,0.44221064945062,0.02244568312577686
+fitted_l16_static_law,4,4000000,0.02,3,4.415892139077187,0.004158092833266487,4.101528582473596,0.012667297837999282,0.31436355660359067,0.010011530967558104
+static_dropout_0.14,0,250000,0.14,3,5.470722645521164,0.02808420889994966,4.032453775405884,0.02467190243959671,1.4382688701152802,0.03724958360607134
+static_dropout_0.14,1,500000,0.14,3,5.149252874155839,0.009721292660607455,3.7035952607790628,0.03283222913980165,1.4456576133767765,0.03202443758399359
+static_dropout_0.14,2,1000000,0.14,3,4.846569702029228,0.027835384007538343,3.8814875607689223,0.03663314954997834,0.965082141260306,0.022778798858926228
+static_dropout_0.14,3,2000000,0.14,3,4.608831651508808,0.006911091820519436,4.045392190416654,0.02868106552021608,0.5634394610921541,0.02219500109725294
+static_dropout_0.14,4,4000000,0.14,3,4.445927885671456,0.01281389472329066,4.125415486594041,0.008814570521429348,0.32051239907741547,0.005114602595178522
+static_dropout_0.3,0,250000,0.3,3,5.446260929107666,0.019125310935332103,4.4457056671381,0.034358902052248425,1.0005552619695663,0.024155461141722852
+static_dropout_0.3,1,500000,0.3,3,5.064259819686413,0.021637796833646396,4.125050216913223,0.04522603378014792,0.9392096027731895,0.05214326197407805
+static_dropout_0.3,2,1000000,0.3,3,4.7926972309748335,0.017311084846289997,4.1534921278556185,0.030420766811043106,0.6392051031192144,0.014300245473945872
+static_dropout_0.3,3,2000000,0.3,3,4.60936535646518,0.005911969954977541,4.212546601891518,0.011776857890288749,0.39681875457366306,0.017503466982826142
+static_dropout_0.3,4,4000000,0.3,3,4.469294945398967,0.008116710580691494,4.236548418800036,0.003898783020435021,0.23274652659893036,0.009938004189833373
+static_dropout_0.02,0,250000,0.02,3,5.745150377353032,0.03186573719752147,3.5394225865602493,0.01043658554282977,2.2057277907927832,0.04182545755928344
+static_dropout_0.02,1,500000,0.02,3,5.56048562626044,0.014806452057977726,3.110301854709784,0.02923601075152012,2.4501837715506554,0.04238209743234919
+static_dropout_0.02,2,1000000,0.02,3,5.145867633322875,0.02939265744253425,3.4641073818008103,0.04880776844576436,1.6817602515220642,0.04157387316099468
+static_dropout_0.02,3,2000000,0.02,3,4.779911190271378,0.021934652780817913,3.834411238630613,0.04218285619197449,0.9454999516407648,0.027020864935500585
+static_dropout_0.02,4,4000000,0.02,3,4.540457583963871,0.006064728711460762,4.065717585384846,0.007518099672343167,0.47473999857902527,0.005772200910659103
+static_dropout_0,0,250000,0.0,3,5.828276579578717,0.01576268602511141,3.4497722735007605,0.023904614106222872,2.378504306077957,0.03224386346161905
+static_dropout_0,1,500000,0.0,3,5.6920087188482285,0.04522491867870689,2.951105666657289,0.05869100244970825,2.7409030521909394,0.08369333869665757
+static_dropout_0,2,1000000,0.0,3,5.24841162810723,0.009065582711747499,3.3280878563721976,0.041678604603156234,1.9203237717350323,0.03261336000414208
+static_dropout_0,3,2000000,0.0,3,4.851728734870751,0.015303064070863357,3.7760739227135978,0.03677045479243139,1.075654812157154,0.029026197035937777
+static_dropout_0,4,4000000,0.0,3,4.59050024797519,0.01917389367658262,4.044050499796867,0.023224885505203383,0.5464497481783231,0.027464276868201288

scripts/summarize_streaming_multiseed.py CHANGED Viewed

@@ -191,12 +191,18 @@ def write_report(
     stage_rows: list[dict],
     paired_rows: list[dict],
     metrics_paths: list[Path],
 ) -> None:
     seed_ids = sorted({int(row["seed"]) for row in paired_rows})
     seed_count = len(seed_ids)
     best_row = condition_rows[0]
     static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
     best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
     paired_win_lines = []
     for row in condition_rows:
@@ -219,18 +225,25 @@ def write_report(
         )
     lines = [
-        "# TinyStories Multi-Seed Streaming Validation",
         "",
-        "Date: 2026-05-30",
         "",
         f"This report combines {seed_count} random seeds "
         f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
         "No additional training is performed by this script; it reads saved",
         "`metrics.jsonl` files.",
         "",
         "## Sources",
         "",
-    ]
     for path_item in metrics_paths:
         lines.append(f"- `{path_item}`")
@@ -294,19 +307,27 @@ def write_report(
             f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
             f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
             f"{fmt(best_row['std_final_val'])}.",
             f"- The best static baseline by mean final loss is "
             f"`{best_static_row['condition']}` at "
             f"{fmt(best_static_row['mean_final_val'])} +/- "
             f"{fmt(best_static_row['std_final_val'])}.",
-            "- `smooth_low` is very close to `interaction`, suggesting the exact anchor",
-            "  values may not be uniquely required as long as the schedule follows the",
-            "  same pressure range.",
             *paired_win_lines,
-            "- Static `0.12` can win early stages, but holding it fixed loses at the",
-            "  final 4M stage.",
-            "- This is now the TinyStories paper-grade validation gate for this narrowed",
-            "  setup: five seeds, paired seed comparisons, and static baselines selected",
-            "  from the same stream protocol.",
         ]
     )
     path.write_text("\n".join(lines) + "\n", encoding="utf-8")
@@ -318,6 +339,9 @@ def build_parser() -> argparse.ArgumentParser:
     parser.add_argument("--output-dir", type=Path, required=True)
     parser.add_argument("--report", type=Path, required=True)
     parser.add_argument("--conditions", nargs="+", default=DEFAULT_CONDITIONS)
     return parser
@@ -371,7 +395,16 @@ def main() -> None:
             "delta_vs_best_static",
         ],
     )
-    write_report(args.report, condition_rows, stage_rows, paired_rows, args.metrics)
     print(
         json.dumps(
             {

     stage_rows: list[dict],
     paired_rows: list[dict],
     metrics_paths: list[Path],
+    title: str,
+    date: str,
+    context: str,
 ) -> None:
     seed_ids = sorted({int(row["seed"]) for row in paired_rows})
     seed_count = len(seed_ids)
     best_row = condition_rows[0]
+    second_row = condition_rows[1] if len(condition_rows) > 1 else None
     static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
     best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
+    first_stage_rows = [row for row in stage_rows if int(row["stage"]) == 0]
+    best_first_stage = min(first_stage_rows, key=lambda row: row["mean_val"])
     paired_win_lines = []
     for row in condition_rows:
         )
     lines = [
+        f"# {title}",
         "",
+        f"Date: {date}",
         "",
         f"This report combines {seed_count} random seeds "
         f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
         "No additional training is performed by this script; it reads saved",
         "`metrics.jsonl` files.",
         "",
+    ]
+    if context:
+        lines.extend([context, ""])
+    lines.extend(
+        [
         "## Sources",
         "",
+        ]
+    )
     for path_item in metrics_paths:
         lines.append(f"- `{path_item}`")
             f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
             f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
             f"{fmt(best_row['std_final_val'])}.",
+            *(
+                [
+                    f"- The second-best final condition is `{second_row['condition']}` at "
+                    f"{fmt(second_row['mean_final_val'])} +/- "
+                    f"{fmt(second_row['std_final_val'])}."
+                ]
+                if second_row is not None
+                else []
+            ),
             f"- The best static baseline by mean final loss is "
             f"`{best_static_row['condition']}` at "
             f"{fmt(best_static_row['mean_final_val'])} +/- "
             f"{fmt(best_static_row['std_final_val'])}.",
             *paired_win_lines,
+            f"- The best first-stage condition is `{best_first_stage['condition']}` "
+            f"at prefix {best_first_stage['token_limit']:,} with mean validation "
+            f"loss {fmt(best_first_stage['mean_val'])}; compare this with the final "
+            "ranking before claiming a schedule is uniformly better.",
+            "- This is a saved-run streaming validation artifact. Treat it as strong",
+            "  evidence only when the tested conditions, seeds, static baselines, and",
+            "  stream protocol match the claim being made.",
         ]
     )
     path.write_text("\n".join(lines) + "\n", encoding="utf-8")
     parser.add_argument("--output-dir", type=Path, required=True)
     parser.add_argument("--report", type=Path, required=True)
     parser.add_argument("--conditions", nargs="+", default=DEFAULT_CONDITIONS)
+    parser.add_argument("--title", default="TinyStories Multi-Seed Streaming Validation")
+    parser.add_argument("--date", default="2026-05-30")
+    parser.add_argument("--context", default="")
     return parser
             "delta_vs_best_static",
         ],
     )
+    write_report(
+        args.report,
+        condition_rows,
+        stage_rows,
+        paired_rows,
+        args.metrics,
+        args.title,
+        args.date,
+        args.context,
+    )
     print(
         json.dumps(
             {