Mandeep Sidhu commited on
Commit
1c065aa
·
1 Parent(s): 2f2776e

Add previous local streaming report

Browse files
docs/plan.md CHANGED
@@ -284,7 +284,7 @@ Use this order for every regime.
284
  | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
  | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
- | original/local streaming regime | pending report | summarize saved streaming runs before launching any additional training |
288
  | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
289
 
290
  ## Current Formula Status
@@ -333,6 +333,7 @@ structure transfers, while coefficients may be regime-specific.
333
  | TinyStories held-out prefix | supports pressure dependence on unique tokens |
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
  | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
 
336
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
337
 
338
  Latest TinyStories 5-seed streaming final-loss table:
@@ -394,21 +395,25 @@ streaming multi-seed reports for each regime.
394
 
395
  ## Immediate Next Action
396
 
397
- Build the original/local streaming report from saved runs. Do not launch a
398
- broad new regime sweep until the previous/local report is reconciled against
399
- the TinyStories five-seed result.
 
400
 
401
  ## Next Training After Current Gate
402
 
403
- No MPS training should launch before the previous/local streaming report is
404
- generated from existing saved runs. If that report lacks enough coverage for a
405
- clean claim, the next MPS run should be narrowly scoped to only the missing
406
- conditions or seeds:
 
407
 
408
  ```text
409
- preferred first step: no training, saved-run report only
410
- possible follow-up: fill missing previous/local streaming cells
411
- avoid: broad new regime sweep before the report audit
 
 
412
  ```
413
 
414
  Evaluate with paired seed comparisons:
@@ -430,5 +435,7 @@ Latest streaming report:
430
 
431
  ```text
432
  docs/streaming_multiseed_validation_report.md
 
433
  runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
 
434
  ```
 
284
  | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
  | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
+ | original/local streaming regime | 3-seed saved-run report complete | previous/local decay schedules beat best static in 3/3 paired final-loss comparisons |
288
  | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
289
 
290
  ## Current Formula Status
 
333
  | TinyStories held-out prefix | supports pressure dependence on unique tokens |
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
  | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
336
+ | previous/local streaming, 3 seeds | hold-30 decay has best mean final loss; top decay schedules beat best static in 3/3 paired comparisons |
337
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
338
 
339
  Latest TinyStories 5-seed streaming final-loss table:
 
395
 
396
  ## Immediate Next Action
397
 
398
+ Reconcile the TinyStories five-seed report and previous/local three-seed
399
+ report into the paper outline. Decide whether the previous/local regime needs a
400
+ targeted seed-4/5 extension, or whether the next better use of MPS time is a
401
+ third held-out regime.
402
 
403
  ## Next Training After Current Gate
404
 
405
+ No MPS training should launch until the two completed streaming reports are
406
+ read together. If previous/local seed count is the limiting issue, the next run
407
+ should be narrowly scoped to only the missing seed-4/5 previous/local
408
+ conditions. If external validity is the limiting issue, use a third held-out
409
+ regime instead:
410
 
411
  ```text
412
+ completed: TinyStories 5-seed streaming report
413
+ completed: previous/local 3-seed saved-run streaming report
414
+ possible follow-up A: previous/local seed-4/5 extension
415
+ possible follow-up B: third held-out regime
416
+ avoid: broad new sweep before choosing A vs B
417
  ```
418
 
419
  Evaluate with paired seed comparisons:
 
435
 
436
  ```text
437
  docs/streaming_multiseed_validation_report.md
438
+ docs/previous_regime_streaming_report.md
439
  runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
440
+ runs/previous_local_streaming_report/l16_multiseed_confirm/
441
  ```
docs/previous_regime_streaming_report.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Previous/Local Regime Streaming Validation
2
+
3
+ Date: 2026-05-30
4
+
5
+ This report combines 3 random seeds (1, 2, 3) from saved streaming runs.
6
+ No additional training is performed by this script; it reads saved
7
+ `metrics.jsonl` files.
8
+
9
+ Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This report uses the existing three-seed confirmation run only; earlier single-seed search/refinement runs are treated as exploratory support, not as the primary proof table.
10
+
11
+ ## Sources
12
+
13
+ - `runs/stream_multiseed_confirm/locked_stream/20260526-203116/metrics.jsonl`
14
+
15
+ ## Condition Ranking By Final Loss
16
+
17
+ | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
18
+ |---|---|---:|---:|---:|---:|---:|---:|---|
19
+ | `hold_30_then_decay` | `anchor_decay` | 3 | 4.8503 | 0.0017 | 4.4060 | 0.0118 | 0.3530 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
20
+ | `mild_30_to_08` | `anchor_decay` | 3 | 4.8504 | 0.0018 | 4.4075 | 0.0078 | 0.3307 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
21
+ | `fitted_l16_static_law` | `anchor_decay` | 3 | 4.9527 | 0.0052 | 4.4159 | 0.0042 | 0.3144 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
22
+ | `static_dropout_0.14` | `static` | 3 | 4.9043 | 0.0119 | 4.4459 | 0.0128 | 0.3205 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
23
+ | `static_dropout_0.3` | `static` | 3 | 4.8764 | 0.0014 | 4.4693 | 0.0081 | 0.2327 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
24
+ | `static_dropout_0.02` | `static` | 3 | 5.1544 | 0.0091 | 4.5405 | 0.0061 | 0.4747 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
25
+ | `static_dropout_0` | `static` | 3 | 5.2422 | 0.0015 | 4.5905 | 0.0192 | 0.5464 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
26
+
27
+ ## Paired Final-Loss Deltas
28
+
29
+ Negative `delta_vs_best_static` means the condition beat the best static
30
+ baseline for that seed.
31
+
32
+ | Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
33
+ |---:|---|---:|---|---:|---:|
34
+ | 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
35
+ | 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
36
+ | 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
37
+ | 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
38
+ | 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
39
+ | 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0985 |
40
+ | 1 | `static_dropout_0` | 4.5703 | `static_dropout_0.14` | 4.4418 | +0.1286 |
41
+ | 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4603 | -0.0535 |
42
+ | 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4603 | -0.0523 |
43
+ | 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4603 | -0.0467 |
44
+ | 2 | `static_dropout_0.14` | 4.4603 | `static_dropout_0.14` | 4.4603 | +0.0000 |
45
+ | 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4603 | +0.0116 |
46
+ | 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4603 | +0.0863 |
47
+ | 2 | `static_dropout_0` | 4.6085 | `static_dropout_0.14` | 4.4603 | +0.1482 |
48
+ | 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4357 | -0.0183 |
49
+ | 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4357 | -0.0206 |
50
+ | 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4357 | -0.0223 |
51
+ | 3 | `static_dropout_0.14` | 4.4357 | `static_dropout_0.14` | 4.4357 | +0.0000 |
52
+ | 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4357 | +0.0401 |
53
+ | 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4357 | +0.0988 |
54
+ | 3 | `static_dropout_0` | 4.5926 | `static_dropout_0.14` | 4.4357 | +0.1569 |
55
+
56
+ ## Stage Trajectory
57
+
58
+ | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
59
+ |---:|---:|---|---:|---:|---:|---:|---:|---:|
60
+ | 0 | 250,000 | `mild_30_to_08` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
61
+ | 0 | 250,000 | `static_dropout_0.3` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
62
+ | 0 | 250,000 | `hold_30_then_decay` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
63
+ | 0 | 250,000 | `static_dropout_0.14` | 0.140 | 3 | 5.4707 | 0.0281 | 4.0325 | 1.4383 |
64
+ | 0 | 250,000 | `static_dropout_0.02` | 0.020 | 3 | 5.7452 | 0.0319 | 3.5394 | 2.2057 |
65
+ | 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 3 | 5.7847 | 0.0108 | 5.1677 | 0.6170 |
66
+ | 0 | 250,000 | `static_dropout_0` | 0.000 | 3 | 5.8283 | 0.0158 | 3.4498 | 2.3785 |
67
+ | 1 | 500,000 | `mild_30_to_08` | 0.240 | 3 | 5.0573 | 0.0197 | 4.0209 | 1.0364 |
68
+ | 1 | 500,000 | `static_dropout_0.3` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
69
+ | 1 | 500,000 | `hold_30_then_decay` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
70
+ | 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 3 | 5.1479 | 0.0127 | 4.4501 | 0.6978 |
71
+ | 1 | 500,000 | `static_dropout_0.14` | 0.140 | 3 | 5.1493 | 0.0097 | 3.7036 | 1.4457 |
72
+ | 1 | 500,000 | `static_dropout_0.02` | 0.020 | 3 | 5.5605 | 0.0148 | 3.1103 | 2.4502 |
73
+ | 1 | 500,000 | `static_dropout_0` | 0.000 | 3 | 5.6920 | 0.0452 | 2.9511 | 2.7409 |
74
+ | 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 3 | 4.7695 | 0.0164 | 4.0408 | 0.7287 |
75
+ | 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 3 | 4.7717 | 0.0162 | 3.9925 | 0.7793 |
76
+ | 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.7927 | 0.0173 | 4.1535 | 0.6392 |
77
+ | 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 3 | 4.8273 | 0.0096 | 4.2699 | 0.5573 |
78
+ | 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.8466 | 0.0278 | 3.8815 | 0.9651 |
79
+ | 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 3 | 5.1459 | 0.0294 | 3.4641 | 1.6818 |
80
+ | 2 | 1,000,000 | `static_dropout_0` | 0.000 | 3 | 5.2484 | 0.0091 | 3.3281 | 1.9203 |
81
+ | 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 3 | 4.5655 | 0.0060 | 4.0390 | 0.5265 |
82
+ | 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 3 | 4.5691 | 0.0072 | 4.0380 | 0.5312 |
83
+ | 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 3 | 4.5879 | 0.0086 | 4.1457 | 0.4422 |
84
+ | 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.6088 | 0.0069 | 4.0454 | 0.5634 |
85
+ | 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.6094 | 0.0059 | 4.2125 | 0.3968 |
86
+ | 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.7799 | 0.0219 | 3.8344 | 0.9455 |
87
+ | 3 | 2,000,000 | `static_dropout_0` | 0.000 | 3 | 4.8517 | 0.0153 | 3.7761 | 1.0757 |
88
+ | 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 3 | 4.4060 | 0.0118 | 4.0530 | 0.3530 |
89
+ | 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 3 | 4.4075 | 0.0078 | 4.0768 | 0.3307 |
90
+ | 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 3 | 4.4159 | 0.0042 | 4.1015 | 0.3144 |
91
+ | 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.4459 | 0.0128 | 4.1254 | 0.3205 |
92
+ | 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.4693 | 0.0081 | 4.2365 | 0.2327 |
93
+ | 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.5405 | 0.0061 | 4.0657 | 0.4747 |
94
+ | 4 | 4,000,000 | `static_dropout_0` | 0.000 | 3 | 4.5905 | 0.0192 | 4.0441 | 0.5464 |
95
+
96
+ ## Interpretation
97
+
98
+ - `hold_30_then_decay` has the best 3-seed mean final validation loss: 4.4060 +/- 0.0118.
99
+ - The second-best final condition is `mild_30_to_08` at 4.4075 +/- 0.0078.
100
+ - The best static baseline by mean final loss is `static_dropout_0.14` at 4.4459 +/- 0.0128.
101
+ - `hold_30_then_decay` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0183.
102
+ - `mild_30_to_08` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0206.
103
+ - `fitted_l16_static_law` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0211.
104
+ - The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4463; compare this with the final ranking before claiming a schedule is uniformly better.
105
+ - This is a saved-run streaming validation artifact. Treat it as strong
106
+ evidence only when the tested conditions, seeds, static baselines, and
107
+ stream protocol match the claim being made.
108
+
109
+ ## Supporting Exploratory Runs
110
+
111
+ The primary proof table above is the three-seed confirmation run:
112
+
113
+ ```text
114
+ runs/stream_multiseed_confirm/locked_stream/20260526-203116/
115
+ ```
116
+
117
+ Earlier single-seed runs are useful for interpreting how the schedule was
118
+ selected, but they are not counted as multi-seed proof:
119
+
120
+ | Supporting run | Role | Main reading |
121
+ |---|---|---|
122
+ | `runs/stream_schedule_search/locked_stream/20260526-171537/` | schedule search | decay schedules starting near `0.30` and ending near `0.02` to `0.08` beat static `0.14` and `0.30` at the final 4M prefix |
123
+ | `runs/stream_schedule_refinement/locked_stream/20260526-184506/` | endpoint and curvature refinement | several `hold_30` variants ended tightly around `4.394`, while `hold_24_then_decay` was weaker at `4.4214`, suggesting the initial dropout should not be reduced too aggressively in this regime |
124
+ | `runs/formula_l16_exact_multiseed/locked_stream/20260527-123806/` | coefficient-derived schedule check | `pressure_formula_l16_floor02` reached `4.4059 +/- 0.0042` over three seeds versus static `0.14` at `4.4459 +/- 0.0128` |
125
+
126
+ ## Research Reading
127
+
128
+ This previous/local regime supports the same qualitative claim as the
129
+ TinyStories five-seed validation: a static dropout that is reasonable at one
130
+ stream scale is not necessarily optimal as the data prefix grows. In this
131
+ regime, the useful path keeps dropout high early (`0.30`) and then lowers it
132
+ as unique tokens and sampled tokens increase.
133
+
134
+ The strongest previous/local evidence is:
135
+
136
+ | Claim | Evidence |
137
+ |---|---|
138
+ | decay beats best static final loss | `hold_30_then_decay` beats the per-seed best static baseline in `3/3` seeds |
139
+ | endpoint is not uniquely fixed | `mild_30_to_08` is nearly tied with `hold_30_then_decay` |
140
+ | too-low early dropout is harmful | static `0.02` and `0.00` are much worse throughout the stream |
141
+ | too-high static dropout underuses later data | static `0.30` wins no final paired comparison despite being strong early |
142
+ | coefficient-derived schedules are viable | `fitted_l16_static_law` and `pressure_formula_l16_floor02` both beat static `0.14` in the saved three-seed comparisons |
143
+
144
+ Limitations:
145
+
146
+ 1. This report is `n=3`, not `n=5`.
147
+ 2. The schedules were refined inside this local regime, so this is not a
148
+ clean held-out-regime proof of universal coefficients.
149
+ 3. The report still supports the cross-regime mechanism because the direction
150
+ of the effect matches TinyStories: high enough initial regularization
151
+ prevents early overfit, and lowering dropout later improves final validation
152
+ loss versus holding one static value fixed.
docs/streaming_multiseed_validation_report.md CHANGED
@@ -6,6 +6,8 @@ This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
6
  No additional training is performed by this script; it reads saved
7
  `metrics.jsonl` files.
8
 
 
 
9
  ## Sources
10
 
11
  - `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
@@ -93,15 +95,12 @@ baseline for that seed.
93
  ## Interpretation
94
 
95
  - `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
 
96
  - The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
97
- - `smooth_low` is very close to `interaction`, suggesting the exact anchor
98
- values may not be uniquely required as long as the schedule follows the
99
- same pressure range.
100
  - `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
101
  - `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
102
  - `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
103
- - Static `0.12` can win early stages, but holding it fixed loses at the
104
- final 4M stage.
105
- - This is now the TinyStories paper-grade validation gate for this narrowed
106
- setup: five seeds, paired seed comparisons, and static baselines selected
107
- from the same stream protocol.
 
6
  No additional training is performed by this script; it reads saved
7
  `metrics.jsonl` files.
8
 
9
+ Regime: TinyStories BPE streaming validation with L12_H8_D320, 17,367,040 parameters, four prefixes from 500k to 4M tokens, and 2,000 optimizer steps per stage.
10
+
11
  ## Sources
12
 
13
  - `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
 
95
  ## Interpretation
96
 
97
  - `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
98
+ - The second-best final condition is `smooth_low` at 2.5321 +/- 0.0203.
99
  - The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
 
 
 
100
  - `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
101
  - `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
102
  - `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
103
+ - The best first-stage condition is `static_dropout_0.12` at prefix 500,000 with mean validation loss 3.2226; compare this with the final ranking before claiming a schedule is uniformly better.
104
+ - This is a saved-run streaming validation artifact. Treat it as strong
105
+ evidence only when the tested conditions, seeds, static baselines, and
106
+ stream protocol match the claim being made.
 
runs/previous_local_streaming_report/l16_multiseed_confirm/condition_summary.csv ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
2
+ hold_30_then_decay,anchor_decay,3,4.85031243065993,0.001654499467210519,4.406020574271679,0.011754833839775457,0.3529989644885063,0.015994346590220903,0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
3
+ mild_30_to_08,anchor_decay,3,4.8503826280434925,0.0017992301945451007,4.407524392008781,0.007800724825508926,0.33071232338746387,0.012432927596429534,0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
4
+ fitted_l16_static_law,anchor_decay,3,4.952731303373973,0.0052436768034393386,4.415892139077187,0.004158092833266487,0.31436355660359067,0.010011530967558104,0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02
5
+ static_dropout_0.14,static,3,4.904260951777299,0.011925494764473046,4.445927885671456,0.01281389472329066,0.32051239907741547,0.005114602595178522,0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14
6
+ static_dropout_0.3,static,3,4.876375656326612,0.0014325425566632558,4.469294945398967,0.008116710580691494,0.23274652659893036,0.009938004189833373,0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30
7
+ static_dropout_0.02,static,3,5.154374482234319,0.009061613413273495,4.540457583963871,0.006064728711460762,0.47473999857902527,0.005772200910659103,0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02
8
+ static_dropout_0,static,3,5.242185181876024,0.00150415496846462,4.59050024797519,0.01917389367658262,0.5464497481783231,0.027464276868201288,0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00
runs/previous_local_streaming_report/l16_multiseed_confirm/paired_final_deltas.csv ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
2
+ 1,hold_30_then_decay,4.393905431032181,static_dropout_0.14,4.441768206655979,-0.04786277562379837
3
+ 1,mild_30_to_08,4.399484351277351,static_dropout_0.14,4.441768206655979,-0.04228385537862778
4
+ 1,fitted_l16_static_law,4.420692302286625,static_dropout_0.14,4.441768206655979,-0.021075904369354248
5
+ 1,static_dropout_0.14,4.441768206655979,static_dropout_0.14,4.441768206655979,0.0
6
+ 1,static_dropout_0.3,4.460189968347549,static_dropout_0.14,4.441768206655979,0.018421761691570282
7
+ 1,static_dropout_0.02,4.540249735116959,static_dropout_0.14,4.441768206655979,0.09848152846097946
8
+ 1,static_dropout_0,4.5703444480896,static_dropout_0.14,4.441768206655979,0.12857624143362045
9
+ 2,hold_30_then_decay,4.406777806580067,static_dropout_0.14,4.460304826498032,-0.053527019917964935
10
+ 2,mild_30_to_08,4.408027365803719,static_dropout_0.14,4.460304826498032,-0.05227746069431305
11
+ 2,fitted_l16_static_law,4.41358458250761,static_dropout_0.14,4.460304826498032,-0.046720243990421295
12
+ 2,static_dropout_0.14,4.460304826498032,static_dropout_0.14,4.460304826498032,0.0
13
+ 2,static_dropout_0.3,4.47192245721817,static_dropout_0.14,4.460304826498032,0.01161763072013855
14
+ 2,static_dropout_0.02,4.546623565256596,static_dropout_0.14,4.460304826498032,0.086318738758564
15
+ 2,static_dropout_0,4.608511999249458,static_dropout_0.14,4.460304826498032,0.1482071727514267
16
+ 3,hold_30_then_decay,4.417378485202789,static_dropout_0.14,4.435710623860359,-0.018332138657569885
17
+ 3,mild_30_to_08,4.415061458945274,static_dropout_0.14,4.435710623860359,-0.02064916491508484
18
+ 3,fitted_l16_static_law,4.4133995324373245,static_dropout_0.14,4.435710623860359,-0.022311091423034668
19
+ 3,static_dropout_0.14,4.435710623860359,static_dropout_0.14,4.435710623860359,0.0
20
+ 3,static_dropout_0.3,4.47577241063118,static_dropout_0.14,4.435710623860359,0.04006178677082062
21
+ 3,static_dropout_0.02,4.534499451518059,static_dropout_0.14,4.435710623860359,0.09878882765769958
22
+ 3,static_dropout_0,4.5926442965865135,static_dropout_0.14,4.435710623860359,0.15693367272615433
runs/previous_local_streaming_report/l16_multiseed_confirm/stage_summary.csv ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
2
+ hold_30_then_decay,0,250000,0.3,3,5.446260958909988,0.019125165089046693,4.445705617467563,0.034358782454698235,1.000555341442426,0.024155559429602118
3
+ hold_30_then_decay,1,500000,0.3,3,5.064259976148605,0.021637732528368203,4.125050333638986,0.04522623234945776,0.9392096425096194,0.05214362791131757
4
+ hold_30_then_decay,2,1000000,0.20000000000000004,3,4.769520824154218,0.01635006458114404,4.040781140327454,0.030883572207078522,0.7287396838267645,0.01586184848529449
5
+ hold_30_then_decay,3,2000000,0.10000000000000002,3,4.565499819815159,0.005976341537445996,4.039049598077933,0.013075486007728024,0.5264502217372259,0.018983448452741372
6
+ hold_30_then_decay,4,4000000,0.02,3,4.406020574271679,0.011754833839775457,4.053021609783173,0.005869636087162683,0.3529989644885063,0.015994346590220903
7
+ mild_30_to_08,0,250000,0.3,3,5.4462608098983765,0.019125068401667725,4.445705557862918,0.034358764649810906,1.0005552520354588,0.02415560068535275
8
+ mild_30_to_08,1,500000,0.24,3,5.0572623411814375,0.019719838039851723,4.020891976853211,0.04693412607863078,1.0363703643282254,0.05164875161403775
9
+ mild_30_to_08,2,1000000,0.18000000000000002,3,4.771725835899512,0.016156229486385284,3.992456587652365,0.03048971998308727,0.7792692482471466,0.015220576762990419
10
+ mild_30_to_08,3,2000000,0.12,3,4.569139761229356,0.0072146469245784745,4.03795708467563,0.013373284097680214,0.5311826765537262,0.020578724869085317
11
+ mild_30_to_08,4,4000000,0.08,3,4.407524392008781,0.007800724825508926,4.076812068621318,0.005051302799236397,0.33071232338746387,0.012432927596429534
12
+ fitted_l16_static_law,0,250000,0.6,3,5.784741741915544,0.010830950668144494,5.167703871925672,0.03352670524735539,0.617037869989872,0.02282459187460017
13
+ fitted_l16_static_law,1,500000,0.4000000000000001,3,5.147908595701058,0.012709997007779845,4.450103844205539,0.031679739247628576,0.6978047514955202,0.027979523555407795
14
+ fitted_l16_static_law,2,1000000,0.3,3,4.827251675228278,0.009557452344926845,4.269914664328098,0.02965806771390336,0.5573370109001795,0.020671084306813296
15
+ fitted_l16_static_law,3,2000000,0.14,3,4.587862364947796,0.00860798597712663,4.145651715497176,0.020112725612223863,0.44221064945062,0.02244568312577686
16
+ fitted_l16_static_law,4,4000000,0.02,3,4.415892139077187,0.004158092833266487,4.101528582473596,0.012667297837999282,0.31436355660359067,0.010011530967558104
17
+ static_dropout_0.14,0,250000,0.14,3,5.470722645521164,0.02808420889994966,4.032453775405884,0.02467190243959671,1.4382688701152802,0.03724958360607134
18
+ static_dropout_0.14,1,500000,0.14,3,5.149252874155839,0.009721292660607455,3.7035952607790628,0.03283222913980165,1.4456576133767765,0.03202443758399359
19
+ static_dropout_0.14,2,1000000,0.14,3,4.846569702029228,0.027835384007538343,3.8814875607689223,0.03663314954997834,0.965082141260306,0.022778798858926228
20
+ static_dropout_0.14,3,2000000,0.14,3,4.608831651508808,0.006911091820519436,4.045392190416654,0.02868106552021608,0.5634394610921541,0.02219500109725294
21
+ static_dropout_0.14,4,4000000,0.14,3,4.445927885671456,0.01281389472329066,4.125415486594041,0.008814570521429348,0.32051239907741547,0.005114602595178522
22
+ static_dropout_0.3,0,250000,0.3,3,5.446260929107666,0.019125310935332103,4.4457056671381,0.034358902052248425,1.0005552619695663,0.024155461141722852
23
+ static_dropout_0.3,1,500000,0.3,3,5.064259819686413,0.021637796833646396,4.125050216913223,0.04522603378014792,0.9392096027731895,0.05214326197407805
24
+ static_dropout_0.3,2,1000000,0.3,3,4.7926972309748335,0.017311084846289997,4.1534921278556185,0.030420766811043106,0.6392051031192144,0.014300245473945872
25
+ static_dropout_0.3,3,2000000,0.3,3,4.60936535646518,0.005911969954977541,4.212546601891518,0.011776857890288749,0.39681875457366306,0.017503466982826142
26
+ static_dropout_0.3,4,4000000,0.3,3,4.469294945398967,0.008116710580691494,4.236548418800036,0.003898783020435021,0.23274652659893036,0.009938004189833373
27
+ static_dropout_0.02,0,250000,0.02,3,5.745150377353032,0.03186573719752147,3.5394225865602493,0.01043658554282977,2.2057277907927832,0.04182545755928344
28
+ static_dropout_0.02,1,500000,0.02,3,5.56048562626044,0.014806452057977726,3.110301854709784,0.02923601075152012,2.4501837715506554,0.04238209743234919
29
+ static_dropout_0.02,2,1000000,0.02,3,5.145867633322875,0.02939265744253425,3.4641073818008103,0.04880776844576436,1.6817602515220642,0.04157387316099468
30
+ static_dropout_0.02,3,2000000,0.02,3,4.779911190271378,0.021934652780817913,3.834411238630613,0.04218285619197449,0.9454999516407648,0.027020864935500585
31
+ static_dropout_0.02,4,4000000,0.02,3,4.540457583963871,0.006064728711460762,4.065717585384846,0.007518099672343167,0.47473999857902527,0.005772200910659103
32
+ static_dropout_0,0,250000,0.0,3,5.828276579578717,0.01576268602511141,3.4497722735007605,0.023904614106222872,2.378504306077957,0.03224386346161905
33
+ static_dropout_0,1,500000,0.0,3,5.6920087188482285,0.04522491867870689,2.951105666657289,0.05869100244970825,2.7409030521909394,0.08369333869665757
34
+ static_dropout_0,2,1000000,0.0,3,5.24841162810723,0.009065582711747499,3.3280878563721976,0.041678604603156234,1.9203237717350323,0.03261336000414208
35
+ static_dropout_0,3,2000000,0.0,3,4.851728734870751,0.015303064070863357,3.7760739227135978,0.03677045479243139,1.075654812157154,0.029026197035937777
36
+ static_dropout_0,4,4000000,0.0,3,4.59050024797519,0.01917389367658262,4.044050499796867,0.023224885505203383,0.5464497481783231,0.027464276868201288
scripts/summarize_streaming_multiseed.py CHANGED
@@ -191,12 +191,18 @@ def write_report(
191
  stage_rows: list[dict],
192
  paired_rows: list[dict],
193
  metrics_paths: list[Path],
 
 
 
194
  ) -> None:
195
  seed_ids = sorted({int(row["seed"]) for row in paired_rows})
196
  seed_count = len(seed_ids)
197
  best_row = condition_rows[0]
 
198
  static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
199
  best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
 
 
200
 
201
  paired_win_lines = []
202
  for row in condition_rows:
@@ -219,18 +225,25 @@ def write_report(
219
  )
220
 
221
  lines = [
222
- "# TinyStories Multi-Seed Streaming Validation",
223
  "",
224
- "Date: 2026-05-30",
225
  "",
226
  f"This report combines {seed_count} random seeds "
227
  f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
228
  "No additional training is performed by this script; it reads saved",
229
  "`metrics.jsonl` files.",
230
  "",
 
 
 
 
 
 
231
  "## Sources",
232
  "",
233
- ]
 
234
  for path_item in metrics_paths:
235
  lines.append(f"- `{path_item}`")
236
 
@@ -294,19 +307,27 @@ def write_report(
294
  f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
295
  f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
296
  f"{fmt(best_row['std_final_val'])}.",
 
 
 
 
 
 
 
 
 
297
  f"- The best static baseline by mean final loss is "
298
  f"`{best_static_row['condition']}` at "
299
  f"{fmt(best_static_row['mean_final_val'])} +/- "
300
  f"{fmt(best_static_row['std_final_val'])}.",
301
- "- `smooth_low` is very close to `interaction`, suggesting the exact anchor",
302
- " values may not be uniquely required as long as the schedule follows the",
303
- " same pressure range.",
304
  *paired_win_lines,
305
- "- Static `0.12` can win early stages, but holding it fixed loses at the",
306
- " final 4M stage.",
307
- "- This is now the TinyStories paper-grade validation gate for this narrowed",
308
- " setup: five seeds, paired seed comparisons, and static baselines selected",
309
- " from the same stream protocol.",
 
 
310
  ]
311
  )
312
  path.write_text("\n".join(lines) + "\n", encoding="utf-8")
@@ -318,6 +339,9 @@ def build_parser() -> argparse.ArgumentParser:
318
  parser.add_argument("--output-dir", type=Path, required=True)
319
  parser.add_argument("--report", type=Path, required=True)
320
  parser.add_argument("--conditions", nargs="+", default=DEFAULT_CONDITIONS)
 
 
 
321
  return parser
322
 
323
 
@@ -371,7 +395,16 @@ def main() -> None:
371
  "delta_vs_best_static",
372
  ],
373
  )
374
- write_report(args.report, condition_rows, stage_rows, paired_rows, args.metrics)
 
 
 
 
 
 
 
 
 
375
  print(
376
  json.dumps(
377
  {
 
191
  stage_rows: list[dict],
192
  paired_rows: list[dict],
193
  metrics_paths: list[Path],
194
+ title: str,
195
+ date: str,
196
+ context: str,
197
  ) -> None:
198
  seed_ids = sorted({int(row["seed"]) for row in paired_rows})
199
  seed_count = len(seed_ids)
200
  best_row = condition_rows[0]
201
+ second_row = condition_rows[1] if len(condition_rows) > 1 else None
202
  static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
203
  best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
204
+ first_stage_rows = [row for row in stage_rows if int(row["stage"]) == 0]
205
+ best_first_stage = min(first_stage_rows, key=lambda row: row["mean_val"])
206
 
207
  paired_win_lines = []
208
  for row in condition_rows:
 
225
  )
226
 
227
  lines = [
228
+ f"# {title}",
229
  "",
230
+ f"Date: {date}",
231
  "",
232
  f"This report combines {seed_count} random seeds "
233
  f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
234
  "No additional training is performed by this script; it reads saved",
235
  "`metrics.jsonl` files.",
236
  "",
237
+ ]
238
+ if context:
239
+ lines.extend([context, ""])
240
+
241
+ lines.extend(
242
+ [
243
  "## Sources",
244
  "",
245
+ ]
246
+ )
247
  for path_item in metrics_paths:
248
  lines.append(f"- `{path_item}`")
249
 
 
307
  f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
308
  f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
309
  f"{fmt(best_row['std_final_val'])}.",
310
+ *(
311
+ [
312
+ f"- The second-best final condition is `{second_row['condition']}` at "
313
+ f"{fmt(second_row['mean_final_val'])} +/- "
314
+ f"{fmt(second_row['std_final_val'])}."
315
+ ]
316
+ if second_row is not None
317
+ else []
318
+ ),
319
  f"- The best static baseline by mean final loss is "
320
  f"`{best_static_row['condition']}` at "
321
  f"{fmt(best_static_row['mean_final_val'])} +/- "
322
  f"{fmt(best_static_row['std_final_val'])}.",
 
 
 
323
  *paired_win_lines,
324
+ f"- The best first-stage condition is `{best_first_stage['condition']}` "
325
+ f"at prefix {best_first_stage['token_limit']:,} with mean validation "
326
+ f"loss {fmt(best_first_stage['mean_val'])}; compare this with the final "
327
+ "ranking before claiming a schedule is uniformly better.",
328
+ "- This is a saved-run streaming validation artifact. Treat it as strong",
329
+ " evidence only when the tested conditions, seeds, static baselines, and",
330
+ " stream protocol match the claim being made.",
331
  ]
332
  )
333
  path.write_text("\n".join(lines) + "\n", encoding="utf-8")
 
339
  parser.add_argument("--output-dir", type=Path, required=True)
340
  parser.add_argument("--report", type=Path, required=True)
341
  parser.add_argument("--conditions", nargs="+", default=DEFAULT_CONDITIONS)
342
+ parser.add_argument("--title", default="TinyStories Multi-Seed Streaming Validation")
343
+ parser.add_argument("--date", default="2026-05-30")
344
+ parser.add_argument("--context", default="")
345
  return parser
346
 
347
 
 
395
  "delta_vs_best_static",
396
  ],
397
  )
398
+ write_report(
399
+ args.report,
400
+ condition_rows,
401
+ stage_rows,
402
+ paired_rows,
403
+ args.metrics,
404
+ args.title,
405
+ args.date,
406
+ args.context,
407
+ )
408
  print(
409
  json.dumps(
410
  {