Mandeep Sidhu commited on
Commit ·
1c065aa
1
Parent(s): 2f2776e
Add previous local streaming report
Browse files- docs/plan.md +18 -11
- docs/previous_regime_streaming_report.md +152 -0
- docs/streaming_multiseed_validation_report.md +7 -8
- runs/previous_local_streaming_report/l16_multiseed_confirm/condition_summary.csv +8 -0
- runs/previous_local_streaming_report/l16_multiseed_confirm/paired_final_deltas.csv +22 -0
- runs/previous_local_streaming_report/l16_multiseed_confirm/stage_summary.csv +36 -0
- scripts/summarize_streaming_multiseed.py +45 -12
docs/plan.md
CHANGED
|
@@ -284,7 +284,7 @@ Use this order for every regime.
|
|
| 284 |
| original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
|
| 285 |
| TinyStories static/coefficient regime | active | main coefficient evidence |
|
| 286 |
| TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
|
| 287 |
-
| original/local streaming regime |
|
| 288 |
| next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
|
| 289 |
|
| 290 |
## Current Formula Status
|
|
@@ -333,6 +333,7 @@ structure transfers, while coefficients may be regime-specific.
|
|
| 333 |
| TinyStories held-out prefix | supports pressure dependence on unique tokens |
|
| 334 |
| TinyStories held-out model | supports pressure dependence on model size |
|
| 335 |
| TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
|
|
|
|
| 336 |
| cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
|
| 337 |
|
| 338 |
Latest TinyStories 5-seed streaming final-loss table:
|
|
@@ -394,21 +395,25 @@ streaming multi-seed reports for each regime.
|
|
| 394 |
|
| 395 |
## Immediate Next Action
|
| 396 |
|
| 397 |
-
|
| 398 |
-
|
| 399 |
-
the
|
|
|
|
| 400 |
|
| 401 |
## Next Training After Current Gate
|
| 402 |
|
| 403 |
-
No MPS training should launch
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
conditions
|
|
|
|
| 407 |
|
| 408 |
```text
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
|
|
|
|
|
|
| 412 |
```
|
| 413 |
|
| 414 |
Evaluate with paired seed comparisons:
|
|
@@ -430,5 +435,7 @@ Latest streaming report:
|
|
| 430 |
|
| 431 |
```text
|
| 432 |
docs/streaming_multiseed_validation_report.md
|
|
|
|
| 433 |
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
|
|
|
|
| 434 |
```
|
|
|
|
| 284 |
| original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
|
| 285 |
| TinyStories static/coefficient regime | active | main coefficient evidence |
|
| 286 |
| TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
|
| 287 |
+
| original/local streaming regime | 3-seed saved-run report complete | previous/local decay schedules beat best static in 3/3 paired final-loss comparisons |
|
| 288 |
| next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
|
| 289 |
|
| 290 |
## Current Formula Status
|
|
|
|
| 333 |
| TinyStories held-out prefix | supports pressure dependence on unique tokens |
|
| 334 |
| TinyStories held-out model | supports pressure dependence on model size |
|
| 335 |
| TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
|
| 336 |
+
| previous/local streaming, 3 seeds | hold-30 decay has best mean final loss; top decay schedules beat best static in 3/3 paired comparisons |
|
| 337 |
| cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
|
| 338 |
|
| 339 |
Latest TinyStories 5-seed streaming final-loss table:
|
|
|
|
| 395 |
|
| 396 |
## Immediate Next Action
|
| 397 |
|
| 398 |
+
Reconcile the TinyStories five-seed report and previous/local three-seed
|
| 399 |
+
report into the paper outline. Decide whether the previous/local regime needs a
|
| 400 |
+
targeted seed-4/5 extension, or whether the next better use of MPS time is a
|
| 401 |
+
third held-out regime.
|
| 402 |
|
| 403 |
## Next Training After Current Gate
|
| 404 |
|
| 405 |
+
No MPS training should launch until the two completed streaming reports are
|
| 406 |
+
read together. If previous/local seed count is the limiting issue, the next run
|
| 407 |
+
should be narrowly scoped to only the missing seed-4/5 previous/local
|
| 408 |
+
conditions. If external validity is the limiting issue, use a third held-out
|
| 409 |
+
regime instead:
|
| 410 |
|
| 411 |
```text
|
| 412 |
+
completed: TinyStories 5-seed streaming report
|
| 413 |
+
completed: previous/local 3-seed saved-run streaming report
|
| 414 |
+
possible follow-up A: previous/local seed-4/5 extension
|
| 415 |
+
possible follow-up B: third held-out regime
|
| 416 |
+
avoid: broad new sweep before choosing A vs B
|
| 417 |
```
|
| 418 |
|
| 419 |
Evaluate with paired seed comparisons:
|
|
|
|
| 435 |
|
| 436 |
```text
|
| 437 |
docs/streaming_multiseed_validation_report.md
|
| 438 |
+
docs/previous_regime_streaming_report.md
|
| 439 |
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
|
| 440 |
+
runs/previous_local_streaming_report/l16_multiseed_confirm/
|
| 441 |
```
|
docs/previous_regime_streaming_report.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Previous/Local Regime Streaming Validation
|
| 2 |
+
|
| 3 |
+
Date: 2026-05-30
|
| 4 |
+
|
| 5 |
+
This report combines 3 random seeds (1, 2, 3) from saved streaming runs.
|
| 6 |
+
No additional training is performed by this script; it reads saved
|
| 7 |
+
`metrics.jsonl` files.
|
| 8 |
+
|
| 9 |
+
Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This report uses the existing three-seed confirmation run only; earlier single-seed search/refinement runs are treated as exploratory support, not as the primary proof table.
|
| 10 |
+
|
| 11 |
+
## Sources
|
| 12 |
+
|
| 13 |
+
- `runs/stream_multiseed_confirm/locked_stream/20260526-203116/metrics.jsonl`
|
| 14 |
+
|
| 15 |
+
## Condition Ranking By Final Loss
|
| 16 |
+
|
| 17 |
+
| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|
| 18 |
+
|---|---|---:|---:|---:|---:|---:|---:|---|
|
| 19 |
+
| `hold_30_then_decay` | `anchor_decay` | 3 | 4.8503 | 0.0017 | 4.4060 | 0.0118 | 0.3530 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
|
| 20 |
+
| `mild_30_to_08` | `anchor_decay` | 3 | 4.8504 | 0.0018 | 4.4075 | 0.0078 | 0.3307 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
|
| 21 |
+
| `fitted_l16_static_law` | `anchor_decay` | 3 | 4.9527 | 0.0052 | 4.4159 | 0.0042 | 0.3144 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
|
| 22 |
+
| `static_dropout_0.14` | `static` | 3 | 4.9043 | 0.0119 | 4.4459 | 0.0128 | 0.3205 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
|
| 23 |
+
| `static_dropout_0.3` | `static` | 3 | 4.8764 | 0.0014 | 4.4693 | 0.0081 | 0.2327 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
|
| 24 |
+
| `static_dropout_0.02` | `static` | 3 | 5.1544 | 0.0091 | 4.5405 | 0.0061 | 0.4747 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
|
| 25 |
+
| `static_dropout_0` | `static` | 3 | 5.2422 | 0.0015 | 4.5905 | 0.0192 | 0.5464 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
|
| 26 |
+
|
| 27 |
+
## Paired Final-Loss Deltas
|
| 28 |
+
|
| 29 |
+
Negative `delta_vs_best_static` means the condition beat the best static
|
| 30 |
+
baseline for that seed.
|
| 31 |
+
|
| 32 |
+
| Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
|
| 33 |
+
|---:|---|---:|---|---:|---:|
|
| 34 |
+
| 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
|
| 35 |
+
| 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
|
| 36 |
+
| 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
|
| 37 |
+
| 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
|
| 38 |
+
| 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
|
| 39 |
+
| 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0985 |
|
| 40 |
+
| 1 | `static_dropout_0` | 4.5703 | `static_dropout_0.14` | 4.4418 | +0.1286 |
|
| 41 |
+
| 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4603 | -0.0535 |
|
| 42 |
+
| 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4603 | -0.0523 |
|
| 43 |
+
| 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4603 | -0.0467 |
|
| 44 |
+
| 2 | `static_dropout_0.14` | 4.4603 | `static_dropout_0.14` | 4.4603 | +0.0000 |
|
| 45 |
+
| 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4603 | +0.0116 |
|
| 46 |
+
| 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4603 | +0.0863 |
|
| 47 |
+
| 2 | `static_dropout_0` | 4.6085 | `static_dropout_0.14` | 4.4603 | +0.1482 |
|
| 48 |
+
| 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4357 | -0.0183 |
|
| 49 |
+
| 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4357 | -0.0206 |
|
| 50 |
+
| 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4357 | -0.0223 |
|
| 51 |
+
| 3 | `static_dropout_0.14` | 4.4357 | `static_dropout_0.14` | 4.4357 | +0.0000 |
|
| 52 |
+
| 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4357 | +0.0401 |
|
| 53 |
+
| 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4357 | +0.0988 |
|
| 54 |
+
| 3 | `static_dropout_0` | 4.5926 | `static_dropout_0.14` | 4.4357 | +0.1569 |
|
| 55 |
+
|
| 56 |
+
## Stage Trajectory
|
| 57 |
+
|
| 58 |
+
| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
|
| 59 |
+
|---:|---:|---|---:|---:|---:|---:|---:|---:|
|
| 60 |
+
| 0 | 250,000 | `mild_30_to_08` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
|
| 61 |
+
| 0 | 250,000 | `static_dropout_0.3` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
|
| 62 |
+
| 0 | 250,000 | `hold_30_then_decay` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
|
| 63 |
+
| 0 | 250,000 | `static_dropout_0.14` | 0.140 | 3 | 5.4707 | 0.0281 | 4.0325 | 1.4383 |
|
| 64 |
+
| 0 | 250,000 | `static_dropout_0.02` | 0.020 | 3 | 5.7452 | 0.0319 | 3.5394 | 2.2057 |
|
| 65 |
+
| 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 3 | 5.7847 | 0.0108 | 5.1677 | 0.6170 |
|
| 66 |
+
| 0 | 250,000 | `static_dropout_0` | 0.000 | 3 | 5.8283 | 0.0158 | 3.4498 | 2.3785 |
|
| 67 |
+
| 1 | 500,000 | `mild_30_to_08` | 0.240 | 3 | 5.0573 | 0.0197 | 4.0209 | 1.0364 |
|
| 68 |
+
| 1 | 500,000 | `static_dropout_0.3` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
|
| 69 |
+
| 1 | 500,000 | `hold_30_then_decay` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
|
| 70 |
+
| 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 3 | 5.1479 | 0.0127 | 4.4501 | 0.6978 |
|
| 71 |
+
| 1 | 500,000 | `static_dropout_0.14` | 0.140 | 3 | 5.1493 | 0.0097 | 3.7036 | 1.4457 |
|
| 72 |
+
| 1 | 500,000 | `static_dropout_0.02` | 0.020 | 3 | 5.5605 | 0.0148 | 3.1103 | 2.4502 |
|
| 73 |
+
| 1 | 500,000 | `static_dropout_0` | 0.000 | 3 | 5.6920 | 0.0452 | 2.9511 | 2.7409 |
|
| 74 |
+
| 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 3 | 4.7695 | 0.0164 | 4.0408 | 0.7287 |
|
| 75 |
+
| 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 3 | 4.7717 | 0.0162 | 3.9925 | 0.7793 |
|
| 76 |
+
| 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.7927 | 0.0173 | 4.1535 | 0.6392 |
|
| 77 |
+
| 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 3 | 4.8273 | 0.0096 | 4.2699 | 0.5573 |
|
| 78 |
+
| 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.8466 | 0.0278 | 3.8815 | 0.9651 |
|
| 79 |
+
| 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 3 | 5.1459 | 0.0294 | 3.4641 | 1.6818 |
|
| 80 |
+
| 2 | 1,000,000 | `static_dropout_0` | 0.000 | 3 | 5.2484 | 0.0091 | 3.3281 | 1.9203 |
|
| 81 |
+
| 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 3 | 4.5655 | 0.0060 | 4.0390 | 0.5265 |
|
| 82 |
+
| 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 3 | 4.5691 | 0.0072 | 4.0380 | 0.5312 |
|
| 83 |
+
| 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 3 | 4.5879 | 0.0086 | 4.1457 | 0.4422 |
|
| 84 |
+
| 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.6088 | 0.0069 | 4.0454 | 0.5634 |
|
| 85 |
+
| 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.6094 | 0.0059 | 4.2125 | 0.3968 |
|
| 86 |
+
| 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.7799 | 0.0219 | 3.8344 | 0.9455 |
|
| 87 |
+
| 3 | 2,000,000 | `static_dropout_0` | 0.000 | 3 | 4.8517 | 0.0153 | 3.7761 | 1.0757 |
|
| 88 |
+
| 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 3 | 4.4060 | 0.0118 | 4.0530 | 0.3530 |
|
| 89 |
+
| 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 3 | 4.4075 | 0.0078 | 4.0768 | 0.3307 |
|
| 90 |
+
| 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 3 | 4.4159 | 0.0042 | 4.1015 | 0.3144 |
|
| 91 |
+
| 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.4459 | 0.0128 | 4.1254 | 0.3205 |
|
| 92 |
+
| 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.4693 | 0.0081 | 4.2365 | 0.2327 |
|
| 93 |
+
| 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.5405 | 0.0061 | 4.0657 | 0.4747 |
|
| 94 |
+
| 4 | 4,000,000 | `static_dropout_0` | 0.000 | 3 | 4.5905 | 0.0192 | 4.0441 | 0.5464 |
|
| 95 |
+
|
| 96 |
+
## Interpretation
|
| 97 |
+
|
| 98 |
+
- `hold_30_then_decay` has the best 3-seed mean final validation loss: 4.4060 +/- 0.0118.
|
| 99 |
+
- The second-best final condition is `mild_30_to_08` at 4.4075 +/- 0.0078.
|
| 100 |
+
- The best static baseline by mean final loss is `static_dropout_0.14` at 4.4459 +/- 0.0128.
|
| 101 |
+
- `hold_30_then_decay` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0183.
|
| 102 |
+
- `mild_30_to_08` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0206.
|
| 103 |
+
- `fitted_l16_static_law` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0211.
|
| 104 |
+
- The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4463; compare this with the final ranking before claiming a schedule is uniformly better.
|
| 105 |
+
- This is a saved-run streaming validation artifact. Treat it as strong
|
| 106 |
+
evidence only when the tested conditions, seeds, static baselines, and
|
| 107 |
+
stream protocol match the claim being made.
|
| 108 |
+
|
| 109 |
+
## Supporting Exploratory Runs
|
| 110 |
+
|
| 111 |
+
The primary proof table above is the three-seed confirmation run:
|
| 112 |
+
|
| 113 |
+
```text
|
| 114 |
+
runs/stream_multiseed_confirm/locked_stream/20260526-203116/
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
Earlier single-seed runs are useful for interpreting how the schedule was
|
| 118 |
+
selected, but they are not counted as multi-seed proof:
|
| 119 |
+
|
| 120 |
+
| Supporting run | Role | Main reading |
|
| 121 |
+
|---|---|---|
|
| 122 |
+
| `runs/stream_schedule_search/locked_stream/20260526-171537/` | schedule search | decay schedules starting near `0.30` and ending near `0.02` to `0.08` beat static `0.14` and `0.30` at the final 4M prefix |
|
| 123 |
+
| `runs/stream_schedule_refinement/locked_stream/20260526-184506/` | endpoint and curvature refinement | several `hold_30` variants ended tightly around `4.394`, while `hold_24_then_decay` was weaker at `4.4214`, suggesting the initial dropout should not be reduced too aggressively in this regime |
|
| 124 |
+
| `runs/formula_l16_exact_multiseed/locked_stream/20260527-123806/` | coefficient-derived schedule check | `pressure_formula_l16_floor02` reached `4.4059 +/- 0.0042` over three seeds versus static `0.14` at `4.4459 +/- 0.0128` |
|
| 125 |
+
|
| 126 |
+
## Research Reading
|
| 127 |
+
|
| 128 |
+
This previous/local regime supports the same qualitative claim as the
|
| 129 |
+
TinyStories five-seed validation: a static dropout that is reasonable at one
|
| 130 |
+
stream scale is not necessarily optimal as the data prefix grows. In this
|
| 131 |
+
regime, the useful path keeps dropout high early (`0.30`) and then lowers it
|
| 132 |
+
as unique tokens and sampled tokens increase.
|
| 133 |
+
|
| 134 |
+
The strongest previous/local evidence is:
|
| 135 |
+
|
| 136 |
+
| Claim | Evidence |
|
| 137 |
+
|---|---|
|
| 138 |
+
| decay beats best static final loss | `hold_30_then_decay` beats the per-seed best static baseline in `3/3` seeds |
|
| 139 |
+
| endpoint is not uniquely fixed | `mild_30_to_08` is nearly tied with `hold_30_then_decay` |
|
| 140 |
+
| too-low early dropout is harmful | static `0.02` and `0.00` are much worse throughout the stream |
|
| 141 |
+
| too-high static dropout underuses later data | static `0.30` wins no final paired comparison despite being strong early |
|
| 142 |
+
| coefficient-derived schedules are viable | `fitted_l16_static_law` and `pressure_formula_l16_floor02` both beat static `0.14` in the saved three-seed comparisons |
|
| 143 |
+
|
| 144 |
+
Limitations:
|
| 145 |
+
|
| 146 |
+
1. This report is `n=3`, not `n=5`.
|
| 147 |
+
2. The schedules were refined inside this local regime, so this is not a
|
| 148 |
+
clean held-out-regime proof of universal coefficients.
|
| 149 |
+
3. The report still supports the cross-regime mechanism because the direction
|
| 150 |
+
of the effect matches TinyStories: high enough initial regularization
|
| 151 |
+
prevents early overfit, and lowering dropout later improves final validation
|
| 152 |
+
loss versus holding one static value fixed.
|
docs/streaming_multiseed_validation_report.md
CHANGED
|
@@ -6,6 +6,8 @@ This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
|
|
| 6 |
No additional training is performed by this script; it reads saved
|
| 7 |
`metrics.jsonl` files.
|
| 8 |
|
|
|
|
|
|
|
| 9 |
## Sources
|
| 10 |
|
| 11 |
- `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
|
|
@@ -93,15 +95,12 @@ baseline for that seed.
|
|
| 93 |
## Interpretation
|
| 94 |
|
| 95 |
- `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
|
|
|
|
| 96 |
- The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
|
| 97 |
-
- `smooth_low` is very close to `interaction`, suggesting the exact anchor
|
| 98 |
-
values may not be uniquely required as long as the schedule follows the
|
| 99 |
-
same pressure range.
|
| 100 |
- `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
|
| 101 |
- `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
|
| 102 |
- `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
|
| 103 |
-
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
from the same stream protocol.
|
|
|
|
| 6 |
No additional training is performed by this script; it reads saved
|
| 7 |
`metrics.jsonl` files.
|
| 8 |
|
| 9 |
+
Regime: TinyStories BPE streaming validation with L12_H8_D320, 17,367,040 parameters, four prefixes from 500k to 4M tokens, and 2,000 optimizer steps per stage.
|
| 10 |
+
|
| 11 |
## Sources
|
| 12 |
|
| 13 |
- `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
|
|
|
|
| 95 |
## Interpretation
|
| 96 |
|
| 97 |
- `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
|
| 98 |
+
- The second-best final condition is `smooth_low` at 2.5321 +/- 0.0203.
|
| 99 |
- The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
|
|
|
|
|
|
|
|
|
|
| 100 |
- `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
|
| 101 |
- `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
|
| 102 |
- `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
|
| 103 |
+
- The best first-stage condition is `static_dropout_0.12` at prefix 500,000 with mean validation loss 3.2226; compare this with the final ranking before claiming a schedule is uniformly better.
|
| 104 |
+
- This is a saved-run streaming validation artifact. Treat it as strong
|
| 105 |
+
evidence only when the tested conditions, seeds, static baselines, and
|
| 106 |
+
stream protocol match the claim being made.
|
|
|
runs/previous_local_streaming_report/l16_multiseed_confirm/condition_summary.csv
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
|
| 2 |
+
hold_30_then_decay,anchor_decay,3,4.85031243065993,0.001654499467210519,4.406020574271679,0.011754833839775457,0.3529989644885063,0.015994346590220903,0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
|
| 3 |
+
mild_30_to_08,anchor_decay,3,4.8503826280434925,0.0017992301945451007,4.407524392008781,0.007800724825508926,0.33071232338746387,0.012432927596429534,0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
|
| 4 |
+
fitted_l16_static_law,anchor_decay,3,4.952731303373973,0.0052436768034393386,4.415892139077187,0.004158092833266487,0.31436355660359067,0.010011530967558104,0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02
|
| 5 |
+
static_dropout_0.14,static,3,4.904260951777299,0.011925494764473046,4.445927885671456,0.01281389472329066,0.32051239907741547,0.005114602595178522,0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14
|
| 6 |
+
static_dropout_0.3,static,3,4.876375656326612,0.0014325425566632558,4.469294945398967,0.008116710580691494,0.23274652659893036,0.009938004189833373,0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30
|
| 7 |
+
static_dropout_0.02,static,3,5.154374482234319,0.009061613413273495,4.540457583963871,0.006064728711460762,0.47473999857902527,0.005772200910659103,0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02
|
| 8 |
+
static_dropout_0,static,3,5.242185181876024,0.00150415496846462,4.59050024797519,0.01917389367658262,0.5464497481783231,0.027464276868201288,0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00
|
runs/previous_local_streaming_report/l16_multiseed_confirm/paired_final_deltas.csv
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
|
| 2 |
+
1,hold_30_then_decay,4.393905431032181,static_dropout_0.14,4.441768206655979,-0.04786277562379837
|
| 3 |
+
1,mild_30_to_08,4.399484351277351,static_dropout_0.14,4.441768206655979,-0.04228385537862778
|
| 4 |
+
1,fitted_l16_static_law,4.420692302286625,static_dropout_0.14,4.441768206655979,-0.021075904369354248
|
| 5 |
+
1,static_dropout_0.14,4.441768206655979,static_dropout_0.14,4.441768206655979,0.0
|
| 6 |
+
1,static_dropout_0.3,4.460189968347549,static_dropout_0.14,4.441768206655979,0.018421761691570282
|
| 7 |
+
1,static_dropout_0.02,4.540249735116959,static_dropout_0.14,4.441768206655979,0.09848152846097946
|
| 8 |
+
1,static_dropout_0,4.5703444480896,static_dropout_0.14,4.441768206655979,0.12857624143362045
|
| 9 |
+
2,hold_30_then_decay,4.406777806580067,static_dropout_0.14,4.460304826498032,-0.053527019917964935
|
| 10 |
+
2,mild_30_to_08,4.408027365803719,static_dropout_0.14,4.460304826498032,-0.05227746069431305
|
| 11 |
+
2,fitted_l16_static_law,4.41358458250761,static_dropout_0.14,4.460304826498032,-0.046720243990421295
|
| 12 |
+
2,static_dropout_0.14,4.460304826498032,static_dropout_0.14,4.460304826498032,0.0
|
| 13 |
+
2,static_dropout_0.3,4.47192245721817,static_dropout_0.14,4.460304826498032,0.01161763072013855
|
| 14 |
+
2,static_dropout_0.02,4.546623565256596,static_dropout_0.14,4.460304826498032,0.086318738758564
|
| 15 |
+
2,static_dropout_0,4.608511999249458,static_dropout_0.14,4.460304826498032,0.1482071727514267
|
| 16 |
+
3,hold_30_then_decay,4.417378485202789,static_dropout_0.14,4.435710623860359,-0.018332138657569885
|
| 17 |
+
3,mild_30_to_08,4.415061458945274,static_dropout_0.14,4.435710623860359,-0.02064916491508484
|
| 18 |
+
3,fitted_l16_static_law,4.4133995324373245,static_dropout_0.14,4.435710623860359,-0.022311091423034668
|
| 19 |
+
3,static_dropout_0.14,4.435710623860359,static_dropout_0.14,4.435710623860359,0.0
|
| 20 |
+
3,static_dropout_0.3,4.47577241063118,static_dropout_0.14,4.435710623860359,0.04006178677082062
|
| 21 |
+
3,static_dropout_0.02,4.534499451518059,static_dropout_0.14,4.435710623860359,0.09878882765769958
|
| 22 |
+
3,static_dropout_0,4.5926442965865135,static_dropout_0.14,4.435710623860359,0.15693367272615433
|
runs/previous_local_streaming_report/l16_multiseed_confirm/stage_summary.csv
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
|
| 2 |
+
hold_30_then_decay,0,250000,0.3,3,5.446260958909988,0.019125165089046693,4.445705617467563,0.034358782454698235,1.000555341442426,0.024155559429602118
|
| 3 |
+
hold_30_then_decay,1,500000,0.3,3,5.064259976148605,0.021637732528368203,4.125050333638986,0.04522623234945776,0.9392096425096194,0.05214362791131757
|
| 4 |
+
hold_30_then_decay,2,1000000,0.20000000000000004,3,4.769520824154218,0.01635006458114404,4.040781140327454,0.030883572207078522,0.7287396838267645,0.01586184848529449
|
| 5 |
+
hold_30_then_decay,3,2000000,0.10000000000000002,3,4.565499819815159,0.005976341537445996,4.039049598077933,0.013075486007728024,0.5264502217372259,0.018983448452741372
|
| 6 |
+
hold_30_then_decay,4,4000000,0.02,3,4.406020574271679,0.011754833839775457,4.053021609783173,0.005869636087162683,0.3529989644885063,0.015994346590220903
|
| 7 |
+
mild_30_to_08,0,250000,0.3,3,5.4462608098983765,0.019125068401667725,4.445705557862918,0.034358764649810906,1.0005552520354588,0.02415560068535275
|
| 8 |
+
mild_30_to_08,1,500000,0.24,3,5.0572623411814375,0.019719838039851723,4.020891976853211,0.04693412607863078,1.0363703643282254,0.05164875161403775
|
| 9 |
+
mild_30_to_08,2,1000000,0.18000000000000002,3,4.771725835899512,0.016156229486385284,3.992456587652365,0.03048971998308727,0.7792692482471466,0.015220576762990419
|
| 10 |
+
mild_30_to_08,3,2000000,0.12,3,4.569139761229356,0.0072146469245784745,4.03795708467563,0.013373284097680214,0.5311826765537262,0.020578724869085317
|
| 11 |
+
mild_30_to_08,4,4000000,0.08,3,4.407524392008781,0.007800724825508926,4.076812068621318,0.005051302799236397,0.33071232338746387,0.012432927596429534
|
| 12 |
+
fitted_l16_static_law,0,250000,0.6,3,5.784741741915544,0.010830950668144494,5.167703871925672,0.03352670524735539,0.617037869989872,0.02282459187460017
|
| 13 |
+
fitted_l16_static_law,1,500000,0.4000000000000001,3,5.147908595701058,0.012709997007779845,4.450103844205539,0.031679739247628576,0.6978047514955202,0.027979523555407795
|
| 14 |
+
fitted_l16_static_law,2,1000000,0.3,3,4.827251675228278,0.009557452344926845,4.269914664328098,0.02965806771390336,0.5573370109001795,0.020671084306813296
|
| 15 |
+
fitted_l16_static_law,3,2000000,0.14,3,4.587862364947796,0.00860798597712663,4.145651715497176,0.020112725612223863,0.44221064945062,0.02244568312577686
|
| 16 |
+
fitted_l16_static_law,4,4000000,0.02,3,4.415892139077187,0.004158092833266487,4.101528582473596,0.012667297837999282,0.31436355660359067,0.010011530967558104
|
| 17 |
+
static_dropout_0.14,0,250000,0.14,3,5.470722645521164,0.02808420889994966,4.032453775405884,0.02467190243959671,1.4382688701152802,0.03724958360607134
|
| 18 |
+
static_dropout_0.14,1,500000,0.14,3,5.149252874155839,0.009721292660607455,3.7035952607790628,0.03283222913980165,1.4456576133767765,0.03202443758399359
|
| 19 |
+
static_dropout_0.14,2,1000000,0.14,3,4.846569702029228,0.027835384007538343,3.8814875607689223,0.03663314954997834,0.965082141260306,0.022778798858926228
|
| 20 |
+
static_dropout_0.14,3,2000000,0.14,3,4.608831651508808,0.006911091820519436,4.045392190416654,0.02868106552021608,0.5634394610921541,0.02219500109725294
|
| 21 |
+
static_dropout_0.14,4,4000000,0.14,3,4.445927885671456,0.01281389472329066,4.125415486594041,0.008814570521429348,0.32051239907741547,0.005114602595178522
|
| 22 |
+
static_dropout_0.3,0,250000,0.3,3,5.446260929107666,0.019125310935332103,4.4457056671381,0.034358902052248425,1.0005552619695663,0.024155461141722852
|
| 23 |
+
static_dropout_0.3,1,500000,0.3,3,5.064259819686413,0.021637796833646396,4.125050216913223,0.04522603378014792,0.9392096027731895,0.05214326197407805
|
| 24 |
+
static_dropout_0.3,2,1000000,0.3,3,4.7926972309748335,0.017311084846289997,4.1534921278556185,0.030420766811043106,0.6392051031192144,0.014300245473945872
|
| 25 |
+
static_dropout_0.3,3,2000000,0.3,3,4.60936535646518,0.005911969954977541,4.212546601891518,0.011776857890288749,0.39681875457366306,0.017503466982826142
|
| 26 |
+
static_dropout_0.3,4,4000000,0.3,3,4.469294945398967,0.008116710580691494,4.236548418800036,0.003898783020435021,0.23274652659893036,0.009938004189833373
|
| 27 |
+
static_dropout_0.02,0,250000,0.02,3,5.745150377353032,0.03186573719752147,3.5394225865602493,0.01043658554282977,2.2057277907927832,0.04182545755928344
|
| 28 |
+
static_dropout_0.02,1,500000,0.02,3,5.56048562626044,0.014806452057977726,3.110301854709784,0.02923601075152012,2.4501837715506554,0.04238209743234919
|
| 29 |
+
static_dropout_0.02,2,1000000,0.02,3,5.145867633322875,0.02939265744253425,3.4641073818008103,0.04880776844576436,1.6817602515220642,0.04157387316099468
|
| 30 |
+
static_dropout_0.02,3,2000000,0.02,3,4.779911190271378,0.021934652780817913,3.834411238630613,0.04218285619197449,0.9454999516407648,0.027020864935500585
|
| 31 |
+
static_dropout_0.02,4,4000000,0.02,3,4.540457583963871,0.006064728711460762,4.065717585384846,0.007518099672343167,0.47473999857902527,0.005772200910659103
|
| 32 |
+
static_dropout_0,0,250000,0.0,3,5.828276579578717,0.01576268602511141,3.4497722735007605,0.023904614106222872,2.378504306077957,0.03224386346161905
|
| 33 |
+
static_dropout_0,1,500000,0.0,3,5.6920087188482285,0.04522491867870689,2.951105666657289,0.05869100244970825,2.7409030521909394,0.08369333869665757
|
| 34 |
+
static_dropout_0,2,1000000,0.0,3,5.24841162810723,0.009065582711747499,3.3280878563721976,0.041678604603156234,1.9203237717350323,0.03261336000414208
|
| 35 |
+
static_dropout_0,3,2000000,0.0,3,4.851728734870751,0.015303064070863357,3.7760739227135978,0.03677045479243139,1.075654812157154,0.029026197035937777
|
| 36 |
+
static_dropout_0,4,4000000,0.0,3,4.59050024797519,0.01917389367658262,4.044050499796867,0.023224885505203383,0.5464497481783231,0.027464276868201288
|
scripts/summarize_streaming_multiseed.py
CHANGED
|
@@ -191,12 +191,18 @@ def write_report(
|
|
| 191 |
stage_rows: list[dict],
|
| 192 |
paired_rows: list[dict],
|
| 193 |
metrics_paths: list[Path],
|
|
|
|
|
|
|
|
|
|
| 194 |
) -> None:
|
| 195 |
seed_ids = sorted({int(row["seed"]) for row in paired_rows})
|
| 196 |
seed_count = len(seed_ids)
|
| 197 |
best_row = condition_rows[0]
|
|
|
|
| 198 |
static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
|
| 199 |
best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
|
|
|
|
|
|
|
| 200 |
|
| 201 |
paired_win_lines = []
|
| 202 |
for row in condition_rows:
|
|
@@ -219,18 +225,25 @@ def write_report(
|
|
| 219 |
)
|
| 220 |
|
| 221 |
lines = [
|
| 222 |
-
"#
|
| 223 |
"",
|
| 224 |
-
"Date:
|
| 225 |
"",
|
| 226 |
f"This report combines {seed_count} random seeds "
|
| 227 |
f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
|
| 228 |
"No additional training is performed by this script; it reads saved",
|
| 229 |
"`metrics.jsonl` files.",
|
| 230 |
"",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
"## Sources",
|
| 232 |
"",
|
| 233 |
-
|
|
|
|
| 234 |
for path_item in metrics_paths:
|
| 235 |
lines.append(f"- `{path_item}`")
|
| 236 |
|
|
@@ -294,19 +307,27 @@ def write_report(
|
|
| 294 |
f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
|
| 295 |
f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
|
| 296 |
f"{fmt(best_row['std_final_val'])}.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
f"- The best static baseline by mean final loss is "
|
| 298 |
f"`{best_static_row['condition']}` at "
|
| 299 |
f"{fmt(best_static_row['mean_final_val'])} +/- "
|
| 300 |
f"{fmt(best_static_row['std_final_val'])}.",
|
| 301 |
-
"- `smooth_low` is very close to `interaction`, suggesting the exact anchor",
|
| 302 |
-
" values may not be uniquely required as long as the schedule follows the",
|
| 303 |
-
" same pressure range.",
|
| 304 |
*paired_win_lines,
|
| 305 |
-
"-
|
| 306 |
-
"
|
| 307 |
-
"
|
| 308 |
-
"
|
| 309 |
-
"
|
|
|
|
|
|
|
| 310 |
]
|
| 311 |
)
|
| 312 |
path.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
|
@@ -318,6 +339,9 @@ def build_parser() -> argparse.ArgumentParser:
|
|
| 318 |
parser.add_argument("--output-dir", type=Path, required=True)
|
| 319 |
parser.add_argument("--report", type=Path, required=True)
|
| 320 |
parser.add_argument("--conditions", nargs="+", default=DEFAULT_CONDITIONS)
|
|
|
|
|
|
|
|
|
|
| 321 |
return parser
|
| 322 |
|
| 323 |
|
|
@@ -371,7 +395,16 @@ def main() -> None:
|
|
| 371 |
"delta_vs_best_static",
|
| 372 |
],
|
| 373 |
)
|
| 374 |
-
write_report(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 375 |
print(
|
| 376 |
json.dumps(
|
| 377 |
{
|
|
|
|
| 191 |
stage_rows: list[dict],
|
| 192 |
paired_rows: list[dict],
|
| 193 |
metrics_paths: list[Path],
|
| 194 |
+
title: str,
|
| 195 |
+
date: str,
|
| 196 |
+
context: str,
|
| 197 |
) -> None:
|
| 198 |
seed_ids = sorted({int(row["seed"]) for row in paired_rows})
|
| 199 |
seed_count = len(seed_ids)
|
| 200 |
best_row = condition_rows[0]
|
| 201 |
+
second_row = condition_rows[1] if len(condition_rows) > 1 else None
|
| 202 |
static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
|
| 203 |
best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
|
| 204 |
+
first_stage_rows = [row for row in stage_rows if int(row["stage"]) == 0]
|
| 205 |
+
best_first_stage = min(first_stage_rows, key=lambda row: row["mean_val"])
|
| 206 |
|
| 207 |
paired_win_lines = []
|
| 208 |
for row in condition_rows:
|
|
|
|
| 225 |
)
|
| 226 |
|
| 227 |
lines = [
|
| 228 |
+
f"# {title}",
|
| 229 |
"",
|
| 230 |
+
f"Date: {date}",
|
| 231 |
"",
|
| 232 |
f"This report combines {seed_count} random seeds "
|
| 233 |
f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
|
| 234 |
"No additional training is performed by this script; it reads saved",
|
| 235 |
"`metrics.jsonl` files.",
|
| 236 |
"",
|
| 237 |
+
]
|
| 238 |
+
if context:
|
| 239 |
+
lines.extend([context, ""])
|
| 240 |
+
|
| 241 |
+
lines.extend(
|
| 242 |
+
[
|
| 243 |
"## Sources",
|
| 244 |
"",
|
| 245 |
+
]
|
| 246 |
+
)
|
| 247 |
for path_item in metrics_paths:
|
| 248 |
lines.append(f"- `{path_item}`")
|
| 249 |
|
|
|
|
| 307 |
f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
|
| 308 |
f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
|
| 309 |
f"{fmt(best_row['std_final_val'])}.",
|
| 310 |
+
*(
|
| 311 |
+
[
|
| 312 |
+
f"- The second-best final condition is `{second_row['condition']}` at "
|
| 313 |
+
f"{fmt(second_row['mean_final_val'])} +/- "
|
| 314 |
+
f"{fmt(second_row['std_final_val'])}."
|
| 315 |
+
]
|
| 316 |
+
if second_row is not None
|
| 317 |
+
else []
|
| 318 |
+
),
|
| 319 |
f"- The best static baseline by mean final loss is "
|
| 320 |
f"`{best_static_row['condition']}` at "
|
| 321 |
f"{fmt(best_static_row['mean_final_val'])} +/- "
|
| 322 |
f"{fmt(best_static_row['std_final_val'])}.",
|
|
|
|
|
|
|
|
|
|
| 323 |
*paired_win_lines,
|
| 324 |
+
f"- The best first-stage condition is `{best_first_stage['condition']}` "
|
| 325 |
+
f"at prefix {best_first_stage['token_limit']:,} with mean validation "
|
| 326 |
+
f"loss {fmt(best_first_stage['mean_val'])}; compare this with the final "
|
| 327 |
+
"ranking before claiming a schedule is uniformly better.",
|
| 328 |
+
"- This is a saved-run streaming validation artifact. Treat it as strong",
|
| 329 |
+
" evidence only when the tested conditions, seeds, static baselines, and",
|
| 330 |
+
" stream protocol match the claim being made.",
|
| 331 |
]
|
| 332 |
)
|
| 333 |
path.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
|
|
|
| 339 |
parser.add_argument("--output-dir", type=Path, required=True)
|
| 340 |
parser.add_argument("--report", type=Path, required=True)
|
| 341 |
parser.add_argument("--conditions", nargs="+", default=DEFAULT_CONDITIONS)
|
| 342 |
+
parser.add_argument("--title", default="TinyStories Multi-Seed Streaming Validation")
|
| 343 |
+
parser.add_argument("--date", default="2026-05-30")
|
| 344 |
+
parser.add_argument("--context", default="")
|
| 345 |
return parser
|
| 346 |
|
| 347 |
|
|
|
|
| 395 |
"delta_vs_best_static",
|
| 396 |
],
|
| 397 |
)
|
| 398 |
+
write_report(
|
| 399 |
+
args.report,
|
| 400 |
+
condition_rows,
|
| 401 |
+
stage_rows,
|
| 402 |
+
paired_rows,
|
| 403 |
+
args.metrics,
|
| 404 |
+
args.title,
|
| 405 |
+
args.date,
|
| 406 |
+
args.context,
|
| 407 |
+
)
|
| 408 |
print(
|
| 409 |
json.dumps(
|
| 410 |
{
|