Mandeep Sidhu commited on
Commit
cf52b0e
·
1 Parent(s): dcae82e

Add WikiText-103 five-seed streaming validation

Browse files
docs/plan.md CHANGED
@@ -285,7 +285,7 @@ Use this order for every regime.
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
  | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
  | OpenWebText10K streaming regime | 5-seed clean validation complete | OpenWebText10K interaction decay beats best static in 5/5 paired final-loss comparisons |
288
- | WikiText-103 streaming regime | pending | start only after TinyStories and OpenWebText10K streaming reports are reconciled |
289
 
290
  ## Current Formula Status
291
 
@@ -334,6 +334,7 @@ structure transfers, while coefficients may be regime-specific.
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
  | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
336
  | OpenWebText10K streaming, 5 seeds | interaction decay has best mean final loss; top decay schedules beat best static in 5/5 paired comparisons |
 
337
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
338
 
339
  Latest TinyStories 5-seed streaming final-loss table:
@@ -401,6 +402,38 @@ The best static baseline in the clean OpenWebText10K run is static dropout
401
  `0.0473` and wins every paired seed comparison. This promotes OpenWebText10K
402
  from exploratory support to a second multi-seed streaming validation regime.
403
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
404
  ## Completed Static Backtest Gate
405
 
406
  The first offline coefficient backtest is complete. It is retained as supporting
@@ -423,11 +456,15 @@ streaming multi-seed reports for each regime.
423
 
424
  ## Immediate Next Action
425
 
426
- Reconcile the TinyStories five-seed report and OpenWebText10K five-seed report
427
- into the paper outline. The seed-count gap is now closed. The next empirical
428
- weakness is external validity, so the preferred next experiment is a third
429
- held-out regime with minimal coefficient calibration followed by narrowed
430
- multi-seed streaming validation.
 
 
 
 
431
 
432
  ## Next Training After Current Gate
433
 
@@ -438,8 +475,10 @@ limiting issue, use a third held-out regime for the next validation step:
438
  ```text
439
  completed: TinyStories 5-seed streaming report
440
  completed: OpenWebText10K 5-seed clean streaming report
441
- next: third held-out regime with minimal calibration
442
- avoid: broad new sweep before cross-regime report reconciliation
 
 
443
  ```
444
 
445
  Evaluate with paired seed comparisons:
@@ -452,17 +491,20 @@ decay minus best-static delta per seed
452
  rank consistency across seeds
453
  ```
454
 
455
- Because OpenWebText10K decay wins across paired seeds, promote the cross-regime
456
- streaming claim to "supported in two regimes." Do not yet claim universal
457
- numeric coefficients. The next claim to test is whether the pressure-law
458
- structure and regime-specific fitting procedure reproduce the win in a third
459
- held-out regime.
 
460
 
461
  Latest streaming report:
462
 
463
  ```text
464
  docs/tinystories_streaming_report.md
465
  docs/openwebtext10k_streaming_report.md
 
466
  runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
467
  runs/openwebtext10k_streaming_report/l16_updated_formula_clean_5seed/
 
468
  ```
 
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
  | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
  | OpenWebText10K streaming regime | 5-seed clean validation complete | OpenWebText10K interaction decay beats best static in 5/5 paired final-loss comparisons |
288
+ | WikiText-103 streaming regime | 5-seed validation complete | formula-derived L12 decay beats best static in 5/5 paired final-loss comparisons |
289
 
290
  ## Current Formula Status
291
 
 
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
  | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
336
  | OpenWebText10K streaming, 5 seeds | interaction decay has best mean final loss; top decay schedules beat best static in 5/5 paired comparisons |
337
+ | WikiText-103 streaming, 5 seeds | formula-derived L12 decay has best mean final loss; beats best static in 5/5 paired comparisons |
338
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
339
 
340
  Latest TinyStories 5-seed streaming final-loss table:
 
402
  `0.0473` and wins every paired seed comparison. This promotes OpenWebText10K
403
  from exploratory support to a second multi-seed streaming validation regime.
404
 
405
+ Latest WikiText-103 5-seed streaming final-loss table:
406
+
407
+ | Condition | Mean final 4M validation loss | Std |
408
+ |---|---:|---:|
409
+ | `wikitext103_formula_l12` decay | 4.0808 | 0.0195 |
410
+ | `wikitext103_probe_blend` decay | 4.0961 | 0.0145 |
411
+ | `wikitext103_low_decay` decay | 4.1020 | 0.0166 |
412
+ | static `0.10` | 4.1105 | 0.0188 |
413
+ | static `0.08` | 4.1116 | 0.0186 |
414
+ | static `0.06` | 4.1197 | 0.0082 |
415
+ | static `0.14` | 4.1221 | 0.0155 |
416
+ | static `0.18` | 4.1304 | 0.0130 |
417
+ | static `0.04` | 4.1331 | 0.0227 |
418
+ | static `0.20` | 4.1394 | 0.0167 |
419
+ | static `0.02` | 4.1459 | 0.0165 |
420
+ | static `0.26` | 4.1784 | 0.0145 |
421
+ | static `0.00` | 4.1835 | 0.0165 |
422
+ | static `0.30` | 4.1946 | 0.0141 |
423
+
424
+ Paired final-loss result:
425
+
426
+ | Decay schedule | Paired wins vs best static |
427
+ |---|---:|
428
+ | `wikitext103_formula_l12` | 5/5 |
429
+ | `wikitext103_probe_blend` | 4/5 |
430
+ | `wikitext103_low_decay` | 4/5 |
431
+
432
+ The best static baseline in the clean WikiText-103 run is static dropout
433
+ `0.10` by mean final loss. The formula-derived L12 decay improves mean final
434
+ validation loss by about `0.0297` and wins every paired seed comparison. This
435
+ promotes WikiText-103 to a third multi-seed streaming validation regime.
436
+
437
  ## Completed Static Backtest Gate
438
 
439
  The first offline coefficient backtest is complete. It is retained as supporting
 
456
 
457
  ## Immediate Next Action
458
 
459
+ Reconcile the TinyStories, OpenWebText10K, and WikiText-103 five-seed streaming
460
+ reports into the paper outline. The strongest current claim is now supported in
461
+ three regimes: formula-derived or regime-fitted decay schedules beat the best
462
+ static dropout baseline in paired five-seed final-loss comparisons.
463
+
464
+ The next empirical weakness is no longer a missing third text regime. The next
465
+ useful strengthening step is to test robustness across a controlled architecture
466
+ or token-budget change inside one established corpus regime, while preserving
467
+ the same MPS-only, five-seed validation standard.
468
 
469
  ## Next Training After Current Gate
470
 
 
475
  ```text
476
  completed: TinyStories 5-seed streaming report
477
  completed: OpenWebText10K 5-seed clean streaming report
478
+ completed: WikiText-103 5-seed clean streaming report
479
+ next: reconcile three-regime evidence into the paper, then choose one narrowed
480
+ robustness test
481
+ avoid: broad new sweep before three-regime report reconciliation
482
  ```
483
 
484
  Evaluate with paired seed comparisons:
 
491
  rank consistency across seeds
492
  ```
493
 
494
+ Because TinyStories, OpenWebText10K, and WikiText-103 decays win across paired
495
+ seeds, promote the cross-regime streaming claim to "supported in three regimes."
496
+ Do not yet claim universal numeric coefficients. The current defensible
497
+ paper-level claim is that the pressure-law structure and regime-specific fitting
498
+ procedure can produce dropout schedules that beat the best static dropout
499
+ baseline across multiple text regimes.
500
 
501
  Latest streaming report:
502
 
503
  ```text
504
  docs/tinystories_streaming_report.md
505
  docs/openwebtext10k_streaming_report.md
506
+ docs/wikitext103_streaming_report.md
507
  runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
508
  runs/openwebtext10k_streaming_report/l16_updated_formula_clean_5seed/
509
+ runs/wikitext103_streaming_report/l12_validation_5seed/
510
  ```
docs/wikitext103_streaming_report.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # WikiText-103 Streaming Validation
2
+
3
+ Date: 2026-05-31
4
+
5
+ This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
6
+ No additional training is performed by this script; it reads saved
7
+ `metrics.jsonl` files.
8
+
9
+ Regime: WikiText-103 cached-corpus streaming setup with L12_H8_D320, 17,367,040 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This is a clean five-seed run including three dropout decay schedules and broad static dropout baselines from 0.00 through 0.30.
10
+
11
+ ## Sources
12
+
13
+ - `runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/metrics.jsonl`
14
+
15
+ ## Condition Ranking By Final Loss
16
+
17
+ | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
18
+ |---|---|---:|---:|---:|---:|---:|---:|---|
19
+ | `wikitext103_formula_l12` | `anchor_decay` | 5 | 4.5711 | 0.0045 | 4.0808 | 0.0195 | 0.2817 | `0.30 -> 0.26 -> 0.18 -> 0.09 -> 0.02` |
20
+ | `wikitext103_probe_blend` | `anchor_decay` | 5 | 4.5635 | 0.0046 | 4.0961 | 0.0145 | 0.3287 | `0.19 -> 0.14 -> 0.09 -> 0.04 -> 0.01` |
21
+ | `wikitext103_low_decay` | `anchor_decay` | 5 | 4.5681 | 0.0073 | 4.1020 | 0.0166 | 0.3251 | `0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02` |
22
+ | `static_dropout_0.1` | `static` | 5 | 4.5836 | 0.0062 | 4.1105 | 0.0188 | 0.2687 | `0.10 -> 0.10 -> 0.10 -> 0.10 -> 0.10` |
23
+ | `static_dropout_0.08` | `static` | 5 | 4.5967 | 0.0073 | 4.1116 | 0.0186 | 0.2848 | `0.08 -> 0.08 -> 0.08 -> 0.08 -> 0.08` |
24
+ | `static_dropout_0.06` | `static` | 5 | 4.6186 | 0.0048 | 4.1197 | 0.0082 | 0.3131 | `0.06 -> 0.06 -> 0.06 -> 0.06 -> 0.06` |
25
+ | `static_dropout_0.14` | `static` | 5 | 4.5735 | 0.0077 | 4.1221 | 0.0155 | 0.2548 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
26
+ | `static_dropout_0.18` | `static` | 5 | 4.5756 | 0.0041 | 4.1304 | 0.0130 | 0.2289 | `0.18 -> 0.18 -> 0.18 -> 0.18 -> 0.18` |
27
+ | `static_dropout_0.04` | `static` | 5 | 4.6501 | 0.0077 | 4.1331 | 0.0227 | 0.3353 | `0.04 -> 0.04 -> 0.04 -> 0.04 -> 0.04` |
28
+ | `static_dropout_0.2` | `static` | 5 | 4.5794 | 0.0050 | 4.1394 | 0.0167 | 0.2239 | `0.20 -> 0.20 -> 0.20 -> 0.20 -> 0.20` |
29
+ | `static_dropout_0.02` | `static` | 5 | 4.6954 | 0.0086 | 4.1459 | 0.0165 | 0.3700 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
30
+ | `static_dropout_0.26` | `static` | 5 | 4.6063 | 0.0051 | 4.1784 | 0.0145 | 0.2008 | `0.26 -> 0.26 -> 0.26 -> 0.26 -> 0.26` |
31
+ | `static_dropout_0` | `static` | 5 | 4.7762 | 0.0109 | 4.1835 | 0.0165 | 0.4085 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
32
+ | `static_dropout_0.3` | `static` | 5 | 4.6253 | 0.0034 | 4.1946 | 0.0141 | 0.1819 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
33
+
34
+ ## Paired Final-Loss Deltas
35
+
36
+ Negative `delta_vs_best_static` means the condition beat the best static
37
+ baseline for that seed.
38
+
39
+ | Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
40
+ |---:|---|---:|---|---:|---:|
41
+ | 1 | `wikitext103_formula_l12` | 4.0623 | `static_dropout_0.1` | 4.0807 | -0.0184 |
42
+ | 1 | `wikitext103_probe_blend` | 4.0738 | `static_dropout_0.1` | 4.0807 | -0.0069 |
43
+ | 1 | `wikitext103_low_decay` | 4.0854 | `static_dropout_0.1` | 4.0807 | +0.0047 |
44
+ | 1 | `static_dropout_0.1` | 4.0807 | `static_dropout_0.1` | 4.0807 | +0.0000 |
45
+ | 1 | `static_dropout_0.08` | 4.0893 | `static_dropout_0.1` | 4.0807 | +0.0086 |
46
+ | 1 | `static_dropout_0.06` | 4.1112 | `static_dropout_0.1` | 4.0807 | +0.0305 |
47
+ | 1 | `static_dropout_0.14` | 4.1082 | `static_dropout_0.1` | 4.0807 | +0.0275 |
48
+ | 1 | `static_dropout_0.18` | 4.1108 | `static_dropout_0.1` | 4.0807 | +0.0301 |
49
+ | 1 | `static_dropout_0.2` | 4.1162 | `static_dropout_0.1` | 4.0807 | +0.0355 |
50
+ | 1 | `static_dropout_0.04` | 4.1031 | `static_dropout_0.1` | 4.0807 | +0.0224 |
51
+ | 1 | `static_dropout_0.02` | 4.1371 | `static_dropout_0.1` | 4.0807 | +0.0564 |
52
+ | 1 | `static_dropout_0` | 4.1600 | `static_dropout_0.1` | 4.0807 | +0.0793 |
53
+ | 1 | `static_dropout_0.26` | 4.1557 | `static_dropout_0.1` | 4.0807 | +0.0750 |
54
+ | 1 | `static_dropout_0.3` | 4.1802 | `static_dropout_0.1` | 4.0807 | +0.0994 |
55
+ | 2 | `wikitext103_formula_l12` | 4.1123 | `static_dropout_0.06` | 4.1304 | -0.0181 |
56
+ | 2 | `wikitext103_probe_blend` | 4.1113 | `static_dropout_0.06` | 4.1304 | -0.0191 |
57
+ | 2 | `wikitext103_low_decay` | 4.1291 | `static_dropout_0.06` | 4.1304 | -0.0013 |
58
+ | 2 | `static_dropout_0.1` | 4.1320 | `static_dropout_0.06` | 4.1304 | +0.0016 |
59
+ | 2 | `static_dropout_0.08` | 4.1374 | `static_dropout_0.06` | 4.1304 | +0.0071 |
60
+ | 2 | `static_dropout_0.06` | 4.1304 | `static_dropout_0.06` | 4.1304 | +0.0000 |
61
+ | 2 | `static_dropout_0.14` | 4.1476 | `static_dropout_0.06` | 4.1304 | +0.0172 |
62
+ | 2 | `static_dropout_0.18` | 4.1471 | `static_dropout_0.06` | 4.1304 | +0.0167 |
63
+ | 2 | `static_dropout_0.2` | 4.1633 | `static_dropout_0.06` | 4.1304 | +0.0329 |
64
+ | 2 | `static_dropout_0.04` | 4.1648 | `static_dropout_0.06` | 4.1304 | +0.0344 |
65
+ | 2 | `static_dropout_0.02` | 4.1746 | `static_dropout_0.06` | 4.1304 | +0.0442 |
66
+ | 2 | `static_dropout_0` | 4.2030 | `static_dropout_0.06` | 4.1304 | +0.0726 |
67
+ | 2 | `static_dropout_0.26` | 4.1961 | `static_dropout_0.06` | 4.1304 | +0.0658 |
68
+ | 2 | `static_dropout_0.3` | 4.2155 | `static_dropout_0.06` | 4.1304 | +0.0852 |
69
+ | 3 | `wikitext103_formula_l12` | 4.0763 | `static_dropout_0.08` | 4.1036 | -0.0272 |
70
+ | 3 | `wikitext103_probe_blend` | 4.0934 | `static_dropout_0.08` | 4.1036 | -0.0102 |
71
+ | 3 | `wikitext103_low_decay` | 4.1006 | `static_dropout_0.08` | 4.1036 | -0.0030 |
72
+ | 3 | `static_dropout_0.1` | 4.1115 | `static_dropout_0.08` | 4.1036 | +0.0079 |
73
+ | 3 | `static_dropout_0.08` | 4.1036 | `static_dropout_0.08` | 4.1036 | +0.0000 |
74
+ | 3 | `static_dropout_0.06` | 4.1127 | `static_dropout_0.08` | 4.1036 | +0.0092 |
75
+ | 3 | `static_dropout_0.14` | 4.1240 | `static_dropout_0.08` | 4.1036 | +0.0204 |
76
+ | 3 | `static_dropout_0.18` | 4.1285 | `static_dropout_0.08` | 4.1036 | +0.0250 |
77
+ | 3 | `static_dropout_0.2` | 4.1367 | `static_dropout_0.08` | 4.1036 | +0.0332 |
78
+ | 3 | `static_dropout_0.04` | 4.1246 | `static_dropout_0.08` | 4.1036 | +0.0211 |
79
+ | 3 | `static_dropout_0.02` | 4.1443 | `static_dropout_0.08` | 4.1036 | +0.0408 |
80
+ | 3 | `static_dropout_0` | 4.1758 | `static_dropout_0.08` | 4.1036 | +0.0722 |
81
+ | 3 | `static_dropout_0.26` | 4.1796 | `static_dropout_0.08` | 4.1036 | +0.0761 |
82
+ | 3 | `static_dropout_0.3` | 4.1926 | `static_dropout_0.08` | 4.1036 | +0.0890 |
83
+ | 4 | `wikitext103_formula_l12` | 4.0845 | `static_dropout_0.1` | 4.1096 | -0.0251 |
84
+ | 4 | `wikitext103_probe_blend` | 4.0954 | `static_dropout_0.1` | 4.1096 | -0.0141 |
85
+ | 4 | `wikitext103_low_decay` | 4.0928 | `static_dropout_0.1` | 4.1096 | -0.0167 |
86
+ | 4 | `static_dropout_0.1` | 4.1096 | `static_dropout_0.1` | 4.1096 | +0.0000 |
87
+ | 4 | `static_dropout_0.08` | 4.1223 | `static_dropout_0.1` | 4.1096 | +0.0127 |
88
+ | 4 | `static_dropout_0.06` | 4.1188 | `static_dropout_0.1` | 4.1096 | +0.0093 |
89
+ | 4 | `static_dropout_0.14` | 4.1117 | `static_dropout_0.1` | 4.1096 | +0.0021 |
90
+ | 4 | `static_dropout_0.18` | 4.1330 | `static_dropout_0.1` | 4.1096 | +0.0234 |
91
+ | 4 | `static_dropout_0.2` | 4.1388 | `static_dropout_0.1` | 4.1096 | +0.0292 |
92
+ | 4 | `static_dropout_0.04` | 4.1312 | `static_dropout_0.1` | 4.1096 | +0.0217 |
93
+ | 4 | `static_dropout_0.02` | 4.1387 | `static_dropout_0.1` | 4.1096 | +0.0291 |
94
+ | 4 | `static_dropout_0` | 4.1853 | `static_dropout_0.1` | 4.1096 | +0.0757 |
95
+ | 4 | `static_dropout_0.26` | 4.1782 | `static_dropout_0.1` | 4.1096 | +0.0686 |
96
+ | 4 | `static_dropout_0.3` | 4.2007 | `static_dropout_0.1` | 4.1096 | +0.0912 |
97
+ | 5 | `wikitext103_formula_l12` | 4.0686 | `static_dropout_0.08` | 4.1056 | -0.0370 |
98
+ | 5 | `wikitext103_probe_blend` | 4.1066 | `static_dropout_0.08` | 4.1056 | +0.0009 |
99
+ | 5 | `wikitext103_low_decay` | 4.1021 | `static_dropout_0.08` | 4.1056 | -0.0035 |
100
+ | 5 | `static_dropout_0.1` | 4.1186 | `static_dropout_0.08` | 4.1056 | +0.0129 |
101
+ | 5 | `static_dropout_0.08` | 4.1056 | `static_dropout_0.08` | 4.1056 | +0.0000 |
102
+ | 5 | `static_dropout_0.06` | 4.1253 | `static_dropout_0.08` | 4.1056 | +0.0197 |
103
+ | 5 | `static_dropout_0.14` | 4.1192 | `static_dropout_0.08` | 4.1056 | +0.0135 |
104
+ | 5 | `static_dropout_0.18` | 4.1325 | `static_dropout_0.08` | 4.1056 | +0.0269 |
105
+ | 5 | `static_dropout_0.2` | 4.1418 | `static_dropout_0.08` | 4.1056 | +0.0362 |
106
+ | 5 | `static_dropout_0.04` | 4.1419 | `static_dropout_0.08` | 4.1056 | +0.0363 |
107
+ | 5 | `static_dropout_0.02` | 4.1346 | `static_dropout_0.08` | 4.1056 | +0.0290 |
108
+ | 5 | `static_dropout_0` | 4.1934 | `static_dropout_0.08` | 4.1056 | +0.0878 |
109
+ | 5 | `static_dropout_0.26` | 4.1821 | `static_dropout_0.08` | 4.1056 | +0.0765 |
110
+ | 5 | `static_dropout_0.3` | 4.1841 | `static_dropout_0.08` | 4.1056 | +0.0785 |
111
+
112
+ ## Stage Trajectory
113
+
114
+ | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
115
+ |---:|---:|---|---:|---:|---:|---:|---:|---:|
116
+ | 0 | 250,000 | `static_dropout_0.18` | 0.180 | 5 | 5.1616 | 0.0150 | 3.9964 | 1.1652 |
117
+ | 0 | 250,000 | `wikitext103_low_decay` | 0.140 | 5 | 5.1635 | 0.0220 | 3.9051 | 1.2585 |
118
+ | 0 | 250,000 | `static_dropout_0.14` | 0.140 | 5 | 5.1635 | 0.0220 | 3.9051 | 1.2585 |
119
+ | 0 | 250,000 | `wikitext103_probe_blend` | 0.190 | 5 | 5.1659 | 0.0171 | 4.0201 | 1.1458 |
120
+ | 0 | 250,000 | `static_dropout_0.1` | 0.100 | 5 | 5.1699 | 0.0237 | 3.8219 | 1.3480 |
121
+ | 0 | 250,000 | `static_dropout_0.2` | 0.200 | 5 | 5.1701 | 0.0141 | 4.0363 | 1.1338 |
122
+ | 0 | 250,000 | `static_dropout_0.08` | 0.080 | 5 | 5.1894 | 0.0161 | 3.7619 | 1.4274 |
123
+ | 0 | 250,000 | `static_dropout_0.26` | 0.260 | 5 | 5.1940 | 0.0161 | 4.1496 | 1.0444 |
124
+ | 0 | 250,000 | `wikitext103_formula_l12` | 0.300 | 5 | 5.2148 | 0.0181 | 4.2131 | 1.0017 |
125
+ | 0 | 250,000 | `static_dropout_0.3` | 0.300 | 5 | 5.2148 | 0.0181 | 4.2131 | 1.0017 |
126
+ | 0 | 250,000 | `static_dropout_0.06` | 0.060 | 5 | 5.2154 | 0.0173 | 3.7128 | 1.5026 |
127
+ | 0 | 250,000 | `static_dropout_0.04` | 0.040 | 5 | 5.2378 | 0.0186 | 3.6441 | 1.5938 |
128
+ | 0 | 250,000 | `static_dropout_0.02` | 0.020 | 5 | 5.2750 | 0.0255 | 3.5725 | 1.7025 |
129
+ | 0 | 250,000 | `static_dropout_0` | 0.000 | 5 | 5.3403 | 0.0270 | 3.5230 | 1.8172 |
130
+ | 1 | 500,000 | `wikitext103_probe_blend` | 0.140 | 5 | 4.7872 | 0.0269 | 3.6846 | 1.1027 |
131
+ | 1 | 500,000 | `static_dropout_0.2` | 0.200 | 5 | 4.7873 | 0.0236 | 3.7914 | 0.9959 |
132
+ | 1 | 500,000 | `static_dropout_0.18` | 0.180 | 5 | 4.7946 | 0.0206 | 3.7572 | 1.0375 |
133
+ | 1 | 500,000 | `static_dropout_0.14` | 0.140 | 5 | 4.8001 | 0.0198 | 3.6650 | 1.1351 |
134
+ | 1 | 500,000 | `wikitext103_low_decay` | 0.140 | 5 | 4.8001 | 0.0198 | 3.6650 | 1.1351 |
135
+ | 1 | 500,000 | `wikitext103_formula_l12` | 0.260 | 5 | 4.8053 | 0.0278 | 3.9182 | 0.8871 |
136
+ | 1 | 500,000 | `static_dropout_0.26` | 0.260 | 5 | 4.8081 | 0.0216 | 3.9053 | 0.9028 |
137
+ | 1 | 500,000 | `static_dropout_0.3` | 0.300 | 5 | 4.8242 | 0.0296 | 3.9765 | 0.8476 |
138
+ | 1 | 500,000 | `static_dropout_0.1` | 0.100 | 5 | 4.8332 | 0.0258 | 3.5637 | 1.2695 |
139
+ | 1 | 500,000 | `static_dropout_0.08` | 0.080 | 5 | 4.8576 | 0.0239 | 3.5036 | 1.3540 |
140
+ | 1 | 500,000 | `static_dropout_0.06` | 0.060 | 5 | 4.8947 | 0.0213 | 3.4394 | 1.4552 |
141
+ | 1 | 500,000 | `static_dropout_0.04` | 0.040 | 5 | 4.9573 | 0.0250 | 3.3515 | 1.6058 |
142
+ | 1 | 500,000 | `static_dropout_0.02` | 0.020 | 5 | 5.0451 | 0.0169 | 3.2612 | 1.7839 |
143
+ | 1 | 500,000 | `static_dropout_0` | 0.000 | 5 | 5.1741 | 0.0252 | 3.1506 | 2.0235 |
144
+ | 2 | 1,000,000 | `wikitext103_formula_l12` | 0.180 | 5 | 4.4938 | 0.0147 | 3.8283 | 0.6655 |
145
+ | 2 | 1,000,000 | `wikitext103_probe_blend` | 0.090 | 5 | 4.4940 | 0.0159 | 3.6495 | 0.8445 |
146
+ | 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.5001 | 0.0148 | 3.7163 | 0.7838 |
147
+ | 2 | 1,000,000 | `wikitext103_low_decay` | 0.100 | 5 | 4.5013 | 0.0158 | 3.6607 | 0.8406 |
148
+ | 2 | 1,000,000 | `static_dropout_0.18` | 0.180 | 5 | 4.5023 | 0.0185 | 3.7866 | 0.7157 |
149
+ | 2 | 1,000,000 | `static_dropout_0.2` | 0.200 | 5 | 4.5060 | 0.0204 | 3.8148 | 0.6913 |
150
+ | 2 | 1,000,000 | `static_dropout_0.1` | 0.100 | 5 | 4.5186 | 0.0135 | 3.6524 | 0.8662 |
151
+ | 2 | 1,000,000 | `static_dropout_0.26` | 0.260 | 5 | 4.5262 | 0.0101 | 3.9071 | 0.6191 |
152
+ | 2 | 1,000,000 | `static_dropout_0.08` | 0.080 | 5 | 4.5326 | 0.0117 | 3.6000 | 0.9326 |
153
+ | 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.5462 | 0.0127 | 3.9708 | 0.5754 |
154
+ | 2 | 1,000,000 | `static_dropout_0.06` | 0.060 | 5 | 4.5574 | 0.0126 | 3.5554 | 1.0020 |
155
+ | 2 | 1,000,000 | `static_dropout_0.04` | 0.040 | 5 | 4.5959 | 0.0146 | 3.5030 | 1.0929 |
156
+ | 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.6558 | 0.0159 | 3.4324 | 1.2234 |
157
+ | 2 | 1,000,000 | `static_dropout_0` | 0.000 | 5 | 4.7661 | 0.0324 | 3.3658 | 1.4003 |
158
+ | 3 | 2,000,000 | `wikitext103_formula_l12` | 0.090 | 5 | 4.2607 | 0.0181 | 3.8089 | 0.4518 |
159
+ | 3 | 2,000,000 | `wikitext103_low_decay` | 0.060 | 5 | 4.2736 | 0.0228 | 3.7474 | 0.5261 |
160
+ | 3 | 2,000,000 | `wikitext103_probe_blend` | 0.040 | 5 | 4.2743 | 0.0174 | 3.7200 | 0.5543 |
161
+ | 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.2816 | 0.0214 | 3.8287 | 0.4529 |
162
+ | 3 | 2,000,000 | `static_dropout_0.1` | 0.100 | 5 | 4.2857 | 0.0190 | 3.7809 | 0.5048 |
163
+ | 3 | 2,000,000 | `static_dropout_0.18` | 0.180 | 5 | 4.2889 | 0.0195 | 3.8768 | 0.4121 |
164
+ | 3 | 2,000,000 | `static_dropout_0.08` | 0.080 | 5 | 4.2925 | 0.0166 | 3.7519 | 0.5406 |
165
+ | 3 | 2,000,000 | `static_dropout_0.2` | 0.200 | 5 | 4.2942 | 0.0160 | 3.8982 | 0.3960 |
166
+ | 3 | 2,000,000 | `static_dropout_0.06` | 0.060 | 5 | 4.3059 | 0.0182 | 3.7318 | 0.5741 |
167
+ | 3 | 2,000,000 | `static_dropout_0.26` | 0.260 | 5 | 4.3248 | 0.0179 | 3.9655 | 0.3593 |
168
+ | 3 | 2,000,000 | `static_dropout_0.04` | 0.040 | 5 | 4.3262 | 0.0165 | 3.7038 | 0.6225 |
169
+ | 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.3467 | 0.0159 | 4.0097 | 0.3370 |
170
+ | 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.3551 | 0.0265 | 3.6673 | 0.6878 |
171
+ | 3 | 2,000,000 | `static_dropout_0` | 0.000 | 5 | 4.4174 | 0.0188 | 3.6409 | 0.7765 |
172
+ | 4 | 4,000,000 | `wikitext103_formula_l12` | 0.020 | 5 | 4.0808 | 0.0195 | 3.7991 | 0.2817 |
173
+ | 4 | 4,000,000 | `wikitext103_probe_blend` | 0.010 | 5 | 4.0961 | 0.0145 | 3.7674 | 0.3287 |
174
+ | 4 | 4,000,000 | `wikitext103_low_decay` | 0.020 | 5 | 4.1020 | 0.0166 | 3.7769 | 0.3251 |
175
+ | 4 | 4,000,000 | `static_dropout_0.1` | 0.100 | 5 | 4.1105 | 0.0188 | 3.8417 | 0.2687 |
176
+ | 4 | 4,000,000 | `static_dropout_0.08` | 0.080 | 5 | 4.1116 | 0.0186 | 3.8268 | 0.2848 |
177
+ | 4 | 4,000,000 | `static_dropout_0.06` | 0.060 | 5 | 4.1197 | 0.0082 | 3.8066 | 0.3131 |
178
+ | 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.1221 | 0.0155 | 3.8674 | 0.2548 |
179
+ | 4 | 4,000,000 | `static_dropout_0.18` | 0.180 | 5 | 4.1304 | 0.0130 | 3.9015 | 0.2289 |
180
+ | 4 | 4,000,000 | `static_dropout_0.04` | 0.040 | 5 | 4.1331 | 0.0227 | 3.7978 | 0.3353 |
181
+ | 4 | 4,000,000 | `static_dropout_0.2` | 0.200 | 5 | 4.1394 | 0.0167 | 3.9155 | 0.2239 |
182
+ | 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.1459 | 0.0165 | 3.7759 | 0.3700 |
183
+ | 4 | 4,000,000 | `static_dropout_0.26` | 0.260 | 5 | 4.1784 | 0.0145 | 3.9775 | 0.2008 |
184
+ | 4 | 4,000,000 | `static_dropout_0` | 0.000 | 5 | 4.1835 | 0.0165 | 3.7750 | 0.4085 |
185
+ | 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.1946 | 0.0141 | 4.0127 | 0.1819 |
186
+
187
+ ## Interpretation
188
+
189
+ - `wikitext103_formula_l12` has the best 5-seed mean final validation loss: 4.0808 +/- 0.0195.
190
+ - The second-best final condition is `wikitext103_probe_blend` at 4.0961 +/- 0.0145.
191
+ - The best static baseline by mean final loss is `static_dropout_0.1` at 4.1105 +/- 0.0188.
192
+ - `wikitext103_formula_l12` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0181.
193
+ - `wikitext103_probe_blend` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0009.
194
+ - `wikitext103_low_decay` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0047.
195
+ - The best first-stage condition is `static_dropout_0.18` at prefix 250,000 with mean validation loss 5.1616; compare this with the final ranking before claiming a schedule is uniformly better.
196
+ - This is a saved-run streaming validation artifact. Treat it as strong
197
+ evidence only when the tested conditions, seeds, static baselines, and
198
+ stream protocol match the claim being made.
runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/RESULT_SUMMARY.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Locked Streaming Dropout Summary
2
+
3
+ Run directory: `runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525`
4
+
5
+ Model: `L12_H8_D320` causal Transformer, 17,367,040 parameters, 12 layers, 8 heads, 320 embedding dim.
6
+ Training per stage: 1,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 1, 2, 3, 4, 5.
7
+
8
+ ## Condition Ranking
9
+
10
+ | Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
11
+ |---|---|---:|---:|---:|---:|---|
12
+ | `wikitext103_probe_blend` | anchor_decay | 0.01 | 4.5635 | 4.0961 | 0.3287 | 0.19 -> 0.14 -> 0.09 -> 0.04 -> 0.01 |
13
+ | `wikitext103_low_decay` | anchor_decay | 0.02 | 4.5681 | 4.1020 | 0.3251 | 0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02 |
14
+ | `wikitext103_formula_l12` | anchor_decay | 0.02 | 4.5711 | 4.0808 | 0.2817 | 0.30 -> 0.26 -> 0.18 -> 0.09 -> 0.02 |
15
+ | `static_dropout_0.14` | static | 0.14 | 4.5735 | 4.1221 | 0.2548 | 0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14 |
16
+ | `static_dropout_0.18` | static | 0.18 | 4.5756 | 4.1304 | 0.2289 | 0.18 -> 0.18 -> 0.18 -> 0.18 -> 0.18 |
17
+ | `static_dropout_0.2` | static | 0.20 | 4.5794 | 4.1394 | 0.2239 | 0.20 -> 0.20 -> 0.20 -> 0.20 -> 0.20 |
18
+ | `static_dropout_0.1` | static | 0.10 | 4.5836 | 4.1105 | 0.2687 | 0.10 -> 0.10 -> 0.10 -> 0.10 -> 0.10 |
19
+ | `static_dropout_0.08` | static | 0.08 | 4.5967 | 4.1116 | 0.2848 | 0.08 -> 0.08 -> 0.08 -> 0.08 -> 0.08 |
20
+ | `static_dropout_0.26` | static | 0.26 | 4.6063 | 4.1784 | 0.2008 | 0.26 -> 0.26 -> 0.26 -> 0.26 -> 0.26 |
21
+ | `static_dropout_0.06` | static | 0.06 | 4.6186 | 4.1197 | 0.3131 | 0.06 -> 0.06 -> 0.06 -> 0.06 -> 0.06 |
22
+ | `static_dropout_0.3` | static | 0.30 | 4.6253 | 4.1946 | 0.1819 | 0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30 |
23
+ | `static_dropout_0.04` | static | 0.04 | 4.6501 | 4.1331 | 0.3353 | 0.04 -> 0.04 -> 0.04 -> 0.04 -> 0.04 |
24
+ | `static_dropout_0.02` | static | 0.02 | 4.6954 | 4.1459 | 0.3700 | 0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02 |
25
+ | `static_dropout_0` | static | 0.00 | 4.7762 | 4.1835 | 0.4085 | 0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00 |
26
+
27
+ ## Stage Trajectory
28
+
29
+ ### Stage 0: 250,000 Prefix Tokens
30
+
31
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
32
+ |---|---:|---:|---:|---:|---:|
33
+ | `static_dropout_0.18` | 0.18 | 5.1616 | 3.9964 | 1.1652 | 5 |
34
+ | `wikitext103_low_decay` | 0.14 | 5.1635 | 3.9051 | 1.2585 | 5 |
35
+ | `static_dropout_0.14` | 0.14 | 5.1635 | 3.9051 | 1.2585 | 5 |
36
+ | `wikitext103_probe_blend` | 0.19 | 5.1659 | 4.0201 | 1.1458 | 5 |
37
+ | `static_dropout_0.1` | 0.10 | 5.1699 | 3.8219 | 1.3480 | 5 |
38
+ | `static_dropout_0.2` | 0.20 | 5.1701 | 4.0363 | 1.1338 | 5 |
39
+ | `static_dropout_0.08` | 0.08 | 5.1894 | 3.7619 | 1.4274 | 5 |
40
+ | `static_dropout_0.26` | 0.26 | 5.1940 | 4.1496 | 1.0444 | 5 |
41
+ | `wikitext103_formula_l12` | 0.30 | 5.2148 | 4.2131 | 1.0017 | 5 |
42
+ | `static_dropout_0.3` | 0.30 | 5.2148 | 4.2131 | 1.0017 | 5 |
43
+ | `static_dropout_0.06` | 0.06 | 5.2154 | 3.7128 | 1.5026 | 5 |
44
+ | `static_dropout_0.04` | 0.04 | 5.2378 | 3.6441 | 1.5938 | 5 |
45
+ | `static_dropout_0.02` | 0.02 | 5.2750 | 3.5725 | 1.7025 | 5 |
46
+ | `static_dropout_0` | 0.00 | 5.3403 | 3.5230 | 1.8172 | 5 |
47
+
48
+ ### Stage 1: 500,000 Prefix Tokens
49
+
50
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
51
+ |---|---:|---:|---:|---:|---:|
52
+ | `wikitext103_probe_blend` | 0.14 | 4.7872 | 3.6846 | 1.1027 | 5 |
53
+ | `static_dropout_0.2` | 0.20 | 4.7873 | 3.7914 | 0.9959 | 5 |
54
+ | `static_dropout_0.18` | 0.18 | 4.7946 | 3.7572 | 1.0375 | 5 |
55
+ | `static_dropout_0.14` | 0.14 | 4.8001 | 3.6650 | 1.1351 | 5 |
56
+ | `wikitext103_low_decay` | 0.14 | 4.8001 | 3.6650 | 1.1351 | 5 |
57
+ | `wikitext103_formula_l12` | 0.26 | 4.8053 | 3.9182 | 0.8871 | 5 |
58
+ | `static_dropout_0.26` | 0.26 | 4.8081 | 3.9053 | 0.9028 | 5 |
59
+ | `static_dropout_0.3` | 0.30 | 4.8242 | 3.9765 | 0.8476 | 5 |
60
+ | `static_dropout_0.1` | 0.10 | 4.8332 | 3.5637 | 1.2695 | 5 |
61
+ | `static_dropout_0.08` | 0.08 | 4.8576 | 3.5036 | 1.3540 | 5 |
62
+ | `static_dropout_0.06` | 0.06 | 4.8947 | 3.4394 | 1.4552 | 5 |
63
+ | `static_dropout_0.04` | 0.04 | 4.9573 | 3.3515 | 1.6058 | 5 |
64
+ | `static_dropout_0.02` | 0.02 | 5.0451 | 3.2612 | 1.7839 | 5 |
65
+ | `static_dropout_0` | 0.00 | 5.1741 | 3.1506 | 2.0235 | 5 |
66
+
67
+ ### Stage 2: 1,000,000 Prefix Tokens
68
+
69
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
70
+ |---|---:|---:|---:|---:|---:|
71
+ | `wikitext103_formula_l12` | 0.18 | 4.4938 | 3.8283 | 0.6655 | 5 |
72
+ | `wikitext103_probe_blend` | 0.09 | 4.4940 | 3.6495 | 0.8445 | 5 |
73
+ | `static_dropout_0.14` | 0.14 | 4.5001 | 3.7163 | 0.7838 | 5 |
74
+ | `wikitext103_low_decay` | 0.10 | 4.5013 | 3.6607 | 0.8406 | 5 |
75
+ | `static_dropout_0.18` | 0.18 | 4.5023 | 3.7866 | 0.7157 | 5 |
76
+ | `static_dropout_0.2` | 0.20 | 4.5060 | 3.8148 | 0.6913 | 5 |
77
+ | `static_dropout_0.1` | 0.10 | 4.5186 | 3.6524 | 0.8662 | 5 |
78
+ | `static_dropout_0.26` | 0.26 | 4.5262 | 3.9071 | 0.6191 | 5 |
79
+ | `static_dropout_0.08` | 0.08 | 4.5326 | 3.6000 | 0.9326 | 5 |
80
+ | `static_dropout_0.3` | 0.30 | 4.5462 | 3.9708 | 0.5754 | 5 |
81
+ | `static_dropout_0.06` | 0.06 | 4.5574 | 3.5554 | 1.0020 | 5 |
82
+ | `static_dropout_0.04` | 0.04 | 4.5959 | 3.5030 | 1.0929 | 5 |
83
+ | `static_dropout_0.02` | 0.02 | 4.6558 | 3.4324 | 1.2234 | 5 |
84
+ | `static_dropout_0` | 0.00 | 4.7661 | 3.3658 | 1.4003 | 5 |
85
+
86
+ ### Stage 3: 2,000,000 Prefix Tokens
87
+
88
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
89
+ |---|---:|---:|---:|---:|---:|
90
+ | `wikitext103_formula_l12` | 0.09 | 4.2607 | 3.8089 | 0.4518 | 5 |
91
+ | `wikitext103_low_decay` | 0.06 | 4.2736 | 3.7474 | 0.5261 | 5 |
92
+ | `wikitext103_probe_blend` | 0.04 | 4.2743 | 3.7200 | 0.5543 | 5 |
93
+ | `static_dropout_0.14` | 0.14 | 4.2816 | 3.8287 | 0.4529 | 5 |
94
+ | `static_dropout_0.1` | 0.10 | 4.2857 | 3.7809 | 0.5048 | 5 |
95
+ | `static_dropout_0.18` | 0.18 | 4.2889 | 3.8768 | 0.4121 | 5 |
96
+ | `static_dropout_0.08` | 0.08 | 4.2925 | 3.7519 | 0.5406 | 5 |
97
+ | `static_dropout_0.2` | 0.20 | 4.2942 | 3.8982 | 0.3960 | 5 |
98
+ | `static_dropout_0.06` | 0.06 | 4.3059 | 3.7318 | 0.5741 | 5 |
99
+ | `static_dropout_0.26` | 0.26 | 4.3248 | 3.9655 | 0.3593 | 5 |
100
+ | `static_dropout_0.04` | 0.04 | 4.3262 | 3.7038 | 0.6225 | 5 |
101
+ | `static_dropout_0.3` | 0.30 | 4.3467 | 4.0097 | 0.3370 | 5 |
102
+ | `static_dropout_0.02` | 0.02 | 4.3551 | 3.6673 | 0.6878 | 5 |
103
+ | `static_dropout_0` | 0.00 | 4.4174 | 3.6409 | 0.7765 | 5 |
104
+
105
+ ### Stage 4: 4,000,000 Prefix Tokens
106
+
107
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
108
+ |---|---:|---:|---:|---:|---:|
109
+ | `wikitext103_formula_l12` | 0.02 | 4.0808 | 3.7991 | 0.2817 | 5 |
110
+ | `wikitext103_probe_blend` | 0.01 | 4.0961 | 3.7674 | 0.3287 | 5 |
111
+ | `wikitext103_low_decay` | 0.02 | 4.1020 | 3.7769 | 0.3251 | 5 |
112
+ | `static_dropout_0.1` | 0.10 | 4.1105 | 3.8417 | 0.2687 | 5 |
113
+ | `static_dropout_0.08` | 0.08 | 4.1116 | 3.8268 | 0.2848 | 5 |
114
+ | `static_dropout_0.06` | 0.06 | 4.1197 | 3.8066 | 0.3131 | 5 |
115
+ | `static_dropout_0.14` | 0.14 | 4.1221 | 3.8674 | 0.2548 | 5 |
116
+ | `static_dropout_0.18` | 0.18 | 4.1304 | 3.9015 | 0.2289 | 5 |
117
+ | `static_dropout_0.04` | 0.04 | 4.1331 | 3.7978 | 0.3353 | 5 |
118
+ | `static_dropout_0.2` | 0.20 | 4.1394 | 3.9155 | 0.2239 | 5 |
119
+ | `static_dropout_0.02` | 0.02 | 4.1459 | 3.7759 | 0.3700 | 5 |
120
+ | `static_dropout_0.26` | 0.26 | 4.1784 | 3.9775 | 0.2008 | 5 |
121
+ | `static_dropout_0` | 0.00 | 4.1835 | 3.7750 | 0.4085 | 5 |
122
+ | `static_dropout_0.3` | 0.30 | 4.1946 | 4.0127 | 0.1819 | 5 |
runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/config.json ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "args": {
3
+ "mode": "locked_stream",
4
+ "corpus": null,
5
+ "corpus_glob": null,
6
+ "text_column": "text",
7
+ "use_cached_data": true,
8
+ "output_dir": "runs/wikitext103_l12_streaming_validation_5seed",
9
+ "resume_from": null,
10
+ "cache_dir": ".cache/dropout_decay_wikitext103",
11
+ "models": [
12
+ "L12_H8_D320=12x8x320"
13
+ ],
14
+ "seeds": [
15
+ 1,
16
+ 2,
17
+ 3,
18
+ 4,
19
+ 5
20
+ ],
21
+ "token_limits": [
22
+ 5000000
23
+ ],
24
+ "stream_token_caps": [
25
+ 250000,
26
+ 500000,
27
+ 1000000,
28
+ 2000000,
29
+ 4000000
30
+ ],
31
+ "val_tokens": 500000,
32
+ "allow_short_corpus": false,
33
+ "force_retokenize": false,
34
+ "vocab_size": 4096,
35
+ "tokenizer_train_chars": 10000000,
36
+ "block_size": 128,
37
+ "batch_size": 16,
38
+ "steps": 2000,
39
+ "stage_steps": 1000,
40
+ "dropout_rates": [
41
+ 0.0,
42
+ 0.02,
43
+ 0.04,
44
+ 0.06,
45
+ 0.08,
46
+ 0.1,
47
+ 0.14,
48
+ 0.18,
49
+ 0.2,
50
+ 0.26,
51
+ 0.3
52
+ ],
53
+ "decays": [],
54
+ "anchor_decays": [
55
+ {
56
+ "name": "wikitext103_formula_l12",
57
+ "kind": "anchor_decay",
58
+ "initial": 0.3,
59
+ "final": 0.02,
60
+ "schedule": "log_prefix_anchor",
61
+ "decay_tokens": null,
62
+ "anchors": [
63
+ [
64
+ 250000,
65
+ 0.3
66
+ ],
67
+ [
68
+ 500000,
69
+ 0.26
70
+ ],
71
+ [
72
+ 1000000,
73
+ 0.18
74
+ ],
75
+ [
76
+ 2000000,
77
+ 0.09
78
+ ],
79
+ [
80
+ 4000000,
81
+ 0.02
82
+ ]
83
+ ]
84
+ },
85
+ {
86
+ "name": "wikitext103_probe_blend",
87
+ "kind": "anchor_decay",
88
+ "initial": 0.19,
89
+ "final": 0.01,
90
+ "schedule": "log_prefix_anchor",
91
+ "decay_tokens": null,
92
+ "anchors": [
93
+ [
94
+ 250000,
95
+ 0.19
96
+ ],
97
+ [
98
+ 500000,
99
+ 0.14
100
+ ],
101
+ [
102
+ 1000000,
103
+ 0.09
104
+ ],
105
+ [
106
+ 2000000,
107
+ 0.04
108
+ ],
109
+ [
110
+ 4000000,
111
+ 0.01
112
+ ]
113
+ ]
114
+ },
115
+ {
116
+ "name": "wikitext103_low_decay",
117
+ "kind": "anchor_decay",
118
+ "initial": 0.14,
119
+ "final": 0.02,
120
+ "schedule": "log_prefix_anchor",
121
+ "decay_tokens": null,
122
+ "anchors": [
123
+ [
124
+ 250000,
125
+ 0.14
126
+ ],
127
+ [
128
+ 500000,
129
+ 0.14
130
+ ],
131
+ [
132
+ 1000000,
133
+ 0.1
134
+ ],
135
+ [
136
+ 2000000,
137
+ 0.06
138
+ ],
139
+ [
140
+ 4000000,
141
+ 0.02
142
+ ]
143
+ ]
144
+ }
145
+ ],
146
+ "decay_tokens": null,
147
+ "eval_batches": 64,
148
+ "train_eval_batches": 32,
149
+ "trace_eval_batches": 8,
150
+ "eval_every": 0,
151
+ "log_every": 250,
152
+ "lr": 0.0003,
153
+ "weight_decay": 0.1,
154
+ "grad_clip": 1.0,
155
+ "plateau_delta": 0.01,
156
+ "target_min_dropout": 0.1,
157
+ "min_nonzero_margin": 0.01,
158
+ "min_high_dropout_margin": 0.03,
159
+ "screen_early_stop": false,
160
+ "screen_prune_patience": 3,
161
+ "screen_prune_min_delta": 0.01
162
+ },
163
+ "mode": "locked_stream",
164
+ "seeds": [
165
+ 1,
166
+ 2,
167
+ 3,
168
+ 4,
169
+ 5
170
+ ],
171
+ "models": [
172
+ {
173
+ "model_name": "L12_H8_D320",
174
+ "n_layer": 12,
175
+ "n_head": 8,
176
+ "n_embd": 320
177
+ }
178
+ ],
179
+ "device": "mps",
180
+ "torch": "2.12.0",
181
+ "python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
182
+ "mps_available": true,
183
+ "attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
184
+ "tokenizer_path": ".cache/dropout_decay_wikitext103/tokenizer-v4096.json",
185
+ "encoded_path": ".cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy",
186
+ "train_tokens": 4500020,
187
+ "val_tokens": 500000,
188
+ "effective_token_limits": [
189
+ 4500020
190
+ ],
191
+ "effective_stream_token_caps": [
192
+ 250000,
193
+ 500000,
194
+ 1000000,
195
+ 2000000,
196
+ 4000000
197
+ ],
198
+ "resume_from": null
199
+ }
runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/metrics.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/summary.csv ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
2
+ locked_stream,static_dropout_0,static,0,250000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,5,3.5230478435754775,0.03634617773629946,5.340271946787834,0.02699193219713777,1.8172241032123566,0.05178790589031804
3
+ locked_stream,static_dropout_0.02,static,0,250000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,5,3.572465108335018,0.027061999609254494,5.274957031011581,0.025475383539799955,1.7024919226765634,0.03656967535619302
4
+ locked_stream,static_dropout_0.04,static,0,250000,L12_H8_D320,12,8,320,17367040,0.04,0.04,constant,5,3.6440532103180887,0.031242522590923742,5.237822580337524,0.018592151453448197,1.593769370019436,0.03528512909650908
5
+ locked_stream,static_dropout_0.06,static,0,250000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,5,3.712760145962238,0.018437630763482243,5.215402702987194,0.01734079554543704,1.5026425570249557,0.03154944432034942
6
+ locked_stream,static_dropout_0.08,static,0,250000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,5,3.761947725713253,0.017361993699706244,5.189354091882706,0.016114118388750917,1.4274063661694527,0.012218707428925264
7
+ locked_stream,static_dropout_0.1,static,0,250000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,5,3.8219083666801454,0.019248328397722874,5.169893845915794,0.023739040366881205,1.3479854792356492,0.03169483270954973
8
+ locked_stream,static_dropout_0.14,static,0,250000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,5,3.9050611019134522,0.013283976741307785,5.1635470792651175,0.021953339409384615,1.2584859773516655,0.025343683677925957
9
+ locked_stream,static_dropout_0.18,static,0,250000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,5,3.9964362382888794,0.01862825818296394,5.16162400841713,0.014987630048664583,1.16518777012825,0.02013674124988852
10
+ locked_stream,static_dropout_0.2,static,0,250000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,5,4.036272630095482,0.013668609797313208,5.170051643252373,0.014071496005973716,1.1337790131568908,0.013715020568700555
11
+ locked_stream,static_dropout_0.26,static,0,250000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,5,4.149615630507469,0.024675006962846267,5.194025552272796,0.016128282325428966,1.0444099217653275,0.022196467420881313
12
+ locked_stream,static_dropout_0.3,static,0,250000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,5,4.213146212697029,0.013218280901902867,5.214832927286625,0.018060234410376495,1.0016867145895958,0.02723842931339789
13
+ locked_stream,wikitext103_formula_l12,anchor_decay,0,250000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,5,4.213146224617958,0.013218263380840874,5.21483291387558,0.018060268226475495,1.0016866892576217,0.027238437205160136
14
+ locked_stream,wikitext103_low_decay,anchor_decay,0,250000,L12_H8_D320,12,8,320,17367040,0.14,0.02,log_prefix_anchor,5,3.9050611212849615,0.01328398150034886,5.163547059893608,0.02195330299872749,1.2584859386086464,0.025343680837745006
15
+ locked_stream,wikitext103_probe_blend,anchor_decay,0,250000,L12_H8_D320,12,8,320,17367040,0.19,0.01,log_prefix_anchor,5,4.020059056580067,0.016213485710936643,5.165890334546566,0.017075109484847858,1.1458312779664994,0.02016518864116932
16
+ locked_stream,static_dropout_0,static,1,500000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,5,3.1505795806646346,0.025699619025332174,5.174065832793713,0.025168511150111032,2.0234862521290777,0.0236535174506497
17
+ locked_stream,static_dropout_0.02,static,1,500000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,5,3.261230443418026,0.014168838694048941,5.045114178955555,0.016894569246234076,1.783883735537529,0.019840865900389654
18
+ locked_stream,static_dropout_0.04,static,1,500000,L12_H8_D320,12,8,320,17367040,0.04,0.04,constant,5,3.3515226930379867,0.029537749484816224,4.957330641150475,0.02503345339488141,1.6058079481124878,0.0202359151612378
19
+ locked_stream,static_dropout_0.06,static,1,500000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,5,3.4394222095608713,0.013418585935883633,4.894671627879143,0.021331675814859084,1.4552494183182716,0.01946001354297248
20
+ locked_stream,static_dropout_0.08,static,1,500000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,5,3.50356438010931,0.024856947856236954,4.857585413753986,0.023903772104159438,1.3540210336446763,0.0310369653543579
21
+ locked_stream,static_dropout_0.1,static,1,500000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,5,3.5636927232146265,0.024862277351108786,4.83315060287714,0.025840187319246186,1.2694578796625138,0.01835928668354016
22
+ locked_stream,static_dropout_0.14,static,1,500000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,5,3.6649743527173997,0.023120394246437293,4.800087215006352,0.019846558873074828,1.135112862288952,0.012980966896977213
23
+ locked_stream,static_dropout_0.18,static,1,500000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,5,3.7571836963295935,0.024732309568709827,4.794638857245445,0.02062163611688947,1.0374551609158515,0.02312552092553772
24
+ locked_stream,static_dropout_0.2,static,1,500000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,5,3.7914333969354628,0.018987144953080883,4.787301687896251,0.023601504941398347,0.9958682909607888,0.014572669955232898
25
+ locked_stream,static_dropout_0.26,static,1,500000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,5,3.905269515514374,0.02376505939443177,4.8081169500947,0.021576250073037834,0.9028474345803261,0.018700407757224247
26
+ locked_stream,static_dropout_0.3,static,1,500000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,5,3.976513123512268,0.010796527385978102,4.82416096329689,0.029627048936626245,0.8476478397846222,0.026893359245285242
27
+ locked_stream,wikitext103_formula_l12,anchor_decay,1,500000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,5,3.918210099637508,0.011987444359375697,4.80534844994545,0.027829930328897787,0.8871383503079414,0.02626272167699895
28
+ locked_stream,wikitext103_low_decay,anchor_decay,1,500000,L12_H8_D320,12,8,320,17367040,0.14,0.02,log_prefix_anchor,5,3.6649743929505347,0.023120354646606448,4.800087235867977,0.01984656043518159,1.1351128429174424,0.01298098541252223
29
+ locked_stream,wikitext103_probe_blend,anchor_decay,1,500000,L12_H8_D320,12,8,320,17367040,0.19,0.01,log_prefix_anchor,5,3.684571087360382,0.02397149156090502,4.7872447416186334,0.0268838649061173,1.1026736542582511,0.021535214146850536
30
+ locked_stream,static_dropout_0,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,5,3.3657630145549775,0.025194332688414405,4.766069588065148,0.03242686059194632,1.40030657351017,0.052237334934698405
31
+ locked_stream,static_dropout_0.02,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,5,3.4323942199349404,0.018124036007212796,4.655830132961273,0.01585338982667534,1.2234359130263328,0.024146051923035036
32
+ locked_stream,static_dropout_0.04,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.04,0.04,constant,5,3.502993068099022,0.017291674978225223,4.595940679311752,0.014619476464690098,1.0929476112127303,0.028916946275823
33
+ locked_stream,static_dropout_0.06,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,5,3.555374290049076,0.018874416206200166,4.557375983893872,0.012636266859277006,1.0020016938447953,0.030341255269329737
34
+ locked_stream,static_dropout_0.08,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,5,3.5999757796525955,0.009078845784580064,4.532606348395348,0.011746366682868653,0.9326305687427521,0.019403758262432406
35
+ locked_stream,static_dropout_0.1,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,5,3.652407945692539,0.022251919818600654,4.518611620366573,0.01351751181040844,0.8662036746740341,0.033766642849086155
36
+ locked_stream,static_dropout_0.14,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,5,3.716250878572464,0.02407671653248301,4.500055414438248,0.014819407090300617,0.7838045358657837,0.038101035301189086
37
+ locked_stream,static_dropout_0.18,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,5,3.786629955470562,0.015769997442310384,4.5023036181926726,0.018490102933264013,0.7156736627221107,0.03180040950921459
38
+ locked_stream,static_dropout_0.2,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,5,3.814779002964497,0.01587114098616348,4.506029562652111,0.02043622151331783,0.6912505596876144,0.03147408021666257
39
+ locked_stream,static_dropout_0.26,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,5,3.9071271896362303,0.02774032700763299,4.52620010226965,0.010149133513914643,0.619072912633419,0.035195638721522884
40
+ locked_stream,static_dropout_0.3,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,5,3.970817744731903,0.017137273586986676,4.546238152682781,0.012672630425565377,0.5754204079508781,0.02420747470801454
41
+ locked_stream,wikitext103_formula_l12,anchor_decay,2,1000000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,5,3.8283171117305757,0.017535628899982475,4.49377838075161,0.014674108647494532,0.6654612690210342,0.02707089996550965
42
+ locked_stream,wikitext103_low_decay,anchor_decay,2,1000000,L12_H8_D320,12,8,320,17367040,0.14,0.02,log_prefix_anchor,5,3.6606912568211554,0.023899445408052746,4.50133255571127,0.01578619433972172,0.8406412988901139,0.03886870429434243
43
+ locked_stream,wikitext103_probe_blend,anchor_decay,2,1000000,L12_H8_D320,12,8,320,17367040,0.19,0.01,log_prefix_anchor,5,3.6495467707514764,0.019490464893964263,4.49399798810482,0.015864095933133645,0.8444512173533439,0.03475725994195922
44
+ locked_stream,static_dropout_0,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,5,3.6409011498093604,0.014124132971310668,4.417352265119552,0.01876483626309062,0.7764511153101921,0.024211298020746064
45
+ locked_stream,static_dropout_0.02,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,5,3.6672953754663467,0.010966930251295475,4.355083760619164,0.026489436313039863,0.6877883851528168,0.026106538952549097
46
+ locked_stream,static_dropout_0.04,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.04,0.04,constant,5,3.703781445324421,0.01970997472996272,4.3262467920780185,0.016533384756725625,0.6224653467535972,0.02577206799431021
47
+ locked_stream,static_dropout_0.06,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,5,3.7317687764763834,0.017157479537739013,4.305900506675243,0.018234543669842573,0.5741317301988602,0.023633182905307998
48
+ locked_stream,static_dropout_0.08,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,5,3.7518652200698854,0.019831550751639896,4.292461846768856,0.016558424281915483,0.5405966266989708,0.034040566494281235
49
+ locked_stream,static_dropout_0.1,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,5,3.7808884128928186,0.01766528157534878,4.285703422129155,0.018994111876547076,0.5048150092363357,0.028072111875115907
50
+ locked_stream,static_dropout_0.14,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,5,3.828673002123833,0.004937581309037719,4.281563547253609,0.02138858846966249,0.452890545129776,0.01755457638984346
51
+ locked_stream,static_dropout_0.18,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,5,3.876796562969685,0.008357602777023191,4.288915157318115,0.019532110178580833,0.4121185943484306,0.013904819825939841
52
+ locked_stream,static_dropout_0.2,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,5,3.898232290148735,0.008893270752025355,4.294206875562668,0.016019909189912813,0.3959745854139328,0.01268490098888804
53
+ locked_stream,static_dropout_0.26,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,5,3.9654731526970863,0.018160558354021094,4.32477528527379,0.017871972287011016,0.359302132576704,0.018728321745075875
54
+ locked_stream,static_dropout_0.3,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,5,4.009651578962803,0.014639843511797184,4.34667577445507,0.015854348211857338,0.3370241954922676,0.019075525784314876
55
+ locked_stream,wikitext103_formula_l12,anchor_decay,3,2000000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,5,3.8088956013321877,0.016376701234484724,4.260725708305836,0.018098739272622868,0.45183010697364806,0.01960955973506382
56
+ locked_stream,wikitext103_low_decay,anchor_decay,3,2000000,L12_H8_D320,12,8,320,17367040,0.14,0.02,log_prefix_anchor,5,3.7474397838115694,0.003830485352807811,4.273568108677864,0.022770462958788148,0.5261283248662949,0.02051193157067845
57
+ locked_stream,wikitext103_probe_blend,anchor_decay,3,2000000,L12_H8_D320,12,8,320,17367040,0.19,0.01,log_prefix_anchor,5,3.720025636255741,0.012889615922098394,4.274319227784872,0.0173783813582558,0.5542935915291309,0.012746502538982045
58
+ locked_stream,static_dropout_0,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,5,3.7750062167644503,0.025453964920340872,4.183486974239349,0.01652781602088682,0.4084807574748993,0.03358628463770181
59
+ locked_stream,static_dropout_0.02,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,5,3.775918734073639,0.03360785862475549,4.145870254188776,0.016455755256661365,0.3699515201151371,0.0368559849747316
60
+ locked_stream,static_dropout_0.04,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.04,0.04,constant,5,3.7978432700037956,0.018504253509089505,4.1331265024840835,0.022679096058077455,0.3352832324802876,0.036956450676459744
61
+ locked_stream,static_dropout_0.06,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,5,3.806561966240406,0.022924307415498075,4.119682380557061,0.008179605355252681,0.3131204143166542,0.028667268842643943
62
+ locked_stream,static_dropout_0.08,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,5,3.8268220275640488,0.021203534366681782,4.111632210761309,0.01857117977689519,0.2848101831972599,0.03679859980086126
63
+ locked_stream,static_dropout_0.1,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,5,3.841722105443478,0.019884552697005082,4.1104619488120075,0.018806640195540816,0.2687398433685303,0.03219305563944893
64
+ locked_stream,static_dropout_0.14,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,5,3.8673695802688597,0.025363134860510252,4.122133788466454,0.015529540565590345,0.2547642081975937,0.0302072467927019
65
+ locked_stream,static_dropout_0.18,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,5,3.9014849200844766,0.018144303418082167,4.130411610752344,0.013005735782293457,0.22892669066786767,0.02966442514236336
66
+ locked_stream,static_dropout_0.2,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,5,3.9154964342713354,0.020116706552102667,4.1393771633505825,0.016742174044810716,0.22388072907924653,0.033233695598367155
67
+ locked_stream,static_dropout_0.26,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,5,3.9775424867868425,0.021977989750098077,4.178351600468159,0.01453281035529064,0.20080911368131638,0.03295020359909027
68
+ locked_stream,static_dropout_0.3,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,5,4.012722708284855,0.024177126067976527,4.194609892368317,0.014131822174274241,0.18188718408346177,0.031100561007876955
69
+ locked_stream,wikitext103_formula_l12,anchor_decay,4,4000000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,5,3.7991417959332465,0.028273745467366926,4.080797865241766,0.019458062276034267,0.2816560693085194,0.038849358858741
70
+ locked_stream,wikitext103_low_decay,anchor_decay,4,4000000,L12_H8_D320,12,8,320,17367040,0.14,0.02,log_prefix_anchor,5,3.776866267621517,0.02647416884246062,4.102013133466244,0.016554390946653275,0.3251468658447266,0.03314100249583806
71
+ locked_stream,wikitext103_probe_blend,anchor_decay,4,4000000,L12_H8_D320,12,8,320,17367040,0.19,0.01,log_prefix_anchor,5,3.7674024373292925,0.027214408565503653,4.096089626103639,0.014523730760550201,0.3286871887743473,0.038218075738052845
runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/summary.json ADDED
@@ -0,0 +1,1542 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "run_mode": "locked_stream",
4
+ "condition": "static_dropout_0",
5
+ "condition_kind": "static",
6
+ "stage": 0,
7
+ "token_limit": 250000,
8
+ "model_name": "L12_H8_D320",
9
+ "n_layer": 12,
10
+ "n_head": 8,
11
+ "n_embd": 320,
12
+ "parameters": 17367040,
13
+ "dropout_initial": 0.0,
14
+ "dropout_final": 0.0,
15
+ "dropout_schedule": "constant",
16
+ "n": 5,
17
+ "mean_train_eval_loss": 3.5230478435754775,
18
+ "std_train_eval_loss": 0.03634617773629946,
19
+ "mean_val_eval_loss": 5.340271946787834,
20
+ "std_val_eval_loss": 0.02699193219713777,
21
+ "mean_generalization_gap": 1.8172241032123566,
22
+ "std_generalization_gap": 0.05178790589031804
23
+ },
24
+ {
25
+ "run_mode": "locked_stream",
26
+ "condition": "static_dropout_0.02",
27
+ "condition_kind": "static",
28
+ "stage": 0,
29
+ "token_limit": 250000,
30
+ "model_name": "L12_H8_D320",
31
+ "n_layer": 12,
32
+ "n_head": 8,
33
+ "n_embd": 320,
34
+ "parameters": 17367040,
35
+ "dropout_initial": 0.02,
36
+ "dropout_final": 0.02,
37
+ "dropout_schedule": "constant",
38
+ "n": 5,
39
+ "mean_train_eval_loss": 3.572465108335018,
40
+ "std_train_eval_loss": 0.027061999609254494,
41
+ "mean_val_eval_loss": 5.274957031011581,
42
+ "std_val_eval_loss": 0.025475383539799955,
43
+ "mean_generalization_gap": 1.7024919226765634,
44
+ "std_generalization_gap": 0.03656967535619302
45
+ },
46
+ {
47
+ "run_mode": "locked_stream",
48
+ "condition": "static_dropout_0.04",
49
+ "condition_kind": "static",
50
+ "stage": 0,
51
+ "token_limit": 250000,
52
+ "model_name": "L12_H8_D320",
53
+ "n_layer": 12,
54
+ "n_head": 8,
55
+ "n_embd": 320,
56
+ "parameters": 17367040,
57
+ "dropout_initial": 0.04,
58
+ "dropout_final": 0.04,
59
+ "dropout_schedule": "constant",
60
+ "n": 5,
61
+ "mean_train_eval_loss": 3.6440532103180887,
62
+ "std_train_eval_loss": 0.031242522590923742,
63
+ "mean_val_eval_loss": 5.237822580337524,
64
+ "std_val_eval_loss": 0.018592151453448197,
65
+ "mean_generalization_gap": 1.593769370019436,
66
+ "std_generalization_gap": 0.03528512909650908
67
+ },
68
+ {
69
+ "run_mode": "locked_stream",
70
+ "condition": "static_dropout_0.06",
71
+ "condition_kind": "static",
72
+ "stage": 0,
73
+ "token_limit": 250000,
74
+ "model_name": "L12_H8_D320",
75
+ "n_layer": 12,
76
+ "n_head": 8,
77
+ "n_embd": 320,
78
+ "parameters": 17367040,
79
+ "dropout_initial": 0.06,
80
+ "dropout_final": 0.06,
81
+ "dropout_schedule": "constant",
82
+ "n": 5,
83
+ "mean_train_eval_loss": 3.712760145962238,
84
+ "std_train_eval_loss": 0.018437630763482243,
85
+ "mean_val_eval_loss": 5.215402702987194,
86
+ "std_val_eval_loss": 0.01734079554543704,
87
+ "mean_generalization_gap": 1.5026425570249557,
88
+ "std_generalization_gap": 0.03154944432034942
89
+ },
90
+ {
91
+ "run_mode": "locked_stream",
92
+ "condition": "static_dropout_0.08",
93
+ "condition_kind": "static",
94
+ "stage": 0,
95
+ "token_limit": 250000,
96
+ "model_name": "L12_H8_D320",
97
+ "n_layer": 12,
98
+ "n_head": 8,
99
+ "n_embd": 320,
100
+ "parameters": 17367040,
101
+ "dropout_initial": 0.08,
102
+ "dropout_final": 0.08,
103
+ "dropout_schedule": "constant",
104
+ "n": 5,
105
+ "mean_train_eval_loss": 3.761947725713253,
106
+ "std_train_eval_loss": 0.017361993699706244,
107
+ "mean_val_eval_loss": 5.189354091882706,
108
+ "std_val_eval_loss": 0.016114118388750917,
109
+ "mean_generalization_gap": 1.4274063661694527,
110
+ "std_generalization_gap": 0.012218707428925264
111
+ },
112
+ {
113
+ "run_mode": "locked_stream",
114
+ "condition": "static_dropout_0.1",
115
+ "condition_kind": "static",
116
+ "stage": 0,
117
+ "token_limit": 250000,
118
+ "model_name": "L12_H8_D320",
119
+ "n_layer": 12,
120
+ "n_head": 8,
121
+ "n_embd": 320,
122
+ "parameters": 17367040,
123
+ "dropout_initial": 0.1,
124
+ "dropout_final": 0.1,
125
+ "dropout_schedule": "constant",
126
+ "n": 5,
127
+ "mean_train_eval_loss": 3.8219083666801454,
128
+ "std_train_eval_loss": 0.019248328397722874,
129
+ "mean_val_eval_loss": 5.169893845915794,
130
+ "std_val_eval_loss": 0.023739040366881205,
131
+ "mean_generalization_gap": 1.3479854792356492,
132
+ "std_generalization_gap": 0.03169483270954973
133
+ },
134
+ {
135
+ "run_mode": "locked_stream",
136
+ "condition": "static_dropout_0.14",
137
+ "condition_kind": "static",
138
+ "stage": 0,
139
+ "token_limit": 250000,
140
+ "model_name": "L12_H8_D320",
141
+ "n_layer": 12,
142
+ "n_head": 8,
143
+ "n_embd": 320,
144
+ "parameters": 17367040,
145
+ "dropout_initial": 0.14,
146
+ "dropout_final": 0.14,
147
+ "dropout_schedule": "constant",
148
+ "n": 5,
149
+ "mean_train_eval_loss": 3.9050611019134522,
150
+ "std_train_eval_loss": 0.013283976741307785,
151
+ "mean_val_eval_loss": 5.1635470792651175,
152
+ "std_val_eval_loss": 0.021953339409384615,
153
+ "mean_generalization_gap": 1.2584859773516655,
154
+ "std_generalization_gap": 0.025343683677925957
155
+ },
156
+ {
157
+ "run_mode": "locked_stream",
158
+ "condition": "static_dropout_0.18",
159
+ "condition_kind": "static",
160
+ "stage": 0,
161
+ "token_limit": 250000,
162
+ "model_name": "L12_H8_D320",
163
+ "n_layer": 12,
164
+ "n_head": 8,
165
+ "n_embd": 320,
166
+ "parameters": 17367040,
167
+ "dropout_initial": 0.18,
168
+ "dropout_final": 0.18,
169
+ "dropout_schedule": "constant",
170
+ "n": 5,
171
+ "mean_train_eval_loss": 3.9964362382888794,
172
+ "std_train_eval_loss": 0.01862825818296394,
173
+ "mean_val_eval_loss": 5.16162400841713,
174
+ "std_val_eval_loss": 0.014987630048664583,
175
+ "mean_generalization_gap": 1.16518777012825,
176
+ "std_generalization_gap": 0.02013674124988852
177
+ },
178
+ {
179
+ "run_mode": "locked_stream",
180
+ "condition": "static_dropout_0.2",
181
+ "condition_kind": "static",
182
+ "stage": 0,
183
+ "token_limit": 250000,
184
+ "model_name": "L12_H8_D320",
185
+ "n_layer": 12,
186
+ "n_head": 8,
187
+ "n_embd": 320,
188
+ "parameters": 17367040,
189
+ "dropout_initial": 0.2,
190
+ "dropout_final": 0.2,
191
+ "dropout_schedule": "constant",
192
+ "n": 5,
193
+ "mean_train_eval_loss": 4.036272630095482,
194
+ "std_train_eval_loss": 0.013668609797313208,
195
+ "mean_val_eval_loss": 5.170051643252373,
196
+ "std_val_eval_loss": 0.014071496005973716,
197
+ "mean_generalization_gap": 1.1337790131568908,
198
+ "std_generalization_gap": 0.013715020568700555
199
+ },
200
+ {
201
+ "run_mode": "locked_stream",
202
+ "condition": "static_dropout_0.26",
203
+ "condition_kind": "static",
204
+ "stage": 0,
205
+ "token_limit": 250000,
206
+ "model_name": "L12_H8_D320",
207
+ "n_layer": 12,
208
+ "n_head": 8,
209
+ "n_embd": 320,
210
+ "parameters": 17367040,
211
+ "dropout_initial": 0.26,
212
+ "dropout_final": 0.26,
213
+ "dropout_schedule": "constant",
214
+ "n": 5,
215
+ "mean_train_eval_loss": 4.149615630507469,
216
+ "std_train_eval_loss": 0.024675006962846267,
217
+ "mean_val_eval_loss": 5.194025552272796,
218
+ "std_val_eval_loss": 0.016128282325428966,
219
+ "mean_generalization_gap": 1.0444099217653275,
220
+ "std_generalization_gap": 0.022196467420881313
221
+ },
222
+ {
223
+ "run_mode": "locked_stream",
224
+ "condition": "static_dropout_0.3",
225
+ "condition_kind": "static",
226
+ "stage": 0,
227
+ "token_limit": 250000,
228
+ "model_name": "L12_H8_D320",
229
+ "n_layer": 12,
230
+ "n_head": 8,
231
+ "n_embd": 320,
232
+ "parameters": 17367040,
233
+ "dropout_initial": 0.3,
234
+ "dropout_final": 0.3,
235
+ "dropout_schedule": "constant",
236
+ "n": 5,
237
+ "mean_train_eval_loss": 4.213146212697029,
238
+ "std_train_eval_loss": 0.013218280901902867,
239
+ "mean_val_eval_loss": 5.214832927286625,
240
+ "std_val_eval_loss": 0.018060234410376495,
241
+ "mean_generalization_gap": 1.0016867145895958,
242
+ "std_generalization_gap": 0.02723842931339789
243
+ },
244
+ {
245
+ "run_mode": "locked_stream",
246
+ "condition": "wikitext103_formula_l12",
247
+ "condition_kind": "anchor_decay",
248
+ "stage": 0,
249
+ "token_limit": 250000,
250
+ "model_name": "L12_H8_D320",
251
+ "n_layer": 12,
252
+ "n_head": 8,
253
+ "n_embd": 320,
254
+ "parameters": 17367040,
255
+ "dropout_initial": 0.3,
256
+ "dropout_final": 0.02,
257
+ "dropout_schedule": "log_prefix_anchor",
258
+ "n": 5,
259
+ "mean_train_eval_loss": 4.213146224617958,
260
+ "std_train_eval_loss": 0.013218263380840874,
261
+ "mean_val_eval_loss": 5.21483291387558,
262
+ "std_val_eval_loss": 0.018060268226475495,
263
+ "mean_generalization_gap": 1.0016866892576217,
264
+ "std_generalization_gap": 0.027238437205160136
265
+ },
266
+ {
267
+ "run_mode": "locked_stream",
268
+ "condition": "wikitext103_low_decay",
269
+ "condition_kind": "anchor_decay",
270
+ "stage": 0,
271
+ "token_limit": 250000,
272
+ "model_name": "L12_H8_D320",
273
+ "n_layer": 12,
274
+ "n_head": 8,
275
+ "n_embd": 320,
276
+ "parameters": 17367040,
277
+ "dropout_initial": 0.14,
278
+ "dropout_final": 0.02,
279
+ "dropout_schedule": "log_prefix_anchor",
280
+ "n": 5,
281
+ "mean_train_eval_loss": 3.9050611212849615,
282
+ "std_train_eval_loss": 0.01328398150034886,
283
+ "mean_val_eval_loss": 5.163547059893608,
284
+ "std_val_eval_loss": 0.02195330299872749,
285
+ "mean_generalization_gap": 1.2584859386086464,
286
+ "std_generalization_gap": 0.025343680837745006
287
+ },
288
+ {
289
+ "run_mode": "locked_stream",
290
+ "condition": "wikitext103_probe_blend",
291
+ "condition_kind": "anchor_decay",
292
+ "stage": 0,
293
+ "token_limit": 250000,
294
+ "model_name": "L12_H8_D320",
295
+ "n_layer": 12,
296
+ "n_head": 8,
297
+ "n_embd": 320,
298
+ "parameters": 17367040,
299
+ "dropout_initial": 0.19,
300
+ "dropout_final": 0.01,
301
+ "dropout_schedule": "log_prefix_anchor",
302
+ "n": 5,
303
+ "mean_train_eval_loss": 4.020059056580067,
304
+ "std_train_eval_loss": 0.016213485710936643,
305
+ "mean_val_eval_loss": 5.165890334546566,
306
+ "std_val_eval_loss": 0.017075109484847858,
307
+ "mean_generalization_gap": 1.1458312779664994,
308
+ "std_generalization_gap": 0.02016518864116932
309
+ },
310
+ {
311
+ "run_mode": "locked_stream",
312
+ "condition": "static_dropout_0",
313
+ "condition_kind": "static",
314
+ "stage": 1,
315
+ "token_limit": 500000,
316
+ "model_name": "L12_H8_D320",
317
+ "n_layer": 12,
318
+ "n_head": 8,
319
+ "n_embd": 320,
320
+ "parameters": 17367040,
321
+ "dropout_initial": 0.0,
322
+ "dropout_final": 0.0,
323
+ "dropout_schedule": "constant",
324
+ "n": 5,
325
+ "mean_train_eval_loss": 3.1505795806646346,
326
+ "std_train_eval_loss": 0.025699619025332174,
327
+ "mean_val_eval_loss": 5.174065832793713,
328
+ "std_val_eval_loss": 0.025168511150111032,
329
+ "mean_generalization_gap": 2.0234862521290777,
330
+ "std_generalization_gap": 0.0236535174506497
331
+ },
332
+ {
333
+ "run_mode": "locked_stream",
334
+ "condition": "static_dropout_0.02",
335
+ "condition_kind": "static",
336
+ "stage": 1,
337
+ "token_limit": 500000,
338
+ "model_name": "L12_H8_D320",
339
+ "n_layer": 12,
340
+ "n_head": 8,
341
+ "n_embd": 320,
342
+ "parameters": 17367040,
343
+ "dropout_initial": 0.02,
344
+ "dropout_final": 0.02,
345
+ "dropout_schedule": "constant",
346
+ "n": 5,
347
+ "mean_train_eval_loss": 3.261230443418026,
348
+ "std_train_eval_loss": 0.014168838694048941,
349
+ "mean_val_eval_loss": 5.045114178955555,
350
+ "std_val_eval_loss": 0.016894569246234076,
351
+ "mean_generalization_gap": 1.783883735537529,
352
+ "std_generalization_gap": 0.019840865900389654
353
+ },
354
+ {
355
+ "run_mode": "locked_stream",
356
+ "condition": "static_dropout_0.04",
357
+ "condition_kind": "static",
358
+ "stage": 1,
359
+ "token_limit": 500000,
360
+ "model_name": "L12_H8_D320",
361
+ "n_layer": 12,
362
+ "n_head": 8,
363
+ "n_embd": 320,
364
+ "parameters": 17367040,
365
+ "dropout_initial": 0.04,
366
+ "dropout_final": 0.04,
367
+ "dropout_schedule": "constant",
368
+ "n": 5,
369
+ "mean_train_eval_loss": 3.3515226930379867,
370
+ "std_train_eval_loss": 0.029537749484816224,
371
+ "mean_val_eval_loss": 4.957330641150475,
372
+ "std_val_eval_loss": 0.02503345339488141,
373
+ "mean_generalization_gap": 1.6058079481124878,
374
+ "std_generalization_gap": 0.0202359151612378
375
+ },
376
+ {
377
+ "run_mode": "locked_stream",
378
+ "condition": "static_dropout_0.06",
379
+ "condition_kind": "static",
380
+ "stage": 1,
381
+ "token_limit": 500000,
382
+ "model_name": "L12_H8_D320",
383
+ "n_layer": 12,
384
+ "n_head": 8,
385
+ "n_embd": 320,
386
+ "parameters": 17367040,
387
+ "dropout_initial": 0.06,
388
+ "dropout_final": 0.06,
389
+ "dropout_schedule": "constant",
390
+ "n": 5,
391
+ "mean_train_eval_loss": 3.4394222095608713,
392
+ "std_train_eval_loss": 0.013418585935883633,
393
+ "mean_val_eval_loss": 4.894671627879143,
394
+ "std_val_eval_loss": 0.021331675814859084,
395
+ "mean_generalization_gap": 1.4552494183182716,
396
+ "std_generalization_gap": 0.01946001354297248
397
+ },
398
+ {
399
+ "run_mode": "locked_stream",
400
+ "condition": "static_dropout_0.08",
401
+ "condition_kind": "static",
402
+ "stage": 1,
403
+ "token_limit": 500000,
404
+ "model_name": "L12_H8_D320",
405
+ "n_layer": 12,
406
+ "n_head": 8,
407
+ "n_embd": 320,
408
+ "parameters": 17367040,
409
+ "dropout_initial": 0.08,
410
+ "dropout_final": 0.08,
411
+ "dropout_schedule": "constant",
412
+ "n": 5,
413
+ "mean_train_eval_loss": 3.50356438010931,
414
+ "std_train_eval_loss": 0.024856947856236954,
415
+ "mean_val_eval_loss": 4.857585413753986,
416
+ "std_val_eval_loss": 0.023903772104159438,
417
+ "mean_generalization_gap": 1.3540210336446763,
418
+ "std_generalization_gap": 0.0310369653543579
419
+ },
420
+ {
421
+ "run_mode": "locked_stream",
422
+ "condition": "static_dropout_0.1",
423
+ "condition_kind": "static",
424
+ "stage": 1,
425
+ "token_limit": 500000,
426
+ "model_name": "L12_H8_D320",
427
+ "n_layer": 12,
428
+ "n_head": 8,
429
+ "n_embd": 320,
430
+ "parameters": 17367040,
431
+ "dropout_initial": 0.1,
432
+ "dropout_final": 0.1,
433
+ "dropout_schedule": "constant",
434
+ "n": 5,
435
+ "mean_train_eval_loss": 3.5636927232146265,
436
+ "std_train_eval_loss": 0.024862277351108786,
437
+ "mean_val_eval_loss": 4.83315060287714,
438
+ "std_val_eval_loss": 0.025840187319246186,
439
+ "mean_generalization_gap": 1.2694578796625138,
440
+ "std_generalization_gap": 0.01835928668354016
441
+ },
442
+ {
443
+ "run_mode": "locked_stream",
444
+ "condition": "static_dropout_0.14",
445
+ "condition_kind": "static",
446
+ "stage": 1,
447
+ "token_limit": 500000,
448
+ "model_name": "L12_H8_D320",
449
+ "n_layer": 12,
450
+ "n_head": 8,
451
+ "n_embd": 320,
452
+ "parameters": 17367040,
453
+ "dropout_initial": 0.14,
454
+ "dropout_final": 0.14,
455
+ "dropout_schedule": "constant",
456
+ "n": 5,
457
+ "mean_train_eval_loss": 3.6649743527173997,
458
+ "std_train_eval_loss": 0.023120394246437293,
459
+ "mean_val_eval_loss": 4.800087215006352,
460
+ "std_val_eval_loss": 0.019846558873074828,
461
+ "mean_generalization_gap": 1.135112862288952,
462
+ "std_generalization_gap": 0.012980966896977213
463
+ },
464
+ {
465
+ "run_mode": "locked_stream",
466
+ "condition": "static_dropout_0.18",
467
+ "condition_kind": "static",
468
+ "stage": 1,
469
+ "token_limit": 500000,
470
+ "model_name": "L12_H8_D320",
471
+ "n_layer": 12,
472
+ "n_head": 8,
473
+ "n_embd": 320,
474
+ "parameters": 17367040,
475
+ "dropout_initial": 0.18,
476
+ "dropout_final": 0.18,
477
+ "dropout_schedule": "constant",
478
+ "n": 5,
479
+ "mean_train_eval_loss": 3.7571836963295935,
480
+ "std_train_eval_loss": 0.024732309568709827,
481
+ "mean_val_eval_loss": 4.794638857245445,
482
+ "std_val_eval_loss": 0.02062163611688947,
483
+ "mean_generalization_gap": 1.0374551609158515,
484
+ "std_generalization_gap": 0.02312552092553772
485
+ },
486
+ {
487
+ "run_mode": "locked_stream",
488
+ "condition": "static_dropout_0.2",
489
+ "condition_kind": "static",
490
+ "stage": 1,
491
+ "token_limit": 500000,
492
+ "model_name": "L12_H8_D320",
493
+ "n_layer": 12,
494
+ "n_head": 8,
495
+ "n_embd": 320,
496
+ "parameters": 17367040,
497
+ "dropout_initial": 0.2,
498
+ "dropout_final": 0.2,
499
+ "dropout_schedule": "constant",
500
+ "n": 5,
501
+ "mean_train_eval_loss": 3.7914333969354628,
502
+ "std_train_eval_loss": 0.018987144953080883,
503
+ "mean_val_eval_loss": 4.787301687896251,
504
+ "std_val_eval_loss": 0.023601504941398347,
505
+ "mean_generalization_gap": 0.9958682909607888,
506
+ "std_generalization_gap": 0.014572669955232898
507
+ },
508
+ {
509
+ "run_mode": "locked_stream",
510
+ "condition": "static_dropout_0.26",
511
+ "condition_kind": "static",
512
+ "stage": 1,
513
+ "token_limit": 500000,
514
+ "model_name": "L12_H8_D320",
515
+ "n_layer": 12,
516
+ "n_head": 8,
517
+ "n_embd": 320,
518
+ "parameters": 17367040,
519
+ "dropout_initial": 0.26,
520
+ "dropout_final": 0.26,
521
+ "dropout_schedule": "constant",
522
+ "n": 5,
523
+ "mean_train_eval_loss": 3.905269515514374,
524
+ "std_train_eval_loss": 0.02376505939443177,
525
+ "mean_val_eval_loss": 4.8081169500947,
526
+ "std_val_eval_loss": 0.021576250073037834,
527
+ "mean_generalization_gap": 0.9028474345803261,
528
+ "std_generalization_gap": 0.018700407757224247
529
+ },
530
+ {
531
+ "run_mode": "locked_stream",
532
+ "condition": "static_dropout_0.3",
533
+ "condition_kind": "static",
534
+ "stage": 1,
535
+ "token_limit": 500000,
536
+ "model_name": "L12_H8_D320",
537
+ "n_layer": 12,
538
+ "n_head": 8,
539
+ "n_embd": 320,
540
+ "parameters": 17367040,
541
+ "dropout_initial": 0.3,
542
+ "dropout_final": 0.3,
543
+ "dropout_schedule": "constant",
544
+ "n": 5,
545
+ "mean_train_eval_loss": 3.976513123512268,
546
+ "std_train_eval_loss": 0.010796527385978102,
547
+ "mean_val_eval_loss": 4.82416096329689,
548
+ "std_val_eval_loss": 0.029627048936626245,
549
+ "mean_generalization_gap": 0.8476478397846222,
550
+ "std_generalization_gap": 0.026893359245285242
551
+ },
552
+ {
553
+ "run_mode": "locked_stream",
554
+ "condition": "wikitext103_formula_l12",
555
+ "condition_kind": "anchor_decay",
556
+ "stage": 1,
557
+ "token_limit": 500000,
558
+ "model_name": "L12_H8_D320",
559
+ "n_layer": 12,
560
+ "n_head": 8,
561
+ "n_embd": 320,
562
+ "parameters": 17367040,
563
+ "dropout_initial": 0.3,
564
+ "dropout_final": 0.02,
565
+ "dropout_schedule": "log_prefix_anchor",
566
+ "n": 5,
567
+ "mean_train_eval_loss": 3.918210099637508,
568
+ "std_train_eval_loss": 0.011987444359375697,
569
+ "mean_val_eval_loss": 4.80534844994545,
570
+ "std_val_eval_loss": 0.027829930328897787,
571
+ "mean_generalization_gap": 0.8871383503079414,
572
+ "std_generalization_gap": 0.02626272167699895
573
+ },
574
+ {
575
+ "run_mode": "locked_stream",
576
+ "condition": "wikitext103_low_decay",
577
+ "condition_kind": "anchor_decay",
578
+ "stage": 1,
579
+ "token_limit": 500000,
580
+ "model_name": "L12_H8_D320",
581
+ "n_layer": 12,
582
+ "n_head": 8,
583
+ "n_embd": 320,
584
+ "parameters": 17367040,
585
+ "dropout_initial": 0.14,
586
+ "dropout_final": 0.02,
587
+ "dropout_schedule": "log_prefix_anchor",
588
+ "n": 5,
589
+ "mean_train_eval_loss": 3.6649743929505347,
590
+ "std_train_eval_loss": 0.023120354646606448,
591
+ "mean_val_eval_loss": 4.800087235867977,
592
+ "std_val_eval_loss": 0.01984656043518159,
593
+ "mean_generalization_gap": 1.1351128429174424,
594
+ "std_generalization_gap": 0.01298098541252223
595
+ },
596
+ {
597
+ "run_mode": "locked_stream",
598
+ "condition": "wikitext103_probe_blend",
599
+ "condition_kind": "anchor_decay",
600
+ "stage": 1,
601
+ "token_limit": 500000,
602
+ "model_name": "L12_H8_D320",
603
+ "n_layer": 12,
604
+ "n_head": 8,
605
+ "n_embd": 320,
606
+ "parameters": 17367040,
607
+ "dropout_initial": 0.19,
608
+ "dropout_final": 0.01,
609
+ "dropout_schedule": "log_prefix_anchor",
610
+ "n": 5,
611
+ "mean_train_eval_loss": 3.684571087360382,
612
+ "std_train_eval_loss": 0.02397149156090502,
613
+ "mean_val_eval_loss": 4.7872447416186334,
614
+ "std_val_eval_loss": 0.0268838649061173,
615
+ "mean_generalization_gap": 1.1026736542582511,
616
+ "std_generalization_gap": 0.021535214146850536
617
+ },
618
+ {
619
+ "run_mode": "locked_stream",
620
+ "condition": "static_dropout_0",
621
+ "condition_kind": "static",
622
+ "stage": 2,
623
+ "token_limit": 1000000,
624
+ "model_name": "L12_H8_D320",
625
+ "n_layer": 12,
626
+ "n_head": 8,
627
+ "n_embd": 320,
628
+ "parameters": 17367040,
629
+ "dropout_initial": 0.0,
630
+ "dropout_final": 0.0,
631
+ "dropout_schedule": "constant",
632
+ "n": 5,
633
+ "mean_train_eval_loss": 3.3657630145549775,
634
+ "std_train_eval_loss": 0.025194332688414405,
635
+ "mean_val_eval_loss": 4.766069588065148,
636
+ "std_val_eval_loss": 0.03242686059194632,
637
+ "mean_generalization_gap": 1.40030657351017,
638
+ "std_generalization_gap": 0.052237334934698405
639
+ },
640
+ {
641
+ "run_mode": "locked_stream",
642
+ "condition": "static_dropout_0.02",
643
+ "condition_kind": "static",
644
+ "stage": 2,
645
+ "token_limit": 1000000,
646
+ "model_name": "L12_H8_D320",
647
+ "n_layer": 12,
648
+ "n_head": 8,
649
+ "n_embd": 320,
650
+ "parameters": 17367040,
651
+ "dropout_initial": 0.02,
652
+ "dropout_final": 0.02,
653
+ "dropout_schedule": "constant",
654
+ "n": 5,
655
+ "mean_train_eval_loss": 3.4323942199349404,
656
+ "std_train_eval_loss": 0.018124036007212796,
657
+ "mean_val_eval_loss": 4.655830132961273,
658
+ "std_val_eval_loss": 0.01585338982667534,
659
+ "mean_generalization_gap": 1.2234359130263328,
660
+ "std_generalization_gap": 0.024146051923035036
661
+ },
662
+ {
663
+ "run_mode": "locked_stream",
664
+ "condition": "static_dropout_0.04",
665
+ "condition_kind": "static",
666
+ "stage": 2,
667
+ "token_limit": 1000000,
668
+ "model_name": "L12_H8_D320",
669
+ "n_layer": 12,
670
+ "n_head": 8,
671
+ "n_embd": 320,
672
+ "parameters": 17367040,
673
+ "dropout_initial": 0.04,
674
+ "dropout_final": 0.04,
675
+ "dropout_schedule": "constant",
676
+ "n": 5,
677
+ "mean_train_eval_loss": 3.502993068099022,
678
+ "std_train_eval_loss": 0.017291674978225223,
679
+ "mean_val_eval_loss": 4.595940679311752,
680
+ "std_val_eval_loss": 0.014619476464690098,
681
+ "mean_generalization_gap": 1.0929476112127303,
682
+ "std_generalization_gap": 0.028916946275823
683
+ },
684
+ {
685
+ "run_mode": "locked_stream",
686
+ "condition": "static_dropout_0.06",
687
+ "condition_kind": "static",
688
+ "stage": 2,
689
+ "token_limit": 1000000,
690
+ "model_name": "L12_H8_D320",
691
+ "n_layer": 12,
692
+ "n_head": 8,
693
+ "n_embd": 320,
694
+ "parameters": 17367040,
695
+ "dropout_initial": 0.06,
696
+ "dropout_final": 0.06,
697
+ "dropout_schedule": "constant",
698
+ "n": 5,
699
+ "mean_train_eval_loss": 3.555374290049076,
700
+ "std_train_eval_loss": 0.018874416206200166,
701
+ "mean_val_eval_loss": 4.557375983893872,
702
+ "std_val_eval_loss": 0.012636266859277006,
703
+ "mean_generalization_gap": 1.0020016938447953,
704
+ "std_generalization_gap": 0.030341255269329737
705
+ },
706
+ {
707
+ "run_mode": "locked_stream",
708
+ "condition": "static_dropout_0.08",
709
+ "condition_kind": "static",
710
+ "stage": 2,
711
+ "token_limit": 1000000,
712
+ "model_name": "L12_H8_D320",
713
+ "n_layer": 12,
714
+ "n_head": 8,
715
+ "n_embd": 320,
716
+ "parameters": 17367040,
717
+ "dropout_initial": 0.08,
718
+ "dropout_final": 0.08,
719
+ "dropout_schedule": "constant",
720
+ "n": 5,
721
+ "mean_train_eval_loss": 3.5999757796525955,
722
+ "std_train_eval_loss": 0.009078845784580064,
723
+ "mean_val_eval_loss": 4.532606348395348,
724
+ "std_val_eval_loss": 0.011746366682868653,
725
+ "mean_generalization_gap": 0.9326305687427521,
726
+ "std_generalization_gap": 0.019403758262432406
727
+ },
728
+ {
729
+ "run_mode": "locked_stream",
730
+ "condition": "static_dropout_0.1",
731
+ "condition_kind": "static",
732
+ "stage": 2,
733
+ "token_limit": 1000000,
734
+ "model_name": "L12_H8_D320",
735
+ "n_layer": 12,
736
+ "n_head": 8,
737
+ "n_embd": 320,
738
+ "parameters": 17367040,
739
+ "dropout_initial": 0.1,
740
+ "dropout_final": 0.1,
741
+ "dropout_schedule": "constant",
742
+ "n": 5,
743
+ "mean_train_eval_loss": 3.652407945692539,
744
+ "std_train_eval_loss": 0.022251919818600654,
745
+ "mean_val_eval_loss": 4.518611620366573,
746
+ "std_val_eval_loss": 0.01351751181040844,
747
+ "mean_generalization_gap": 0.8662036746740341,
748
+ "std_generalization_gap": 0.033766642849086155
749
+ },
750
+ {
751
+ "run_mode": "locked_stream",
752
+ "condition": "static_dropout_0.14",
753
+ "condition_kind": "static",
754
+ "stage": 2,
755
+ "token_limit": 1000000,
756
+ "model_name": "L12_H8_D320",
757
+ "n_layer": 12,
758
+ "n_head": 8,
759
+ "n_embd": 320,
760
+ "parameters": 17367040,
761
+ "dropout_initial": 0.14,
762
+ "dropout_final": 0.14,
763
+ "dropout_schedule": "constant",
764
+ "n": 5,
765
+ "mean_train_eval_loss": 3.716250878572464,
766
+ "std_train_eval_loss": 0.02407671653248301,
767
+ "mean_val_eval_loss": 4.500055414438248,
768
+ "std_val_eval_loss": 0.014819407090300617,
769
+ "mean_generalization_gap": 0.7838045358657837,
770
+ "std_generalization_gap": 0.038101035301189086
771
+ },
772
+ {
773
+ "run_mode": "locked_stream",
774
+ "condition": "static_dropout_0.18",
775
+ "condition_kind": "static",
776
+ "stage": 2,
777
+ "token_limit": 1000000,
778
+ "model_name": "L12_H8_D320",
779
+ "n_layer": 12,
780
+ "n_head": 8,
781
+ "n_embd": 320,
782
+ "parameters": 17367040,
783
+ "dropout_initial": 0.18,
784
+ "dropout_final": 0.18,
785
+ "dropout_schedule": "constant",
786
+ "n": 5,
787
+ "mean_train_eval_loss": 3.786629955470562,
788
+ "std_train_eval_loss": 0.015769997442310384,
789
+ "mean_val_eval_loss": 4.5023036181926726,
790
+ "std_val_eval_loss": 0.018490102933264013,
791
+ "mean_generalization_gap": 0.7156736627221107,
792
+ "std_generalization_gap": 0.03180040950921459
793
+ },
794
+ {
795
+ "run_mode": "locked_stream",
796
+ "condition": "static_dropout_0.2",
797
+ "condition_kind": "static",
798
+ "stage": 2,
799
+ "token_limit": 1000000,
800
+ "model_name": "L12_H8_D320",
801
+ "n_layer": 12,
802
+ "n_head": 8,
803
+ "n_embd": 320,
804
+ "parameters": 17367040,
805
+ "dropout_initial": 0.2,
806
+ "dropout_final": 0.2,
807
+ "dropout_schedule": "constant",
808
+ "n": 5,
809
+ "mean_train_eval_loss": 3.814779002964497,
810
+ "std_train_eval_loss": 0.01587114098616348,
811
+ "mean_val_eval_loss": 4.506029562652111,
812
+ "std_val_eval_loss": 0.02043622151331783,
813
+ "mean_generalization_gap": 0.6912505596876144,
814
+ "std_generalization_gap": 0.03147408021666257
815
+ },
816
+ {
817
+ "run_mode": "locked_stream",
818
+ "condition": "static_dropout_0.26",
819
+ "condition_kind": "static",
820
+ "stage": 2,
821
+ "token_limit": 1000000,
822
+ "model_name": "L12_H8_D320",
823
+ "n_layer": 12,
824
+ "n_head": 8,
825
+ "n_embd": 320,
826
+ "parameters": 17367040,
827
+ "dropout_initial": 0.26,
828
+ "dropout_final": 0.26,
829
+ "dropout_schedule": "constant",
830
+ "n": 5,
831
+ "mean_train_eval_loss": 3.9071271896362303,
832
+ "std_train_eval_loss": 0.02774032700763299,
833
+ "mean_val_eval_loss": 4.52620010226965,
834
+ "std_val_eval_loss": 0.010149133513914643,
835
+ "mean_generalization_gap": 0.619072912633419,
836
+ "std_generalization_gap": 0.035195638721522884
837
+ },
838
+ {
839
+ "run_mode": "locked_stream",
840
+ "condition": "static_dropout_0.3",
841
+ "condition_kind": "static",
842
+ "stage": 2,
843
+ "token_limit": 1000000,
844
+ "model_name": "L12_H8_D320",
845
+ "n_layer": 12,
846
+ "n_head": 8,
847
+ "n_embd": 320,
848
+ "parameters": 17367040,
849
+ "dropout_initial": 0.3,
850
+ "dropout_final": 0.3,
851
+ "dropout_schedule": "constant",
852
+ "n": 5,
853
+ "mean_train_eval_loss": 3.970817744731903,
854
+ "std_train_eval_loss": 0.017137273586986676,
855
+ "mean_val_eval_loss": 4.546238152682781,
856
+ "std_val_eval_loss": 0.012672630425565377,
857
+ "mean_generalization_gap": 0.5754204079508781,
858
+ "std_generalization_gap": 0.02420747470801454
859
+ },
860
+ {
861
+ "run_mode": "locked_stream",
862
+ "condition": "wikitext103_formula_l12",
863
+ "condition_kind": "anchor_decay",
864
+ "stage": 2,
865
+ "token_limit": 1000000,
866
+ "model_name": "L12_H8_D320",
867
+ "n_layer": 12,
868
+ "n_head": 8,
869
+ "n_embd": 320,
870
+ "parameters": 17367040,
871
+ "dropout_initial": 0.3,
872
+ "dropout_final": 0.02,
873
+ "dropout_schedule": "log_prefix_anchor",
874
+ "n": 5,
875
+ "mean_train_eval_loss": 3.8283171117305757,
876
+ "std_train_eval_loss": 0.017535628899982475,
877
+ "mean_val_eval_loss": 4.49377838075161,
878
+ "std_val_eval_loss": 0.014674108647494532,
879
+ "mean_generalization_gap": 0.6654612690210342,
880
+ "std_generalization_gap": 0.02707089996550965
881
+ },
882
+ {
883
+ "run_mode": "locked_stream",
884
+ "condition": "wikitext103_low_decay",
885
+ "condition_kind": "anchor_decay",
886
+ "stage": 2,
887
+ "token_limit": 1000000,
888
+ "model_name": "L12_H8_D320",
889
+ "n_layer": 12,
890
+ "n_head": 8,
891
+ "n_embd": 320,
892
+ "parameters": 17367040,
893
+ "dropout_initial": 0.14,
894
+ "dropout_final": 0.02,
895
+ "dropout_schedule": "log_prefix_anchor",
896
+ "n": 5,
897
+ "mean_train_eval_loss": 3.6606912568211554,
898
+ "std_train_eval_loss": 0.023899445408052746,
899
+ "mean_val_eval_loss": 4.50133255571127,
900
+ "std_val_eval_loss": 0.01578619433972172,
901
+ "mean_generalization_gap": 0.8406412988901139,
902
+ "std_generalization_gap": 0.03886870429434243
903
+ },
904
+ {
905
+ "run_mode": "locked_stream",
906
+ "condition": "wikitext103_probe_blend",
907
+ "condition_kind": "anchor_decay",
908
+ "stage": 2,
909
+ "token_limit": 1000000,
910
+ "model_name": "L12_H8_D320",
911
+ "n_layer": 12,
912
+ "n_head": 8,
913
+ "n_embd": 320,
914
+ "parameters": 17367040,
915
+ "dropout_initial": 0.19,
916
+ "dropout_final": 0.01,
917
+ "dropout_schedule": "log_prefix_anchor",
918
+ "n": 5,
919
+ "mean_train_eval_loss": 3.6495467707514764,
920
+ "std_train_eval_loss": 0.019490464893964263,
921
+ "mean_val_eval_loss": 4.49399798810482,
922
+ "std_val_eval_loss": 0.015864095933133645,
923
+ "mean_generalization_gap": 0.8444512173533439,
924
+ "std_generalization_gap": 0.03475725994195922
925
+ },
926
+ {
927
+ "run_mode": "locked_stream",
928
+ "condition": "static_dropout_0",
929
+ "condition_kind": "static",
930
+ "stage": 3,
931
+ "token_limit": 2000000,
932
+ "model_name": "L12_H8_D320",
933
+ "n_layer": 12,
934
+ "n_head": 8,
935
+ "n_embd": 320,
936
+ "parameters": 17367040,
937
+ "dropout_initial": 0.0,
938
+ "dropout_final": 0.0,
939
+ "dropout_schedule": "constant",
940
+ "n": 5,
941
+ "mean_train_eval_loss": 3.6409011498093604,
942
+ "std_train_eval_loss": 0.014124132971310668,
943
+ "mean_val_eval_loss": 4.417352265119552,
944
+ "std_val_eval_loss": 0.01876483626309062,
945
+ "mean_generalization_gap": 0.7764511153101921,
946
+ "std_generalization_gap": 0.024211298020746064
947
+ },
948
+ {
949
+ "run_mode": "locked_stream",
950
+ "condition": "static_dropout_0.02",
951
+ "condition_kind": "static",
952
+ "stage": 3,
953
+ "token_limit": 2000000,
954
+ "model_name": "L12_H8_D320",
955
+ "n_layer": 12,
956
+ "n_head": 8,
957
+ "n_embd": 320,
958
+ "parameters": 17367040,
959
+ "dropout_initial": 0.02,
960
+ "dropout_final": 0.02,
961
+ "dropout_schedule": "constant",
962
+ "n": 5,
963
+ "mean_train_eval_loss": 3.6672953754663467,
964
+ "std_train_eval_loss": 0.010966930251295475,
965
+ "mean_val_eval_loss": 4.355083760619164,
966
+ "std_val_eval_loss": 0.026489436313039863,
967
+ "mean_generalization_gap": 0.6877883851528168,
968
+ "std_generalization_gap": 0.026106538952549097
969
+ },
970
+ {
971
+ "run_mode": "locked_stream",
972
+ "condition": "static_dropout_0.04",
973
+ "condition_kind": "static",
974
+ "stage": 3,
975
+ "token_limit": 2000000,
976
+ "model_name": "L12_H8_D320",
977
+ "n_layer": 12,
978
+ "n_head": 8,
979
+ "n_embd": 320,
980
+ "parameters": 17367040,
981
+ "dropout_initial": 0.04,
982
+ "dropout_final": 0.04,
983
+ "dropout_schedule": "constant",
984
+ "n": 5,
985
+ "mean_train_eval_loss": 3.703781445324421,
986
+ "std_train_eval_loss": 0.01970997472996272,
987
+ "mean_val_eval_loss": 4.3262467920780185,
988
+ "std_val_eval_loss": 0.016533384756725625,
989
+ "mean_generalization_gap": 0.6224653467535972,
990
+ "std_generalization_gap": 0.02577206799431021
991
+ },
992
+ {
993
+ "run_mode": "locked_stream",
994
+ "condition": "static_dropout_0.06",
995
+ "condition_kind": "static",
996
+ "stage": 3,
997
+ "token_limit": 2000000,
998
+ "model_name": "L12_H8_D320",
999
+ "n_layer": 12,
1000
+ "n_head": 8,
1001
+ "n_embd": 320,
1002
+ "parameters": 17367040,
1003
+ "dropout_initial": 0.06,
1004
+ "dropout_final": 0.06,
1005
+ "dropout_schedule": "constant",
1006
+ "n": 5,
1007
+ "mean_train_eval_loss": 3.7317687764763834,
1008
+ "std_train_eval_loss": 0.017157479537739013,
1009
+ "mean_val_eval_loss": 4.305900506675243,
1010
+ "std_val_eval_loss": 0.018234543669842573,
1011
+ "mean_generalization_gap": 0.5741317301988602,
1012
+ "std_generalization_gap": 0.023633182905307998
1013
+ },
1014
+ {
1015
+ "run_mode": "locked_stream",
1016
+ "condition": "static_dropout_0.08",
1017
+ "condition_kind": "static",
1018
+ "stage": 3,
1019
+ "token_limit": 2000000,
1020
+ "model_name": "L12_H8_D320",
1021
+ "n_layer": 12,
1022
+ "n_head": 8,
1023
+ "n_embd": 320,
1024
+ "parameters": 17367040,
1025
+ "dropout_initial": 0.08,
1026
+ "dropout_final": 0.08,
1027
+ "dropout_schedule": "constant",
1028
+ "n": 5,
1029
+ "mean_train_eval_loss": 3.7518652200698854,
1030
+ "std_train_eval_loss": 0.019831550751639896,
1031
+ "mean_val_eval_loss": 4.292461846768856,
1032
+ "std_val_eval_loss": 0.016558424281915483,
1033
+ "mean_generalization_gap": 0.5405966266989708,
1034
+ "std_generalization_gap": 0.034040566494281235
1035
+ },
1036
+ {
1037
+ "run_mode": "locked_stream",
1038
+ "condition": "static_dropout_0.1",
1039
+ "condition_kind": "static",
1040
+ "stage": 3,
1041
+ "token_limit": 2000000,
1042
+ "model_name": "L12_H8_D320",
1043
+ "n_layer": 12,
1044
+ "n_head": 8,
1045
+ "n_embd": 320,
1046
+ "parameters": 17367040,
1047
+ "dropout_initial": 0.1,
1048
+ "dropout_final": 0.1,
1049
+ "dropout_schedule": "constant",
1050
+ "n": 5,
1051
+ "mean_train_eval_loss": 3.7808884128928186,
1052
+ "std_train_eval_loss": 0.01766528157534878,
1053
+ "mean_val_eval_loss": 4.285703422129155,
1054
+ "std_val_eval_loss": 0.018994111876547076,
1055
+ "mean_generalization_gap": 0.5048150092363357,
1056
+ "std_generalization_gap": 0.028072111875115907
1057
+ },
1058
+ {
1059
+ "run_mode": "locked_stream",
1060
+ "condition": "static_dropout_0.14",
1061
+ "condition_kind": "static",
1062
+ "stage": 3,
1063
+ "token_limit": 2000000,
1064
+ "model_name": "L12_H8_D320",
1065
+ "n_layer": 12,
1066
+ "n_head": 8,
1067
+ "n_embd": 320,
1068
+ "parameters": 17367040,
1069
+ "dropout_initial": 0.14,
1070
+ "dropout_final": 0.14,
1071
+ "dropout_schedule": "constant",
1072
+ "n": 5,
1073
+ "mean_train_eval_loss": 3.828673002123833,
1074
+ "std_train_eval_loss": 0.004937581309037719,
1075
+ "mean_val_eval_loss": 4.281563547253609,
1076
+ "std_val_eval_loss": 0.02138858846966249,
1077
+ "mean_generalization_gap": 0.452890545129776,
1078
+ "std_generalization_gap": 0.01755457638984346
1079
+ },
1080
+ {
1081
+ "run_mode": "locked_stream",
1082
+ "condition": "static_dropout_0.18",
1083
+ "condition_kind": "static",
1084
+ "stage": 3,
1085
+ "token_limit": 2000000,
1086
+ "model_name": "L12_H8_D320",
1087
+ "n_layer": 12,
1088
+ "n_head": 8,
1089
+ "n_embd": 320,
1090
+ "parameters": 17367040,
1091
+ "dropout_initial": 0.18,
1092
+ "dropout_final": 0.18,
1093
+ "dropout_schedule": "constant",
1094
+ "n": 5,
1095
+ "mean_train_eval_loss": 3.876796562969685,
1096
+ "std_train_eval_loss": 0.008357602777023191,
1097
+ "mean_val_eval_loss": 4.288915157318115,
1098
+ "std_val_eval_loss": 0.019532110178580833,
1099
+ "mean_generalization_gap": 0.4121185943484306,
1100
+ "std_generalization_gap": 0.013904819825939841
1101
+ },
1102
+ {
1103
+ "run_mode": "locked_stream",
1104
+ "condition": "static_dropout_0.2",
1105
+ "condition_kind": "static",
1106
+ "stage": 3,
1107
+ "token_limit": 2000000,
1108
+ "model_name": "L12_H8_D320",
1109
+ "n_layer": 12,
1110
+ "n_head": 8,
1111
+ "n_embd": 320,
1112
+ "parameters": 17367040,
1113
+ "dropout_initial": 0.2,
1114
+ "dropout_final": 0.2,
1115
+ "dropout_schedule": "constant",
1116
+ "n": 5,
1117
+ "mean_train_eval_loss": 3.898232290148735,
1118
+ "std_train_eval_loss": 0.008893270752025355,
1119
+ "mean_val_eval_loss": 4.294206875562668,
1120
+ "std_val_eval_loss": 0.016019909189912813,
1121
+ "mean_generalization_gap": 0.3959745854139328,
1122
+ "std_generalization_gap": 0.01268490098888804
1123
+ },
1124
+ {
1125
+ "run_mode": "locked_stream",
1126
+ "condition": "static_dropout_0.26",
1127
+ "condition_kind": "static",
1128
+ "stage": 3,
1129
+ "token_limit": 2000000,
1130
+ "model_name": "L12_H8_D320",
1131
+ "n_layer": 12,
1132
+ "n_head": 8,
1133
+ "n_embd": 320,
1134
+ "parameters": 17367040,
1135
+ "dropout_initial": 0.26,
1136
+ "dropout_final": 0.26,
1137
+ "dropout_schedule": "constant",
1138
+ "n": 5,
1139
+ "mean_train_eval_loss": 3.9654731526970863,
1140
+ "std_train_eval_loss": 0.018160558354021094,
1141
+ "mean_val_eval_loss": 4.32477528527379,
1142
+ "std_val_eval_loss": 0.017871972287011016,
1143
+ "mean_generalization_gap": 0.359302132576704,
1144
+ "std_generalization_gap": 0.018728321745075875
1145
+ },
1146
+ {
1147
+ "run_mode": "locked_stream",
1148
+ "condition": "static_dropout_0.3",
1149
+ "condition_kind": "static",
1150
+ "stage": 3,
1151
+ "token_limit": 2000000,
1152
+ "model_name": "L12_H8_D320",
1153
+ "n_layer": 12,
1154
+ "n_head": 8,
1155
+ "n_embd": 320,
1156
+ "parameters": 17367040,
1157
+ "dropout_initial": 0.3,
1158
+ "dropout_final": 0.3,
1159
+ "dropout_schedule": "constant",
1160
+ "n": 5,
1161
+ "mean_train_eval_loss": 4.009651578962803,
1162
+ "std_train_eval_loss": 0.014639843511797184,
1163
+ "mean_val_eval_loss": 4.34667577445507,
1164
+ "std_val_eval_loss": 0.015854348211857338,
1165
+ "mean_generalization_gap": 0.3370241954922676,
1166
+ "std_generalization_gap": 0.019075525784314876
1167
+ },
1168
+ {
1169
+ "run_mode": "locked_stream",
1170
+ "condition": "wikitext103_formula_l12",
1171
+ "condition_kind": "anchor_decay",
1172
+ "stage": 3,
1173
+ "token_limit": 2000000,
1174
+ "model_name": "L12_H8_D320",
1175
+ "n_layer": 12,
1176
+ "n_head": 8,
1177
+ "n_embd": 320,
1178
+ "parameters": 17367040,
1179
+ "dropout_initial": 0.3,
1180
+ "dropout_final": 0.02,
1181
+ "dropout_schedule": "log_prefix_anchor",
1182
+ "n": 5,
1183
+ "mean_train_eval_loss": 3.8088956013321877,
1184
+ "std_train_eval_loss": 0.016376701234484724,
1185
+ "mean_val_eval_loss": 4.260725708305836,
1186
+ "std_val_eval_loss": 0.018098739272622868,
1187
+ "mean_generalization_gap": 0.45183010697364806,
1188
+ "std_generalization_gap": 0.01960955973506382
1189
+ },
1190
+ {
1191
+ "run_mode": "locked_stream",
1192
+ "condition": "wikitext103_low_decay",
1193
+ "condition_kind": "anchor_decay",
1194
+ "stage": 3,
1195
+ "token_limit": 2000000,
1196
+ "model_name": "L12_H8_D320",
1197
+ "n_layer": 12,
1198
+ "n_head": 8,
1199
+ "n_embd": 320,
1200
+ "parameters": 17367040,
1201
+ "dropout_initial": 0.14,
1202
+ "dropout_final": 0.02,
1203
+ "dropout_schedule": "log_prefix_anchor",
1204
+ "n": 5,
1205
+ "mean_train_eval_loss": 3.7474397838115694,
1206
+ "std_train_eval_loss": 0.003830485352807811,
1207
+ "mean_val_eval_loss": 4.273568108677864,
1208
+ "std_val_eval_loss": 0.022770462958788148,
1209
+ "mean_generalization_gap": 0.5261283248662949,
1210
+ "std_generalization_gap": 0.02051193157067845
1211
+ },
1212
+ {
1213
+ "run_mode": "locked_stream",
1214
+ "condition": "wikitext103_probe_blend",
1215
+ "condition_kind": "anchor_decay",
1216
+ "stage": 3,
1217
+ "token_limit": 2000000,
1218
+ "model_name": "L12_H8_D320",
1219
+ "n_layer": 12,
1220
+ "n_head": 8,
1221
+ "n_embd": 320,
1222
+ "parameters": 17367040,
1223
+ "dropout_initial": 0.19,
1224
+ "dropout_final": 0.01,
1225
+ "dropout_schedule": "log_prefix_anchor",
1226
+ "n": 5,
1227
+ "mean_train_eval_loss": 3.720025636255741,
1228
+ "std_train_eval_loss": 0.012889615922098394,
1229
+ "mean_val_eval_loss": 4.274319227784872,
1230
+ "std_val_eval_loss": 0.0173783813582558,
1231
+ "mean_generalization_gap": 0.5542935915291309,
1232
+ "std_generalization_gap": 0.012746502538982045
1233
+ },
1234
+ {
1235
+ "run_mode": "locked_stream",
1236
+ "condition": "static_dropout_0",
1237
+ "condition_kind": "static",
1238
+ "stage": 4,
1239
+ "token_limit": 4000000,
1240
+ "model_name": "L12_H8_D320",
1241
+ "n_layer": 12,
1242
+ "n_head": 8,
1243
+ "n_embd": 320,
1244
+ "parameters": 17367040,
1245
+ "dropout_initial": 0.0,
1246
+ "dropout_final": 0.0,
1247
+ "dropout_schedule": "constant",
1248
+ "n": 5,
1249
+ "mean_train_eval_loss": 3.7750062167644503,
1250
+ "std_train_eval_loss": 0.025453964920340872,
1251
+ "mean_val_eval_loss": 4.183486974239349,
1252
+ "std_val_eval_loss": 0.01652781602088682,
1253
+ "mean_generalization_gap": 0.4084807574748993,
1254
+ "std_generalization_gap": 0.03358628463770181
1255
+ },
1256
+ {
1257
+ "run_mode": "locked_stream",
1258
+ "condition": "static_dropout_0.02",
1259
+ "condition_kind": "static",
1260
+ "stage": 4,
1261
+ "token_limit": 4000000,
1262
+ "model_name": "L12_H8_D320",
1263
+ "n_layer": 12,
1264
+ "n_head": 8,
1265
+ "n_embd": 320,
1266
+ "parameters": 17367040,
1267
+ "dropout_initial": 0.02,
1268
+ "dropout_final": 0.02,
1269
+ "dropout_schedule": "constant",
1270
+ "n": 5,
1271
+ "mean_train_eval_loss": 3.775918734073639,
1272
+ "std_train_eval_loss": 0.03360785862475549,
1273
+ "mean_val_eval_loss": 4.145870254188776,
1274
+ "std_val_eval_loss": 0.016455755256661365,
1275
+ "mean_generalization_gap": 0.3699515201151371,
1276
+ "std_generalization_gap": 0.0368559849747316
1277
+ },
1278
+ {
1279
+ "run_mode": "locked_stream",
1280
+ "condition": "static_dropout_0.04",
1281
+ "condition_kind": "static",
1282
+ "stage": 4,
1283
+ "token_limit": 4000000,
1284
+ "model_name": "L12_H8_D320",
1285
+ "n_layer": 12,
1286
+ "n_head": 8,
1287
+ "n_embd": 320,
1288
+ "parameters": 17367040,
1289
+ "dropout_initial": 0.04,
1290
+ "dropout_final": 0.04,
1291
+ "dropout_schedule": "constant",
1292
+ "n": 5,
1293
+ "mean_train_eval_loss": 3.7978432700037956,
1294
+ "std_train_eval_loss": 0.018504253509089505,
1295
+ "mean_val_eval_loss": 4.1331265024840835,
1296
+ "std_val_eval_loss": 0.022679096058077455,
1297
+ "mean_generalization_gap": 0.3352832324802876,
1298
+ "std_generalization_gap": 0.036956450676459744
1299
+ },
1300
+ {
1301
+ "run_mode": "locked_stream",
1302
+ "condition": "static_dropout_0.06",
1303
+ "condition_kind": "static",
1304
+ "stage": 4,
1305
+ "token_limit": 4000000,
1306
+ "model_name": "L12_H8_D320",
1307
+ "n_layer": 12,
1308
+ "n_head": 8,
1309
+ "n_embd": 320,
1310
+ "parameters": 17367040,
1311
+ "dropout_initial": 0.06,
1312
+ "dropout_final": 0.06,
1313
+ "dropout_schedule": "constant",
1314
+ "n": 5,
1315
+ "mean_train_eval_loss": 3.806561966240406,
1316
+ "std_train_eval_loss": 0.022924307415498075,
1317
+ "mean_val_eval_loss": 4.119682380557061,
1318
+ "std_val_eval_loss": 0.008179605355252681,
1319
+ "mean_generalization_gap": 0.3131204143166542,
1320
+ "std_generalization_gap": 0.028667268842643943
1321
+ },
1322
+ {
1323
+ "run_mode": "locked_stream",
1324
+ "condition": "static_dropout_0.08",
1325
+ "condition_kind": "static",
1326
+ "stage": 4,
1327
+ "token_limit": 4000000,
1328
+ "model_name": "L12_H8_D320",
1329
+ "n_layer": 12,
1330
+ "n_head": 8,
1331
+ "n_embd": 320,
1332
+ "parameters": 17367040,
1333
+ "dropout_initial": 0.08,
1334
+ "dropout_final": 0.08,
1335
+ "dropout_schedule": "constant",
1336
+ "n": 5,
1337
+ "mean_train_eval_loss": 3.8268220275640488,
1338
+ "std_train_eval_loss": 0.021203534366681782,
1339
+ "mean_val_eval_loss": 4.111632210761309,
1340
+ "std_val_eval_loss": 0.01857117977689519,
1341
+ "mean_generalization_gap": 0.2848101831972599,
1342
+ "std_generalization_gap": 0.03679859980086126
1343
+ },
1344
+ {
1345
+ "run_mode": "locked_stream",
1346
+ "condition": "static_dropout_0.1",
1347
+ "condition_kind": "static",
1348
+ "stage": 4,
1349
+ "token_limit": 4000000,
1350
+ "model_name": "L12_H8_D320",
1351
+ "n_layer": 12,
1352
+ "n_head": 8,
1353
+ "n_embd": 320,
1354
+ "parameters": 17367040,
1355
+ "dropout_initial": 0.1,
1356
+ "dropout_final": 0.1,
1357
+ "dropout_schedule": "constant",
1358
+ "n": 5,
1359
+ "mean_train_eval_loss": 3.841722105443478,
1360
+ "std_train_eval_loss": 0.019884552697005082,
1361
+ "mean_val_eval_loss": 4.1104619488120075,
1362
+ "std_val_eval_loss": 0.018806640195540816,
1363
+ "mean_generalization_gap": 0.2687398433685303,
1364
+ "std_generalization_gap": 0.03219305563944893
1365
+ },
1366
+ {
1367
+ "run_mode": "locked_stream",
1368
+ "condition": "static_dropout_0.14",
1369
+ "condition_kind": "static",
1370
+ "stage": 4,
1371
+ "token_limit": 4000000,
1372
+ "model_name": "L12_H8_D320",
1373
+ "n_layer": 12,
1374
+ "n_head": 8,
1375
+ "n_embd": 320,
1376
+ "parameters": 17367040,
1377
+ "dropout_initial": 0.14,
1378
+ "dropout_final": 0.14,
1379
+ "dropout_schedule": "constant",
1380
+ "n": 5,
1381
+ "mean_train_eval_loss": 3.8673695802688597,
1382
+ "std_train_eval_loss": 0.025363134860510252,
1383
+ "mean_val_eval_loss": 4.122133788466454,
1384
+ "std_val_eval_loss": 0.015529540565590345,
1385
+ "mean_generalization_gap": 0.2547642081975937,
1386
+ "std_generalization_gap": 0.0302072467927019
1387
+ },
1388
+ {
1389
+ "run_mode": "locked_stream",
1390
+ "condition": "static_dropout_0.18",
1391
+ "condition_kind": "static",
1392
+ "stage": 4,
1393
+ "token_limit": 4000000,
1394
+ "model_name": "L12_H8_D320",
1395
+ "n_layer": 12,
1396
+ "n_head": 8,
1397
+ "n_embd": 320,
1398
+ "parameters": 17367040,
1399
+ "dropout_initial": 0.18,
1400
+ "dropout_final": 0.18,
1401
+ "dropout_schedule": "constant",
1402
+ "n": 5,
1403
+ "mean_train_eval_loss": 3.9014849200844766,
1404
+ "std_train_eval_loss": 0.018144303418082167,
1405
+ "mean_val_eval_loss": 4.130411610752344,
1406
+ "std_val_eval_loss": 0.013005735782293457,
1407
+ "mean_generalization_gap": 0.22892669066786767,
1408
+ "std_generalization_gap": 0.02966442514236336
1409
+ },
1410
+ {
1411
+ "run_mode": "locked_stream",
1412
+ "condition": "static_dropout_0.2",
1413
+ "condition_kind": "static",
1414
+ "stage": 4,
1415
+ "token_limit": 4000000,
1416
+ "model_name": "L12_H8_D320",
1417
+ "n_layer": 12,
1418
+ "n_head": 8,
1419
+ "n_embd": 320,
1420
+ "parameters": 17367040,
1421
+ "dropout_initial": 0.2,
1422
+ "dropout_final": 0.2,
1423
+ "dropout_schedule": "constant",
1424
+ "n": 5,
1425
+ "mean_train_eval_loss": 3.9154964342713354,
1426
+ "std_train_eval_loss": 0.020116706552102667,
1427
+ "mean_val_eval_loss": 4.1393771633505825,
1428
+ "std_val_eval_loss": 0.016742174044810716,
1429
+ "mean_generalization_gap": 0.22388072907924653,
1430
+ "std_generalization_gap": 0.033233695598367155
1431
+ },
1432
+ {
1433
+ "run_mode": "locked_stream",
1434
+ "condition": "static_dropout_0.26",
1435
+ "condition_kind": "static",
1436
+ "stage": 4,
1437
+ "token_limit": 4000000,
1438
+ "model_name": "L12_H8_D320",
1439
+ "n_layer": 12,
1440
+ "n_head": 8,
1441
+ "n_embd": 320,
1442
+ "parameters": 17367040,
1443
+ "dropout_initial": 0.26,
1444
+ "dropout_final": 0.26,
1445
+ "dropout_schedule": "constant",
1446
+ "n": 5,
1447
+ "mean_train_eval_loss": 3.9775424867868425,
1448
+ "std_train_eval_loss": 0.021977989750098077,
1449
+ "mean_val_eval_loss": 4.178351600468159,
1450
+ "std_val_eval_loss": 0.01453281035529064,
1451
+ "mean_generalization_gap": 0.20080911368131638,
1452
+ "std_generalization_gap": 0.03295020359909027
1453
+ },
1454
+ {
1455
+ "run_mode": "locked_stream",
1456
+ "condition": "static_dropout_0.3",
1457
+ "condition_kind": "static",
1458
+ "stage": 4,
1459
+ "token_limit": 4000000,
1460
+ "model_name": "L12_H8_D320",
1461
+ "n_layer": 12,
1462
+ "n_head": 8,
1463
+ "n_embd": 320,
1464
+ "parameters": 17367040,
1465
+ "dropout_initial": 0.3,
1466
+ "dropout_final": 0.3,
1467
+ "dropout_schedule": "constant",
1468
+ "n": 5,
1469
+ "mean_train_eval_loss": 4.012722708284855,
1470
+ "std_train_eval_loss": 0.024177126067976527,
1471
+ "mean_val_eval_loss": 4.194609892368317,
1472
+ "std_val_eval_loss": 0.014131822174274241,
1473
+ "mean_generalization_gap": 0.18188718408346177,
1474
+ "std_generalization_gap": 0.031100561007876955
1475
+ },
1476
+ {
1477
+ "run_mode": "locked_stream",
1478
+ "condition": "wikitext103_formula_l12",
1479
+ "condition_kind": "anchor_decay",
1480
+ "stage": 4,
1481
+ "token_limit": 4000000,
1482
+ "model_name": "L12_H8_D320",
1483
+ "n_layer": 12,
1484
+ "n_head": 8,
1485
+ "n_embd": 320,
1486
+ "parameters": 17367040,
1487
+ "dropout_initial": 0.3,
1488
+ "dropout_final": 0.02,
1489
+ "dropout_schedule": "log_prefix_anchor",
1490
+ "n": 5,
1491
+ "mean_train_eval_loss": 3.7991417959332465,
1492
+ "std_train_eval_loss": 0.028273745467366926,
1493
+ "mean_val_eval_loss": 4.080797865241766,
1494
+ "std_val_eval_loss": 0.019458062276034267,
1495
+ "mean_generalization_gap": 0.2816560693085194,
1496
+ "std_generalization_gap": 0.038849358858741
1497
+ },
1498
+ {
1499
+ "run_mode": "locked_stream",
1500
+ "condition": "wikitext103_low_decay",
1501
+ "condition_kind": "anchor_decay",
1502
+ "stage": 4,
1503
+ "token_limit": 4000000,
1504
+ "model_name": "L12_H8_D320",
1505
+ "n_layer": 12,
1506
+ "n_head": 8,
1507
+ "n_embd": 320,
1508
+ "parameters": 17367040,
1509
+ "dropout_initial": 0.14,
1510
+ "dropout_final": 0.02,
1511
+ "dropout_schedule": "log_prefix_anchor",
1512
+ "n": 5,
1513
+ "mean_train_eval_loss": 3.776866267621517,
1514
+ "std_train_eval_loss": 0.02647416884246062,
1515
+ "mean_val_eval_loss": 4.102013133466244,
1516
+ "std_val_eval_loss": 0.016554390946653275,
1517
+ "mean_generalization_gap": 0.3251468658447266,
1518
+ "std_generalization_gap": 0.03314100249583806
1519
+ },
1520
+ {
1521
+ "run_mode": "locked_stream",
1522
+ "condition": "wikitext103_probe_blend",
1523
+ "condition_kind": "anchor_decay",
1524
+ "stage": 4,
1525
+ "token_limit": 4000000,
1526
+ "model_name": "L12_H8_D320",
1527
+ "n_layer": 12,
1528
+ "n_head": 8,
1529
+ "n_embd": 320,
1530
+ "parameters": 17367040,
1531
+ "dropout_initial": 0.19,
1532
+ "dropout_final": 0.01,
1533
+ "dropout_schedule": "log_prefix_anchor",
1534
+ "n": 5,
1535
+ "mean_train_eval_loss": 3.7674024373292925,
1536
+ "std_train_eval_loss": 0.027214408565503653,
1537
+ "mean_val_eval_loss": 4.096089626103639,
1538
+ "std_val_eval_loss": 0.014523730760550201,
1539
+ "mean_generalization_gap": 0.3286871887743473,
1540
+ "std_generalization_gap": 0.038218075738052845
1541
+ }
1542
+ ]
runs/wikitext103_l12_streaming_validation_5seed/locked_stream/20260531-093525/trace.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
runs/wikitext103_streaming_report/l12_validation_5seed/condition_summary.csv ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
2
+ wikitext103_formula_l12,anchor_decay,5,4.571096663624049,0.004493905823534624,4.080797865241766,0.019458062276034267,0.2816560693085194,0.038849358858741,0.30 -> 0.26 -> 0.18 -> 0.09 -> 0.02
3
+ wikitext103_probe_blend,anchor_decay,5,4.5635083836317065,0.0046223860048219215,4.096089626103639,0.014523730760550201,0.3286871887743473,0.038218075738052845,0.19 -> 0.14 -> 0.09 -> 0.04 -> 0.01
4
+ wikitext103_low_decay,anchor_decay,5,4.568109618723392,0.007251133716114802,4.102013133466244,0.016554390946653275,0.3251468658447266,0.03314100249583806,0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02
5
+ static_dropout_0.1,static,5,4.583564288020134,0.00624825224116908,4.1104619488120075,0.018806640195540816,0.2687398433685303,0.03219305563944893,0.10 -> 0.10 -> 0.10 -> 0.10 -> 0.10
6
+ static_dropout_0.08,static,5,4.596727982312441,0.007294540304515209,4.111632210761309,0.01857117977689519,0.2848101831972599,0.03679859980086126,0.08 -> 0.08 -> 0.08 -> 0.08 -> 0.08
7
+ static_dropout_0.06,static,5,4.618606640398502,0.004793067320536325,4.119682380557061,0.008179605355252681,0.3131204143166542,0.028667268842643943,0.06 -> 0.06 -> 0.06 -> 0.06 -> 0.06
8
+ static_dropout_0.14,static,5,4.573477408885956,0.007705324015227247,4.122133788466454,0.015529540565590345,0.2547642081975937,0.0302072467927019,0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14
9
+ static_dropout_0.18,static,5,4.575578650385141,0.004087884073210763,4.130411610752344,0.013005735782293457,0.22892669066786767,0.02966442514236336,0.18 -> 0.18 -> 0.18 -> 0.18 -> 0.18
10
+ static_dropout_0.04,static,5,4.6500934390723705,0.007686314739184788,4.1331265024840835,0.022679096058077455,0.3352832324802876,0.036956450676459744,0.04 -> 0.04 -> 0.04 -> 0.04 -> 0.04
11
+ static_dropout_0.2,static,5,4.579393386542797,0.004961083585861927,4.1393771633505825,0.016742174044810716,0.22388072907924653,0.033233695598367155,0.20 -> 0.20 -> 0.20 -> 0.20 -> 0.20
12
+ static_dropout_0.02,static,5,4.695371071547269,0.008590324433750853,4.145870254188776,0.016455755256661365,0.3699515201151371,0.0368559849747316,0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02
13
+ static_dropout_0.26,static,5,4.606293898075819,0.0051111340096229975,4.178351600468159,0.01453281035529064,0.20080911368131638,0.03295020359909027,0.26 -> 0.26 -> 0.26 -> 0.26 -> 0.26
14
+ static_dropout_0,static,5,4.776249321401119,0.010887492470564343,4.183486974239349,0.01652781602088682,0.4084807574748993,0.03358628463770181,0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00
15
+ static_dropout_0.3,static,5,4.625303542017937,0.003399857155492287,4.194609892368317,0.014131822174274241,0.18188718408346177,0.031100561007876955,0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30
runs/wikitext103_streaming_report/l12_validation_5seed/paired_final_deltas.csv ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
2
+ 1,wikitext103_formula_l12,4.062299974262714,static_dropout_0.1,4.080718021839857,-0.018418047577142715
3
+ 1,wikitext103_probe_blend,4.07382857054472,static_dropout_0.1,4.080718021839857,-0.006889451295137405
4
+ 1,wikitext103_low_decay,4.085419669747353,static_dropout_0.1,4.080718021839857,0.004701647907495499
5
+ 1,static_dropout_0.1,4.080718021839857,static_dropout_0.1,4.080718021839857,0.0
6
+ 1,static_dropout_0.08,4.089278034865856,static_dropout_0.1,4.080718021839857,0.00856001302599907
7
+ 1,static_dropout_0.06,4.111171845346689,static_dropout_0.1,4.080718021839857,0.030453823506832123
8
+ 1,static_dropout_0.14,4.108215663582087,static_dropout_0.1,4.080718021839857,0.02749764174222946
9
+ 1,static_dropout_0.18,4.110849380493164,static_dropout_0.1,4.080718021839857,0.03013135865330696
10
+ 1,static_dropout_0.2,4.116236750036478,static_dropout_0.1,4.080718021839857,0.03551872819662094
11
+ 1,static_dropout_0.04,4.103101845830679,static_dropout_0.1,4.080718021839857,0.02238382399082184
12
+ 1,static_dropout_0.02,4.137094188481569,static_dropout_0.1,4.080718021839857,0.05637616664171219
13
+ 1,static_dropout_0,4.159992527216673,static_dropout_0.1,4.080718021839857,0.0792745053768158
14
+ 1,static_dropout_0.26,4.155716687440872,static_dropout_0.1,4.080718021839857,0.07499866560101509
15
+ 1,static_dropout_0.3,4.180164486169815,static_dropout_0.1,4.080718021839857,0.09944646432995796
16
+ 2,wikitext103_formula_l12,4.112260434776545,static_dropout_0.06,4.130373623222113,-0.018113188445568085
17
+ 2,wikitext103_probe_blend,4.111278761178255,static_dropout_0.06,4.130373623222113,-0.019094862043857574
18
+ 2,wikitext103_low_decay,4.129110310226679,static_dropout_0.06,4.130373623222113,-0.0012633129954338074
19
+ 2,static_dropout_0.1,4.131969079375267,static_dropout_0.06,4.130373623222113,0.0015954561531543732
20
+ 2,static_dropout_0.08,4.137437805533409,static_dropout_0.06,4.130373623222113,0.007064182311296463
21
+ 2,static_dropout_0.06,4.130373623222113,static_dropout_0.06,4.130373623222113,0.0
22
+ 2,static_dropout_0.14,4.147622540593147,static_dropout_0.06,4.130373623222113,0.017248917371034622
23
+ 2,static_dropout_0.18,4.147120479494333,static_dropout_0.06,4.130373623222113,0.01674685627222061
24
+ 2,static_dropout_0.2,4.163300380110741,static_dropout_0.06,4.130373623222113,0.032926756888628006
25
+ 2,static_dropout_0.04,4.164802756160498,static_dropout_0.06,4.130373623222113,0.03442913293838501
26
+ 2,static_dropout_0.02,4.174610733985901,static_dropout_0.06,4.130373623222113,0.04423711076378822
27
+ 2,static_dropout_0,4.202965669333935,static_dropout_0.06,4.130373623222113,0.07259204611182213
28
+ 2,static_dropout_0.26,4.1961489245295525,static_dropout_0.06,4.130373623222113,0.0657753013074398
29
+ 2,static_dropout_0.3,4.215538114309311,static_dropout_0.06,4.130373623222113,0.08516449108719826
30
+ 3,wikitext103_formula_l12,4.0763493329286575,static_dropout_0.08,4.1035647466778755,-0.027215413749217987
31
+ 3,wikitext103_probe_blend,4.093361176550388,static_dropout_0.08,4.1035647466778755,-0.010203570127487183
32
+ 3,wikitext103_low_decay,4.100566737353802,static_dropout_0.08,4.1035647466778755,-0.0029980093240737915
33
+ 3,static_dropout_0.1,4.111502721905708,static_dropout_0.08,4.1035647466778755,0.007937975227832794
34
+ 3,static_dropout_0.08,4.1035647466778755,static_dropout_0.08,4.1035647466778755,0.0
35
+ 3,static_dropout_0.06,4.112732540816069,static_dropout_0.08,4.1035647466778755,0.00916779413819313
36
+ 3,static_dropout_0.14,4.123979520052671,static_dropout_0.08,4.1035647466778755,0.020414773374795914
37
+ 3,static_dropout_0.18,4.128544177860022,static_dropout_0.08,4.1035647466778755,0.024979431182146072
38
+ 3,static_dropout_0.2,4.136738710105419,static_dropout_0.08,4.1035647466778755,0.03317396342754364
39
+ 3,static_dropout_0.04,4.124624028801918,static_dropout_0.08,4.1035647466778755,0.02105928212404251
40
+ 3,static_dropout_0.02,4.144321486353874,static_dropout_0.08,4.1035647466778755,0.04075673967599869
41
+ 3,static_dropout_0,4.175786443054676,static_dropout_0.08,4.1035647466778755,0.07222169637680054
42
+ 3,static_dropout_0.26,4.179637394845486,static_dropout_0.08,4.1035647466778755,0.07607264816761017
43
+ 3,static_dropout_0.3,4.192550577223301,static_dropout_0.08,4.1035647466778755,0.08898583054542542
44
+ 4,wikitext103_formula_l12,4.0844879895448685,static_dropout_0.1,4.109558492898941,-0.02507050335407257
45
+ 4,wikitext103_probe_blend,4.09542103484273,static_dropout_0.1,4.109558492898941,-0.014137458056211472
46
+ 4,wikitext103_low_decay,4.092821758240461,static_dropout_0.1,4.109558492898941,-0.01673673465847969
47
+ 4,static_dropout_0.1,4.109558492898941,static_dropout_0.1,4.109558492898941,0.0
48
+ 4,static_dropout_0.08,4.122252244502306,static_dropout_0.1,4.109558492898941,0.012693751603364944
49
+ 4,static_dropout_0.06,4.118809700012207,static_dropout_0.1,4.109558492898941,0.009251207113265991
50
+ 4,static_dropout_0.14,4.111692663282156,static_dropout_0.1,4.109558492898941,0.0021341703832149506
51
+ 4,static_dropout_0.18,4.132994774729013,static_dropout_0.1,4.109558492898941,0.023436281830072403
52
+ 4,static_dropout_0.2,4.13876885920763,static_dropout_0.1,4.109558492898941,0.029210366308689117
53
+ 4,static_dropout_0.04,4.131225533783436,static_dropout_0.1,4.109558492898941,0.02166704088449478
54
+ 4,static_dropout_0.02,4.1386831142008305,static_dropout_0.1,4.109558492898941,0.02912462130188942
55
+ 4,static_dropout_0,4.185262691229582,static_dropout_0.1,4.109558492898941,0.07570419833064079
56
+ 4,static_dropout_0.26,4.178157042711973,static_dropout_0.1,4.109558492898941,0.06859854981303215
57
+ 4,static_dropout_0.3,4.200709246098995,static_dropout_0.1,4.109558492898941,0.09115075320005417
58
+ 5,wikitext103_formula_l12,4.068591594696045,static_dropout_0.08,4.105628222227097,-0.037036627531051636
59
+ 5,wikitext103_probe_blend,4.106558587402105,static_dropout_0.08,4.105628222227097,0.0009303651750087738
60
+ 5,wikitext103_low_decay,4.102147191762924,static_dropout_0.08,4.105628222227097,-0.0034810304641723633
61
+ 5,static_dropout_0.1,4.118561428040266,static_dropout_0.08,4.105628222227097,0.01293320581316948
62
+ 5,static_dropout_0.08,4.105628222227097,static_dropout_0.08,4.105628222227097,0.0
63
+ 5,static_dropout_0.06,4.125324193388224,static_dropout_0.08,4.105628222227097,0.01969597116112709
64
+ 5,static_dropout_0.14,4.1191585548222065,static_dropout_0.08,4.105628222227097,0.01353033259510994
65
+ 5,static_dropout_0.18,4.132549241185188,static_dropout_0.08,4.105628222227097,0.026921018958091736
66
+ 5,static_dropout_0.2,4.141841117292643,static_dropout_0.08,4.105628222227097,0.036212895065546036
67
+ 5,static_dropout_0.04,4.141878347843885,static_dropout_0.08,4.105628222227097,0.036250125616788864
68
+ 5,static_dropout_0.02,4.134641747921705,static_dropout_0.08,4.105628222227097,0.02901352569460869
69
+ 5,static_dropout_0,4.193427540361881,static_dropout_0.08,4.105628222227097,0.0877993181347847
70
+ 5,static_dropout_0.26,4.18209795281291,static_dropout_0.08,4.105628222227097,0.07646973058581352
71
+ 5,static_dropout_0.3,4.184087038040161,static_dropout_0.08,4.105628222227097,0.07845881581306458
runs/wikitext103_streaming_report/l12_validation_5seed/stage_summary.csv ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
2
+ wikitext103_formula_l12,0,250000,0.3,5,5.21483291387558,0.018060268226475495,4.213146224617958,0.013218263380840874,1.0016866892576217,0.027238437205160136
3
+ wikitext103_formula_l12,1,500000,0.26,5,4.80534844994545,0.027829930328897787,3.918210099637508,0.011987444359375697,0.8871383503079414,0.02626272167699895
4
+ wikitext103_formula_l12,2,1000000,0.18,5,4.49377838075161,0.014674108647494532,3.8283171117305757,0.017535628899982475,0.6654612690210342,0.02707089996550965
5
+ wikitext103_formula_l12,3,2000000,0.09,5,4.260725708305836,0.018098739272622868,3.8088956013321877,0.016376701234484724,0.45183010697364806,0.01960955973506382
6
+ wikitext103_formula_l12,4,4000000,0.02,5,4.080797865241766,0.019458062276034267,3.7991417959332465,0.028273745467366926,0.2816560693085194,0.038849358858741
7
+ wikitext103_probe_blend,0,250000,0.19,5,5.165890334546566,0.017075109484847858,4.020059056580067,0.016213485710936643,1.1458312779664994,0.02016518864116932
8
+ wikitext103_probe_blend,1,500000,0.14,5,4.7872447416186334,0.0268838649061173,3.684571087360382,0.02397149156090502,1.1026736542582511,0.021535214146850536
9
+ wikitext103_probe_blend,2,1000000,0.09,5,4.49399798810482,0.015864095933133645,3.6495467707514764,0.019490464893964263,0.8444512173533439,0.03475725994195922
10
+ wikitext103_probe_blend,3,2000000,0.04,5,4.274319227784872,0.0173783813582558,3.720025636255741,0.012889615922098394,0.5542935915291309,0.012746502538982045
11
+ wikitext103_probe_blend,4,4000000,0.01,5,4.096089626103639,0.014523730760550201,3.7674024373292925,0.027214408565503653,0.3286871887743473,0.038218075738052845
12
+ wikitext103_low_decay,0,250000,0.14,5,5.163547059893608,0.02195330299872749,3.9050611212849615,0.01328398150034886,1.2584859386086464,0.025343680837745006
13
+ wikitext103_low_decay,1,500000,0.14,5,4.800087235867977,0.01984656043518159,3.6649743929505347,0.023120354646606448,1.1351128429174424,0.01298098541252223
14
+ wikitext103_low_decay,2,1000000,0.1,5,4.50133255571127,0.01578619433972172,3.6606912568211554,0.023899445408052746,0.8406412988901139,0.03886870429434243
15
+ wikitext103_low_decay,3,2000000,0.06,5,4.273568108677864,0.022770462958788148,3.7474397838115694,0.003830485352807811,0.5261283248662949,0.02051193157067845
16
+ wikitext103_low_decay,4,4000000,0.02,5,4.102013133466244,0.016554390946653275,3.776866267621517,0.02647416884246062,0.3251468658447266,0.03314100249583806
17
+ static_dropout_0.1,0,250000,0.1,5,5.169893845915794,0.023739040366881205,3.8219083666801454,0.019248328397722874,1.3479854792356492,0.03169483270954973
18
+ static_dropout_0.1,1,500000,0.1,5,4.83315060287714,0.025840187319246186,3.5636927232146265,0.024862277351108786,1.2694578796625138,0.01835928668354016
19
+ static_dropout_0.1,2,1000000,0.1,5,4.518611620366573,0.01351751181040844,3.652407945692539,0.022251919818600654,0.8662036746740341,0.033766642849086155
20
+ static_dropout_0.1,3,2000000,0.1,5,4.285703422129155,0.018994111876547076,3.7808884128928186,0.01766528157534878,0.5048150092363357,0.028072111875115907
21
+ static_dropout_0.1,4,4000000,0.1,5,4.1104619488120075,0.018806640195540816,3.841722105443478,0.019884552697005082,0.2687398433685303,0.03219305563944893
22
+ static_dropout_0.08,0,250000,0.08,5,5.189354091882706,0.016114118388750917,3.761947725713253,0.017361993699706244,1.4274063661694527,0.012218707428925264
23
+ static_dropout_0.08,1,500000,0.08,5,4.857585413753986,0.023903772104159438,3.50356438010931,0.024856947856236954,1.3540210336446763,0.0310369653543579
24
+ static_dropout_0.08,2,1000000,0.08,5,4.532606348395348,0.011746366682868653,3.5999757796525955,0.009078845784580064,0.9326305687427521,0.019403758262432406
25
+ static_dropout_0.08,3,2000000,0.08,5,4.292461846768856,0.016558424281915483,3.7518652200698854,0.019831550751639896,0.5405966266989708,0.034040566494281235
26
+ static_dropout_0.08,4,4000000,0.08,5,4.111632210761309,0.01857117977689519,3.8268220275640488,0.021203534366681782,0.2848101831972599,0.03679859980086126
27
+ static_dropout_0.06,0,250000,0.06,5,5.215402702987194,0.01734079554543704,3.712760145962238,0.018437630763482243,1.5026425570249557,0.03154944432034942
28
+ static_dropout_0.06,1,500000,0.06,5,4.894671627879143,0.021331675814859084,3.4394222095608713,0.013418585935883633,1.4552494183182716,0.01946001354297248
29
+ static_dropout_0.06,2,1000000,0.06,5,4.557375983893872,0.012636266859277006,3.555374290049076,0.018874416206200166,1.0020016938447953,0.030341255269329737
30
+ static_dropout_0.06,3,2000000,0.06,5,4.305900506675243,0.018234543669842573,3.7317687764763834,0.017157479537739013,0.5741317301988602,0.023633182905307998
31
+ static_dropout_0.06,4,4000000,0.06,5,4.119682380557061,0.008179605355252681,3.806561966240406,0.022924307415498075,0.3131204143166542,0.028667268842643943
32
+ static_dropout_0.14,0,250000,0.14,5,5.1635470792651175,0.021953339409384615,3.9050611019134522,0.013283976741307785,1.2584859773516655,0.025343683677925957
33
+ static_dropout_0.14,1,500000,0.14,5,4.800087215006352,0.019846558873074828,3.6649743527173997,0.023120394246437293,1.135112862288952,0.012980966896977213
34
+ static_dropout_0.14,2,1000000,0.14,5,4.500055414438248,0.014819407090300617,3.716250878572464,0.02407671653248301,0.7838045358657837,0.038101035301189086
35
+ static_dropout_0.14,3,2000000,0.14,5,4.281563547253609,0.02138858846966249,3.828673002123833,0.004937581309037719,0.452890545129776,0.01755457638984346
36
+ static_dropout_0.14,4,4000000,0.14,5,4.122133788466454,0.015529540565590345,3.8673695802688597,0.025363134860510252,0.2547642081975937,0.0302072467927019
37
+ static_dropout_0.18,0,250000,0.18,5,5.16162400841713,0.014987630048664583,3.9964362382888794,0.01862825818296394,1.16518777012825,0.02013674124988852
38
+ static_dropout_0.18,1,500000,0.18,5,4.794638857245445,0.02062163611688947,3.7571836963295935,0.024732309568709827,1.0374551609158515,0.02312552092553772
39
+ static_dropout_0.18,2,1000000,0.18,5,4.5023036181926726,0.018490102933264013,3.786629955470562,0.015769997442310384,0.7156736627221107,0.03180040950921459
40
+ static_dropout_0.18,3,2000000,0.18,5,4.288915157318115,0.019532110178580833,3.876796562969685,0.008357602777023191,0.4121185943484306,0.013904819825939841
41
+ static_dropout_0.18,4,4000000,0.18,5,4.130411610752344,0.013005735782293457,3.9014849200844766,0.018144303418082167,0.22892669066786767,0.02966442514236336
42
+ static_dropout_0.2,0,250000,0.2,5,5.170051643252373,0.014071496005973716,4.036272630095482,0.013668609797313208,1.1337790131568908,0.013715020568700555
43
+ static_dropout_0.2,1,500000,0.2,5,4.787301687896251,0.023601504941398347,3.7914333969354628,0.018987144953080883,0.9958682909607888,0.014572669955232898
44
+ static_dropout_0.2,2,1000000,0.2,5,4.506029562652111,0.02043622151331783,3.814779002964497,0.01587114098616348,0.6912505596876144,0.03147408021666257
45
+ static_dropout_0.2,3,2000000,0.2,5,4.294206875562668,0.016019909189912813,3.898232290148735,0.008893270752025355,0.3959745854139328,0.01268490098888804
46
+ static_dropout_0.2,4,4000000,0.2,5,4.1393771633505825,0.016742174044810716,3.9154964342713354,0.020116706552102667,0.22388072907924653,0.033233695598367155
47
+ static_dropout_0.04,0,250000,0.04,5,5.237822580337524,0.018592151453448197,3.6440532103180887,0.031242522590923742,1.593769370019436,0.03528512909650908
48
+ static_dropout_0.04,1,500000,0.04,5,4.957330641150475,0.02503345339488141,3.3515226930379867,0.029537749484816224,1.6058079481124878,0.0202359151612378
49
+ static_dropout_0.04,2,1000000,0.04,5,4.595940679311752,0.014619476464690098,3.502993068099022,0.017291674978225223,1.0929476112127303,0.028916946275823
50
+ static_dropout_0.04,3,2000000,0.04,5,4.3262467920780185,0.016533384756725625,3.703781445324421,0.01970997472996272,0.6224653467535972,0.02577206799431021
51
+ static_dropout_0.04,4,4000000,0.04,5,4.1331265024840835,0.022679096058077455,3.7978432700037956,0.018504253509089505,0.3352832324802876,0.036956450676459744
52
+ static_dropout_0.02,0,250000,0.02,5,5.274957031011581,0.025475383539799955,3.572465108335018,0.027061999609254494,1.7024919226765634,0.03656967535619302
53
+ static_dropout_0.02,1,500000,0.02,5,5.045114178955555,0.016894569246234076,3.261230443418026,0.014168838694048941,1.783883735537529,0.019840865900389654
54
+ static_dropout_0.02,2,1000000,0.02,5,4.655830132961273,0.01585338982667534,3.4323942199349404,0.018124036007212796,1.2234359130263328,0.024146051923035036
55
+ static_dropout_0.02,3,2000000,0.02,5,4.355083760619164,0.026489436313039863,3.6672953754663467,0.010966930251295475,0.6877883851528168,0.026106538952549097
56
+ static_dropout_0.02,4,4000000,0.02,5,4.145870254188776,0.016455755256661365,3.775918734073639,0.03360785862475549,0.3699515201151371,0.0368559849747316
57
+ static_dropout_0,0,250000,0.0,5,5.340271946787834,0.02699193219713777,3.5230478435754775,0.03634617773629946,1.8172241032123566,0.05178790589031804
58
+ static_dropout_0,1,500000,0.0,5,5.174065832793713,0.025168511150111032,3.1505795806646346,0.025699619025332174,2.0234862521290777,0.0236535174506497
59
+ static_dropout_0,2,1000000,0.0,5,4.766069588065148,0.03242686059194632,3.3657630145549775,0.025194332688414405,1.40030657351017,0.052237334934698405
60
+ static_dropout_0,3,2000000,0.0,5,4.417352265119552,0.01876483626309062,3.6409011498093604,0.014124132971310668,0.7764511153101921,0.024211298020746064
61
+ static_dropout_0,4,4000000,0.0,5,4.183486974239349,0.01652781602088682,3.7750062167644503,0.025453964920340872,0.4084807574748993,0.03358628463770181
62
+ static_dropout_0.26,0,250000,0.26,5,5.194025552272796,0.016128282325428966,4.149615630507469,0.024675006962846267,1.0444099217653275,0.022196467420881313
63
+ static_dropout_0.26,1,500000,0.26,5,4.8081169500947,0.021576250073037834,3.905269515514374,0.02376505939443177,0.9028474345803261,0.018700407757224247
64
+ static_dropout_0.26,2,1000000,0.26,5,4.52620010226965,0.010149133513914643,3.9071271896362303,0.02774032700763299,0.619072912633419,0.035195638721522884
65
+ static_dropout_0.26,3,2000000,0.26,5,4.32477528527379,0.017871972287011016,3.9654731526970863,0.018160558354021094,0.359302132576704,0.018728321745075875
66
+ static_dropout_0.26,4,4000000,0.26,5,4.178351600468159,0.01453281035529064,3.9775424867868425,0.021977989750098077,0.20080911368131638,0.03295020359909027
67
+ static_dropout_0.3,0,250000,0.3,5,5.214832927286625,0.018060234410376495,4.213146212697029,0.013218280901902867,1.0016867145895958,0.02723842931339789
68
+ static_dropout_0.3,1,500000,0.3,5,4.82416096329689,0.029627048936626245,3.976513123512268,0.010796527385978102,0.8476478397846222,0.026893359245285242
69
+ static_dropout_0.3,2,1000000,0.3,5,4.546238152682781,0.012672630425565377,3.970817744731903,0.017137273586986676,0.5754204079508781,0.02420747470801454
70
+ static_dropout_0.3,3,2000000,0.3,5,4.34667577445507,0.015854348211857338,4.009651578962803,0.014639843511797184,0.3370241954922676,0.019075525784314876
71
+ static_dropout_0.3,4,4000000,0.3,5,4.194609892368317,0.014131822174274241,4.012722708284855,0.024177126067976527,0.18188718408346177,0.031100561007876955