Mandeep Sidhu commited on
Commit
bf705c0
·
1 Parent(s): 1c065aa

Add clean previous local five-seed validation

Browse files
docs/plan.md CHANGED
@@ -284,7 +284,7 @@ Use this order for every regime.
284
  | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
  | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
- | original/local streaming regime | 3-seed saved-run report complete | previous/local decay schedules beat best static in 3/3 paired final-loss comparisons |
288
  | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
289
 
290
  ## Current Formula Status
@@ -333,7 +333,7 @@ structure transfers, while coefficients may be regime-specific.
333
  | TinyStories held-out prefix | supports pressure dependence on unique tokens |
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
  | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
336
- | previous/local streaming, 3 seeds | hold-30 decay has best mean final loss; top decay schedules beat best static in 3/3 paired comparisons |
337
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
338
 
339
  Latest TinyStories 5-seed streaming final-loss table:
@@ -355,9 +355,9 @@ Paired final-loss result:
355
  | `baseabc` | 5/5 |
356
  | `smooth_low` | 4/5, with the one miss only `+0.0003` |
357
 
358
- The immediate risk is no longer TinyStories seed count. The main remaining risk
359
- is external validity: the current strongest streaming result is one corpus and
360
- one narrowed model/optimizer regime. The current defensible claim is:
361
 
362
  ```text
363
  Formula-derived dropout schedules track the moving useful dropout region and
@@ -370,8 +370,36 @@ The stronger claim:
370
  Formula-derived dropout decay beats the best static dropout.
371
  ```
372
 
373
- is supported at `n=5` for this TinyStories setup, with interaction decay
374
- beating the per-seed best static baseline in all five seeds.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
375
 
376
  ## Completed Static Backtest Gate
377
 
@@ -395,25 +423,23 @@ streaming multi-seed reports for each regime.
395
 
396
  ## Immediate Next Action
397
 
398
- Reconcile the TinyStories five-seed report and previous/local three-seed
399
- report into the paper outline. Decide whether the previous/local regime needs a
400
- targeted seed-4/5 extension, or whether the next better use of MPS time is a
401
- third held-out regime.
 
402
 
403
  ## Next Training After Current Gate
404
 
405
- No MPS training should launch until the two completed streaming reports are
406
- read together. If previous/local seed count is the limiting issue, the next run
407
- should be narrowly scoped to only the missing seed-4/5 previous/local
408
- conditions. If external validity is the limiting issue, use a third held-out
409
- regime instead:
410
 
411
  ```text
412
  completed: TinyStories 5-seed streaming report
413
- completed: previous/local 3-seed saved-run streaming report
414
- possible follow-up A: previous/local seed-4/5 extension
415
- possible follow-up B: third held-out regime
416
- avoid: broad new sweep before choosing A vs B
417
  ```
418
 
419
  Evaluate with paired seed comparisons:
@@ -426,10 +452,11 @@ decay minus best-static delta per seed
426
  rank consistency across seeds
427
  ```
428
 
429
- If previous/local decay wins across paired seeds, promote the cross-regime
430
- streaming claim. If it ties, claim competitive automatic scheduling rather than
431
- superiority outside TinyStories. If it loses, fit a streaming-specific
432
- correction offline before launching any broader experiment.
 
433
 
434
  Latest streaming report:
435
 
@@ -437,5 +464,5 @@ Latest streaming report:
437
  docs/streaming_multiseed_validation_report.md
438
  docs/previous_regime_streaming_report.md
439
  runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
440
- runs/previous_local_streaming_report/l16_multiseed_confirm/
441
  ```
 
284
  | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
  | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
+ | original/local streaming regime | 5-seed clean validation complete | previous/local interaction decay beats best static in 5/5 paired final-loss comparisons |
288
  | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
289
 
290
  ## Current Formula Status
 
333
  | TinyStories held-out prefix | supports pressure dependence on unique tokens |
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
  | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
336
+ | previous/local streaming, 5 seeds | interaction decay has best mean final loss; top decay schedules beat best static in 5/5 paired comparisons |
337
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
338
 
339
  Latest TinyStories 5-seed streaming final-loss table:
 
355
  | `baseabc` | 5/5 |
356
  | `smooth_low` | 4/5, with the one miss only `+0.0003` |
357
 
358
+ The immediate risk is no longer seed count for TinyStories or previous/local.
359
+ The main remaining risk is external validity beyond two tested regimes. The
360
+ current defensible claim is:
361
 
362
  ```text
363
  Formula-derived dropout schedules track the moving useful dropout region and
 
370
  Formula-derived dropout decay beats the best static dropout.
371
  ```
372
 
373
+ is supported at `n=5` in both the TinyStories and previous/local streaming
374
+ setups, with interaction decay beating the per-seed best static baseline in all
375
+ five seeds in both regimes.
376
+
377
+ Latest previous/local 5-seed streaming final-loss table:
378
+
379
+ | Condition | Mean final 4M validation loss | Std |
380
+ |---|---:|---:|
381
+ | `prevlocal_interaction` decay | 4.3981 | 0.0095 |
382
+ | `hold_30_then_decay` | 4.4052 | 0.0112 |
383
+ | `mild_30_to_08` | 4.4073 | 0.0085 |
384
+ | `fitted_l16_static_law` | 4.4124 | 0.0084 |
385
+ | static `0.14` | 4.4455 | 0.0120 |
386
+ | static `0.30` | 4.4668 | 0.0141 |
387
+ | static `0.02` | 4.5358 | 0.0091 |
388
+ | static `0.00` | 4.5943 | 0.0216 |
389
+
390
+ Paired final-loss result:
391
+
392
+ | Decay schedule | Paired wins vs best static |
393
+ |---|---:|
394
+ | `prevlocal_interaction` | 5/5 |
395
+ | `hold_30_then_decay` | 5/5 |
396
+ | `mild_30_to_08` | 5/5 |
397
+ | `fitted_l16_static_law` | 5/5 |
398
+
399
+ The best static baseline in the clean previous/local run is static dropout
400
+ `0.14`. The interaction schedule improves mean final validation loss by about
401
+ `0.0473` and wins every paired seed comparison. This promotes previous/local
402
+ from exploratory support to a second multi-seed streaming validation regime.
403
 
404
  ## Completed Static Backtest Gate
405
 
 
423
 
424
  ## Immediate Next Action
425
 
426
+ Reconcile the TinyStories five-seed report and previous/local five-seed report
427
+ into the paper outline. The seed-count gap is now closed. The next empirical
428
+ weakness is external validity, so the preferred next experiment is a third
429
+ held-out regime with minimal coefficient calibration followed by narrowed
430
+ multi-seed streaming validation.
431
 
432
  ## Next Training After Current Gate
433
 
434
+ No MPS training should launch until the two completed five-seed streaming
435
+ reports are read together. Since previous/local seed count is no longer the
436
+ limiting issue, use a third held-out regime for the next validation step:
 
 
437
 
438
  ```text
439
  completed: TinyStories 5-seed streaming report
440
+ completed: previous/local 5-seed clean streaming report
441
+ next: third held-out regime with minimal calibration
442
+ avoid: broad new sweep before cross-regime report reconciliation
 
443
  ```
444
 
445
  Evaluate with paired seed comparisons:
 
452
  rank consistency across seeds
453
  ```
454
 
455
+ Because previous/local decay wins across paired seeds, promote the cross-regime
456
+ streaming claim to "supported in two regimes." Do not yet claim universal
457
+ numeric coefficients. The next claim to test is whether the pressure-law
458
+ structure and regime-specific fitting procedure reproduce the win in a third
459
+ held-out regime.
460
 
461
  Latest streaming report:
462
 
 
464
  docs/streaming_multiseed_validation_report.md
465
  docs/previous_regime_streaming_report.md
466
  runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
467
+ runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/
468
  ```
docs/previous_regime_streaming_report.md CHANGED
@@ -2,27 +2,28 @@
2
 
3
  Date: 2026-05-30
4
 
5
- This report combines 3 random seeds (1, 2, 3) from saved streaming runs.
6
  No additional training is performed by this script; it reads saved
7
  `metrics.jsonl` files.
8
 
9
- Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This report uses the existing three-seed confirmation run only; earlier single-seed search/refinement runs are treated as exploratory support, not as the primary proof table.
10
 
11
  ## Sources
12
 
13
- - `runs/stream_multiseed_confirm/locked_stream/20260526-203116/metrics.jsonl`
14
 
15
  ## Condition Ranking By Final Loss
16
 
17
  | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
18
  |---|---|---:|---:|---:|---:|---:|---:|---|
19
- | `hold_30_then_decay` | `anchor_decay` | 3 | 4.8503 | 0.0017 | 4.4060 | 0.0118 | 0.3530 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
20
- | `mild_30_to_08` | `anchor_decay` | 3 | 4.8504 | 0.0018 | 4.4075 | 0.0078 | 0.3307 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
21
- | `fitted_l16_static_law` | `anchor_decay` | 3 | 4.9527 | 0.0052 | 4.4159 | 0.0042 | 0.3144 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
22
- | `static_dropout_0.14` | `static` | 3 | 4.9043 | 0.0119 | 4.4459 | 0.0128 | 0.3205 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
23
- | `static_dropout_0.3` | `static` | 3 | 4.8764 | 0.0014 | 4.4693 | 0.0081 | 0.2327 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
24
- | `static_dropout_0.02` | `static` | 3 | 5.1544 | 0.0091 | 4.5405 | 0.0061 | 0.4747 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
25
- | `static_dropout_0` | `static` | 3 | 5.2422 | 0.0015 | 4.5905 | 0.0192 | 0.5464 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
 
26
 
27
  ## Paired Final-Loss Deltas
28
 
@@ -31,122 +32,102 @@ baseline for that seed.
31
 
32
  | Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
33
  |---:|---|---:|---|---:|---:|
 
34
  | 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
35
  | 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
36
  | 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
37
  | 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
38
  | 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
39
- | 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0985 |
40
- | 1 | `static_dropout_0` | 4.5703 | `static_dropout_0.14` | 4.4418 | +0.1286 |
41
- | 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4603 | -0.0535 |
42
- | 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4603 | -0.0523 |
43
- | 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4603 | -0.0467 |
44
- | 2 | `static_dropout_0.14` | 4.4603 | `static_dropout_0.14` | 4.4603 | +0.0000 |
45
- | 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4603 | +0.0116 |
46
- | 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4603 | +0.0863 |
47
- | 2 | `static_dropout_0` | 4.6085 | `static_dropout_0.14` | 4.4603 | +0.1482 |
48
- | 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4357 | -0.0183 |
49
- | 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4357 | -0.0206 |
50
- | 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4357 | -0.0223 |
51
- | 3 | `static_dropout_0.14` | 4.4357 | `static_dropout_0.14` | 4.4357 | +0.0000 |
52
- | 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4357 | +0.0401 |
53
- | 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4357 | +0.0988 |
54
- | 3 | `static_dropout_0` | 4.5926 | `static_dropout_0.14` | 4.4357 | +0.1569 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ## Stage Trajectory
57
 
58
  | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
59
  |---:|---:|---|---:|---:|---:|---:|---:|---:|
60
- | 0 | 250,000 | `mild_30_to_08` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
61
- | 0 | 250,000 | `static_dropout_0.3` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
62
- | 0 | 250,000 | `hold_30_then_decay` | 0.300 | 3 | 5.4463 | 0.0191 | 4.4457 | 1.0006 |
63
- | 0 | 250,000 | `static_dropout_0.14` | 0.140 | 3 | 5.4707 | 0.0281 | 4.0325 | 1.4383 |
64
- | 0 | 250,000 | `static_dropout_0.02` | 0.020 | 3 | 5.7452 | 0.0319 | 3.5394 | 2.2057 |
65
- | 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 3 | 5.7847 | 0.0108 | 5.1677 | 0.6170 |
66
- | 0 | 250,000 | `static_dropout_0` | 0.000 | 3 | 5.8283 | 0.0158 | 3.4498 | 2.3785 |
67
- | 1 | 500,000 | `mild_30_to_08` | 0.240 | 3 | 5.0573 | 0.0197 | 4.0209 | 1.0364 |
68
- | 1 | 500,000 | `static_dropout_0.3` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
69
- | 1 | 500,000 | `hold_30_then_decay` | 0.300 | 3 | 5.0643 | 0.0216 | 4.1251 | 0.9392 |
70
- | 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 3 | 5.1479 | 0.0127 | 4.4501 | 0.6978 |
71
- | 1 | 500,000 | `static_dropout_0.14` | 0.140 | 3 | 5.1493 | 0.0097 | 3.7036 | 1.4457 |
72
- | 1 | 500,000 | `static_dropout_0.02` | 0.020 | 3 | 5.5605 | 0.0148 | 3.1103 | 2.4502 |
73
- | 1 | 500,000 | `static_dropout_0` | 0.000 | 3 | 5.6920 | 0.0452 | 2.9511 | 2.7409 |
74
- | 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 3 | 4.7695 | 0.0164 | 4.0408 | 0.7287 |
75
- | 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 3 | 4.7717 | 0.0162 | 3.9925 | 0.7793 |
76
- | 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.7927 | 0.0173 | 4.1535 | 0.6392 |
77
- | 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 3 | 4.8273 | 0.0096 | 4.2699 | 0.5573 |
78
- | 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.8466 | 0.0278 | 3.8815 | 0.9651 |
79
- | 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 3 | 5.1459 | 0.0294 | 3.4641 | 1.6818 |
80
- | 2 | 1,000,000 | `static_dropout_0` | 0.000 | 3 | 5.2484 | 0.0091 | 3.3281 | 1.9203 |
81
- | 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 3 | 4.5655 | 0.0060 | 4.0390 | 0.5265 |
82
- | 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 3 | 4.5691 | 0.0072 | 4.0380 | 0.5312 |
83
- | 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 3 | 4.5879 | 0.0086 | 4.1457 | 0.4422 |
84
- | 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.6088 | 0.0069 | 4.0454 | 0.5634 |
85
- | 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.6094 | 0.0059 | 4.2125 | 0.3968 |
86
- | 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.7799 | 0.0219 | 3.8344 | 0.9455 |
87
- | 3 | 2,000,000 | `static_dropout_0` | 0.000 | 3 | 4.8517 | 0.0153 | 3.7761 | 1.0757 |
88
- | 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 3 | 4.4060 | 0.0118 | 4.0530 | 0.3530 |
89
- | 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 3 | 4.4075 | 0.0078 | 4.0768 | 0.3307 |
90
- | 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 3 | 4.4159 | 0.0042 | 4.1015 | 0.3144 |
91
- | 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 3 | 4.4459 | 0.0128 | 4.1254 | 0.3205 |
92
- | 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 3 | 4.4693 | 0.0081 | 4.2365 | 0.2327 |
93
- | 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 3 | 4.5405 | 0.0061 | 4.0657 | 0.4747 |
94
- | 4 | 4,000,000 | `static_dropout_0` | 0.000 | 3 | 4.5905 | 0.0192 | 4.0441 | 0.5464 |
 
 
 
 
 
95
 
96
  ## Interpretation
97
 
98
- - `hold_30_then_decay` has the best 3-seed mean final validation loss: 4.4060 +/- 0.0118.
99
- - The second-best final condition is `mild_30_to_08` at 4.4075 +/- 0.0078.
100
- - The best static baseline by mean final loss is `static_dropout_0.14` at 4.4459 +/- 0.0128.
101
- - `hold_30_then_decay` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0183.
102
- - `mild_30_to_08` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0206.
103
- - `fitted_l16_static_law` beats the per-seed best static baseline in 3/3 seeds; worst paired delta is -0.0211.
104
- - The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4463; compare this with the final ranking before claiming a schedule is uniformly better.
 
105
  - This is a saved-run streaming validation artifact. Treat it as strong
106
  evidence only when the tested conditions, seeds, static baselines, and
107
  stream protocol match the claim being made.
108
-
109
- ## Supporting Exploratory Runs
110
-
111
- The primary proof table above is the three-seed confirmation run:
112
-
113
- ```text
114
- runs/stream_multiseed_confirm/locked_stream/20260526-203116/
115
- ```
116
-
117
- Earlier single-seed runs are useful for interpreting how the schedule was
118
- selected, but they are not counted as multi-seed proof:
119
-
120
- | Supporting run | Role | Main reading |
121
- |---|---|---|
122
- | `runs/stream_schedule_search/locked_stream/20260526-171537/` | schedule search | decay schedules starting near `0.30` and ending near `0.02` to `0.08` beat static `0.14` and `0.30` at the final 4M prefix |
123
- | `runs/stream_schedule_refinement/locked_stream/20260526-184506/` | endpoint and curvature refinement | several `hold_30` variants ended tightly around `4.394`, while `hold_24_then_decay` was weaker at `4.4214`, suggesting the initial dropout should not be reduced too aggressively in this regime |
124
- | `runs/formula_l16_exact_multiseed/locked_stream/20260527-123806/` | coefficient-derived schedule check | `pressure_formula_l16_floor02` reached `4.4059 +/- 0.0042` over three seeds versus static `0.14` at `4.4459 +/- 0.0128` |
125
-
126
- ## Research Reading
127
-
128
- This previous/local regime supports the same qualitative claim as the
129
- TinyStories five-seed validation: a static dropout that is reasonable at one
130
- stream scale is not necessarily optimal as the data prefix grows. In this
131
- regime, the useful path keeps dropout high early (`0.30`) and then lowers it
132
- as unique tokens and sampled tokens increase.
133
-
134
- The strongest previous/local evidence is:
135
-
136
- | Claim | Evidence |
137
- |---|---|
138
- | decay beats best static final loss | `hold_30_then_decay` beats the per-seed best static baseline in `3/3` seeds |
139
- | endpoint is not uniquely fixed | `mild_30_to_08` is nearly tied with `hold_30_then_decay` |
140
- | too-low early dropout is harmful | static `0.02` and `0.00` are much worse throughout the stream |
141
- | too-high static dropout underuses later data | static `0.30` wins no final paired comparison despite being strong early |
142
- | coefficient-derived schedules are viable | `fitted_l16_static_law` and `pressure_formula_l16_floor02` both beat static `0.14` in the saved three-seed comparisons |
143
-
144
- Limitations:
145
-
146
- 1. This report is `n=3`, not `n=5`.
147
- 2. The schedules were refined inside this local regime, so this is not a
148
- clean held-out-regime proof of universal coefficients.
149
- 3. The report still supports the cross-regime mechanism because the direction
150
- of the effect matches TinyStories: high enough initial regularization
151
- prevents early overfit, and lowering dropout later improves final validation
152
- loss versus holding one static value fixed.
 
2
 
3
  Date: 2026-05-30
4
 
5
+ This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
6
  No additional training is performed by this script; it reads saved
7
  `metrics.jsonl` files.
8
 
9
+ Regime: original/local saved streaming setup with L16_H8_D384, 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 optimizer steps per stage. This is a clean five-seed run including the updated previous/local interaction formula schedule, empirical decay schedules, and static baselines.
10
 
11
  ## Sources
12
 
13
+ - `runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/metrics.jsonl`
14
 
15
  ## Condition Ranking By Final Loss
16
 
17
  | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
18
  |---|---|---:|---:|---:|---:|---:|---:|---|
19
+ | `prevlocal_interaction` | `anchor_decay` | 5 | 4.8609 | 0.0046 | 4.3981 | 0.0095 | 0.3177 | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` |
20
+ | `hold_30_then_decay` | `anchor_decay` | 5 | 4.8512 | 0.0017 | 4.4052 | 0.0112 | 0.3565 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
21
+ | `mild_30_to_08` | `anchor_decay` | 5 | 4.8509 | 0.0015 | 4.4073 | 0.0085 | 0.3337 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
22
+ | `fitted_l16_static_law` | `anchor_decay` | 5 | 4.9521 | 0.0039 | 4.4124 | 0.0084 | 0.3137 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
23
+ | `static_dropout_0.14` | `static` | 5 | 4.9051 | 0.0088 | 4.4455 | 0.0120 | 0.3289 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
24
+ | `static_dropout_0.3` | `static` | 5 | 4.8767 | 0.0019 | 4.4668 | 0.0141 | 0.2349 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
25
+ | `static_dropout_0.02` | `static` | 5 | 5.1571 | 0.0097 | 4.5358 | 0.0091 | 0.4829 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
26
+ | `static_dropout_0` | `static` | 5 | 5.2511 | 0.0160 | 4.5943 | 0.0216 | 0.5529 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |
27
 
28
  ## Paired Final-Loss Deltas
29
 
 
32
 
33
  | Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
34
  |---:|---|---:|---|---:|---:|
35
+ | 1 | `prevlocal_interaction` | 4.4023 | `static_dropout_0.14` | 4.4418 | -0.0394 |
36
  | 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
37
  | 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
38
  | 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
39
  | 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
40
  | 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
41
+ | 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0984 |
42
+ | 1 | `static_dropout_0` | 4.5704 | `static_dropout_0.14` | 4.4418 | +0.1286 |
43
+ | 2 | `prevlocal_interaction` | 4.4020 | `static_dropout_0.14` | 4.4602 | -0.0583 |
44
+ | 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4602 | -0.0534 |
45
+ | 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4602 | -0.0522 |
46
+ | 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4602 | -0.0466 |
47
+ | 2 | `static_dropout_0.14` | 4.4602 | `static_dropout_0.14` | 4.4602 | +0.0000 |
48
+ | 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4602 | +0.0117 |
49
+ | 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4602 | +0.0864 |
50
+ | 2 | `static_dropout_0` | 4.6094 | `static_dropout_0.14` | 4.4602 | +0.1492 |
51
+ | 3 | `prevlocal_interaction` | 4.4029 | `static_dropout_0.14` | 4.4356 | -0.0328 |
52
+ | 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4356 | -0.0183 |
53
+ | 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4356 | -0.0206 |
54
+ | 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4356 | -0.0223 |
55
+ | 3 | `static_dropout_0.14` | 4.4356 | `static_dropout_0.14` | 4.4356 | +0.0000 |
56
+ | 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4356 | +0.0401 |
57
+ | 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4356 | +0.0988 |
58
+ | 3 | `static_dropout_0` | 4.5928 | `static_dropout_0.14` | 4.4356 | +0.1571 |
59
+ | 4 | `prevlocal_interaction` | 4.3811 | `static_dropout_0.14` | 4.4337 | -0.0526 |
60
+ | 4 | `hold_30_then_decay` | 4.3936 | `static_dropout_0.14` | 4.4337 | -0.0400 |
61
+ | 4 | `mild_30_to_08` | 4.3978 | `static_dropout_0.14` | 4.4337 | -0.0359 |
62
+ | 4 | `fitted_l16_static_law` | 4.3983 | `static_dropout_0.14` | 4.4337 | -0.0354 |
63
+ | 4 | `static_dropout_0.14` | 4.4337 | `static_dropout_0.14` | 4.4337 | +0.0000 |
64
+ | 4 | `static_dropout_0.3` | 4.4455 | `static_dropout_0.14` | 4.4337 | +0.0118 |
65
+ | 4 | `static_dropout_0.02` | 4.5220 | `static_dropout_0.14` | 4.4337 | +0.0883 |
66
+ | 4 | `static_dropout_0` | 4.5768 | `static_dropout_0.14` | 4.4337 | +0.1432 |
67
+ | 5 | `prevlocal_interaction` | 4.4024 | `static_dropout_0.14` | 4.4560 | -0.0536 |
68
+ | 5 | `hold_30_then_decay` | 4.4145 | `static_dropout_0.14` | 4.4560 | -0.0415 |
69
+ | 5 | `mild_30_to_08` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 |
70
+ | 5 | `fitted_l16_static_law` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 |
71
+ | 5 | `static_dropout_0.14` | 4.4560 | `static_dropout_0.14` | 4.4560 | +0.0000 |
72
+ | 5 | `static_dropout_0.3` | 4.4805 | `static_dropout_0.14` | 4.4560 | +0.0245 |
73
+ | 5 | `static_dropout_0.02` | 4.5355 | `static_dropout_0.14` | 4.4560 | +0.0796 |
74
+ | 5 | `static_dropout_0` | 4.6219 | `static_dropout_0.14` | 4.4560 | +0.1660 |
75
 
76
  ## Stage Trajectory
77
 
78
  | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
79
  |---:|---:|---|---:|---:|---:|---:|---:|---:|
80
+ | 0 | 250,000 | `mild_30_to_08` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
81
+ | 0 | 250,000 | `hold_30_then_decay` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
82
+ | 0 | 250,000 | `static_dropout_0.3` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
83
+ | 0 | 250,000 | `static_dropout_0.14` | 0.140 | 5 | 5.4773 | 0.0224 | 4.0298 | 1.4475 |
84
+ | 0 | 250,000 | `prevlocal_interaction` | 0.385 | 5 | 5.4947 | 0.0109 | 4.6016 | 0.8930 |
85
+ | 0 | 250,000 | `static_dropout_0.02` | 0.020 | 5 | 5.7426 | 0.0242 | 3.5371 | 2.2055 |
86
+ | 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 5 | 5.7842 | 0.0096 | 5.1640 | 0.6202 |
87
+ | 0 | 250,000 | `static_dropout_0` | 0.000 | 5 | 5.8330 | 0.0198 | 3.4443 | 2.3887 |
88
+ | 1 | 500,000 | `mild_30_to_08` | 0.240 | 5 | 5.0582 | 0.0159 | 4.0349 | 1.0233 |
89
+ | 1 | 500,000 | `static_dropout_0.3` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 |
90
+ | 1 | 500,000 | `hold_30_then_decay` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 |
91
+ | 1 | 500,000 | `prevlocal_interaction` | 0.319 | 5 | 5.0715 | 0.0118 | 4.2065 | 0.8650 |
92
+ | 1 | 500,000 | `static_dropout_0.14` | 0.140 | 5 | 5.1492 | 0.0070 | 3.7143 | 1.4349 |
93
+ | 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 5 | 5.1507 | 0.0102 | 4.4632 | 0.6875 |
94
+ | 1 | 500,000 | `static_dropout_0.02` | 0.020 | 5 | 5.5754 | 0.0248 | 3.1246 | 2.4508 |
95
+ | 1 | 500,000 | `static_dropout_0` | 0.000 | 5 | 5.7175 | 0.0502 | 2.9583 | 2.7592 |
96
+ | 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 5 | 4.7757 | 0.0144 | 4.0378 | 0.7379 |
97
+ | 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 5 | 4.7774 | 0.0138 | 3.9886 | 0.7888 |
98
+ | 2 | 1,000,000 | `prevlocal_interaction` | 0.227 | 5 | 4.7811 | 0.0084 | 4.0826 | 0.6984 |
99
+ | 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.7983 | 0.0144 | 4.1501 | 0.6481 |
100
+ | 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 5 | 4.8326 | 0.0102 | 4.2632 | 0.5694 |
101
+ | 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.8490 | 0.0202 | 3.8712 | 0.9779 |
102
+ | 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 5 | 5.1470 | 0.0222 | 3.4615 | 1.6854 |
103
+ | 2 | 1,000,000 | `static_dropout_0` | 0.000 | 5 | 5.2637 | 0.0274 | 3.3260 | 1.9377 |
104
+ | 3 | 2,000,000 | `prevlocal_interaction` | 0.139 | 5 | 4.5590 | 0.0142 | 4.0802 | 0.4788 |
105
+ | 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 5 | 4.5599 | 0.0161 | 4.0445 | 0.5154 |
106
+ | 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 5 | 4.5631 | 0.0155 | 4.0441 | 0.5190 |
107
+ | 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 5 | 4.5806 | 0.0153 | 4.1471 | 0.4334 |
108
+ | 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.6035 | 0.0141 | 4.2150 | 0.3885 |
109
+ | 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.6048 | 0.0136 | 4.0399 | 0.5648 |
110
+ | 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.7847 | 0.0196 | 3.8405 | 0.9442 |
111
+ | 3 | 2,000,000 | `static_dropout_0` | 0.000 | 5 | 4.8472 | 0.0171 | 3.7786 | 1.0687 |
112
+ | 4 | 4,000,000 | `prevlocal_interaction` | 0.066 | 5 | 4.3981 | 0.0095 | 4.0805 | 0.3177 |
113
+ | 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 5 | 4.4052 | 0.0112 | 4.0488 | 0.3565 |
114
+ | 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 5 | 4.4073 | 0.0085 | 4.0736 | 0.3337 |
115
+ | 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 5 | 4.4124 | 0.0084 | 4.0987 | 0.3137 |
116
+ | 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.4455 | 0.0120 | 4.1165 | 0.3289 |
117
+ | 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.4668 | 0.0141 | 4.2319 | 0.2349 |
118
+ | 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.5358 | 0.0091 | 4.0529 | 0.4829 |
119
+ | 4 | 4,000,000 | `static_dropout_0` | 0.000 | 5 | 4.5943 | 0.0216 | 4.0414 | 0.5529 |
120
 
121
  ## Interpretation
122
 
123
+ - `prevlocal_interaction` has the best 5-seed mean final validation loss: 4.3981 +/- 0.0095.
124
+ - The second-best final condition is `hold_30_then_decay` at 4.4052 +/- 0.0112.
125
+ - The best static baseline by mean final loss is `static_dropout_0.14` at 4.4455 +/- 0.0120.
126
+ - `prevlocal_interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0328.
127
+ - `hold_30_then_decay` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0183.
128
+ - `mild_30_to_08` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0206.
129
+ - `fitted_l16_static_law` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0211.
130
+ - The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4483; compare this with the final ranking before claiming a schedule is uniformly better.
131
  - This is a saved-run streaming validation artifact. Treat it as strong
132
  evidence only when the tested conditions, seeds, static baselines, and
133
  stream protocol match the claim being made.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/condition_summary.csv ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
2
+ prevlocal_interaction,anchor_decay,5,4.860862210392952,0.0046364279557658235,4.3981304407119755,0.009545784836147743,0.3176518455147743,0.007999498965173152,0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07
3
+ hold_30_then_decay,anchor_decay,5,4.851180048286915,0.0016753687570399134,4.405232906341553,0.011151070705538514,0.3564802721142769,0.01297330703929578,0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02
4
+ mild_30_to_08,anchor_decay,5,4.850860581099987,0.0014618995680224028,4.40728645324707,0.008502541215009067,0.3337064355611801,0.010359634321755684,0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08
5
+ fitted_l16_static_law,anchor_decay,5,4.952093484103679,0.0038574646544463683,4.412404176592827,0.00843791675235308,0.3137470245361328,0.007204760471400837,0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02
6
+ static_dropout_0.14,static,5,4.905146500468254,0.00876134360549518,4.44545366615057,0.012017216742245517,0.32894645929336547,0.01603071874172604,0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14
7
+ static_dropout_0.3,static,5,4.8767191568017,0.0019103599368448555,4.46677490323782,0.014064932048228269,0.23490906208753587,0.008922414622347311,0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30
8
+ static_dropout_0.02,static,5,5.157098578512668,0.009693091424804937,4.535757505893708,0.00908401354385357,0.48288719058036805,0.020126181497736668,0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02
9
+ static_dropout_0,static,5,5.251133863329888,0.016029529764030867,4.594272664189338,0.021638340853154137,0.5528693303465844,0.029132548047629703,0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/paired_final_deltas.csv ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
2
+ 1,prevlocal_interaction,4.402347795665264,static_dropout_0.14,4.4417688846588135,-0.03942108899354935
3
+ 1,hold_30_then_decay,4.393906190991402,static_dropout_0.14,4.4417688846588135,-0.047862693667411804
4
+ 1,mild_30_to_08,4.399485997855663,static_dropout_0.14,4.4417688846588135,-0.04228288680315018
5
+ 1,fitted_l16_static_law,4.420692957937717,static_dropout_0.14,4.4417688846588135,-0.02107592672109604
6
+ 1,static_dropout_0.14,4.4417688846588135,static_dropout_0.14,4.4417688846588135,0.0
7
+ 1,static_dropout_0.3,4.460195399820805,static_dropout_0.14,4.4417688846588135,0.01842651516199112
8
+ 1,static_dropout_0.02,4.5401855930686,static_dropout_0.14,4.4417688846588135,0.09841670840978622
9
+ 1,static_dropout_0,4.570374272763729,static_dropout_0.14,4.4417688846588135,0.12860538810491562
10
+ 2,prevlocal_interaction,4.401971310377121,static_dropout_0.14,4.460222490131855,-0.05825117975473404
11
+ 2,hold_30_then_decay,4.406779877841473,static_dropout_0.14,4.460222490131855,-0.053442612290382385
12
+ 2,mild_30_to_08,4.4080275148153305,static_dropout_0.14,4.460222490131855,-0.052194975316524506
13
+ 2,fitted_l16_static_law,4.41358345746994,static_dropout_0.14,4.460222490131855,-0.046639032661914825
14
+ 2,static_dropout_0.14,4.460222490131855,static_dropout_0.14,4.460222490131855,0.0
15
+ 2,static_dropout_0.3,4.4719239845871925,static_dropout_0.14,4.460222490131855,0.011701494455337524
16
+ 2,static_dropout_0.02,4.546629846096039,static_dropout_0.14,4.460222490131855,0.08640735596418381
17
+ 2,static_dropout_0,4.609437867999077,static_dropout_0.14,4.460222490131855,0.14921537786722183
18
+ 3,prevlocal_interaction,4.402896843850613,static_dropout_0.14,4.43564984947443,-0.032753005623817444
19
+ 3,hold_30_then_decay,4.417374566197395,static_dropout_0.14,4.43564984947443,-0.01827528327703476
20
+ 3,mild_30_to_08,4.415062002837658,static_dropout_0.14,4.43564984947443,-0.020587846636772156
21
+ 3,fitted_l16_static_law,4.413399815559387,static_dropout_0.14,4.43564984947443,-0.022250033915042877
22
+ 3,static_dropout_0.14,4.43564984947443,static_dropout_0.14,4.43564984947443,0.0
23
+ 3,static_dropout_0.3,4.475773207843304,static_dropout_0.14,4.43564984947443,0.040123358368873596
24
+ 3,static_dropout_0.02,4.534482300281525,static_dropout_0.14,4.43564984947443,0.09883245080709457
25
+ 3,static_dropout_0,4.592755533754826,static_dropout_0.14,4.43564984947443,0.1571056842803955
26
+ 4,prevlocal_interaction,4.381064593791962,static_dropout_0.14,4.433655060827732,-0.052590467035770416
27
+ 4,hold_30_then_decay,4.3936478942632675,static_dropout_0.14,4.433655060827732,-0.04000716656446457
28
+ 4,mild_30_to_08,4.397788874804974,static_dropout_0.14,4.433655060827732,-0.035866186022758484
29
+ 4,fitted_l16_static_law,4.398257076740265,static_dropout_0.14,4.433655060827732,-0.035397984087467194
30
+ 4,static_dropout_0.14,4.433655060827732,static_dropout_0.14,4.433655060827732,0.0
31
+ 4,static_dropout_0.3,4.445499815046787,static_dropout_0.14,4.433655060827732,0.011844754219055176
32
+ 4,static_dropout_0.02,4.52195218205452,static_dropout_0.14,4.433655060827732,0.08829712122678757
33
+ 4,static_dropout_0,4.576848782598972,static_dropout_0.14,4.433655060827732,0.14319372177124023
34
+ 5,prevlocal_interaction,4.402371659874916,static_dropout_0.14,4.455972045660019,-0.053600385785102844
35
+ 5,hold_30_then_decay,4.4144560024142265,static_dropout_0.14,4.455972045660019,-0.04151604324579239
36
+ 5,mild_30_to_08,4.416067875921726,static_dropout_0.14,4.455972045660019,-0.039904169738292694
37
+ 5,fitted_l16_static_law,4.4160875752568245,static_dropout_0.14,4.455972045660019,-0.03988447040319443
38
+ 5,static_dropout_0.14,4.455972045660019,static_dropout_0.14,4.455972045660019,0.0
39
+ 5,static_dropout_0.3,4.48048210889101,static_dropout_0.14,4.455972045660019,0.024510063230991364
40
+ 5,static_dropout_0.02,4.5355376079678535,static_dropout_0.14,4.455972045660019,0.07956556230783463
41
+ 5,static_dropout_0,4.62194686383009,static_dropout_0.14,4.455972045660019,0.16597481817007065
runs/previous_local_streaming_report/l16_updated_formula_clean_5seed/stage_summary.csv ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
2
+ prevlocal_interaction,0,250000,0.385,5,5.4946602284908295,0.01093302726132647,4.6016244173049925,0.026939774812057612,0.8930358111858367,0.016010040404479016
3
+ prevlocal_interaction,1,500000,0.319,5,5.071460470557213,0.01179360076463939,4.20646400153637,0.03245807641301778,0.8649964690208435,0.028479060454710485
4
+ prevlocal_interaction,2,1000000,0.227,5,4.781069016456604,0.008428247627355752,4.08262689858675,0.023420093535722695,0.698442117869854,0.028148054160246055
5
+ prevlocal_interaction,3,2000000,0.139,5,4.558990895748138,0.014177173687439483,4.080180557072163,0.022080406390158354,0.4788103386759758,0.021926835907345392
6
+ prevlocal_interaction,4,4000000,0.066,5,4.3981304407119755,0.009545784836147743,4.080478595197201,0.009311692964638253,0.3176518455147743,0.007999498965173152
7
+ hold_30_then_decay,0,250000,0.3,5,5.4483301296830176,0.013828501308583057,4.442901518940926,0.027340510763309508,1.0054286107420922,0.02219730946529904
8
+ hold_30_then_decay,1,500000,0.3,5,5.066737350821495,0.017273545737457947,4.1383186161518095,0.03875357135925004,0.9284187346696854,0.04002925354224623
9
+ hold_30_then_decay,2,1000000,0.2,5,4.775730343163014,0.014352387307903692,4.037793649733066,0.02368477035230831,0.7379366934299469,0.01882967372675974
10
+ hold_30_then_decay,3,2000000,0.1,5,4.559869511425495,0.016051317749301037,4.044496415555477,0.019708741233353012,0.515373095870018,0.020379012527272283
11
+ hold_30_then_decay,4,4000000,0.02,5,4.405232906341553,0.011151070705538514,4.0487526342272755,0.007824452256379268,0.3564802721142769,0.01297330703929578
12
+ mild_30_to_08,0,250000,0.3,5,5.448330116271973,0.013828516502701142,4.442901518940926,0.027340586099289153,1.005428597331047,0.022197359989343385
13
+ mild_30_to_08,1,500000,0.24,5,5.058184179663658,0.015882720199114145,4.034893324971199,0.04033083916799125,1.023290854692459,0.04098800520602419
14
+ mild_30_to_08,2,1000000,0.18,5,4.777442049980164,0.013845858727658497,3.9886452093720437,0.02349137402419598,0.7887968406081199,0.018652082916074838
15
+ mild_30_to_08,3,2000000,0.12,5,4.563060106337071,0.015509498762185112,4.044088624417782,0.020441976517745996,0.5189714819192887,0.022529376522631556
16
+ mild_30_to_08,4,4000000,0.08,5,4.40728645324707,0.008502541215009067,4.07358001768589,0.0063536190340169095,0.3337064355611801,0.010359634321755684
17
+ fitted_l16_static_law,0,250000,0.6,5,5.7842145070433615,0.009632183754286684,5.164006796479225,0.02748612153330559,0.6202077105641365,0.018181362630120823
18
+ fitted_l16_static_law,1,500000,0.4,5,5.150681225955486,0.010164023432481408,4.463223123550415,0.029267257511679485,0.6874581024050712,0.024496105147219012
19
+ fitted_l16_static_law,2,1000000,0.3,5,4.832601730525494,0.010169544124120607,4.263189716637134,0.02333674196202296,0.569412013888359,0.023004537548591726
20
+ fitted_l16_static_law,3,2000000,0.14,5,4.58056578040123,0.01532149630405117,4.14712455868721,0.01706029159496315,0.4334412217140198,0.019914111395845077
21
+ fitted_l16_static_law,4,4000000,0.02,5,4.412404176592827,0.00843791675235308,4.098657152056694,0.01111204513074185,0.3137470245361328,0.007204760471400837
22
+ static_dropout_0.14,0,250000,0.14,5,5.477323499321938,0.02236835486589015,4.029827673733235,0.018556819977249093,1.4474958255887032,0.03092474074054602
23
+ static_dropout_0.14,1,500000,0.14,5,5.149166536331177,0.007010026540791338,3.714307613670826,0.03238913748160129,1.4348589226603508,0.031243440426199517
24
+ static_dropout_0.14,2,1000000,0.14,5,4.849037018418312,0.020208736415348236,3.8711691960692405,0.02974306105040781,0.9778678223490715,0.023799818088071894
25
+ static_dropout_0.14,3,2000000,0.14,5,4.6047517821192745,0.013619996903704912,4.039909638464451,0.025550506633378975,0.5648421436548233,0.015970945478988943
26
+ static_dropout_0.14,4,4000000,0.14,5,4.44545366615057,0.012017216742245517,4.116507206857205,0.014037194709348206,0.32894645929336547,0.01603071874172604
27
+ static_dropout_0.3,0,250000,0.3,5,5.448330155014991,0.0138285316736341,4.442901518940926,0.027340553421349313,1.005428636074066,0.022197311058747782
28
+ static_dropout_0.3,1,500000,0.3,5,5.066737298667431,0.017273470277214743,4.138318654894829,0.03875368971811196,0.9284186437726021,0.04002925284584238
29
+ static_dropout_0.3,2,1000000,0.3,5,4.79825523942709,0.01441949497608529,4.150126910209655,0.023298256740585745,0.6481283292174339,0.017421801083541605
30
+ static_dropout_0.3,3,2000000,0.3,5,4.603498187661171,0.014129740963263297,4.2150133237242695,0.015000678307181381,0.38848486393690107,0.01687487069399014
31
+ static_dropout_0.3,4,4000000,0.3,5,4.46677490323782,0.014064932048228269,4.231865841150284,0.010414934638152858,0.23490906208753587,0.008922414622347311
32
+ static_dropout_0.02,0,250000,0.02,5,5.742638063430786,0.024161263410536992,3.537110958993435,0.008037117123073168,2.2055271044373512,0.030737843551395496
33
+ static_dropout_0.02,1,500000,0.02,5,5.575391733646393,0.024791398740622035,3.124619247019291,0.031814549489392455,2.450772486627102,0.030503049251572257
34
+ static_dropout_0.02,2,1000000,0.02,5,5.14697041362524,0.022233878343551068,3.4615398421883583,0.03992270092195685,1.685430571436882,0.04092951267469098
35
+ static_dropout_0.02,3,2000000,0.02,5,4.784735175967216,0.019582585992709827,3.840523959696293,0.03097454466954304,0.9442112162709236,0.02147121638277758
36
+ static_dropout_0.02,4,4000000,0.02,5,4.535757505893708,0.00908401354385357,4.052870315313339,0.02163703576438587,0.48288719058036805,0.020126181497736668
37
+ static_dropout_0,0,250000,0.0,5,5.8329681470990185,0.019809207037273006,3.4442542552948,0.022358399496724347,2.388713891804218,0.038334133145443657
38
+ static_dropout_0,1,500000,0.0,5,5.717529235780239,0.05024223752386389,2.958326259255409,0.044781060309162554,2.75920297652483,0.07102439530887096
39
+ static_dropout_0,2,1000000,0.0,5,5.26366505920887,0.027353946222948587,3.3260142356157303,0.03607156293344983,1.9376508235931396,0.03553067411055354
40
+ static_dropout_0,3,2000000,0.0,5,4.847234210371971,0.0170992476167825,3.778580814599991,0.03536285448761605,1.0686533957719804,0.025604091638377884
41
+ static_dropout_0,4,4000000,0.0,5,4.594272664189338,0.021638340853154137,4.041403333842754,0.017193152802814336,0.5528693303465844,0.029132548047629703
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/RESULT_SUMMARY.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Locked Streaming Dropout Summary
2
+
3
+ Run directory: `runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525`
4
+
5
+ Model: `L16_H8_D384` causal Transformer, 31,457,280 parameters, 16 layers, 8 heads, 384 embedding dim.
6
+ Training per stage: 1,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 1, 2, 3, 4, 5.
7
+
8
+ ## Condition Ranking
9
+
10
+ | Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
11
+ |---|---|---:|---:|---:|---:|---|
12
+ | `mild_30_to_08` | anchor_decay | 0.08 | 4.8509 | 4.4073 | 0.3337 | 0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08 |
13
+ | `hold_30_then_decay` | anchor_decay | 0.02 | 4.8512 | 4.4052 | 0.3565 | 0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02 |
14
+ | `prevlocal_interaction` | anchor_decay | 0.07 | 4.8609 | 4.3981 | 0.3177 | 0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07 |
15
+ | `static_dropout_0.3` | static | 0.30 | 4.8767 | 4.4668 | 0.2349 | 0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30 |
16
+ | `static_dropout_0.14` | static | 0.14 | 4.9051 | 4.4455 | 0.3289 | 0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14 |
17
+ | `fitted_l16_static_law` | anchor_decay | 0.02 | 4.9521 | 4.4124 | 0.3137 | 0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02 |
18
+ | `static_dropout_0.02` | static | 0.02 | 5.1571 | 4.5358 | 0.4829 | 0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02 |
19
+ | `static_dropout_0` | static | 0.00 | 5.2511 | 4.5943 | 0.5529 | 0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00 |
20
+
21
+ ## Stage Trajectory
22
+
23
+ ### Stage 0: 250,000 Prefix Tokens
24
+
25
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
26
+ |---|---:|---:|---:|---:|---:|
27
+ | `mild_30_to_08` | 0.30 | 5.4483 | 4.4429 | 1.0054 | 5 |
28
+ | `hold_30_then_decay` | 0.30 | 5.4483 | 4.4429 | 1.0054 | 5 |
29
+ | `static_dropout_0.3` | 0.30 | 5.4483 | 4.4429 | 1.0054 | 5 |
30
+ | `static_dropout_0.14` | 0.14 | 5.4773 | 4.0298 | 1.4475 | 5 |
31
+ | `prevlocal_interaction` | 0.39 | 5.4947 | 4.6016 | 0.8930 | 5 |
32
+ | `static_dropout_0.02` | 0.02 | 5.7426 | 3.5371 | 2.2055 | 5 |
33
+ | `fitted_l16_static_law` | 0.60 | 5.7842 | 5.1640 | 0.6202 | 5 |
34
+ | `static_dropout_0` | 0.00 | 5.8330 | 3.4443 | 2.3887 | 5 |
35
+
36
+ ### Stage 1: 500,000 Prefix Tokens
37
+
38
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
39
+ |---|---:|---:|---:|---:|---:|
40
+ | `mild_30_to_08` | 0.24 | 5.0582 | 4.0349 | 1.0233 | 5 |
41
+ | `static_dropout_0.3` | 0.30 | 5.0667 | 4.1383 | 0.9284 | 5 |
42
+ | `hold_30_then_decay` | 0.30 | 5.0667 | 4.1383 | 0.9284 | 5 |
43
+ | `prevlocal_interaction` | 0.32 | 5.0715 | 4.2065 | 0.8650 | 5 |
44
+ | `static_dropout_0.14` | 0.14 | 5.1492 | 3.7143 | 1.4349 | 5 |
45
+ | `fitted_l16_static_law` | 0.40 | 5.1507 | 4.4632 | 0.6875 | 5 |
46
+ | `static_dropout_0.02` | 0.02 | 5.5754 | 3.1246 | 2.4508 | 5 |
47
+ | `static_dropout_0` | 0.00 | 5.7175 | 2.9583 | 2.7592 | 5 |
48
+
49
+ ### Stage 2: 1,000,000 Prefix Tokens
50
+
51
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
52
+ |---|---:|---:|---:|---:|---:|
53
+ | `hold_30_then_decay` | 0.20 | 4.7757 | 4.0378 | 0.7379 | 5 |
54
+ | `mild_30_to_08` | 0.18 | 4.7774 | 3.9886 | 0.7888 | 5 |
55
+ | `prevlocal_interaction` | 0.23 | 4.7811 | 4.0826 | 0.6984 | 5 |
56
+ | `static_dropout_0.3` | 0.30 | 4.7983 | 4.1501 | 0.6481 | 5 |
57
+ | `fitted_l16_static_law` | 0.30 | 4.8326 | 4.2632 | 0.5694 | 5 |
58
+ | `static_dropout_0.14` | 0.14 | 4.8490 | 3.8712 | 0.9779 | 5 |
59
+ | `static_dropout_0.02` | 0.02 | 5.1470 | 3.4615 | 1.6854 | 5 |
60
+ | `static_dropout_0` | 0.00 | 5.2637 | 3.3260 | 1.9377 | 5 |
61
+
62
+ ### Stage 3: 2,000,000 Prefix Tokens
63
+
64
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
65
+ |---|---:|---:|---:|---:|---:|
66
+ | `prevlocal_interaction` | 0.14 | 4.5590 | 4.0802 | 0.4788 | 5 |
67
+ | `hold_30_then_decay` | 0.10 | 4.5599 | 4.0445 | 0.5154 | 5 |
68
+ | `mild_30_to_08` | 0.12 | 4.5631 | 4.0441 | 0.5190 | 5 |
69
+ | `fitted_l16_static_law` | 0.14 | 4.5806 | 4.1471 | 0.4334 | 5 |
70
+ | `static_dropout_0.3` | 0.30 | 4.6035 | 4.2150 | 0.3885 | 5 |
71
+ | `static_dropout_0.14` | 0.14 | 4.6048 | 4.0399 | 0.5648 | 5 |
72
+ | `static_dropout_0.02` | 0.02 | 4.7847 | 3.8405 | 0.9442 | 5 |
73
+ | `static_dropout_0` | 0.00 | 4.8472 | 3.7786 | 1.0687 | 5 |
74
+
75
+ ### Stage 4: 4,000,000 Prefix Tokens
76
+
77
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
78
+ |---|---:|---:|---:|---:|---:|
79
+ | `prevlocal_interaction` | 0.07 | 4.3981 | 4.0805 | 0.3177 | 5 |
80
+ | `hold_30_then_decay` | 0.02 | 4.4052 | 4.0488 | 0.3565 | 5 |
81
+ | `mild_30_to_08` | 0.08 | 4.4073 | 4.0736 | 0.3337 | 5 |
82
+ | `fitted_l16_static_law` | 0.02 | 4.4124 | 4.0987 | 0.3137 | 5 |
83
+ | `static_dropout_0.14` | 0.14 | 4.4455 | 4.1165 | 0.3289 | 5 |
84
+ | `static_dropout_0.3` | 0.30 | 4.4668 | 4.2319 | 0.2349 | 5 |
85
+ | `static_dropout_0.02` | 0.02 | 4.5358 | 4.0529 | 0.4829 | 5 |
86
+ | `static_dropout_0` | 0.00 | 4.5943 | 4.0414 | 0.5529 | 5 |
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/config.json ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "args": {
3
+ "mode": "locked_stream",
4
+ "corpus": null,
5
+ "corpus_glob": null,
6
+ "text_column": "text",
7
+ "use_cached_data": true,
8
+ "output_dir": "runs/previous_local_updated_formula_clean_l16",
9
+ "resume_from": null,
10
+ "cache_dir": ".cache/dropout_decay",
11
+ "models": [
12
+ "L16_H8_D384=16x8x384"
13
+ ],
14
+ "seeds": [
15
+ 1,
16
+ 2,
17
+ 3,
18
+ 4,
19
+ 5
20
+ ],
21
+ "token_limits": [
22
+ 5000000
23
+ ],
24
+ "stream_token_caps": [
25
+ 250000,
26
+ 500000,
27
+ 1000000,
28
+ 2000000,
29
+ 4000000
30
+ ],
31
+ "val_tokens": 500000,
32
+ "allow_short_corpus": false,
33
+ "force_retokenize": false,
34
+ "vocab_size": 4096,
35
+ "tokenizer_train_chars": 10000000,
36
+ "block_size": 128,
37
+ "batch_size": 16,
38
+ "steps": 2000,
39
+ "stage_steps": 1000,
40
+ "dropout_rates": [
41
+ 0.0,
42
+ 0.02,
43
+ 0.14,
44
+ 0.3
45
+ ],
46
+ "decays": [],
47
+ "anchor_decays": [
48
+ {
49
+ "name": "prevlocal_interaction",
50
+ "kind": "anchor_decay",
51
+ "initial": 0.385,
52
+ "final": 0.066,
53
+ "schedule": "log_prefix_anchor",
54
+ "decay_tokens": null,
55
+ "anchors": [
56
+ [
57
+ 250000,
58
+ 0.385
59
+ ],
60
+ [
61
+ 500000,
62
+ 0.319
63
+ ],
64
+ [
65
+ 1000000,
66
+ 0.227
67
+ ],
68
+ [
69
+ 2000000,
70
+ 0.139
71
+ ],
72
+ [
73
+ 4000000,
74
+ 0.066
75
+ ]
76
+ ]
77
+ },
78
+ {
79
+ "name": "hold_30_then_decay",
80
+ "kind": "anchor_decay",
81
+ "initial": 0.3,
82
+ "final": 0.02,
83
+ "schedule": "log_prefix_anchor",
84
+ "decay_tokens": null,
85
+ "anchors": [
86
+ [
87
+ 250000,
88
+ 0.3
89
+ ],
90
+ [
91
+ 500000,
92
+ 0.3
93
+ ],
94
+ [
95
+ 1000000,
96
+ 0.2
97
+ ],
98
+ [
99
+ 2000000,
100
+ 0.1
101
+ ],
102
+ [
103
+ 4000000,
104
+ 0.02
105
+ ]
106
+ ]
107
+ },
108
+ {
109
+ "name": "mild_30_to_08",
110
+ "kind": "anchor_decay",
111
+ "initial": 0.3,
112
+ "final": 0.08,
113
+ "schedule": "log_prefix_anchor",
114
+ "decay_tokens": null,
115
+ "anchors": [
116
+ [
117
+ 250000,
118
+ 0.3
119
+ ],
120
+ [
121
+ 500000,
122
+ 0.24
123
+ ],
124
+ [
125
+ 1000000,
126
+ 0.18
127
+ ],
128
+ [
129
+ 2000000,
130
+ 0.12
131
+ ],
132
+ [
133
+ 4000000,
134
+ 0.08
135
+ ]
136
+ ]
137
+ },
138
+ {
139
+ "name": "fitted_l16_static_law",
140
+ "kind": "anchor_decay",
141
+ "initial": 0.6,
142
+ "final": 0.02,
143
+ "schedule": "log_prefix_anchor",
144
+ "decay_tokens": null,
145
+ "anchors": [
146
+ [
147
+ 250000,
148
+ 0.6
149
+ ],
150
+ [
151
+ 500000,
152
+ 0.4
153
+ ],
154
+ [
155
+ 1000000,
156
+ 0.3
157
+ ],
158
+ [
159
+ 2000000,
160
+ 0.14
161
+ ],
162
+ [
163
+ 4000000,
164
+ 0.02
165
+ ]
166
+ ]
167
+ }
168
+ ],
169
+ "decay_tokens": null,
170
+ "eval_batches": 64,
171
+ "train_eval_batches": 32,
172
+ "trace_eval_batches": 8,
173
+ "eval_every": 0,
174
+ "log_every": 250,
175
+ "lr": 0.0003,
176
+ "weight_decay": 0.1,
177
+ "grad_clip": 1.0,
178
+ "plateau_delta": 0.01,
179
+ "target_min_dropout": 0.1,
180
+ "min_nonzero_margin": 0.01,
181
+ "min_high_dropout_margin": 0.03,
182
+ "screen_early_stop": false,
183
+ "screen_prune_patience": 3,
184
+ "screen_prune_min_delta": 0.01
185
+ },
186
+ "mode": "locked_stream",
187
+ "seeds": [
188
+ 1,
189
+ 2,
190
+ 3,
191
+ 4,
192
+ 5
193
+ ],
194
+ "models": [
195
+ {
196
+ "model_name": "L16_H8_D384",
197
+ "n_layer": 16,
198
+ "n_head": 8,
199
+ "n_embd": 384
200
+ }
201
+ ],
202
+ "device": "mps",
203
+ "torch": "2.12.0",
204
+ "python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
205
+ "mps_available": true,
206
+ "attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
207
+ "tokenizer_path": ".cache/dropout_decay/tokenizer-v4096.json",
208
+ "encoded_path": ".cache/dropout_decay/tokens-v4096-uint16.npy",
209
+ "train_tokens": 5000970,
210
+ "val_tokens": 500000,
211
+ "effective_token_limits": [
212
+ 5000000
213
+ ],
214
+ "effective_stream_token_caps": [
215
+ 250000,
216
+ 500000,
217
+ 1000000,
218
+ 2000000,
219
+ 4000000
220
+ ],
221
+ "resume_from": null
222
+ }
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/metrics.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/summary.csv ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
2
+ locked_stream,fitted_l16_static_law,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,5.164006796479225,0.02748612153330559,5.7842145070433615,0.009632183754286684,0.6202077105641365,0.018181362630120823
3
+ locked_stream,hold_30_then_decay,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.442901518940926,0.027340510763309508,5.4483301296830176,0.013828501308583057,1.0054286107420922,0.02219730946529904
4
+ locked_stream,mild_30_to_08,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.442901518940926,0.027340586099289153,5.448330116271973,0.013828516502701142,1.005428597331047,0.022197359989343385
5
+ locked_stream,prevlocal_interaction,anchor_decay,0,250000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.6016244173049925,0.026939774812057612,5.4946602284908295,0.01093302726132647,0.8930358111858367,0.016010040404479016
6
+ locked_stream,static_dropout_0,static,0,250000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,3.4442542552948,0.022358399496724347,5.8329681470990185,0.019809207037273006,2.388713891804218,0.038334133145443657
7
+ locked_stream,static_dropout_0.02,static,0,250000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.537110958993435,0.008037117123073168,5.742638063430786,0.024161263410536992,2.2055271044373512,0.030737843551395496
8
+ locked_stream,static_dropout_0.14,static,0,250000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,4.029827673733235,0.018556819977249093,5.477323499321938,0.02236835486589015,1.4474958255887032,0.03092474074054602
9
+ locked_stream,static_dropout_0.3,static,0,250000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.442901518940926,0.027340553421349313,5.448330155014991,0.0138285316736341,1.005428636074066,0.022197311058747782
10
+ locked_stream,fitted_l16_static_law,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.463223123550415,0.029267257511679485,5.150681225955486,0.010164023432481408,0.6874581024050712,0.024496105147219012
11
+ locked_stream,hold_30_then_decay,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.1383186161518095,0.03875357135925004,5.066737350821495,0.017273545737457947,0.9284187346696854,0.04002925354224623
12
+ locked_stream,mild_30_to_08,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.034893324971199,0.04033083916799125,5.058184179663658,0.015882720199114145,1.023290854692459,0.04098800520602419
13
+ locked_stream,prevlocal_interaction,anchor_decay,1,500000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.20646400153637,0.03245807641301778,5.071460470557213,0.01179360076463939,0.8649964690208435,0.028479060454710485
14
+ locked_stream,static_dropout_0,static,1,500000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,2.958326259255409,0.044781060309162554,5.717529235780239,0.05024223752386389,2.75920297652483,0.07102439530887096
15
+ locked_stream,static_dropout_0.02,static,1,500000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.124619247019291,0.031814549489392455,5.575391733646393,0.024791398740622035,2.450772486627102,0.030503049251572257
16
+ locked_stream,static_dropout_0.14,static,1,500000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,3.714307613670826,0.03238913748160129,5.149166536331177,0.007010026540791338,1.4348589226603508,0.031243440426199517
17
+ locked_stream,static_dropout_0.3,static,1,500000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.138318654894829,0.03875368971811196,5.066737298667431,0.017273470277214743,0.9284186437726021,0.04002925284584238
18
+ locked_stream,fitted_l16_static_law,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.263189716637134,0.02333674196202296,4.832601730525494,0.010169544124120607,0.569412013888359,0.023004537548591726
19
+ locked_stream,hold_30_then_decay,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.037793649733066,0.02368477035230831,4.775730343163014,0.014352387307903692,0.7379366934299469,0.01882967372675974
20
+ locked_stream,mild_30_to_08,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,3.9886452093720437,0.02349137402419598,4.777442049980164,0.013845858727658497,0.7887968406081199,0.018652082916074838
21
+ locked_stream,prevlocal_interaction,anchor_decay,2,1000000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.08262689858675,0.023420093535722695,4.781069016456604,0.008428247627355752,0.698442117869854,0.028148054160246055
22
+ locked_stream,static_dropout_0,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,3.3260142356157303,0.03607156293344983,5.26366505920887,0.027353946222948587,1.9376508235931396,0.03553067411055354
23
+ locked_stream,static_dropout_0.02,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.4615398421883583,0.03992270092195685,5.14697041362524,0.022233878343551068,1.685430571436882,0.04092951267469098
24
+ locked_stream,static_dropout_0.14,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,3.8711691960692405,0.02974306105040781,4.849037018418312,0.020208736415348236,0.9778678223490715,0.023799818088071894
25
+ locked_stream,static_dropout_0.3,static,2,1000000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.150126910209655,0.023298256740585745,4.79825523942709,0.01441949497608529,0.6481283292174339,0.017421801083541605
26
+ locked_stream,fitted_l16_static_law,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.14712455868721,0.01706029159496315,4.58056578040123,0.01532149630405117,0.4334412217140198,0.019914111395845077
27
+ locked_stream,hold_30_then_decay,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.044496415555477,0.019708741233353012,4.559869511425495,0.016051317749301037,0.515373095870018,0.020379012527272283
28
+ locked_stream,mild_30_to_08,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.044088624417782,0.020441976517745996,4.563060106337071,0.015509498762185112,0.5189714819192887,0.022529376522631556
29
+ locked_stream,prevlocal_interaction,anchor_decay,3,2000000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.080180557072163,0.022080406390158354,4.558990895748138,0.014177173687439483,0.4788103386759758,0.021926835907345392
30
+ locked_stream,static_dropout_0,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,3.778580814599991,0.03536285448761605,4.847234210371971,0.0170992476167825,1.0686533957719804,0.025604091638377884
31
+ locked_stream,static_dropout_0.02,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,3.840523959696293,0.03097454466954304,4.784735175967216,0.019582585992709827,0.9442112162709236,0.02147121638277758
32
+ locked_stream,static_dropout_0.14,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,4.039909638464451,0.025550506633378975,4.6047517821192745,0.013619996903704912,0.5648421436548233,0.015970945478988943
33
+ locked_stream,static_dropout_0.3,static,3,2000000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.2150133237242695,0.015000678307181381,4.603498187661171,0.014129740963263297,0.38848486393690107,0.01687487069399014
34
+ locked_stream,fitted_l16_static_law,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.6,0.02,log_prefix_anchor,5,4.098657152056694,0.01111204513074185,4.412404176592827,0.00843791675235308,0.3137470245361328,0.007204760471400837
35
+ locked_stream,hold_30_then_decay,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.3,0.02,log_prefix_anchor,5,4.0487526342272755,0.007824452256379268,4.405232906341553,0.011151070705538514,0.3564802721142769,0.01297330703929578
36
+ locked_stream,mild_30_to_08,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.3,0.08,log_prefix_anchor,5,4.07358001768589,0.0063536190340169095,4.40728645324707,0.008502541215009067,0.3337064355611801,0.010359634321755684
37
+ locked_stream,prevlocal_interaction,anchor_decay,4,4000000,L16_H8_D384,16,8,384,31457280,0.385,0.066,log_prefix_anchor,5,4.080478595197201,0.009311692964638253,4.3981304407119755,0.009545784836147743,0.3176518455147743,0.007999498965173152
38
+ locked_stream,static_dropout_0,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.0,0.0,constant,5,4.041403333842754,0.017193152802814336,4.594272664189338,0.021638340853154137,0.5528693303465844,0.029132548047629703
39
+ locked_stream,static_dropout_0.02,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.02,0.02,constant,5,4.052870315313339,0.02163703576438587,4.535757505893708,0.00908401354385357,0.48288719058036805,0.020126181497736668
40
+ locked_stream,static_dropout_0.14,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.14,0.14,constant,5,4.116507206857205,0.014037194709348206,4.44545366615057,0.012017216742245517,0.32894645929336547,0.01603071874172604
41
+ locked_stream,static_dropout_0.3,static,4,4000000,L16_H8_D384,16,8,384,31457280,0.3,0.3,constant,5,4.231865841150284,0.010414934638152858,4.46677490323782,0.014064932048228269,0.23490906208753587,0.008922414622347311
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/summary.json ADDED
@@ -0,0 +1,882 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "run_mode": "locked_stream",
4
+ "condition": "fitted_l16_static_law",
5
+ "condition_kind": "anchor_decay",
6
+ "stage": 0,
7
+ "token_limit": 250000,
8
+ "model_name": "L16_H8_D384",
9
+ "n_layer": 16,
10
+ "n_head": 8,
11
+ "n_embd": 384,
12
+ "parameters": 31457280,
13
+ "dropout_initial": 0.6,
14
+ "dropout_final": 0.02,
15
+ "dropout_schedule": "log_prefix_anchor",
16
+ "n": 5,
17
+ "mean_train_eval_loss": 5.164006796479225,
18
+ "std_train_eval_loss": 0.02748612153330559,
19
+ "mean_val_eval_loss": 5.7842145070433615,
20
+ "std_val_eval_loss": 0.009632183754286684,
21
+ "mean_generalization_gap": 0.6202077105641365,
22
+ "std_generalization_gap": 0.018181362630120823
23
+ },
24
+ {
25
+ "run_mode": "locked_stream",
26
+ "condition": "hold_30_then_decay",
27
+ "condition_kind": "anchor_decay",
28
+ "stage": 0,
29
+ "token_limit": 250000,
30
+ "model_name": "L16_H8_D384",
31
+ "n_layer": 16,
32
+ "n_head": 8,
33
+ "n_embd": 384,
34
+ "parameters": 31457280,
35
+ "dropout_initial": 0.3,
36
+ "dropout_final": 0.02,
37
+ "dropout_schedule": "log_prefix_anchor",
38
+ "n": 5,
39
+ "mean_train_eval_loss": 4.442901518940926,
40
+ "std_train_eval_loss": 0.027340510763309508,
41
+ "mean_val_eval_loss": 5.4483301296830176,
42
+ "std_val_eval_loss": 0.013828501308583057,
43
+ "mean_generalization_gap": 1.0054286107420922,
44
+ "std_generalization_gap": 0.02219730946529904
45
+ },
46
+ {
47
+ "run_mode": "locked_stream",
48
+ "condition": "mild_30_to_08",
49
+ "condition_kind": "anchor_decay",
50
+ "stage": 0,
51
+ "token_limit": 250000,
52
+ "model_name": "L16_H8_D384",
53
+ "n_layer": 16,
54
+ "n_head": 8,
55
+ "n_embd": 384,
56
+ "parameters": 31457280,
57
+ "dropout_initial": 0.3,
58
+ "dropout_final": 0.08,
59
+ "dropout_schedule": "log_prefix_anchor",
60
+ "n": 5,
61
+ "mean_train_eval_loss": 4.442901518940926,
62
+ "std_train_eval_loss": 0.027340586099289153,
63
+ "mean_val_eval_loss": 5.448330116271973,
64
+ "std_val_eval_loss": 0.013828516502701142,
65
+ "mean_generalization_gap": 1.005428597331047,
66
+ "std_generalization_gap": 0.022197359989343385
67
+ },
68
+ {
69
+ "run_mode": "locked_stream",
70
+ "condition": "prevlocal_interaction",
71
+ "condition_kind": "anchor_decay",
72
+ "stage": 0,
73
+ "token_limit": 250000,
74
+ "model_name": "L16_H8_D384",
75
+ "n_layer": 16,
76
+ "n_head": 8,
77
+ "n_embd": 384,
78
+ "parameters": 31457280,
79
+ "dropout_initial": 0.385,
80
+ "dropout_final": 0.066,
81
+ "dropout_schedule": "log_prefix_anchor",
82
+ "n": 5,
83
+ "mean_train_eval_loss": 4.6016244173049925,
84
+ "std_train_eval_loss": 0.026939774812057612,
85
+ "mean_val_eval_loss": 5.4946602284908295,
86
+ "std_val_eval_loss": 0.01093302726132647,
87
+ "mean_generalization_gap": 0.8930358111858367,
88
+ "std_generalization_gap": 0.016010040404479016
89
+ },
90
+ {
91
+ "run_mode": "locked_stream",
92
+ "condition": "static_dropout_0",
93
+ "condition_kind": "static",
94
+ "stage": 0,
95
+ "token_limit": 250000,
96
+ "model_name": "L16_H8_D384",
97
+ "n_layer": 16,
98
+ "n_head": 8,
99
+ "n_embd": 384,
100
+ "parameters": 31457280,
101
+ "dropout_initial": 0.0,
102
+ "dropout_final": 0.0,
103
+ "dropout_schedule": "constant",
104
+ "n": 5,
105
+ "mean_train_eval_loss": 3.4442542552948,
106
+ "std_train_eval_loss": 0.022358399496724347,
107
+ "mean_val_eval_loss": 5.8329681470990185,
108
+ "std_val_eval_loss": 0.019809207037273006,
109
+ "mean_generalization_gap": 2.388713891804218,
110
+ "std_generalization_gap": 0.038334133145443657
111
+ },
112
+ {
113
+ "run_mode": "locked_stream",
114
+ "condition": "static_dropout_0.02",
115
+ "condition_kind": "static",
116
+ "stage": 0,
117
+ "token_limit": 250000,
118
+ "model_name": "L16_H8_D384",
119
+ "n_layer": 16,
120
+ "n_head": 8,
121
+ "n_embd": 384,
122
+ "parameters": 31457280,
123
+ "dropout_initial": 0.02,
124
+ "dropout_final": 0.02,
125
+ "dropout_schedule": "constant",
126
+ "n": 5,
127
+ "mean_train_eval_loss": 3.537110958993435,
128
+ "std_train_eval_loss": 0.008037117123073168,
129
+ "mean_val_eval_loss": 5.742638063430786,
130
+ "std_val_eval_loss": 0.024161263410536992,
131
+ "mean_generalization_gap": 2.2055271044373512,
132
+ "std_generalization_gap": 0.030737843551395496
133
+ },
134
+ {
135
+ "run_mode": "locked_stream",
136
+ "condition": "static_dropout_0.14",
137
+ "condition_kind": "static",
138
+ "stage": 0,
139
+ "token_limit": 250000,
140
+ "model_name": "L16_H8_D384",
141
+ "n_layer": 16,
142
+ "n_head": 8,
143
+ "n_embd": 384,
144
+ "parameters": 31457280,
145
+ "dropout_initial": 0.14,
146
+ "dropout_final": 0.14,
147
+ "dropout_schedule": "constant",
148
+ "n": 5,
149
+ "mean_train_eval_loss": 4.029827673733235,
150
+ "std_train_eval_loss": 0.018556819977249093,
151
+ "mean_val_eval_loss": 5.477323499321938,
152
+ "std_val_eval_loss": 0.02236835486589015,
153
+ "mean_generalization_gap": 1.4474958255887032,
154
+ "std_generalization_gap": 0.03092474074054602
155
+ },
156
+ {
157
+ "run_mode": "locked_stream",
158
+ "condition": "static_dropout_0.3",
159
+ "condition_kind": "static",
160
+ "stage": 0,
161
+ "token_limit": 250000,
162
+ "model_name": "L16_H8_D384",
163
+ "n_layer": 16,
164
+ "n_head": 8,
165
+ "n_embd": 384,
166
+ "parameters": 31457280,
167
+ "dropout_initial": 0.3,
168
+ "dropout_final": 0.3,
169
+ "dropout_schedule": "constant",
170
+ "n": 5,
171
+ "mean_train_eval_loss": 4.442901518940926,
172
+ "std_train_eval_loss": 0.027340553421349313,
173
+ "mean_val_eval_loss": 5.448330155014991,
174
+ "std_val_eval_loss": 0.0138285316736341,
175
+ "mean_generalization_gap": 1.005428636074066,
176
+ "std_generalization_gap": 0.022197311058747782
177
+ },
178
+ {
179
+ "run_mode": "locked_stream",
180
+ "condition": "fitted_l16_static_law",
181
+ "condition_kind": "anchor_decay",
182
+ "stage": 1,
183
+ "token_limit": 500000,
184
+ "model_name": "L16_H8_D384",
185
+ "n_layer": 16,
186
+ "n_head": 8,
187
+ "n_embd": 384,
188
+ "parameters": 31457280,
189
+ "dropout_initial": 0.6,
190
+ "dropout_final": 0.02,
191
+ "dropout_schedule": "log_prefix_anchor",
192
+ "n": 5,
193
+ "mean_train_eval_loss": 4.463223123550415,
194
+ "std_train_eval_loss": 0.029267257511679485,
195
+ "mean_val_eval_loss": 5.150681225955486,
196
+ "std_val_eval_loss": 0.010164023432481408,
197
+ "mean_generalization_gap": 0.6874581024050712,
198
+ "std_generalization_gap": 0.024496105147219012
199
+ },
200
+ {
201
+ "run_mode": "locked_stream",
202
+ "condition": "hold_30_then_decay",
203
+ "condition_kind": "anchor_decay",
204
+ "stage": 1,
205
+ "token_limit": 500000,
206
+ "model_name": "L16_H8_D384",
207
+ "n_layer": 16,
208
+ "n_head": 8,
209
+ "n_embd": 384,
210
+ "parameters": 31457280,
211
+ "dropout_initial": 0.3,
212
+ "dropout_final": 0.02,
213
+ "dropout_schedule": "log_prefix_anchor",
214
+ "n": 5,
215
+ "mean_train_eval_loss": 4.1383186161518095,
216
+ "std_train_eval_loss": 0.03875357135925004,
217
+ "mean_val_eval_loss": 5.066737350821495,
218
+ "std_val_eval_loss": 0.017273545737457947,
219
+ "mean_generalization_gap": 0.9284187346696854,
220
+ "std_generalization_gap": 0.04002925354224623
221
+ },
222
+ {
223
+ "run_mode": "locked_stream",
224
+ "condition": "mild_30_to_08",
225
+ "condition_kind": "anchor_decay",
226
+ "stage": 1,
227
+ "token_limit": 500000,
228
+ "model_name": "L16_H8_D384",
229
+ "n_layer": 16,
230
+ "n_head": 8,
231
+ "n_embd": 384,
232
+ "parameters": 31457280,
233
+ "dropout_initial": 0.3,
234
+ "dropout_final": 0.08,
235
+ "dropout_schedule": "log_prefix_anchor",
236
+ "n": 5,
237
+ "mean_train_eval_loss": 4.034893324971199,
238
+ "std_train_eval_loss": 0.04033083916799125,
239
+ "mean_val_eval_loss": 5.058184179663658,
240
+ "std_val_eval_loss": 0.015882720199114145,
241
+ "mean_generalization_gap": 1.023290854692459,
242
+ "std_generalization_gap": 0.04098800520602419
243
+ },
244
+ {
245
+ "run_mode": "locked_stream",
246
+ "condition": "prevlocal_interaction",
247
+ "condition_kind": "anchor_decay",
248
+ "stage": 1,
249
+ "token_limit": 500000,
250
+ "model_name": "L16_H8_D384",
251
+ "n_layer": 16,
252
+ "n_head": 8,
253
+ "n_embd": 384,
254
+ "parameters": 31457280,
255
+ "dropout_initial": 0.385,
256
+ "dropout_final": 0.066,
257
+ "dropout_schedule": "log_prefix_anchor",
258
+ "n": 5,
259
+ "mean_train_eval_loss": 4.20646400153637,
260
+ "std_train_eval_loss": 0.03245807641301778,
261
+ "mean_val_eval_loss": 5.071460470557213,
262
+ "std_val_eval_loss": 0.01179360076463939,
263
+ "mean_generalization_gap": 0.8649964690208435,
264
+ "std_generalization_gap": 0.028479060454710485
265
+ },
266
+ {
267
+ "run_mode": "locked_stream",
268
+ "condition": "static_dropout_0",
269
+ "condition_kind": "static",
270
+ "stage": 1,
271
+ "token_limit": 500000,
272
+ "model_name": "L16_H8_D384",
273
+ "n_layer": 16,
274
+ "n_head": 8,
275
+ "n_embd": 384,
276
+ "parameters": 31457280,
277
+ "dropout_initial": 0.0,
278
+ "dropout_final": 0.0,
279
+ "dropout_schedule": "constant",
280
+ "n": 5,
281
+ "mean_train_eval_loss": 2.958326259255409,
282
+ "std_train_eval_loss": 0.044781060309162554,
283
+ "mean_val_eval_loss": 5.717529235780239,
284
+ "std_val_eval_loss": 0.05024223752386389,
285
+ "mean_generalization_gap": 2.75920297652483,
286
+ "std_generalization_gap": 0.07102439530887096
287
+ },
288
+ {
289
+ "run_mode": "locked_stream",
290
+ "condition": "static_dropout_0.02",
291
+ "condition_kind": "static",
292
+ "stage": 1,
293
+ "token_limit": 500000,
294
+ "model_name": "L16_H8_D384",
295
+ "n_layer": 16,
296
+ "n_head": 8,
297
+ "n_embd": 384,
298
+ "parameters": 31457280,
299
+ "dropout_initial": 0.02,
300
+ "dropout_final": 0.02,
301
+ "dropout_schedule": "constant",
302
+ "n": 5,
303
+ "mean_train_eval_loss": 3.124619247019291,
304
+ "std_train_eval_loss": 0.031814549489392455,
305
+ "mean_val_eval_loss": 5.575391733646393,
306
+ "std_val_eval_loss": 0.024791398740622035,
307
+ "mean_generalization_gap": 2.450772486627102,
308
+ "std_generalization_gap": 0.030503049251572257
309
+ },
310
+ {
311
+ "run_mode": "locked_stream",
312
+ "condition": "static_dropout_0.14",
313
+ "condition_kind": "static",
314
+ "stage": 1,
315
+ "token_limit": 500000,
316
+ "model_name": "L16_H8_D384",
317
+ "n_layer": 16,
318
+ "n_head": 8,
319
+ "n_embd": 384,
320
+ "parameters": 31457280,
321
+ "dropout_initial": 0.14,
322
+ "dropout_final": 0.14,
323
+ "dropout_schedule": "constant",
324
+ "n": 5,
325
+ "mean_train_eval_loss": 3.714307613670826,
326
+ "std_train_eval_loss": 0.03238913748160129,
327
+ "mean_val_eval_loss": 5.149166536331177,
328
+ "std_val_eval_loss": 0.007010026540791338,
329
+ "mean_generalization_gap": 1.4348589226603508,
330
+ "std_generalization_gap": 0.031243440426199517
331
+ },
332
+ {
333
+ "run_mode": "locked_stream",
334
+ "condition": "static_dropout_0.3",
335
+ "condition_kind": "static",
336
+ "stage": 1,
337
+ "token_limit": 500000,
338
+ "model_name": "L16_H8_D384",
339
+ "n_layer": 16,
340
+ "n_head": 8,
341
+ "n_embd": 384,
342
+ "parameters": 31457280,
343
+ "dropout_initial": 0.3,
344
+ "dropout_final": 0.3,
345
+ "dropout_schedule": "constant",
346
+ "n": 5,
347
+ "mean_train_eval_loss": 4.138318654894829,
348
+ "std_train_eval_loss": 0.03875368971811196,
349
+ "mean_val_eval_loss": 5.066737298667431,
350
+ "std_val_eval_loss": 0.017273470277214743,
351
+ "mean_generalization_gap": 0.9284186437726021,
352
+ "std_generalization_gap": 0.04002925284584238
353
+ },
354
+ {
355
+ "run_mode": "locked_stream",
356
+ "condition": "fitted_l16_static_law",
357
+ "condition_kind": "anchor_decay",
358
+ "stage": 2,
359
+ "token_limit": 1000000,
360
+ "model_name": "L16_H8_D384",
361
+ "n_layer": 16,
362
+ "n_head": 8,
363
+ "n_embd": 384,
364
+ "parameters": 31457280,
365
+ "dropout_initial": 0.6,
366
+ "dropout_final": 0.02,
367
+ "dropout_schedule": "log_prefix_anchor",
368
+ "n": 5,
369
+ "mean_train_eval_loss": 4.263189716637134,
370
+ "std_train_eval_loss": 0.02333674196202296,
371
+ "mean_val_eval_loss": 4.832601730525494,
372
+ "std_val_eval_loss": 0.010169544124120607,
373
+ "mean_generalization_gap": 0.569412013888359,
374
+ "std_generalization_gap": 0.023004537548591726
375
+ },
376
+ {
377
+ "run_mode": "locked_stream",
378
+ "condition": "hold_30_then_decay",
379
+ "condition_kind": "anchor_decay",
380
+ "stage": 2,
381
+ "token_limit": 1000000,
382
+ "model_name": "L16_H8_D384",
383
+ "n_layer": 16,
384
+ "n_head": 8,
385
+ "n_embd": 384,
386
+ "parameters": 31457280,
387
+ "dropout_initial": 0.3,
388
+ "dropout_final": 0.02,
389
+ "dropout_schedule": "log_prefix_anchor",
390
+ "n": 5,
391
+ "mean_train_eval_loss": 4.037793649733066,
392
+ "std_train_eval_loss": 0.02368477035230831,
393
+ "mean_val_eval_loss": 4.775730343163014,
394
+ "std_val_eval_loss": 0.014352387307903692,
395
+ "mean_generalization_gap": 0.7379366934299469,
396
+ "std_generalization_gap": 0.01882967372675974
397
+ },
398
+ {
399
+ "run_mode": "locked_stream",
400
+ "condition": "mild_30_to_08",
401
+ "condition_kind": "anchor_decay",
402
+ "stage": 2,
403
+ "token_limit": 1000000,
404
+ "model_name": "L16_H8_D384",
405
+ "n_layer": 16,
406
+ "n_head": 8,
407
+ "n_embd": 384,
408
+ "parameters": 31457280,
409
+ "dropout_initial": 0.3,
410
+ "dropout_final": 0.08,
411
+ "dropout_schedule": "log_prefix_anchor",
412
+ "n": 5,
413
+ "mean_train_eval_loss": 3.9886452093720437,
414
+ "std_train_eval_loss": 0.02349137402419598,
415
+ "mean_val_eval_loss": 4.777442049980164,
416
+ "std_val_eval_loss": 0.013845858727658497,
417
+ "mean_generalization_gap": 0.7887968406081199,
418
+ "std_generalization_gap": 0.018652082916074838
419
+ },
420
+ {
421
+ "run_mode": "locked_stream",
422
+ "condition": "prevlocal_interaction",
423
+ "condition_kind": "anchor_decay",
424
+ "stage": 2,
425
+ "token_limit": 1000000,
426
+ "model_name": "L16_H8_D384",
427
+ "n_layer": 16,
428
+ "n_head": 8,
429
+ "n_embd": 384,
430
+ "parameters": 31457280,
431
+ "dropout_initial": 0.385,
432
+ "dropout_final": 0.066,
433
+ "dropout_schedule": "log_prefix_anchor",
434
+ "n": 5,
435
+ "mean_train_eval_loss": 4.08262689858675,
436
+ "std_train_eval_loss": 0.023420093535722695,
437
+ "mean_val_eval_loss": 4.781069016456604,
438
+ "std_val_eval_loss": 0.008428247627355752,
439
+ "mean_generalization_gap": 0.698442117869854,
440
+ "std_generalization_gap": 0.028148054160246055
441
+ },
442
+ {
443
+ "run_mode": "locked_stream",
444
+ "condition": "static_dropout_0",
445
+ "condition_kind": "static",
446
+ "stage": 2,
447
+ "token_limit": 1000000,
448
+ "model_name": "L16_H8_D384",
449
+ "n_layer": 16,
450
+ "n_head": 8,
451
+ "n_embd": 384,
452
+ "parameters": 31457280,
453
+ "dropout_initial": 0.0,
454
+ "dropout_final": 0.0,
455
+ "dropout_schedule": "constant",
456
+ "n": 5,
457
+ "mean_train_eval_loss": 3.3260142356157303,
458
+ "std_train_eval_loss": 0.03607156293344983,
459
+ "mean_val_eval_loss": 5.26366505920887,
460
+ "std_val_eval_loss": 0.027353946222948587,
461
+ "mean_generalization_gap": 1.9376508235931396,
462
+ "std_generalization_gap": 0.03553067411055354
463
+ },
464
+ {
465
+ "run_mode": "locked_stream",
466
+ "condition": "static_dropout_0.02",
467
+ "condition_kind": "static",
468
+ "stage": 2,
469
+ "token_limit": 1000000,
470
+ "model_name": "L16_H8_D384",
471
+ "n_layer": 16,
472
+ "n_head": 8,
473
+ "n_embd": 384,
474
+ "parameters": 31457280,
475
+ "dropout_initial": 0.02,
476
+ "dropout_final": 0.02,
477
+ "dropout_schedule": "constant",
478
+ "n": 5,
479
+ "mean_train_eval_loss": 3.4615398421883583,
480
+ "std_train_eval_loss": 0.03992270092195685,
481
+ "mean_val_eval_loss": 5.14697041362524,
482
+ "std_val_eval_loss": 0.022233878343551068,
483
+ "mean_generalization_gap": 1.685430571436882,
484
+ "std_generalization_gap": 0.04092951267469098
485
+ },
486
+ {
487
+ "run_mode": "locked_stream",
488
+ "condition": "static_dropout_0.14",
489
+ "condition_kind": "static",
490
+ "stage": 2,
491
+ "token_limit": 1000000,
492
+ "model_name": "L16_H8_D384",
493
+ "n_layer": 16,
494
+ "n_head": 8,
495
+ "n_embd": 384,
496
+ "parameters": 31457280,
497
+ "dropout_initial": 0.14,
498
+ "dropout_final": 0.14,
499
+ "dropout_schedule": "constant",
500
+ "n": 5,
501
+ "mean_train_eval_loss": 3.8711691960692405,
502
+ "std_train_eval_loss": 0.02974306105040781,
503
+ "mean_val_eval_loss": 4.849037018418312,
504
+ "std_val_eval_loss": 0.020208736415348236,
505
+ "mean_generalization_gap": 0.9778678223490715,
506
+ "std_generalization_gap": 0.023799818088071894
507
+ },
508
+ {
509
+ "run_mode": "locked_stream",
510
+ "condition": "static_dropout_0.3",
511
+ "condition_kind": "static",
512
+ "stage": 2,
513
+ "token_limit": 1000000,
514
+ "model_name": "L16_H8_D384",
515
+ "n_layer": 16,
516
+ "n_head": 8,
517
+ "n_embd": 384,
518
+ "parameters": 31457280,
519
+ "dropout_initial": 0.3,
520
+ "dropout_final": 0.3,
521
+ "dropout_schedule": "constant",
522
+ "n": 5,
523
+ "mean_train_eval_loss": 4.150126910209655,
524
+ "std_train_eval_loss": 0.023298256740585745,
525
+ "mean_val_eval_loss": 4.79825523942709,
526
+ "std_val_eval_loss": 0.01441949497608529,
527
+ "mean_generalization_gap": 0.6481283292174339,
528
+ "std_generalization_gap": 0.017421801083541605
529
+ },
530
+ {
531
+ "run_mode": "locked_stream",
532
+ "condition": "fitted_l16_static_law",
533
+ "condition_kind": "anchor_decay",
534
+ "stage": 3,
535
+ "token_limit": 2000000,
536
+ "model_name": "L16_H8_D384",
537
+ "n_layer": 16,
538
+ "n_head": 8,
539
+ "n_embd": 384,
540
+ "parameters": 31457280,
541
+ "dropout_initial": 0.6,
542
+ "dropout_final": 0.02,
543
+ "dropout_schedule": "log_prefix_anchor",
544
+ "n": 5,
545
+ "mean_train_eval_loss": 4.14712455868721,
546
+ "std_train_eval_loss": 0.01706029159496315,
547
+ "mean_val_eval_loss": 4.58056578040123,
548
+ "std_val_eval_loss": 0.01532149630405117,
549
+ "mean_generalization_gap": 0.4334412217140198,
550
+ "std_generalization_gap": 0.019914111395845077
551
+ },
552
+ {
553
+ "run_mode": "locked_stream",
554
+ "condition": "hold_30_then_decay",
555
+ "condition_kind": "anchor_decay",
556
+ "stage": 3,
557
+ "token_limit": 2000000,
558
+ "model_name": "L16_H8_D384",
559
+ "n_layer": 16,
560
+ "n_head": 8,
561
+ "n_embd": 384,
562
+ "parameters": 31457280,
563
+ "dropout_initial": 0.3,
564
+ "dropout_final": 0.02,
565
+ "dropout_schedule": "log_prefix_anchor",
566
+ "n": 5,
567
+ "mean_train_eval_loss": 4.044496415555477,
568
+ "std_train_eval_loss": 0.019708741233353012,
569
+ "mean_val_eval_loss": 4.559869511425495,
570
+ "std_val_eval_loss": 0.016051317749301037,
571
+ "mean_generalization_gap": 0.515373095870018,
572
+ "std_generalization_gap": 0.020379012527272283
573
+ },
574
+ {
575
+ "run_mode": "locked_stream",
576
+ "condition": "mild_30_to_08",
577
+ "condition_kind": "anchor_decay",
578
+ "stage": 3,
579
+ "token_limit": 2000000,
580
+ "model_name": "L16_H8_D384",
581
+ "n_layer": 16,
582
+ "n_head": 8,
583
+ "n_embd": 384,
584
+ "parameters": 31457280,
585
+ "dropout_initial": 0.3,
586
+ "dropout_final": 0.08,
587
+ "dropout_schedule": "log_prefix_anchor",
588
+ "n": 5,
589
+ "mean_train_eval_loss": 4.044088624417782,
590
+ "std_train_eval_loss": 0.020441976517745996,
591
+ "mean_val_eval_loss": 4.563060106337071,
592
+ "std_val_eval_loss": 0.015509498762185112,
593
+ "mean_generalization_gap": 0.5189714819192887,
594
+ "std_generalization_gap": 0.022529376522631556
595
+ },
596
+ {
597
+ "run_mode": "locked_stream",
598
+ "condition": "prevlocal_interaction",
599
+ "condition_kind": "anchor_decay",
600
+ "stage": 3,
601
+ "token_limit": 2000000,
602
+ "model_name": "L16_H8_D384",
603
+ "n_layer": 16,
604
+ "n_head": 8,
605
+ "n_embd": 384,
606
+ "parameters": 31457280,
607
+ "dropout_initial": 0.385,
608
+ "dropout_final": 0.066,
609
+ "dropout_schedule": "log_prefix_anchor",
610
+ "n": 5,
611
+ "mean_train_eval_loss": 4.080180557072163,
612
+ "std_train_eval_loss": 0.022080406390158354,
613
+ "mean_val_eval_loss": 4.558990895748138,
614
+ "std_val_eval_loss": 0.014177173687439483,
615
+ "mean_generalization_gap": 0.4788103386759758,
616
+ "std_generalization_gap": 0.021926835907345392
617
+ },
618
+ {
619
+ "run_mode": "locked_stream",
620
+ "condition": "static_dropout_0",
621
+ "condition_kind": "static",
622
+ "stage": 3,
623
+ "token_limit": 2000000,
624
+ "model_name": "L16_H8_D384",
625
+ "n_layer": 16,
626
+ "n_head": 8,
627
+ "n_embd": 384,
628
+ "parameters": 31457280,
629
+ "dropout_initial": 0.0,
630
+ "dropout_final": 0.0,
631
+ "dropout_schedule": "constant",
632
+ "n": 5,
633
+ "mean_train_eval_loss": 3.778580814599991,
634
+ "std_train_eval_loss": 0.03536285448761605,
635
+ "mean_val_eval_loss": 4.847234210371971,
636
+ "std_val_eval_loss": 0.0170992476167825,
637
+ "mean_generalization_gap": 1.0686533957719804,
638
+ "std_generalization_gap": 0.025604091638377884
639
+ },
640
+ {
641
+ "run_mode": "locked_stream",
642
+ "condition": "static_dropout_0.02",
643
+ "condition_kind": "static",
644
+ "stage": 3,
645
+ "token_limit": 2000000,
646
+ "model_name": "L16_H8_D384",
647
+ "n_layer": 16,
648
+ "n_head": 8,
649
+ "n_embd": 384,
650
+ "parameters": 31457280,
651
+ "dropout_initial": 0.02,
652
+ "dropout_final": 0.02,
653
+ "dropout_schedule": "constant",
654
+ "n": 5,
655
+ "mean_train_eval_loss": 3.840523959696293,
656
+ "std_train_eval_loss": 0.03097454466954304,
657
+ "mean_val_eval_loss": 4.784735175967216,
658
+ "std_val_eval_loss": 0.019582585992709827,
659
+ "mean_generalization_gap": 0.9442112162709236,
660
+ "std_generalization_gap": 0.02147121638277758
661
+ },
662
+ {
663
+ "run_mode": "locked_stream",
664
+ "condition": "static_dropout_0.14",
665
+ "condition_kind": "static",
666
+ "stage": 3,
667
+ "token_limit": 2000000,
668
+ "model_name": "L16_H8_D384",
669
+ "n_layer": 16,
670
+ "n_head": 8,
671
+ "n_embd": 384,
672
+ "parameters": 31457280,
673
+ "dropout_initial": 0.14,
674
+ "dropout_final": 0.14,
675
+ "dropout_schedule": "constant",
676
+ "n": 5,
677
+ "mean_train_eval_loss": 4.039909638464451,
678
+ "std_train_eval_loss": 0.025550506633378975,
679
+ "mean_val_eval_loss": 4.6047517821192745,
680
+ "std_val_eval_loss": 0.013619996903704912,
681
+ "mean_generalization_gap": 0.5648421436548233,
682
+ "std_generalization_gap": 0.015970945478988943
683
+ },
684
+ {
685
+ "run_mode": "locked_stream",
686
+ "condition": "static_dropout_0.3",
687
+ "condition_kind": "static",
688
+ "stage": 3,
689
+ "token_limit": 2000000,
690
+ "model_name": "L16_H8_D384",
691
+ "n_layer": 16,
692
+ "n_head": 8,
693
+ "n_embd": 384,
694
+ "parameters": 31457280,
695
+ "dropout_initial": 0.3,
696
+ "dropout_final": 0.3,
697
+ "dropout_schedule": "constant",
698
+ "n": 5,
699
+ "mean_train_eval_loss": 4.2150133237242695,
700
+ "std_train_eval_loss": 0.015000678307181381,
701
+ "mean_val_eval_loss": 4.603498187661171,
702
+ "std_val_eval_loss": 0.014129740963263297,
703
+ "mean_generalization_gap": 0.38848486393690107,
704
+ "std_generalization_gap": 0.01687487069399014
705
+ },
706
+ {
707
+ "run_mode": "locked_stream",
708
+ "condition": "fitted_l16_static_law",
709
+ "condition_kind": "anchor_decay",
710
+ "stage": 4,
711
+ "token_limit": 4000000,
712
+ "model_name": "L16_H8_D384",
713
+ "n_layer": 16,
714
+ "n_head": 8,
715
+ "n_embd": 384,
716
+ "parameters": 31457280,
717
+ "dropout_initial": 0.6,
718
+ "dropout_final": 0.02,
719
+ "dropout_schedule": "log_prefix_anchor",
720
+ "n": 5,
721
+ "mean_train_eval_loss": 4.098657152056694,
722
+ "std_train_eval_loss": 0.01111204513074185,
723
+ "mean_val_eval_loss": 4.412404176592827,
724
+ "std_val_eval_loss": 0.00843791675235308,
725
+ "mean_generalization_gap": 0.3137470245361328,
726
+ "std_generalization_gap": 0.007204760471400837
727
+ },
728
+ {
729
+ "run_mode": "locked_stream",
730
+ "condition": "hold_30_then_decay",
731
+ "condition_kind": "anchor_decay",
732
+ "stage": 4,
733
+ "token_limit": 4000000,
734
+ "model_name": "L16_H8_D384",
735
+ "n_layer": 16,
736
+ "n_head": 8,
737
+ "n_embd": 384,
738
+ "parameters": 31457280,
739
+ "dropout_initial": 0.3,
740
+ "dropout_final": 0.02,
741
+ "dropout_schedule": "log_prefix_anchor",
742
+ "n": 5,
743
+ "mean_train_eval_loss": 4.0487526342272755,
744
+ "std_train_eval_loss": 0.007824452256379268,
745
+ "mean_val_eval_loss": 4.405232906341553,
746
+ "std_val_eval_loss": 0.011151070705538514,
747
+ "mean_generalization_gap": 0.3564802721142769,
748
+ "std_generalization_gap": 0.01297330703929578
749
+ },
750
+ {
751
+ "run_mode": "locked_stream",
752
+ "condition": "mild_30_to_08",
753
+ "condition_kind": "anchor_decay",
754
+ "stage": 4,
755
+ "token_limit": 4000000,
756
+ "model_name": "L16_H8_D384",
757
+ "n_layer": 16,
758
+ "n_head": 8,
759
+ "n_embd": 384,
760
+ "parameters": 31457280,
761
+ "dropout_initial": 0.3,
762
+ "dropout_final": 0.08,
763
+ "dropout_schedule": "log_prefix_anchor",
764
+ "n": 5,
765
+ "mean_train_eval_loss": 4.07358001768589,
766
+ "std_train_eval_loss": 0.0063536190340169095,
767
+ "mean_val_eval_loss": 4.40728645324707,
768
+ "std_val_eval_loss": 0.008502541215009067,
769
+ "mean_generalization_gap": 0.3337064355611801,
770
+ "std_generalization_gap": 0.010359634321755684
771
+ },
772
+ {
773
+ "run_mode": "locked_stream",
774
+ "condition": "prevlocal_interaction",
775
+ "condition_kind": "anchor_decay",
776
+ "stage": 4,
777
+ "token_limit": 4000000,
778
+ "model_name": "L16_H8_D384",
779
+ "n_layer": 16,
780
+ "n_head": 8,
781
+ "n_embd": 384,
782
+ "parameters": 31457280,
783
+ "dropout_initial": 0.385,
784
+ "dropout_final": 0.066,
785
+ "dropout_schedule": "log_prefix_anchor",
786
+ "n": 5,
787
+ "mean_train_eval_loss": 4.080478595197201,
788
+ "std_train_eval_loss": 0.009311692964638253,
789
+ "mean_val_eval_loss": 4.3981304407119755,
790
+ "std_val_eval_loss": 0.009545784836147743,
791
+ "mean_generalization_gap": 0.3176518455147743,
792
+ "std_generalization_gap": 0.007999498965173152
793
+ },
794
+ {
795
+ "run_mode": "locked_stream",
796
+ "condition": "static_dropout_0",
797
+ "condition_kind": "static",
798
+ "stage": 4,
799
+ "token_limit": 4000000,
800
+ "model_name": "L16_H8_D384",
801
+ "n_layer": 16,
802
+ "n_head": 8,
803
+ "n_embd": 384,
804
+ "parameters": 31457280,
805
+ "dropout_initial": 0.0,
806
+ "dropout_final": 0.0,
807
+ "dropout_schedule": "constant",
808
+ "n": 5,
809
+ "mean_train_eval_loss": 4.041403333842754,
810
+ "std_train_eval_loss": 0.017193152802814336,
811
+ "mean_val_eval_loss": 4.594272664189338,
812
+ "std_val_eval_loss": 0.021638340853154137,
813
+ "mean_generalization_gap": 0.5528693303465844,
814
+ "std_generalization_gap": 0.029132548047629703
815
+ },
816
+ {
817
+ "run_mode": "locked_stream",
818
+ "condition": "static_dropout_0.02",
819
+ "condition_kind": "static",
820
+ "stage": 4,
821
+ "token_limit": 4000000,
822
+ "model_name": "L16_H8_D384",
823
+ "n_layer": 16,
824
+ "n_head": 8,
825
+ "n_embd": 384,
826
+ "parameters": 31457280,
827
+ "dropout_initial": 0.02,
828
+ "dropout_final": 0.02,
829
+ "dropout_schedule": "constant",
830
+ "n": 5,
831
+ "mean_train_eval_loss": 4.052870315313339,
832
+ "std_train_eval_loss": 0.02163703576438587,
833
+ "mean_val_eval_loss": 4.535757505893708,
834
+ "std_val_eval_loss": 0.00908401354385357,
835
+ "mean_generalization_gap": 0.48288719058036805,
836
+ "std_generalization_gap": 0.020126181497736668
837
+ },
838
+ {
839
+ "run_mode": "locked_stream",
840
+ "condition": "static_dropout_0.14",
841
+ "condition_kind": "static",
842
+ "stage": 4,
843
+ "token_limit": 4000000,
844
+ "model_name": "L16_H8_D384",
845
+ "n_layer": 16,
846
+ "n_head": 8,
847
+ "n_embd": 384,
848
+ "parameters": 31457280,
849
+ "dropout_initial": 0.14,
850
+ "dropout_final": 0.14,
851
+ "dropout_schedule": "constant",
852
+ "n": 5,
853
+ "mean_train_eval_loss": 4.116507206857205,
854
+ "std_train_eval_loss": 0.014037194709348206,
855
+ "mean_val_eval_loss": 4.44545366615057,
856
+ "std_val_eval_loss": 0.012017216742245517,
857
+ "mean_generalization_gap": 0.32894645929336547,
858
+ "std_generalization_gap": 0.01603071874172604
859
+ },
860
+ {
861
+ "run_mode": "locked_stream",
862
+ "condition": "static_dropout_0.3",
863
+ "condition_kind": "static",
864
+ "stage": 4,
865
+ "token_limit": 4000000,
866
+ "model_name": "L16_H8_D384",
867
+ "n_layer": 16,
868
+ "n_head": 8,
869
+ "n_embd": 384,
870
+ "parameters": 31457280,
871
+ "dropout_initial": 0.3,
872
+ "dropout_final": 0.3,
873
+ "dropout_schedule": "constant",
874
+ "n": 5,
875
+ "mean_train_eval_loss": 4.231865841150284,
876
+ "std_train_eval_loss": 0.010414934638152858,
877
+ "mean_val_eval_loss": 4.46677490323782,
878
+ "std_val_eval_loss": 0.014064932048228269,
879
+ "mean_generalization_gap": 0.23490906208753587,
880
+ "std_generalization_gap": 0.008922414622347311
881
+ }
882
+ ]
runs/previous_local_updated_formula_clean_l16/locked_stream/20260530-174525/trace.jsonl ADDED
The diff for this file is too large to render. See raw diff