Mandeep Sidhu commited on
Commit
2f2776e
·
1 Parent(s): 3550904

Add five-seed TinyStories streaming validation

Browse files
docs/plan.md CHANGED
@@ -283,8 +283,9 @@ Use this order for every regime.
283
  |---|---|---|
284
  | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
- | TinyStories streaming regime | 3-seed validation complete | schedule evidence supports decay over static final loss; expand to 5 seeds for paper-grade confidence |
287
- | next streaming regime | pending | start after deciding whether to expand TinyStories from 3 to 5 seeds |
 
288
 
289
  ## Current Formula Status
290
 
@@ -331,24 +332,31 @@ structure transfers, while coefficients may be regime-specific.
331
  | TinyStories static optima | interaction form fits static dropout optima better than base ABC |
332
  | TinyStories held-out prefix | supports pressure dependence on unique tokens |
333
  | TinyStories held-out model | supports pressure dependence on model size |
334
- | TinyStories streaming, 3 seeds | interaction has best mean final loss; decay schedules beat best static in paired final-loss comparisons |
335
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
336
 
337
- Latest TinyStories 3-seed streaming final-loss table:
338
 
339
  | Condition | Mean final 4M validation loss | Std |
340
  |---|---:|---:|
341
- | `interaction` decay | 2.5392 | 0.0020 |
342
- | `smooth_low` decay | 2.5405 | 0.0018 |
343
- | `baseabc` decay | 2.5418 | 0.0019 |
344
- | static `0.08` | 2.5511 | 0.0112 |
345
- | static `0.12` | 2.5541 | 0.0041 |
346
- | static `0.18` | 2.5690 | 0.0069 |
347
-
348
- The immediate risk is still overclaiming the streaming result because `n=3` is
349
- small, but the result is materially stronger than the single-seed run.
350
- The current defensible
351
- claim is:
 
 
 
 
 
 
 
352
 
353
  ```text
354
  Formula-derived dropout schedules track the moving useful dropout region and
@@ -361,8 +369,8 @@ The stronger claim:
361
  Formula-derived dropout decay beats the best static dropout.
362
  ```
363
 
364
- is supported at `n=3` for this TinyStories setup, but should be expanded to
365
- `n=5` before being treated as paper-grade.
366
 
367
  ## Completed Static Backtest Gate
368
 
@@ -386,26 +394,21 @@ streaming multi-seed reports for each regime.
386
 
387
  ## Immediate Next Action
388
 
389
- If we want paper-grade confidence for TinyStories, expand the same narrowed
390
- streaming validation from 3 seeds to 5 seeds by adding seeds `4` and `5`.
391
- Do not launch a broad new regime sweep before deciding whether this 5-seed
392
- confirmation is needed.
393
 
394
  ## Next Training After Current Gate
395
 
396
- The next MPS run, if approved, should be the same narrowed TinyStories streaming
397
- validation with seeds `4` and `5`:
 
 
398
 
399
  ```text
400
- model: L12_H8_D320
401
- seeds: 4 and 5
402
- conditions:
403
- interaction decay
404
- baseabc decay
405
- smooth_low decay
406
- static_dropout_0.08
407
- static_dropout_0.12
408
- static_dropout_0.18
409
  ```
410
 
411
  Evaluate with paired seed comparisons:
@@ -418,13 +421,14 @@ decay minus best-static delta per seed
418
  rank consistency across seeds
419
  ```
420
 
421
- If decay wins across paired seeds, promote the schedule claim. If it ties, claim
422
- competitive automatic scheduling rather than superiority. If it loses, fit a
423
- streaming-specific correction offline before launching any broader experiment.
 
424
 
425
  Latest streaming report:
426
 
427
  ```text
428
  docs/streaming_multiseed_validation_report.md
429
- runs/streaming_tinystories_multiseed_validation_l12/combined_3seed_summary/
430
  ```
 
283
  |---|---|---|
284
  | original/local saved regime | offline backtest complete | retrospective support for interaction pressure law; do not rerun unless necessary |
285
  | TinyStories static/coefficient regime | active | main coefficient evidence |
286
+ | TinyStories streaming regime | 5-seed validation complete | current main streaming evidence; interaction decay beats best static in 5/5 paired final-loss comparisons |
287
+ | original/local streaming regime | pending report | summarize saved streaming runs before launching any additional training |
288
+ | next new streaming regime | pending | start only after TinyStories and original/local streaming reports are reconciled |
289
 
290
  ## Current Formula Status
291
 
 
332
  | TinyStories static optima | interaction form fits static dropout optima better than base ABC |
333
  | TinyStories held-out prefix | supports pressure dependence on unique tokens |
334
  | TinyStories held-out model | supports pressure dependence on model size |
335
+ | TinyStories streaming, 5 seeds | interaction has best mean final loss; interaction beats best static in 5/5 paired final-loss comparisons |
336
  | cross-regime raw coefficient transfer | weaker than within-regime fit; supports regime-specific coefficients rather than universal numeric coefficients |
337
 
338
+ Latest TinyStories 5-seed streaming final-loss table:
339
 
340
  | Condition | Mean final 4M validation loss | Std |
341
  |---|---:|---:|
342
+ | `interaction` decay | 2.5311 | 0.0213 |
343
+ | `smooth_low` decay | 2.5321 | 0.0203 |
344
+ | `baseabc` decay | 2.5357 | 0.0175 |
345
+ | static `0.08` | 2.5444 | 0.0211 |
346
+ | static `0.12` | 2.5477 | 0.0178 |
347
+ | static `0.18` | 2.5644 | 0.0182 |
348
+
349
+ Paired final-loss result:
350
+
351
+ | Decay schedule | Paired wins vs best static |
352
+ |---|---:|
353
+ | `interaction` | 5/5 |
354
+ | `baseabc` | 5/5 |
355
+ | `smooth_low` | 4/5, with the one miss only `+0.0003` |
356
+
357
+ The immediate risk is no longer TinyStories seed count. The main remaining risk
358
+ is external validity: the current strongest streaming result is one corpus and
359
+ one narrowed model/optimizer regime. The current defensible claim is:
360
 
361
  ```text
362
  Formula-derived dropout schedules track the moving useful dropout region and
 
369
  Formula-derived dropout decay beats the best static dropout.
370
  ```
371
 
372
+ is supported at `n=5` for this TinyStories setup, with interaction decay
373
+ beating the per-seed best static baseline in all five seeds.
374
 
375
  ## Completed Static Backtest Gate
376
 
 
394
 
395
  ## Immediate Next Action
396
 
397
+ Build the original/local streaming report from saved runs. Do not launch a
398
+ broad new regime sweep until the previous/local report is reconciled against
399
+ the TinyStories five-seed result.
 
400
 
401
  ## Next Training After Current Gate
402
 
403
+ No MPS training should launch before the previous/local streaming report is
404
+ generated from existing saved runs. If that report lacks enough coverage for a
405
+ clean claim, the next MPS run should be narrowly scoped to only the missing
406
+ conditions or seeds:
407
 
408
  ```text
409
+ preferred first step: no training, saved-run report only
410
+ possible follow-up: fill missing previous/local streaming cells
411
+ avoid: broad new regime sweep before the report audit
 
 
 
 
 
 
412
  ```
413
 
414
  Evaluate with paired seed comparisons:
 
421
  rank consistency across seeds
422
  ```
423
 
424
+ If previous/local decay wins across paired seeds, promote the cross-regime
425
+ streaming claim. If it ties, claim competitive automatic scheduling rather than
426
+ superiority outside TinyStories. If it loses, fit a streaming-specific
427
+ correction offline before launching any broader experiment.
428
 
429
  Latest streaming report:
430
 
431
  ```text
432
  docs/streaming_multiseed_validation_report.md
433
+ runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/
434
  ```
docs/streaming_multiseed_validation_report.md CHANGED
@@ -2,25 +2,26 @@
2
 
3
  Date: 2026-05-30
4
 
5
- This report combines the original seed-1 streaming run with the new seeds
6
- 2 and 3 run. No additional training is performed by this script; it reads
7
- saved `metrics.jsonl` files.
8
 
9
  ## Sources
10
 
11
  - `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
12
  - `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-111523/metrics.jsonl`
 
13
 
14
  ## Condition Ranking By Final Loss
15
 
16
  | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
17
  |---|---|---:|---:|---:|---:|---:|---:|---|
18
- | `interaction` | `anchor_decay` | 3 | 2.8347 | 0.0034 | 2.5392 | 0.0020 | 0.2804 | `0.18 -> 0.14 -> 0.08 -> 0.04` |
19
- | `smooth_low` | `decay` | 3 | 2.8347 | 0.0033 | 2.5405 | 0.0018 | 0.2786 | `0.16 -> 0.11 -> 0.07 -> 0.05` |
20
- | `baseabc` | `anchor_decay` | 3 | 2.8490 | 0.0016 | 2.5418 | 0.0019 | 0.2815 | `0.25 -> 0.19 -> 0.10 -> 0.02` |
21
- | `static_dropout_0.08` | `static` | 3 | 2.8476 | 0.0048 | 2.5511 | 0.0112 | 0.2758 | `0.08 -> 0.08 -> 0.08 -> 0.08` |
22
- | `static_dropout_0.12` | `static` | 3 | 2.8369 | 0.0081 | 2.5541 | 0.0041 | 0.2443 | `0.12 -> 0.12 -> 0.12 -> 0.12` |
23
- | `static_dropout_0.18` | `static` | 3 | 2.8456 | 0.0046 | 2.5690 | 0.0069 | 0.2174 | `0.18 -> 0.18 -> 0.18 -> 0.18` |
24
 
25
  ## Paired Final-Loss Deltas
26
 
@@ -47,45 +48,60 @@ baseline for that seed.
47
  | 3 | `static_dropout_0.08` | 2.5478 | `static_dropout_0.08` | 2.5478 | +0.0000 |
48
  | 3 | `static_dropout_0.12` | 2.5510 | `static_dropout_0.08` | 2.5478 | +0.0033 |
49
  | 3 | `static_dropout_0.18` | 2.5667 | `static_dropout_0.08` | 2.5478 | +0.0189 |
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
  ## Stage Trajectory
52
 
53
  | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
54
  |---:|---:|---|---:|---:|---:|---:|---:|---:|
55
- | 0 | 500,000 | `static_dropout_0.12` | 0.120 | 3 | 3.2193 | 0.0157 | 2.7021 | 0.5172 |
56
- | 0 | 500,000 | `static_dropout_0.18` | 0.180 | 3 | 3.2266 | 0.0052 | 2.8143 | 0.4123 |
57
- | 0 | 500,000 | `smooth_low` | 0.162 | 3 | 3.2267 | 0.0049 | 2.8030 | 0.4237 |
58
- | 0 | 500,000 | `interaction` | 0.184 | 3 | 3.2300 | 0.0049 | 2.8225 | 0.4075 |
59
- | 0 | 500,000 | `static_dropout_0.08` | 0.080 | 3 | 3.2304 | 0.0058 | 2.6364 | 0.5940 |
60
- | 0 | 500,000 | `baseabc` | 0.251 | 3 | 3.2659 | 0.0047 | 2.9229 | 0.3430 |
61
- | 1 | 1,000,000 | `static_dropout_0.12` | 0.120 | 3 | 2.8883 | 0.0128 | 2.4336 | 0.4548 |
62
- | 1 | 1,000,000 | `interaction` | 0.141 | 3 | 2.8900 | 0.0019 | 2.4909 | 0.3991 |
63
- | 1 | 1,000,000 | `smooth_low` | 0.115 | 3 | 2.8903 | 0.0012 | 2.4749 | 0.4154 |
64
- | 1 | 1,000,000 | `static_dropout_0.18` | 0.180 | 3 | 2.8923 | 0.0073 | 2.5390 | 0.3534 |
65
- | 1 | 1,000,000 | `baseabc` | 0.186 | 3 | 2.9021 | 0.0034 | 2.5676 | 0.3345 |
66
- | 1 | 1,000,000 | `static_dropout_0.08` | 0.080 | 3 | 2.9140 | 0.0041 | 2.3541 | 0.5599 |
67
- | 2 | 2,000,000 | `interaction` | 0.084 | 3 | 2.6794 | 0.0114 | 2.3421 | 0.3374 |
68
- | 2 | 2,000,000 | `smooth_low` | 0.067 | 3 | 2.6814 | 0.0113 | 2.3387 | 0.3427 |
69
- | 2 | 2,000,000 | `static_dropout_0.12` | 0.120 | 3 | 2.6857 | 0.0015 | 2.3639 | 0.3218 |
70
- | 2 | 2,000,000 | `baseabc` | 0.105 | 3 | 2.6864 | 0.0083 | 2.3886 | 0.2978 |
71
- | 2 | 2,000,000 | `static_dropout_0.18` | 0.180 | 3 | 2.6942 | 0.0034 | 2.4302 | 0.2641 |
72
- | 2 | 2,000,000 | `static_dropout_0.08` | 0.080 | 3 | 2.6948 | 0.0052 | 2.3069 | 0.3879 |
73
- | 3 | 4,000,000 | `interaction` | 0.045 | 3 | 2.5392 | 0.0020 | 2.2588 | 0.2804 |
74
- | 3 | 4,000,000 | `smooth_low` | 0.045 | 3 | 2.5405 | 0.0018 | 2.2619 | 0.2786 |
75
- | 3 | 4,000,000 | `baseabc` | 0.020 | 3 | 2.5418 | 0.0019 | 2.2603 | 0.2815 |
76
- | 3 | 4,000,000 | `static_dropout_0.08` | 0.080 | 3 | 2.5511 | 0.0112 | 2.2753 | 0.2758 |
77
- | 3 | 4,000,000 | `static_dropout_0.12` | 0.120 | 3 | 2.5541 | 0.0041 | 2.3098 | 0.2443 |
78
- | 3 | 4,000,000 | `static_dropout_0.18` | 0.180 | 3 | 2.5690 | 0.0069 | 2.3516 | 0.2174 |
79
 
80
  ## Interpretation
81
 
82
- - `interaction` has the best 3-seed mean final validation loss.
83
- - `smooth_low` is very close, suggesting the exact anchor values may not be
84
- uniquely required as long as the schedule follows the same pressure range.
85
- - All decay schedules beat the best static baseline on final loss in the
86
- paired seed comparisons, though seed 1 margins remain tiny.
 
 
 
87
  - Static `0.12` can win early stages, but holding it fixed loses at the
88
  final 4M stage.
89
- - This supports the schedule claim more strongly than the single-seed run,
90
- but seed count is still small. Expanding to 5 seeds is the paper-grade
91
- next check if we want stronger statistical confidence.
 
2
 
3
  Date: 2026-05-30
4
 
5
+ This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
6
+ No additional training is performed by this script; it reads saved
7
+ `metrics.jsonl` files.
8
 
9
  ## Sources
10
 
11
  - `runs/streaming_tinystories_interaction_schedule_l12/locked_stream/20260530-053831/metrics.jsonl`
12
  - `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-111523/metrics.jsonl`
13
+ - `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl`
14
 
15
  ## Condition Ranking By Final Loss
16
 
17
  | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
18
  |---|---|---:|---:|---:|---:|---:|---:|---|
19
+ | `interaction` | `anchor_decay` | 5 | 2.8309 | 0.0068 | 2.5311 | 0.0213 | 0.2626 | `0.18 -> 0.14 -> 0.08 -> 0.04` |
20
+ | `smooth_low` | `decay` | 5 | 2.8307 | 0.0069 | 2.5321 | 0.0203 | 0.2607 | `0.16 -> 0.11 -> 0.07 -> 0.05` |
21
+ | `baseabc` | `anchor_decay` | 5 | 2.8474 | 0.0028 | 2.5357 | 0.0175 | 0.2655 | `0.25 -> 0.19 -> 0.10 -> 0.02` |
22
+ | `static_dropout_0.08` | `static` | 5 | 2.8434 | 0.0072 | 2.5444 | 0.0211 | 0.2593 | `0.08 -> 0.08 -> 0.08 -> 0.08` |
23
+ | `static_dropout_0.12` | `static` | 5 | 2.8357 | 0.0061 | 2.5477 | 0.0178 | 0.2269 | `0.12 -> 0.12 -> 0.12 -> 0.12` |
24
+ | `static_dropout_0.18` | `static` | 5 | 2.8461 | 0.0047 | 2.5644 | 0.0182 | 0.2035 | `0.18 -> 0.18 -> 0.18 -> 0.18` |
25
 
26
  ## Paired Final-Loss Deltas
27
 
 
48
  | 3 | `static_dropout_0.08` | 2.5478 | `static_dropout_0.08` | 2.5478 | +0.0000 |
49
  | 3 | `static_dropout_0.12` | 2.5510 | `static_dropout_0.08` | 2.5478 | +0.0033 |
50
  | 3 | `static_dropout_0.18` | 2.5667 | `static_dropout_0.08` | 2.5478 | +0.0189 |
51
+ | 4 | `interaction` | 2.4932 | `static_dropout_0.08` | 2.5098 | -0.0166 |
52
+ | 4 | `baseabc` | 2.5049 | `static_dropout_0.08` | 2.5098 | -0.0049 |
53
+ | 4 | `smooth_low` | 2.4959 | `static_dropout_0.08` | 2.5098 | -0.0139 |
54
+ | 4 | `static_dropout_0.08` | 2.5098 | `static_dropout_0.08` | 2.5098 | +0.0000 |
55
+ | 4 | `static_dropout_0.12` | 2.5166 | `static_dropout_0.08` | 2.5098 | +0.0068 |
56
+ | 4 | `static_dropout_0.18` | 2.5343 | `static_dropout_0.08` | 2.5098 | +0.0244 |
57
+ | 5 | `interaction` | 2.5447 | `static_dropout_0.08` | 2.5588 | -0.0141 |
58
+ | 5 | `baseabc` | 2.5481 | `static_dropout_0.08` | 2.5588 | -0.0107 |
59
+ | 5 | `smooth_low` | 2.5428 | `static_dropout_0.08` | 2.5588 | -0.0159 |
60
+ | 5 | `static_dropout_0.08` | 2.5588 | `static_dropout_0.08` | 2.5588 | +0.0000 |
61
+ | 5 | `static_dropout_0.12` | 2.5595 | `static_dropout_0.08` | 2.5588 | +0.0008 |
62
+ | 5 | `static_dropout_0.18` | 2.5806 | `static_dropout_0.08` | 2.5588 | +0.0218 |
63
 
64
  ## Stage Trajectory
65
 
66
  | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
67
  |---:|---:|---|---:|---:|---:|---:|---:|---:|
68
+ | 0 | 500,000 | `static_dropout_0.12` | 0.120 | 5 | 3.2226 | 0.0143 | 2.6968 | 0.5257 |
69
+ | 0 | 500,000 | `smooth_low` | 0.162 | 5 | 3.2287 | 0.0122 | 2.7909 | 0.4377 |
70
+ | 0 | 500,000 | `static_dropout_0.08` | 0.080 | 5 | 3.2304 | 0.0102 | 2.6173 | 0.6131 |
71
+ | 0 | 500,000 | `interaction` | 0.184 | 5 | 3.2326 | 0.0123 | 2.8108 | 0.4218 |
72
+ | 0 | 500,000 | `static_dropout_0.18` | 0.180 | 5 | 3.2349 | 0.0151 | 2.8056 | 0.4293 |
73
+ | 0 | 500,000 | `baseabc` | 0.251 | 5 | 3.2728 | 0.0102 | 2.9139 | 0.3588 |
74
+ | 1 | 1,000,000 | `interaction` | 0.141 | 5 | 2.8908 | 0.0027 | 2.4842 | 0.4065 |
75
+ | 1 | 1,000,000 | `smooth_low` | 0.115 | 5 | 2.8912 | 0.0018 | 2.4678 | 0.4234 |
76
+ | 1 | 1,000,000 | `static_dropout_0.12` | 0.120 | 5 | 2.8930 | 0.0121 | 2.4335 | 0.4595 |
77
+ | 1 | 1,000,000 | `static_dropout_0.18` | 0.180 | 5 | 2.8990 | 0.0106 | 2.5397 | 0.3593 |
78
+ | 1 | 1,000,000 | `baseabc` | 0.186 | 5 | 2.9041 | 0.0037 | 2.5659 | 0.3382 |
79
+ | 1 | 1,000,000 | `static_dropout_0.08` | 0.080 | 5 | 2.9132 | 0.0068 | 2.3531 | 0.5601 |
80
+ | 2 | 2,000,000 | `interaction` | 0.084 | 5 | 2.6690 | 0.0207 | 2.3392 | 0.3298 |
81
+ | 2 | 2,000,000 | `smooth_low` | 0.067 | 5 | 2.6708 | 0.0218 | 2.3360 | 0.3347 |
82
+ | 2 | 2,000,000 | `baseabc` | 0.105 | 5 | 2.6770 | 0.0186 | 2.3938 | 0.2833 |
83
+ | 2 | 2,000,000 | `static_dropout_0.12` | 0.120 | 5 | 2.6795 | 0.0163 | 2.3697 | 0.3098 |
84
+ | 2 | 2,000,000 | `static_dropout_0.08` | 0.080 | 5 | 2.6856 | 0.0161 | 2.3109 | 0.3747 |
85
+ | 2 | 2,000,000 | `static_dropout_0.18` | 0.180 | 5 | 2.6860 | 0.0159 | 2.4347 | 0.2513 |
86
+ | 3 | 4,000,000 | `interaction` | 0.045 | 5 | 2.5311 | 0.0213 | 2.2685 | 0.2626 |
87
+ | 3 | 4,000,000 | `smooth_low` | 0.045 | 5 | 2.5321 | 0.0203 | 2.2713 | 0.2607 |
88
+ | 3 | 4,000,000 | `baseabc` | 0.020 | 5 | 2.5357 | 0.0175 | 2.2702 | 0.2655 |
89
+ | 3 | 4,000,000 | `static_dropout_0.08` | 0.080 | 5 | 2.5444 | 0.0211 | 2.2851 | 0.2593 |
90
+ | 3 | 4,000,000 | `static_dropout_0.12` | 0.120 | 5 | 2.5477 | 0.0178 | 2.3208 | 0.2269 |
91
+ | 3 | 4,000,000 | `static_dropout_0.18` | 0.180 | 5 | 2.5644 | 0.0182 | 2.3609 | 0.2035 |
92
 
93
  ## Interpretation
94
 
95
+ - `interaction` has the best 5-seed mean final validation loss: 2.5311 +/- 0.0213.
96
+ - The best static baseline by mean final loss is `static_dropout_0.08` at 2.5444 +/- 0.0211.
97
+ - `smooth_low` is very close to `interaction`, suggesting the exact anchor
98
+ values may not be uniquely required as long as the schedule follows the
99
+ same pressure range.
100
+ - `interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0005.
101
+ - `smooth_low` beats the per-seed best static baseline in 4/5 seeds; worst paired delta is +0.0003.
102
+ - `baseabc` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0022.
103
  - Static `0.12` can win early stages, but holding it fixed loses at the
104
  final 4M stage.
105
+ - This is now the TinyStories paper-grade validation gate for this narrowed
106
+ setup: five seeds, paired seed comparisons, and static baselines selected
107
+ from the same stream protocol.
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/condition_summary.csv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ condition,kind,n,mean_trajectory_val,std_trajectory_val,mean_final_val,std_final_val,mean_final_gap,std_final_gap,dropout_path
2
+ interaction,anchor_decay,5,2.830870787985623,0.006818420404928253,2.5311126589775084,0.021347338722582713,0.2625980645418167,0.024789124454166093,0.18 -> 0.14 -> 0.08 -> 0.04
3
+ smooth_low,decay,5,2.830661616846919,0.006865263070272619,2.532056810706854,0.020285070194807492,0.26073483750224113,0.024773434805768067,0.16 -> 0.11 -> 0.07 -> 0.05
4
+ baseabc,anchor_decay,5,2.847391692176461,0.0027872612012979406,2.5356785252690317,0.017481971768559337,0.2655265092849731,0.022078743750300757,0.25 -> 0.19 -> 0.10 -> 0.02
5
+ static_dropout_0.08,static,5,2.843405600450933,0.007235109596464878,2.5443769969046115,0.021138839821398296,0.2592696316540241,0.023068417622449672,0.08 -> 0.08 -> 0.08 -> 0.08
6
+ static_dropout_0.12,static,5,2.835688897036016,0.006067411735608185,2.54771768823266,0.017777870988576795,0.2268920622766018,0.024198331849328455,0.12 -> 0.12 -> 0.12 -> 0.12
7
+ static_dropout_0.18,static,5,2.84606368560344,0.00466160864781479,2.564381641894579,0.01822875837957205,0.20352096483111382,0.019991069212145,0.18 -> 0.18 -> 0.18 -> 0.18
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/paired_final_deltas.csv ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ seed,condition,final_val,best_static_condition,best_static_final_val,delta_vs_best_static
2
+ 1,interaction,2.541424009948969,static_dropout_0.08,2.54193027690053,-0.0005062669515609741
3
+ 1,baseabc,2.5396873727440834,static_dropout_0.08,2.54193027690053,-0.002242904156446457
4
+ 1,smooth_low,2.5422630608081818,static_dropout_0.08,2.54193027690053,0.00033278390765190125
5
+ 1,static_dropout_0.08,2.54193027690053,static_dropout_0.08,2.54193027690053,0.0
6
+ 1,static_dropout_0.12,2.552573699504137,static_dropout_0.08,2.54193027690053,0.010643422603607178
7
+ 1,static_dropout_0.18,2.563616167753935,static_dropout_0.08,2.54193027690053,0.021685890853405
8
+ 2,interaction,2.5376960188150406,static_dropout_0.12,2.558806713670492,-0.021110694855451584
9
+ 2,baseabc,2.5431769341230392,static_dropout_0.12,2.558806713670492,-0.015629779547452927
10
+ 2,smooth_low,2.5386215560138226,static_dropout_0.12,2.558806713670492,-0.020185157656669617
11
+ 2,static_dropout_0.08,2.563583254814148,static_dropout_0.12,2.558806713670492,0.004776541143655777
12
+ 2,static_dropout_0.12,2.558806713670492,static_dropout_0.12,2.558806713670492,0.0
13
+ 2,static_dropout_0.18,2.5767628997564316,static_dropout_0.12,2.558806713670492,0.017956186085939407
14
+ 3,interaction,2.5385167226195335,static_dropout_0.08,2.547760896384716,-0.009244173765182495
15
+ 3,baseabc,2.5425427742302418,static_dropout_0.08,2.547760896384716,-0.005218122154474258
16
+ 3,smooth_low,2.5406627915799618,static_dropout_0.08,2.547760896384716,-0.007098104804754257
17
+ 3,static_dropout_0.08,2.547760896384716,static_dropout_0.08,2.547760896384716,0.0
18
+ 3,static_dropout_0.12,2.5510490722954273,static_dropout_0.08,2.547760896384716,0.0032881759107112885
19
+ 3,static_dropout_0.18,2.566690094769001,static_dropout_0.08,2.547760896384716,0.018929198384284973
20
+ 4,interaction,2.493242312222719,static_dropout_0.08,2.5098287016153336,-0.016586389392614365
21
+ 4,baseabc,2.5048790462315083,static_dropout_0.08,2.5098287016153336,-0.004949655383825302
22
+ 4,smooth_low,2.495888389647007,static_dropout_0.08,2.5098287016153336,-0.013940311968326569
23
+ 4,static_dropout_0.08,2.5098287016153336,static_dropout_0.08,2.5098287016153336,0.0
24
+ 4,static_dropout_0.12,2.516622833907604,static_dropout_0.08,2.5098287016153336,0.00679413229227066
25
+ 4,static_dropout_0.18,2.5342570766806602,static_dropout_0.08,2.5098287016153336,0.02442837506532669
26
+ 5,interaction,2.5446842312812805,static_dropout_0.08,2.5587818548083305,-0.014097623527050018
27
+ 5,baseabc,2.548106499016285,static_dropout_0.08,2.5587818548083305,-0.010675355792045593
28
+ 5,smooth_low,2.5428482554852962,static_dropout_0.08,2.5587818548083305,-0.015933599323034286
29
+ 5,static_dropout_0.08,2.5587818548083305,static_dropout_0.08,2.5587818548083305,0.0
30
+ 5,static_dropout_0.12,2.5595361217856407,static_dropout_0.08,2.5587818548083305,0.0007542669773101807
31
+ 5,static_dropout_0.18,2.580581970512867,static_dropout_0.08,2.5587818548083305,0.021800115704536438
runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/stage_summary.csv ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ condition,stage,token_limit,dropout,n,mean_val,std_val,mean_train,std_train,mean_gap,std_gap
2
+ interaction,0,500000,0.184,5,3.232573334872723,0.012322249630223228,2.8108130276203154,0.01843696543963324,0.4217603072524071,0.02642271429216449
3
+ interaction,1,1000000,0.141,5,2.8907706700265408,0.00267614743609897,2.4842222586274145,0.010088488701185646,0.40654841139912606,0.010858718174861278
4
+ interaction,2,2000000,0.084,5,2.6690264880657195,0.020694473868817478,2.3392063215374947,0.013823027371981881,0.32982016652822493,0.01498506611092748
5
+ interaction,3,4000000,0.045,5,2.5311126589775084,0.021347338722582713,2.268514594435692,0.02281533749518331,0.2625980645418167,0.024789124454166093
6
+ baseabc,0,500000,0.251,5,3.2727632120251657,0.010179445899511374,2.913914993405342,0.014484851568583227,0.35884821861982347,0.023140322107556496
7
+ baseabc,1,1000000,0.186,5,2.904086685180664,0.0037183966108595805,2.5659306168556215,0.007757824816452149,0.33815606832504275,0.010478362783456022
8
+ baseabc,2,2000000,0.10500000000000001,5,2.677038346230984,0.018558286675126146,2.3937876164913177,0.01400362738363624,0.283250729739666,0.023973231722312253
9
+ baseabc,3,4000000,0.02,5,2.5356785252690317,0.017481971768559337,2.2701520159840585,0.022386508741524157,0.2655265092849731,0.022078743750300757
10
+ smooth_low,0,500000,0.16230079361664454,5,3.2286536514759065,0.012238699798719154,2.790942022204399,0.019096707542752278,0.43771162927150725,0.026763959097502593
11
+ smooth_low,1,1000000,0.11452606249945704,5,2.8911825358867644,0.0017684362872669639,2.467776434123516,0.010888431197392221,0.42340610176324844,0.011628309520454386
12
+ smooth_low,2,2000000,0.06673830013226953,5,2.6707534693181514,0.02176383141227104,2.336036388576031,0.01467458437647137,0.33471708074212075,0.01662120958844446
13
+ smooth_low,3,4000000,0.045000006515082035,5,2.532056810706854,0.020285070194807492,2.271321973204613,0.02045259624936235,0.26073483750224113,0.024773434805768067
14
+ static_dropout_0.08,0,500000,0.08,5,3.2304254487156867,0.010215096373828215,2.617284271121025,0.03148810449622768,0.6131411775946617,0.03563123820440458
15
+ static_dropout_0.08,1,1000000,0.08,5,2.9132092565298082,0.00683551263871941,2.3530609726905825,0.0062032351188941945,0.5601482838392258,0.0011661277729932713
16
+ static_dropout_0.08,2,2000000,0.08,5,2.6856106996536253,0.016060032597589886,2.310917650163174,0.009996090547415524,0.3746930494904518,0.01953395199808365
17
+ static_dropout_0.08,3,4000000,0.08,5,2.5443769969046115,0.021138839821398296,2.2851073652505876,0.025411287336708797,0.2592696316540241,0.023068417622449672
18
+ static_dropout_0.12,0,500000,0.12,5,3.2225750528275965,0.014316453554089874,2.6968257799744606,0.010457351991404112,0.5257492728531361,0.013561266528339002
19
+ static_dropout_0.12,1,1000000,0.12,5,2.893001724779606,0.012108793042410648,2.433453027904034,0.01094591291488611,0.4595486968755722,0.012902761746442865
20
+ static_dropout_0.12,2,2000000,0.12,5,2.6794611223042013,0.016337678176899628,2.369656093418598,0.01150856650994152,0.30980502888560296,0.022584218257239858
21
+ static_dropout_0.12,3,4000000,0.12,5,2.54771768823266,0.017777870988576795,2.3208256259560587,0.023844817077326764,0.2268920622766018,0.024198331849328455
22
+ static_dropout_0.18,0,500000,0.18,5,3.234873204678297,0.015085252989781608,2.8055711716413496,0.016932298003007645,0.42930203303694725,0.029596925527428923
23
+ static_dropout_0.18,1,1000000,0.18,5,2.899004338681698,0.010615437421382663,2.5396994188427926,0.0152169117514434,0.35930491983890533,0.01280367489475416
24
+ static_dropout_0.18,2,2000000,0.18,5,2.6859955571591856,0.015853498199607095,2.43468574732542,0.010811605842721716,0.25130980983376505,0.02032095250136152
25
+ static_dropout_0.18,3,4000000,0.18,5,2.564381641894579,0.01822875837957205,2.360860677063465,0.025011274732778335,0.20352096483111382,0.019991069212145
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/RESULT_SUMMARY.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Locked Streaming Dropout Summary
2
+
3
+ Run directory: `runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335`
4
+
5
+ Model: `L12_H8_D320` causal Transformer, 17,367,040 parameters, 12 layers, 8 heads, 320 embedding dim.
6
+ Training per stage: 2,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 4, 5.
7
+
8
+ ## Condition Ranking
9
+
10
+ | Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
11
+ |---|---|---:|---:|---:|---:|---|
12
+ | `smooth_low` | decay | 0.05 | 2.8245 | 2.5194 | 0.2339 | 0.16 -> 0.11 -> 0.07 -> 0.05 |
13
+ | `interaction` | anchor_decay | 0.04 | 2.8252 | 2.5190 | 0.2358 | 0.18 -> 0.14 -> 0.08 -> 0.04 |
14
+ | `static_dropout_0.12` | static | 0.12 | 2.8339 | 2.5381 | 0.2007 | 0.12 -> 0.12 -> 0.12 -> 0.12 |
15
+ | `static_dropout_0.08` | static | 0.08 | 2.8371 | 2.5343 | 0.2344 | 0.08 -> 0.08 -> 0.08 -> 0.08 |
16
+ | `baseabc` | anchor_decay | 0.02 | 2.8449 | 2.5265 | 0.2415 | 0.25 -> 0.19 -> 0.10 -> 0.02 |
17
+ | `static_dropout_0.18` | static | 0.18 | 2.8468 | 2.5574 | 0.1826 | 0.18 -> 0.18 -> 0.18 -> 0.18 |
18
+
19
+ ## Stage Trajectory
20
+
21
+ ### Stage 0: 500,000 Prefix Tokens
22
+
23
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
24
+ |---|---:|---:|---:|---:|---:|
25
+ | `static_dropout_0.12` | 0.12 | 3.2275 | 2.6889 | 0.5386 | 2 |
26
+ | `static_dropout_0.08` | 0.08 | 3.2304 | 2.5885 | 0.6419 | 2 |
27
+ | `smooth_low` | 0.16 | 3.2315 | 2.7729 | 0.4587 | 2 |
28
+ | `interaction` | 0.18 | 3.2364 | 2.7933 | 0.4431 | 2 |
29
+ | `static_dropout_0.18` | 0.18 | 3.2472 | 2.7925 | 0.4548 | 2 |
30
+ | `baseabc` | 0.25 | 3.2830 | 2.9004 | 0.3826 | 2 |
31
+
32
+ ### Stage 1: 1,000,000 Prefix Tokens
33
+
34
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
35
+ |---|---:|---:|---:|---:|---:|
36
+ | `interaction` | 0.14 | 2.8919 | 2.4742 | 0.4177 | 2 |
37
+ | `smooth_low` | 0.11 | 2.8925 | 2.4570 | 0.4355 | 2 |
38
+ | `static_dropout_0.12` | 0.12 | 2.9000 | 2.4333 | 0.4667 | 2 |
39
+ | `baseabc` | 0.19 | 2.9071 | 2.5635 | 0.3436 | 2 |
40
+ | `static_dropout_0.18` | 0.18 | 2.9090 | 2.5408 | 0.3682 | 2 |
41
+ | `static_dropout_0.08` | 0.08 | 2.9121 | 2.3515 | 0.5605 | 2 |
42
+
43
+ ### Stage 2: 2,000,000 Prefix Tokens
44
+
45
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
46
+ |---|---:|---:|---:|---:|---:|
47
+ | `interaction` | 0.08 | 2.6534 | 2.3349 | 0.3185 | 2 |
48
+ | `smooth_low` | 0.07 | 2.6547 | 2.3320 | 0.3228 | 2 |
49
+ | `baseabc` | 0.10 | 2.6630 | 2.4016 | 0.2614 | 2 |
50
+ | `static_dropout_0.12` | 0.12 | 2.6701 | 2.3782 | 0.2919 | 2 |
51
+ | `static_dropout_0.08` | 0.08 | 2.6718 | 2.3169 | 0.3549 | 2 |
52
+ | `static_dropout_0.18` | 0.18 | 2.6736 | 2.4414 | 0.2322 | 2 |
53
+
54
+ ### Stage 3: 4,000,000 Prefix Tokens
55
+
56
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
57
+ |---|---:|---:|---:|---:|---:|
58
+ | `interaction` | 0.04 | 2.5190 | 2.2831 | 0.2358 | 2 |
59
+ | `smooth_low` | 0.05 | 2.5194 | 2.2855 | 0.2339 | 2 |
60
+ | `baseabc` | 0.02 | 2.5265 | 2.2850 | 0.2415 | 2 |
61
+ | `static_dropout_0.08` | 0.08 | 2.5343 | 2.2999 | 0.2344 | 2 |
62
+ | `static_dropout_0.12` | 0.12 | 2.5381 | 2.3373 | 0.2007 | 2 |
63
+ | `static_dropout_0.18` | 0.18 | 2.5574 | 2.3748 | 0.1826 | 2 |
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/config.json ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "args": {
3
+ "mode": "locked_stream",
4
+ "corpus": null,
5
+ "corpus_glob": null,
6
+ "text_column": "text",
7
+ "use_cached_data": true,
8
+ "output_dir": "runs/streaming_tinystories_multiseed_validation_l12",
9
+ "resume_from": null,
10
+ "cache_dir": ".cache/dropout_decay_tinystories",
11
+ "models": [
12
+ "L12_H8_D320=12x8x320"
13
+ ],
14
+ "seeds": [
15
+ 4,
16
+ 5
17
+ ],
18
+ "token_limits": [
19
+ 5000000
20
+ ],
21
+ "stream_token_caps": [
22
+ 500000,
23
+ 1000000,
24
+ 2000000,
25
+ 4000000
26
+ ],
27
+ "val_tokens": 500000,
28
+ "allow_short_corpus": false,
29
+ "force_retokenize": false,
30
+ "vocab_size": 4096,
31
+ "tokenizer_train_chars": 10000000,
32
+ "block_size": 128,
33
+ "batch_size": 16,
34
+ "steps": 2000,
35
+ "stage_steps": 2000,
36
+ "dropout_rates": [
37
+ 0.08,
38
+ 0.12,
39
+ 0.18
40
+ ],
41
+ "decays": [
42
+ {
43
+ "name": "smooth_low",
44
+ "kind": "decay",
45
+ "initial": 0.184,
46
+ "final": 0.045,
47
+ "schedule": "smoothstep",
48
+ "decay_tokens": null,
49
+ "anchors": []
50
+ }
51
+ ],
52
+ "anchor_decays": [
53
+ {
54
+ "name": "interaction",
55
+ "kind": "anchor_decay",
56
+ "initial": 0.184,
57
+ "final": 0.045,
58
+ "schedule": "log_prefix_anchor",
59
+ "decay_tokens": null,
60
+ "anchors": [
61
+ [
62
+ 500000,
63
+ 0.184
64
+ ],
65
+ [
66
+ 1000000,
67
+ 0.141
68
+ ],
69
+ [
70
+ 2000000,
71
+ 0.084
72
+ ],
73
+ [
74
+ 4000000,
75
+ 0.045
76
+ ]
77
+ ]
78
+ },
79
+ {
80
+ "name": "baseabc",
81
+ "kind": "anchor_decay",
82
+ "initial": 0.251,
83
+ "final": 0.02,
84
+ "schedule": "log_prefix_anchor",
85
+ "decay_tokens": null,
86
+ "anchors": [
87
+ [
88
+ 500000,
89
+ 0.251
90
+ ],
91
+ [
92
+ 1000000,
93
+ 0.186
94
+ ],
95
+ [
96
+ 2000000,
97
+ 0.105
98
+ ],
99
+ [
100
+ 4000000,
101
+ 0.02
102
+ ]
103
+ ]
104
+ }
105
+ ],
106
+ "decay_tokens": null,
107
+ "eval_batches": 64,
108
+ "train_eval_batches": 32,
109
+ "trace_eval_batches": 8,
110
+ "eval_every": 0,
111
+ "log_every": 1000,
112
+ "lr": 0.0003,
113
+ "weight_decay": 0.1,
114
+ "grad_clip": 1.0,
115
+ "plateau_delta": 0.01,
116
+ "target_min_dropout": 0.1,
117
+ "min_nonzero_margin": 0.01,
118
+ "min_high_dropout_margin": 0.03,
119
+ "screen_early_stop": false,
120
+ "screen_prune_patience": 3,
121
+ "screen_prune_min_delta": 0.01
122
+ },
123
+ "mode": "locked_stream",
124
+ "seeds": [
125
+ 4,
126
+ 5
127
+ ],
128
+ "models": [
129
+ {
130
+ "model_name": "L12_H8_D320",
131
+ "n_layer": 12,
132
+ "n_head": 8,
133
+ "n_embd": 320
134
+ }
135
+ ],
136
+ "device": "mps",
137
+ "torch": "2.12.0",
138
+ "python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
139
+ "mps_available": true,
140
+ "attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
141
+ "tokenizer_path": ".cache/dropout_decay_tinystories/tokenizer-v4096.json",
142
+ "encoded_path": ".cache/dropout_decay_tinystories/tokens-v4096-uint16.npy",
143
+ "train_tokens": 4500048,
144
+ "val_tokens": 500000,
145
+ "effective_token_limits": [
146
+ 4500048
147
+ ],
148
+ "effective_stream_token_caps": [
149
+ 500000,
150
+ 1000000,
151
+ 2000000,
152
+ 4000000
153
+ ],
154
+ "resume_from": null
155
+ }
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/metrics.jsonl ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.184, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 180.73087000846863, "eval_loss": 3.2204562090337276, "generalization_gap": 0.422469187527895, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7979870215058327, "train_loss_last": 2.8813064098358154, "val_eval_loss": 3.2204562090337276}
2
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.141, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 196.36664009094238, "eval_loss": 2.8947813101112843, "generalization_gap": 0.41770828887820244, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.477073021233082, "train_loss_last": 2.6102116107940674, "val_eval_loss": 2.8947813101112843}
3
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.084, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 206.1359338760376, "eval_loss": 2.671290386468172, "generalization_gap": 0.3228793926537037, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3484109938144684, "train_loss_last": 2.4124107360839844, "val_eval_loss": 2.671290386468172}
4
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.045, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 202.7109558582306, "eval_loss": 2.493242312222719, "generalization_gap": 0.2353867031633854, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.257855609059334, "train_loss_last": 2.1464571952819824, "val_eval_loss": 2.493242312222719}
5
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.184, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 201.26832008361816, "eval_loss": 3.2523921839892864, "generalization_gap": 0.46378010138869286, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7886120826005936, "train_loss_last": 2.8814051151275635, "val_eval_loss": 3.2523921839892864}
6
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.141, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 195.89590692520142, "eval_loss": 2.888928048312664, "generalization_gap": 0.41768673807382584, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.471241310238838, "train_loss_last": 2.5935797691345215, "val_eval_loss": 2.888928048312664}
7
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.084, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.00955510139465, "eval_loss": 2.6355181634426117, "generalization_gap": 0.31410761922597885, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.321410544216633, "train_loss_last": 2.3252055644989014, "val_eval_loss": 2.6355181634426117}
8
+ {"condition": "interaction", "condition_kind": "anchor_decay", "dropout_active_final": 0.045, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.66337299346924, "eval_loss": 2.5446842312812805, "generalization_gap": 0.23629452288150787, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3083897083997726, "train_loss_last": 2.3208839893341064, "val_eval_loss": 2.5446842312812805}
9
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.251, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.9356129169464, "eval_loss": 3.279827632009983, "generalization_gap": 0.3836944177746773, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.896133214235306, "train_loss_last": 2.9102234840393066, "val_eval_loss": 3.279827632009983}
10
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.186, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.77453088760376, "eval_loss": 2.9080570228397846, "generalization_gap": 0.35393861308693886, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.5541184097528458, "train_loss_last": 2.793113946914673, "val_eval_loss": 2.9080570228397846}
11
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.105, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.48544478416443, "eval_loss": 2.6800981052219868, "generalization_gap": 0.27399464324116707, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.4061034619808197, "train_loss_last": 2.46707820892334, "val_eval_loss": 2.6800981052219868}
12
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.9006688594818, "eval_loss": 2.5048790462315083, "generalization_gap": 0.24500996246933937, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.259869083762169, "train_loss_last": 2.2745652198791504, "val_eval_loss": 2.5048790462315083}
13
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.251, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.33616995811462, "eval_loss": 3.2862001582980156, "generalization_gap": 0.38156063109636307, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.9046395272016525, "train_loss_last": 2.9626998901367188, "val_eval_loss": 3.2862001582980156}
14
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.186, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.12936902046204, "eval_loss": 2.9062237925827503, "generalization_gap": 0.33334336057305336, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.572880432009697, "train_loss_last": 2.6301145553588867, "val_eval_loss": 2.9062237925827503}
15
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.105, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 194.08817791938782, "eval_loss": 2.645930740982294, "generalization_gap": 0.24876827374100685, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3971624672412872, "train_loss_last": 2.5148327350616455, "val_eval_loss": 2.645930740982294}
16
+ {"condition": "baseabc", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.251, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 193.8023762702942, "eval_loss": 2.548106499016285, "generalization_gap": 0.2380063161253929, "model_config": {"block_size": 128, "dropout": 0.251, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.310100182890892, "train_loss_last": 2.243898868560791, "val_eval_loss": 2.548106499016285}
17
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.01875686645508, "eval_loss": 3.2171844728291035, "generalization_gap": 0.6240800507366657, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.5931044220924377, "train_loss_last": 2.6104087829589844, "val_eval_loss": 3.2171844728291035}
18
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.42746496200562, "eval_loss": 2.920697819441557, "generalization_gap": 0.5618984661996365, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.3587993532419205, "train_loss_last": 2.512050151824951, "val_eval_loss": 2.920697819441557}
19
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 193.96866512298584, "eval_loss": 2.684825126081705, "generalization_gap": 0.35957426205277443, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3252508640289307, "train_loss_last": 2.418401002883911, "val_eval_loss": 2.684825126081705}
20
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 193.8896188735962, "eval_loss": 2.5098287016153336, "generalization_gap": 0.23803159594535828, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.2717971056699753, "train_loss_last": 2.3379616737365723, "val_eval_loss": 2.5098287016153336}
21
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.1848509311676, "eval_loss": 3.243676133453846, "generalization_gap": 0.6597021743655205, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.5839739590883255, "train_loss_last": 2.5418384075164795, "val_eval_loss": 3.243676133453846}
22
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.46870708465576, "eval_loss": 2.9034036584198475, "generalization_gap": 0.5591761879622936, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.344227470457554, "train_loss_last": 2.497213363647461, "val_eval_loss": 2.9034036584198475}
23
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.38420414924622, "eval_loss": 2.6587292850017548, "generalization_gap": 0.3501850962638855, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3085441887378693, "train_loss_last": 2.4595203399658203, "val_eval_loss": 2.6587292850017548}
24
+ {"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 194.19127774238586, "eval_loss": 2.5587818548083305, "generalization_gap": 0.230830118060112, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3279517367482185, "train_loss_last": 2.345273494720459, "val_eval_loss": 2.5587818548083305}
25
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.00335597991943, "eval_loss": 3.2164776138961315, "generalization_gap": 0.5361459217965603, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.6803316920995712, "train_loss_last": 2.7694272994995117, "val_eval_loss": 3.2164776138961315}
26
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.01300406455994, "eval_loss": 2.9069128818809986, "generalization_gap": 0.46784964576363564, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.439063236117363, "train_loss_last": 2.5384514331817627, "val_eval_loss": 2.9069128818809986}
27
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.0231637954712, "eval_loss": 2.6897822842001915, "generalization_gap": 0.31032323837280273, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.3794590458273888, "train_loss_last": 2.393846035003662, "val_eval_loss": 2.6897822842001915}
28
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 194.00913906097412, "eval_loss": 2.516622833907604, "generalization_gap": 0.2051488682627678, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3114739656448364, "train_loss_last": 2.516744613647461, "val_eval_loss": 2.516622833907604}
29
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.97342491149902, "eval_loss": 3.238522443920374, "generalization_gap": 0.5409641526639462, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.6975582912564278, "train_loss_last": 2.941650629043579, "val_eval_loss": 3.238522443920374}
30
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.8040759563446, "eval_loss": 2.8931140787899494, "generalization_gap": 0.4656083397567272, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.427505739033222, "train_loss_last": 2.539917469024658, "val_eval_loss": 2.8931140787899494}
31
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.5177140235901, "eval_loss": 2.6504716500639915, "generalization_gap": 0.27343276143074036, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.377038888633251, "train_loss_last": 2.3899667263031006, "val_eval_loss": 2.6504716500639915}
32
+ {"condition": "static_dropout_0.12", "condition_kind": "static", "dropout_active_final": 0.12, "dropout_final": 0.12, "dropout_initial": 0.12, "dropout_schedule": "constant", "elapsed_sec": 193.79171109199524, "eval_loss": 2.5595361217856407, "generalization_gap": 0.19634868949651718, "model_config": {"block_size": 128, "dropout": 0.12, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3631874322891235, "train_loss_last": 2.3815793991088867, "val_eval_loss": 2.5595361217856407}
33
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.09354996681213, "eval_loss": 3.234104972332716, "generalization_gap": 0.44031351432204247, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7937914580106735, "train_loss_last": 2.9370627403259277, "val_eval_loss": 3.234104972332716}
34
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.1311731338501, "eval_loss": 2.9113698303699493, "generalization_gap": 0.3634902611374855, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.547879569232464, "train_loss_last": 2.7211272716522217, "val_eval_loss": 2.9113698303699493}
35
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.0225739479065, "eval_loss": 2.6890049539506435, "generalization_gap": 0.24288412556052208, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.4461208283901215, "train_loss_last": 2.6030375957489014, "val_eval_loss": 2.6890049539506435}
36
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 193.94474577903748, "eval_loss": 2.5342570766806602, "generalization_gap": 0.18950188159942627, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.344755195081234, "train_loss_last": 2.4995384216308594, "val_eval_loss": 2.5342570766806602}
37
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.22309184074402, "eval_loss": 3.260387759655714, "generalization_gap": 0.4692406989634037, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7911470606923103, "train_loss_last": 2.9028453826904297, "val_eval_loss": 3.260387759655714}
38
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.1104338169098, "eval_loss": 2.9066071063280106, "generalization_gap": 0.37287352979183197, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.5337335765361786, "train_loss_last": 2.83718204498291, "val_eval_loss": 2.9066071063280106}
39
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 194.03589725494385, "eval_loss": 2.658251740038395, "generalization_gap": 0.2214929312467575, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.4367588087916374, "train_loss_last": 2.6042351722717285, "val_eval_loss": 2.658251740038395}
40
+ {"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 193.68132185935974, "eval_loss": 2.580581970512867, "generalization_gap": 0.1757577732205391, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.404824197292328, "train_loss_last": 2.4918441772460938, "val_eval_loss": 2.580581970512867}
41
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.16230079361664454, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 201.204332113266, "eval_loss": 3.2153887301683426, "generalization_gap": 0.43687721341848373, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.778511516749859, "train_loss_last": 2.8499650955200195, "val_eval_loss": 3.2153887301683426}
42
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.11452606249945704, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 204.5595259666443, "eval_loss": 2.893922034651041, "generalization_gap": 0.4342481978237629, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.459673836827278, "train_loss_last": 2.584320068359375, "val_eval_loss": 2.893922034651041}
43
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.06673830013226953, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 201.91624069213867, "eval_loss": 2.6745377629995346, "generalization_gap": 0.32962220162153244, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.344915561378002, "train_loss_last": 2.395663022994995, "val_eval_loss": 2.6745377629995346}
44
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.045000006515082035, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 195.82384395599365, "eval_loss": 2.495888389647007, "generalization_gap": 0.23202652484178543, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 4, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.2638618648052216, "train_loss_last": 2.146873712539673, "val_eval_loss": 2.495888389647007}
45
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.16230079361664454, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 195.43609189987183, "eval_loss": 3.2477101795375347, "generalization_gap": 0.48043813183903694, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 0, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 2.7672720476984978, "train_loss_last": 2.8498988151550293, "val_eval_loss": 3.2477101795375347}
46
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.11452606249945704, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 196.09808206558228, "eval_loss": 2.89107983186841, "generalization_gap": 0.4366713650524616, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 1, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_eval_loss": 2.4544084668159485, "train_loss_last": 2.5650553703308105, "val_eval_loss": 2.89107983186841}
47
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.06673830013226953, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 196.0459849834442, "eval_loss": 2.6349585987627506, "generalization_gap": 0.31589408591389656, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 2, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_eval_loss": 2.319064512848854, "train_loss_last": 2.319474220275879, "val_eval_loss": 2.6349585987627506}
48
+ {"condition": "smooth_low", "condition_kind": "decay", "dropout_active_final": 0.045000006515082035, "dropout_final": 0.045, "dropout_initial": 0.184, "dropout_schedule": "smoothstep", "elapsed_sec": 196.2467851638794, "eval_loss": 2.5428482554852962, "generalization_gap": 0.23579202219843864, "model_config": {"block_size": 128, "dropout": 0.184, "n_embd": 320, "n_head": 8, "n_layer": 12, "vocab_size": 4096}, "model_name": "L12_H8_D320", "n_embd": 320, "n_head": 8, "n_layer": 12, "parameters": 17367040, "run_mode": "locked_stream", "seed": 5, "stage": 3, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_eval_loss": 2.3070562332868576, "train_loss_last": 2.3239519596099854, "val_eval_loss": 2.5428482554852962}
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/summary.csv ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
2
+ locked_stream,baseabc,anchor_decay,0,500000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.900386370718479,0.006014871581398837,3.2830138951539993,0.004506056551557341,0.38262752443552017,0.0015088150298414955
3
+ locked_stream,interaction,anchor_decay,0,500000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.793299552053213,0.006629082873104159,3.236424196511507,0.022582144454879362,0.4431246444582939,0.02921122732798352
4
+ locked_stream,smooth_low,decay,0,500000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.7728917822241783,0.007947504783153755,3.2315494548529387,0.022854716026733408,0.45865767262876034,0.030802220809887162
5
+ locked_stream,static_dropout_0.08,static,0,500000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.5885391905903816,0.0064562123055806634,3.2304303031414747,0.01873243287264808,0.6418911125510931,0.02518864517822874
6
+ locked_stream,static_dropout_0.12,static,0,500000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.6889449916779995,0.012181045080595719,3.2275000289082527,0.015588048800246605,0.5385550372302532,0.0034070037196508854
7
+ locked_stream,static_dropout_0.18,static,0,500000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.792469259351492,0.001869871275966133,3.247246365994215,0.018584737144575744,0.4547771066427231,0.020454608420541878
8
+ locked_stream,baseabc,anchor_decay,1,1000000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.5634994208812714,0.013266753166592413,2.9071404077112675,0.0012962895462253123,0.3436409868299961,0.014563042712817725
9
+ locked_stream,interaction,anchor_decay,1,1000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.47415716573596,0.004123642389949808,2.891854679211974,0.004138881109864529,0.41769751347601414,1.5238719914720123e-05
10
+ locked_stream,smooth_low,decay,1,1000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.4570411518216133,0.003723178840467485,2.8925009332597256,0.0020097408611055986,0.43545978143811226,0.001713437979361886
11
+ locked_stream,static_dropout_0.08,static,1,1000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.351513411849737,0.010303877131481138,2.912050738930702,0.012228818533382818,0.560537327080965,0.00192494140190168
12
+ locked_stream,static_dropout_0.12,static,1,1000000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.4332844875752926,0.008172384561739694,2.900013480335474,0.009757227237938778,0.4667289927601814,0.0015848426761990845
13
+ locked_stream,static_dropout_0.18,static,1,1000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.540806572884321,0.010002727362158672,2.90898846834898,0.0033677544669751154,0.36818189546465874,0.006634972895183557
14
+ locked_stream,baseabc,anchor_decay,2,2000000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.4016329646110535,0.006322238010876659,2.6630144231021404,0.024159974949157448,0.26138145849108696,0.017837736938280786
15
+ locked_stream,interaction,anchor_decay,2,2000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.3349107690155506,0.0190922010057151,2.653404274955392,0.0252947814794913,0.31849350593984127,0.006202580473776199
16
+ locked_stream,smooth_low,decay,2,2000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.331990037113428,0.018279451715743147,2.6547481808811426,0.027986695425526037,0.3227581437677145,0.00970724370978289
17
+ locked_stream,static_dropout_0.08,static,2,2000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.3168975263834,0.011813403389391254,2.67177720554173,0.01845254618839936,0.35487967915832996,0.006639142799008103
18
+ locked_stream,static_dropout_0.12,static,2,2000000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.37824896723032,0.0017113095635120858,2.6701269671320915,0.027796815970450365,0.29187799990177155,0.02608550640693828
19
+ locked_stream,static_dropout_0.18,static,2,2000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.4414398185908794,0.0066199475436894235,2.6736283469945192,0.021745806100631468,0.2321885284036398,0.015125858556942045
20
+ locked_stream,baseabc,anchor_decay,3,4000000,L12_H8_D320,12,8,320,17367040,0.251,0.02,log_prefix_anchor,2,2.2849846333265305,0.03551875082037381,2.5264927726238966,0.030566424997536902,0.24150813929736614,0.004952325822836911
21
+ locked_stream,interaction,anchor_decay,3,4000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,log_prefix_anchor,2,2.2831226587295532,0.035733004324778946,2.518963271752,0.03637492980355821,0.23584061302244663,0.0006419254787792673
22
+ locked_stream,smooth_low,decay,3,4000000,L12_H8_D320,12,8,320,17367040,0.184,0.045,smoothstep,2,2.2854590490460396,0.030543030862435327,2.5193683225661516,0.033205639577864834,0.23390927352011204,0.0026626087154295068
23
+ locked_stream,static_dropout_0.08,static,3,4000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,2,2.299874421209097,0.039707320430454655,2.534305278211832,0.03461510658323205,0.23443085700273514,0.0050922138472226
24
+ locked_stream,static_dropout_0.12,static,3,4000000,L12_H8_D320,12,8,320,17367040,0.12,0.12,constant,2,2.33733069896698,0.03656694294283975,2.5380794778466225,0.030344276861570076,0.2007487788796425,0.006222666081269672
25
+ locked_stream,static_dropout_0.18,static,3,4000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,2,2.374789696186781,0.042475198802574214,2.5574195235967636,0.03275664656650025,0.18262982740998268,0.00971855223607397
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/summary.json ADDED
@@ -0,0 +1,530 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "run_mode": "locked_stream",
4
+ "condition": "baseabc",
5
+ "condition_kind": "anchor_decay",
6
+ "stage": 0,
7
+ "token_limit": 500000,
8
+ "model_name": "L12_H8_D320",
9
+ "n_layer": 12,
10
+ "n_head": 8,
11
+ "n_embd": 320,
12
+ "parameters": 17367040,
13
+ "dropout_initial": 0.251,
14
+ "dropout_final": 0.02,
15
+ "dropout_schedule": "log_prefix_anchor",
16
+ "n": 2,
17
+ "mean_train_eval_loss": 2.900386370718479,
18
+ "std_train_eval_loss": 0.006014871581398837,
19
+ "mean_val_eval_loss": 3.2830138951539993,
20
+ "std_val_eval_loss": 0.004506056551557341,
21
+ "mean_generalization_gap": 0.38262752443552017,
22
+ "std_generalization_gap": 0.0015088150298414955
23
+ },
24
+ {
25
+ "run_mode": "locked_stream",
26
+ "condition": "interaction",
27
+ "condition_kind": "anchor_decay",
28
+ "stage": 0,
29
+ "token_limit": 500000,
30
+ "model_name": "L12_H8_D320",
31
+ "n_layer": 12,
32
+ "n_head": 8,
33
+ "n_embd": 320,
34
+ "parameters": 17367040,
35
+ "dropout_initial": 0.184,
36
+ "dropout_final": 0.045,
37
+ "dropout_schedule": "log_prefix_anchor",
38
+ "n": 2,
39
+ "mean_train_eval_loss": 2.793299552053213,
40
+ "std_train_eval_loss": 0.006629082873104159,
41
+ "mean_val_eval_loss": 3.236424196511507,
42
+ "std_val_eval_loss": 0.022582144454879362,
43
+ "mean_generalization_gap": 0.4431246444582939,
44
+ "std_generalization_gap": 0.02921122732798352
45
+ },
46
+ {
47
+ "run_mode": "locked_stream",
48
+ "condition": "smooth_low",
49
+ "condition_kind": "decay",
50
+ "stage": 0,
51
+ "token_limit": 500000,
52
+ "model_name": "L12_H8_D320",
53
+ "n_layer": 12,
54
+ "n_head": 8,
55
+ "n_embd": 320,
56
+ "parameters": 17367040,
57
+ "dropout_initial": 0.184,
58
+ "dropout_final": 0.045,
59
+ "dropout_schedule": "smoothstep",
60
+ "n": 2,
61
+ "mean_train_eval_loss": 2.7728917822241783,
62
+ "std_train_eval_loss": 0.007947504783153755,
63
+ "mean_val_eval_loss": 3.2315494548529387,
64
+ "std_val_eval_loss": 0.022854716026733408,
65
+ "mean_generalization_gap": 0.45865767262876034,
66
+ "std_generalization_gap": 0.030802220809887162
67
+ },
68
+ {
69
+ "run_mode": "locked_stream",
70
+ "condition": "static_dropout_0.08",
71
+ "condition_kind": "static",
72
+ "stage": 0,
73
+ "token_limit": 500000,
74
+ "model_name": "L12_H8_D320",
75
+ "n_layer": 12,
76
+ "n_head": 8,
77
+ "n_embd": 320,
78
+ "parameters": 17367040,
79
+ "dropout_initial": 0.08,
80
+ "dropout_final": 0.08,
81
+ "dropout_schedule": "constant",
82
+ "n": 2,
83
+ "mean_train_eval_loss": 2.5885391905903816,
84
+ "std_train_eval_loss": 0.0064562123055806634,
85
+ "mean_val_eval_loss": 3.2304303031414747,
86
+ "std_val_eval_loss": 0.01873243287264808,
87
+ "mean_generalization_gap": 0.6418911125510931,
88
+ "std_generalization_gap": 0.02518864517822874
89
+ },
90
+ {
91
+ "run_mode": "locked_stream",
92
+ "condition": "static_dropout_0.12",
93
+ "condition_kind": "static",
94
+ "stage": 0,
95
+ "token_limit": 500000,
96
+ "model_name": "L12_H8_D320",
97
+ "n_layer": 12,
98
+ "n_head": 8,
99
+ "n_embd": 320,
100
+ "parameters": 17367040,
101
+ "dropout_initial": 0.12,
102
+ "dropout_final": 0.12,
103
+ "dropout_schedule": "constant",
104
+ "n": 2,
105
+ "mean_train_eval_loss": 2.6889449916779995,
106
+ "std_train_eval_loss": 0.012181045080595719,
107
+ "mean_val_eval_loss": 3.2275000289082527,
108
+ "std_val_eval_loss": 0.015588048800246605,
109
+ "mean_generalization_gap": 0.5385550372302532,
110
+ "std_generalization_gap": 0.0034070037196508854
111
+ },
112
+ {
113
+ "run_mode": "locked_stream",
114
+ "condition": "static_dropout_0.18",
115
+ "condition_kind": "static",
116
+ "stage": 0,
117
+ "token_limit": 500000,
118
+ "model_name": "L12_H8_D320",
119
+ "n_layer": 12,
120
+ "n_head": 8,
121
+ "n_embd": 320,
122
+ "parameters": 17367040,
123
+ "dropout_initial": 0.18,
124
+ "dropout_final": 0.18,
125
+ "dropout_schedule": "constant",
126
+ "n": 2,
127
+ "mean_train_eval_loss": 2.792469259351492,
128
+ "std_train_eval_loss": 0.001869871275966133,
129
+ "mean_val_eval_loss": 3.247246365994215,
130
+ "std_val_eval_loss": 0.018584737144575744,
131
+ "mean_generalization_gap": 0.4547771066427231,
132
+ "std_generalization_gap": 0.020454608420541878
133
+ },
134
+ {
135
+ "run_mode": "locked_stream",
136
+ "condition": "baseabc",
137
+ "condition_kind": "anchor_decay",
138
+ "stage": 1,
139
+ "token_limit": 1000000,
140
+ "model_name": "L12_H8_D320",
141
+ "n_layer": 12,
142
+ "n_head": 8,
143
+ "n_embd": 320,
144
+ "parameters": 17367040,
145
+ "dropout_initial": 0.251,
146
+ "dropout_final": 0.02,
147
+ "dropout_schedule": "log_prefix_anchor",
148
+ "n": 2,
149
+ "mean_train_eval_loss": 2.5634994208812714,
150
+ "std_train_eval_loss": 0.013266753166592413,
151
+ "mean_val_eval_loss": 2.9071404077112675,
152
+ "std_val_eval_loss": 0.0012962895462253123,
153
+ "mean_generalization_gap": 0.3436409868299961,
154
+ "std_generalization_gap": 0.014563042712817725
155
+ },
156
+ {
157
+ "run_mode": "locked_stream",
158
+ "condition": "interaction",
159
+ "condition_kind": "anchor_decay",
160
+ "stage": 1,
161
+ "token_limit": 1000000,
162
+ "model_name": "L12_H8_D320",
163
+ "n_layer": 12,
164
+ "n_head": 8,
165
+ "n_embd": 320,
166
+ "parameters": 17367040,
167
+ "dropout_initial": 0.184,
168
+ "dropout_final": 0.045,
169
+ "dropout_schedule": "log_prefix_anchor",
170
+ "n": 2,
171
+ "mean_train_eval_loss": 2.47415716573596,
172
+ "std_train_eval_loss": 0.004123642389949808,
173
+ "mean_val_eval_loss": 2.891854679211974,
174
+ "std_val_eval_loss": 0.004138881109864529,
175
+ "mean_generalization_gap": 0.41769751347601414,
176
+ "std_generalization_gap": 1.5238719914720123e-05
177
+ },
178
+ {
179
+ "run_mode": "locked_stream",
180
+ "condition": "smooth_low",
181
+ "condition_kind": "decay",
182
+ "stage": 1,
183
+ "token_limit": 1000000,
184
+ "model_name": "L12_H8_D320",
185
+ "n_layer": 12,
186
+ "n_head": 8,
187
+ "n_embd": 320,
188
+ "parameters": 17367040,
189
+ "dropout_initial": 0.184,
190
+ "dropout_final": 0.045,
191
+ "dropout_schedule": "smoothstep",
192
+ "n": 2,
193
+ "mean_train_eval_loss": 2.4570411518216133,
194
+ "std_train_eval_loss": 0.003723178840467485,
195
+ "mean_val_eval_loss": 2.8925009332597256,
196
+ "std_val_eval_loss": 0.0020097408611055986,
197
+ "mean_generalization_gap": 0.43545978143811226,
198
+ "std_generalization_gap": 0.001713437979361886
199
+ },
200
+ {
201
+ "run_mode": "locked_stream",
202
+ "condition": "static_dropout_0.08",
203
+ "condition_kind": "static",
204
+ "stage": 1,
205
+ "token_limit": 1000000,
206
+ "model_name": "L12_H8_D320",
207
+ "n_layer": 12,
208
+ "n_head": 8,
209
+ "n_embd": 320,
210
+ "parameters": 17367040,
211
+ "dropout_initial": 0.08,
212
+ "dropout_final": 0.08,
213
+ "dropout_schedule": "constant",
214
+ "n": 2,
215
+ "mean_train_eval_loss": 2.351513411849737,
216
+ "std_train_eval_loss": 0.010303877131481138,
217
+ "mean_val_eval_loss": 2.912050738930702,
218
+ "std_val_eval_loss": 0.012228818533382818,
219
+ "mean_generalization_gap": 0.560537327080965,
220
+ "std_generalization_gap": 0.00192494140190168
221
+ },
222
+ {
223
+ "run_mode": "locked_stream",
224
+ "condition": "static_dropout_0.12",
225
+ "condition_kind": "static",
226
+ "stage": 1,
227
+ "token_limit": 1000000,
228
+ "model_name": "L12_H8_D320",
229
+ "n_layer": 12,
230
+ "n_head": 8,
231
+ "n_embd": 320,
232
+ "parameters": 17367040,
233
+ "dropout_initial": 0.12,
234
+ "dropout_final": 0.12,
235
+ "dropout_schedule": "constant",
236
+ "n": 2,
237
+ "mean_train_eval_loss": 2.4332844875752926,
238
+ "std_train_eval_loss": 0.008172384561739694,
239
+ "mean_val_eval_loss": 2.900013480335474,
240
+ "std_val_eval_loss": 0.009757227237938778,
241
+ "mean_generalization_gap": 0.4667289927601814,
242
+ "std_generalization_gap": 0.0015848426761990845
243
+ },
244
+ {
245
+ "run_mode": "locked_stream",
246
+ "condition": "static_dropout_0.18",
247
+ "condition_kind": "static",
248
+ "stage": 1,
249
+ "token_limit": 1000000,
250
+ "model_name": "L12_H8_D320",
251
+ "n_layer": 12,
252
+ "n_head": 8,
253
+ "n_embd": 320,
254
+ "parameters": 17367040,
255
+ "dropout_initial": 0.18,
256
+ "dropout_final": 0.18,
257
+ "dropout_schedule": "constant",
258
+ "n": 2,
259
+ "mean_train_eval_loss": 2.540806572884321,
260
+ "std_train_eval_loss": 0.010002727362158672,
261
+ "mean_val_eval_loss": 2.90898846834898,
262
+ "std_val_eval_loss": 0.0033677544669751154,
263
+ "mean_generalization_gap": 0.36818189546465874,
264
+ "std_generalization_gap": 0.006634972895183557
265
+ },
266
+ {
267
+ "run_mode": "locked_stream",
268
+ "condition": "baseabc",
269
+ "condition_kind": "anchor_decay",
270
+ "stage": 2,
271
+ "token_limit": 2000000,
272
+ "model_name": "L12_H8_D320",
273
+ "n_layer": 12,
274
+ "n_head": 8,
275
+ "n_embd": 320,
276
+ "parameters": 17367040,
277
+ "dropout_initial": 0.251,
278
+ "dropout_final": 0.02,
279
+ "dropout_schedule": "log_prefix_anchor",
280
+ "n": 2,
281
+ "mean_train_eval_loss": 2.4016329646110535,
282
+ "std_train_eval_loss": 0.006322238010876659,
283
+ "mean_val_eval_loss": 2.6630144231021404,
284
+ "std_val_eval_loss": 0.024159974949157448,
285
+ "mean_generalization_gap": 0.26138145849108696,
286
+ "std_generalization_gap": 0.017837736938280786
287
+ },
288
+ {
289
+ "run_mode": "locked_stream",
290
+ "condition": "interaction",
291
+ "condition_kind": "anchor_decay",
292
+ "stage": 2,
293
+ "token_limit": 2000000,
294
+ "model_name": "L12_H8_D320",
295
+ "n_layer": 12,
296
+ "n_head": 8,
297
+ "n_embd": 320,
298
+ "parameters": 17367040,
299
+ "dropout_initial": 0.184,
300
+ "dropout_final": 0.045,
301
+ "dropout_schedule": "log_prefix_anchor",
302
+ "n": 2,
303
+ "mean_train_eval_loss": 2.3349107690155506,
304
+ "std_train_eval_loss": 0.0190922010057151,
305
+ "mean_val_eval_loss": 2.653404274955392,
306
+ "std_val_eval_loss": 0.0252947814794913,
307
+ "mean_generalization_gap": 0.31849350593984127,
308
+ "std_generalization_gap": 0.006202580473776199
309
+ },
310
+ {
311
+ "run_mode": "locked_stream",
312
+ "condition": "smooth_low",
313
+ "condition_kind": "decay",
314
+ "stage": 2,
315
+ "token_limit": 2000000,
316
+ "model_name": "L12_H8_D320",
317
+ "n_layer": 12,
318
+ "n_head": 8,
319
+ "n_embd": 320,
320
+ "parameters": 17367040,
321
+ "dropout_initial": 0.184,
322
+ "dropout_final": 0.045,
323
+ "dropout_schedule": "smoothstep",
324
+ "n": 2,
325
+ "mean_train_eval_loss": 2.331990037113428,
326
+ "std_train_eval_loss": 0.018279451715743147,
327
+ "mean_val_eval_loss": 2.6547481808811426,
328
+ "std_val_eval_loss": 0.027986695425526037,
329
+ "mean_generalization_gap": 0.3227581437677145,
330
+ "std_generalization_gap": 0.00970724370978289
331
+ },
332
+ {
333
+ "run_mode": "locked_stream",
334
+ "condition": "static_dropout_0.08",
335
+ "condition_kind": "static",
336
+ "stage": 2,
337
+ "token_limit": 2000000,
338
+ "model_name": "L12_H8_D320",
339
+ "n_layer": 12,
340
+ "n_head": 8,
341
+ "n_embd": 320,
342
+ "parameters": 17367040,
343
+ "dropout_initial": 0.08,
344
+ "dropout_final": 0.08,
345
+ "dropout_schedule": "constant",
346
+ "n": 2,
347
+ "mean_train_eval_loss": 2.3168975263834,
348
+ "std_train_eval_loss": 0.011813403389391254,
349
+ "mean_val_eval_loss": 2.67177720554173,
350
+ "std_val_eval_loss": 0.01845254618839936,
351
+ "mean_generalization_gap": 0.35487967915832996,
352
+ "std_generalization_gap": 0.006639142799008103
353
+ },
354
+ {
355
+ "run_mode": "locked_stream",
356
+ "condition": "static_dropout_0.12",
357
+ "condition_kind": "static",
358
+ "stage": 2,
359
+ "token_limit": 2000000,
360
+ "model_name": "L12_H8_D320",
361
+ "n_layer": 12,
362
+ "n_head": 8,
363
+ "n_embd": 320,
364
+ "parameters": 17367040,
365
+ "dropout_initial": 0.12,
366
+ "dropout_final": 0.12,
367
+ "dropout_schedule": "constant",
368
+ "n": 2,
369
+ "mean_train_eval_loss": 2.37824896723032,
370
+ "std_train_eval_loss": 0.0017113095635120858,
371
+ "mean_val_eval_loss": 2.6701269671320915,
372
+ "std_val_eval_loss": 0.027796815970450365,
373
+ "mean_generalization_gap": 0.29187799990177155,
374
+ "std_generalization_gap": 0.02608550640693828
375
+ },
376
+ {
377
+ "run_mode": "locked_stream",
378
+ "condition": "static_dropout_0.18",
379
+ "condition_kind": "static",
380
+ "stage": 2,
381
+ "token_limit": 2000000,
382
+ "model_name": "L12_H8_D320",
383
+ "n_layer": 12,
384
+ "n_head": 8,
385
+ "n_embd": 320,
386
+ "parameters": 17367040,
387
+ "dropout_initial": 0.18,
388
+ "dropout_final": 0.18,
389
+ "dropout_schedule": "constant",
390
+ "n": 2,
391
+ "mean_train_eval_loss": 2.4414398185908794,
392
+ "std_train_eval_loss": 0.0066199475436894235,
393
+ "mean_val_eval_loss": 2.6736283469945192,
394
+ "std_val_eval_loss": 0.021745806100631468,
395
+ "mean_generalization_gap": 0.2321885284036398,
396
+ "std_generalization_gap": 0.015125858556942045
397
+ },
398
+ {
399
+ "run_mode": "locked_stream",
400
+ "condition": "baseabc",
401
+ "condition_kind": "anchor_decay",
402
+ "stage": 3,
403
+ "token_limit": 4000000,
404
+ "model_name": "L12_H8_D320",
405
+ "n_layer": 12,
406
+ "n_head": 8,
407
+ "n_embd": 320,
408
+ "parameters": 17367040,
409
+ "dropout_initial": 0.251,
410
+ "dropout_final": 0.02,
411
+ "dropout_schedule": "log_prefix_anchor",
412
+ "n": 2,
413
+ "mean_train_eval_loss": 2.2849846333265305,
414
+ "std_train_eval_loss": 0.03551875082037381,
415
+ "mean_val_eval_loss": 2.5264927726238966,
416
+ "std_val_eval_loss": 0.030566424997536902,
417
+ "mean_generalization_gap": 0.24150813929736614,
418
+ "std_generalization_gap": 0.004952325822836911
419
+ },
420
+ {
421
+ "run_mode": "locked_stream",
422
+ "condition": "interaction",
423
+ "condition_kind": "anchor_decay",
424
+ "stage": 3,
425
+ "token_limit": 4000000,
426
+ "model_name": "L12_H8_D320",
427
+ "n_layer": 12,
428
+ "n_head": 8,
429
+ "n_embd": 320,
430
+ "parameters": 17367040,
431
+ "dropout_initial": 0.184,
432
+ "dropout_final": 0.045,
433
+ "dropout_schedule": "log_prefix_anchor",
434
+ "n": 2,
435
+ "mean_train_eval_loss": 2.2831226587295532,
436
+ "std_train_eval_loss": 0.035733004324778946,
437
+ "mean_val_eval_loss": 2.518963271752,
438
+ "std_val_eval_loss": 0.03637492980355821,
439
+ "mean_generalization_gap": 0.23584061302244663,
440
+ "std_generalization_gap": 0.0006419254787792673
441
+ },
442
+ {
443
+ "run_mode": "locked_stream",
444
+ "condition": "smooth_low",
445
+ "condition_kind": "decay",
446
+ "stage": 3,
447
+ "token_limit": 4000000,
448
+ "model_name": "L12_H8_D320",
449
+ "n_layer": 12,
450
+ "n_head": 8,
451
+ "n_embd": 320,
452
+ "parameters": 17367040,
453
+ "dropout_initial": 0.184,
454
+ "dropout_final": 0.045,
455
+ "dropout_schedule": "smoothstep",
456
+ "n": 2,
457
+ "mean_train_eval_loss": 2.2854590490460396,
458
+ "std_train_eval_loss": 0.030543030862435327,
459
+ "mean_val_eval_loss": 2.5193683225661516,
460
+ "std_val_eval_loss": 0.033205639577864834,
461
+ "mean_generalization_gap": 0.23390927352011204,
462
+ "std_generalization_gap": 0.0026626087154295068
463
+ },
464
+ {
465
+ "run_mode": "locked_stream",
466
+ "condition": "static_dropout_0.08",
467
+ "condition_kind": "static",
468
+ "stage": 3,
469
+ "token_limit": 4000000,
470
+ "model_name": "L12_H8_D320",
471
+ "n_layer": 12,
472
+ "n_head": 8,
473
+ "n_embd": 320,
474
+ "parameters": 17367040,
475
+ "dropout_initial": 0.08,
476
+ "dropout_final": 0.08,
477
+ "dropout_schedule": "constant",
478
+ "n": 2,
479
+ "mean_train_eval_loss": 2.299874421209097,
480
+ "std_train_eval_loss": 0.039707320430454655,
481
+ "mean_val_eval_loss": 2.534305278211832,
482
+ "std_val_eval_loss": 0.03461510658323205,
483
+ "mean_generalization_gap": 0.23443085700273514,
484
+ "std_generalization_gap": 0.0050922138472226
485
+ },
486
+ {
487
+ "run_mode": "locked_stream",
488
+ "condition": "static_dropout_0.12",
489
+ "condition_kind": "static",
490
+ "stage": 3,
491
+ "token_limit": 4000000,
492
+ "model_name": "L12_H8_D320",
493
+ "n_layer": 12,
494
+ "n_head": 8,
495
+ "n_embd": 320,
496
+ "parameters": 17367040,
497
+ "dropout_initial": 0.12,
498
+ "dropout_final": 0.12,
499
+ "dropout_schedule": "constant",
500
+ "n": 2,
501
+ "mean_train_eval_loss": 2.33733069896698,
502
+ "std_train_eval_loss": 0.03656694294283975,
503
+ "mean_val_eval_loss": 2.5380794778466225,
504
+ "std_val_eval_loss": 0.030344276861570076,
505
+ "mean_generalization_gap": 0.2007487788796425,
506
+ "std_generalization_gap": 0.006222666081269672
507
+ },
508
+ {
509
+ "run_mode": "locked_stream",
510
+ "condition": "static_dropout_0.18",
511
+ "condition_kind": "static",
512
+ "stage": 3,
513
+ "token_limit": 4000000,
514
+ "model_name": "L12_H8_D320",
515
+ "n_layer": 12,
516
+ "n_head": 8,
517
+ "n_embd": 320,
518
+ "parameters": 17367040,
519
+ "dropout_initial": 0.18,
520
+ "dropout_final": 0.18,
521
+ "dropout_schedule": "constant",
522
+ "n": 2,
523
+ "mean_train_eval_loss": 2.374789696186781,
524
+ "std_train_eval_loss": 0.042475198802574214,
525
+ "mean_val_eval_loss": 2.5574195235967636,
526
+ "std_val_eval_loss": 0.03275664656650025,
527
+ "mean_generalization_gap": 0.18262982740998268,
528
+ "std_generalization_gap": 0.00971855223607397
529
+ }
530
+ ]
runs/streaming_tinystories_multiseed_validation_l12/locked_stream/20260530-141335/trace.jsonl ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.3873605728149414}
2
+ {"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8813064098358154}
3
+ {"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.771984100341797}
4
+ {"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.6102116107940674}
5
+ {"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.602119207382202}
6
+ {"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.4124107360839844}
7
+ {"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5032730102539062}
8
+ {"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.1464571952819824}
9
+ {"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.293152093887329}
10
+ {"condition": "interaction", "dropout": 0.184, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8814051151275635}
11
+ {"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7541747093200684}
12
+ {"condition": "interaction", "dropout": 0.141, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.5935797691345215}
13
+ {"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.7131600379943848}
14
+ {"condition": "interaction", "dropout": 0.084, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.3252055644989014}
15
+ {"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4249722957611084}
16
+ {"condition": "interaction", "dropout": 0.045, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3208839893341064}
17
+ {"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.5486912727355957}
18
+ {"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9102234840393066}
19
+ {"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.9149229526519775}
20
+ {"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.793113946914673}
21
+ {"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.658621072769165}
22
+ {"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.46707820892334}
23
+ {"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.2726128101348877}
24
+ {"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.2745652198791504}
25
+ {"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.539700508117676}
26
+ {"condition": "baseabc", "dropout": 0.251, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9626998901367188}
27
+ {"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.8122684955596924}
28
+ {"condition": "baseabc", "dropout": 0.186, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.6301145553588867}
29
+ {"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.6069304943084717}
30
+ {"condition": "baseabc", "dropout": 0.105, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.5148327350616455}
31
+ {"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.388144016265869}
32
+ {"condition": "baseabc", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.243898868560791}
33
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.2069337368011475}
34
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.6104087829589844}
35
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.686776876449585}
36
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.512050151824951}
37
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.2968358993530273}
38
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.418401002883911}
39
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4829370975494385}
40
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3379616737365723}
41
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.4543862342834473}
42
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.5418384075164795}
43
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.6846256256103516}
44
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.497213363647461}
45
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.4699549674987793}
46
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.4595203399658203}
47
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4750654697418213}
48
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.345273494720459}
49
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.1532158851623535}
50
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.7694272994995117}
51
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7790844440460205}
52
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.5384514331817627}
53
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.5747876167297363}
54
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.393846035003662}
55
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5318603515625}
56
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.516744613647461}
57
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.2838807106018066}
58
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.941650629043579}
59
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.8269617557525635}
60
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.539917469024658}
61
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.6944527626037598}
62
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.3899667263031006}
63
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.3812689781188965}
64
+ {"condition": "static_dropout_0.12", "dropout": 0.12, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3815793991088867}
65
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.5123531818389893}
66
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9370627403259277}
67
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.8317360877990723}
68
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.7211272716522217}
69
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.5763471126556396}
70
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.6030375957489014}
71
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.490279197692871}
72
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.4995384216308594}
73
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.4424760341644287}
74
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.9028453826904297}
75
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7941365242004395}
76
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.83718204498291}
77
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.747528314590454}
78
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.6042351722717285}
79
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5449204444885254}
80
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.4918441772460938}
81
+ {"condition": "smooth_low", "dropout": 0.17803874120648827, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.380012273788452}
82
+ {"condition": "smooth_low", "dropout": 0.16230079361664454, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8499650955200195}
83
+ {"condition": "smooth_low", "dropout": 0.1400439632143008, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7695858478546143}
84
+ {"condition": "smooth_low", "dropout": 0.11452606249945704, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.584320068359375}
85
+ {"condition": "smooth_low", "dropout": 0.0890049039721133, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.604490041732788}
86
+ {"condition": "smooth_low", "dropout": 0.06673830013226953, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.395663022994995}
87
+ {"condition": "smooth_low", "dropout": 0.05098406347992578, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.5218143463134766}
88
+ {"condition": "smooth_low", "dropout": 0.045000006515082035, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 4, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.146873712539673}
89
+ {"condition": "smooth_low", "dropout": 0.17803874120648827, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 1000, "steps": 2000, "token_limit": 500000, "tokens_seen": 2048000, "train_batch_loss": 3.2876977920532227}
90
+ {"condition": "smooth_low", "dropout": 0.16230079361664454, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 0, "step": 2000, "steps": 2000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.8498988151550293}
91
+ {"condition": "smooth_low", "dropout": 0.1400439632143008, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 1000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 2.7525501251220703}
92
+ {"condition": "smooth_low", "dropout": 0.11452606249945704, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 1, "step": 2000, "steps": 2000, "token_limit": 1000000, "tokens_seen": 8192000, "train_batch_loss": 2.5650553703308105}
93
+ {"condition": "smooth_low", "dropout": 0.0890049039721133, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 1000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 10240000, "train_batch_loss": 2.7159547805786133}
94
+ {"condition": "smooth_low", "dropout": 0.06673830013226953, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 2, "step": 2000, "steps": 2000, "token_limit": 2000000, "tokens_seen": 12288000, "train_batch_loss": 2.319474220275879}
95
+ {"condition": "smooth_low", "dropout": 0.05098406347992578, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 1000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 14336000, "train_batch_loss": 2.4393768310546875}
96
+ {"condition": "smooth_low", "dropout": 0.045000006515082035, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 5, "stage": 3, "step": 2000, "steps": 2000, "token_limit": 4000000, "tokens_seen": 16384000, "train_batch_loss": 2.3239519596099854}
scripts/summarize_streaming_multiseed.py CHANGED
@@ -192,14 +192,41 @@ def write_report(
192
  paired_rows: list[dict],
193
  metrics_paths: list[Path],
194
  ) -> None:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  lines = [
196
  "# TinyStories Multi-Seed Streaming Validation",
197
  "",
198
  "Date: 2026-05-30",
199
  "",
200
- "This report combines the original seed-1 streaming run with the new seeds",
201
- "2 and 3 run. No additional training is performed by this script; it reads",
202
- "saved `metrics.jsonl` files.",
 
203
  "",
204
  "## Sources",
205
  "",
@@ -264,16 +291,22 @@ def write_report(
264
  "",
265
  "## Interpretation",
266
  "",
267
- "- `interaction` has the best 3-seed mean final validation loss.",
268
- "- `smooth_low` is very close, suggesting the exact anchor values may not be",
269
- " uniquely required as long as the schedule follows the same pressure range.",
270
- "- All decay schedules beat the best static baseline on final loss in the",
271
- " paired seed comparisons, though seed 1 margins remain tiny.",
 
 
 
 
 
 
272
  "- Static `0.12` can win early stages, but holding it fixed loses at the",
273
  " final 4M stage.",
274
- "- This supports the schedule claim more strongly than the single-seed run,",
275
- " but seed count is still small. Expanding to 5 seeds is the paper-grade",
276
- " next check if we want stronger statistical confidence.",
277
  ]
278
  )
279
  path.write_text("\n".join(lines) + "\n", encoding="utf-8")
 
192
  paired_rows: list[dict],
193
  metrics_paths: list[Path],
194
  ) -> None:
195
+ seed_ids = sorted({int(row["seed"]) for row in paired_rows})
196
+ seed_count = len(seed_ids)
197
+ best_row = condition_rows[0]
198
+ static_rows = [row for row in condition_rows if row["condition"].startswith("static_")]
199
+ best_static_row = min(static_rows, key=lambda row: row["mean_final_val"])
200
+
201
+ paired_win_lines = []
202
+ for row in condition_rows:
203
+ condition = row["condition"]
204
+ if condition.startswith("static_"):
205
+ continue
206
+ condition_deltas = [
207
+ item["delta_vs_best_static"]
208
+ for item in paired_rows
209
+ if item["condition"] == condition
210
+ ]
211
+ wins = sum(delta < 0 for delta in condition_deltas)
212
+ ties = sum(delta == 0 for delta in condition_deltas)
213
+ worst_delta = max(condition_deltas)
214
+ paired_win_lines.append(
215
+ f"- `{condition}` beats the per-seed best static baseline in "
216
+ f"{wins}/{seed_count} seeds"
217
+ + (f" with {ties} exact ties" if ties else "")
218
+ + f"; worst paired delta is {worst_delta:+.4f}."
219
+ )
220
+
221
  lines = [
222
  "# TinyStories Multi-Seed Streaming Validation",
223
  "",
224
  "Date: 2026-05-30",
225
  "",
226
+ f"This report combines {seed_count} random seeds "
227
+ f"({', '.join(str(seed) for seed in seed_ids)}) from saved streaming runs.",
228
+ "No additional training is performed by this script; it reads saved",
229
+ "`metrics.jsonl` files.",
230
  "",
231
  "## Sources",
232
  "",
 
291
  "",
292
  "## Interpretation",
293
  "",
294
+ f"- `{best_row['condition']}` has the best {seed_count}-seed mean final "
295
+ f"validation loss: {fmt(best_row['mean_final_val'])} +/- "
296
+ f"{fmt(best_row['std_final_val'])}.",
297
+ f"- The best static baseline by mean final loss is "
298
+ f"`{best_static_row['condition']}` at "
299
+ f"{fmt(best_static_row['mean_final_val'])} +/- "
300
+ f"{fmt(best_static_row['std_final_val'])}.",
301
+ "- `smooth_low` is very close to `interaction`, suggesting the exact anchor",
302
+ " values may not be uniquely required as long as the schedule follows the",
303
+ " same pressure range.",
304
+ *paired_win_lines,
305
  "- Static `0.12` can win early stages, but holding it fixed loses at the",
306
  " final 4M stage.",
307
+ "- This is now the TinyStories paper-grade validation gate for this narrowed",
308
+ " setup: five seeds, paired seed comparisons, and static baselines selected",
309
+ " from the same stream protocol.",
310
  ]
311
  )
312
  path.write_text("\n".join(lines) + "\n", encoding="utf-8")