Mandeep Sidhu commited on
Commit
9cd900a
·
1 Parent(s): e39c73c

Add WikiText corpus holdout results

Browse files
REPRODUCING.md CHANGED
@@ -58,6 +58,19 @@ validation tokens: 500,000
58
  vocab size: 4,096
59
  ```
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ## Smoke Test
62
 
63
  This verifies cached-data loading without running a Torch experiment:
@@ -215,6 +228,49 @@ best static final validation: 4.4946 +/- 0.0087
215
  best mean trajectory: static 0.18, 4.9064 vs formula 4.9073
216
  ```
217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
  ## Notes for Publication
219
 
220
  - Do not claim the formula is universal.
 
58
  vocab size: 4,096
59
  ```
60
 
61
+ The public WikiText-103 holdout can be rebuilt from source:
62
+
63
+ ```bash
64
+ .venv/bin/python scripts/prepare_wikitext103.py
65
+ ```
66
+
67
+ The pinned parquet source is verified as:
68
+
69
+ ```text
70
+ bytes: 156,700,942
71
+ sha256: 75aa65dee9de2a7c10ba1808efd2408c3f4eb008104c3ccac47f8ed19300ebdd
72
+ ```
73
+
74
  ## Smoke Test
75
 
76
  This verifies cached-data loading without running a Torch experiment:
 
228
  best mean trajectory: static 0.18, 4.9064 vs formula 4.9073
229
  ```
230
 
231
+ ## Reproduce WikiText-103 Corpus Holdout
232
+
233
+ Prepare the public corpus first:
234
+
235
+ ```bash
236
+ .venv/bin/python scripts/prepare_wikitext103.py
237
+ ```
238
+
239
+ Then run the frozen L12 formula against a broad static grid:
240
+
241
+ ```bash
242
+ PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
243
+ --mode locked_stream \
244
+ --corpus data/wikitext103_raw/train-00001-of-00002.parquet \
245
+ --cache-dir .cache/dropout_decay_wikitext103 \
246
+ --output-dir runs/corpus_holdout_wikitext103_l12 \
247
+ --models L12_H8_D320=12x8x320 \
248
+ --seeds 1 2 3 \
249
+ --stream-token-caps 250000 500000 1000000 2000000 4000000 \
250
+ --dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
251
+ --anchor-decays formula_l12_wikitext103:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
252
+ --stage-steps 1000 \
253
+ --batch-size 16 \
254
+ --block-size 128 \
255
+ --eval-batches 64 \
256
+ --train-eval-batches 32 \
257
+ --trace-eval-batches 8 \
258
+ --log-every 500 \
259
+ --vocab-size 4096 \
260
+ --val-tokens 500000 \
261
+ --lr 0.0003 \
262
+ --weight-decay 0.1 \
263
+ --grad-clip 1.0
264
+ ```
265
+
266
+ Completed reference result:
267
+
268
+ ```text
269
+ formula final validation: 4.0836 +/- 0.0258
270
+ best static final validation: 4.1081 +/- 0.0258
271
+ best mean trajectory: formula, 4.5728 vs static 0.18, 4.5759
272
+ ```
273
+
274
  ## Notes for Publication
275
 
276
  - Do not claim the formula is universal.
docs/dropout_decay_research_report_v2.md CHANGED
@@ -42,12 +42,13 @@ p = clamp(0.02, 0.65,
42
  ```
43
 
44
  Across the completed headline validation runs, the formula schedule wins
45
- `21/21` paired final-loss comparisons across five model sizes and two
46
- architecture-shape holdouts. The evidence supports final-validation improvement
47
- under this nanochat-style Transformer and expanding-prefix protocol. It does not
48
- yet establish a universal dropout law across datasets, architectures, or
49
- training scales, and the width-heavy holdout shows that the current formula can
50
- overestimate the best early-prefix dropout for some architecture shapes.
 
51
 
52
  ## System Under Test
53
 
@@ -413,7 +414,8 @@ Completed headline evidence:
413
  | Model-size validation | `15/15` paired final-loss wins |
414
  | Deep/narrow architecture holdout | `3/3` paired final-loss wins |
415
  | Width-heavy architecture holdout | `3/3` paired final-loss wins |
416
- | Combined paired final-loss comparisons | `21/21` wins |
 
417
  | Update-pressure direction | Supported on L12 |
418
  | Sampled-pressure coefficient | Supported on L12 |
419
  | High arbitrary initial dropout | Rejected |
@@ -448,6 +450,7 @@ below expose more of the completed run surface.
448
  | `l12_sample_pressure_ablation_053842` | L12 sampled-pressure coefficient ablation | 3 |
449
  | `deep_narrow_h8_112117` | Deep/narrow architecture-shape holdout | 3 |
450
  | `wide_h8_151721` | Width-heavy architecture-shape holdout | 3 |
 
451
 
452
  ### Static Screen Optima
453
 
@@ -637,6 +640,43 @@ loss because it decayed to low dropout at the largest prefixes. This suggests
637
  that final-loss transfer is real, but an architecture-shape term may be needed
638
  to avoid overestimating early dropout for wide models.
639
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
640
  ## Interpretation
641
 
642
  The most plausible mechanism is pressure tracking:
@@ -653,10 +693,11 @@ The most plausible mechanism is pressure tracking:
653
 
654
  This interpretation is consistent with the static screens, the model-size
655
  interpolation results, the update-pressure sweep, the sampled-pressure
656
- coefficient ablation, and the two architecture-shape holdouts. The width-heavy
657
- holdout adds an important refinement: parameter count alone does not fully
658
- describe architecture capacity, because the formula's early dropout was higher
659
- than the measured early-prefix static optimum for that shape.
 
660
 
661
  ## What This Report Does Not Prove
662
 
@@ -685,7 +726,7 @@ The strongest safe paper claim is:
685
  > In nanochat-style causal Transformers trained under an expanding-prefix
686
  > streaming protocol, a pressure-aware dropout schedule improves final
687
  > validation loss over fixed-dropout baselines across model sizes, update
688
- > pressures, and two architecture-shape holdouts.
689
 
690
  Claims to avoid:
691
 
@@ -698,11 +739,13 @@ Claims to avoid:
698
 
699
  The next experiments that would most strengthen a paper are:
700
 
701
- 1. Corpus/domain holdout: freeze the formula and run on a different text
702
- distribution. This is the largest missing generalization test.
703
- 2. Architecture-shape refinement: add a small feature such as depth/width ratio
704
- or embedding dimension to reduce early-dropout overestimation on wide models,
705
- then validate it on held-out shapes.
 
 
706
  3. L8 and L16 sampled-pressure ablations: repeat the `0x`, `0.5x`, `1.0x`, and
707
  `1.5x` coefficient ablation outside L12.
708
  4. Oracle schedule comparison: compare the formula against a stage-wise oracle
 
42
  ```
43
 
44
  Across the completed headline validation runs, the formula schedule wins
45
+ `24/24` paired final-loss comparisons across five model sizes, two
46
+ architecture-shape holdouts, and one public-corpus holdout. The evidence
47
+ supports final-validation improvement under this nanochat-style Transformer and
48
+ expanding-prefix protocol. It does not yet establish a universal dropout law
49
+ across datasets, architectures, or training scales. The width-heavy and
50
+ WikiText-103 holdouts both show that the current formula can overestimate the
51
+ best early-prefix dropout even when it wins final loss.
52
 
53
  ## System Under Test
54
 
 
414
  | Model-size validation | `15/15` paired final-loss wins |
415
  | Deep/narrow architecture holdout | `3/3` paired final-loss wins |
416
  | Width-heavy architecture holdout | `3/3` paired final-loss wins |
417
+ | WikiText-103 corpus holdout | `3/3` paired final-loss wins |
418
+ | Combined paired final-loss comparisons | `24/24` wins |
419
  | Update-pressure direction | Supported on L12 |
420
  | Sampled-pressure coefficient | Supported on L12 |
421
  | High arbitrary initial dropout | Rejected |
 
450
  | `l12_sample_pressure_ablation_053842` | L12 sampled-pressure coefficient ablation | 3 |
451
  | `deep_narrow_h8_112117` | Deep/narrow architecture-shape holdout | 3 |
452
  | `wide_h8_151721` | Width-heavy architecture-shape holdout | 3 |
453
+ | `wikitext103_l12_183213` | Public-corpus holdout on WikiText-103 raw | 3 |
454
 
455
  ### Static Screen Optima
456
 
 
640
  that final-loss transfer is real, but an architecture-shape term may be needed
641
  to avoid overestimating early dropout for wide models.
642
 
643
+ ### Corpus Holdout Final Controls
644
+
645
+ The first public-corpus holdout freezes the L12 formula and reruns the same
646
+ locked-stream protocol on WikiText-103 raw text, using a fresh local 4,096-token
647
+ BPE cache from the public Hugging Face parquet source. This tests whether the
648
+ schedule transfers beyond the original local cached corpus.
649
+
650
+ | Condition | N | Final val | Val std | Final train | Final gap |
651
+ |---|---:|---:|---:|---:|---:|
652
+ | formula | 3 | 4.0836 | 0.0258 | 3.8130 | 0.2707 |
653
+ | static 0.10 | 3 | 4.1081 | 0.0258 | 3.8531 | 0.2549 |
654
+ | static 0.08 | 3 | 4.1101 | 0.0247 | 3.8331 | 0.2769 |
655
+ | static 0.06 | 3 | 4.1181 | 0.0107 | 3.8177 | 0.3003 |
656
+ | static 0.14 | 3 | 4.1266 | 0.0198 | 3.8825 | 0.2441 |
657
+ | static 0.18 | 3 | 4.1288 | 0.0181 | 3.9090 | 0.2198 |
658
+ | static 0.20 | 3 | 4.1388 | 0.0236 | 3.9251 | 0.2136 |
659
+ | static 0.02 | 3 | 4.1520 | 0.0199 | 3.7968 | 0.3552 |
660
+ | static 0.26 | 3 | 4.1772 | 0.0203 | 3.9888 | 0.1883 |
661
+ | static 0.00 | 3 | 4.1796 | 0.0217 | 3.7873 | 0.3922 |
662
+ | static 0.30 | 3 | 4.1961 | 0.0179 | 4.0251 | 0.1710 |
663
+
664
+ Best static final loss varied by seed, but formula again beat the best static
665
+ condition in every paired final comparison:
666
+
667
+ ```text
668
+ seed 1: 4.0623 vs static 0.10 at 4.0807, delta -0.0184
669
+ seed 2: 4.1123 vs static 0.06 at 4.1304, delta -0.0181
670
+ seed 3: 4.0763 vs static 0.08 at 4.1036, delta -0.0272
671
+ ```
672
+
673
+ This is an important positive holdout because formula also had the best mean
674
+ trajectory loss, `4.5728`, versus best static `0.18` at `4.5759`. It still
675
+ exposes a weakness: the first two prefixes favored static rates around
676
+ `0.14-0.20`, while formula started at `0.30 -> 0.26`. The final win comes from
677
+ decaying below the best early static rate, not from perfectly predicting the
678
+ early optimum.
679
+
680
  ## Interpretation
681
 
682
  The most plausible mechanism is pressure tracking:
 
693
 
694
  This interpretation is consistent with the static screens, the model-size
695
  interpolation results, the update-pressure sweep, the sampled-pressure
696
+ coefficient ablation, the two architecture-shape holdouts, and the WikiText-103
697
+ corpus holdout. The holdouts add an important refinement: parameter count alone
698
+ does not fully describe capacity or data difficulty, because the formula's
699
+ early dropout was higher than the measured early-prefix static optimum for the
700
+ wide shape and for WikiText-103.
701
 
702
  ## What This Report Does Not Prove
703
 
 
726
  > In nanochat-style causal Transformers trained under an expanding-prefix
727
  > streaming protocol, a pressure-aware dropout schedule improves final
728
  > validation loss over fixed-dropout baselines across model sizes, update
729
+ > pressures, two architecture-shape holdouts, and one public-corpus holdout.
730
 
731
  Claims to avoid:
732
 
 
739
 
740
  The next experiments that would most strengthen a paper are:
741
 
742
+ 1. Second corpus/domain holdout: freeze the formula and run on another public
743
+ text distribution. This checks whether the WikiText-103 result is a single
744
+ favorable domain or a broader transfer result.
745
+ 2. Architecture/data-shape refinement: add a small feature such as
746
+ depth/width ratio, embedding dimension, or a dataset/tokenization statistic
747
+ to reduce early-dropout overestimation, then validate it on held-out shapes
748
+ and corpora.
749
  3. L8 and L16 sampled-pressure ablations: repeat the `0x`, `0.5x`, `1.0x`, and
750
  `1.5x` coefficient ablation outside L12.
751
  4. Oracle schedule comparison: compare the formula against a stage-wise oracle
runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213/RESULT_SUMMARY.md ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Locked Streaming Dropout Summary
2
+
3
+ Run directory: `runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213`
4
+
5
+ Model: `L12_H8_D320` causal Transformer, 17,367,040 parameters, 12 layers, 8 heads, 320 embedding dim.
6
+ Training per stage: 1,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 1, 2, 3.
7
+
8
+ ## Condition Ranking
9
+
10
+ | Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
11
+ |---|---|---:|---:|---:|---:|---|
12
+ | `formula_l12_wikitext103` | anchor_decay | 0.02 | 4.5728 | 4.0836 | 0.2707 | 0.30 -> 0.26 -> 0.18 -> 0.09 -> 0.02 |
13
+ | `static_dropout_0.18` | static | 0.18 | 4.5759 | 4.1288 | 0.2198 | 0.18 -> 0.18 -> 0.18 -> 0.18 -> 0.18 |
14
+ | `static_dropout_0.14` | static | 0.14 | 4.5775 | 4.1266 | 0.2441 | 0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14 |
15
+ | `static_dropout_0.2` | static | 0.20 | 4.5797 | 4.1388 | 0.2136 | 0.20 -> 0.20 -> 0.20 -> 0.20 -> 0.20 |
16
+ | `static_dropout_0.1` | static | 0.10 | 4.5844 | 4.1081 | 0.2549 | 0.10 -> 0.10 -> 0.10 -> 0.10 -> 0.10 |
17
+ | `static_dropout_0.08` | static | 0.08 | 4.5976 | 4.1101 | 0.2769 | 0.08 -> 0.08 -> 0.08 -> 0.08 -> 0.08 |
18
+ | `static_dropout_0.26` | static | 0.26 | 4.6081 | 4.1772 | 0.1883 | 0.26 -> 0.26 -> 0.26 -> 0.26 -> 0.26 |
19
+ | `static_dropout_0.06` | static | 0.06 | 4.6174 | 4.1181 | 0.3003 | 0.06 -> 0.06 -> 0.06 -> 0.06 -> 0.06 |
20
+ | `static_dropout_0.3` | static | 0.30 | 4.6266 | 4.1961 | 0.1710 | 0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30 |
21
+ | `static_dropout_0.02` | static | 0.02 | 4.7008 | 4.1520 | 0.3552 | 0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02 |
22
+ | `static_dropout_0` | static | 0.00 | 4.7735 | 4.1796 | 0.3922 | 0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00 |
23
+
24
+ ## Stage Trajectory
25
+
26
+ ### Stage 0: 250,000 Prefix Tokens
27
+
28
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
29
+ |---|---:|---:|---:|---:|---:|
30
+ | `static_dropout_0.18` | 0.18 | 5.1632 | 4.0077 | 1.1555 | 3 |
31
+ | `static_dropout_0.14` | 0.14 | 5.1693 | 3.9113 | 1.2579 | 3 |
32
+ | `static_dropout_0.2` | 0.20 | 5.1730 | 4.0440 | 1.1290 | 3 |
33
+ | `static_dropout_0.1` | 0.10 | 5.1741 | 3.8214 | 1.3527 | 3 |
34
+ | `static_dropout_0.08` | 0.08 | 5.1900 | 3.7643 | 1.4257 | 3 |
35
+ | `static_dropout_0.26` | 0.26 | 5.1977 | 4.1614 | 1.0363 | 3 |
36
+ | `static_dropout_0.06` | 0.06 | 5.2117 | 3.7253 | 1.4864 | 3 |
37
+ | `formula_l12_wikitext103` | 0.30 | 5.2139 | 4.2195 | 0.9944 | 3 |
38
+ | `static_dropout_0.3` | 0.30 | 5.2139 | 4.2195 | 0.9944 | 3 |
39
+ | `static_dropout_0.02` | 0.02 | 5.2779 | 3.5912 | 1.6867 | 3 |
40
+ | `static_dropout_0` | 0.00 | 5.3396 | 3.5193 | 1.8203 | 3 |
41
+
42
+ ### Stage 1: 500,000 Prefix Tokens
43
+
44
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
45
+ |---|---:|---:|---:|---:|---:|
46
+ | `static_dropout_0.2` | 0.20 | 4.7875 | 3.7917 | 0.9959 | 3 |
47
+ | `static_dropout_0.18` | 0.18 | 4.7958 | 3.7620 | 1.0338 | 3 |
48
+ | `static_dropout_0.14` | 0.14 | 4.8048 | 3.6746 | 1.1302 | 3 |
49
+ | `formula_l12_wikitext103` | 0.26 | 4.8070 | 3.9220 | 0.8851 | 3 |
50
+ | `static_dropout_0.26` | 0.26 | 4.8103 | 3.9032 | 0.9070 | 3 |
51
+ | `static_dropout_0.3` | 0.30 | 4.8266 | 3.9801 | 0.8465 | 3 |
52
+ | `static_dropout_0.1` | 0.10 | 4.8324 | 3.5595 | 1.2729 | 3 |
53
+ | `static_dropout_0.08` | 0.08 | 4.8634 | 3.4963 | 1.3671 | 3 |
54
+ | `static_dropout_0.06` | 0.06 | 4.8951 | 3.4347 | 1.4603 | 3 |
55
+ | `static_dropout_0.02` | 0.02 | 5.0488 | 3.2697 | 1.7791 | 3 |
56
+ | `static_dropout_0` | 0.00 | 5.1751 | 3.1573 | 2.0178 | 3 |
57
+
58
+ ### Stage 2: 1,000,000 Prefix Tokens
59
+
60
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
61
+ |---|---:|---:|---:|---:|---:|
62
+ | `formula_l12_wikitext103` | 0.18 | 4.4945 | 3.8363 | 0.6582 | 3 |
63
+ | `static_dropout_0.14` | 0.14 | 4.5009 | 3.7184 | 0.7825 | 3 |
64
+ | `static_dropout_0.2` | 0.20 | 4.5016 | 3.8116 | 0.6900 | 3 |
65
+ | `static_dropout_0.18` | 0.18 | 4.5027 | 3.7867 | 0.7160 | 3 |
66
+ | `static_dropout_0.1` | 0.10 | 4.5167 | 3.6543 | 0.8624 | 3 |
67
+ | `static_dropout_0.26` | 0.26 | 4.5273 | 3.9097 | 0.6175 | 3 |
68
+ | `static_dropout_0.08` | 0.08 | 4.5324 | 3.6026 | 0.9298 | 3 |
69
+ | `static_dropout_0.3` | 0.30 | 4.5468 | 3.9773 | 0.5695 | 3 |
70
+ | `static_dropout_0.06` | 0.06 | 4.5575 | 3.5541 | 1.0034 | 3 |
71
+ | `static_dropout_0.02` | 0.02 | 4.6572 | 3.4398 | 1.2173 | 3 |
72
+ | `static_dropout_0` | 0.00 | 4.7597 | 3.3638 | 1.3959 | 3 |
73
+
74
+ ### Stage 3: 2,000,000 Prefix Tokens
75
+
76
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
77
+ |---|---:|---:|---:|---:|---:|
78
+ | `formula_l12_wikitext103` | 0.09 | 4.2650 | 3.8133 | 0.4518 | 3 |
79
+ | `static_dropout_0.14` | 0.14 | 4.2859 | 3.8283 | 0.4575 | 3 |
80
+ | `static_dropout_0.18` | 0.18 | 4.2890 | 3.8752 | 0.4138 | 3 |
81
+ | `static_dropout_0.1` | 0.10 | 4.2908 | 3.7856 | 0.5053 | 3 |
82
+ | `static_dropout_0.08` | 0.08 | 4.2920 | 3.7518 | 0.5403 | 3 |
83
+ | `static_dropout_0.2` | 0.20 | 4.2976 | 3.8983 | 0.3993 | 3 |
84
+ | `static_dropout_0.06` | 0.06 | 4.3047 | 3.7288 | 0.5759 | 3 |
85
+ | `static_dropout_0.26` | 0.26 | 4.3278 | 3.9675 | 0.3604 | 3 |
86
+ | `static_dropout_0.3` | 0.30 | 4.3497 | 4.0121 | 0.3376 | 3 |
87
+ | `static_dropout_0.02` | 0.02 | 4.3680 | 3.6687 | 0.6993 | 3 |
88
+ | `static_dropout_0` | 0.00 | 4.4136 | 3.6351 | 0.7785 | 3 |
89
+
90
+ ### Stage 4: 4,000,000 Prefix Tokens
91
+
92
+ | Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
93
+ |---|---:|---:|---:|---:|---:|
94
+ | `formula_l12_wikitext103` | 0.02 | 4.0836 | 3.8130 | 0.2707 | 3 |
95
+ | `static_dropout_0.1` | 0.10 | 4.1081 | 3.8531 | 0.2549 | 3 |
96
+ | `static_dropout_0.08` | 0.08 | 4.1101 | 3.8331 | 0.2769 | 3 |
97
+ | `static_dropout_0.06` | 0.06 | 4.1181 | 3.8177 | 0.3003 | 3 |
98
+ | `static_dropout_0.14` | 0.14 | 4.1266 | 3.8825 | 0.2441 | 3 |
99
+ | `static_dropout_0.18` | 0.18 | 4.1288 | 3.9090 | 0.2198 | 3 |
100
+ | `static_dropout_0.2` | 0.20 | 4.1388 | 3.9251 | 0.2136 | 3 |
101
+ | `static_dropout_0.02` | 0.02 | 4.1520 | 3.7968 | 0.3552 | 3 |
102
+ | `static_dropout_0.26` | 0.26 | 4.1772 | 3.9888 | 0.1883 | 3 |
103
+ | `static_dropout_0` | 0.00 | 4.1796 | 3.7873 | 0.3922 | 3 |
104
+ | `static_dropout_0.3` | 0.30 | 4.1961 | 4.0251 | 0.1710 | 3 |
runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213/config.json ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "args": {
3
+ "mode": "locked_stream",
4
+ "corpus": null,
5
+ "corpus_glob": null,
6
+ "text_column": "text",
7
+ "use_cached_data": true,
8
+ "output_dir": "runs/corpus_holdout_wikitext103_l12",
9
+ "resume_from": null,
10
+ "cache_dir": ".cache/dropout_decay_wikitext103",
11
+ "models": [
12
+ "L12_H8_D320=12x8x320"
13
+ ],
14
+ "seeds": [
15
+ 1,
16
+ 2,
17
+ 3
18
+ ],
19
+ "token_limits": [
20
+ 5000000
21
+ ],
22
+ "stream_token_caps": [
23
+ 250000,
24
+ 500000,
25
+ 1000000,
26
+ 2000000,
27
+ 4000000
28
+ ],
29
+ "val_tokens": 500000,
30
+ "allow_short_corpus": false,
31
+ "force_retokenize": false,
32
+ "vocab_size": 4096,
33
+ "tokenizer_train_chars": 10000000,
34
+ "block_size": 128,
35
+ "batch_size": 16,
36
+ "steps": 2000,
37
+ "stage_steps": 1000,
38
+ "dropout_rates": [
39
+ 0.0,
40
+ 0.02,
41
+ 0.06,
42
+ 0.08,
43
+ 0.1,
44
+ 0.14,
45
+ 0.18,
46
+ 0.2,
47
+ 0.26,
48
+ 0.3
49
+ ],
50
+ "decays": [],
51
+ "anchor_decays": [
52
+ {
53
+ "name": "formula_l12_wikitext103",
54
+ "kind": "anchor_decay",
55
+ "initial": 0.3,
56
+ "final": 0.02,
57
+ "schedule": "log_prefix_anchor",
58
+ "decay_tokens": null,
59
+ "anchors": [
60
+ [
61
+ 250000,
62
+ 0.3
63
+ ],
64
+ [
65
+ 500000,
66
+ 0.26
67
+ ],
68
+ [
69
+ 1000000,
70
+ 0.18
71
+ ],
72
+ [
73
+ 2000000,
74
+ 0.09
75
+ ],
76
+ [
77
+ 4000000,
78
+ 0.02
79
+ ]
80
+ ]
81
+ }
82
+ ],
83
+ "decay_tokens": null,
84
+ "eval_batches": 64,
85
+ "train_eval_batches": 32,
86
+ "trace_eval_batches": 8,
87
+ "eval_every": 0,
88
+ "log_every": 500,
89
+ "lr": 0.0003,
90
+ "weight_decay": 0.1,
91
+ "grad_clip": 1.0,
92
+ "plateau_delta": 0.01,
93
+ "target_min_dropout": 0.1,
94
+ "min_nonzero_margin": 0.01,
95
+ "min_high_dropout_margin": 0.03,
96
+ "screen_early_stop": false,
97
+ "screen_prune_patience": 3,
98
+ "screen_prune_min_delta": 0.01
99
+ },
100
+ "mode": "locked_stream",
101
+ "seeds": [
102
+ 1,
103
+ 2,
104
+ 3
105
+ ],
106
+ "models": [
107
+ {
108
+ "model_name": "L12_H8_D320",
109
+ "n_layer": 12,
110
+ "n_head": 8,
111
+ "n_embd": 320
112
+ }
113
+ ],
114
+ "device": "mps",
115
+ "torch": "2.12.0",
116
+ "python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
117
+ "mps_available": true,
118
+ "attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
119
+ "tokenizer_path": ".cache/dropout_decay_wikitext103/tokenizer-v4096.json",
120
+ "encoded_path": ".cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy",
121
+ "train_tokens": 4500020,
122
+ "val_tokens": 500000,
123
+ "effective_token_limits": [
124
+ 4500020
125
+ ],
126
+ "effective_stream_token_caps": [
127
+ 250000,
128
+ 500000,
129
+ 1000000,
130
+ 2000000,
131
+ 4000000
132
+ ],
133
+ "resume_from": null
134
+ }
runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213/metrics.jsonl ADDED
The diff for this file is too large to render. See raw diff
 
runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213/summary.csv ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
2
+ locked_stream,formula_l12_wikitext103,anchor_decay,0,250000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,3,4.219510187705358,0.009400458871163125,5.213888607919216,0.02540367830871252,0.9943784202138582,0.034794492047624956
3
+ locked_stream,static_dropout_0,static,0,250000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,3,3.5193328907092414,0.04538277641898639,5.339607280989488,0.03803900159576214,1.8202743902802467,0.07017278062165234
4
+ locked_stream,static_dropout_0.02,static,0,250000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,3,3.591204417248567,0.009258206352807653,5.27785774320364,0.03450217928391973,1.6866533259550731,0.041632245818121416
5
+ locked_stream,static_dropout_0.06,static,0,250000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,3,3.725302261610826,0.008063637282408817,5.211690753698349,0.02339268745574712,1.4863884920875232,0.031438438863035205
6
+ locked_stream,static_dropout_0.08,static,0,250000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,3,3.764298364520073,0.01498956532264627,5.190002868572871,0.020721921300465723,1.425704504052798,0.014047879271355385
7
+ locked_stream,static_dropout_0.1,static,0,250000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,3,3.821372576057911,0.021364905351568726,5.17405295620362,0.02772884387617372,1.3526803801457088,0.04389058367879646
8
+ locked_stream,static_dropout_0.14,static,0,250000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,3,3.9113145818312964,0.013390634728517344,5.169258703788121,0.02886637125607666,1.2579441219568253,0.03490443394763099
9
+ locked_stream,static_dropout_0.18,static,0,250000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,3,4.007677458226681,0.011584990895915435,5.163212870558103,0.019303530325978743,1.155535412331422,0.021457887456797633
10
+ locked_stream,static_dropout_0.2,static,0,250000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,3,4.0440234914422035,0.006683199962460588,5.173003261288007,0.018770276566139252,1.1289797698458035,0.01557915612774489
11
+ locked_stream,static_dropout_0.26,static,0,250000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,3,4.161390465994676,0.017199456198519997,5.197733856737614,0.020026368980316463,1.0363433907429378,0.02452558695313914
12
+ locked_stream,static_dropout_0.3,static,0,250000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,3,4.21951021750768,0.00940042069294641,5.213888632754485,0.02540370559129739,0.9943784152468046,0.03479448069461329
13
+ locked_stream,formula_l12_wikitext103,anchor_decay,1,500000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,3,3.921963239709536,0.012701856161083026,4.807015274961789,0.03255901757339794,0.8850520352522532,0.020904354460357738
14
+ locked_stream,static_dropout_0,static,1,500000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,3,3.157259744902452,0.033266149836286125,5.17509716997544,0.027225613071693588,2.017837425072988,0.027271551103611697
15
+ locked_stream,static_dropout_0.02,static,1,500000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,3,3.2696756149331727,0.010229343900978308,5.048807606101036,0.015919759088441253,1.7791319911678631,0.015163975400676688
16
+ locked_stream,static_dropout_0.06,static,1,500000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,3,3.4347175483902297,0.01608227845233568,4.895065948367119,0.025796230814273288,1.4603483999768894,0.016219356742392815
17
+ locked_stream,static_dropout_0.08,static,1,500000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,3,3.49625497063001,0.02296095075813903,4.8633880987763405,0.03137654346429713,1.3671331281463306,0.02205679071854492
18
+ locked_stream,static_dropout_0.1,static,1,500000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,3,3.559451498091221,0.03242743348226214,4.832371026277542,0.03379083503310632,1.2729195281863213,0.004581876102180898
19
+ locked_stream,static_dropout_0.14,static,1,500000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,3,3.6746241599321365,0.026787166240812375,4.804830571015676,0.02283391501446146,1.1302064110835393,0.004574911720412448
20
+ locked_stream,static_dropout_0.18,static,1,500000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,3,3.761968764166037,0.030310143678321872,4.79575655857722,0.02839125459863536,1.0337877944111824,0.02395919557232544
21
+ locked_stream,static_dropout_0.2,static,1,500000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,3,3.7916674092411995,0.02457212816143504,4.78752597173055,0.03269980279988427,0.9958585624893507,0.010890773437236328
22
+ locked_stream,static_dropout_0.26,static,1,500000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,3,3.9032329618930817,0.03313463368600539,4.810270811120669,0.029030451515976875,0.9070378492275873,0.0218852812037738
23
+ locked_stream,static_dropout_0.3,static,1,500000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,3,3.9801244686047235,0.011984368688744857,4.826624286671479,0.034146701924452114,0.8464998180667559,0.023039413394205236
24
+ locked_stream,formula_l12_wikitext103,anchor_decay,2,1000000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,3,3.836283288896084,0.007165129109779279,4.4944643552104635,0.01869570507899432,0.6581810663143793,0.02324475775596518
25
+ locked_stream,static_dropout_0,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,3,3.363803490996361,0.03468523458495426,4.759654012819131,0.029874213126619024,1.3958505218227704,0.061691590097926664
26
+ locked_stream,static_dropout_0.02,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,3,3.439823806285858,0.005562641857737584,4.657156705856323,0.020918879251822793,1.217332899570465,0.015357844279503235
27
+ locked_stream,static_dropout_0.06,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,3,3.5540901521841683,0.017924779970058147,4.557474325100581,0.016603375950540906,1.0033841729164124,0.03385453586409587
28
+ locked_stream,static_dropout_0.08,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,3,3.60255632797877,0.00575740505926527,4.5323765849073725,0.009481504316560474,0.9298202569286028,0.012211546495961454
29
+ locked_stream,static_dropout_0.1,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,3,3.6542825972040496,0.030316068353894877,4.516653408606847,0.012694329764198367,0.8623708114027977,0.04203932410029853
30
+ locked_stream,static_dropout_0.14,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,3,3.7183619191249213,0.02332517298978488,4.5008573432763415,0.018900687447980586,0.7824954241514206,0.04221659685113837
31
+ locked_stream,static_dropout_0.18,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,3,3.786718524992466,0.012788547677963957,4.50268988062938,0.019498788384197212,0.7159713556369146,0.02737366667147904
32
+ locked_stream,static_dropout_0.2,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,3,3.811583066980044,0.016414324708578124,4.501583576202393,0.018885880819448517,0.6900005092223486,0.02848926010067624
33
+ locked_stream,static_dropout_0.26,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,3,3.909716250995795,0.029922261968460786,4.527257425089677,0.012874091579002427,0.6175411740938822,0.038926303819109775
34
+ locked_stream,static_dropout_0.3,static,2,1000000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,3,3.9773165384928384,0.009735095632867338,4.546837595601876,0.01674020951247983,0.569521057109038,0.02091747765619741
35
+ locked_stream,formula_l12_wikitext103,anchor_decay,3,2000000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,3,3.8132816379268966,0.021180137190910886,4.265036730716626,0.020952651949156013,0.4517550927897294,0.026508733332518825
36
+ locked_stream,static_dropout_0,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,3,3.6350866481661797,0.016285186896951624,4.413571459551652,0.02523774966453558,0.7784848113854727,0.03399640276399005
37
+ locked_stream,static_dropout_0.02,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,3,3.6686666384339333,0.015260894096187612,4.368014050026734,0.027178290223553957,0.6993474115928014,0.02886603382685387
38
+ locked_stream,static_dropout_0.06,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,3,3.7288275261720023,0.02214638532765685,4.304687723517418,0.025605319026104526,0.5758601973454157,0.03169069142172165
39
+ locked_stream,static_dropout_0.08,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,3,3.751750993231932,0.027325045963601875,4.292027606318395,0.022193268917965966,0.540276613086462,0.048123905643236306
40
+ locked_stream,static_dropout_0.1,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,3,3.785589481393496,0.023254215859456147,4.290843218564987,0.024816447573560912,0.505253737171491,0.03955542155547533
41
+ locked_stream,static_dropout_0.14,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,3,3.8283161322275796,0.006946470327794601,4.285862573732932,0.029060804176488816,0.4575464415053527,0.02311386935886081
42
+ locked_stream,static_dropout_0.18,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,3,3.875220308701197,0.011227992287915711,4.2889708404739695,0.027610989551016634,0.4137505317727725,0.019195364852840884
43
+ locked_stream,static_dropout_0.2,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,3,3.8982935870687165,0.011753289842086798,4.297594143698613,0.0215564773236604,0.39930055662989616,0.015284717362206547
44
+ locked_stream,static_dropout_0.26,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,3,3.9674767553806305,0.024908308844425655,4.327832326292992,0.024543452908526633,0.36035557091236115,0.02569548028047761
45
+ locked_stream,static_dropout_0.3,static,3,2000000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,3,4.012147528429826,0.02012763898563558,4.349700070917606,0.019107888304911173,0.33755254248778027,0.025128923759912217
46
+ locked_stream,formula_l12_wikitext103,anchor_decay,4,4000000,L12_H8_D320,12,8,320,17367040,0.3,0.02,log_prefix_anchor,3,3.812962586681048,0.025352240743704624,4.083636499941349,0.025765009042835362,0.27067391326030094,0.05009449170635578
47
+ locked_stream,static_dropout_0,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.0,0.0,constant,3,3.7873488490780196,0.026205633809150763,4.179581480721633,0.021736412927032305,0.39223263164361316,0.03552201232307707
48
+ locked_stream,static_dropout_0.02,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.02,0.02,constant,3,3.796787646909555,0.024529861961376253,4.152008826533954,0.019904634278940705,0.35522117962439853,0.04352816945908601
49
+ locked_stream,static_dropout_0.06,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.06,0.06,constant,3,3.817745270828406,0.022916843434376433,4.118092642476161,0.010664122574525757,0.3003473716477553,0.03183478268164145
50
+ locked_stream,static_dropout_0.08,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.08,0.08,constant,3,3.833147312204043,0.02624431695895608,4.110093618432681,0.024734680784555277,0.2769463062286377,0.04976013223808796
51
+ locked_stream,static_dropout_0.1,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.1,0.1,constant,3,3.85314512749513,0.017044693431674442,4.108063347637653,0.025798124596532718,0.25491822014252347,0.03599137702525591
52
+ locked_stream,static_dropout_0.14,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.14,0.14,constant,3,3.8825049375494323,0.020653196329518318,4.126605914284785,0.019834322361978202,0.24410097673535347,0.03709848360553261
53
+ locked_stream,static_dropout_0.18,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.18,0.18,constant,3,3.9089941332737603,0.02100733745422538,4.128837997714679,0.018137202414878383,0.21984386444091797,0.03802532083941853
54
+ locked_stream,static_dropout_0.2,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.2,0.2,constant,3,3.925126865506172,0.0200711438797493,4.138758639494578,0.02359678741043119,0.21363177398840585,0.04159847214281183
55
+ locked_stream,static_dropout_0.26,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.26,0.26,constant,3,3.98883589108785,0.021597891789016974,4.17716755097111,0.020328951540567387,0.18833165988326073,0.03975692522805749
56
+ locked_stream,static_dropout_0.3,static,4,4000000,L12_H8_D320,12,8,320,17367040,0.3,0.3,constant,3,4.025067411363125,0.021409684612983128,4.196084383875132,0.017949538707409805,0.17101697251200676,0.03846054936653385
runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213/summary.json ADDED
@@ -0,0 +1,1212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "run_mode": "locked_stream",
4
+ "condition": "formula_l12_wikitext103",
5
+ "condition_kind": "anchor_decay",
6
+ "stage": 0,
7
+ "token_limit": 250000,
8
+ "model_name": "L12_H8_D320",
9
+ "n_layer": 12,
10
+ "n_head": 8,
11
+ "n_embd": 320,
12
+ "parameters": 17367040,
13
+ "dropout_initial": 0.3,
14
+ "dropout_final": 0.02,
15
+ "dropout_schedule": "log_prefix_anchor",
16
+ "n": 3,
17
+ "mean_train_eval_loss": 4.219510187705358,
18
+ "std_train_eval_loss": 0.009400458871163125,
19
+ "mean_val_eval_loss": 5.213888607919216,
20
+ "std_val_eval_loss": 0.02540367830871252,
21
+ "mean_generalization_gap": 0.9943784202138582,
22
+ "std_generalization_gap": 0.034794492047624956
23
+ },
24
+ {
25
+ "run_mode": "locked_stream",
26
+ "condition": "static_dropout_0",
27
+ "condition_kind": "static",
28
+ "stage": 0,
29
+ "token_limit": 250000,
30
+ "model_name": "L12_H8_D320",
31
+ "n_layer": 12,
32
+ "n_head": 8,
33
+ "n_embd": 320,
34
+ "parameters": 17367040,
35
+ "dropout_initial": 0.0,
36
+ "dropout_final": 0.0,
37
+ "dropout_schedule": "constant",
38
+ "n": 3,
39
+ "mean_train_eval_loss": 3.5193328907092414,
40
+ "std_train_eval_loss": 0.04538277641898639,
41
+ "mean_val_eval_loss": 5.339607280989488,
42
+ "std_val_eval_loss": 0.03803900159576214,
43
+ "mean_generalization_gap": 1.8202743902802467,
44
+ "std_generalization_gap": 0.07017278062165234
45
+ },
46
+ {
47
+ "run_mode": "locked_stream",
48
+ "condition": "static_dropout_0.02",
49
+ "condition_kind": "static",
50
+ "stage": 0,
51
+ "token_limit": 250000,
52
+ "model_name": "L12_H8_D320",
53
+ "n_layer": 12,
54
+ "n_head": 8,
55
+ "n_embd": 320,
56
+ "parameters": 17367040,
57
+ "dropout_initial": 0.02,
58
+ "dropout_final": 0.02,
59
+ "dropout_schedule": "constant",
60
+ "n": 3,
61
+ "mean_train_eval_loss": 3.591204417248567,
62
+ "std_train_eval_loss": 0.009258206352807653,
63
+ "mean_val_eval_loss": 5.27785774320364,
64
+ "std_val_eval_loss": 0.03450217928391973,
65
+ "mean_generalization_gap": 1.6866533259550731,
66
+ "std_generalization_gap": 0.041632245818121416
67
+ },
68
+ {
69
+ "run_mode": "locked_stream",
70
+ "condition": "static_dropout_0.06",
71
+ "condition_kind": "static",
72
+ "stage": 0,
73
+ "token_limit": 250000,
74
+ "model_name": "L12_H8_D320",
75
+ "n_layer": 12,
76
+ "n_head": 8,
77
+ "n_embd": 320,
78
+ "parameters": 17367040,
79
+ "dropout_initial": 0.06,
80
+ "dropout_final": 0.06,
81
+ "dropout_schedule": "constant",
82
+ "n": 3,
83
+ "mean_train_eval_loss": 3.725302261610826,
84
+ "std_train_eval_loss": 0.008063637282408817,
85
+ "mean_val_eval_loss": 5.211690753698349,
86
+ "std_val_eval_loss": 0.02339268745574712,
87
+ "mean_generalization_gap": 1.4863884920875232,
88
+ "std_generalization_gap": 0.031438438863035205
89
+ },
90
+ {
91
+ "run_mode": "locked_stream",
92
+ "condition": "static_dropout_0.08",
93
+ "condition_kind": "static",
94
+ "stage": 0,
95
+ "token_limit": 250000,
96
+ "model_name": "L12_H8_D320",
97
+ "n_layer": 12,
98
+ "n_head": 8,
99
+ "n_embd": 320,
100
+ "parameters": 17367040,
101
+ "dropout_initial": 0.08,
102
+ "dropout_final": 0.08,
103
+ "dropout_schedule": "constant",
104
+ "n": 3,
105
+ "mean_train_eval_loss": 3.764298364520073,
106
+ "std_train_eval_loss": 0.01498956532264627,
107
+ "mean_val_eval_loss": 5.190002868572871,
108
+ "std_val_eval_loss": 0.020721921300465723,
109
+ "mean_generalization_gap": 1.425704504052798,
110
+ "std_generalization_gap": 0.014047879271355385
111
+ },
112
+ {
113
+ "run_mode": "locked_stream",
114
+ "condition": "static_dropout_0.1",
115
+ "condition_kind": "static",
116
+ "stage": 0,
117
+ "token_limit": 250000,
118
+ "model_name": "L12_H8_D320",
119
+ "n_layer": 12,
120
+ "n_head": 8,
121
+ "n_embd": 320,
122
+ "parameters": 17367040,
123
+ "dropout_initial": 0.1,
124
+ "dropout_final": 0.1,
125
+ "dropout_schedule": "constant",
126
+ "n": 3,
127
+ "mean_train_eval_loss": 3.821372576057911,
128
+ "std_train_eval_loss": 0.021364905351568726,
129
+ "mean_val_eval_loss": 5.17405295620362,
130
+ "std_val_eval_loss": 0.02772884387617372,
131
+ "mean_generalization_gap": 1.3526803801457088,
132
+ "std_generalization_gap": 0.04389058367879646
133
+ },
134
+ {
135
+ "run_mode": "locked_stream",
136
+ "condition": "static_dropout_0.14",
137
+ "condition_kind": "static",
138
+ "stage": 0,
139
+ "token_limit": 250000,
140
+ "model_name": "L12_H8_D320",
141
+ "n_layer": 12,
142
+ "n_head": 8,
143
+ "n_embd": 320,
144
+ "parameters": 17367040,
145
+ "dropout_initial": 0.14,
146
+ "dropout_final": 0.14,
147
+ "dropout_schedule": "constant",
148
+ "n": 3,
149
+ "mean_train_eval_loss": 3.9113145818312964,
150
+ "std_train_eval_loss": 0.013390634728517344,
151
+ "mean_val_eval_loss": 5.169258703788121,
152
+ "std_val_eval_loss": 0.02886637125607666,
153
+ "mean_generalization_gap": 1.2579441219568253,
154
+ "std_generalization_gap": 0.03490443394763099
155
+ },
156
+ {
157
+ "run_mode": "locked_stream",
158
+ "condition": "static_dropout_0.18",
159
+ "condition_kind": "static",
160
+ "stage": 0,
161
+ "token_limit": 250000,
162
+ "model_name": "L12_H8_D320",
163
+ "n_layer": 12,
164
+ "n_head": 8,
165
+ "n_embd": 320,
166
+ "parameters": 17367040,
167
+ "dropout_initial": 0.18,
168
+ "dropout_final": 0.18,
169
+ "dropout_schedule": "constant",
170
+ "n": 3,
171
+ "mean_train_eval_loss": 4.007677458226681,
172
+ "std_train_eval_loss": 0.011584990895915435,
173
+ "mean_val_eval_loss": 5.163212870558103,
174
+ "std_val_eval_loss": 0.019303530325978743,
175
+ "mean_generalization_gap": 1.155535412331422,
176
+ "std_generalization_gap": 0.021457887456797633
177
+ },
178
+ {
179
+ "run_mode": "locked_stream",
180
+ "condition": "static_dropout_0.2",
181
+ "condition_kind": "static",
182
+ "stage": 0,
183
+ "token_limit": 250000,
184
+ "model_name": "L12_H8_D320",
185
+ "n_layer": 12,
186
+ "n_head": 8,
187
+ "n_embd": 320,
188
+ "parameters": 17367040,
189
+ "dropout_initial": 0.2,
190
+ "dropout_final": 0.2,
191
+ "dropout_schedule": "constant",
192
+ "n": 3,
193
+ "mean_train_eval_loss": 4.0440234914422035,
194
+ "std_train_eval_loss": 0.006683199962460588,
195
+ "mean_val_eval_loss": 5.173003261288007,
196
+ "std_val_eval_loss": 0.018770276566139252,
197
+ "mean_generalization_gap": 1.1289797698458035,
198
+ "std_generalization_gap": 0.01557915612774489
199
+ },
200
+ {
201
+ "run_mode": "locked_stream",
202
+ "condition": "static_dropout_0.26",
203
+ "condition_kind": "static",
204
+ "stage": 0,
205
+ "token_limit": 250000,
206
+ "model_name": "L12_H8_D320",
207
+ "n_layer": 12,
208
+ "n_head": 8,
209
+ "n_embd": 320,
210
+ "parameters": 17367040,
211
+ "dropout_initial": 0.26,
212
+ "dropout_final": 0.26,
213
+ "dropout_schedule": "constant",
214
+ "n": 3,
215
+ "mean_train_eval_loss": 4.161390465994676,
216
+ "std_train_eval_loss": 0.017199456198519997,
217
+ "mean_val_eval_loss": 5.197733856737614,
218
+ "std_val_eval_loss": 0.020026368980316463,
219
+ "mean_generalization_gap": 1.0363433907429378,
220
+ "std_generalization_gap": 0.02452558695313914
221
+ },
222
+ {
223
+ "run_mode": "locked_stream",
224
+ "condition": "static_dropout_0.3",
225
+ "condition_kind": "static",
226
+ "stage": 0,
227
+ "token_limit": 250000,
228
+ "model_name": "L12_H8_D320",
229
+ "n_layer": 12,
230
+ "n_head": 8,
231
+ "n_embd": 320,
232
+ "parameters": 17367040,
233
+ "dropout_initial": 0.3,
234
+ "dropout_final": 0.3,
235
+ "dropout_schedule": "constant",
236
+ "n": 3,
237
+ "mean_train_eval_loss": 4.21951021750768,
238
+ "std_train_eval_loss": 0.00940042069294641,
239
+ "mean_val_eval_loss": 5.213888632754485,
240
+ "std_val_eval_loss": 0.02540370559129739,
241
+ "mean_generalization_gap": 0.9943784152468046,
242
+ "std_generalization_gap": 0.03479448069461329
243
+ },
244
+ {
245
+ "run_mode": "locked_stream",
246
+ "condition": "formula_l12_wikitext103",
247
+ "condition_kind": "anchor_decay",
248
+ "stage": 1,
249
+ "token_limit": 500000,
250
+ "model_name": "L12_H8_D320",
251
+ "n_layer": 12,
252
+ "n_head": 8,
253
+ "n_embd": 320,
254
+ "parameters": 17367040,
255
+ "dropout_initial": 0.3,
256
+ "dropout_final": 0.02,
257
+ "dropout_schedule": "log_prefix_anchor",
258
+ "n": 3,
259
+ "mean_train_eval_loss": 3.921963239709536,
260
+ "std_train_eval_loss": 0.012701856161083026,
261
+ "mean_val_eval_loss": 4.807015274961789,
262
+ "std_val_eval_loss": 0.03255901757339794,
263
+ "mean_generalization_gap": 0.8850520352522532,
264
+ "std_generalization_gap": 0.020904354460357738
265
+ },
266
+ {
267
+ "run_mode": "locked_stream",
268
+ "condition": "static_dropout_0",
269
+ "condition_kind": "static",
270
+ "stage": 1,
271
+ "token_limit": 500000,
272
+ "model_name": "L12_H8_D320",
273
+ "n_layer": 12,
274
+ "n_head": 8,
275
+ "n_embd": 320,
276
+ "parameters": 17367040,
277
+ "dropout_initial": 0.0,
278
+ "dropout_final": 0.0,
279
+ "dropout_schedule": "constant",
280
+ "n": 3,
281
+ "mean_train_eval_loss": 3.157259744902452,
282
+ "std_train_eval_loss": 0.033266149836286125,
283
+ "mean_val_eval_loss": 5.17509716997544,
284
+ "std_val_eval_loss": 0.027225613071693588,
285
+ "mean_generalization_gap": 2.017837425072988,
286
+ "std_generalization_gap": 0.027271551103611697
287
+ },
288
+ {
289
+ "run_mode": "locked_stream",
290
+ "condition": "static_dropout_0.02",
291
+ "condition_kind": "static",
292
+ "stage": 1,
293
+ "token_limit": 500000,
294
+ "model_name": "L12_H8_D320",
295
+ "n_layer": 12,
296
+ "n_head": 8,
297
+ "n_embd": 320,
298
+ "parameters": 17367040,
299
+ "dropout_initial": 0.02,
300
+ "dropout_final": 0.02,
301
+ "dropout_schedule": "constant",
302
+ "n": 3,
303
+ "mean_train_eval_loss": 3.2696756149331727,
304
+ "std_train_eval_loss": 0.010229343900978308,
305
+ "mean_val_eval_loss": 5.048807606101036,
306
+ "std_val_eval_loss": 0.015919759088441253,
307
+ "mean_generalization_gap": 1.7791319911678631,
308
+ "std_generalization_gap": 0.015163975400676688
309
+ },
310
+ {
311
+ "run_mode": "locked_stream",
312
+ "condition": "static_dropout_0.06",
313
+ "condition_kind": "static",
314
+ "stage": 1,
315
+ "token_limit": 500000,
316
+ "model_name": "L12_H8_D320",
317
+ "n_layer": 12,
318
+ "n_head": 8,
319
+ "n_embd": 320,
320
+ "parameters": 17367040,
321
+ "dropout_initial": 0.06,
322
+ "dropout_final": 0.06,
323
+ "dropout_schedule": "constant",
324
+ "n": 3,
325
+ "mean_train_eval_loss": 3.4347175483902297,
326
+ "std_train_eval_loss": 0.01608227845233568,
327
+ "mean_val_eval_loss": 4.895065948367119,
328
+ "std_val_eval_loss": 0.025796230814273288,
329
+ "mean_generalization_gap": 1.4603483999768894,
330
+ "std_generalization_gap": 0.016219356742392815
331
+ },
332
+ {
333
+ "run_mode": "locked_stream",
334
+ "condition": "static_dropout_0.08",
335
+ "condition_kind": "static",
336
+ "stage": 1,
337
+ "token_limit": 500000,
338
+ "model_name": "L12_H8_D320",
339
+ "n_layer": 12,
340
+ "n_head": 8,
341
+ "n_embd": 320,
342
+ "parameters": 17367040,
343
+ "dropout_initial": 0.08,
344
+ "dropout_final": 0.08,
345
+ "dropout_schedule": "constant",
346
+ "n": 3,
347
+ "mean_train_eval_loss": 3.49625497063001,
348
+ "std_train_eval_loss": 0.02296095075813903,
349
+ "mean_val_eval_loss": 4.8633880987763405,
350
+ "std_val_eval_loss": 0.03137654346429713,
351
+ "mean_generalization_gap": 1.3671331281463306,
352
+ "std_generalization_gap": 0.02205679071854492
353
+ },
354
+ {
355
+ "run_mode": "locked_stream",
356
+ "condition": "static_dropout_0.1",
357
+ "condition_kind": "static",
358
+ "stage": 1,
359
+ "token_limit": 500000,
360
+ "model_name": "L12_H8_D320",
361
+ "n_layer": 12,
362
+ "n_head": 8,
363
+ "n_embd": 320,
364
+ "parameters": 17367040,
365
+ "dropout_initial": 0.1,
366
+ "dropout_final": 0.1,
367
+ "dropout_schedule": "constant",
368
+ "n": 3,
369
+ "mean_train_eval_loss": 3.559451498091221,
370
+ "std_train_eval_loss": 0.03242743348226214,
371
+ "mean_val_eval_loss": 4.832371026277542,
372
+ "std_val_eval_loss": 0.03379083503310632,
373
+ "mean_generalization_gap": 1.2729195281863213,
374
+ "std_generalization_gap": 0.004581876102180898
375
+ },
376
+ {
377
+ "run_mode": "locked_stream",
378
+ "condition": "static_dropout_0.14",
379
+ "condition_kind": "static",
380
+ "stage": 1,
381
+ "token_limit": 500000,
382
+ "model_name": "L12_H8_D320",
383
+ "n_layer": 12,
384
+ "n_head": 8,
385
+ "n_embd": 320,
386
+ "parameters": 17367040,
387
+ "dropout_initial": 0.14,
388
+ "dropout_final": 0.14,
389
+ "dropout_schedule": "constant",
390
+ "n": 3,
391
+ "mean_train_eval_loss": 3.6746241599321365,
392
+ "std_train_eval_loss": 0.026787166240812375,
393
+ "mean_val_eval_loss": 4.804830571015676,
394
+ "std_val_eval_loss": 0.02283391501446146,
395
+ "mean_generalization_gap": 1.1302064110835393,
396
+ "std_generalization_gap": 0.004574911720412448
397
+ },
398
+ {
399
+ "run_mode": "locked_stream",
400
+ "condition": "static_dropout_0.18",
401
+ "condition_kind": "static",
402
+ "stage": 1,
403
+ "token_limit": 500000,
404
+ "model_name": "L12_H8_D320",
405
+ "n_layer": 12,
406
+ "n_head": 8,
407
+ "n_embd": 320,
408
+ "parameters": 17367040,
409
+ "dropout_initial": 0.18,
410
+ "dropout_final": 0.18,
411
+ "dropout_schedule": "constant",
412
+ "n": 3,
413
+ "mean_train_eval_loss": 3.761968764166037,
414
+ "std_train_eval_loss": 0.030310143678321872,
415
+ "mean_val_eval_loss": 4.79575655857722,
416
+ "std_val_eval_loss": 0.02839125459863536,
417
+ "mean_generalization_gap": 1.0337877944111824,
418
+ "std_generalization_gap": 0.02395919557232544
419
+ },
420
+ {
421
+ "run_mode": "locked_stream",
422
+ "condition": "static_dropout_0.2",
423
+ "condition_kind": "static",
424
+ "stage": 1,
425
+ "token_limit": 500000,
426
+ "model_name": "L12_H8_D320",
427
+ "n_layer": 12,
428
+ "n_head": 8,
429
+ "n_embd": 320,
430
+ "parameters": 17367040,
431
+ "dropout_initial": 0.2,
432
+ "dropout_final": 0.2,
433
+ "dropout_schedule": "constant",
434
+ "n": 3,
435
+ "mean_train_eval_loss": 3.7916674092411995,
436
+ "std_train_eval_loss": 0.02457212816143504,
437
+ "mean_val_eval_loss": 4.78752597173055,
438
+ "std_val_eval_loss": 0.03269980279988427,
439
+ "mean_generalization_gap": 0.9958585624893507,
440
+ "std_generalization_gap": 0.010890773437236328
441
+ },
442
+ {
443
+ "run_mode": "locked_stream",
444
+ "condition": "static_dropout_0.26",
445
+ "condition_kind": "static",
446
+ "stage": 1,
447
+ "token_limit": 500000,
448
+ "model_name": "L12_H8_D320",
449
+ "n_layer": 12,
450
+ "n_head": 8,
451
+ "n_embd": 320,
452
+ "parameters": 17367040,
453
+ "dropout_initial": 0.26,
454
+ "dropout_final": 0.26,
455
+ "dropout_schedule": "constant",
456
+ "n": 3,
457
+ "mean_train_eval_loss": 3.9032329618930817,
458
+ "std_train_eval_loss": 0.03313463368600539,
459
+ "mean_val_eval_loss": 4.810270811120669,
460
+ "std_val_eval_loss": 0.029030451515976875,
461
+ "mean_generalization_gap": 0.9070378492275873,
462
+ "std_generalization_gap": 0.0218852812037738
463
+ },
464
+ {
465
+ "run_mode": "locked_stream",
466
+ "condition": "static_dropout_0.3",
467
+ "condition_kind": "static",
468
+ "stage": 1,
469
+ "token_limit": 500000,
470
+ "model_name": "L12_H8_D320",
471
+ "n_layer": 12,
472
+ "n_head": 8,
473
+ "n_embd": 320,
474
+ "parameters": 17367040,
475
+ "dropout_initial": 0.3,
476
+ "dropout_final": 0.3,
477
+ "dropout_schedule": "constant",
478
+ "n": 3,
479
+ "mean_train_eval_loss": 3.9801244686047235,
480
+ "std_train_eval_loss": 0.011984368688744857,
481
+ "mean_val_eval_loss": 4.826624286671479,
482
+ "std_val_eval_loss": 0.034146701924452114,
483
+ "mean_generalization_gap": 0.8464998180667559,
484
+ "std_generalization_gap": 0.023039413394205236
485
+ },
486
+ {
487
+ "run_mode": "locked_stream",
488
+ "condition": "formula_l12_wikitext103",
489
+ "condition_kind": "anchor_decay",
490
+ "stage": 2,
491
+ "token_limit": 1000000,
492
+ "model_name": "L12_H8_D320",
493
+ "n_layer": 12,
494
+ "n_head": 8,
495
+ "n_embd": 320,
496
+ "parameters": 17367040,
497
+ "dropout_initial": 0.3,
498
+ "dropout_final": 0.02,
499
+ "dropout_schedule": "log_prefix_anchor",
500
+ "n": 3,
501
+ "mean_train_eval_loss": 3.836283288896084,
502
+ "std_train_eval_loss": 0.007165129109779279,
503
+ "mean_val_eval_loss": 4.4944643552104635,
504
+ "std_val_eval_loss": 0.01869570507899432,
505
+ "mean_generalization_gap": 0.6581810663143793,
506
+ "std_generalization_gap": 0.02324475775596518
507
+ },
508
+ {
509
+ "run_mode": "locked_stream",
510
+ "condition": "static_dropout_0",
511
+ "condition_kind": "static",
512
+ "stage": 2,
513
+ "token_limit": 1000000,
514
+ "model_name": "L12_H8_D320",
515
+ "n_layer": 12,
516
+ "n_head": 8,
517
+ "n_embd": 320,
518
+ "parameters": 17367040,
519
+ "dropout_initial": 0.0,
520
+ "dropout_final": 0.0,
521
+ "dropout_schedule": "constant",
522
+ "n": 3,
523
+ "mean_train_eval_loss": 3.363803490996361,
524
+ "std_train_eval_loss": 0.03468523458495426,
525
+ "mean_val_eval_loss": 4.759654012819131,
526
+ "std_val_eval_loss": 0.029874213126619024,
527
+ "mean_generalization_gap": 1.3958505218227704,
528
+ "std_generalization_gap": 0.061691590097926664
529
+ },
530
+ {
531
+ "run_mode": "locked_stream",
532
+ "condition": "static_dropout_0.02",
533
+ "condition_kind": "static",
534
+ "stage": 2,
535
+ "token_limit": 1000000,
536
+ "model_name": "L12_H8_D320",
537
+ "n_layer": 12,
538
+ "n_head": 8,
539
+ "n_embd": 320,
540
+ "parameters": 17367040,
541
+ "dropout_initial": 0.02,
542
+ "dropout_final": 0.02,
543
+ "dropout_schedule": "constant",
544
+ "n": 3,
545
+ "mean_train_eval_loss": 3.439823806285858,
546
+ "std_train_eval_loss": 0.005562641857737584,
547
+ "mean_val_eval_loss": 4.657156705856323,
548
+ "std_val_eval_loss": 0.020918879251822793,
549
+ "mean_generalization_gap": 1.217332899570465,
550
+ "std_generalization_gap": 0.015357844279503235
551
+ },
552
+ {
553
+ "run_mode": "locked_stream",
554
+ "condition": "static_dropout_0.06",
555
+ "condition_kind": "static",
556
+ "stage": 2,
557
+ "token_limit": 1000000,
558
+ "model_name": "L12_H8_D320",
559
+ "n_layer": 12,
560
+ "n_head": 8,
561
+ "n_embd": 320,
562
+ "parameters": 17367040,
563
+ "dropout_initial": 0.06,
564
+ "dropout_final": 0.06,
565
+ "dropout_schedule": "constant",
566
+ "n": 3,
567
+ "mean_train_eval_loss": 3.5540901521841683,
568
+ "std_train_eval_loss": 0.017924779970058147,
569
+ "mean_val_eval_loss": 4.557474325100581,
570
+ "std_val_eval_loss": 0.016603375950540906,
571
+ "mean_generalization_gap": 1.0033841729164124,
572
+ "std_generalization_gap": 0.03385453586409587
573
+ },
574
+ {
575
+ "run_mode": "locked_stream",
576
+ "condition": "static_dropout_0.08",
577
+ "condition_kind": "static",
578
+ "stage": 2,
579
+ "token_limit": 1000000,
580
+ "model_name": "L12_H8_D320",
581
+ "n_layer": 12,
582
+ "n_head": 8,
583
+ "n_embd": 320,
584
+ "parameters": 17367040,
585
+ "dropout_initial": 0.08,
586
+ "dropout_final": 0.08,
587
+ "dropout_schedule": "constant",
588
+ "n": 3,
589
+ "mean_train_eval_loss": 3.60255632797877,
590
+ "std_train_eval_loss": 0.00575740505926527,
591
+ "mean_val_eval_loss": 4.5323765849073725,
592
+ "std_val_eval_loss": 0.009481504316560474,
593
+ "mean_generalization_gap": 0.9298202569286028,
594
+ "std_generalization_gap": 0.012211546495961454
595
+ },
596
+ {
597
+ "run_mode": "locked_stream",
598
+ "condition": "static_dropout_0.1",
599
+ "condition_kind": "static",
600
+ "stage": 2,
601
+ "token_limit": 1000000,
602
+ "model_name": "L12_H8_D320",
603
+ "n_layer": 12,
604
+ "n_head": 8,
605
+ "n_embd": 320,
606
+ "parameters": 17367040,
607
+ "dropout_initial": 0.1,
608
+ "dropout_final": 0.1,
609
+ "dropout_schedule": "constant",
610
+ "n": 3,
611
+ "mean_train_eval_loss": 3.6542825972040496,
612
+ "std_train_eval_loss": 0.030316068353894877,
613
+ "mean_val_eval_loss": 4.516653408606847,
614
+ "std_val_eval_loss": 0.012694329764198367,
615
+ "mean_generalization_gap": 0.8623708114027977,
616
+ "std_generalization_gap": 0.04203932410029853
617
+ },
618
+ {
619
+ "run_mode": "locked_stream",
620
+ "condition": "static_dropout_0.14",
621
+ "condition_kind": "static",
622
+ "stage": 2,
623
+ "token_limit": 1000000,
624
+ "model_name": "L12_H8_D320",
625
+ "n_layer": 12,
626
+ "n_head": 8,
627
+ "n_embd": 320,
628
+ "parameters": 17367040,
629
+ "dropout_initial": 0.14,
630
+ "dropout_final": 0.14,
631
+ "dropout_schedule": "constant",
632
+ "n": 3,
633
+ "mean_train_eval_loss": 3.7183619191249213,
634
+ "std_train_eval_loss": 0.02332517298978488,
635
+ "mean_val_eval_loss": 4.5008573432763415,
636
+ "std_val_eval_loss": 0.018900687447980586,
637
+ "mean_generalization_gap": 0.7824954241514206,
638
+ "std_generalization_gap": 0.04221659685113837
639
+ },
640
+ {
641
+ "run_mode": "locked_stream",
642
+ "condition": "static_dropout_0.18",
643
+ "condition_kind": "static",
644
+ "stage": 2,
645
+ "token_limit": 1000000,
646
+ "model_name": "L12_H8_D320",
647
+ "n_layer": 12,
648
+ "n_head": 8,
649
+ "n_embd": 320,
650
+ "parameters": 17367040,
651
+ "dropout_initial": 0.18,
652
+ "dropout_final": 0.18,
653
+ "dropout_schedule": "constant",
654
+ "n": 3,
655
+ "mean_train_eval_loss": 3.786718524992466,
656
+ "std_train_eval_loss": 0.012788547677963957,
657
+ "mean_val_eval_loss": 4.50268988062938,
658
+ "std_val_eval_loss": 0.019498788384197212,
659
+ "mean_generalization_gap": 0.7159713556369146,
660
+ "std_generalization_gap": 0.02737366667147904
661
+ },
662
+ {
663
+ "run_mode": "locked_stream",
664
+ "condition": "static_dropout_0.2",
665
+ "condition_kind": "static",
666
+ "stage": 2,
667
+ "token_limit": 1000000,
668
+ "model_name": "L12_H8_D320",
669
+ "n_layer": 12,
670
+ "n_head": 8,
671
+ "n_embd": 320,
672
+ "parameters": 17367040,
673
+ "dropout_initial": 0.2,
674
+ "dropout_final": 0.2,
675
+ "dropout_schedule": "constant",
676
+ "n": 3,
677
+ "mean_train_eval_loss": 3.811583066980044,
678
+ "std_train_eval_loss": 0.016414324708578124,
679
+ "mean_val_eval_loss": 4.501583576202393,
680
+ "std_val_eval_loss": 0.018885880819448517,
681
+ "mean_generalization_gap": 0.6900005092223486,
682
+ "std_generalization_gap": 0.02848926010067624
683
+ },
684
+ {
685
+ "run_mode": "locked_stream",
686
+ "condition": "static_dropout_0.26",
687
+ "condition_kind": "static",
688
+ "stage": 2,
689
+ "token_limit": 1000000,
690
+ "model_name": "L12_H8_D320",
691
+ "n_layer": 12,
692
+ "n_head": 8,
693
+ "n_embd": 320,
694
+ "parameters": 17367040,
695
+ "dropout_initial": 0.26,
696
+ "dropout_final": 0.26,
697
+ "dropout_schedule": "constant",
698
+ "n": 3,
699
+ "mean_train_eval_loss": 3.909716250995795,
700
+ "std_train_eval_loss": 0.029922261968460786,
701
+ "mean_val_eval_loss": 4.527257425089677,
702
+ "std_val_eval_loss": 0.012874091579002427,
703
+ "mean_generalization_gap": 0.6175411740938822,
704
+ "std_generalization_gap": 0.038926303819109775
705
+ },
706
+ {
707
+ "run_mode": "locked_stream",
708
+ "condition": "static_dropout_0.3",
709
+ "condition_kind": "static",
710
+ "stage": 2,
711
+ "token_limit": 1000000,
712
+ "model_name": "L12_H8_D320",
713
+ "n_layer": 12,
714
+ "n_head": 8,
715
+ "n_embd": 320,
716
+ "parameters": 17367040,
717
+ "dropout_initial": 0.3,
718
+ "dropout_final": 0.3,
719
+ "dropout_schedule": "constant",
720
+ "n": 3,
721
+ "mean_train_eval_loss": 3.9773165384928384,
722
+ "std_train_eval_loss": 0.009735095632867338,
723
+ "mean_val_eval_loss": 4.546837595601876,
724
+ "std_val_eval_loss": 0.01674020951247983,
725
+ "mean_generalization_gap": 0.569521057109038,
726
+ "std_generalization_gap": 0.02091747765619741
727
+ },
728
+ {
729
+ "run_mode": "locked_stream",
730
+ "condition": "formula_l12_wikitext103",
731
+ "condition_kind": "anchor_decay",
732
+ "stage": 3,
733
+ "token_limit": 2000000,
734
+ "model_name": "L12_H8_D320",
735
+ "n_layer": 12,
736
+ "n_head": 8,
737
+ "n_embd": 320,
738
+ "parameters": 17367040,
739
+ "dropout_initial": 0.3,
740
+ "dropout_final": 0.02,
741
+ "dropout_schedule": "log_prefix_anchor",
742
+ "n": 3,
743
+ "mean_train_eval_loss": 3.8132816379268966,
744
+ "std_train_eval_loss": 0.021180137190910886,
745
+ "mean_val_eval_loss": 4.265036730716626,
746
+ "std_val_eval_loss": 0.020952651949156013,
747
+ "mean_generalization_gap": 0.4517550927897294,
748
+ "std_generalization_gap": 0.026508733332518825
749
+ },
750
+ {
751
+ "run_mode": "locked_stream",
752
+ "condition": "static_dropout_0",
753
+ "condition_kind": "static",
754
+ "stage": 3,
755
+ "token_limit": 2000000,
756
+ "model_name": "L12_H8_D320",
757
+ "n_layer": 12,
758
+ "n_head": 8,
759
+ "n_embd": 320,
760
+ "parameters": 17367040,
761
+ "dropout_initial": 0.0,
762
+ "dropout_final": 0.0,
763
+ "dropout_schedule": "constant",
764
+ "n": 3,
765
+ "mean_train_eval_loss": 3.6350866481661797,
766
+ "std_train_eval_loss": 0.016285186896951624,
767
+ "mean_val_eval_loss": 4.413571459551652,
768
+ "std_val_eval_loss": 0.02523774966453558,
769
+ "mean_generalization_gap": 0.7784848113854727,
770
+ "std_generalization_gap": 0.03399640276399005
771
+ },
772
+ {
773
+ "run_mode": "locked_stream",
774
+ "condition": "static_dropout_0.02",
775
+ "condition_kind": "static",
776
+ "stage": 3,
777
+ "token_limit": 2000000,
778
+ "model_name": "L12_H8_D320",
779
+ "n_layer": 12,
780
+ "n_head": 8,
781
+ "n_embd": 320,
782
+ "parameters": 17367040,
783
+ "dropout_initial": 0.02,
784
+ "dropout_final": 0.02,
785
+ "dropout_schedule": "constant",
786
+ "n": 3,
787
+ "mean_train_eval_loss": 3.6686666384339333,
788
+ "std_train_eval_loss": 0.015260894096187612,
789
+ "mean_val_eval_loss": 4.368014050026734,
790
+ "std_val_eval_loss": 0.027178290223553957,
791
+ "mean_generalization_gap": 0.6993474115928014,
792
+ "std_generalization_gap": 0.02886603382685387
793
+ },
794
+ {
795
+ "run_mode": "locked_stream",
796
+ "condition": "static_dropout_0.06",
797
+ "condition_kind": "static",
798
+ "stage": 3,
799
+ "token_limit": 2000000,
800
+ "model_name": "L12_H8_D320",
801
+ "n_layer": 12,
802
+ "n_head": 8,
803
+ "n_embd": 320,
804
+ "parameters": 17367040,
805
+ "dropout_initial": 0.06,
806
+ "dropout_final": 0.06,
807
+ "dropout_schedule": "constant",
808
+ "n": 3,
809
+ "mean_train_eval_loss": 3.7288275261720023,
810
+ "std_train_eval_loss": 0.02214638532765685,
811
+ "mean_val_eval_loss": 4.304687723517418,
812
+ "std_val_eval_loss": 0.025605319026104526,
813
+ "mean_generalization_gap": 0.5758601973454157,
814
+ "std_generalization_gap": 0.03169069142172165
815
+ },
816
+ {
817
+ "run_mode": "locked_stream",
818
+ "condition": "static_dropout_0.08",
819
+ "condition_kind": "static",
820
+ "stage": 3,
821
+ "token_limit": 2000000,
822
+ "model_name": "L12_H8_D320",
823
+ "n_layer": 12,
824
+ "n_head": 8,
825
+ "n_embd": 320,
826
+ "parameters": 17367040,
827
+ "dropout_initial": 0.08,
828
+ "dropout_final": 0.08,
829
+ "dropout_schedule": "constant",
830
+ "n": 3,
831
+ "mean_train_eval_loss": 3.751750993231932,
832
+ "std_train_eval_loss": 0.027325045963601875,
833
+ "mean_val_eval_loss": 4.292027606318395,
834
+ "std_val_eval_loss": 0.022193268917965966,
835
+ "mean_generalization_gap": 0.540276613086462,
836
+ "std_generalization_gap": 0.048123905643236306
837
+ },
838
+ {
839
+ "run_mode": "locked_stream",
840
+ "condition": "static_dropout_0.1",
841
+ "condition_kind": "static",
842
+ "stage": 3,
843
+ "token_limit": 2000000,
844
+ "model_name": "L12_H8_D320",
845
+ "n_layer": 12,
846
+ "n_head": 8,
847
+ "n_embd": 320,
848
+ "parameters": 17367040,
849
+ "dropout_initial": 0.1,
850
+ "dropout_final": 0.1,
851
+ "dropout_schedule": "constant",
852
+ "n": 3,
853
+ "mean_train_eval_loss": 3.785589481393496,
854
+ "std_train_eval_loss": 0.023254215859456147,
855
+ "mean_val_eval_loss": 4.290843218564987,
856
+ "std_val_eval_loss": 0.024816447573560912,
857
+ "mean_generalization_gap": 0.505253737171491,
858
+ "std_generalization_gap": 0.03955542155547533
859
+ },
860
+ {
861
+ "run_mode": "locked_stream",
862
+ "condition": "static_dropout_0.14",
863
+ "condition_kind": "static",
864
+ "stage": 3,
865
+ "token_limit": 2000000,
866
+ "model_name": "L12_H8_D320",
867
+ "n_layer": 12,
868
+ "n_head": 8,
869
+ "n_embd": 320,
870
+ "parameters": 17367040,
871
+ "dropout_initial": 0.14,
872
+ "dropout_final": 0.14,
873
+ "dropout_schedule": "constant",
874
+ "n": 3,
875
+ "mean_train_eval_loss": 3.8283161322275796,
876
+ "std_train_eval_loss": 0.006946470327794601,
877
+ "mean_val_eval_loss": 4.285862573732932,
878
+ "std_val_eval_loss": 0.029060804176488816,
879
+ "mean_generalization_gap": 0.4575464415053527,
880
+ "std_generalization_gap": 0.02311386935886081
881
+ },
882
+ {
883
+ "run_mode": "locked_stream",
884
+ "condition": "static_dropout_0.18",
885
+ "condition_kind": "static",
886
+ "stage": 3,
887
+ "token_limit": 2000000,
888
+ "model_name": "L12_H8_D320",
889
+ "n_layer": 12,
890
+ "n_head": 8,
891
+ "n_embd": 320,
892
+ "parameters": 17367040,
893
+ "dropout_initial": 0.18,
894
+ "dropout_final": 0.18,
895
+ "dropout_schedule": "constant",
896
+ "n": 3,
897
+ "mean_train_eval_loss": 3.875220308701197,
898
+ "std_train_eval_loss": 0.011227992287915711,
899
+ "mean_val_eval_loss": 4.2889708404739695,
900
+ "std_val_eval_loss": 0.027610989551016634,
901
+ "mean_generalization_gap": 0.4137505317727725,
902
+ "std_generalization_gap": 0.019195364852840884
903
+ },
904
+ {
905
+ "run_mode": "locked_stream",
906
+ "condition": "static_dropout_0.2",
907
+ "condition_kind": "static",
908
+ "stage": 3,
909
+ "token_limit": 2000000,
910
+ "model_name": "L12_H8_D320",
911
+ "n_layer": 12,
912
+ "n_head": 8,
913
+ "n_embd": 320,
914
+ "parameters": 17367040,
915
+ "dropout_initial": 0.2,
916
+ "dropout_final": 0.2,
917
+ "dropout_schedule": "constant",
918
+ "n": 3,
919
+ "mean_train_eval_loss": 3.8982935870687165,
920
+ "std_train_eval_loss": 0.011753289842086798,
921
+ "mean_val_eval_loss": 4.297594143698613,
922
+ "std_val_eval_loss": 0.0215564773236604,
923
+ "mean_generalization_gap": 0.39930055662989616,
924
+ "std_generalization_gap": 0.015284717362206547
925
+ },
926
+ {
927
+ "run_mode": "locked_stream",
928
+ "condition": "static_dropout_0.26",
929
+ "condition_kind": "static",
930
+ "stage": 3,
931
+ "token_limit": 2000000,
932
+ "model_name": "L12_H8_D320",
933
+ "n_layer": 12,
934
+ "n_head": 8,
935
+ "n_embd": 320,
936
+ "parameters": 17367040,
937
+ "dropout_initial": 0.26,
938
+ "dropout_final": 0.26,
939
+ "dropout_schedule": "constant",
940
+ "n": 3,
941
+ "mean_train_eval_loss": 3.9674767553806305,
942
+ "std_train_eval_loss": 0.024908308844425655,
943
+ "mean_val_eval_loss": 4.327832326292992,
944
+ "std_val_eval_loss": 0.024543452908526633,
945
+ "mean_generalization_gap": 0.36035557091236115,
946
+ "std_generalization_gap": 0.02569548028047761
947
+ },
948
+ {
949
+ "run_mode": "locked_stream",
950
+ "condition": "static_dropout_0.3",
951
+ "condition_kind": "static",
952
+ "stage": 3,
953
+ "token_limit": 2000000,
954
+ "model_name": "L12_H8_D320",
955
+ "n_layer": 12,
956
+ "n_head": 8,
957
+ "n_embd": 320,
958
+ "parameters": 17367040,
959
+ "dropout_initial": 0.3,
960
+ "dropout_final": 0.3,
961
+ "dropout_schedule": "constant",
962
+ "n": 3,
963
+ "mean_train_eval_loss": 4.012147528429826,
964
+ "std_train_eval_loss": 0.02012763898563558,
965
+ "mean_val_eval_loss": 4.349700070917606,
966
+ "std_val_eval_loss": 0.019107888304911173,
967
+ "mean_generalization_gap": 0.33755254248778027,
968
+ "std_generalization_gap": 0.025128923759912217
969
+ },
970
+ {
971
+ "run_mode": "locked_stream",
972
+ "condition": "formula_l12_wikitext103",
973
+ "condition_kind": "anchor_decay",
974
+ "stage": 4,
975
+ "token_limit": 4000000,
976
+ "model_name": "L12_H8_D320",
977
+ "n_layer": 12,
978
+ "n_head": 8,
979
+ "n_embd": 320,
980
+ "parameters": 17367040,
981
+ "dropout_initial": 0.3,
982
+ "dropout_final": 0.02,
983
+ "dropout_schedule": "log_prefix_anchor",
984
+ "n": 3,
985
+ "mean_train_eval_loss": 3.812962586681048,
986
+ "std_train_eval_loss": 0.025352240743704624,
987
+ "mean_val_eval_loss": 4.083636499941349,
988
+ "std_val_eval_loss": 0.025765009042835362,
989
+ "mean_generalization_gap": 0.27067391326030094,
990
+ "std_generalization_gap": 0.05009449170635578
991
+ },
992
+ {
993
+ "run_mode": "locked_stream",
994
+ "condition": "static_dropout_0",
995
+ "condition_kind": "static",
996
+ "stage": 4,
997
+ "token_limit": 4000000,
998
+ "model_name": "L12_H8_D320",
999
+ "n_layer": 12,
1000
+ "n_head": 8,
1001
+ "n_embd": 320,
1002
+ "parameters": 17367040,
1003
+ "dropout_initial": 0.0,
1004
+ "dropout_final": 0.0,
1005
+ "dropout_schedule": "constant",
1006
+ "n": 3,
1007
+ "mean_train_eval_loss": 3.7873488490780196,
1008
+ "std_train_eval_loss": 0.026205633809150763,
1009
+ "mean_val_eval_loss": 4.179581480721633,
1010
+ "std_val_eval_loss": 0.021736412927032305,
1011
+ "mean_generalization_gap": 0.39223263164361316,
1012
+ "std_generalization_gap": 0.03552201232307707
1013
+ },
1014
+ {
1015
+ "run_mode": "locked_stream",
1016
+ "condition": "static_dropout_0.02",
1017
+ "condition_kind": "static",
1018
+ "stage": 4,
1019
+ "token_limit": 4000000,
1020
+ "model_name": "L12_H8_D320",
1021
+ "n_layer": 12,
1022
+ "n_head": 8,
1023
+ "n_embd": 320,
1024
+ "parameters": 17367040,
1025
+ "dropout_initial": 0.02,
1026
+ "dropout_final": 0.02,
1027
+ "dropout_schedule": "constant",
1028
+ "n": 3,
1029
+ "mean_train_eval_loss": 3.796787646909555,
1030
+ "std_train_eval_loss": 0.024529861961376253,
1031
+ "mean_val_eval_loss": 4.152008826533954,
1032
+ "std_val_eval_loss": 0.019904634278940705,
1033
+ "mean_generalization_gap": 0.35522117962439853,
1034
+ "std_generalization_gap": 0.04352816945908601
1035
+ },
1036
+ {
1037
+ "run_mode": "locked_stream",
1038
+ "condition": "static_dropout_0.06",
1039
+ "condition_kind": "static",
1040
+ "stage": 4,
1041
+ "token_limit": 4000000,
1042
+ "model_name": "L12_H8_D320",
1043
+ "n_layer": 12,
1044
+ "n_head": 8,
1045
+ "n_embd": 320,
1046
+ "parameters": 17367040,
1047
+ "dropout_initial": 0.06,
1048
+ "dropout_final": 0.06,
1049
+ "dropout_schedule": "constant",
1050
+ "n": 3,
1051
+ "mean_train_eval_loss": 3.817745270828406,
1052
+ "std_train_eval_loss": 0.022916843434376433,
1053
+ "mean_val_eval_loss": 4.118092642476161,
1054
+ "std_val_eval_loss": 0.010664122574525757,
1055
+ "mean_generalization_gap": 0.3003473716477553,
1056
+ "std_generalization_gap": 0.03183478268164145
1057
+ },
1058
+ {
1059
+ "run_mode": "locked_stream",
1060
+ "condition": "static_dropout_0.08",
1061
+ "condition_kind": "static",
1062
+ "stage": 4,
1063
+ "token_limit": 4000000,
1064
+ "model_name": "L12_H8_D320",
1065
+ "n_layer": 12,
1066
+ "n_head": 8,
1067
+ "n_embd": 320,
1068
+ "parameters": 17367040,
1069
+ "dropout_initial": 0.08,
1070
+ "dropout_final": 0.08,
1071
+ "dropout_schedule": "constant",
1072
+ "n": 3,
1073
+ "mean_train_eval_loss": 3.833147312204043,
1074
+ "std_train_eval_loss": 0.02624431695895608,
1075
+ "mean_val_eval_loss": 4.110093618432681,
1076
+ "std_val_eval_loss": 0.024734680784555277,
1077
+ "mean_generalization_gap": 0.2769463062286377,
1078
+ "std_generalization_gap": 0.04976013223808796
1079
+ },
1080
+ {
1081
+ "run_mode": "locked_stream",
1082
+ "condition": "static_dropout_0.1",
1083
+ "condition_kind": "static",
1084
+ "stage": 4,
1085
+ "token_limit": 4000000,
1086
+ "model_name": "L12_H8_D320",
1087
+ "n_layer": 12,
1088
+ "n_head": 8,
1089
+ "n_embd": 320,
1090
+ "parameters": 17367040,
1091
+ "dropout_initial": 0.1,
1092
+ "dropout_final": 0.1,
1093
+ "dropout_schedule": "constant",
1094
+ "n": 3,
1095
+ "mean_train_eval_loss": 3.85314512749513,
1096
+ "std_train_eval_loss": 0.017044693431674442,
1097
+ "mean_val_eval_loss": 4.108063347637653,
1098
+ "std_val_eval_loss": 0.025798124596532718,
1099
+ "mean_generalization_gap": 0.25491822014252347,
1100
+ "std_generalization_gap": 0.03599137702525591
1101
+ },
1102
+ {
1103
+ "run_mode": "locked_stream",
1104
+ "condition": "static_dropout_0.14",
1105
+ "condition_kind": "static",
1106
+ "stage": 4,
1107
+ "token_limit": 4000000,
1108
+ "model_name": "L12_H8_D320",
1109
+ "n_layer": 12,
1110
+ "n_head": 8,
1111
+ "n_embd": 320,
1112
+ "parameters": 17367040,
1113
+ "dropout_initial": 0.14,
1114
+ "dropout_final": 0.14,
1115
+ "dropout_schedule": "constant",
1116
+ "n": 3,
1117
+ "mean_train_eval_loss": 3.8825049375494323,
1118
+ "std_train_eval_loss": 0.020653196329518318,
1119
+ "mean_val_eval_loss": 4.126605914284785,
1120
+ "std_val_eval_loss": 0.019834322361978202,
1121
+ "mean_generalization_gap": 0.24410097673535347,
1122
+ "std_generalization_gap": 0.03709848360553261
1123
+ },
1124
+ {
1125
+ "run_mode": "locked_stream",
1126
+ "condition": "static_dropout_0.18",
1127
+ "condition_kind": "static",
1128
+ "stage": 4,
1129
+ "token_limit": 4000000,
1130
+ "model_name": "L12_H8_D320",
1131
+ "n_layer": 12,
1132
+ "n_head": 8,
1133
+ "n_embd": 320,
1134
+ "parameters": 17367040,
1135
+ "dropout_initial": 0.18,
1136
+ "dropout_final": 0.18,
1137
+ "dropout_schedule": "constant",
1138
+ "n": 3,
1139
+ "mean_train_eval_loss": 3.9089941332737603,
1140
+ "std_train_eval_loss": 0.02100733745422538,
1141
+ "mean_val_eval_loss": 4.128837997714679,
1142
+ "std_val_eval_loss": 0.018137202414878383,
1143
+ "mean_generalization_gap": 0.21984386444091797,
1144
+ "std_generalization_gap": 0.03802532083941853
1145
+ },
1146
+ {
1147
+ "run_mode": "locked_stream",
1148
+ "condition": "static_dropout_0.2",
1149
+ "condition_kind": "static",
1150
+ "stage": 4,
1151
+ "token_limit": 4000000,
1152
+ "model_name": "L12_H8_D320",
1153
+ "n_layer": 12,
1154
+ "n_head": 8,
1155
+ "n_embd": 320,
1156
+ "parameters": 17367040,
1157
+ "dropout_initial": 0.2,
1158
+ "dropout_final": 0.2,
1159
+ "dropout_schedule": "constant",
1160
+ "n": 3,
1161
+ "mean_train_eval_loss": 3.925126865506172,
1162
+ "std_train_eval_loss": 0.0200711438797493,
1163
+ "mean_val_eval_loss": 4.138758639494578,
1164
+ "std_val_eval_loss": 0.02359678741043119,
1165
+ "mean_generalization_gap": 0.21363177398840585,
1166
+ "std_generalization_gap": 0.04159847214281183
1167
+ },
1168
+ {
1169
+ "run_mode": "locked_stream",
1170
+ "condition": "static_dropout_0.26",
1171
+ "condition_kind": "static",
1172
+ "stage": 4,
1173
+ "token_limit": 4000000,
1174
+ "model_name": "L12_H8_D320",
1175
+ "n_layer": 12,
1176
+ "n_head": 8,
1177
+ "n_embd": 320,
1178
+ "parameters": 17367040,
1179
+ "dropout_initial": 0.26,
1180
+ "dropout_final": 0.26,
1181
+ "dropout_schedule": "constant",
1182
+ "n": 3,
1183
+ "mean_train_eval_loss": 3.98883589108785,
1184
+ "std_train_eval_loss": 0.021597891789016974,
1185
+ "mean_val_eval_loss": 4.17716755097111,
1186
+ "std_val_eval_loss": 0.020328951540567387,
1187
+ "mean_generalization_gap": 0.18833165988326073,
1188
+ "std_generalization_gap": 0.03975692522805749
1189
+ },
1190
+ {
1191
+ "run_mode": "locked_stream",
1192
+ "condition": "static_dropout_0.3",
1193
+ "condition_kind": "static",
1194
+ "stage": 4,
1195
+ "token_limit": 4000000,
1196
+ "model_name": "L12_H8_D320",
1197
+ "n_layer": 12,
1198
+ "n_head": 8,
1199
+ "n_embd": 320,
1200
+ "parameters": 17367040,
1201
+ "dropout_initial": 0.3,
1202
+ "dropout_final": 0.3,
1203
+ "dropout_schedule": "constant",
1204
+ "n": 3,
1205
+ "mean_train_eval_loss": 4.025067411363125,
1206
+ "std_train_eval_loss": 0.021409684612983128,
1207
+ "mean_val_eval_loss": 4.196084383875132,
1208
+ "std_val_eval_loss": 0.017949538707409805,
1209
+ "mean_generalization_gap": 0.17101697251200676,
1210
+ "std_generalization_gap": 0.03846054936653385
1211
+ }
1212
+ ]
runs/corpus_holdout_wikitext103_l12/locked_stream/20260528-183213/trace.jsonl ADDED
@@ -0,0 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"condition": "formula_l12_wikitext103", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.10919189453125}
2
+ {"condition": "formula_l12_wikitext103", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.3267059326171875}
3
+ {"condition": "formula_l12_wikitext103", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.150979042053223}
4
+ {"condition": "formula_l12_wikitext103", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.038363456726074}
5
+ {"condition": "formula_l12_wikitext103", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.993682861328125}
6
+ {"condition": "formula_l12_wikitext103", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.025256633758545}
7
+ {"condition": "formula_l12_wikitext103", "dropout": 0.09, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.174747943878174}
8
+ {"condition": "formula_l12_wikitext103", "dropout": 0.09, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.870220184326172}
9
+ {"condition": "formula_l12_wikitext103", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.9933338165283203}
10
+ {"condition": "formula_l12_wikitext103", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.751016855239868}
11
+ {"condition": "formula_l12_wikitext103", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.02793025970459}
12
+ {"condition": "formula_l12_wikitext103", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.466760635375977}
13
+ {"condition": "formula_l12_wikitext103", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.282154083251953}
14
+ {"condition": "formula_l12_wikitext103", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.091670036315918}
15
+ {"condition": "formula_l12_wikitext103", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.201129913330078}
16
+ {"condition": "formula_l12_wikitext103", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.089809894561768}
17
+ {"condition": "formula_l12_wikitext103", "dropout": 0.09, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.933499813079834}
18
+ {"condition": "formula_l12_wikitext103", "dropout": 0.09, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8695573806762695}
19
+ {"condition": "formula_l12_wikitext103", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.0093183517456055}
20
+ {"condition": "formula_l12_wikitext103", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.987300395965576}
21
+ {"condition": "formula_l12_wikitext103", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.0945048332214355}
22
+ {"condition": "formula_l12_wikitext103", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.291036605834961}
23
+ {"condition": "formula_l12_wikitext103", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.300796985626221}
24
+ {"condition": "formula_l12_wikitext103", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.205198287963867}
25
+ {"condition": "formula_l12_wikitext103", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.238568305969238}
26
+ {"condition": "formula_l12_wikitext103", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.775209426879883}
27
+ {"condition": "formula_l12_wikitext103", "dropout": 0.09, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.786513328552246}
28
+ {"condition": "formula_l12_wikitext103", "dropout": 0.09, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.115474700927734}
29
+ {"condition": "formula_l12_wikitext103", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.8349270820617676}
30
+ {"condition": "formula_l12_wikitext103", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.794011116027832}
31
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.586461067199707}
32
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.4980170726776123}
33
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.706185817718506}
34
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.010756015777588}
35
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.846466302871704}
36
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.2843170166015625}
37
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.748643398284912}
38
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.6820337772369385}
39
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.5873708724975586}
40
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.545558452606201}
41
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.423453330993652}
42
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.617307186126709}
43
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.6751105785369873}
44
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 2.825775146484375}
45
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.784053325653076}
46
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.3288044929504395}
47
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.545459747314453}
48
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.469001293182373}
49
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.6900553703308105}
50
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.053542137145996}
51
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.557775974273682}
52
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.530925750732422}
53
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.377972364425659}
54
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.2348737716674805}
55
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.526865005493164}
56
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.300647497177124}
57
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.4643568992614746}
58
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.544159412384033}
59
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.6731104850769043}
60
+ {"condition": "static_dropout_0", "dropout": 0.0, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.626936435699463}
61
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.596406936645508}
62
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.5398497581481934}
63
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.487887144088745}
64
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.1747987270355225}
65
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.5900020599365234}
66
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.4777474403381348}
67
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.7301249504089355}
68
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.790534019470215}
69
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.079158782958984}
70
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.661900520324707}
71
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.517829418182373}
72
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.5424184799194336}
73
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.4532995223999023}
74
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.393620252609253}
75
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.6632490158081055}
76
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.5351672172546387}
77
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.7610929012298584}
78
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.4371466636657715}
79
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.9270126819610596}
80
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.961052656173706}
81
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.554469585418701}
82
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.6152591705322266}
83
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.6096372604370117}
84
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.1891748905181885}
85
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.523820161819458}
86
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.5947442054748535}
87
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.066094398498535}
88
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.6390342712402344}
89
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.002377510070801}
90
+ {"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.785369873046875}
91
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.710752964019775}
92
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.9541187286376953}
93
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.050755977630615}
94
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.6150665283203125}
95
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.892446756362915}
96
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.7684245109558105}
97
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.7994470596313477}
98
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.783784866333008}
99
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.948446750640869}
100
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.7330846786499023}
101
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.726563453674316}
102
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.838132381439209}
103
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.9320147037506104}
104
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.3768904209136963}
105
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9188880920410156}
106
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.764596939086914}
107
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.872814655303955}
108
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8399839401245117}
109
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.9642767906188965}
110
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.817599296569824}
111
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.6344380378723145}
112
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.6540794372558594}
113
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.642949104309082}
114
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.537709951400757}
115
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9372754096984863}
116
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.012999534606934}
117
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.8797926902770996}
118
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.944365978240967}
119
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.346187114715576}
120
+ {"condition": "static_dropout_0.06", "dropout": 0.06, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.762101888656616}
121
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.694372177124023}
122
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.008996963500977}
123
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.710146427154541}
124
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.538743495941162}
125
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9808402061462402}
126
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.636242389678955}
127
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.011699676513672}
128
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.6634960174560547}
129
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.111110687255859}
130
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.8889031410217285}
131
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.579659461975098}
132
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.9213414192199707}
133
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.2153167724609375}
134
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.624877691268921}
135
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9966182708740234}
136
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.9975638389587402}
137
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.217199802398682}
138
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8261024951934814}
139
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.059680938720703}
140
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.039473056793213}
141
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.775473117828369}
142
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.8681042194366455}
143
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.858368158340454}
144
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.6299901008605957}
145
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.762925148010254}
146
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.398933172225952}
147
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.915220260620117}
148
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.801541805267334}
149
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.091124534606934}
150
+ {"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.713860034942627}
151
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.502629280090332}
152
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.026566028594971}
153
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.839831829071045}
154
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.62257719039917}
155
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.067403793334961}
156
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.841139793395996}
157
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.967017889022827}
158
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.688873291015625}
159
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.087314605712891}
160
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.86462140083313}
161
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.7392988204956055}
162
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.893677234649658}
163
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.870542526245117}
164
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.8797426223754883}
165
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9220707416534424}
166
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.793069362640381}
167
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.9290614128112793}
168
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.9442577362060547}
169
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.7859325408935547}
170
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.145358085632324}
171
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.739326477050781}
172
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.062891960144043}
173
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.149168014526367}
174
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.7364869117736816}
175
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.7814455032348633}
176
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.7253265380859375}
177
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.9168343544006348}
178
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.791567325592041}
179
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.778740406036377}
180
+ {"condition": "static_dropout_0.1", "dropout": 0.1, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.776935577392578}
181
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.7961297035217285}
182
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.8898675441741943}
183
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.075655937194824}
184
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.756572723388672}
185
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.029130935668945}
186
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.773857831954956}
187
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.1858930587768555}
188
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8487486839294434}
189
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.9703285694122314}
190
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.8854053020477295}
191
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.818635940551758}
192
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.108892917633057}
193
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.21990966796875}
194
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.81386137008667}
195
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.809004545211792}
196
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.7727158069610596}
197
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.847060441970825}
198
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.098654747009277}
199
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.098649024963379}
200
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.016416549682617}
201
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.850597381591797}
202
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.025087356567383}
203
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.976827621459961}
204
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.8091471195220947}
205
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.069204807281494}
206
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.657958507537842}
207
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 3.9937562942504883}
208
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.972700357437134}
209
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.0872650146484375}
210
+ {"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.936558246612549}
211
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.747689247131348}
212
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.080568313598633}
213
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.200558662414551}
214
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.979367971420288}
215
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.182689666748047}
216
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.7715864181518555}
217
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.135184288024902}
218
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.08739709854126}
219
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.0746169090271}
220
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.0843729972839355}
221
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.756750106811523}
222
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.258420467376709}
223
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.110930919647217}
224
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.029372215270996}
225
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.872436046600342}
226
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.009469509124756}
227
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.128143310546875}
228
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8418898582458496}
229
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.1407575607299805}
230
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.095191955566406}
231
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.976916790008545}
232
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.044127464294434}
233
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.275222301483154}
234
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.9059031009674072}
235
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.260745525360107}
236
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.906364917755127}
237
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.124902248382568}
238
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.0458292961120605}
239
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.184656143188477}
240
+ {"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.018331050872803}
241
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.769725799560547}
242
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.0768561363220215}
243
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.300281047821045}
244
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.8637545108795166}
245
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.150639533996582}
246
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.981992244720459}
247
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.188844203948975}
248
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.9941892623901367}
249
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.077834129333496}
250
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.117517471313477}
251
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.81527042388916}
252
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.054525375366211}
253
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.37611198425293}
254
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.8969826698303223}
255
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.113924980163574}
256
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.9721953868865967}
257
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.022497177124023}
258
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.092113494873047}
259
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.1637492179870605}
260
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.9015634059906006}
261
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.844550132751465}
262
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.1090593338012695}
263
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 3.9657397270202637}
264
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.7497055530548096}
265
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.205092430114746}
266
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.052699089050293}
267
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.3501362800598145}
268
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8875317573547363}
269
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 3.977046251296997}
270
+ {"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 3.952976703643799}
271
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.029112815856934}
272
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.155546188354492}
273
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.2847490310668945}
274
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.293614387512207}
275
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.426083087921143}
276
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.8696696758270264}
277
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.210818290710449}
278
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.7679266929626465}
279
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.164118766784668}
280
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.19427490234375}
281
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.084468841552734}
282
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.303183555603027}
283
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.257471084594727}
284
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.9919092655181885}
285
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.075634956359863}
286
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.116327285766602}
287
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.130469799041748}
288
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.175629615783691}
289
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.153221130371094}
290
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.002399444580078}
291
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.839096546173096}
292
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.228353500366211}
293
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.0907487869262695}
294
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.135189056396484}
295
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.315141677856445}
296
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.052082061767578}
297
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.100449562072754}
298
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.0784711837768555}
299
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.183749198913574}
300
+ {"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.022092819213867}
301
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.10919189453125}
302
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.3267059326171875}
303
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.192840576171875}
304
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.094118118286133}
305
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.15631628036499}
306
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.1938676834106445}
307
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.386846542358398}
308
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.115250587463379}
309
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.304389476776123}
310
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.113180160522461}
311
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.02793025970459}
312
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.466761112213135}
313
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.32354736328125}
314
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.149529933929443}
315
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.344442367553711}
316
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.2312116622924805}
317
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.160443305969238}
318
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.132419109344482}
319
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.286291599273682}
320
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.308834552764893}
321
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.0945048332214355}
322
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.291036605834961}
323
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.351649761199951}
324
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.264379024505615}
325
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.378478527069092}
326
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.95513916015625}
327
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.055840492248535}
328
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.364241600036621}
329
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.2040815353393555}
330
+ {"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "L12_H8_D320", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.1234211921691895}