Mandeep Sidhu commited on
Commit ·
555cf14
1
Parent(s): baafabf
Add width-heavy architecture holdout results
Browse files- README.md +2 -1
- REPRODUCING.md +10 -3
- docs/dropout_decay_research_report_v2.md +747 -0
- runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/RESULT_SUMMARY.md +86 -0
- runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/config.json +131 -0
- runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/metrics.jsonl +120 -0
- runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.csv +41 -0
- runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.json +882 -0
- runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/trace.jsonl +240 -0
README.md
CHANGED
|
@@ -67,7 +67,8 @@ Every run writes:
|
|
| 67 |
|
| 68 |
Old exploratory outputs are archived under `archive/`.
|
| 69 |
|
| 70 |
-
For exact headline reproduction, see `REPRODUCING.md`.
|
|
|
|
| 71 |
|
| 72 |
## Step 1: Cheap Static Screen
|
| 73 |
|
|
|
|
| 67 |
|
| 68 |
Old exploratory outputs are archived under `archive/`.
|
| 69 |
|
| 70 |
+
For exact headline reproduction, see `REPRODUCING.md`. For the current
|
| 71 |
+
engineer-facing research summary, see `docs/dropout_decay_research_report_v2.md`.
|
| 72 |
|
| 73 |
## Step 1: Cheap Static Screen
|
| 74 |
|
REPRODUCING.md
CHANGED
|
@@ -177,9 +177,10 @@ formula final validation: 4.5286 +/- 0.0118
|
|
| 177 |
best static final validation: 4.5564 +/- 0.0127
|
| 178 |
```
|
| 179 |
|
| 180 |
-
##
|
| 181 |
|
| 182 |
-
The
|
|
|
|
| 183 |
|
| 184 |
```bash
|
| 185 |
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
|
|
@@ -206,7 +207,13 @@ PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
|
|
| 206 |
--grad-clip 1.0
|
| 207 |
```
|
| 208 |
|
| 209 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 210 |
|
| 211 |
## Notes for Publication
|
| 212 |
|
|
|
|
| 177 |
best static final validation: 4.5564 +/- 0.0127
|
| 178 |
```
|
| 179 |
|
| 180 |
+
## Reproduce Width-Heavy Holdout
|
| 181 |
|
| 182 |
+
The width-heavy architecture holdout is the paired complement to the deep/narrow
|
| 183 |
+
holdout above:
|
| 184 |
|
| 185 |
```bash
|
| 186 |
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
|
|
|
|
| 207 |
--grad-clip 1.0
|
| 208 |
```
|
| 209 |
|
| 210 |
+
Completed reference result:
|
| 211 |
+
|
| 212 |
+
```text
|
| 213 |
+
formula final validation: 4.4658 +/- 0.0065
|
| 214 |
+
best static final validation: 4.4946 +/- 0.0087
|
| 215 |
+
best mean trajectory: static 0.18, 4.9064 vs formula 4.9073
|
| 216 |
+
```
|
| 217 |
|
| 218 |
## Notes for Publication
|
| 219 |
|
docs/dropout_decay_research_report_v2.md
ADDED
|
@@ -0,0 +1,747 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dropout Decay in Expanding-Stream Language Model Training
|
| 2 |
+
|
| 3 |
+
Date: 2026-05-28
|
| 4 |
+
|
| 5 |
+
This version is written for an AI/ML engineer reading the project for the first
|
| 6 |
+
time. It keeps the strongest empirical claims from the original report, but
|
| 7 |
+
adds the missing context needed to understand what was actually trained, what
|
| 8 |
+
the streaming setup means, how dropout schedules were applied, and which claims
|
| 9 |
+
are safe.
|
| 10 |
+
|
| 11 |
+
## Executive Summary
|
| 12 |
+
|
| 13 |
+
This project studies whether dropout should be scheduled from measurable
|
| 14 |
+
streaming-data pressure rather than held fixed during causal language-model
|
| 15 |
+
training.
|
| 16 |
+
|
| 17 |
+
The setting is simulated expanding-stream training. For each seed and condition,
|
| 18 |
+
one model and optimizer are trained continuously across five stream prefixes:
|
| 19 |
+
`250k`, `500k`, `1M`, `2M`, and `4M` unique training tokens. At each stage,
|
| 20 |
+
random 128-token windows are sampled from the currently available prefix. Early
|
| 21 |
+
stages therefore revisit the same tokens many times, while later stages expose
|
| 22 |
+
the same model to more unique data.
|
| 23 |
+
|
| 24 |
+
The original broad hypothesis was too simple:
|
| 25 |
+
|
| 26 |
+
> Start with very high dropout on a small stream prefix, then decay dropout as
|
| 27 |
+
> more stream data arrives.
|
| 28 |
+
|
| 29 |
+
The experiments rejected that version. A high-dropout schedule such as
|
| 30 |
+
`0.8 -> 0.1` was worse than static low dropout. The supported claim is narrower:
|
| 31 |
+
|
| 32 |
+
> When the best static dropout changes with stream prefix size, a pressure-aware
|
| 33 |
+
> prefix schedule can beat any single fixed dropout at final validation loss.
|
| 34 |
+
|
| 35 |
+
The current empirical law uses two pressure terms:
|
| 36 |
+
|
| 37 |
+
```text
|
| 38 |
+
p = clamp(0.02, 0.65,
|
| 39 |
+
0.154 * log10(params / unique_tokens)
|
| 40 |
+
+ 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
|
| 41 |
+
- 0.210)
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
Across the completed headline validation runs, the formula schedule wins
|
| 45 |
+
`21/21` paired final-loss comparisons across five model sizes and two
|
| 46 |
+
architecture-shape holdouts. The evidence supports final-validation improvement
|
| 47 |
+
under this nanochat-style Transformer and expanding-prefix protocol. It does not
|
| 48 |
+
yet establish a universal dropout law across datasets, architectures, or
|
| 49 |
+
training scales, and the width-heavy holdout shows that the current formula can
|
| 50 |
+
overestimate the best early-prefix dropout for some architecture shapes.
|
| 51 |
+
|
| 52 |
+
## System Under Test
|
| 53 |
+
|
| 54 |
+
The implementation is derived from Andrej Karpathy's `nanochat` project and
|
| 55 |
+
keeps only the pieces needed for controlled dropout experiments:
|
| 56 |
+
|
| 57 |
+
- BPE-style tokenizer with a 4,096-token vocabulary.
|
| 58 |
+
- Nanochat-style causal Transformer.
|
| 59 |
+
- RMSNorm, rotary attention, bias-free linear layers, and squared-ReLU MLPs.
|
| 60 |
+
- Dropout control over embedding dropout, attention dropout, residual dropout,
|
| 61 |
+
and MLP dropout.
|
| 62 |
+
- MPS-only Torch execution.
|
| 63 |
+
- Expanding-prefix training loops for simulated streaming.
|
| 64 |
+
|
| 65 |
+
The original nanochat MIT copyright and permission notice are retained in the
|
| 66 |
+
derived source files and project license.
|
| 67 |
+
|
| 68 |
+
### Model Family
|
| 69 |
+
|
| 70 |
+
The main model family changes depth and width while keeping the same basic
|
| 71 |
+
Transformer design:
|
| 72 |
+
|
| 73 |
+
| Name | Shape | Params | Role |
|
| 74 |
+
|---|---:|---:|---|
|
| 75 |
+
| `L8_H8_D256` | `8x8x256` | 8.39M | Small boundary case |
|
| 76 |
+
| `L10_H8_D288` | `10x8x288` | 12.31M | Interpolation check |
|
| 77 |
+
| `L12_H8_D320` | `12x8x320` | 17.37M | Main mid-scale validation |
|
| 78 |
+
| `L14_H8_D352` | `14x8x352` | 23.70M | Interpolation check |
|
| 79 |
+
| `L16_H8_D384` | `16x8x384` | 31.46M | Larger validation model |
|
| 80 |
+
| `deep_narrow_L18_H8_D256` | `18x8x256` | 16.25M | Architecture-shape holdout |
|
| 81 |
+
| `wide_L8_H8_D384` | `8x8x384` | 17.30M | Width-heavy architecture holdout |
|
| 82 |
+
|
| 83 |
+
The shape notation is `layers x heads x embedding dimension`.
|
| 84 |
+
|
| 85 |
+
### Training Configuration
|
| 86 |
+
|
| 87 |
+
Unless otherwise stated, headline runs use:
|
| 88 |
+
|
| 89 |
+
| Field | Value |
|
| 90 |
+
|---|---:|
|
| 91 |
+
| Device | `mps` only |
|
| 92 |
+
| Vocab size | `4096` |
|
| 93 |
+
| Block size | `128` |
|
| 94 |
+
| Batch size | `16` |
|
| 95 |
+
| Tokens per optimizer step | `2048` |
|
| 96 |
+
| Optimizer | AdamW |
|
| 97 |
+
| Learning rate | `0.0003` |
|
| 98 |
+
| Adam betas | `(0.9, 0.95)` |
|
| 99 |
+
| Weight decay | `0.1` |
|
| 100 |
+
| Gradient clipping | `1.0` |
|
| 101 |
+
| Validation batches | `64` |
|
| 102 |
+
| Train-eval batches | `32` |
|
| 103 |
+
| Seeds | usually `1, 2, 3` |
|
| 104 |
+
|
| 105 |
+
The runner refuses CPU and CUDA experiment execution. It also exits if
|
| 106 |
+
`PYTORCH_ENABLE_MPS_FALLBACK=1` is set, because fallback could silently run some
|
| 107 |
+
Torch operations on CPU.
|
| 108 |
+
|
| 109 |
+
## Data and Reproducibility Context
|
| 110 |
+
|
| 111 |
+
The completed runs use a project-local cache:
|
| 112 |
+
|
| 113 |
+
```text
|
| 114 |
+
.cache/dropout_decay/tokenizer-v4096.json
|
| 115 |
+
.cache/dropout_decay/tokens-v4096-uint16.npy
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
The completed run configs record:
|
| 119 |
+
|
| 120 |
+
```text
|
| 121 |
+
train tokens: 5,000,970
|
| 122 |
+
validation tokens: 500,000
|
| 123 |
+
vocab size: 4,096
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
The configs also show that the cache was built from a local parquet source named
|
| 127 |
+
`base_data_climbmix/shard_*.parquet`. The binary token cache is intentionally
|
| 128 |
+
not committed. That makes exact reproduction dependent on publishing or
|
| 129 |
+
reconstructing the cache. This is the largest reproducibility gap in the current
|
| 130 |
+
artifact.
|
| 131 |
+
|
| 132 |
+
The runner supports two data paths:
|
| 133 |
+
|
| 134 |
+
- `--use-cached-data --cache-dir .cache/dropout_decay`
|
| 135 |
+
- `--corpus` or `--corpus-glob` to rebuild the cache from raw text or parquet
|
| 136 |
+
|
| 137 |
+
For exact commands and environment setup, see `REPRODUCING.md`.
|
| 138 |
+
|
| 139 |
+
## What "Expanding Stream" Means Here
|
| 140 |
+
|
| 141 |
+
This project does not stream examples from an online service. It simulates a
|
| 142 |
+
stream by revealing a larger prefix of a fixed token array at each stage.
|
| 143 |
+
|
| 144 |
+
For a locked-stream run:
|
| 145 |
+
|
| 146 |
+
1. Create one model and optimizer for a given seed and condition.
|
| 147 |
+
2. Train stage 0 by sampling batches from the first `250k` training tokens.
|
| 148 |
+
3. Continue the same model and optimizer into stage 1, now sampling from the
|
| 149 |
+
first `500k` tokens.
|
| 150 |
+
4. Repeat for `1M`, `2M`, and `4M` prefixes.
|
| 151 |
+
5. Evaluate train and validation loss at the end of every stage.
|
| 152 |
+
|
| 153 |
+
This distinction matters because early prefixes are sampled repeatedly. With
|
| 154 |
+
`1000` steps per stage, the model consumes:
|
| 155 |
+
|
| 156 |
+
```text
|
| 157 |
+
1000 steps * 16 examples/step * 128 tokens/example = 2,048,000 sampled tokens
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
At the `250k` prefix, that is about `8.19x` the available unique-token count
|
| 161 |
+
within the first stage alone. At the `4M` prefix, the repeated-sampling pressure
|
| 162 |
+
is much lower. This pressure change is the reason a fixed dropout value can be
|
| 163 |
+
suboptimal across the full stream trajectory.
|
| 164 |
+
|
| 165 |
+
## Dropout Schedule Semantics
|
| 166 |
+
|
| 167 |
+
There are two schedule types in the codebase:
|
| 168 |
+
|
| 169 |
+
- Static schedules keep one dropout value for all stages.
|
| 170 |
+
- Anchor schedules choose dropout from the current stream prefix.
|
| 171 |
+
|
| 172 |
+
The headline formula runs use anchor schedules. In those runs, the stage
|
| 173 |
+
prefixes are exactly the anchor points, so dropout is constant within each
|
| 174 |
+
stage:
|
| 175 |
+
|
| 176 |
+
```text
|
| 177 |
+
250k stage -> p_250k
|
| 178 |
+
500k stage -> p_500k
|
| 179 |
+
1M stage -> p_1M
|
| 180 |
+
2M stage -> p_2M
|
| 181 |
+
4M stage -> p_4M
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
The implementation supports log interpolation between anchors for intermediate
|
| 185 |
+
prefix sizes, but the main reported experiments evaluate exactly at the anchor
|
| 186 |
+
prefixes. Therefore, these experiments validate a prefix-aware dropout path, not
|
| 187 |
+
a continuously changing per-step dropout curve. The empirical formula is used to
|
| 188 |
+
precompute these anchor values, which are then passed to the runner as
|
| 189 |
+
`--anchor-decays`.
|
| 190 |
+
|
| 191 |
+
## Metrics
|
| 192 |
+
|
| 193 |
+
The report uses four recurring metrics:
|
| 194 |
+
|
| 195 |
+
| Metric | Meaning |
|
| 196 |
+
|---|---|
|
| 197 |
+
| Final validation loss | Validation loss after the final `4M` prefix stage |
|
| 198 |
+
| Mean trajectory validation loss | Average validation loss across all prefix stages |
|
| 199 |
+
| Final train-validation gap | Final validation loss minus final train-eval loss |
|
| 200 |
+
| Paired final delta | Formula final validation loss minus the best static final validation loss for the same seed |
|
| 201 |
+
|
| 202 |
+
Lower validation loss is better. The gap is diagnostic, not the optimization
|
| 203 |
+
objective. A smaller gap can mean useful regularization, but it can also mean
|
| 204 |
+
underfitting if both train and validation losses are high.
|
| 205 |
+
|
| 206 |
+
## Initial Hypothesis and Correction
|
| 207 |
+
|
| 208 |
+
The first broad hypothesis was that very high initial dropout would protect the
|
| 209 |
+
model from overfitting small stream prefixes and could be decayed as more data
|
| 210 |
+
arrived.
|
| 211 |
+
|
| 212 |
+
Early 8.39M-parameter streaming runs rejected this version:
|
| 213 |
+
|
| 214 |
+
| Condition | 5M | 10M | 20M | 40M |
|
| 215 |
+
|---|---:|---:|---:|---:|
|
| 216 |
+
| High-dropout decay streaming | `6.9213` | `6.2689` | `5.4262` | `4.9090` |
|
| 217 |
+
| Static `0.1` dropout streaming | `5.6310` | `5.1018` | `4.8497` | `4.6743` |
|
| 218 |
+
| Static `0.8` dropout streaming | `6.9898` | `6.7637` | `6.4835` | `6.2390` |
|
| 219 |
+
|
| 220 |
+
The improvement over time was mostly the effect of seeing more stream data, not
|
| 221 |
+
evidence that high-dropout decay was a good schedule. This forced a more
|
| 222 |
+
careful experimental design:
|
| 223 |
+
|
| 224 |
+
1. First find regimes where static dropout has a real nonzero optimum.
|
| 225 |
+
2. Observe how that optimum moves as the stream prefix grows.
|
| 226 |
+
3. Test schedules that track the moving optimum instead of using arbitrary high
|
| 227 |
+
dropout.
|
| 228 |
+
|
| 229 |
+
## Static Dropout Screen
|
| 230 |
+
|
| 231 |
+
The key discovery from the static screen is that the best dropout depends on
|
| 232 |
+
both model scale and stream prefix size. Larger models and smaller prefixes need
|
| 233 |
+
more dropout. As the stream prefix grows, the best dropout moves downward.
|
| 234 |
+
|
| 235 |
+
| Model | Params | Prefix | Best static dropout | Validation loss | Zero-dropout penalty |
|
| 236 |
+
|---|---:|---:|---:|---:|---:|
|
| 237 |
+
| L16 | 31.46M | 2M | `0.14` | `4.4270` | `+0.1982` |
|
| 238 |
+
| L12 | 17.37M | 2M | `0.14` | `4.5088` | `+0.0866` |
|
| 239 |
+
| L8 | 8.39M | 2M | `0.08` | `4.6232` | `+0.0266` |
|
| 240 |
+
| L8 | 8.39M | 4M | `0.00` | best | near zero |
|
| 241 |
+
|
| 242 |
+
This screen is not itself the final evidence, because much of it is single
|
| 243 |
+
seed. Its role is to reveal the empirical shape: the static optimum is not a
|
| 244 |
+
constant. The locked-stream experiments then test whether tracking that shape
|
| 245 |
+
beats a single fixed dropout baseline.
|
| 246 |
+
|
| 247 |
+
## Empirical Formula
|
| 248 |
+
|
| 249 |
+
The current formula is:
|
| 250 |
+
|
| 251 |
+
```text
|
| 252 |
+
p = clamp(0.02, 0.65,
|
| 253 |
+
0.154 * log10(params / unique_tokens)
|
| 254 |
+
+ 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
|
| 255 |
+
- 0.210)
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
Terms:
|
| 259 |
+
|
| 260 |
+
- `params / unique_tokens` is a capacity-pressure proxy. Larger models on
|
| 261 |
+
smaller stream prefixes are more likely to memorize.
|
| 262 |
+
- `cumulative_sampled_tokens / unique_tokens` is an update-pressure proxy. More
|
| 263 |
+
repeated sampling from the same prefix increases overfitting pressure.
|
| 264 |
+
- `0.02` is an empirical floor. It avoids assuming exact zero dropout is always
|
| 265 |
+
optimal.
|
| 266 |
+
- `0.65` is a guardrail. The successful headline schedules are far below it.
|
| 267 |
+
|
| 268 |
+
The coefficients are empirical, not theoretical constants. The formula should
|
| 269 |
+
be read as a compact fitted schedule family for this protocol, not as a general
|
| 270 |
+
law of dropout.
|
| 271 |
+
|
| 272 |
+
For the standard `1000`-step protocol, the formula produces these paths:
|
| 273 |
+
|
| 274 |
+
```text
|
| 275 |
+
prefix tokens: 250k 500k 1M 2M 4M
|
| 276 |
+
cumulative sampled tokens: 2.048M 4.096M 6.144M 8.192M 10.240M
|
| 277 |
+
```
|
| 278 |
+
|
| 279 |
+
The cumulative sampled-token values are the planned totals after each stage.
|
| 280 |
+
They are used to compute the stage anchor dropouts below.
|
| 281 |
+
|
| 282 |
+
| Model | Params | Formula path |
|
| 283 |
+
|---|---:|---|
|
| 284 |
+
| L8 | 8.39M | `0.252 -> 0.206 -> 0.129 -> 0.038 -> 0.020` |
|
| 285 |
+
| L10 | 12.31M | `0.278 -> 0.232 -> 0.154 -> 0.064 -> 0.020` |
|
| 286 |
+
| L12 | 17.37M | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` |
|
| 287 |
+
| L14 | 23.70M | `0.322 -> 0.276 -> 0.198 -> 0.108 -> 0.020` |
|
| 288 |
+
| L16 | 31.46M | `0.341 -> 0.294 -> 0.217 -> 0.127 -> 0.030` |
|
| 289 |
+
| wide L8 | 17.30M | `0.301 -> 0.254 -> 0.177 -> 0.087 -> 0.020` |
|
| 290 |
+
|
| 291 |
+
## Main Result: Model-Size Validation
|
| 292 |
+
|
| 293 |
+
The formula was tested across five model sizes from 8.39M to 31.46M parameters.
|
| 294 |
+
Each model used three seeds and was compared against fixed-dropout controls near
|
| 295 |
+
the expected optimum.
|
| 296 |
+
|
| 297 |
+
| Model | Params | Formula final val | Best static final val | Paired final deltas |
|
| 298 |
+
|---|---:|---:|---:|---:|
|
| 299 |
+
| L8 | 8.39M | `4.6094 +/- 0.0056` | `4.6242` | `-0.0102, -0.0160, -0.0182` |
|
| 300 |
+
| L10 | 12.31M | `4.5306 +/- 0.0094` | `4.5580` | `-0.0288, -0.0188, -0.0345` |
|
| 301 |
+
| L12 | 17.37M | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
|
| 302 |
+
| L14 | 23.70M | `4.4384 +/- 0.0087` | `4.4736` | `-0.0294, -0.0269, -0.0429` |
|
| 303 |
+
| L16 | 31.46M | `4.4059 +/- 0.0046` | `4.4459` | `-0.0411, -0.0512, -0.0279` |
|
| 304 |
+
|
| 305 |
+
The formula wins all `15/15` paired final-loss comparisons in this model-size
|
| 306 |
+
validation set.
|
| 307 |
+
|
| 308 |
+
The L8 case is the weakest positive result. It wins final validation loss, but
|
| 309 |
+
the static optimum is shallow and static `0.08` has better mean trajectory. The
|
| 310 |
+
larger models show clearer benefits.
|
| 311 |
+
|
| 312 |
+
## Why Schedule Shape Matters
|
| 313 |
+
|
| 314 |
+
L16 was used to debug the difference between "high dropout" and "right dropout."
|
| 315 |
+
An early fitted path that started too high, `0.60 -> 0.40 -> 0.30 -> 0.14 ->
|
| 316 |
+
0.02`, beat some static controls at the final prefix but had worse trajectory
|
| 317 |
+
loss. Moderate schedules around `0.30` were much better.
|
| 318 |
+
|
| 319 |
+
Three-seed L16 confirmation:
|
| 320 |
+
|
| 321 |
+
| Condition | Path | Final val | Final std | Mean trajectory val | Final gap |
|
| 322 |
+
|---|---|---:|---:|---:|---:|
|
| 323 |
+
| `hold_30_then_decay` | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | `4.4060` | `0.0118` | `4.8503` | `0.3530` |
|
| 324 |
+
| `mild_30_to_08` | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | `4.4075` | `0.0078` | `4.8504` | `0.3307` |
|
| 325 |
+
| `fitted_l16_static_law` | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | `4.4159` | `0.0042` | `4.9527` | `0.3144` |
|
| 326 |
+
| `static_dropout_0.14` | constant | `4.4459` | `0.0128` | `4.9043` | `0.3205` |
|
| 327 |
+
| `static_dropout_0.30` | constant | `4.4693` | `0.0081` | `4.8764` | `0.2327` |
|
| 328 |
+
| `static_dropout_0.02` | constant | `4.5405` | `0.0061` | `5.1544` | `0.4747` |
|
| 329 |
+
| `static_dropout_0.00` | constant | `4.5905` | `0.0192` | `5.2422` | `0.5464` |
|
| 330 |
+
|
| 331 |
+
The lesson is that the winning schedule is not "very high dropout, then decay."
|
| 332 |
+
It is "start near the small-prefix optimum, then decay as the optimum moves
|
| 333 |
+
down."
|
| 334 |
+
|
| 335 |
+
## Update-Pressure Validation
|
| 336 |
+
|
| 337 |
+
Changing `stage_steps` changes how many sampled tokens the optimizer consumes at
|
| 338 |
+
each prefix. The formula predicts that more repeated sampling should require
|
| 339 |
+
more dropout.
|
| 340 |
+
|
| 341 |
+
L12 update-pressure sweep:
|
| 342 |
+
|
| 343 |
+
| Stage steps | Formula path | Mean trajectory val | Formula final val | Best static final val | Paired final deltas |
|
| 344 |
+
|---:|---|---:|---:|---:|---:|
|
| 345 |
+
| 500 | `0.226 -> 0.180 -> 0.102 -> 0.020 -> 0.020` | `5.1581` | `4.7138 +/- 0.0080` | `4.7321` | `-0.0152, -0.0147, -0.0249` |
|
| 346 |
+
| 1000 | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
|
| 347 |
+
| 2000 | `0.376 -> 0.330 -> 0.252 -> 0.162 -> 0.065` | `4.7841` | `4.3089 +/- 0.0116` | `4.3513` | `-0.0453, -0.0321, -0.0489` |
|
| 348 |
+
|
| 349 |
+
The formula wins final validation loss in all three update-pressure regimes.
|
| 350 |
+
At `2000` steps per prefix, it also wins mean trajectory loss. This supports the
|
| 351 |
+
direction of the sampled-token term.
|
| 352 |
+
|
| 353 |
+
## Sampled-Pressure Coefficient Ablation
|
| 354 |
+
|
| 355 |
+
The sampled-pressure coefficient was ablated on L12 while holding model,
|
| 356 |
+
prefixes, and training budget fixed.
|
| 357 |
+
|
| 358 |
+
| Condition | Coefficient multiplier | Path | Mean trajectory val | Final val | Final std | Final gap |
|
| 359 |
+
|---|---:|---|---:|---:|---:|---:|
|
| 360 |
+
| `no_sample_pressure_l12` | 0x | `0.074 -> 0.027 -> 0.020 -> 0.020 -> 0.020` | `5.0282` | `4.5468` | `0.0011` | `0.3482` |
|
| 361 |
+
| `half_sample_pressure_l12` | 0.5x | `0.187 -> 0.141 -> 0.079 -> 0.020 -> 0.020` | `4.9260` | `4.5055` | `0.0046` | `0.3272` |
|
| 362 |
+
| `pressure_formula_floor02` | 1.0x | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812` | `0.0062` | `0.2825` |
|
| 363 |
+
| `high_sample_pressure_l12` | 1.5x | `0.415 -> 0.368 -> 0.275 -> 0.163 -> 0.041` | `4.9739` | `4.4959` | `0.0025` | `0.2418` |
|
| 364 |
+
|
| 365 |
+
The `1.0x` coefficient is best on final validation. The `1.5x` variant has the
|
| 366 |
+
smallest final train-validation gap but worse validation loss, which is a useful
|
| 367 |
+
warning: minimizing the gap is not the same as maximizing generalization.
|
| 368 |
+
|
| 369 |
+
## Architecture-Shape Holdout
|
| 370 |
+
|
| 371 |
+
A key question is whether parameter count alone is a reasonable capacity proxy.
|
| 372 |
+
The first architecture-shape holdout uses a deep/narrow 8-head model:
|
| 373 |
+
|
| 374 |
+
```text
|
| 375 |
+
18 layers, 8 heads, 256 embedding dim, 16.25M parameters
|
| 376 |
+
```
|
| 377 |
+
|
| 378 |
+
The formula path was generated from parameter count only:
|
| 379 |
+
|
| 380 |
+
```text
|
| 381 |
+
0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020
|
| 382 |
+
```
|
| 383 |
+
|
| 384 |
+
Results:
|
| 385 |
+
|
| 386 |
+
| Condition | Path | Mean trajectory val | Final val | Final std | Final gap |
|
| 387 |
+
|---|---|---:|---:|---:|---:|
|
| 388 |
+
| Formula | `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020` | `4.9720` | `4.5286` | `0.0118` | `0.2418` |
|
| 389 |
+
| Static `0.02` | constant | `5.0730` | `4.5887` | `0.0067` | `0.2947` |
|
| 390 |
+
| Static `0.08` | constant | `4.9900` | `4.5607` | `0.0081` | `0.2447` |
|
| 391 |
+
| Static `0.14` | constant | `4.9633` | `4.5564` | `0.0127` | `0.2080` |
|
| 392 |
+
| Static `0.18` | constant | `4.9699` | `4.5710` | `0.0061` | `0.1950` |
|
| 393 |
+
| Static `0.20` | constant | `4.9799` | `4.5835` | `0.0199` | `0.1841` |
|
| 394 |
+
| Static `0.26` | constant | `5.0021` | `4.6096` | `0.0126` | `0.1602` |
|
| 395 |
+
| Static `0.30` | constant | `5.0341` | `4.6520` | `0.0024` | `0.1545` |
|
| 396 |
+
|
| 397 |
+
Best static was `0.14`. Formula beat it on every paired final seed:
|
| 398 |
+
|
| 399 |
+
```text
|
| 400 |
+
formula - best_static = -0.0270, -0.0317, -0.0248
|
| 401 |
+
```
|
| 402 |
+
|
| 403 |
+
This supports final-loss transfer for the deep/narrow shape. It is not a clean
|
| 404 |
+
trajectory win because static `0.14` had slightly better mean trajectory. The
|
| 405 |
+
safe claim is final-loss transfer, not universal trajectory dominance.
|
| 406 |
+
|
| 407 |
+
## Combined Evidence
|
| 408 |
+
|
| 409 |
+
Completed headline evidence:
|
| 410 |
+
|
| 411 |
+
| Evidence type | Result |
|
| 412 |
+
|---|---|
|
| 413 |
+
| Model-size validation | `15/15` paired final-loss wins |
|
| 414 |
+
| Deep/narrow architecture holdout | `3/3` paired final-loss wins |
|
| 415 |
+
| Width-heavy architecture holdout | `3/3` paired final-loss wins |
|
| 416 |
+
| Combined paired final-loss comparisons | `21/21` wins |
|
| 417 |
+
| Update-pressure direction | Supported on L12 |
|
| 418 |
+
| Sampled-pressure coefficient | Supported on L12 |
|
| 419 |
+
| High arbitrary initial dropout | Rejected |
|
| 420 |
+
|
| 421 |
+
The current evidence is strong for the refined hypothesis under this exact
|
| 422 |
+
protocol. It is not strong enough to claim a universal dropout law.
|
| 423 |
+
|
| 424 |
+
## Additional Experiment Tables
|
| 425 |
+
|
| 426 |
+
This section gives a denser empirical audit trail for the main claims. The
|
| 427 |
+
narrative sections above highlight only the most important rows; the tables
|
| 428 |
+
below expose more of the completed run surface.
|
| 429 |
+
|
| 430 |
+
### Completed Run Inventory
|
| 431 |
+
|
| 432 |
+
| Run ID | Role | Seeds |
|
| 433 |
+
|---|---|---:|
|
| 434 |
+
| `legacy_20260525` | Initial streaming controls and high-dropout failure case | mixed |
|
| 435 |
+
| `screen_static_133008` | Static dropout screen across L8, L12, and L16 | 1 |
|
| 436 |
+
| `l16_static_vs_decay_152414` | L16 single-seed static-vs-decay baseline | 1 |
|
| 437 |
+
| `l16_schedule_search_171537` | L16 single-seed schedule search | 1 |
|
| 438 |
+
| `l16_schedule_refine_184506` | L16 single-seed schedule refinement | 1 |
|
| 439 |
+
| `l16_multiseed_confirm_203116` | L16 three-seed schedule confirmation | 3 |
|
| 440 |
+
| `l12_single_seed_072432` | L12 seed-1 pressure-formula probe | 1 |
|
| 441 |
+
| `l12_followup_085421` | L12 seeds 2 and 3 follow-up for common conditions | 2 |
|
| 442 |
+
| `l8_boundary_104407` | L8 boundary model formula test | 3 |
|
| 443 |
+
| `l16_exact_formula_123806` | L16 exact formula-vs-static confirmation | 3 |
|
| 444 |
+
| `l10_interpolation_153920` | L10 model-size interpolation run | 3 |
|
| 445 |
+
| `l14_interpolation_182113` | L14 model-size interpolation run | 3 |
|
| 446 |
+
| `l12_stage_steps_500_231804` | L12 low-update-pressure validation | 3 |
|
| 447 |
+
| `l12_stage_steps_2000_004033` | L12 high-update-pressure validation | 3 |
|
| 448 |
+
| `l12_sample_pressure_ablation_053842` | L12 sampled-pressure coefficient ablation | 3 |
|
| 449 |
+
| `deep_narrow_h8_112117` | Deep/narrow architecture-shape holdout | 3 |
|
| 450 |
+
| `wide_h8_151721` | Width-heavy architecture-shape holdout | 3 |
|
| 451 |
+
|
| 452 |
+
### Static Screen Optima
|
| 453 |
+
|
| 454 |
+
The static screen was the main reason the research direction changed. It showed
|
| 455 |
+
that dropout optima move with both prefix size and model scale.
|
| 456 |
+
|
| 457 |
+
| Model | Params | Prefix tokens | Effective epochs | Best dropout | Val loss | Train loss | Gap |
|
| 458 |
+
|---|---:|---:|---:|---:|---:|---:|---:|
|
| 459 |
+
| `L8_H8_D256` | 8.39M | 250k | 40.96 | 0.40 | 5.4175 | 3.6411 | 1.7763 |
|
| 460 |
+
| `L8_H8_D256` | 8.39M | 500k | 20.48 | 0.20 | 5.0216 | 3.6979 | 1.3238 |
|
| 461 |
+
| `L8_H8_D256` | 8.39M | 1M | 10.24 | 0.14 | 4.7763 | 3.9900 | 0.7863 |
|
| 462 |
+
| `L8_H8_D256` | 8.39M | 2M | 5.12 | 0.08 | 4.6232 | 4.2158 | 0.4074 |
|
| 463 |
+
| `L8_H8_D256` | 8.39M | 4M | 2.56 | 0.00 | 4.5136 | 4.2515 | 0.2621 |
|
| 464 |
+
| `L12_H8_D320` | 17.37M | 250k | 40.96 | 0.50 | 5.4384 | 3.3720 | 2.0663 |
|
| 465 |
+
| `L12_H8_D320` | 17.37M | 500k | 20.48 | 0.40 | 4.9791 | 3.7358 | 1.2434 |
|
| 466 |
+
| `L12_H8_D320` | 17.37M | 1M | 10.24 | 0.20 | 4.6871 | 3.7160 | 0.9711 |
|
| 467 |
+
| `L12_H8_D320` | 17.37M | 2M | 5.12 | 0.14 | 4.5088 | 4.0218 | 0.4870 |
|
| 468 |
+
| `L12_H8_D320` | 17.37M | 4M | 2.56 | 0.02 | 4.3875 | 4.0300 | 0.3575 |
|
| 469 |
+
| `L16_H8_D384` | 31.46M | 250k | 40.96 | 0.60 | 5.5055 | 3.3185 | 2.1870 |
|
| 470 |
+
| `L16_H8_D384` | 31.46M | 500k | 20.48 | 0.40 | 4.9814 | 3.2797 | 1.7017 |
|
| 471 |
+
| `L16_H8_D384` | 31.46M | 1M | 10.24 | 0.30 | 4.6511 | 3.6295 | 1.0216 |
|
| 472 |
+
| `L16_H8_D384` | 31.46M | 2M | 5.12 | 0.14 | 4.4270 | 3.7761 | 0.6509 |
|
| 473 |
+
| `L16_H8_D384` | 31.46M | 4M | 2.56 | 0.02 | 4.2947 | 3.8547 | 0.4400 |
|
| 474 |
+
|
| 475 |
+
### Formula Trajectories Across Model Size
|
| 476 |
+
|
| 477 |
+
This table shows validation loss at every stream prefix for the formula
|
| 478 |
+
schedule, not only the final result.
|
| 479 |
+
|
| 480 |
+
| Model | Params | 250k | 500k | 1M | 2M | 4M |
|
| 481 |
+
|---|---:|---:|---:|---:|---:|---:|
|
| 482 |
+
| L8 | 8.39M | 5.6127 | 5.2183 | 4.9549 | 4.7543 | 4.6094 |
|
| 483 |
+
| L10 | 12.31M | 5.5603 | 5.1544 | 4.8885 | 4.6831 | 4.5306 |
|
| 484 |
+
| L12 | 17.37M | 5.5239 | 5.1258 | 4.8439 | 4.6383 | 4.4812 |
|
| 485 |
+
| L14 | 23.70M | 5.4849 | 5.0853 | 4.8105 | 4.5969 | 4.4384 |
|
| 486 |
+
| L16 | 31.46M | 5.4670 | 5.0597 | 4.7784 | 4.5699 | 4.4059 |
|
| 487 |
+
|
| 488 |
+
### Final Static-Control Rankings By Model
|
| 489 |
+
|
| 490 |
+
These tables show the final `4M` validation loss for the formula and all static
|
| 491 |
+
controls that were run in each model-size validation. They make clear that the
|
| 492 |
+
formula is not only beating a weak single baseline; it is being compared against
|
| 493 |
+
nearby static controls around the apparent optimum.
|
| 494 |
+
|
| 495 |
+
#### L8 Final Controls
|
| 496 |
+
|
| 497 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 498 |
+
|---|---:|---:|---:|---:|---:|
|
| 499 |
+
| formula | 3 | 4.6094 | 0.0056 | 4.3977 | 0.2117 |
|
| 500 |
+
| static 0.08 | 3 | 4.6257 | 0.0007 | 4.4303 | 0.1954 |
|
| 501 |
+
| static 0.04 | 3 | 4.6281 | 0.0040 | 4.4116 | 0.2166 |
|
| 502 |
+
| static 0.02 | 3 | 4.6302 | 0.0086 | 4.3857 | 0.2445 |
|
| 503 |
+
| static 0.00 | 3 | 4.6464 | 0.0072 | 4.3789 | 0.2675 |
|
| 504 |
+
| static 0.13 | 3 | 4.6475 | 0.0083 | 4.4690 | 0.1784 |
|
| 505 |
+
| static 0.20 | 3 | 4.6833 | 0.0048 | 4.5289 | 0.1543 |
|
| 506 |
+
| static 0.25 | 3 | 4.7232 | 0.0032 | 4.5782 | 0.1450 |
|
| 507 |
+
| static 0.30 | 3 | 4.7666 | 0.0083 | 4.6333 | 0.1334 |
|
| 508 |
+
|
| 509 |
+
#### L10 Final Controls
|
| 510 |
+
|
| 511 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 512 |
+
|---|---:|---:|---:|---:|---:|
|
| 513 |
+
| formula | 3 | 4.5306 | 0.0094 | 4.2816 | 0.2491 |
|
| 514 |
+
| static 0.06 | 3 | 4.5580 | 0.0033 | 4.2991 | 0.2588 |
|
| 515 |
+
| static 0.10 | 3 | 4.5618 | 0.0049 | 4.3319 | 0.2299 |
|
| 516 |
+
| static 0.08 | 3 | 4.5645 | 0.0015 | 4.3267 | 0.2378 |
|
| 517 |
+
| static 0.13 | 3 | 4.5725 | 0.0100 | 4.3582 | 0.2143 |
|
| 518 |
+
| static 0.16 | 3 | 4.5732 | 0.0073 | 4.3716 | 0.2017 |
|
| 519 |
+
| static 0.02 | 3 | 4.5835 | 0.0067 | 4.2847 | 0.2988 |
|
| 520 |
+
| static 0.20 | 3 | 4.5939 | 0.0099 | 4.4108 | 0.1830 |
|
| 521 |
+
| static 0.23 | 3 | 4.6069 | 0.0061 | 4.4318 | 0.1752 |
|
| 522 |
+
| static 0.28 | 3 | 4.6494 | 0.0063 | 4.4887 | 0.1607 |
|
| 523 |
+
|
| 524 |
+
#### L12 Final Controls
|
| 525 |
+
|
| 526 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 527 |
+
|---|---:|---:|---:|---:|---:|
|
| 528 |
+
| formula | 3 | 4.4812 | 0.0062 | 4.1988 | 0.2825 |
|
| 529 |
+
| static 0.14 | 3 | 4.5183 | 0.0022 | 4.2741 | 0.2442 |
|
| 530 |
+
| static 0.20 | 3 | 4.5284 | 0.0075 | 4.3071 | 0.2213 |
|
| 531 |
+
| static 0.09 | 3 | 4.5291 | 0.0023 | 4.2545 | 0.2745 |
|
| 532 |
+
| static 0.18 | 3 | 4.5308 | 0.0069 | 4.3086 | 0.2222 |
|
| 533 |
+
| static 0.26 | 3 | 4.5581 | 0.0025 | 4.3624 | 0.1957 |
|
| 534 |
+
| static 0.02 | 1 | 4.5624 | 0.0000 | 4.2134 | 0.3491 |
|
| 535 |
+
| static 0.30 | 3 | 4.5817 | 0.0014 | 4.3991 | 0.1826 |
|
| 536 |
+
| static 0.00 | 1 | 4.6071 | 0.0000 | 4.1934 | 0.4137 |
|
| 537 |
+
|
| 538 |
+
#### L14 Final Controls
|
| 539 |
+
|
| 540 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 541 |
+
|---|---:|---:|---:|---:|---:|
|
| 542 |
+
| formula | 3 | 4.4384 | 0.0087 | 4.1337 | 0.3046 |
|
| 543 |
+
| static 0.18 | 3 | 4.4736 | 0.0072 | 4.2166 | 0.2570 |
|
| 544 |
+
| static 0.14 | 3 | 4.4769 | 0.0113 | 4.1999 | 0.2770 |
|
| 545 |
+
| static 0.20 | 3 | 4.4777 | 0.0014 | 4.2243 | 0.2534 |
|
| 546 |
+
| static 0.10 | 3 | 4.4851 | 0.0039 | 4.1776 | 0.3075 |
|
| 547 |
+
| static 0.28 | 3 | 4.5056 | 0.0068 | 4.2989 | 0.2067 |
|
| 548 |
+
| static 0.02 | 3 | 4.5384 | 0.0113 | 4.1117 | 0.4267 |
|
| 549 |
+
| static 0.32 | 3 | 4.5390 | 0.0054 | 4.3325 | 0.2065 |
|
| 550 |
+
|
| 551 |
+
#### L16 Exact-Formula Final Controls
|
| 552 |
+
|
| 553 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 554 |
+
|---|---:|---:|---:|---:|---:|
|
| 555 |
+
| formula | 3 | 4.4059 | 0.0046 | 4.0601 | 0.3457 |
|
| 556 |
+
| static 0.14 | 3 | 4.4459 | 0.0128 | 4.1254 | 0.3205 |
|
| 557 |
+
|
| 558 |
+
### Update-Pressure Final Rankings
|
| 559 |
+
|
| 560 |
+
For L12, changing `stage_steps` changes repeated-sampling pressure while keeping
|
| 561 |
+
the same stream prefixes. The rows below include the formula and the three best
|
| 562 |
+
static controls at the final prefix for each update-pressure regime.
|
| 563 |
+
|
| 564 |
+
| Stage steps | Condition | N | Final val | Val std | Final train | Final gap |
|
| 565 |
+
|---:|---|---:|---:|---:|---:|---:|
|
| 566 |
+
| 500 | formula | 3 | 4.7138 | 0.0080 | 4.5508 | 0.1631 |
|
| 567 |
+
| 500 | static 0.02 | 3 | 4.7321 | 0.0051 | 4.5468 | 0.1853 |
|
| 568 |
+
| 500 | static 0.06 | 3 | 4.7413 | 0.0024 | 4.5796 | 0.1617 |
|
| 569 |
+
| 500 | static 0.10 | 3 | 4.7514 | 0.0070 | 4.6030 | 0.1484 |
|
| 570 |
+
| 1000 | formula | 3 | 4.4812 | 0.0062 | 4.1988 | 0.2825 |
|
| 571 |
+
| 1000 | static 0.14 | 3 | 4.5183 | 0.0022 | 4.2741 | 0.2442 |
|
| 572 |
+
| 1000 | static 0.20 | 3 | 4.5284 | 0.0075 | 4.3071 | 0.2213 |
|
| 573 |
+
| 1000 | static 0.09 | 3 | 4.5291 | 0.0023 | 4.2545 | 0.2745 |
|
| 574 |
+
| 2000 | formula | 3 | 4.3089 | 0.0116 | 3.8949 | 0.4140 |
|
| 575 |
+
| 2000 | static 0.25 | 3 | 4.3513 | 0.0030 | 4.0249 | 0.3264 |
|
| 576 |
+
| 2000 | static 0.18 | 3 | 4.3557 | 0.0076 | 3.9884 | 0.3673 |
|
| 577 |
+
| 2000 | static 0.14 | 3 | 4.3622 | 0.0020 | 3.9608 | 0.4014 |
|
| 578 |
+
|
| 579 |
+
### Sampled-Pressure Ablation Trajectories
|
| 580 |
+
|
| 581 |
+
The table below shows the stage-by-stage validation path for the sampled-token
|
| 582 |
+
coefficient ablation on L12. The `1.0x` row is the main formula run included
|
| 583 |
+
above; the other rows are coefficient variants from the dedicated ablation run.
|
| 584 |
+
|
| 585 |
+
| Multiplier | Validation path across prefixes | Mean trajectory val | Final val | Final std | Final gap |
|
| 586 |
+
|---:|---|---:|---:|---:|---:|
|
| 587 |
+
| 0x | `5.5299 -> 5.3265 -> 5.0044 -> 4.7335 -> 4.5468` | 5.0282 | 4.5468 | 0.0011 | 0.3482 |
|
| 588 |
+
| 0.5x | `5.4776 -> 5.1221 -> 4.8600 -> 4.6647 -> 4.5055` | 4.9260 | 4.5055 | 0.0046 | 0.3272 |
|
| 589 |
+
| 1.0x | `5.5239 -> 5.1258 -> 4.8439 -> 4.6383 -> 4.4812` | 4.9226 | 4.4812 | 0.0062 | 0.2825 |
|
| 590 |
+
| 1.5x | `5.6181 -> 5.1838 -> 4.8996 -> 4.6723 -> 4.4959` | 4.9739 | 4.4959 | 0.0025 | 0.2418 |
|
| 591 |
+
|
| 592 |
+
### Architecture-Shape Holdout Final Controls
|
| 593 |
+
|
| 594 |
+
The deep/narrow holdout is important because it tests whether the pressure rule
|
| 595 |
+
transfers beyond the exact depth/width scaling family used to fit the model-size
|
| 596 |
+
trend.
|
| 597 |
+
|
| 598 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 599 |
+
|---|---:|---:|---:|---:|---:|
|
| 600 |
+
| formula | 3 | 4.5286 | 0.0118 | 4.2869 | 0.2418 |
|
| 601 |
+
| static 0.14 | 3 | 4.5564 | 0.0127 | 4.3485 | 0.2080 |
|
| 602 |
+
| static 0.08 | 3 | 4.5607 | 0.0081 | 4.3160 | 0.2447 |
|
| 603 |
+
| static 0.18 | 3 | 4.5710 | 0.0061 | 4.3760 | 0.1950 |
|
| 604 |
+
| static 0.20 | 3 | 4.5835 | 0.0199 | 4.3995 | 0.1841 |
|
| 605 |
+
| static 0.02 | 3 | 4.5887 | 0.0067 | 4.2941 | 0.2947 |
|
| 606 |
+
| static 0.26 | 3 | 4.6096 | 0.0126 | 4.4494 | 0.1602 |
|
| 607 |
+
| static 0.30 | 3 | 4.6520 | 0.0024 | 4.4975 | 0.1545 |
|
| 608 |
+
|
| 609 |
+
The width-heavy holdout is the paired complement: it keeps parameter count near
|
| 610 |
+
the L12 scale but uses a conventional wider `8x8x384` shape instead of the
|
| 611 |
+
deep/narrow `18x8x256` shape.
|
| 612 |
+
|
| 613 |
+
| Condition | N | Final val | Val std | Final train | Final gap |
|
| 614 |
+
|---|---:|---:|---:|---:|---:|
|
| 615 |
+
| formula | 3 | 4.4658 | 0.0065 | 4.1514 | 0.3144 |
|
| 616 |
+
| static 0.14 | 3 | 4.4946 | 0.0087 | 4.2214 | 0.2732 |
|
| 617 |
+
| static 0.18 | 3 | 4.4968 | 0.0052 | 4.2423 | 0.2545 |
|
| 618 |
+
| static 0.08 | 3 | 4.4989 | 0.0043 | 4.1777 | 0.3211 |
|
| 619 |
+
| static 0.20 | 3 | 4.5175 | 0.0085 | 4.2791 | 0.2384 |
|
| 620 |
+
| static 0.26 | 3 | 4.5411 | 0.0046 | 4.3243 | 0.2169 |
|
| 621 |
+
| static 0.02 | 3 | 4.5426 | 0.0051 | 4.1493 | 0.3933 |
|
| 622 |
+
| static 0.30 | 3 | 4.5754 | 0.0064 | 4.3771 | 0.1984 |
|
| 623 |
+
|
| 624 |
+
Best static final loss varied by seed, but formula beat the best static
|
| 625 |
+
condition in every paired final comparison:
|
| 626 |
+
|
| 627 |
+
```text
|
| 628 |
+
seed 1: 4.4629 vs static 0.18 at 4.4924, delta -0.0295
|
| 629 |
+
seed 2: 4.4733 vs static 0.08 at 4.5015, delta -0.0282
|
| 630 |
+
seed 3: 4.4612 vs static 0.14 at 4.4852, delta -0.0241
|
| 631 |
+
```
|
| 632 |
+
|
| 633 |
+
The width-heavy result is not a clean win on every metric. Static `0.18` had the
|
| 634 |
+
best mean trajectory loss, `4.9064` versus formula `4.9073`, and the first two
|
| 635 |
+
prefixes favored static rates around `0.18-0.20`. The formula still won final
|
| 636 |
+
loss because it decayed to low dropout at the largest prefixes. This suggests
|
| 637 |
+
that final-loss transfer is real, but an architecture-shape term may be needed
|
| 638 |
+
to avoid overestimating early dropout for wide models.
|
| 639 |
+
|
| 640 |
+
## Interpretation
|
| 641 |
+
|
| 642 |
+
The most plausible mechanism is pressure tracking:
|
| 643 |
+
|
| 644 |
+
1. At small prefixes, the model sees many effective passes over the same unique
|
| 645 |
+
tokens. Low dropout overfits quickly.
|
| 646 |
+
2. Larger models amplify this because they have more capacity relative to the
|
| 647 |
+
available prefix.
|
| 648 |
+
3. As the prefix grows, repeated-sampling pressure falls and high dropout begins
|
| 649 |
+
to underfit.
|
| 650 |
+
4. A static dropout value must compromise across these regimes.
|
| 651 |
+
5. A prefix-aware schedule can use stronger early regularization and lower
|
| 652 |
+
later regularization without changing model architecture or optimizer.
|
| 653 |
+
|
| 654 |
+
This interpretation is consistent with the static screens, the model-size
|
| 655 |
+
interpolation results, the update-pressure sweep, the sampled-pressure
|
| 656 |
+
coefficient ablation, and the two architecture-shape holdouts. The width-heavy
|
| 657 |
+
holdout adds an important refinement: parameter count alone does not fully
|
| 658 |
+
describe architecture capacity, because the formula's early dropout was higher
|
| 659 |
+
than the measured early-prefix static optimum for that shape.
|
| 660 |
+
|
| 661 |
+
## What This Report Does Not Prove
|
| 662 |
+
|
| 663 |
+
The current evidence does not prove:
|
| 664 |
+
|
| 665 |
+
- The formula is universal across arbitrary datasets.
|
| 666 |
+
- Parameter count alone fully captures model capacity.
|
| 667 |
+
- The formula always wins mean trajectory loss.
|
| 668 |
+
- The `0.02` floor is theoretically optimal.
|
| 669 |
+
- The sampled-pressure coefficient is optimal for every model size.
|
| 670 |
+
- The result will scale unchanged to larger LMs, longer contexts, or different
|
| 671 |
+
tokenizers.
|
| 672 |
+
|
| 673 |
+
The current evidence does support:
|
| 674 |
+
|
| 675 |
+
- Static dropout optima move downward as stream prefix size grows.
|
| 676 |
+
- Larger models need more early dropout at small stream prefixes.
|
| 677 |
+
- Repeated sampling from the same prefix increases useful dropout.
|
| 678 |
+
- A pressure-aware schedule can beat the best single static dropout on final
|
| 679 |
+
validation loss in the completed protocol.
|
| 680 |
+
|
| 681 |
+
## Publication Framing
|
| 682 |
+
|
| 683 |
+
The strongest safe paper claim is:
|
| 684 |
+
|
| 685 |
+
> In nanochat-style causal Transformers trained under an expanding-prefix
|
| 686 |
+
> streaming protocol, a pressure-aware dropout schedule improves final
|
| 687 |
+
> validation loss over fixed-dropout baselines across model sizes, update
|
| 688 |
+
> pressures, and two architecture-shape holdouts.
|
| 689 |
+
|
| 690 |
+
Claims to avoid:
|
| 691 |
+
|
| 692 |
+
- "Dropout decay is generally beneficial."
|
| 693 |
+
- "Very high initial dropout is useful."
|
| 694 |
+
- "The formula predicts optimal dropout universally."
|
| 695 |
+
- "The formula dominates every trajectory metric."
|
| 696 |
+
|
| 697 |
+
## Remaining High-Value Experiments
|
| 698 |
+
|
| 699 |
+
The next experiments that would most strengthen a paper are:
|
| 700 |
+
|
| 701 |
+
1. Corpus/domain holdout: freeze the formula and run on a different text
|
| 702 |
+
distribution. This is the largest missing generalization test.
|
| 703 |
+
2. Architecture-shape refinement: add a small feature such as depth/width ratio
|
| 704 |
+
or embedding dimension to reduce early-dropout overestimation on wide models,
|
| 705 |
+
then validate it on held-out shapes.
|
| 706 |
+
3. L8 and L16 sampled-pressure ablations: repeat the `0x`, `0.5x`, `1.0x`, and
|
| 707 |
+
`1.5x` coefficient ablation outside L12.
|
| 708 |
+
4. Oracle schedule comparison: compare the formula against a stage-wise oracle
|
| 709 |
+
chosen from measured static optima. The formula does not need to beat the
|
| 710 |
+
oracle; it should approach it without using per-stage oracle knowledge.
|
| 711 |
+
5. Five-seed headline confirmation: reserve higher seed counts for the final
|
| 712 |
+
paper table, not every exploratory sweep.
|
| 713 |
+
|
| 714 |
+
## Reproduction Pointers
|
| 715 |
+
|
| 716 |
+
Important files:
|
| 717 |
+
|
| 718 |
+
| File | Purpose |
|
| 719 |
+
|---|---|
|
| 720 |
+
| `README.md` | Project overview and workflow |
|
| 721 |
+
| `REPRODUCING.md` | Exact reproduction commands and data-cache notes |
|
| 722 |
+
| `src/dropout_decay/experiment.py` | MPS-only runner, stream loop, metrics, summaries |
|
| 723 |
+
| `src/dropout_decay/model.py` | Nanochat-style Transformer and dynamic dropout |
|
| 724 |
+
| `src/dropout_decay/data.py` | Token cache loading and corpus encoding |
|
| 725 |
+
| `runs/*/summary.csv` | Aggregated metrics |
|
| 726 |
+
| `runs/*/metrics.jsonl` | Per-seed raw metrics |
|
| 727 |
+
| `runs/*/RESULT_SUMMARY.md` | Generated human-readable run summaries |
|
| 728 |
+
|
| 729 |
+
For a new reader, the most useful path through the artifacts is:
|
| 730 |
+
|
| 731 |
+
1. Read this report.
|
| 732 |
+
2. Read `REPRODUCING.md` for exact commands.
|
| 733 |
+
3. Inspect the corresponding `runs/.../config.json` for each headline table.
|
| 734 |
+
4. Verify paired deltas from `metrics.jsonl` or `summary.csv`.
|
| 735 |
+
|
| 736 |
+
## Bottom Line
|
| 737 |
+
|
| 738 |
+
The result is not "dropout decay works." The result is more precise:
|
| 739 |
+
|
| 740 |
+
> In an expanding-prefix training regime, dropout should track pressure from
|
| 741 |
+
> model size, available unique tokens, and repeated sampling. A schedule that
|
| 742 |
+
> tracks that pressure can outperform any single fixed dropout rate at final
|
| 743 |
+
> validation loss.
|
| 744 |
+
|
| 745 |
+
That is already a credible empirical story. The main remaining work is claim
|
| 746 |
+
scope: corpus transfer, architecture-shape refinement, and a clearer separation
|
| 747 |
+
between formula fitting and formula validation.
|
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/RESULT_SUMMARY.md
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Locked Streaming Dropout Summary
|
| 2 |
+
|
| 3 |
+
Run directory: `runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721`
|
| 4 |
+
|
| 5 |
+
Model: `wide_L8_H8_D384` causal Transformer, 17,301,504 parameters, 8 layers, 8 heads, 384 embedding dim.
|
| 6 |
+
Training per stage: 1,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 1, 2, 3.
|
| 7 |
+
|
| 8 |
+
## Condition Ranking
|
| 9 |
+
|
| 10 |
+
| Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
|
| 11 |
+
|---|---|---:|---:|---:|---:|---|
|
| 12 |
+
| `static_dropout_0.18` | static | 0.18 | 4.9064 | 4.4968 | 0.2545 | 0.18 -> 0.18 -> 0.18 -> 0.18 -> 0.18 |
|
| 13 |
+
| `formula_wide_l8_h8` | anchor_decay | 0.02 | 4.9073 | 4.4658 | 0.3144 | 0.30 -> 0.25 -> 0.18 -> 0.09 -> 0.02 |
|
| 14 |
+
| `static_dropout_0.14` | static | 0.14 | 4.9120 | 4.4946 | 0.2732 | 0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14 |
|
| 15 |
+
| `static_dropout_0.2` | static | 0.20 | 4.9184 | 4.5175 | 0.2384 | 0.20 -> 0.20 -> 0.20 -> 0.20 -> 0.20 |
|
| 16 |
+
| `static_dropout_0.26` | static | 0.26 | 4.9323 | 4.5411 | 0.2169 | 0.26 -> 0.26 -> 0.26 -> 0.26 -> 0.26 |
|
| 17 |
+
| `static_dropout_0.08` | static | 0.08 | 4.9576 | 4.4989 | 0.3211 | 0.08 -> 0.08 -> 0.08 -> 0.08 -> 0.08 |
|
| 18 |
+
| `static_dropout_0.3` | static | 0.30 | 4.9612 | 4.5754 | 0.1984 | 0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30 |
|
| 19 |
+
| `static_dropout_0.02` | static | 0.02 | 5.0798 | 4.5426 | 0.3933 | 0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02 |
|
| 20 |
+
|
| 21 |
+
## Stage Trajectory
|
| 22 |
+
|
| 23 |
+
### Stage 0: 250,000 Prefix Tokens
|
| 24 |
+
|
| 25 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 26 |
+
|---|---:|---:|---:|---:|---:|
|
| 27 |
+
| `static_dropout_0.18` | 0.18 | 5.4529 | 4.3258 | 1.1270 | 3 |
|
| 28 |
+
| `static_dropout_0.2` | 0.20 | 5.4558 | 4.3694 | 1.0864 | 3 |
|
| 29 |
+
| `static_dropout_0.14` | 0.14 | 5.4600 | 4.2174 | 1.2425 | 3 |
|
| 30 |
+
| `static_dropout_0.26` | 0.26 | 5.4691 | 4.4782 | 0.9909 | 3 |
|
| 31 |
+
| `static_dropout_0.3` | 0.30 | 5.4959 | 4.5701 | 0.9258 | 3 |
|
| 32 |
+
| `formula_wide_l8_h8` | 0.30 | 5.4974 | 4.5700 | 0.9274 | 3 |
|
| 33 |
+
| `static_dropout_0.08` | 0.08 | 5.5064 | 4.0660 | 1.4404 | 3 |
|
| 34 |
+
| `static_dropout_0.02` | 0.02 | 5.6313 | 3.8606 | 1.7707 | 3 |
|
| 35 |
+
|
| 36 |
+
### Stage 1: 500,000 Prefix Tokens
|
| 37 |
+
|
| 38 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 39 |
+
|---|---:|---:|---:|---:|---:|
|
| 40 |
+
| `static_dropout_0.18` | 0.18 | 5.1068 | 4.0540 | 1.0529 | 3 |
|
| 41 |
+
| `static_dropout_0.2` | 0.20 | 5.1129 | 4.1267 | 0.9862 | 3 |
|
| 42 |
+
| `static_dropout_0.26` | 0.26 | 5.1137 | 4.2308 | 0.8829 | 3 |
|
| 43 |
+
| `formula_wide_l8_h8` | 0.25 | 5.1168 | 4.2433 | 0.8736 | 3 |
|
| 44 |
+
| `static_dropout_0.14` | 0.14 | 5.1226 | 3.9473 | 1.1754 | 3 |
|
| 45 |
+
| `static_dropout_0.3` | 0.30 | 5.1355 | 4.3145 | 0.8210 | 3 |
|
| 46 |
+
| `static_dropout_0.08` | 0.08 | 5.2107 | 3.7604 | 1.4503 | 3 |
|
| 47 |
+
| `static_dropout_0.02` | 0.02 | 5.4235 | 3.5110 | 1.9124 | 3 |
|
| 48 |
+
|
| 49 |
+
### Stage 2: 1,000,000 Prefix Tokens
|
| 50 |
+
|
| 51 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 52 |
+
|---|---:|---:|---:|---:|---:|
|
| 53 |
+
| `formula_wide_l8_h8` | 0.18 | 4.8353 | 4.1781 | 0.6572 | 3 |
|
| 54 |
+
| `static_dropout_0.18` | 0.18 | 4.8359 | 4.1157 | 0.7202 | 3 |
|
| 55 |
+
| `static_dropout_0.14` | 0.14 | 4.8447 | 4.0450 | 0.7997 | 3 |
|
| 56 |
+
| `static_dropout_0.2` | 0.20 | 4.8486 | 4.1693 | 0.6793 | 3 |
|
| 57 |
+
| `static_dropout_0.26` | 0.26 | 4.8584 | 4.2648 | 0.5936 | 3 |
|
| 58 |
+
| `static_dropout_0.3` | 0.30 | 4.8883 | 4.3345 | 0.5538 | 3 |
|
| 59 |
+
| `static_dropout_0.08` | 0.08 | 4.9041 | 3.9234 | 0.9808 | 3 |
|
| 60 |
+
| `static_dropout_0.02` | 0.02 | 5.0517 | 3.7576 | 1.2941 | 3 |
|
| 61 |
+
|
| 62 |
+
### Stage 3: 2,000,000 Prefix Tokens
|
| 63 |
+
|
| 64 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 65 |
+
|---|---:|---:|---:|---:|---:|
|
| 66 |
+
| `formula_wide_l8_h8` | 0.09 | 4.6212 | 4.1631 | 0.4581 | 3 |
|
| 67 |
+
| `static_dropout_0.14` | 0.14 | 4.6379 | 4.1672 | 0.4707 | 3 |
|
| 68 |
+
| `static_dropout_0.18` | 0.18 | 4.6397 | 4.2109 | 0.4288 | 3 |
|
| 69 |
+
| `static_dropout_0.2` | 0.20 | 4.6572 | 4.2456 | 0.4116 | 3 |
|
| 70 |
+
| `static_dropout_0.08` | 0.08 | 4.6681 | 4.1108 | 0.5573 | 3 |
|
| 71 |
+
| `static_dropout_0.26` | 0.26 | 4.6792 | 4.3159 | 0.3633 | 3 |
|
| 72 |
+
| `static_dropout_0.3` | 0.30 | 4.7109 | 4.3668 | 0.3441 | 3 |
|
| 73 |
+
| `static_dropout_0.02` | 0.02 | 4.7498 | 4.0129 | 0.7370 | 3 |
|
| 74 |
+
|
| 75 |
+
### Stage 4: 4,000,000 Prefix Tokens
|
| 76 |
+
|
| 77 |
+
| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
|
| 78 |
+
|---|---:|---:|---:|---:|---:|
|
| 79 |
+
| `formula_wide_l8_h8` | 0.02 | 4.4658 | 4.1514 | 0.3144 | 3 |
|
| 80 |
+
| `static_dropout_0.14` | 0.14 | 4.4946 | 4.2214 | 0.2732 | 3 |
|
| 81 |
+
| `static_dropout_0.18` | 0.18 | 4.4968 | 4.2423 | 0.2545 | 3 |
|
| 82 |
+
| `static_dropout_0.08` | 0.08 | 4.4989 | 4.1777 | 0.3211 | 3 |
|
| 83 |
+
| `static_dropout_0.2` | 0.20 | 4.5175 | 4.2791 | 0.2384 | 3 |
|
| 84 |
+
| `static_dropout_0.26` | 0.26 | 4.5411 | 4.3243 | 0.2169 | 3 |
|
| 85 |
+
| `static_dropout_0.02` | 0.02 | 4.5426 | 4.1493 | 0.3933 | 3 |
|
| 86 |
+
| `static_dropout_0.3` | 0.30 | 4.5754 | 4.3771 | 0.1984 | 3 |
|
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/config.json
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"args": {
|
| 3 |
+
"mode": "locked_stream",
|
| 4 |
+
"corpus": null,
|
| 5 |
+
"corpus_glob": null,
|
| 6 |
+
"text_column": "text",
|
| 7 |
+
"use_cached_data": true,
|
| 8 |
+
"output_dir": "runs/architecture_shape_holdout_wide_h8",
|
| 9 |
+
"resume_from": null,
|
| 10 |
+
"cache_dir": ".cache/dropout_decay",
|
| 11 |
+
"models": [
|
| 12 |
+
"wide_L8_H8_D384=8x8x384"
|
| 13 |
+
],
|
| 14 |
+
"seeds": [
|
| 15 |
+
1,
|
| 16 |
+
2,
|
| 17 |
+
3
|
| 18 |
+
],
|
| 19 |
+
"token_limits": [
|
| 20 |
+
5000000
|
| 21 |
+
],
|
| 22 |
+
"stream_token_caps": [
|
| 23 |
+
250000,
|
| 24 |
+
500000,
|
| 25 |
+
1000000,
|
| 26 |
+
2000000,
|
| 27 |
+
4000000
|
| 28 |
+
],
|
| 29 |
+
"val_tokens": 500000,
|
| 30 |
+
"allow_short_corpus": false,
|
| 31 |
+
"force_retokenize": false,
|
| 32 |
+
"vocab_size": 4096,
|
| 33 |
+
"tokenizer_train_chars": 10000000,
|
| 34 |
+
"block_size": 128,
|
| 35 |
+
"batch_size": 16,
|
| 36 |
+
"steps": 2000,
|
| 37 |
+
"stage_steps": 1000,
|
| 38 |
+
"dropout_rates": [
|
| 39 |
+
0.02,
|
| 40 |
+
0.08,
|
| 41 |
+
0.14,
|
| 42 |
+
0.18,
|
| 43 |
+
0.2,
|
| 44 |
+
0.26,
|
| 45 |
+
0.3
|
| 46 |
+
],
|
| 47 |
+
"decays": [],
|
| 48 |
+
"anchor_decays": [
|
| 49 |
+
{
|
| 50 |
+
"name": "formula_wide_l8_h8",
|
| 51 |
+
"kind": "anchor_decay",
|
| 52 |
+
"initial": 0.301,
|
| 53 |
+
"final": 0.02,
|
| 54 |
+
"schedule": "log_prefix_anchor",
|
| 55 |
+
"decay_tokens": null,
|
| 56 |
+
"anchors": [
|
| 57 |
+
[
|
| 58 |
+
250000,
|
| 59 |
+
0.301
|
| 60 |
+
],
|
| 61 |
+
[
|
| 62 |
+
500000,
|
| 63 |
+
0.254
|
| 64 |
+
],
|
| 65 |
+
[
|
| 66 |
+
1000000,
|
| 67 |
+
0.177
|
| 68 |
+
],
|
| 69 |
+
[
|
| 70 |
+
2000000,
|
| 71 |
+
0.087
|
| 72 |
+
],
|
| 73 |
+
[
|
| 74 |
+
4000000,
|
| 75 |
+
0.02
|
| 76 |
+
]
|
| 77 |
+
]
|
| 78 |
+
}
|
| 79 |
+
],
|
| 80 |
+
"decay_tokens": null,
|
| 81 |
+
"eval_batches": 64,
|
| 82 |
+
"train_eval_batches": 32,
|
| 83 |
+
"trace_eval_batches": 8,
|
| 84 |
+
"eval_every": 0,
|
| 85 |
+
"log_every": 500,
|
| 86 |
+
"lr": 0.0003,
|
| 87 |
+
"weight_decay": 0.1,
|
| 88 |
+
"grad_clip": 1.0,
|
| 89 |
+
"plateau_delta": 0.01,
|
| 90 |
+
"target_min_dropout": 0.1,
|
| 91 |
+
"min_nonzero_margin": 0.01,
|
| 92 |
+
"min_high_dropout_margin": 0.03,
|
| 93 |
+
"screen_early_stop": false,
|
| 94 |
+
"screen_prune_patience": 3,
|
| 95 |
+
"screen_prune_min_delta": 0.01
|
| 96 |
+
},
|
| 97 |
+
"mode": "locked_stream",
|
| 98 |
+
"seeds": [
|
| 99 |
+
1,
|
| 100 |
+
2,
|
| 101 |
+
3
|
| 102 |
+
],
|
| 103 |
+
"models": [
|
| 104 |
+
{
|
| 105 |
+
"model_name": "wide_L8_H8_D384",
|
| 106 |
+
"n_layer": 8,
|
| 107 |
+
"n_head": 8,
|
| 108 |
+
"n_embd": 384
|
| 109 |
+
}
|
| 110 |
+
],
|
| 111 |
+
"device": "mps",
|
| 112 |
+
"torch": "2.12.0",
|
| 113 |
+
"python": "3.11.15 (main, Mar 3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
|
| 114 |
+
"mps_available": true,
|
| 115 |
+
"attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
|
| 116 |
+
"tokenizer_path": ".cache/dropout_decay/tokenizer-v4096.json",
|
| 117 |
+
"encoded_path": ".cache/dropout_decay/tokens-v4096-uint16.npy",
|
| 118 |
+
"train_tokens": 5000970,
|
| 119 |
+
"val_tokens": 500000,
|
| 120 |
+
"effective_token_limits": [
|
| 121 |
+
5000000
|
| 122 |
+
],
|
| 123 |
+
"effective_stream_token_caps": [
|
| 124 |
+
250000,
|
| 125 |
+
500000,
|
| 126 |
+
1000000,
|
| 127 |
+
2000000,
|
| 128 |
+
4000000
|
| 129 |
+
],
|
| 130 |
+
"resume_from": null
|
| 131 |
+
}
|
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/metrics.jsonl
ADDED
|
@@ -0,0 +1,120 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.301, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 80.3044319152832, "eval_loss": 5.50944098085165, "generalization_gap": 0.9404481127858162, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.568992868065834, "train_loss_last": 4.461259841918945, "val_eval_loss": 5.50944098085165}
|
| 2 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.254, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 81.0927050113678, "eval_loss": 5.128874830901623, "generalization_gap": 0.9115267693996429, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.21734806150198, "train_loss_last": 4.339989185333252, "val_eval_loss": 5.128874830901623}
|
| 3 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.177, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 82.04068875312805, "eval_loss": 4.821432217955589, "generalization_gap": 0.6483669355511665, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.173065282404423, "train_loss_last": 4.370175838470459, "val_eval_loss": 4.821432217955589}
|
| 4 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.087, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 86.07953906059265, "eval_loss": 4.621216967701912, "generalization_gap": 0.454586386680603, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.166630581021309, "train_loss_last": 4.40201473236084, "val_eval_loss": 4.621216967701912}
|
| 5 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 89.61367917060852, "eval_loss": 4.4628773629665375, "generalization_gap": 0.3145897909998894, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.148287571966648, "train_loss_last": 4.2655863761901855, "val_eval_loss": 4.4628773629665375}
|
| 6 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.301, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 89.19484186172485, "eval_loss": 5.511528238654137, "generalization_gap": 0.9201603531837463, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.59136788547039, "train_loss_last": 4.6994476318359375, "val_eval_loss": 5.511528238654137}
|
| 7 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.254, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 88.02258896827698, "eval_loss": 5.102983042597771, "generalization_gap": 0.8699157163500786, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.233067326247692, "train_loss_last": 4.28831672668457, "val_eval_loss": 5.102983042597771}
|
| 8 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.177, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.586266040802, "eval_loss": 4.846672013401985, "generalization_gap": 0.647594541311264, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.199077472090721, "train_loss_last": 4.06334114074707, "val_eval_loss": 4.846672013401985}
|
| 9 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.087, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.69060587882996, "eval_loss": 4.626160830259323, "generalization_gap": 0.45169103145599365, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.1744697988033295, "train_loss_last": 4.392305374145508, "val_eval_loss": 4.626160830259323}
|
| 10 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.76093912124634, "eval_loss": 4.473251141607761, "generalization_gap": 0.3120561018586159, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.1611950397491455, "train_loss_last": 4.331225872039795, "val_eval_loss": 4.473251141607761}
|
| 11 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.301, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.97550106048584, "eval_loss": 5.471142560243607, "generalization_gap": 0.9215231537818909, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.549619406461716, "train_loss_last": 4.7740654945373535, "val_eval_loss": 5.471142560243607}
|
| 12 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.254, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 89.6904628276825, "eval_loss": 5.118690565228462, "generalization_gap": 0.8392644375562668, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.279426127672195, "train_loss_last": 4.1092705726623535, "val_eval_loss": 5.118690565228462}
|
| 13 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.177, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 91.43055891990662, "eval_loss": 4.837727375328541, "generalization_gap": 0.6756020113825798, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.162125363945961, "train_loss_last": 4.312147617340088, "val_eval_loss": 4.837727375328541}
|
| 14 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.087, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 91.60441994667053, "eval_loss": 4.6162411123514175, "generalization_gap": 0.46792464703321457, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.148316465318203, "train_loss_last": 4.083148002624512, "val_eval_loss": 4.6162411123514175}
|
| 15 |
+
{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 92.24732804298401, "eval_loss": 4.461184054613113, "generalization_gap": 0.31641608476638794, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.1447679698467255, "train_loss_last": 4.061500549316406, "val_eval_loss": 4.461184054613113}
|
| 16 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.50047516822815, "eval_loss": 5.617484986782074, "generalization_gap": 1.740433618426323, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 3.877051368355751, "train_loss_last": 3.831916570663452, "val_eval_loss": 5.617484986782074}
|
| 17 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.35584092140198, "eval_loss": 5.42622634023428, "generalization_gap": 1.9256957545876503, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.5005305856466293, "train_loss_last": 3.6908276081085205, "val_eval_loss": 5.42622634023428}
|
| 18 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 91.07978391647339, "eval_loss": 5.010412596166134, "generalization_gap": 1.254038155078888, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.756374441087246, "train_loss_last": 3.6737847328186035, "val_eval_loss": 5.010412596166134}
|
| 19 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.77991008758545, "eval_loss": 4.739814430475235, "generalization_gap": 0.7331958413124084, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.0066185891628265, "train_loss_last": 4.060349464416504, "val_eval_loss": 4.739814430475235}
|
| 20 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 91.44194889068604, "eval_loss": 4.5367100313305855, "generalization_gap": 0.404718317091465, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.1319917142391205, "train_loss_last": 4.323214530944824, "val_eval_loss": 4.5367100313305855}
|
| 21 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.89891386032104, "eval_loss": 5.631713815033436, "generalization_gap": 1.7653722912073135, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 3.8663415238261223, "train_loss_last": 3.846677303314209, "val_eval_loss": 5.631713815033436}
|
| 22 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 88.31412315368652, "eval_loss": 5.415093585848808, "generalization_gap": 1.9046627581119537, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.5104308277368546, "train_loss_last": 3.469115972518921, "val_eval_loss": 5.415093585848808}
|
| 23 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 88.07946801185608, "eval_loss": 5.082612656056881, "generalization_gap": 1.302198402583599, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.780414253473282, "train_loss_last": 3.9317455291748047, "val_eval_loss": 5.082612656056881}
|
| 24 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.06063985824585, "eval_loss": 4.753706589341164, "generalization_gap": 0.7191323041915894, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.034574285149574, "train_loss_last": 4.145359516143799, "val_eval_loss": 4.753706589341164}
|
| 25 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 92.71470427513123, "eval_loss": 4.546178944408894, "generalization_gap": 0.3828253224492073, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.163353621959686, "train_loss_last": 4.104133129119873, "val_eval_loss": 4.546178944408894}
|
| 26 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 92.13415598869324, "eval_loss": 5.644713446497917, "generalization_gap": 1.806356817483902, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 3.838356629014015, "train_loss_last": 3.7551283836364746, "val_eval_loss": 5.644713446497917}
|
| 27 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 89.27249383926392, "eval_loss": 5.429135553538799, "generalization_gap": 1.9069734960794449, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.5221620574593544, "train_loss_last": 3.522247314453125, "val_eval_loss": 5.429135553538799}
|
| 28 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 87.17945766448975, "eval_loss": 5.062139481306076, "generalization_gap": 1.3260410204529762, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.7360984608531, "train_loss_last": 3.8998873233795166, "val_eval_loss": 5.062139481306076}
|
| 29 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 85.28739190101624, "eval_loss": 4.756018981337547, "generalization_gap": 0.7586047351360321, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 3.997414246201515, "train_loss_last": 3.8002028465270996, "val_eval_loss": 4.756018981337547}
|
| 30 |
+
{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 85.18242502212524, "eval_loss": 4.544781379401684, "generalization_gap": 0.3922204002737999, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.152560979127884, "train_loss_last": 4.062686920166016, "val_eval_loss": 4.544781379401684}
|
| 31 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 85.12396001815796, "eval_loss": 5.508156895637512, "generalization_gap": 1.4284128621220589, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.079744033515453, "train_loss_last": 4.3121185302734375, "val_eval_loss": 5.508156895637512}
|
| 32 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 85.83875703811646, "eval_loss": 5.224022679030895, "generalization_gap": 1.4923789128661156, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.7316437661647797, "train_loss_last": 3.8869738578796387, "val_eval_loss": 5.224022679030895}
|
| 33 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.98465609550476, "eval_loss": 4.877769023180008, "generalization_gap": 0.9518706128001213, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.9258984103798866, "train_loss_last": 3.9108357429504395, "val_eval_loss": 4.877769023180008}
|
| 34 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.71924090385437, "eval_loss": 4.6648936197161674, "generalization_gap": 0.5523728355765343, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.112520784139633, "train_loss_last": 4.277665138244629, "val_eval_loss": 4.6648936197161674}
|
| 35 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.46567440032959, "eval_loss": 4.493896655738354, "generalization_gap": 0.3229532167315483, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.170943439006805, "train_loss_last": 4.201048374176025, "val_eval_loss": 4.493896655738354}
|
| 36 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.63766598701477, "eval_loss": 5.524788901209831, "generalization_gap": 1.4730282947421074, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.051760606467724, "train_loss_last": 4.1509013175964355, "val_eval_loss": 5.524788901209831}
|
| 37 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.58094906806946, "eval_loss": 5.200122766196728, "generalization_gap": 1.4515998288989067, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.748522937297821, "train_loss_last": 3.867222309112549, "val_eval_loss": 5.200122766196728}
|
| 38 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.7393548488617, "eval_loss": 4.9155143946409225, "generalization_gap": 0.9573525786399841, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.9581618160009384, "train_loss_last": 4.031224250793457, "val_eval_loss": 4.9155143946409225}
|
| 39 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.68595004081726, "eval_loss": 4.671512261033058, "generalization_gap": 0.5327281430363655, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.138784117996693, "train_loss_last": 4.2578840255737305, "val_eval_loss": 4.671512261033058}
|
| 40 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.58462691307068, "eval_loss": 4.501494288444519, "generalization_gap": 0.315948948264122, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.185545340180397, "train_loss_last": 4.138848304748535, "val_eval_loss": 4.501494288444519}
|
| 41 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.52455997467041, "eval_loss": 5.486165724694729, "generalization_gap": 1.4197945520281792, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.06637117266655, "train_loss_last": 4.213794708251953, "val_eval_loss": 5.486165724694729}
|
| 42 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.35219979286194, "eval_loss": 5.207873582839966, "generalization_gap": 1.4069878831505775, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.8008856996893883, "train_loss_last": 4.008660316467285, "val_eval_loss": 5.207873582839966}
|
| 43 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.39520287513733, "eval_loss": 4.919148109853268, "generalization_gap": 1.0330585837364197, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.886089526116848, "train_loss_last": 3.963630199432373, "val_eval_loss": 4.919148109853268}
|
| 44 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.12316012382507, "eval_loss": 4.667882397770882, "generalization_gap": 0.5868068486452103, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.081075549125671, "train_loss_last": 4.332503795623779, "val_eval_loss": 4.667882397770882}
|
| 45 |
+
{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 85.79110598564148, "eval_loss": 4.501279823482037, "generalization_gap": 0.32453126460313797, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.176748558878899, "train_loss_last": 4.332511901855469, "val_eval_loss": 4.501279823482037}
|
| 46 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 82.9429247379303, "eval_loss": 5.474037267267704, "generalization_gap": 1.2244683280587196, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.249568939208984, "train_loss_last": 4.2807817459106445, "val_eval_loss": 5.474037267267704}
|
| 47 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 83.7153468132019, "eval_loss": 5.1435349360108376, "generalization_gap": 1.219955712556839, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.9235792234539986, "train_loss_last": 4.343637943267822, "val_eval_loss": 5.1435349360108376}
|
| 48 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 85.14877796173096, "eval_loss": 4.822992421686649, "generalization_gap": 0.8021149635314941, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.020877458155155, "train_loss_last": 4.408194541931152, "val_eval_loss": 4.822992421686649}
|
| 49 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 86.28292083740234, "eval_loss": 4.645638138055801, "generalization_gap": 0.4661043509840965, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.179533787071705, "train_loss_last": 4.2894206047058105, "val_eval_loss": 4.645638138055801}
|
| 50 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 87.48458695411682, "eval_loss": 4.496338866651058, "generalization_gap": 0.2786053493618965, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.217733517289162, "train_loss_last": 4.3651604652404785, "val_eval_loss": 4.496338866651058}
|
| 51 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 88.62815809249878, "eval_loss": 5.471045903861523, "generalization_gap": 1.2540272250771523, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.21701867878437, "train_loss_last": 4.285619735717773, "val_eval_loss": 5.471045903861523}
|
| 52 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 89.17306399345398, "eval_loss": 5.10762532055378, "generalization_gap": 1.161530278623104, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.9460950419306755, "train_loss_last": 4.307913780212402, "val_eval_loss": 5.10762532055378}
|
| 53 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 89.68199706077576, "eval_loss": 4.867001831531525, "generalization_gap": 0.7607270255684853, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.106274805963039, "train_loss_last": 4.097533226013184, "val_eval_loss": 4.867001831531525}
|
| 54 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 96.21420216560364, "eval_loss": 4.635738044977188, "generalization_gap": 0.45351576060056686, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.182222284376621, "train_loss_last": 4.320278644561768, "val_eval_loss": 4.635738044977188}
|
| 55 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 98.51693487167358, "eval_loss": 4.502298936247826, "generalization_gap": 0.26660506427288055, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.235693871974945, "train_loss_last": 4.196667671203613, "val_eval_loss": 4.502298936247826}
|
| 56 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 99.1040849685669, "eval_loss": 5.434773683547974, "generalization_gap": 1.2490219175815582, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.185751765966415, "train_loss_last": 4.543873310089111, "val_eval_loss": 5.434773683547974}
|
| 57 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 101.41888475418091, "eval_loss": 5.116771973669529, "generalization_gap": 1.1446522697806358, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.972119703888893, "train_loss_last": 4.026454448699951, "val_eval_loss": 5.116771973669529}
|
| 58 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 102.92470812797546, "eval_loss": 4.844192124903202, "generalization_gap": 0.8362765908241272, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.007915534079075, "train_loss_last": 4.160324573516846, "val_eval_loss": 4.844192124903202}
|
| 59 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 102.17207908630371, "eval_loss": 4.632426172494888, "generalization_gap": 0.4925713837146759, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.139854788780212, "train_loss_last": 4.122490406036377, "val_eval_loss": 4.632426172494888}
|
| 60 |
+
{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 100.92274498939514, "eval_loss": 4.48524085432291, "generalization_gap": 0.2743789106607437, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.210861943662167, "train_loss_last": 4.3369245529174805, "val_eval_loss": 4.48524085432291}
|
| 61 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 99.96288299560547, "eval_loss": 5.459806442260742, "generalization_gap": 1.099911704659462, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.35989473760128, "train_loss_last": 4.235060214996338, "val_eval_loss": 5.459806442260742}
|
| 62 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 99.67736482620239, "eval_loss": 5.1199976950883865, "generalization_gap": 1.0895164832472801, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.030481211841106, "train_loss_last": 4.190820693969727, "val_eval_loss": 5.1199976950883865}
|
| 63 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 99.31055998802185, "eval_loss": 4.822036981582642, "generalization_gap": 0.7236514016985893, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.098385579884052, "train_loss_last": 4.60369348526001, "val_eval_loss": 4.822036981582642}
|
| 64 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.8000099658966, "eval_loss": 4.644418828189373, "generalization_gap": 0.41945642977952957, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.2249623984098434, "train_loss_last": 4.394259929656982, "val_eval_loss": 4.644418828189373}
|
| 65 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.64591407775879, "eval_loss": 4.492378875613213, "generalization_gap": 0.25095726549625397, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.241421610116959, "train_loss_last": 4.539054870605469, "val_eval_loss": 4.492378875613213}
|
| 66 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.27119994163513, "eval_loss": 5.469226226210594, "generalization_gap": 1.1356151103973389, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.333611115813255, "train_loss_last": 4.384670734405518, "val_eval_loss": 5.469226226210594}
|
| 67 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.35777807235718, "eval_loss": 5.092819690704346, "generalization_gap": 1.0521608665585518, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.040658824145794, "train_loss_last": 4.205634117126465, "val_eval_loss": 5.092819690704346}
|
| 68 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 91.07860922813416, "eval_loss": 4.853221297264099, "generalization_gap": 0.7027387022972107, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.150482594966888, "train_loss_last": 4.3069024085998535, "val_eval_loss": 4.853221297264099}
|
| 69 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.5945131778717, "eval_loss": 4.632607348263264, "generalization_gap": 0.4114653095602989, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.221142038702965, "train_loss_last": 4.197530746459961, "val_eval_loss": 4.632607348263264}
|
| 70 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.36042618751526, "eval_loss": 4.502480059862137, "generalization_gap": 0.2522999197244644, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.250180140137672, "train_loss_last": 4.284463405609131, "val_eval_loss": 4.502480059862137}
|
| 71 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.31073188781738, "eval_loss": 5.429519824683666, "generalization_gap": 1.1455372348427773, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.283982589840889, "train_loss_last": 4.598309516906738, "val_eval_loss": 5.429519824683666}
|
| 72 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.2884590625763, "eval_loss": 5.1077147498726845, "generalization_gap": 1.0169242396950722, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.090790510177612, "train_loss_last": 4.279513359069824, "val_eval_loss": 5.1077147498726845}
|
| 73 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.53543972969055, "eval_loss": 4.832363620400429, "generalization_gap": 0.7342051789164543, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.0981584414839745, "train_loss_last": 4.1803812980651855, "val_eval_loss": 4.832363620400429}
|
| 74 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 90.07991695404053, "eval_loss": 4.642039962112904, "generalization_gap": 0.45552774518728256, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.186512216925621, "train_loss_last": 4.316293716430664, "val_eval_loss": 4.642039962112904}
|
| 75 |
+
{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 90.63944125175476, "eval_loss": 4.495470866560936, "generalization_gap": 0.2602871060371399, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.235183760523796, "train_loss_last": 4.447503566741943, "val_eval_loss": 4.495470866560936}
|
| 76 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.06378698348999, "eval_loss": 5.451976537704468, "generalization_gap": 1.041361778974533, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.410614758729935, "train_loss_last": 4.446702003479004, "val_eval_loss": 5.451976537704468}
|
| 77 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.85460114479065, "eval_loss": 5.123210750520229, "generalization_gap": 1.0067783072590828, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.1164324432611465, "train_loss_last": 4.462212562561035, "val_eval_loss": 5.123210750520229}
|
| 78 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.4302430152893, "eval_loss": 4.834423378109932, "generalization_gap": 0.670300304889679, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.164123073220253, "train_loss_last": 4.44389009475708, "val_eval_loss": 4.834423378109932}
|
| 79 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.19487309455872, "eval_loss": 4.667039297521114, "generalization_gap": 0.4070756286382675, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.259963668882847, "train_loss_last": 4.521040439605713, "val_eval_loss": 4.667039297521114}
|
| 80 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.51846718788147, "eval_loss": 4.518057778477669, "generalization_gap": 0.23345574736595154, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.284602031111717, "train_loss_last": 4.3808183670043945, "val_eval_loss": 4.518057778477669}
|
| 81 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.02405881881714, "eval_loss": 5.476214177906513, "generalization_gap": 1.1110263615846634, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.36518781632185, "train_loss_last": 4.610562324523926, "val_eval_loss": 5.476214177906513}
|
| 82 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.02885913848877, "eval_loss": 5.097359970211983, "generalization_gap": 0.9864056929945946, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.110954277217388, "train_loss_last": 4.277379035949707, "val_eval_loss": 5.097359970211983}
|
| 83 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.20587420463562, "eval_loss": 4.862358041107655, "generalization_gap": 0.6659520193934441, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.1964060217142105, "train_loss_last": 4.454376220703125, "val_eval_loss": 4.862358041107655}
|
| 84 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.5145378112793, "eval_loss": 4.6511543318629265, "generalization_gap": 0.39532778412103653, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.25582654774189, "train_loss_last": 4.377513885498047, "val_eval_loss": 4.6511543318629265}
|
| 85 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.71747303009033, "eval_loss": 4.5257163643836975, "generalization_gap": 0.24465039372444153, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.281065970659256, "train_loss_last": 4.475866794586182, "val_eval_loss": 4.5257163643836975}
|
| 86 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.78583598136902, "eval_loss": 5.4391709342598915, "generalization_gap": 1.1067671701312065, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.332403764128685, "train_loss_last": 4.592159271240234, "val_eval_loss": 5.4391709342598915}
|
| 87 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.46544909477234, "eval_loss": 5.118095904588699, "generalization_gap": 0.965451754629612, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.152644149959087, "train_loss_last": 4.286029815673828, "val_eval_loss": 5.118095904588699}
|
| 88 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 89.8702449798584, "eval_loss": 4.848977982997894, "generalization_gap": 0.7016597390174866, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.147318243980408, "train_loss_last": 4.145584583282471, "val_eval_loss": 4.848977982997894}
|
| 89 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 89.51202011108398, "eval_loss": 4.653353154659271, "generalization_gap": 0.43230512738227844, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.221048027276993, "train_loss_last": 4.275428771972656, "val_eval_loss": 4.653353154659271}
|
| 90 |
+
{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 89.36881709098816, "eval_loss": 4.508818320930004, "generalization_gap": 0.2372145727276802, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.271603748202324, "train_loss_last": 4.303030967712402, "val_eval_loss": 4.508818320930004}
|
| 91 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.39422821998596, "eval_loss": 5.471387661993504, "generalization_gap": 0.9843398556113243, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.487047806382179, "train_loss_last": 4.588815689086914, "val_eval_loss": 5.471387661993504}
|
| 92 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.34041380882263, "eval_loss": 5.123171776533127, "generalization_gap": 0.9006927907466888, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.222478985786438, "train_loss_last": 4.4486284255981445, "val_eval_loss": 5.123171776533127}
|
| 93 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.33436322212219, "eval_loss": 4.8384788408875465, "generalization_gap": 0.5947165563702583, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.243762284517288, "train_loss_last": 4.541705131530762, "val_eval_loss": 4.8384788408875465}
|
| 94 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.38607597351074, "eval_loss": 4.680613316595554, "generalization_gap": 0.36394860595464706, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.316664710640907, "train_loss_last": 4.310101509094238, "val_eval_loss": 4.680613316595554}
|
| 95 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.34234309196472, "eval_loss": 4.545445613563061, "generalization_gap": 0.21612011641263962, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.329325497150421, "train_loss_last": 4.295492172241211, "val_eval_loss": 4.545445613563061}
|
| 96 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.37102317810059, "eval_loss": 5.4896402060985565, "generalization_gap": 1.0082881152629852, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.481352090835571, "train_loss_last": 4.5635986328125, "val_eval_loss": 5.4896402060985565}
|
| 97 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.80504488945007, "eval_loss": 5.100566066801548, "generalization_gap": 0.9002188816666603, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.200347185134888, "train_loss_last": 4.5332489013671875, "val_eval_loss": 5.100566066801548}
|
| 98 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 90.24263501167297, "eval_loss": 4.872108653187752, "generalization_gap": 0.5782715156674385, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.293837137520313, "train_loss_last": 4.513686656951904, "val_eval_loss": 4.872108653187752}
|
| 99 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 90.11540699005127, "eval_loss": 4.671582758426666, "generalization_gap": 0.3512818217277527, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.320300936698914, "train_loss_last": 4.520935535430908, "val_eval_loss": 4.671582758426666}
|
| 100 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.37596607208252, "eval_loss": 4.541754223406315, "generalization_gap": 0.21430686861276627, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.327447354793549, "train_loss_last": 4.469289302825928, "val_eval_loss": 4.541754223406315}
|
| 101 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.083731174469, "eval_loss": 5.44633674621582, "generalization_gap": 0.9801725596189499, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.46616418659687, "train_loss_last": 4.668099403381348, "val_eval_loss": 5.44633674621582}
|
| 102 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.23882603645325, "eval_loss": 5.117508083581924, "generalization_gap": 0.8479217141866684, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.269586369395256, "train_loss_last": 4.168695449829102, "val_eval_loss": 5.117508083581924}
|
| 103 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.65575408935547, "eval_loss": 4.864691182971001, "generalization_gap": 0.6078529059886932, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.256838276982307, "train_loss_last": 4.411451816558838, "val_eval_loss": 4.864691182971001}
|
| 104 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.62967777252197, "eval_loss": 4.685315124690533, "generalization_gap": 0.3747243508696556, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.310590773820877, "train_loss_last": 4.3449296951293945, "val_eval_loss": 4.685315124690533}
|
| 105 |
+
{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.2124891281128, "eval_loss": 4.536212712526321, "generalization_gap": 0.22022628784179688, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.3159864246845245, "train_loss_last": 4.2295427322387695, "val_eval_loss": 4.536212712526321}
|
| 106 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.50791621208191, "eval_loss": 5.500275187194347, "generalization_gap": 0.918338917195797, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.58193626999855, "train_loss_last": 4.801484107971191, "val_eval_loss": 5.500275187194347}
|
| 107 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 90.16439700126648, "eval_loss": 5.156496524810791, "generalization_gap": 0.8506038039922714, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.30589272081852, "train_loss_last": 4.409400939941406, "val_eval_loss": 5.156496524810791}
|
| 108 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 90.12380909919739, "eval_loss": 4.879989787936211, "generalization_gap": 0.5560016483068466, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.323988139629364, "train_loss_last": 4.452445030212402, "val_eval_loss": 4.879989787936211}
|
| 109 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.98156309127808, "eval_loss": 4.714666917920113, "generalization_gap": 0.33146432042121887, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.383202597498894, "train_loss_last": 4.41795539855957, "val_eval_loss": 4.714666917920113}
|
| 110 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.62339806556702, "eval_loss": 4.573916859924793, "generalization_gap": 0.1942192241549492, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.379697635769844, "train_loss_last": 4.556178569793701, "val_eval_loss": 4.573916859924793}
|
| 111 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.48192501068115, "eval_loss": 5.513812951743603, "generalization_gap": 0.9296739622950554, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.584138989448547, "train_loss_last": 4.781554222106934, "val_eval_loss": 5.513812951743603}
|
| 112 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.30899000167847, "eval_loss": 5.122172422707081, "generalization_gap": 0.8031972870230675, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.318975135684013, "train_loss_last": 4.599480628967285, "val_eval_loss": 5.122172422707081}
|
| 113 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.44387793540955, "eval_loss": 4.903594605624676, "generalization_gap": 0.534355454146862, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.369239151477814, "train_loss_last": 4.467323303222656, "val_eval_loss": 4.903594605624676}
|
| 114 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.30373406410217, "eval_loss": 4.708515301346779, "generalization_gap": 0.329326331615448, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.379188969731331, "train_loss_last": 4.398643970489502, "val_eval_loss": 4.708515301346779}
|
| 115 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 88.56540608406067, "eval_loss": 4.582499638199806, "generalization_gap": 0.19323347508907318, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.389266163110733, "train_loss_last": 4.4059600830078125, "val_eval_loss": 4.582499638199806}
|
| 116 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 88.06415605545044, "eval_loss": 5.473536089062691, "generalization_gap": 0.9294483512639999, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.544087737798691, "train_loss_last": 4.708844184875488, "val_eval_loss": 5.473536089062691}
|
| 117 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.83247828483582, "eval_loss": 5.127706632018089, "generalization_gap": 0.8091608434915543, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.318545788526535, "train_loss_last": 4.4110612869262695, "val_eval_loss": 5.127706632018089}
|
| 118 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.72863101959229, "eval_loss": 4.881375916302204, "generalization_gap": 0.5711916759610176, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.3101842403411865, "train_loss_last": 4.227794647216797, "val_eval_loss": 4.881375916302204}
|
| 119 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.52914309501648, "eval_loss": 4.709614671766758, "generalization_gap": 0.3716374859213829, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.337977185845375, "train_loss_last": 4.518902778625488, "val_eval_loss": 4.709614671766758}
|
| 120 |
+
{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.48567986488342, "eval_loss": 4.569913923740387, "generalization_gap": 0.20761683583259583, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.362297087907791, "train_loss_last": 4.421485900878906, "val_eval_loss": 4.569913923740387}
|
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.csv
ADDED
|
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
|
| 2 |
+
locked_stream,formula_wide_l8_h8,anchor_decay,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.56999338666598,0.020892215128759443,5.497370593249798,0.022738105633801235,0.9273772065838178,0.011340226985102892
|
| 3 |
+
locked_stream,static_dropout_0.02,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,3.8605831737319627,0.01997973123033439,5.631304082771142,0.013618853293115287,1.7707209090391796,0.033285474730425785
|
| 4 |
+
locked_stream,static_dropout_0.08,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,4.065958604216576,0.013996274750030279,5.506370507180691,0.01937345681127113,1.4404119029641151,0.02857342430791552
|
| 5 |
+
locked_stream,static_dropout_0.14,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.217446461319923,0.03191073719912012,5.4599522848924,0.021856544511956278,1.2425058235724766,0.015820136190563192
|
| 6 |
+
locked_stream,static_dropout_0.18,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.325829481085141,0.03854969420994424,5.452850831051667,0.02074692690552652,1.127021349966526,0.023996078635556407
|
| 7 |
+
locked_stream,static_dropout_0.2,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.369402113060157,0.03927543942310581,5.455787216623624,0.01881333118653956,1.0863851035634677,0.03904945576752281
|
| 8 |
+
locked_stream,static_dropout_0.26,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.478188027938207,0.01079536309478258,5.469121538102627,0.021740489819011388,0.9909335101644198,0.015173276757753403
|
| 9 |
+
locked_stream,static_dropout_0.3,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.570054332415263,0.022514684546532972,5.495874742666881,0.02049583740381805,0.9258204102516174,0.006480144970797402
|
| 10 |
+
locked_stream,formula_wide_l8_h8,anchor_decay,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.243280505140622,0.032274654795716486,5.116849479575952,0.013043710081023051,0.8735689744353294,0.036269420616252504
|
| 11 |
+
locked_stream,static_dropout_0.02,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,3.5110411569476128,0.010828643474845237,5.423485159873962,0.007411461831348081,1.9124440029263496,0.011534364701109023
|
| 12 |
+
locked_stream,static_dropout_0.08,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,3.760350801050663,0.03610450263157525,5.210673009355863,0.012193401903637546,1.4503222083051999,0.04270984927105307
|
| 13 |
+
locked_stream,static_dropout_0.14,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,3.9472646564245224,0.024291367988665397,5.122644076744716,0.018661090977288925,1.175379420320193,0.039515834393275315
|
| 14 |
+
locked_stream,static_dropout_0.18,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.053976848721504,0.03228513900248589,5.106844045221806,0.013609907255985704,1.0528671965003014,0.03630127590699434
|
| 15 |
+
locked_stream,static_dropout_0.2,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.126676956812541,0.022654445827106268,5.112888875106971,0.013689433065470715,0.9862119182944298,0.020663957741316727
|
| 16 |
+
locked_stream,static_dropout_0.26,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.230804180105527,0.0353623783786987,5.113748642305533,0.01176242224841787,0.8829444622000059,0.03033151506697832
|
| 17 |
+
locked_stream,static_dropout_0.3,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.314471215009689,0.0074322948465704985,5.135458526511987,0.018428372079211774,0.8209873115022978,0.025821376260662585
|
| 18 |
+
locked_stream,formula_wide_l8_h8,anchor_decay,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.178089372813702,0.018981456409242363,4.835277202228705,0.012797043787857624,0.6571878294150034,0.015951825016803534
|
| 19 |
+
locked_stream,static_dropout_0.02,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,3.7576290518045425,0.022184519488321397,5.05172157784303,0.03721037081891111,1.2940925260384877,0.036679450397234824
|
| 20 |
+
locked_stream,static_dropout_0.08,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,3.9233832508325577,0.03610191494845503,4.904143842558066,0.022913408617663143,0.9807605917255083,0.04537425441423101
|
| 21 |
+
locked_stream,static_dropout_0.14,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.04502259939909,0.05344041051164446,4.844728792707126,0.022009612626586104,0.7997061933080355,0.037832338456288826
|
| 22 |
+
locked_stream,static_dropout_0.18,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.115675538778305,0.03014400883010748,4.835873966415723,0.015885757236925133,0.7201984276374181,0.01601490118944517
|
| 23 |
+
locked_stream,static_dropout_0.2,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.169282446304957,0.024947280987738584,4.84858646740516,0.013971446329696822,0.6793040211002032,0.01948231221870367
|
| 24 |
+
locked_stream,static_dropout_0.26,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.26481256633997,0.025972383292421113,4.858426225682099,0.017668569161382996,0.5936136593421301,0.014821502950969461
|
| 25 |
+
locked_stream,static_dropout_0.3,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.334470510482788,0.030891434190892946,4.888320103287697,0.013246250571144034,0.5538495928049088,0.018512166716729395
|
| 26 |
+
locked_stream,formula_wide_l8_h8,anchor_decay,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.163138948380947,0.013421730028855801,4.621206303437551,0.004959867552466137,0.45806735505660373,0.008658546315824916
|
| 27 |
+
locked_stream,static_dropout_0.02,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,4.012869040171306,0.019352473408129574,4.749846667051315,0.008764765668577094,0.73697762688001,0.020006114758379243
|
| 28 |
+
locked_stream,static_dropout_0.08,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,4.110793483753999,0.028893033853370435,4.6680960928400355,0.003314491274290642,0.5573026090860367,0.02737432374001718
|
| 29 |
+
locked_stream,static_dropout_0.14,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.1672036200761795,0.023722898945556545,4.637934118509293,0.006874304525877202,0.4707304984331131,0.019934551772559723
|
| 30 |
+
locked_stream,static_dropout_0.18,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.21087221801281,0.021182682238053856,4.63968871285518,0.006246922787675527,0.42881649484237033,0.0234751500917668
|
| 31 |
+
locked_stream,static_dropout_0.2,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.245612747967243,0.021374004532989366,4.657182261347771,0.008606949344244184,0.4115695133805275,0.018893841026177444
|
| 32 |
+
locked_stream,static_dropout_0.26,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.315852140386899,0.004905814773589955,4.679170399904251,0.006978966774156137,0.3633182595173518,0.011733969729614104
|
| 33 |
+
locked_stream,static_dropout_0.3,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.366789584358533,0.025032839092378575,4.710932297011216,0.003280655243995376,0.34414271265268326,0.0238351561114313
|
| 34 |
+
locked_stream,formula_wide_l8_h8,anchor_decay,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.15141686052084,0.00864907460577688,4.4657708530624705,0.006533212137650865,0.31435399254163104,0.0021895349788719713
|
| 35 |
+
locked_stream,static_dropout_0.02,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,4.149302105108897,0.015932906479109284,4.542556785047054,0.005111427760726518,0.3932546799381574,0.010983082646486758
|
| 36 |
+
locked_stream,static_dropout_0.08,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,4.177745779355367,0.0073518511940769,4.498890255888303,0.004325913851224281,0.3211444765329361,0.0045681171466252805
|
| 37 |
+
locked_stream,static_dropout_0.14,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.221429777642091,0.012821970261243815,4.494626219073932,0.008657044012355454,0.27319644143184024,0.006086902797186131
|
| 38 |
+
locked_stream,static_dropout_0.18,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.242261836926143,0.007533414644039237,4.4967766006787615,0.005175633970589464,0.25451476375261944,0.005043870704268704
|
| 39 |
+
locked_stream,static_dropout_0.2,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.279090583324432,0.006720524978756697,4.51753082126379,0.008461337427952602,0.23844023793935776,0.005697079794179711
|
| 40 |
+
locked_stream,static_dropout_0.26,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.324253092209498,0.007220470805599103,4.541137516498566,0.004647242294749092,0.2168844242890676,0.0030328214421115702
|
| 41 |
+
locked_stream,static_dropout_0.3,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.37708696226279,0.013672763672564944,4.575443473954995,0.006430238324620357,0.19835651169220606,0.008034807259295824
|
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.json
ADDED
|
@@ -0,0 +1,882 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"run_mode": "locked_stream",
|
| 4 |
+
"condition": "formula_wide_l8_h8",
|
| 5 |
+
"condition_kind": "anchor_decay",
|
| 6 |
+
"stage": 0,
|
| 7 |
+
"token_limit": 250000,
|
| 8 |
+
"model_name": "wide_L8_H8_D384",
|
| 9 |
+
"n_layer": 8,
|
| 10 |
+
"n_head": 8,
|
| 11 |
+
"n_embd": 384,
|
| 12 |
+
"parameters": 17301504,
|
| 13 |
+
"dropout_initial": 0.301,
|
| 14 |
+
"dropout_final": 0.02,
|
| 15 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 16 |
+
"n": 3,
|
| 17 |
+
"mean_train_eval_loss": 4.56999338666598,
|
| 18 |
+
"std_train_eval_loss": 0.020892215128759443,
|
| 19 |
+
"mean_val_eval_loss": 5.497370593249798,
|
| 20 |
+
"std_val_eval_loss": 0.022738105633801235,
|
| 21 |
+
"mean_generalization_gap": 0.9273772065838178,
|
| 22 |
+
"std_generalization_gap": 0.011340226985102892
|
| 23 |
+
},
|
| 24 |
+
{
|
| 25 |
+
"run_mode": "locked_stream",
|
| 26 |
+
"condition": "static_dropout_0.02",
|
| 27 |
+
"condition_kind": "static",
|
| 28 |
+
"stage": 0,
|
| 29 |
+
"token_limit": 250000,
|
| 30 |
+
"model_name": "wide_L8_H8_D384",
|
| 31 |
+
"n_layer": 8,
|
| 32 |
+
"n_head": 8,
|
| 33 |
+
"n_embd": 384,
|
| 34 |
+
"parameters": 17301504,
|
| 35 |
+
"dropout_initial": 0.02,
|
| 36 |
+
"dropout_final": 0.02,
|
| 37 |
+
"dropout_schedule": "constant",
|
| 38 |
+
"n": 3,
|
| 39 |
+
"mean_train_eval_loss": 3.8605831737319627,
|
| 40 |
+
"std_train_eval_loss": 0.01997973123033439,
|
| 41 |
+
"mean_val_eval_loss": 5.631304082771142,
|
| 42 |
+
"std_val_eval_loss": 0.013618853293115287,
|
| 43 |
+
"mean_generalization_gap": 1.7707209090391796,
|
| 44 |
+
"std_generalization_gap": 0.033285474730425785
|
| 45 |
+
},
|
| 46 |
+
{
|
| 47 |
+
"run_mode": "locked_stream",
|
| 48 |
+
"condition": "static_dropout_0.08",
|
| 49 |
+
"condition_kind": "static",
|
| 50 |
+
"stage": 0,
|
| 51 |
+
"token_limit": 250000,
|
| 52 |
+
"model_name": "wide_L8_H8_D384",
|
| 53 |
+
"n_layer": 8,
|
| 54 |
+
"n_head": 8,
|
| 55 |
+
"n_embd": 384,
|
| 56 |
+
"parameters": 17301504,
|
| 57 |
+
"dropout_initial": 0.08,
|
| 58 |
+
"dropout_final": 0.08,
|
| 59 |
+
"dropout_schedule": "constant",
|
| 60 |
+
"n": 3,
|
| 61 |
+
"mean_train_eval_loss": 4.065958604216576,
|
| 62 |
+
"std_train_eval_loss": 0.013996274750030279,
|
| 63 |
+
"mean_val_eval_loss": 5.506370507180691,
|
| 64 |
+
"std_val_eval_loss": 0.01937345681127113,
|
| 65 |
+
"mean_generalization_gap": 1.4404119029641151,
|
| 66 |
+
"std_generalization_gap": 0.02857342430791552
|
| 67 |
+
},
|
| 68 |
+
{
|
| 69 |
+
"run_mode": "locked_stream",
|
| 70 |
+
"condition": "static_dropout_0.14",
|
| 71 |
+
"condition_kind": "static",
|
| 72 |
+
"stage": 0,
|
| 73 |
+
"token_limit": 250000,
|
| 74 |
+
"model_name": "wide_L8_H8_D384",
|
| 75 |
+
"n_layer": 8,
|
| 76 |
+
"n_head": 8,
|
| 77 |
+
"n_embd": 384,
|
| 78 |
+
"parameters": 17301504,
|
| 79 |
+
"dropout_initial": 0.14,
|
| 80 |
+
"dropout_final": 0.14,
|
| 81 |
+
"dropout_schedule": "constant",
|
| 82 |
+
"n": 3,
|
| 83 |
+
"mean_train_eval_loss": 4.217446461319923,
|
| 84 |
+
"std_train_eval_loss": 0.03191073719912012,
|
| 85 |
+
"mean_val_eval_loss": 5.4599522848924,
|
| 86 |
+
"std_val_eval_loss": 0.021856544511956278,
|
| 87 |
+
"mean_generalization_gap": 1.2425058235724766,
|
| 88 |
+
"std_generalization_gap": 0.015820136190563192
|
| 89 |
+
},
|
| 90 |
+
{
|
| 91 |
+
"run_mode": "locked_stream",
|
| 92 |
+
"condition": "static_dropout_0.18",
|
| 93 |
+
"condition_kind": "static",
|
| 94 |
+
"stage": 0,
|
| 95 |
+
"token_limit": 250000,
|
| 96 |
+
"model_name": "wide_L8_H8_D384",
|
| 97 |
+
"n_layer": 8,
|
| 98 |
+
"n_head": 8,
|
| 99 |
+
"n_embd": 384,
|
| 100 |
+
"parameters": 17301504,
|
| 101 |
+
"dropout_initial": 0.18,
|
| 102 |
+
"dropout_final": 0.18,
|
| 103 |
+
"dropout_schedule": "constant",
|
| 104 |
+
"n": 3,
|
| 105 |
+
"mean_train_eval_loss": 4.325829481085141,
|
| 106 |
+
"std_train_eval_loss": 0.03854969420994424,
|
| 107 |
+
"mean_val_eval_loss": 5.452850831051667,
|
| 108 |
+
"std_val_eval_loss": 0.02074692690552652,
|
| 109 |
+
"mean_generalization_gap": 1.127021349966526,
|
| 110 |
+
"std_generalization_gap": 0.023996078635556407
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"run_mode": "locked_stream",
|
| 114 |
+
"condition": "static_dropout_0.2",
|
| 115 |
+
"condition_kind": "static",
|
| 116 |
+
"stage": 0,
|
| 117 |
+
"token_limit": 250000,
|
| 118 |
+
"model_name": "wide_L8_H8_D384",
|
| 119 |
+
"n_layer": 8,
|
| 120 |
+
"n_head": 8,
|
| 121 |
+
"n_embd": 384,
|
| 122 |
+
"parameters": 17301504,
|
| 123 |
+
"dropout_initial": 0.2,
|
| 124 |
+
"dropout_final": 0.2,
|
| 125 |
+
"dropout_schedule": "constant",
|
| 126 |
+
"n": 3,
|
| 127 |
+
"mean_train_eval_loss": 4.369402113060157,
|
| 128 |
+
"std_train_eval_loss": 0.03927543942310581,
|
| 129 |
+
"mean_val_eval_loss": 5.455787216623624,
|
| 130 |
+
"std_val_eval_loss": 0.01881333118653956,
|
| 131 |
+
"mean_generalization_gap": 1.0863851035634677,
|
| 132 |
+
"std_generalization_gap": 0.03904945576752281
|
| 133 |
+
},
|
| 134 |
+
{
|
| 135 |
+
"run_mode": "locked_stream",
|
| 136 |
+
"condition": "static_dropout_0.26",
|
| 137 |
+
"condition_kind": "static",
|
| 138 |
+
"stage": 0,
|
| 139 |
+
"token_limit": 250000,
|
| 140 |
+
"model_name": "wide_L8_H8_D384",
|
| 141 |
+
"n_layer": 8,
|
| 142 |
+
"n_head": 8,
|
| 143 |
+
"n_embd": 384,
|
| 144 |
+
"parameters": 17301504,
|
| 145 |
+
"dropout_initial": 0.26,
|
| 146 |
+
"dropout_final": 0.26,
|
| 147 |
+
"dropout_schedule": "constant",
|
| 148 |
+
"n": 3,
|
| 149 |
+
"mean_train_eval_loss": 4.478188027938207,
|
| 150 |
+
"std_train_eval_loss": 0.01079536309478258,
|
| 151 |
+
"mean_val_eval_loss": 5.469121538102627,
|
| 152 |
+
"std_val_eval_loss": 0.021740489819011388,
|
| 153 |
+
"mean_generalization_gap": 0.9909335101644198,
|
| 154 |
+
"std_generalization_gap": 0.015173276757753403
|
| 155 |
+
},
|
| 156 |
+
{
|
| 157 |
+
"run_mode": "locked_stream",
|
| 158 |
+
"condition": "static_dropout_0.3",
|
| 159 |
+
"condition_kind": "static",
|
| 160 |
+
"stage": 0,
|
| 161 |
+
"token_limit": 250000,
|
| 162 |
+
"model_name": "wide_L8_H8_D384",
|
| 163 |
+
"n_layer": 8,
|
| 164 |
+
"n_head": 8,
|
| 165 |
+
"n_embd": 384,
|
| 166 |
+
"parameters": 17301504,
|
| 167 |
+
"dropout_initial": 0.3,
|
| 168 |
+
"dropout_final": 0.3,
|
| 169 |
+
"dropout_schedule": "constant",
|
| 170 |
+
"n": 3,
|
| 171 |
+
"mean_train_eval_loss": 4.570054332415263,
|
| 172 |
+
"std_train_eval_loss": 0.022514684546532972,
|
| 173 |
+
"mean_val_eval_loss": 5.495874742666881,
|
| 174 |
+
"std_val_eval_loss": 0.02049583740381805,
|
| 175 |
+
"mean_generalization_gap": 0.9258204102516174,
|
| 176 |
+
"std_generalization_gap": 0.006480144970797402
|
| 177 |
+
},
|
| 178 |
+
{
|
| 179 |
+
"run_mode": "locked_stream",
|
| 180 |
+
"condition": "formula_wide_l8_h8",
|
| 181 |
+
"condition_kind": "anchor_decay",
|
| 182 |
+
"stage": 1,
|
| 183 |
+
"token_limit": 500000,
|
| 184 |
+
"model_name": "wide_L8_H8_D384",
|
| 185 |
+
"n_layer": 8,
|
| 186 |
+
"n_head": 8,
|
| 187 |
+
"n_embd": 384,
|
| 188 |
+
"parameters": 17301504,
|
| 189 |
+
"dropout_initial": 0.301,
|
| 190 |
+
"dropout_final": 0.02,
|
| 191 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 192 |
+
"n": 3,
|
| 193 |
+
"mean_train_eval_loss": 4.243280505140622,
|
| 194 |
+
"std_train_eval_loss": 0.032274654795716486,
|
| 195 |
+
"mean_val_eval_loss": 5.116849479575952,
|
| 196 |
+
"std_val_eval_loss": 0.013043710081023051,
|
| 197 |
+
"mean_generalization_gap": 0.8735689744353294,
|
| 198 |
+
"std_generalization_gap": 0.036269420616252504
|
| 199 |
+
},
|
| 200 |
+
{
|
| 201 |
+
"run_mode": "locked_stream",
|
| 202 |
+
"condition": "static_dropout_0.02",
|
| 203 |
+
"condition_kind": "static",
|
| 204 |
+
"stage": 1,
|
| 205 |
+
"token_limit": 500000,
|
| 206 |
+
"model_name": "wide_L8_H8_D384",
|
| 207 |
+
"n_layer": 8,
|
| 208 |
+
"n_head": 8,
|
| 209 |
+
"n_embd": 384,
|
| 210 |
+
"parameters": 17301504,
|
| 211 |
+
"dropout_initial": 0.02,
|
| 212 |
+
"dropout_final": 0.02,
|
| 213 |
+
"dropout_schedule": "constant",
|
| 214 |
+
"n": 3,
|
| 215 |
+
"mean_train_eval_loss": 3.5110411569476128,
|
| 216 |
+
"std_train_eval_loss": 0.010828643474845237,
|
| 217 |
+
"mean_val_eval_loss": 5.423485159873962,
|
| 218 |
+
"std_val_eval_loss": 0.007411461831348081,
|
| 219 |
+
"mean_generalization_gap": 1.9124440029263496,
|
| 220 |
+
"std_generalization_gap": 0.011534364701109023
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"run_mode": "locked_stream",
|
| 224 |
+
"condition": "static_dropout_0.08",
|
| 225 |
+
"condition_kind": "static",
|
| 226 |
+
"stage": 1,
|
| 227 |
+
"token_limit": 500000,
|
| 228 |
+
"model_name": "wide_L8_H8_D384",
|
| 229 |
+
"n_layer": 8,
|
| 230 |
+
"n_head": 8,
|
| 231 |
+
"n_embd": 384,
|
| 232 |
+
"parameters": 17301504,
|
| 233 |
+
"dropout_initial": 0.08,
|
| 234 |
+
"dropout_final": 0.08,
|
| 235 |
+
"dropout_schedule": "constant",
|
| 236 |
+
"n": 3,
|
| 237 |
+
"mean_train_eval_loss": 3.760350801050663,
|
| 238 |
+
"std_train_eval_loss": 0.03610450263157525,
|
| 239 |
+
"mean_val_eval_loss": 5.210673009355863,
|
| 240 |
+
"std_val_eval_loss": 0.012193401903637546,
|
| 241 |
+
"mean_generalization_gap": 1.4503222083051999,
|
| 242 |
+
"std_generalization_gap": 0.04270984927105307
|
| 243 |
+
},
|
| 244 |
+
{
|
| 245 |
+
"run_mode": "locked_stream",
|
| 246 |
+
"condition": "static_dropout_0.14",
|
| 247 |
+
"condition_kind": "static",
|
| 248 |
+
"stage": 1,
|
| 249 |
+
"token_limit": 500000,
|
| 250 |
+
"model_name": "wide_L8_H8_D384",
|
| 251 |
+
"n_layer": 8,
|
| 252 |
+
"n_head": 8,
|
| 253 |
+
"n_embd": 384,
|
| 254 |
+
"parameters": 17301504,
|
| 255 |
+
"dropout_initial": 0.14,
|
| 256 |
+
"dropout_final": 0.14,
|
| 257 |
+
"dropout_schedule": "constant",
|
| 258 |
+
"n": 3,
|
| 259 |
+
"mean_train_eval_loss": 3.9472646564245224,
|
| 260 |
+
"std_train_eval_loss": 0.024291367988665397,
|
| 261 |
+
"mean_val_eval_loss": 5.122644076744716,
|
| 262 |
+
"std_val_eval_loss": 0.018661090977288925,
|
| 263 |
+
"mean_generalization_gap": 1.175379420320193,
|
| 264 |
+
"std_generalization_gap": 0.039515834393275315
|
| 265 |
+
},
|
| 266 |
+
{
|
| 267 |
+
"run_mode": "locked_stream",
|
| 268 |
+
"condition": "static_dropout_0.18",
|
| 269 |
+
"condition_kind": "static",
|
| 270 |
+
"stage": 1,
|
| 271 |
+
"token_limit": 500000,
|
| 272 |
+
"model_name": "wide_L8_H8_D384",
|
| 273 |
+
"n_layer": 8,
|
| 274 |
+
"n_head": 8,
|
| 275 |
+
"n_embd": 384,
|
| 276 |
+
"parameters": 17301504,
|
| 277 |
+
"dropout_initial": 0.18,
|
| 278 |
+
"dropout_final": 0.18,
|
| 279 |
+
"dropout_schedule": "constant",
|
| 280 |
+
"n": 3,
|
| 281 |
+
"mean_train_eval_loss": 4.053976848721504,
|
| 282 |
+
"std_train_eval_loss": 0.03228513900248589,
|
| 283 |
+
"mean_val_eval_loss": 5.106844045221806,
|
| 284 |
+
"std_val_eval_loss": 0.013609907255985704,
|
| 285 |
+
"mean_generalization_gap": 1.0528671965003014,
|
| 286 |
+
"std_generalization_gap": 0.03630127590699434
|
| 287 |
+
},
|
| 288 |
+
{
|
| 289 |
+
"run_mode": "locked_stream",
|
| 290 |
+
"condition": "static_dropout_0.2",
|
| 291 |
+
"condition_kind": "static",
|
| 292 |
+
"stage": 1,
|
| 293 |
+
"token_limit": 500000,
|
| 294 |
+
"model_name": "wide_L8_H8_D384",
|
| 295 |
+
"n_layer": 8,
|
| 296 |
+
"n_head": 8,
|
| 297 |
+
"n_embd": 384,
|
| 298 |
+
"parameters": 17301504,
|
| 299 |
+
"dropout_initial": 0.2,
|
| 300 |
+
"dropout_final": 0.2,
|
| 301 |
+
"dropout_schedule": "constant",
|
| 302 |
+
"n": 3,
|
| 303 |
+
"mean_train_eval_loss": 4.126676956812541,
|
| 304 |
+
"std_train_eval_loss": 0.022654445827106268,
|
| 305 |
+
"mean_val_eval_loss": 5.112888875106971,
|
| 306 |
+
"std_val_eval_loss": 0.013689433065470715,
|
| 307 |
+
"mean_generalization_gap": 0.9862119182944298,
|
| 308 |
+
"std_generalization_gap": 0.020663957741316727
|
| 309 |
+
},
|
| 310 |
+
{
|
| 311 |
+
"run_mode": "locked_stream",
|
| 312 |
+
"condition": "static_dropout_0.26",
|
| 313 |
+
"condition_kind": "static",
|
| 314 |
+
"stage": 1,
|
| 315 |
+
"token_limit": 500000,
|
| 316 |
+
"model_name": "wide_L8_H8_D384",
|
| 317 |
+
"n_layer": 8,
|
| 318 |
+
"n_head": 8,
|
| 319 |
+
"n_embd": 384,
|
| 320 |
+
"parameters": 17301504,
|
| 321 |
+
"dropout_initial": 0.26,
|
| 322 |
+
"dropout_final": 0.26,
|
| 323 |
+
"dropout_schedule": "constant",
|
| 324 |
+
"n": 3,
|
| 325 |
+
"mean_train_eval_loss": 4.230804180105527,
|
| 326 |
+
"std_train_eval_loss": 0.0353623783786987,
|
| 327 |
+
"mean_val_eval_loss": 5.113748642305533,
|
| 328 |
+
"std_val_eval_loss": 0.01176242224841787,
|
| 329 |
+
"mean_generalization_gap": 0.8829444622000059,
|
| 330 |
+
"std_generalization_gap": 0.03033151506697832
|
| 331 |
+
},
|
| 332 |
+
{
|
| 333 |
+
"run_mode": "locked_stream",
|
| 334 |
+
"condition": "static_dropout_0.3",
|
| 335 |
+
"condition_kind": "static",
|
| 336 |
+
"stage": 1,
|
| 337 |
+
"token_limit": 500000,
|
| 338 |
+
"model_name": "wide_L8_H8_D384",
|
| 339 |
+
"n_layer": 8,
|
| 340 |
+
"n_head": 8,
|
| 341 |
+
"n_embd": 384,
|
| 342 |
+
"parameters": 17301504,
|
| 343 |
+
"dropout_initial": 0.3,
|
| 344 |
+
"dropout_final": 0.3,
|
| 345 |
+
"dropout_schedule": "constant",
|
| 346 |
+
"n": 3,
|
| 347 |
+
"mean_train_eval_loss": 4.314471215009689,
|
| 348 |
+
"std_train_eval_loss": 0.0074322948465704985,
|
| 349 |
+
"mean_val_eval_loss": 5.135458526511987,
|
| 350 |
+
"std_val_eval_loss": 0.018428372079211774,
|
| 351 |
+
"mean_generalization_gap": 0.8209873115022978,
|
| 352 |
+
"std_generalization_gap": 0.025821376260662585
|
| 353 |
+
},
|
| 354 |
+
{
|
| 355 |
+
"run_mode": "locked_stream",
|
| 356 |
+
"condition": "formula_wide_l8_h8",
|
| 357 |
+
"condition_kind": "anchor_decay",
|
| 358 |
+
"stage": 2,
|
| 359 |
+
"token_limit": 1000000,
|
| 360 |
+
"model_name": "wide_L8_H8_D384",
|
| 361 |
+
"n_layer": 8,
|
| 362 |
+
"n_head": 8,
|
| 363 |
+
"n_embd": 384,
|
| 364 |
+
"parameters": 17301504,
|
| 365 |
+
"dropout_initial": 0.301,
|
| 366 |
+
"dropout_final": 0.02,
|
| 367 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 368 |
+
"n": 3,
|
| 369 |
+
"mean_train_eval_loss": 4.178089372813702,
|
| 370 |
+
"std_train_eval_loss": 0.018981456409242363,
|
| 371 |
+
"mean_val_eval_loss": 4.835277202228705,
|
| 372 |
+
"std_val_eval_loss": 0.012797043787857624,
|
| 373 |
+
"mean_generalization_gap": 0.6571878294150034,
|
| 374 |
+
"std_generalization_gap": 0.015951825016803534
|
| 375 |
+
},
|
| 376 |
+
{
|
| 377 |
+
"run_mode": "locked_stream",
|
| 378 |
+
"condition": "static_dropout_0.02",
|
| 379 |
+
"condition_kind": "static",
|
| 380 |
+
"stage": 2,
|
| 381 |
+
"token_limit": 1000000,
|
| 382 |
+
"model_name": "wide_L8_H8_D384",
|
| 383 |
+
"n_layer": 8,
|
| 384 |
+
"n_head": 8,
|
| 385 |
+
"n_embd": 384,
|
| 386 |
+
"parameters": 17301504,
|
| 387 |
+
"dropout_initial": 0.02,
|
| 388 |
+
"dropout_final": 0.02,
|
| 389 |
+
"dropout_schedule": "constant",
|
| 390 |
+
"n": 3,
|
| 391 |
+
"mean_train_eval_loss": 3.7576290518045425,
|
| 392 |
+
"std_train_eval_loss": 0.022184519488321397,
|
| 393 |
+
"mean_val_eval_loss": 5.05172157784303,
|
| 394 |
+
"std_val_eval_loss": 0.03721037081891111,
|
| 395 |
+
"mean_generalization_gap": 1.2940925260384877,
|
| 396 |
+
"std_generalization_gap": 0.036679450397234824
|
| 397 |
+
},
|
| 398 |
+
{
|
| 399 |
+
"run_mode": "locked_stream",
|
| 400 |
+
"condition": "static_dropout_0.08",
|
| 401 |
+
"condition_kind": "static",
|
| 402 |
+
"stage": 2,
|
| 403 |
+
"token_limit": 1000000,
|
| 404 |
+
"model_name": "wide_L8_H8_D384",
|
| 405 |
+
"n_layer": 8,
|
| 406 |
+
"n_head": 8,
|
| 407 |
+
"n_embd": 384,
|
| 408 |
+
"parameters": 17301504,
|
| 409 |
+
"dropout_initial": 0.08,
|
| 410 |
+
"dropout_final": 0.08,
|
| 411 |
+
"dropout_schedule": "constant",
|
| 412 |
+
"n": 3,
|
| 413 |
+
"mean_train_eval_loss": 3.9233832508325577,
|
| 414 |
+
"std_train_eval_loss": 0.03610191494845503,
|
| 415 |
+
"mean_val_eval_loss": 4.904143842558066,
|
| 416 |
+
"std_val_eval_loss": 0.022913408617663143,
|
| 417 |
+
"mean_generalization_gap": 0.9807605917255083,
|
| 418 |
+
"std_generalization_gap": 0.04537425441423101
|
| 419 |
+
},
|
| 420 |
+
{
|
| 421 |
+
"run_mode": "locked_stream",
|
| 422 |
+
"condition": "static_dropout_0.14",
|
| 423 |
+
"condition_kind": "static",
|
| 424 |
+
"stage": 2,
|
| 425 |
+
"token_limit": 1000000,
|
| 426 |
+
"model_name": "wide_L8_H8_D384",
|
| 427 |
+
"n_layer": 8,
|
| 428 |
+
"n_head": 8,
|
| 429 |
+
"n_embd": 384,
|
| 430 |
+
"parameters": 17301504,
|
| 431 |
+
"dropout_initial": 0.14,
|
| 432 |
+
"dropout_final": 0.14,
|
| 433 |
+
"dropout_schedule": "constant",
|
| 434 |
+
"n": 3,
|
| 435 |
+
"mean_train_eval_loss": 4.04502259939909,
|
| 436 |
+
"std_train_eval_loss": 0.05344041051164446,
|
| 437 |
+
"mean_val_eval_loss": 4.844728792707126,
|
| 438 |
+
"std_val_eval_loss": 0.022009612626586104,
|
| 439 |
+
"mean_generalization_gap": 0.7997061933080355,
|
| 440 |
+
"std_generalization_gap": 0.037832338456288826
|
| 441 |
+
},
|
| 442 |
+
{
|
| 443 |
+
"run_mode": "locked_stream",
|
| 444 |
+
"condition": "static_dropout_0.18",
|
| 445 |
+
"condition_kind": "static",
|
| 446 |
+
"stage": 2,
|
| 447 |
+
"token_limit": 1000000,
|
| 448 |
+
"model_name": "wide_L8_H8_D384",
|
| 449 |
+
"n_layer": 8,
|
| 450 |
+
"n_head": 8,
|
| 451 |
+
"n_embd": 384,
|
| 452 |
+
"parameters": 17301504,
|
| 453 |
+
"dropout_initial": 0.18,
|
| 454 |
+
"dropout_final": 0.18,
|
| 455 |
+
"dropout_schedule": "constant",
|
| 456 |
+
"n": 3,
|
| 457 |
+
"mean_train_eval_loss": 4.115675538778305,
|
| 458 |
+
"std_train_eval_loss": 0.03014400883010748,
|
| 459 |
+
"mean_val_eval_loss": 4.835873966415723,
|
| 460 |
+
"std_val_eval_loss": 0.015885757236925133,
|
| 461 |
+
"mean_generalization_gap": 0.7201984276374181,
|
| 462 |
+
"std_generalization_gap": 0.01601490118944517
|
| 463 |
+
},
|
| 464 |
+
{
|
| 465 |
+
"run_mode": "locked_stream",
|
| 466 |
+
"condition": "static_dropout_0.2",
|
| 467 |
+
"condition_kind": "static",
|
| 468 |
+
"stage": 2,
|
| 469 |
+
"token_limit": 1000000,
|
| 470 |
+
"model_name": "wide_L8_H8_D384",
|
| 471 |
+
"n_layer": 8,
|
| 472 |
+
"n_head": 8,
|
| 473 |
+
"n_embd": 384,
|
| 474 |
+
"parameters": 17301504,
|
| 475 |
+
"dropout_initial": 0.2,
|
| 476 |
+
"dropout_final": 0.2,
|
| 477 |
+
"dropout_schedule": "constant",
|
| 478 |
+
"n": 3,
|
| 479 |
+
"mean_train_eval_loss": 4.169282446304957,
|
| 480 |
+
"std_train_eval_loss": 0.024947280987738584,
|
| 481 |
+
"mean_val_eval_loss": 4.84858646740516,
|
| 482 |
+
"std_val_eval_loss": 0.013971446329696822,
|
| 483 |
+
"mean_generalization_gap": 0.6793040211002032,
|
| 484 |
+
"std_generalization_gap": 0.01948231221870367
|
| 485 |
+
},
|
| 486 |
+
{
|
| 487 |
+
"run_mode": "locked_stream",
|
| 488 |
+
"condition": "static_dropout_0.26",
|
| 489 |
+
"condition_kind": "static",
|
| 490 |
+
"stage": 2,
|
| 491 |
+
"token_limit": 1000000,
|
| 492 |
+
"model_name": "wide_L8_H8_D384",
|
| 493 |
+
"n_layer": 8,
|
| 494 |
+
"n_head": 8,
|
| 495 |
+
"n_embd": 384,
|
| 496 |
+
"parameters": 17301504,
|
| 497 |
+
"dropout_initial": 0.26,
|
| 498 |
+
"dropout_final": 0.26,
|
| 499 |
+
"dropout_schedule": "constant",
|
| 500 |
+
"n": 3,
|
| 501 |
+
"mean_train_eval_loss": 4.26481256633997,
|
| 502 |
+
"std_train_eval_loss": 0.025972383292421113,
|
| 503 |
+
"mean_val_eval_loss": 4.858426225682099,
|
| 504 |
+
"std_val_eval_loss": 0.017668569161382996,
|
| 505 |
+
"mean_generalization_gap": 0.5936136593421301,
|
| 506 |
+
"std_generalization_gap": 0.014821502950969461
|
| 507 |
+
},
|
| 508 |
+
{
|
| 509 |
+
"run_mode": "locked_stream",
|
| 510 |
+
"condition": "static_dropout_0.3",
|
| 511 |
+
"condition_kind": "static",
|
| 512 |
+
"stage": 2,
|
| 513 |
+
"token_limit": 1000000,
|
| 514 |
+
"model_name": "wide_L8_H8_D384",
|
| 515 |
+
"n_layer": 8,
|
| 516 |
+
"n_head": 8,
|
| 517 |
+
"n_embd": 384,
|
| 518 |
+
"parameters": 17301504,
|
| 519 |
+
"dropout_initial": 0.3,
|
| 520 |
+
"dropout_final": 0.3,
|
| 521 |
+
"dropout_schedule": "constant",
|
| 522 |
+
"n": 3,
|
| 523 |
+
"mean_train_eval_loss": 4.334470510482788,
|
| 524 |
+
"std_train_eval_loss": 0.030891434190892946,
|
| 525 |
+
"mean_val_eval_loss": 4.888320103287697,
|
| 526 |
+
"std_val_eval_loss": 0.013246250571144034,
|
| 527 |
+
"mean_generalization_gap": 0.5538495928049088,
|
| 528 |
+
"std_generalization_gap": 0.018512166716729395
|
| 529 |
+
},
|
| 530 |
+
{
|
| 531 |
+
"run_mode": "locked_stream",
|
| 532 |
+
"condition": "formula_wide_l8_h8",
|
| 533 |
+
"condition_kind": "anchor_decay",
|
| 534 |
+
"stage": 3,
|
| 535 |
+
"token_limit": 2000000,
|
| 536 |
+
"model_name": "wide_L8_H8_D384",
|
| 537 |
+
"n_layer": 8,
|
| 538 |
+
"n_head": 8,
|
| 539 |
+
"n_embd": 384,
|
| 540 |
+
"parameters": 17301504,
|
| 541 |
+
"dropout_initial": 0.301,
|
| 542 |
+
"dropout_final": 0.02,
|
| 543 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 544 |
+
"n": 3,
|
| 545 |
+
"mean_train_eval_loss": 4.163138948380947,
|
| 546 |
+
"std_train_eval_loss": 0.013421730028855801,
|
| 547 |
+
"mean_val_eval_loss": 4.621206303437551,
|
| 548 |
+
"std_val_eval_loss": 0.004959867552466137,
|
| 549 |
+
"mean_generalization_gap": 0.45806735505660373,
|
| 550 |
+
"std_generalization_gap": 0.008658546315824916
|
| 551 |
+
},
|
| 552 |
+
{
|
| 553 |
+
"run_mode": "locked_stream",
|
| 554 |
+
"condition": "static_dropout_0.02",
|
| 555 |
+
"condition_kind": "static",
|
| 556 |
+
"stage": 3,
|
| 557 |
+
"token_limit": 2000000,
|
| 558 |
+
"model_name": "wide_L8_H8_D384",
|
| 559 |
+
"n_layer": 8,
|
| 560 |
+
"n_head": 8,
|
| 561 |
+
"n_embd": 384,
|
| 562 |
+
"parameters": 17301504,
|
| 563 |
+
"dropout_initial": 0.02,
|
| 564 |
+
"dropout_final": 0.02,
|
| 565 |
+
"dropout_schedule": "constant",
|
| 566 |
+
"n": 3,
|
| 567 |
+
"mean_train_eval_loss": 4.012869040171306,
|
| 568 |
+
"std_train_eval_loss": 0.019352473408129574,
|
| 569 |
+
"mean_val_eval_loss": 4.749846667051315,
|
| 570 |
+
"std_val_eval_loss": 0.008764765668577094,
|
| 571 |
+
"mean_generalization_gap": 0.73697762688001,
|
| 572 |
+
"std_generalization_gap": 0.020006114758379243
|
| 573 |
+
},
|
| 574 |
+
{
|
| 575 |
+
"run_mode": "locked_stream",
|
| 576 |
+
"condition": "static_dropout_0.08",
|
| 577 |
+
"condition_kind": "static",
|
| 578 |
+
"stage": 3,
|
| 579 |
+
"token_limit": 2000000,
|
| 580 |
+
"model_name": "wide_L8_H8_D384",
|
| 581 |
+
"n_layer": 8,
|
| 582 |
+
"n_head": 8,
|
| 583 |
+
"n_embd": 384,
|
| 584 |
+
"parameters": 17301504,
|
| 585 |
+
"dropout_initial": 0.08,
|
| 586 |
+
"dropout_final": 0.08,
|
| 587 |
+
"dropout_schedule": "constant",
|
| 588 |
+
"n": 3,
|
| 589 |
+
"mean_train_eval_loss": 4.110793483753999,
|
| 590 |
+
"std_train_eval_loss": 0.028893033853370435,
|
| 591 |
+
"mean_val_eval_loss": 4.6680960928400355,
|
| 592 |
+
"std_val_eval_loss": 0.003314491274290642,
|
| 593 |
+
"mean_generalization_gap": 0.5573026090860367,
|
| 594 |
+
"std_generalization_gap": 0.02737432374001718
|
| 595 |
+
},
|
| 596 |
+
{
|
| 597 |
+
"run_mode": "locked_stream",
|
| 598 |
+
"condition": "static_dropout_0.14",
|
| 599 |
+
"condition_kind": "static",
|
| 600 |
+
"stage": 3,
|
| 601 |
+
"token_limit": 2000000,
|
| 602 |
+
"model_name": "wide_L8_H8_D384",
|
| 603 |
+
"n_layer": 8,
|
| 604 |
+
"n_head": 8,
|
| 605 |
+
"n_embd": 384,
|
| 606 |
+
"parameters": 17301504,
|
| 607 |
+
"dropout_initial": 0.14,
|
| 608 |
+
"dropout_final": 0.14,
|
| 609 |
+
"dropout_schedule": "constant",
|
| 610 |
+
"n": 3,
|
| 611 |
+
"mean_train_eval_loss": 4.1672036200761795,
|
| 612 |
+
"std_train_eval_loss": 0.023722898945556545,
|
| 613 |
+
"mean_val_eval_loss": 4.637934118509293,
|
| 614 |
+
"std_val_eval_loss": 0.006874304525877202,
|
| 615 |
+
"mean_generalization_gap": 0.4707304984331131,
|
| 616 |
+
"std_generalization_gap": 0.019934551772559723
|
| 617 |
+
},
|
| 618 |
+
{
|
| 619 |
+
"run_mode": "locked_stream",
|
| 620 |
+
"condition": "static_dropout_0.18",
|
| 621 |
+
"condition_kind": "static",
|
| 622 |
+
"stage": 3,
|
| 623 |
+
"token_limit": 2000000,
|
| 624 |
+
"model_name": "wide_L8_H8_D384",
|
| 625 |
+
"n_layer": 8,
|
| 626 |
+
"n_head": 8,
|
| 627 |
+
"n_embd": 384,
|
| 628 |
+
"parameters": 17301504,
|
| 629 |
+
"dropout_initial": 0.18,
|
| 630 |
+
"dropout_final": 0.18,
|
| 631 |
+
"dropout_schedule": "constant",
|
| 632 |
+
"n": 3,
|
| 633 |
+
"mean_train_eval_loss": 4.21087221801281,
|
| 634 |
+
"std_train_eval_loss": 0.021182682238053856,
|
| 635 |
+
"mean_val_eval_loss": 4.63968871285518,
|
| 636 |
+
"std_val_eval_loss": 0.006246922787675527,
|
| 637 |
+
"mean_generalization_gap": 0.42881649484237033,
|
| 638 |
+
"std_generalization_gap": 0.0234751500917668
|
| 639 |
+
},
|
| 640 |
+
{
|
| 641 |
+
"run_mode": "locked_stream",
|
| 642 |
+
"condition": "static_dropout_0.2",
|
| 643 |
+
"condition_kind": "static",
|
| 644 |
+
"stage": 3,
|
| 645 |
+
"token_limit": 2000000,
|
| 646 |
+
"model_name": "wide_L8_H8_D384",
|
| 647 |
+
"n_layer": 8,
|
| 648 |
+
"n_head": 8,
|
| 649 |
+
"n_embd": 384,
|
| 650 |
+
"parameters": 17301504,
|
| 651 |
+
"dropout_initial": 0.2,
|
| 652 |
+
"dropout_final": 0.2,
|
| 653 |
+
"dropout_schedule": "constant",
|
| 654 |
+
"n": 3,
|
| 655 |
+
"mean_train_eval_loss": 4.245612747967243,
|
| 656 |
+
"std_train_eval_loss": 0.021374004532989366,
|
| 657 |
+
"mean_val_eval_loss": 4.657182261347771,
|
| 658 |
+
"std_val_eval_loss": 0.008606949344244184,
|
| 659 |
+
"mean_generalization_gap": 0.4115695133805275,
|
| 660 |
+
"std_generalization_gap": 0.018893841026177444
|
| 661 |
+
},
|
| 662 |
+
{
|
| 663 |
+
"run_mode": "locked_stream",
|
| 664 |
+
"condition": "static_dropout_0.26",
|
| 665 |
+
"condition_kind": "static",
|
| 666 |
+
"stage": 3,
|
| 667 |
+
"token_limit": 2000000,
|
| 668 |
+
"model_name": "wide_L8_H8_D384",
|
| 669 |
+
"n_layer": 8,
|
| 670 |
+
"n_head": 8,
|
| 671 |
+
"n_embd": 384,
|
| 672 |
+
"parameters": 17301504,
|
| 673 |
+
"dropout_initial": 0.26,
|
| 674 |
+
"dropout_final": 0.26,
|
| 675 |
+
"dropout_schedule": "constant",
|
| 676 |
+
"n": 3,
|
| 677 |
+
"mean_train_eval_loss": 4.315852140386899,
|
| 678 |
+
"std_train_eval_loss": 0.004905814773589955,
|
| 679 |
+
"mean_val_eval_loss": 4.679170399904251,
|
| 680 |
+
"std_val_eval_loss": 0.006978966774156137,
|
| 681 |
+
"mean_generalization_gap": 0.3633182595173518,
|
| 682 |
+
"std_generalization_gap": 0.011733969729614104
|
| 683 |
+
},
|
| 684 |
+
{
|
| 685 |
+
"run_mode": "locked_stream",
|
| 686 |
+
"condition": "static_dropout_0.3",
|
| 687 |
+
"condition_kind": "static",
|
| 688 |
+
"stage": 3,
|
| 689 |
+
"token_limit": 2000000,
|
| 690 |
+
"model_name": "wide_L8_H8_D384",
|
| 691 |
+
"n_layer": 8,
|
| 692 |
+
"n_head": 8,
|
| 693 |
+
"n_embd": 384,
|
| 694 |
+
"parameters": 17301504,
|
| 695 |
+
"dropout_initial": 0.3,
|
| 696 |
+
"dropout_final": 0.3,
|
| 697 |
+
"dropout_schedule": "constant",
|
| 698 |
+
"n": 3,
|
| 699 |
+
"mean_train_eval_loss": 4.366789584358533,
|
| 700 |
+
"std_train_eval_loss": 0.025032839092378575,
|
| 701 |
+
"mean_val_eval_loss": 4.710932297011216,
|
| 702 |
+
"std_val_eval_loss": 0.003280655243995376,
|
| 703 |
+
"mean_generalization_gap": 0.34414271265268326,
|
| 704 |
+
"std_generalization_gap": 0.0238351561114313
|
| 705 |
+
},
|
| 706 |
+
{
|
| 707 |
+
"run_mode": "locked_stream",
|
| 708 |
+
"condition": "formula_wide_l8_h8",
|
| 709 |
+
"condition_kind": "anchor_decay",
|
| 710 |
+
"stage": 4,
|
| 711 |
+
"token_limit": 4000000,
|
| 712 |
+
"model_name": "wide_L8_H8_D384",
|
| 713 |
+
"n_layer": 8,
|
| 714 |
+
"n_head": 8,
|
| 715 |
+
"n_embd": 384,
|
| 716 |
+
"parameters": 17301504,
|
| 717 |
+
"dropout_initial": 0.301,
|
| 718 |
+
"dropout_final": 0.02,
|
| 719 |
+
"dropout_schedule": "log_prefix_anchor",
|
| 720 |
+
"n": 3,
|
| 721 |
+
"mean_train_eval_loss": 4.15141686052084,
|
| 722 |
+
"std_train_eval_loss": 0.00864907460577688,
|
| 723 |
+
"mean_val_eval_loss": 4.4657708530624705,
|
| 724 |
+
"std_val_eval_loss": 0.006533212137650865,
|
| 725 |
+
"mean_generalization_gap": 0.31435399254163104,
|
| 726 |
+
"std_generalization_gap": 0.0021895349788719713
|
| 727 |
+
},
|
| 728 |
+
{
|
| 729 |
+
"run_mode": "locked_stream",
|
| 730 |
+
"condition": "static_dropout_0.02",
|
| 731 |
+
"condition_kind": "static",
|
| 732 |
+
"stage": 4,
|
| 733 |
+
"token_limit": 4000000,
|
| 734 |
+
"model_name": "wide_L8_H8_D384",
|
| 735 |
+
"n_layer": 8,
|
| 736 |
+
"n_head": 8,
|
| 737 |
+
"n_embd": 384,
|
| 738 |
+
"parameters": 17301504,
|
| 739 |
+
"dropout_initial": 0.02,
|
| 740 |
+
"dropout_final": 0.02,
|
| 741 |
+
"dropout_schedule": "constant",
|
| 742 |
+
"n": 3,
|
| 743 |
+
"mean_train_eval_loss": 4.149302105108897,
|
| 744 |
+
"std_train_eval_loss": 0.015932906479109284,
|
| 745 |
+
"mean_val_eval_loss": 4.542556785047054,
|
| 746 |
+
"std_val_eval_loss": 0.005111427760726518,
|
| 747 |
+
"mean_generalization_gap": 0.3932546799381574,
|
| 748 |
+
"std_generalization_gap": 0.010983082646486758
|
| 749 |
+
},
|
| 750 |
+
{
|
| 751 |
+
"run_mode": "locked_stream",
|
| 752 |
+
"condition": "static_dropout_0.08",
|
| 753 |
+
"condition_kind": "static",
|
| 754 |
+
"stage": 4,
|
| 755 |
+
"token_limit": 4000000,
|
| 756 |
+
"model_name": "wide_L8_H8_D384",
|
| 757 |
+
"n_layer": 8,
|
| 758 |
+
"n_head": 8,
|
| 759 |
+
"n_embd": 384,
|
| 760 |
+
"parameters": 17301504,
|
| 761 |
+
"dropout_initial": 0.08,
|
| 762 |
+
"dropout_final": 0.08,
|
| 763 |
+
"dropout_schedule": "constant",
|
| 764 |
+
"n": 3,
|
| 765 |
+
"mean_train_eval_loss": 4.177745779355367,
|
| 766 |
+
"std_train_eval_loss": 0.0073518511940769,
|
| 767 |
+
"mean_val_eval_loss": 4.498890255888303,
|
| 768 |
+
"std_val_eval_loss": 0.004325913851224281,
|
| 769 |
+
"mean_generalization_gap": 0.3211444765329361,
|
| 770 |
+
"std_generalization_gap": 0.0045681171466252805
|
| 771 |
+
},
|
| 772 |
+
{
|
| 773 |
+
"run_mode": "locked_stream",
|
| 774 |
+
"condition": "static_dropout_0.14",
|
| 775 |
+
"condition_kind": "static",
|
| 776 |
+
"stage": 4,
|
| 777 |
+
"token_limit": 4000000,
|
| 778 |
+
"model_name": "wide_L8_H8_D384",
|
| 779 |
+
"n_layer": 8,
|
| 780 |
+
"n_head": 8,
|
| 781 |
+
"n_embd": 384,
|
| 782 |
+
"parameters": 17301504,
|
| 783 |
+
"dropout_initial": 0.14,
|
| 784 |
+
"dropout_final": 0.14,
|
| 785 |
+
"dropout_schedule": "constant",
|
| 786 |
+
"n": 3,
|
| 787 |
+
"mean_train_eval_loss": 4.221429777642091,
|
| 788 |
+
"std_train_eval_loss": 0.012821970261243815,
|
| 789 |
+
"mean_val_eval_loss": 4.494626219073932,
|
| 790 |
+
"std_val_eval_loss": 0.008657044012355454,
|
| 791 |
+
"mean_generalization_gap": 0.27319644143184024,
|
| 792 |
+
"std_generalization_gap": 0.006086902797186131
|
| 793 |
+
},
|
| 794 |
+
{
|
| 795 |
+
"run_mode": "locked_stream",
|
| 796 |
+
"condition": "static_dropout_0.18",
|
| 797 |
+
"condition_kind": "static",
|
| 798 |
+
"stage": 4,
|
| 799 |
+
"token_limit": 4000000,
|
| 800 |
+
"model_name": "wide_L8_H8_D384",
|
| 801 |
+
"n_layer": 8,
|
| 802 |
+
"n_head": 8,
|
| 803 |
+
"n_embd": 384,
|
| 804 |
+
"parameters": 17301504,
|
| 805 |
+
"dropout_initial": 0.18,
|
| 806 |
+
"dropout_final": 0.18,
|
| 807 |
+
"dropout_schedule": "constant",
|
| 808 |
+
"n": 3,
|
| 809 |
+
"mean_train_eval_loss": 4.242261836926143,
|
| 810 |
+
"std_train_eval_loss": 0.007533414644039237,
|
| 811 |
+
"mean_val_eval_loss": 4.4967766006787615,
|
| 812 |
+
"std_val_eval_loss": 0.005175633970589464,
|
| 813 |
+
"mean_generalization_gap": 0.25451476375261944,
|
| 814 |
+
"std_generalization_gap": 0.005043870704268704
|
| 815 |
+
},
|
| 816 |
+
{
|
| 817 |
+
"run_mode": "locked_stream",
|
| 818 |
+
"condition": "static_dropout_0.2",
|
| 819 |
+
"condition_kind": "static",
|
| 820 |
+
"stage": 4,
|
| 821 |
+
"token_limit": 4000000,
|
| 822 |
+
"model_name": "wide_L8_H8_D384",
|
| 823 |
+
"n_layer": 8,
|
| 824 |
+
"n_head": 8,
|
| 825 |
+
"n_embd": 384,
|
| 826 |
+
"parameters": 17301504,
|
| 827 |
+
"dropout_initial": 0.2,
|
| 828 |
+
"dropout_final": 0.2,
|
| 829 |
+
"dropout_schedule": "constant",
|
| 830 |
+
"n": 3,
|
| 831 |
+
"mean_train_eval_loss": 4.279090583324432,
|
| 832 |
+
"std_train_eval_loss": 0.006720524978756697,
|
| 833 |
+
"mean_val_eval_loss": 4.51753082126379,
|
| 834 |
+
"std_val_eval_loss": 0.008461337427952602,
|
| 835 |
+
"mean_generalization_gap": 0.23844023793935776,
|
| 836 |
+
"std_generalization_gap": 0.005697079794179711
|
| 837 |
+
},
|
| 838 |
+
{
|
| 839 |
+
"run_mode": "locked_stream",
|
| 840 |
+
"condition": "static_dropout_0.26",
|
| 841 |
+
"condition_kind": "static",
|
| 842 |
+
"stage": 4,
|
| 843 |
+
"token_limit": 4000000,
|
| 844 |
+
"model_name": "wide_L8_H8_D384",
|
| 845 |
+
"n_layer": 8,
|
| 846 |
+
"n_head": 8,
|
| 847 |
+
"n_embd": 384,
|
| 848 |
+
"parameters": 17301504,
|
| 849 |
+
"dropout_initial": 0.26,
|
| 850 |
+
"dropout_final": 0.26,
|
| 851 |
+
"dropout_schedule": "constant",
|
| 852 |
+
"n": 3,
|
| 853 |
+
"mean_train_eval_loss": 4.324253092209498,
|
| 854 |
+
"std_train_eval_loss": 0.007220470805599103,
|
| 855 |
+
"mean_val_eval_loss": 4.541137516498566,
|
| 856 |
+
"std_val_eval_loss": 0.004647242294749092,
|
| 857 |
+
"mean_generalization_gap": 0.2168844242890676,
|
| 858 |
+
"std_generalization_gap": 0.0030328214421115702
|
| 859 |
+
},
|
| 860 |
+
{
|
| 861 |
+
"run_mode": "locked_stream",
|
| 862 |
+
"condition": "static_dropout_0.3",
|
| 863 |
+
"condition_kind": "static",
|
| 864 |
+
"stage": 4,
|
| 865 |
+
"token_limit": 4000000,
|
| 866 |
+
"model_name": "wide_L8_H8_D384",
|
| 867 |
+
"n_layer": 8,
|
| 868 |
+
"n_head": 8,
|
| 869 |
+
"n_embd": 384,
|
| 870 |
+
"parameters": 17301504,
|
| 871 |
+
"dropout_initial": 0.3,
|
| 872 |
+
"dropout_final": 0.3,
|
| 873 |
+
"dropout_schedule": "constant",
|
| 874 |
+
"n": 3,
|
| 875 |
+
"mean_train_eval_loss": 4.37708696226279,
|
| 876 |
+
"std_train_eval_loss": 0.013672763672564944,
|
| 877 |
+
"mean_val_eval_loss": 4.575443473954995,
|
| 878 |
+
"std_val_eval_loss": 0.006430238324620357,
|
| 879 |
+
"mean_generalization_gap": 0.19835651169220606,
|
| 880 |
+
"std_generalization_gap": 0.008034807259295824
|
| 881 |
+
}
|
| 882 |
+
]
|
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/trace.jsonl
ADDED
|
@@ -0,0 +1,240 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.435223579406738}
|
| 2 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.461259841918945}
|
| 3 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.799189567565918}
|
| 4 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.339989185333252}
|
| 5 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.445383071899414}
|
| 6 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.370175838470459}
|
| 7 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.557982921600342}
|
| 8 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.40201473236084}
|
| 9 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.392435550689697}
|
| 10 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.2655863761901855}
|
| 11 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.108766555786133}
|
| 12 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.6994476318359375}
|
| 13 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.655489921569824}
|
| 14 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.28831672668457}
|
| 15 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.475219249725342}
|
| 16 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.06334114074707}
|
| 17 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.254112243652344}
|
| 18 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.392305374145508}
|
| 19 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.22216796875}
|
| 20 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.331225872039795}
|
| 21 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.339109420776367}
|
| 22 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.7740654945373535}
|
| 23 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.636873245239258}
|
| 24 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.1092705726623535}
|
| 25 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.257920742034912}
|
| 26 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.312147617340088}
|
| 27 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.455212593078613}
|
| 28 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.083148002624512}
|
| 29 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.338054656982422}
|
| 30 |
+
{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.061500549316406}
|
| 31 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.998419761657715}
|
| 32 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.831916570663452}
|
| 33 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.281755447387695}
|
| 34 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.6908276081085205}
|
| 35 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9898693561553955}
|
| 36 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.6737847328186035}
|
| 37 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.473372459411621}
|
| 38 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.060349464416504}
|
| 39 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.2894182205200195}
|
| 40 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.323214530944824}
|
| 41 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.905608654022217}
|
| 42 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.846677303314209}
|
| 43 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.216981887817383}
|
| 44 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.469115972518921}
|
| 45 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.021525859832764}
|
| 46 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.9317455291748047}
|
| 47 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.007565975189209}
|
| 48 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.145359516143799}
|
| 49 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.317673206329346}
|
| 50 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.104133129119873}
|
| 51 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.906655788421631}
|
| 52 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.7551283836364746}
|
| 53 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.161432266235352}
|
| 54 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.522247314453125}
|
| 55 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9082343578338623}
|
| 56 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.8998873233795166}
|
| 57 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.303224086761475}
|
| 58 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8002028465270996}
|
| 59 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.325773239135742}
|
| 60 |
+
{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.062686920166016}
|
| 61 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.93206787109375}
|
| 62 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.3121185302734375}
|
| 63 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.280060291290283}
|
| 64 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.8869738578796387}
|
| 65 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.202347755432129}
|
| 66 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.9108357429504395}
|
| 67 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.140951156616211}
|
| 68 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.277665138244629}
|
| 69 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.203189849853516}
|
| 70 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.201048374176025}
|
| 71 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.783326148986816}
|
| 72 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.1509013175964355}
|
| 73 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.0230302810668945}
|
| 74 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.867222309112549}
|
| 75 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.463627815246582}
|
| 76 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.031224250793457}
|
| 77 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.441164970397949}
|
| 78 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.2578840255737305}
|
| 79 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.3251543045043945}
|
| 80 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.138848304748535}
|
| 81 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.168774604797363}
|
| 82 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.213794708251953}
|
| 83 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.474993705749512}
|
| 84 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.008660316467285}
|
| 85 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.217497825622559}
|
| 86 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.963630199432373}
|
| 87 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.334596633911133}
|
| 88 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.332503795623779}
|
| 89 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.172324180603027}
|
| 90 |
+
{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.332511901855469}
|
| 91 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.201639652252197}
|
| 92 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.2807817459106445}
|
| 93 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.174047470092773}
|
| 94 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.343637943267822}
|
| 95 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.594155311584473}
|
| 96 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.408194541931152}
|
| 97 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.672398567199707}
|
| 98 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.2894206047058105}
|
| 99 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.316445827484131}
|
| 100 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.3651604652404785}
|
| 101 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.319988250732422}
|
| 102 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.285619735717773}
|
| 103 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.278521537780762}
|
| 104 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.307913780212402}
|
| 105 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.268133163452148}
|
| 106 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.097533226013184}
|
| 107 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.153548240661621}
|
| 108 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.320278644561768}
|
| 109 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.55961275100708}
|
| 110 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.196667671203613}
|
| 111 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.9843902587890625}
|
| 112 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.543873310089111}
|
| 113 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.703161239624023}
|
| 114 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.026454448699951}
|
| 115 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.307652473449707}
|
| 116 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.160324573516846}
|
| 117 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.454224109649658}
|
| 118 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.122490406036377}
|
| 119 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.278357028961182}
|
| 120 |
+
{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.3369245529174805}
|
| 121 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.15626335144043}
|
| 122 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.235060214996338}
|
| 123 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.561727523803711}
|
| 124 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.190820693969727}
|
| 125 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.500772476196289}
|
| 126 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.60369348526001}
|
| 127 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.401540756225586}
|
| 128 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.394259929656982}
|
| 129 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.282741546630859}
|
| 130 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.539054870605469}
|
| 131 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.222869873046875}
|
| 132 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.384670734405518}
|
| 133 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.495967864990234}
|
| 134 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.205634117126465}
|
| 135 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.476202964782715}
|
| 136 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.3069024085998535}
|
| 137 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.370694160461426}
|
| 138 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.197530746459961}
|
| 139 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.3027873039245605}
|
| 140 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.284463405609131}
|
| 141 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.250606060028076}
|
| 142 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.598309516906738}
|
| 143 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.485733509063721}
|
| 144 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.279513359069824}
|
| 145 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.5862016677856445}
|
| 146 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.1803812980651855}
|
| 147 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.476117134094238}
|
| 148 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.316293716430664}
|
| 149 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.311079978942871}
|
| 150 |
+
{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.447503566741943}
|
| 151 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.2886962890625}
|
| 152 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.446702003479004}
|
| 153 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.795304298400879}
|
| 154 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.462212562561035}
|
| 155 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.393544673919678}
|
| 156 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.44389009475708}
|
| 157 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.52379035949707}
|
| 158 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.521040439605713}
|
| 159 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.324027061462402}
|
| 160 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.3808183670043945}
|
| 161 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.325625419616699}
|
| 162 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.610562324523926}
|
| 163 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.611095905303955}
|
| 164 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.277379035949707}
|
| 165 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.572210311889648}
|
| 166 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.454376220703125}
|
| 167 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.59763240814209}
|
| 168 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.377513885498047}
|
| 169 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.411192893981934}
|
| 170 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.475866794586182}
|
| 171 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.290244102478027}
|
| 172 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.592159271240234}
|
| 173 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.338376998901367}
|
| 174 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.286029815673828}
|
| 175 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.403045177459717}
|
| 176 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.145584583282471}
|
| 177 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.335721969604492}
|
| 178 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.275428771972656}
|
| 179 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.567635536193848}
|
| 180 |
+
{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.303030967712402}
|
| 181 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.306872367858887}
|
| 182 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.588815689086914}
|
| 183 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.641493797302246}
|
| 184 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.4486284255981445}
|
| 185 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.442417621612549}
|
| 186 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.541705131530762}
|
| 187 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.413729190826416}
|
| 188 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.310101509094238}
|
| 189 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.322086334228516}
|
| 190 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.295492172241211}
|
| 191 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.243927001953125}
|
| 192 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.5635986328125}
|
| 193 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.689737319946289}
|
| 194 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.5332489013671875}
|
| 195 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.609074592590332}
|
| 196 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.513686656951904}
|
| 197 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.5128173828125}
|
| 198 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.520935535430908}
|
| 199 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.397536277770996}
|
| 200 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.469289302825928}
|
| 201 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.171591281890869}
|
| 202 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.668099403381348}
|
| 203 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.4507832527160645}
|
| 204 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.168695449829102}
|
| 205 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.338339805603027}
|
| 206 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.411451816558838}
|
| 207 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.387897491455078}
|
| 208 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.3449296951293945}
|
| 209 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.204782485961914}
|
| 210 |
+
{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.2295427322387695}
|
| 211 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.369487762451172}
|
| 212 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.801484107971191}
|
| 213 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.613731384277344}
|
| 214 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.409400939941406}
|
| 215 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.659245491027832}
|
| 216 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.452445030212402}
|
| 217 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.58389139175415}
|
| 218 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.41795539855957}
|
| 219 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.648049354553223}
|
| 220 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.556178569793701}
|
| 221 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.342921257019043}
|
| 222 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.781554222106934}
|
| 223 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.746016025543213}
|
| 224 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.599480628967285}
|
| 225 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.711769104003906}
|
| 226 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.467323303222656}
|
| 227 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.58270263671875}
|
| 228 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.398643970489502}
|
| 229 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.496487617492676}
|
| 230 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.4059600830078125}
|
| 231 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.600184440612793}
|
| 232 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.708844184875488}
|
| 233 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.56709098815918}
|
| 234 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.4110612869262695}
|
| 235 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.634671211242676}
|
| 236 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.227794647216797}
|
| 237 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.368139743804932}
|
| 238 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.518902778625488}
|
| 239 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.640765190124512}
|
| 240 |
+
{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.421485900878906}
|