Add width-heavy architecture holdout results

Files changed (9) hide show

README.md +2 -1
REPRODUCING.md +10 -3
docs/dropout_decay_research_report_v2.md +747 -0
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/RESULT_SUMMARY.md +86 -0
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/config.json +131 -0
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/metrics.jsonl +120 -0
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.csv +41 -0
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.json +882 -0
runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/trace.jsonl +240 -0

README.md CHANGED Viewed

@@ -67,7 +67,8 @@ Every run writes:
 Old exploratory outputs are archived under `archive/`.
-For exact headline reproduction, see `REPRODUCING.md`.
 ## Step 1: Cheap Static Screen

 Old exploratory outputs are archived under `archive/`.
+For exact headline reproduction, see `REPRODUCING.md`. For the current
+engineer-facing research summary, see `docs/dropout_decay_research_report_v2.md`.
 ## Step 1: Cheap Static Screen

REPRODUCING.md CHANGED Viewed

@@ -177,9 +177,10 @@ formula final validation:     4.5286 +/- 0.0118
 best static final validation: 4.5564 +/- 0.0127
 ```
-## Next Unrun Holdout
-The next planned holdout is the width-heavy architecture test:
 ```bash
 PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
@@ -206,7 +207,13 @@ PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
   --grad-clip 1.0
 ```
-Expected runtime on the current MPS setup is about 2.5-3.5 hours.
 ## Notes for Publication

 best static final validation: 4.5564 +/- 0.0127
 ```
+## Reproduce Width-Heavy Holdout
+The width-heavy architecture holdout is the paired complement to the deep/narrow
+holdout above:
 ```bash
 PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
   --grad-clip 1.0
 ```
+Completed reference result:
+```text
+formula final validation:     4.4658 +/- 0.0065
+best static final validation: 4.4946 +/- 0.0087
+best mean trajectory:        static 0.18, 4.9064 vs formula 4.9073
+```
 ## Notes for Publication

docs/dropout_decay_research_report_v2.md ADDED Viewed

	@@ -0,0 +1,747 @@

+# Dropout Decay in Expanding-Stream Language Model Training
+Date: 2026-05-28
+This version is written for an AI/ML engineer reading the project for the first
+time. It keeps the strongest empirical claims from the original report, but
+adds the missing context needed to understand what was actually trained, what
+the streaming setup means, how dropout schedules were applied, and which claims
+are safe.
+## Executive Summary
+This project studies whether dropout should be scheduled from measurable
+streaming-data pressure rather than held fixed during causal language-model
+training.
+The setting is simulated expanding-stream training. For each seed and condition,
+one model and optimizer are trained continuously across five stream prefixes:
+`250k`, `500k`, `1M`, `2M`, and `4M` unique training tokens. At each stage,
+random 128-token windows are sampled from the currently available prefix. Early
+stages therefore revisit the same tokens many times, while later stages expose
+the same model to more unique data.
+The original broad hypothesis was too simple:
+> Start with very high dropout on a small stream prefix, then decay dropout as
+> more stream data arrives.
+The experiments rejected that version. A high-dropout schedule such as
+`0.8 -> 0.1` was worse than static low dropout. The supported claim is narrower:
+> When the best static dropout changes with stream prefix size, a pressure-aware
+> prefix schedule can beat any single fixed dropout at final validation loss.
+The current empirical law uses two pressure terms:
+```text
+p = clamp(0.02, 0.65,
+          0.154 * log10(params / unique_tokens)
+        + 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
+        - 0.210)
+```
+Across the completed headline validation runs, the formula schedule wins
+`21/21` paired final-loss comparisons across five model sizes and two
+architecture-shape holdouts. The evidence supports final-validation improvement
+under this nanochat-style Transformer and expanding-prefix protocol. It does not
+yet establish a universal dropout law across datasets, architectures, or
+training scales, and the width-heavy holdout shows that the current formula can
+overestimate the best early-prefix dropout for some architecture shapes.
+## System Under Test
+The implementation is derived from Andrej Karpathy's `nanochat` project and
+keeps only the pieces needed for controlled dropout experiments:
+- BPE-style tokenizer with a 4,096-token vocabulary.
+- Nanochat-style causal Transformer.
+- RMSNorm, rotary attention, bias-free linear layers, and squared-ReLU MLPs.
+- Dropout control over embedding dropout, attention dropout, residual dropout,
+  and MLP dropout.
+- MPS-only Torch execution.
+- Expanding-prefix training loops for simulated streaming.
+The original nanochat MIT copyright and permission notice are retained in the
+derived source files and project license.
+### Model Family
+The main model family changes depth and width while keeping the same basic
+Transformer design:
+| Name | Shape | Params | Role |
+|---|---:|---:|---|
+| `L8_H8_D256` | `8x8x256` | 8.39M | Small boundary case |
+| `L10_H8_D288` | `10x8x288` | 12.31M | Interpolation check |
+| `L12_H8_D320` | `12x8x320` | 17.37M | Main mid-scale validation |
+| `L14_H8_D352` | `14x8x352` | 23.70M | Interpolation check |
+| `L16_H8_D384` | `16x8x384` | 31.46M | Larger validation model |
+| `deep_narrow_L18_H8_D256` | `18x8x256` | 16.25M | Architecture-shape holdout |
+| `wide_L8_H8_D384` | `8x8x384` | 17.30M | Width-heavy architecture holdout |
+The shape notation is `layers x heads x embedding dimension`.
+### Training Configuration
+Unless otherwise stated, headline runs use:
+| Field | Value |
+|---|---:|
+| Device | `mps` only |
+| Vocab size | `4096` |
+| Block size | `128` |
+| Batch size | `16` |
+| Tokens per optimizer step | `2048` |
+| Optimizer | AdamW |
+| Learning rate | `0.0003` |
+| Adam betas | `(0.9, 0.95)` |
+| Weight decay | `0.1` |
+| Gradient clipping | `1.0` |
+| Validation batches | `64` |
+| Train-eval batches | `32` |
+| Seeds | usually `1, 2, 3` |
+The runner refuses CPU and CUDA experiment execution. It also exits if
+`PYTORCH_ENABLE_MPS_FALLBACK=1` is set, because fallback could silently run some
+Torch operations on CPU.
+## Data and Reproducibility Context
+The completed runs use a project-local cache:
+```text
+.cache/dropout_decay/tokenizer-v4096.json
+.cache/dropout_decay/tokens-v4096-uint16.npy
+```
+The completed run configs record:
+```text
+train tokens:      5,000,970
+validation tokens:   500,000
+vocab size:             4,096
+```
+The configs also show that the cache was built from a local parquet source named
+`base_data_climbmix/shard_*.parquet`. The binary token cache is intentionally
+not committed. That makes exact reproduction dependent on publishing or
+reconstructing the cache. This is the largest reproducibility gap in the current
+artifact.
+The runner supports two data paths:
+- `--use-cached-data --cache-dir .cache/dropout_decay`
+- `--corpus` or `--corpus-glob` to rebuild the cache from raw text or parquet
+For exact commands and environment setup, see `REPRODUCING.md`.
+## What "Expanding Stream" Means Here
+This project does not stream examples from an online service. It simulates a
+stream by revealing a larger prefix of a fixed token array at each stage.
+For a locked-stream run:
+1. Create one model and optimizer for a given seed and condition.
+2. Train stage 0 by sampling batches from the first `250k` training tokens.
+3. Continue the same model and optimizer into stage 1, now sampling from the
+   first `500k` tokens.
+4. Repeat for `1M`, `2M`, and `4M` prefixes.
+5. Evaluate train and validation loss at the end of every stage.
+This distinction matters because early prefixes are sampled repeatedly. With
+`1000` steps per stage, the model consumes:
+```text
+1000 steps * 16 examples/step * 128 tokens/example = 2,048,000 sampled tokens
+```
+At the `250k` prefix, that is about `8.19x` the available unique-token count
+within the first stage alone. At the `4M` prefix, the repeated-sampling pressure
+is much lower. This pressure change is the reason a fixed dropout value can be
+suboptimal across the full stream trajectory.
+## Dropout Schedule Semantics
+There are two schedule types in the codebase:
+- Static schedules keep one dropout value for all stages.
+- Anchor schedules choose dropout from the current stream prefix.
+The headline formula runs use anchor schedules. In those runs, the stage
+prefixes are exactly the anchor points, so dropout is constant within each
+stage:
+```text
+250k stage -> p_250k
+500k stage -> p_500k
+1M stage   -> p_1M
+2M stage   -> p_2M
+4M stage   -> p_4M
+```
+The implementation supports log interpolation between anchors for intermediate
+prefix sizes, but the main reported experiments evaluate exactly at the anchor
+prefixes. Therefore, these experiments validate a prefix-aware dropout path, not
+a continuously changing per-step dropout curve. The empirical formula is used to
+precompute these anchor values, which are then passed to the runner as
+`--anchor-decays`.
+## Metrics
+The report uses four recurring metrics:
+| Metric | Meaning |
+|---|---|
+| Final validation loss | Validation loss after the final `4M` prefix stage |
+| Mean trajectory validation loss | Average validation loss across all prefix stages |
+| Final train-validation gap | Final validation loss minus final train-eval loss |
+| Paired final delta | Formula final validation loss minus the best static final validation loss for the same seed |
+Lower validation loss is better. The gap is diagnostic, not the optimization
+objective. A smaller gap can mean useful regularization, but it can also mean
+underfitting if both train and validation losses are high.
+## Initial Hypothesis and Correction
+The first broad hypothesis was that very high initial dropout would protect the
+model from overfitting small stream prefixes and could be decayed as more data
+arrived.
+Early 8.39M-parameter streaming runs rejected this version:
+| Condition | 5M | 10M | 20M | 40M |
+|---|---:|---:|---:|---:|
+| High-dropout decay streaming | `6.9213` | `6.2689` | `5.4262` | `4.9090` |
+| Static `0.1` dropout streaming | `5.6310` | `5.1018` | `4.8497` | `4.6743` |
+| Static `0.8` dropout streaming | `6.9898` | `6.7637` | `6.4835` | `6.2390` |
+The improvement over time was mostly the effect of seeing more stream data, not
+evidence that high-dropout decay was a good schedule. This forced a more
+careful experimental design:
+1. First find regimes where static dropout has a real nonzero optimum.
+2. Observe how that optimum moves as the stream prefix grows.
+3. Test schedules that track the moving optimum instead of using arbitrary high
+   dropout.
+## Static Dropout Screen
+The key discovery from the static screen is that the best dropout depends on
+both model scale and stream prefix size. Larger models and smaller prefixes need
+more dropout. As the stream prefix grows, the best dropout moves downward.
+| Model | Params | Prefix | Best static dropout | Validation loss | Zero-dropout penalty |
+|---|---:|---:|---:|---:|---:|
+| L16 | 31.46M | 2M | `0.14` | `4.4270` | `+0.1982` |
+| L12 | 17.37M | 2M | `0.14` | `4.5088` | `+0.0866` |
+| L8 | 8.39M | 2M | `0.08` | `4.6232` | `+0.0266` |
+| L8 | 8.39M | 4M | `0.00` | best | near zero |
+This screen is not itself the final evidence, because much of it is single
+seed. Its role is to reveal the empirical shape: the static optimum is not a
+constant. The locked-stream experiments then test whether tracking that shape
+beats a single fixed dropout baseline.
+## Empirical Formula
+The current formula is:
+```text
+p = clamp(0.02, 0.65,
+          0.154 * log10(params / unique_tokens)
+        + 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
+        - 0.210)
+```
+Terms:
+- `params / unique_tokens` is a capacity-pressure proxy. Larger models on
+  smaller stream prefixes are more likely to memorize.
+- `cumulative_sampled_tokens / unique_tokens` is an update-pressure proxy. More
+  repeated sampling from the same prefix increases overfitting pressure.
+- `0.02` is an empirical floor. It avoids assuming exact zero dropout is always
+  optimal.
+- `0.65` is a guardrail. The successful headline schedules are far below it.
+The coefficients are empirical, not theoretical constants. The formula should
+be read as a compact fitted schedule family for this protocol, not as a general
+law of dropout.
+For the standard `1000`-step protocol, the formula produces these paths:
+```text
+prefix tokens:              250k    500k      1M      2M       4M
+cumulative sampled tokens: 2.048M  4.096M  6.144M  8.192M  10.240M
+```
+The cumulative sampled-token values are the planned totals after each stage.
+They are used to compute the stage anchor dropouts below.
+| Model | Params | Formula path |
+|---|---:|---|
+| L8 | 8.39M | `0.252 -> 0.206 -> 0.129 -> 0.038 -> 0.020` |
+| L10 | 12.31M | `0.278 -> 0.232 -> 0.154 -> 0.064 -> 0.020` |
+| L12 | 17.37M | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` |
+| L14 | 23.70M | `0.322 -> 0.276 -> 0.198 -> 0.108 -> 0.020` |
+| L16 | 31.46M | `0.341 -> 0.294 -> 0.217 -> 0.127 -> 0.030` |
+| wide L8 | 17.30M | `0.301 -> 0.254 -> 0.177 -> 0.087 -> 0.020` |
+## Main Result: Model-Size Validation
+The formula was tested across five model sizes from 8.39M to 31.46M parameters.
+Each model used three seeds and was compared against fixed-dropout controls near
+the expected optimum.
+| Model | Params | Formula final val | Best static final val | Paired final deltas |
+|---|---:|---:|---:|---:|
+| L8 | 8.39M | `4.6094 +/- 0.0056` | `4.6242` | `-0.0102, -0.0160, -0.0182` |
+| L10 | 12.31M | `4.5306 +/- 0.0094` | `4.5580` | `-0.0288, -0.0188, -0.0345` |
+| L12 | 17.37M | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
+| L14 | 23.70M | `4.4384 +/- 0.0087` | `4.4736` | `-0.0294, -0.0269, -0.0429` |
+| L16 | 31.46M | `4.4059 +/- 0.0046` | `4.4459` | `-0.0411, -0.0512, -0.0279` |
+The formula wins all `15/15` paired final-loss comparisons in this model-size
+validation set.
+The L8 case is the weakest positive result. It wins final validation loss, but
+the static optimum is shallow and static `0.08` has better mean trajectory. The
+larger models show clearer benefits.
+## Why Schedule Shape Matters
+L16 was used to debug the difference between "high dropout" and "right dropout."
+An early fitted path that started too high, `0.60 -> 0.40 -> 0.30 -> 0.14 ->
+0.02`, beat some static controls at the final prefix but had worse trajectory
+loss. Moderate schedules around `0.30` were much better.
+Three-seed L16 confirmation:
+| Condition | Path | Final val | Final std | Mean trajectory val | Final gap |
+|---|---|---:|---:|---:|---:|
+| `hold_30_then_decay` | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | `4.4060` | `0.0118` | `4.8503` | `0.3530` |
+| `mild_30_to_08` | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | `4.4075` | `0.0078` | `4.8504` | `0.3307` |
+| `fitted_l16_static_law` | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | `4.4159` | `0.0042` | `4.9527` | `0.3144` |
+| `static_dropout_0.14` | constant | `4.4459` | `0.0128` | `4.9043` | `0.3205` |
+| `static_dropout_0.30` | constant | `4.4693` | `0.0081` | `4.8764` | `0.2327` |
+| `static_dropout_0.02` | constant | `4.5405` | `0.0061` | `5.1544` | `0.4747` |
+| `static_dropout_0.00` | constant | `4.5905` | `0.0192` | `5.2422` | `0.5464` |
+The lesson is that the winning schedule is not "very high dropout, then decay."
+It is "start near the small-prefix optimum, then decay as the optimum moves
+down."
+## Update-Pressure Validation
+Changing `stage_steps` changes how many sampled tokens the optimizer consumes at
+each prefix. The formula predicts that more repeated sampling should require
+more dropout.
+L12 update-pressure sweep:
+| Stage steps | Formula path | Mean trajectory val | Formula final val | Best static final val | Paired final deltas |
+|---:|---|---:|---:|---:|---:|
+| 500 | `0.226 -> 0.180 -> 0.102 -> 0.020 -> 0.020` | `5.1581` | `4.7138 +/- 0.0080` | `4.7321` | `-0.0152, -0.0147, -0.0249` |
+| 1000 | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
+| 2000 | `0.376 -> 0.330 -> 0.252 -> 0.162 -> 0.065` | `4.7841` | `4.3089 +/- 0.0116` | `4.3513` | `-0.0453, -0.0321, -0.0489` |
+The formula wins final validation loss in all three update-pressure regimes.
+At `2000` steps per prefix, it also wins mean trajectory loss. This supports the
+direction of the sampled-token term.
+## Sampled-Pressure Coefficient Ablation
+The sampled-pressure coefficient was ablated on L12 while holding model,
+prefixes, and training budget fixed.
+| Condition | Coefficient multiplier | Path | Mean trajectory val | Final val | Final std | Final gap |
+|---|---:|---|---:|---:|---:|---:|
+| `no_sample_pressure_l12` | 0x | `0.074 -> 0.027 -> 0.020 -> 0.020 -> 0.020` | `5.0282` | `4.5468` | `0.0011` | `0.3482` |
+| `half_sample_pressure_l12` | 0.5x | `0.187 -> 0.141 -> 0.079 -> 0.020 -> 0.020` | `4.9260` | `4.5055` | `0.0046` | `0.3272` |
+| `pressure_formula_floor02` | 1.0x | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812` | `0.0062` | `0.2825` |
+| `high_sample_pressure_l12` | 1.5x | `0.415 -> 0.368 -> 0.275 -> 0.163 -> 0.041` | `4.9739` | `4.4959` | `0.0025` | `0.2418` |
+The `1.0x` coefficient is best on final validation. The `1.5x` variant has the
+smallest final train-validation gap but worse validation loss, which is a useful
+warning: minimizing the gap is not the same as maximizing generalization.
+## Architecture-Shape Holdout
+A key question is whether parameter count alone is a reasonable capacity proxy.
+The first architecture-shape holdout uses a deep/narrow 8-head model:
+```text
+18 layers, 8 heads, 256 embedding dim, 16.25M parameters
+```
+The formula path was generated from parameter count only:
+```text
+0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020
+```
+Results:
+| Condition | Path | Mean trajectory val | Final val | Final std | Final gap |
+|---|---|---:|---:|---:|---:|
+| Formula | `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020` | `4.9720` | `4.5286` | `0.0118` | `0.2418` |
+| Static `0.02` | constant | `5.0730` | `4.5887` | `0.0067` | `0.2947` |
+| Static `0.08` | constant | `4.9900` | `4.5607` | `0.0081` | `0.2447` |
+| Static `0.14` | constant | `4.9633` | `4.5564` | `0.0127` | `0.2080` |
+| Static `0.18` | constant | `4.9699` | `4.5710` | `0.0061` | `0.1950` |
+| Static `0.20` | constant | `4.9799` | `4.5835` | `0.0199` | `0.1841` |
+| Static `0.26` | constant | `5.0021` | `4.6096` | `0.0126` | `0.1602` |
+| Static `0.30` | constant | `5.0341` | `4.6520` | `0.0024` | `0.1545` |
+Best static was `0.14`. Formula beat it on every paired final seed:
+```text
+formula - best_static = -0.0270, -0.0317, -0.0248
+```
+This supports final-loss transfer for the deep/narrow shape. It is not a clean
+trajectory win because static `0.14` had slightly better mean trajectory. The
+safe claim is final-loss transfer, not universal trajectory dominance.
+## Combined Evidence
+Completed headline evidence:
+| Evidence type | Result |
+|---|---|
+| Model-size validation | `15/15` paired final-loss wins |
+| Deep/narrow architecture holdout | `3/3` paired final-loss wins |
+| Width-heavy architecture holdout | `3/3` paired final-loss wins |
+| Combined paired final-loss comparisons | `21/21` wins |
+| Update-pressure direction | Supported on L12 |
+| Sampled-pressure coefficient | Supported on L12 |
+| High arbitrary initial dropout | Rejected |
+The current evidence is strong for the refined hypothesis under this exact
+protocol. It is not strong enough to claim a universal dropout law.
+## Additional Experiment Tables
+This section gives a denser empirical audit trail for the main claims. The
+narrative sections above highlight only the most important rows; the tables
+below expose more of the completed run surface.
+### Completed Run Inventory
+| Run ID | Role | Seeds |
+|---|---|---:|
+| `legacy_20260525` | Initial streaming controls and high-dropout failure case | mixed |
+| `screen_static_133008` | Static dropout screen across L8, L12, and L16 | 1 |
+| `l16_static_vs_decay_152414` | L16 single-seed static-vs-decay baseline | 1 |
+| `l16_schedule_search_171537` | L16 single-seed schedule search | 1 |
+| `l16_schedule_refine_184506` | L16 single-seed schedule refinement | 1 |
+| `l16_multiseed_confirm_203116` | L16 three-seed schedule confirmation | 3 |
+| `l12_single_seed_072432` | L12 seed-1 pressure-formula probe | 1 |
+| `l12_followup_085421` | L12 seeds 2 and 3 follow-up for common conditions | 2 |
+| `l8_boundary_104407` | L8 boundary model formula test | 3 |
+| `l16_exact_formula_123806` | L16 exact formula-vs-static confirmation | 3 |
+| `l10_interpolation_153920` | L10 model-size interpolation run | 3 |
+| `l14_interpolation_182113` | L14 model-size interpolation run | 3 |
+| `l12_stage_steps_500_231804` | L12 low-update-pressure validation | 3 |
+| `l12_stage_steps_2000_004033` | L12 high-update-pressure validation | 3 |
+| `l12_sample_pressure_ablation_053842` | L12 sampled-pressure coefficient ablation | 3 |
+| `deep_narrow_h8_112117` | Deep/narrow architecture-shape holdout | 3 |
+| `wide_h8_151721` | Width-heavy architecture-shape holdout | 3 |
+### Static Screen Optima
+The static screen was the main reason the research direction changed. It showed
+that dropout optima move with both prefix size and model scale.
+| Model | Params | Prefix tokens | Effective epochs | Best dropout | Val loss | Train loss | Gap |
+|---|---:|---:|---:|---:|---:|---:|---:|
+| `L8_H8_D256` | 8.39M | 250k | 40.96 | 0.40 | 5.4175 | 3.6411 | 1.7763 |
+| `L8_H8_D256` | 8.39M | 500k | 20.48 | 0.20 | 5.0216 | 3.6979 | 1.3238 |
+| `L8_H8_D256` | 8.39M | 1M | 10.24 | 0.14 | 4.7763 | 3.9900 | 0.7863 |
+| `L8_H8_D256` | 8.39M | 2M | 5.12 | 0.08 | 4.6232 | 4.2158 | 0.4074 |
+| `L8_H8_D256` | 8.39M | 4M | 2.56 | 0.00 | 4.5136 | 4.2515 | 0.2621 |
+| `L12_H8_D320` | 17.37M | 250k | 40.96 | 0.50 | 5.4384 | 3.3720 | 2.0663 |
+| `L12_H8_D320` | 17.37M | 500k | 20.48 | 0.40 | 4.9791 | 3.7358 | 1.2434 |
+| `L12_H8_D320` | 17.37M | 1M | 10.24 | 0.20 | 4.6871 | 3.7160 | 0.9711 |
+| `L12_H8_D320` | 17.37M | 2M | 5.12 | 0.14 | 4.5088 | 4.0218 | 0.4870 |
+| `L12_H8_D320` | 17.37M | 4M | 2.56 | 0.02 | 4.3875 | 4.0300 | 0.3575 |
+| `L16_H8_D384` | 31.46M | 250k | 40.96 | 0.60 | 5.5055 | 3.3185 | 2.1870 |
+| `L16_H8_D384` | 31.46M | 500k | 20.48 | 0.40 | 4.9814 | 3.2797 | 1.7017 |
+| `L16_H8_D384` | 31.46M | 1M | 10.24 | 0.30 | 4.6511 | 3.6295 | 1.0216 |
+| `L16_H8_D384` | 31.46M | 2M | 5.12 | 0.14 | 4.4270 | 3.7761 | 0.6509 |
+| `L16_H8_D384` | 31.46M | 4M | 2.56 | 0.02 | 4.2947 | 3.8547 | 0.4400 |
+### Formula Trajectories Across Model Size
+This table shows validation loss at every stream prefix for the formula
+schedule, not only the final result.
+| Model | Params | 250k | 500k | 1M | 2M | 4M |
+|---|---:|---:|---:|---:|---:|---:|
+| L8 | 8.39M | 5.6127 | 5.2183 | 4.9549 | 4.7543 | 4.6094 |
+| L10 | 12.31M | 5.5603 | 5.1544 | 4.8885 | 4.6831 | 4.5306 |
+| L12 | 17.37M | 5.5239 | 5.1258 | 4.8439 | 4.6383 | 4.4812 |
+| L14 | 23.70M | 5.4849 | 5.0853 | 4.8105 | 4.5969 | 4.4384 |
+| L16 | 31.46M | 5.4670 | 5.0597 | 4.7784 | 4.5699 | 4.4059 |
+### Final Static-Control Rankings By Model
+These tables show the final `4M` validation loss for the formula and all static
+controls that were run in each model-size validation. They make clear that the
+formula is not only beating a weak single baseline; it is being compared against
+nearby static controls around the apparent optimum.
+#### L8 Final Controls
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.6094 | 0.0056 | 4.3977 | 0.2117 |
+| static 0.08 | 3 | 4.6257 | 0.0007 | 4.4303 | 0.1954 |
+| static 0.04 | 3 | 4.6281 | 0.0040 | 4.4116 | 0.2166 |
+| static 0.02 | 3 | 4.6302 | 0.0086 | 4.3857 | 0.2445 |
+| static 0.00 | 3 | 4.6464 | 0.0072 | 4.3789 | 0.2675 |
+| static 0.13 | 3 | 4.6475 | 0.0083 | 4.4690 | 0.1784 |
+| static 0.20 | 3 | 4.6833 | 0.0048 | 4.5289 | 0.1543 |
+| static 0.25 | 3 | 4.7232 | 0.0032 | 4.5782 | 0.1450 |
+| static 0.30 | 3 | 4.7666 | 0.0083 | 4.6333 | 0.1334 |
+#### L10 Final Controls
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.5306 | 0.0094 | 4.2816 | 0.2491 |
+| static 0.06 | 3 | 4.5580 | 0.0033 | 4.2991 | 0.2588 |
+| static 0.10 | 3 | 4.5618 | 0.0049 | 4.3319 | 0.2299 |
+| static 0.08 | 3 | 4.5645 | 0.0015 | 4.3267 | 0.2378 |
+| static 0.13 | 3 | 4.5725 | 0.0100 | 4.3582 | 0.2143 |
+| static 0.16 | 3 | 4.5732 | 0.0073 | 4.3716 | 0.2017 |
+| static 0.02 | 3 | 4.5835 | 0.0067 | 4.2847 | 0.2988 |
+| static 0.20 | 3 | 4.5939 | 0.0099 | 4.4108 | 0.1830 |
+| static 0.23 | 3 | 4.6069 | 0.0061 | 4.4318 | 0.1752 |
+| static 0.28 | 3 | 4.6494 | 0.0063 | 4.4887 | 0.1607 |
+#### L12 Final Controls
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.4812 | 0.0062 | 4.1988 | 0.2825 |
+| static 0.14 | 3 | 4.5183 | 0.0022 | 4.2741 | 0.2442 |
+| static 0.20 | 3 | 4.5284 | 0.0075 | 4.3071 | 0.2213 |
+| static 0.09 | 3 | 4.5291 | 0.0023 | 4.2545 | 0.2745 |
+| static 0.18 | 3 | 4.5308 | 0.0069 | 4.3086 | 0.2222 |
+| static 0.26 | 3 | 4.5581 | 0.0025 | 4.3624 | 0.1957 |
+| static 0.02 | 1 | 4.5624 | 0.0000 | 4.2134 | 0.3491 |
+| static 0.30 | 3 | 4.5817 | 0.0014 | 4.3991 | 0.1826 |
+| static 0.00 | 1 | 4.6071 | 0.0000 | 4.1934 | 0.4137 |
+#### L14 Final Controls
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.4384 | 0.0087 | 4.1337 | 0.3046 |
+| static 0.18 | 3 | 4.4736 | 0.0072 | 4.2166 | 0.2570 |
+| static 0.14 | 3 | 4.4769 | 0.0113 | 4.1999 | 0.2770 |
+| static 0.20 | 3 | 4.4777 | 0.0014 | 4.2243 | 0.2534 |
+| static 0.10 | 3 | 4.4851 | 0.0039 | 4.1776 | 0.3075 |
+| static 0.28 | 3 | 4.5056 | 0.0068 | 4.2989 | 0.2067 |
+| static 0.02 | 3 | 4.5384 | 0.0113 | 4.1117 | 0.4267 |
+| static 0.32 | 3 | 4.5390 | 0.0054 | 4.3325 | 0.2065 |
+#### L16 Exact-Formula Final Controls
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.4059 | 0.0046 | 4.0601 | 0.3457 |
+| static 0.14 | 3 | 4.4459 | 0.0128 | 4.1254 | 0.3205 |
+### Update-Pressure Final Rankings
+For L12, changing `stage_steps` changes repeated-sampling pressure while keeping
+the same stream prefixes. The rows below include the formula and the three best
+static controls at the final prefix for each update-pressure regime.
+| Stage steps | Condition | N | Final val | Val std | Final train | Final gap |
+|---:|---|---:|---:|---:|---:|---:|
+| 500 | formula | 3 | 4.7138 | 0.0080 | 4.5508 | 0.1631 |
+| 500 | static 0.02 | 3 | 4.7321 | 0.0051 | 4.5468 | 0.1853 |
+| 500 | static 0.06 | 3 | 4.7413 | 0.0024 | 4.5796 | 0.1617 |
+| 500 | static 0.10 | 3 | 4.7514 | 0.0070 | 4.6030 | 0.1484 |
+| 1000 | formula | 3 | 4.4812 | 0.0062 | 4.1988 | 0.2825 |
+| 1000 | static 0.14 | 3 | 4.5183 | 0.0022 | 4.2741 | 0.2442 |
+| 1000 | static 0.20 | 3 | 4.5284 | 0.0075 | 4.3071 | 0.2213 |
+| 1000 | static 0.09 | 3 | 4.5291 | 0.0023 | 4.2545 | 0.2745 |
+| 2000 | formula | 3 | 4.3089 | 0.0116 | 3.8949 | 0.4140 |
+| 2000 | static 0.25 | 3 | 4.3513 | 0.0030 | 4.0249 | 0.3264 |
+| 2000 | static 0.18 | 3 | 4.3557 | 0.0076 | 3.9884 | 0.3673 |
+| 2000 | static 0.14 | 3 | 4.3622 | 0.0020 | 3.9608 | 0.4014 |
+### Sampled-Pressure Ablation Trajectories
+The table below shows the stage-by-stage validation path for the sampled-token
+coefficient ablation on L12. The `1.0x` row is the main formula run included
+above; the other rows are coefficient variants from the dedicated ablation run.
+| Multiplier | Validation path across prefixes | Mean trajectory val | Final val | Final std | Final gap |
+|---:|---|---:|---:|---:|---:|
+| 0x | `5.5299 -> 5.3265 -> 5.0044 -> 4.7335 -> 4.5468` | 5.0282 | 4.5468 | 0.0011 | 0.3482 |
+| 0.5x | `5.4776 -> 5.1221 -> 4.8600 -> 4.6647 -> 4.5055` | 4.9260 | 4.5055 | 0.0046 | 0.3272 |
+| 1.0x | `5.5239 -> 5.1258 -> 4.8439 -> 4.6383 -> 4.4812` | 4.9226 | 4.4812 | 0.0062 | 0.2825 |
+| 1.5x | `5.6181 -> 5.1838 -> 4.8996 -> 4.6723 -> 4.4959` | 4.9739 | 4.4959 | 0.0025 | 0.2418 |
+### Architecture-Shape Holdout Final Controls
+The deep/narrow holdout is important because it tests whether the pressure rule
+transfers beyond the exact depth/width scaling family used to fit the model-size
+trend.
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.5286 | 0.0118 | 4.2869 | 0.2418 |
+| static 0.14 | 3 | 4.5564 | 0.0127 | 4.3485 | 0.2080 |
+| static 0.08 | 3 | 4.5607 | 0.0081 | 4.3160 | 0.2447 |
+| static 0.18 | 3 | 4.5710 | 0.0061 | 4.3760 | 0.1950 |
+| static 0.20 | 3 | 4.5835 | 0.0199 | 4.3995 | 0.1841 |
+| static 0.02 | 3 | 4.5887 | 0.0067 | 4.2941 | 0.2947 |
+| static 0.26 | 3 | 4.6096 | 0.0126 | 4.4494 | 0.1602 |
+| static 0.30 | 3 | 4.6520 | 0.0024 | 4.4975 | 0.1545 |
+The width-heavy holdout is the paired complement: it keeps parameter count near
+the L12 scale but uses a conventional wider `8x8x384` shape instead of the
+deep/narrow `18x8x256` shape.
+| Condition | N | Final val | Val std | Final train | Final gap |
+|---|---:|---:|---:|---:|---:|
+| formula | 3 | 4.4658 | 0.0065 | 4.1514 | 0.3144 |
+| static 0.14 | 3 | 4.4946 | 0.0087 | 4.2214 | 0.2732 |
+| static 0.18 | 3 | 4.4968 | 0.0052 | 4.2423 | 0.2545 |
+| static 0.08 | 3 | 4.4989 | 0.0043 | 4.1777 | 0.3211 |
+| static 0.20 | 3 | 4.5175 | 0.0085 | 4.2791 | 0.2384 |
+| static 0.26 | 3 | 4.5411 | 0.0046 | 4.3243 | 0.2169 |
+| static 0.02 | 3 | 4.5426 | 0.0051 | 4.1493 | 0.3933 |
+| static 0.30 | 3 | 4.5754 | 0.0064 | 4.3771 | 0.1984 |
+Best static final loss varied by seed, but formula beat the best static
+condition in every paired final comparison:
+```text
+seed 1: 4.4629 vs static 0.18 at 4.4924, delta -0.0295
+seed 2: 4.4733 vs static 0.08 at 4.5015, delta -0.0282
+seed 3: 4.4612 vs static 0.14 at 4.4852, delta -0.0241
+```
+The width-heavy result is not a clean win on every metric. Static `0.18` had the
+best mean trajectory loss, `4.9064` versus formula `4.9073`, and the first two
+prefixes favored static rates around `0.18-0.20`. The formula still won final
+loss because it decayed to low dropout at the largest prefixes. This suggests
+that final-loss transfer is real, but an architecture-shape term may be needed
+to avoid overestimating early dropout for wide models.
+## Interpretation
+The most plausible mechanism is pressure tracking:
+1. At small prefixes, the model sees many effective passes over the same unique
+   tokens. Low dropout overfits quickly.
+2. Larger models amplify this because they have more capacity relative to the
+   available prefix.
+3. As the prefix grows, repeated-sampling pressure falls and high dropout begins
+   to underfit.
+4. A static dropout value must compromise across these regimes.
+5. A prefix-aware schedule can use stronger early regularization and lower
+   later regularization without changing model architecture or optimizer.
+This interpretation is consistent with the static screens, the model-size
+interpolation results, the update-pressure sweep, the sampled-pressure
+coefficient ablation, and the two architecture-shape holdouts. The width-heavy
+holdout adds an important refinement: parameter count alone does not fully
+describe architecture capacity, because the formula's early dropout was higher
+than the measured early-prefix static optimum for that shape.
+## What This Report Does Not Prove
+The current evidence does not prove:
+- The formula is universal across arbitrary datasets.
+- Parameter count alone fully captures model capacity.
+- The formula always wins mean trajectory loss.
+- The `0.02` floor is theoretically optimal.
+- The sampled-pressure coefficient is optimal for every model size.
+- The result will scale unchanged to larger LMs, longer contexts, or different
+  tokenizers.
+The current evidence does support:
+- Static dropout optima move downward as stream prefix size grows.
+- Larger models need more early dropout at small stream prefixes.
+- Repeated sampling from the same prefix increases useful dropout.
+- A pressure-aware schedule can beat the best single static dropout on final
+  validation loss in the completed protocol.
+## Publication Framing
+The strongest safe paper claim is:
+> In nanochat-style causal Transformers trained under an expanding-prefix
+> streaming protocol, a pressure-aware dropout schedule improves final
+> validation loss over fixed-dropout baselines across model sizes, update
+> pressures, and two architecture-shape holdouts.
+Claims to avoid:
+- "Dropout decay is generally beneficial."
+- "Very high initial dropout is useful."
+- "The formula predicts optimal dropout universally."
+- "The formula dominates every trajectory metric."
+## Remaining High-Value Experiments
+The next experiments that would most strengthen a paper are:
+1. Corpus/domain holdout: freeze the formula and run on a different text
+   distribution. This is the largest missing generalization test.
+2. Architecture-shape refinement: add a small feature such as depth/width ratio
+   or embedding dimension to reduce early-dropout overestimation on wide models,
+   then validate it on held-out shapes.
+3. L8 and L16 sampled-pressure ablations: repeat the `0x`, `0.5x`, `1.0x`, and
+   `1.5x` coefficient ablation outside L12.
+4. Oracle schedule comparison: compare the formula against a stage-wise oracle
+   chosen from measured static optima. The formula does not need to beat the
+   oracle; it should approach it without using per-stage oracle knowledge.
+5. Five-seed headline confirmation: reserve higher seed counts for the final
+   paper table, not every exploratory sweep.
+## Reproduction Pointers
+Important files:
+| File | Purpose |
+|---|---|
+| `README.md` | Project overview and workflow |
+| `REPRODUCING.md` | Exact reproduction commands and data-cache notes |
+| `src/dropout_decay/experiment.py` | MPS-only runner, stream loop, metrics, summaries |
+| `src/dropout_decay/model.py` | Nanochat-style Transformer and dynamic dropout |
+| `src/dropout_decay/data.py` | Token cache loading and corpus encoding |
+| `runs/*/summary.csv` | Aggregated metrics |
+| `runs/*/metrics.jsonl` | Per-seed raw metrics |
+| `runs/*/RESULT_SUMMARY.md` | Generated human-readable run summaries |
+For a new reader, the most useful path through the artifacts is:
+1. Read this report.
+2. Read `REPRODUCING.md` for exact commands.
+3. Inspect the corresponding `runs/.../config.json` for each headline table.
+4. Verify paired deltas from `metrics.jsonl` or `summary.csv`.
+## Bottom Line
+The result is not "dropout decay works." The result is more precise:
+> In an expanding-prefix training regime, dropout should track pressure from
+> model size, available unique tokens, and repeated sampling. A schedule that
+> tracks that pressure can outperform any single fixed dropout rate at final
+> validation loss.
+That is already a credible empirical story. The main remaining work is claim
+scope: corpus transfer, architecture-shape refinement, and a clearer separation
+between formula fitting and formula validation.

runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/RESULT_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,86 @@

+# Locked Streaming Dropout Summary
+Run directory: `runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721`
+Model: `wide_L8_H8_D384` causal Transformer, 17,301,504 parameters, 8 layers, 8 heads, 384 embedding dim.
+Training per stage: 1,000 steps. Sampled tokens are cumulative in each stage row. Seeds present: 1, 2, 3.
+## Condition Ranking
+| Condition | Kind | Final dropout | Mean trajectory val loss | Final val loss | Final gap | Dropout path |
+|---|---|---:|---:|---:|---:|---|
+| `static_dropout_0.18` | static | 0.18 | 4.9064 | 4.4968 | 0.2545 | 0.18 -> 0.18 -> 0.18 -> 0.18 -> 0.18 |
+| `formula_wide_l8_h8` | anchor_decay | 0.02 | 4.9073 | 4.4658 | 0.3144 | 0.30 -> 0.25 -> 0.18 -> 0.09 -> 0.02 |
+| `static_dropout_0.14` | static | 0.14 | 4.9120 | 4.4946 | 0.2732 | 0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14 |
+| `static_dropout_0.2` | static | 0.20 | 4.9184 | 4.5175 | 0.2384 | 0.20 -> 0.20 -> 0.20 -> 0.20 -> 0.20 |
+| `static_dropout_0.26` | static | 0.26 | 4.9323 | 4.5411 | 0.2169 | 0.26 -> 0.26 -> 0.26 -> 0.26 -> 0.26 |
+| `static_dropout_0.08` | static | 0.08 | 4.9576 | 4.4989 | 0.3211 | 0.08 -> 0.08 -> 0.08 -> 0.08 -> 0.08 |
+| `static_dropout_0.3` | static | 0.30 | 4.9612 | 4.5754 | 0.1984 | 0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30 |
+| `static_dropout_0.02` | static | 0.02 | 5.0798 | 4.5426 | 0.3933 | 0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02 |
+## Stage Trajectory
+### Stage 0: 250,000 Prefix Tokens
+| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
+|---|---:|---:|---:|---:|---:|
+| `static_dropout_0.18` | 0.18 | 5.4529 | 4.3258 | 1.1270 | 3 |
+| `static_dropout_0.2` | 0.20 | 5.4558 | 4.3694 | 1.0864 | 3 |
+| `static_dropout_0.14` | 0.14 | 5.4600 | 4.2174 | 1.2425 | 3 |
+| `static_dropout_0.26` | 0.26 | 5.4691 | 4.4782 | 0.9909 | 3 |
+| `static_dropout_0.3` | 0.30 | 5.4959 | 4.5701 | 0.9258 | 3 |
+| `formula_wide_l8_h8` | 0.30 | 5.4974 | 4.5700 | 0.9274 | 3 |
+| `static_dropout_0.08` | 0.08 | 5.5064 | 4.0660 | 1.4404 | 3 |
+| `static_dropout_0.02` | 0.02 | 5.6313 | 3.8606 | 1.7707 | 3 |
+### Stage 1: 500,000 Prefix Tokens
+| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
+|---|---:|---:|---:|---:|---:|
+| `static_dropout_0.18` | 0.18 | 5.1068 | 4.0540 | 1.0529 | 3 |
+| `static_dropout_0.2` | 0.20 | 5.1129 | 4.1267 | 0.9862 | 3 |
+| `static_dropout_0.26` | 0.26 | 5.1137 | 4.2308 | 0.8829 | 3 |
+| `formula_wide_l8_h8` | 0.25 | 5.1168 | 4.2433 | 0.8736 | 3 |
+| `static_dropout_0.14` | 0.14 | 5.1226 | 3.9473 | 1.1754 | 3 |
+| `static_dropout_0.3` | 0.30 | 5.1355 | 4.3145 | 0.8210 | 3 |
+| `static_dropout_0.08` | 0.08 | 5.2107 | 3.7604 | 1.4503 | 3 |
+| `static_dropout_0.02` | 0.02 | 5.4235 | 3.5110 | 1.9124 | 3 |
+### Stage 2: 1,000,000 Prefix Tokens
+| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
+|---|---:|---:|---:|---:|---:|
+| `formula_wide_l8_h8` | 0.18 | 4.8353 | 4.1781 | 0.6572 | 3 |
+| `static_dropout_0.18` | 0.18 | 4.8359 | 4.1157 | 0.7202 | 3 |
+| `static_dropout_0.14` | 0.14 | 4.8447 | 4.0450 | 0.7997 | 3 |
+| `static_dropout_0.2` | 0.20 | 4.8486 | 4.1693 | 0.6793 | 3 |
+| `static_dropout_0.26` | 0.26 | 4.8584 | 4.2648 | 0.5936 | 3 |
+| `static_dropout_0.3` | 0.30 | 4.8883 | 4.3345 | 0.5538 | 3 |
+| `static_dropout_0.08` | 0.08 | 4.9041 | 3.9234 | 0.9808 | 3 |
+| `static_dropout_0.02` | 0.02 | 5.0517 | 3.7576 | 1.2941 | 3 |
+### Stage 3: 2,000,000 Prefix Tokens
+| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
+|---|---:|---:|---:|---:|---:|
+| `formula_wide_l8_h8` | 0.09 | 4.6212 | 4.1631 | 0.4581 | 3 |
+| `static_dropout_0.14` | 0.14 | 4.6379 | 4.1672 | 0.4707 | 3 |
+| `static_dropout_0.18` | 0.18 | 4.6397 | 4.2109 | 0.4288 | 3 |
+| `static_dropout_0.2` | 0.20 | 4.6572 | 4.2456 | 0.4116 | 3 |
+| `static_dropout_0.08` | 0.08 | 4.6681 | 4.1108 | 0.5573 | 3 |
+| `static_dropout_0.26` | 0.26 | 4.6792 | 4.3159 | 0.3633 | 3 |
+| `static_dropout_0.3` | 0.30 | 4.7109 | 4.3668 | 0.3441 | 3 |
+| `static_dropout_0.02` | 0.02 | 4.7498 | 4.0129 | 0.7370 | 3 |
+### Stage 4: 4,000,000 Prefix Tokens
+| Condition | Dropout | Mean val loss | Mean train loss | Mean gap | N |
+|---|---:|---:|---:|---:|---:|
+| `formula_wide_l8_h8` | 0.02 | 4.4658 | 4.1514 | 0.3144 | 3 |
+| `static_dropout_0.14` | 0.14 | 4.4946 | 4.2214 | 0.2732 | 3 |
+| `static_dropout_0.18` | 0.18 | 4.4968 | 4.2423 | 0.2545 | 3 |
+| `static_dropout_0.08` | 0.08 | 4.4989 | 4.1777 | 0.3211 | 3 |
+| `static_dropout_0.2` | 0.20 | 4.5175 | 4.2791 | 0.2384 | 3 |
+| `static_dropout_0.26` | 0.26 | 4.5411 | 4.3243 | 0.2169 | 3 |
+| `static_dropout_0.02` | 0.02 | 4.5426 | 4.1493 | 0.3933 | 3 |
+| `static_dropout_0.3` | 0.30 | 4.5754 | 4.3771 | 0.1984 | 3 |

runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/config.json ADDED Viewed

	@@ -0,0 +1,131 @@

+{
+  "args": {
+    "mode": "locked_stream",
+    "corpus": null,
+    "corpus_glob": null,
+    "text_column": "text",
+    "use_cached_data": true,
+    "output_dir": "runs/architecture_shape_holdout_wide_h8",
+    "resume_from": null,
+    "cache_dir": ".cache/dropout_decay",
+    "models": [
+      "wide_L8_H8_D384=8x8x384"
+    ],
+    "seeds": [
+      1,
+      2,
+      3
+    ],
+    "token_limits": [
+      5000000
+    ],
+    "stream_token_caps": [
+      250000,
+      500000,
+      1000000,
+      2000000,
+      4000000
+    ],
+    "val_tokens": 500000,
+    "allow_short_corpus": false,
+    "force_retokenize": false,
+    "vocab_size": 4096,
+    "tokenizer_train_chars": 10000000,
+    "block_size": 128,
+    "batch_size": 16,
+    "steps": 2000,
+    "stage_steps": 1000,
+    "dropout_rates": [
+      0.02,
+      0.08,
+      0.14,
+      0.18,
+      0.2,
+      0.26,
+      0.3
+    ],
+    "decays": [],
+    "anchor_decays": [
+      {
+        "name": "formula_wide_l8_h8",
+        "kind": "anchor_decay",
+        "initial": 0.301,
+        "final": 0.02,
+        "schedule": "log_prefix_anchor",
+        "decay_tokens": null,
+        "anchors": [
+          [
+            250000,
+            0.301
+          ],
+          [
+            500000,
+            0.254
+          ],
+          [
+            1000000,
+            0.177
+          ],
+          [
+            2000000,
+            0.087
+          ],
+          [
+            4000000,
+            0.02
+          ]
+        ]
+      }
+    ],
+    "decay_tokens": null,
+    "eval_batches": 64,
+    "train_eval_batches": 32,
+    "trace_eval_batches": 8,
+    "eval_every": 0,
+    "log_every": 500,
+    "lr": 0.0003,
+    "weight_decay": 0.1,
+    "grad_clip": 1.0,
+    "plateau_delta": 0.01,
+    "target_min_dropout": 0.1,
+    "min_nonzero_margin": 0.01,
+    "min_high_dropout_margin": 0.03,
+    "screen_early_stop": false,
+    "screen_prune_patience": 3,
+    "screen_prune_min_delta": 0.01
+  },
+  "mode": "locked_stream",
+  "seeds": [
+    1,
+    2,
+    3
+  ],
+  "models": [
+    {
+      "model_name": "wide_L8_H8_D384",
+      "n_layer": 8,
+      "n_head": 8,
+      "n_embd": 384
+    }
+  ],
+  "device": "mps",
+  "torch": "2.12.0",
+  "python": "3.11.15 (main, Mar  3 2026, 00:52:57) [Clang 21.0.0 (clang-2100.0.123.102)]",
+  "mps_available": true,
+  "attribution": "Derived from Andrej Karpathy's nanochat project (https://github.com/karpathy/nanochat), MIT License, Copyright (c) 2025 Andrej Karpathy.",
+  "tokenizer_path": ".cache/dropout_decay/tokenizer-v4096.json",
+  "encoded_path": ".cache/dropout_decay/tokens-v4096-uint16.npy",
+  "train_tokens": 5000970,
+  "val_tokens": 500000,
+  "effective_token_limits": [
+    5000000
+  ],
+  "effective_stream_token_caps": [
+    250000,
+    500000,
+    1000000,
+    2000000,
+    4000000
+  ],
+  "resume_from": null
+}

runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/metrics.jsonl ADDED Viewed

	@@ -0,0 +1,120 @@

+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.301, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 80.3044319152832, "eval_loss": 5.50944098085165, "generalization_gap": 0.9404481127858162, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.568992868065834, "train_loss_last": 4.461259841918945, "val_eval_loss": 5.50944098085165}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.254, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 81.0927050113678, "eval_loss": 5.128874830901623, "generalization_gap": 0.9115267693996429, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.21734806150198, "train_loss_last": 4.339989185333252, "val_eval_loss": 5.128874830901623}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.177, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 82.04068875312805, "eval_loss": 4.821432217955589, "generalization_gap": 0.6483669355511665, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.173065282404423, "train_loss_last": 4.370175838470459, "val_eval_loss": 4.821432217955589}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.087, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 86.07953906059265, "eval_loss": 4.621216967701912, "generalization_gap": 0.454586386680603, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.166630581021309, "train_loss_last": 4.40201473236084, "val_eval_loss": 4.621216967701912}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 89.61367917060852, "eval_loss": 4.4628773629665375, "generalization_gap": 0.3145897909998894, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.148287571966648, "train_loss_last": 4.2655863761901855, "val_eval_loss": 4.4628773629665375}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.301, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 89.19484186172485, "eval_loss": 5.511528238654137, "generalization_gap": 0.9201603531837463, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.59136788547039, "train_loss_last": 4.6994476318359375, "val_eval_loss": 5.511528238654137}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.254, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 88.02258896827698, "eval_loss": 5.102983042597771, "generalization_gap": 0.8699157163500786, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.233067326247692, "train_loss_last": 4.28831672668457, "val_eval_loss": 5.102983042597771}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.177, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.586266040802, "eval_loss": 4.846672013401985, "generalization_gap": 0.647594541311264, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.199077472090721, "train_loss_last": 4.06334114074707, "val_eval_loss": 4.846672013401985}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.087, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.69060587882996, "eval_loss": 4.626160830259323, "generalization_gap": 0.45169103145599365, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.1744697988033295, "train_loss_last": 4.392305374145508, "val_eval_loss": 4.626160830259323}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.76093912124634, "eval_loss": 4.473251141607761, "generalization_gap": 0.3120561018586159, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.1611950397491455, "train_loss_last": 4.331225872039795, "val_eval_loss": 4.473251141607761}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.301, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 87.97550106048584, "eval_loss": 5.471142560243607, "generalization_gap": 0.9215231537818909, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.549619406461716, "train_loss_last": 4.7740654945373535, "val_eval_loss": 5.471142560243607}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.254, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 89.6904628276825, "eval_loss": 5.118690565228462, "generalization_gap": 0.8392644375562668, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.279426127672195, "train_loss_last": 4.1092705726623535, "val_eval_loss": 5.118690565228462}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.177, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 91.43055891990662, "eval_loss": 4.837727375328541, "generalization_gap": 0.6756020113825798, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.162125363945961, "train_loss_last": 4.312147617340088, "val_eval_loss": 4.837727375328541}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.087, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 91.60441994667053, "eval_loss": 4.6162411123514175, "generalization_gap": 0.46792464703321457, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.148316465318203, "train_loss_last": 4.083148002624512, "val_eval_loss": 4.6162411123514175}
+{"condition": "formula_wide_l8_h8", "condition_kind": "anchor_decay", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.301, "dropout_schedule": "log_prefix_anchor", "elapsed_sec": 92.24732804298401, "eval_loss": 4.461184054613113, "generalization_gap": 0.31641608476638794, "model_config": {"block_size": 128, "dropout": 0.301, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.1447679698467255, "train_loss_last": 4.061500549316406, "val_eval_loss": 4.461184054613113}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.50047516822815, "eval_loss": 5.617484986782074, "generalization_gap": 1.740433618426323, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 3.877051368355751, "train_loss_last": 3.831916570663452, "val_eval_loss": 5.617484986782074}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.35584092140198, "eval_loss": 5.42622634023428, "generalization_gap": 1.9256957545876503, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.5005305856466293, "train_loss_last": 3.6908276081085205, "val_eval_loss": 5.42622634023428}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 91.07978391647339, "eval_loss": 5.010412596166134, "generalization_gap": 1.254038155078888, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.756374441087246, "train_loss_last": 3.6737847328186035, "val_eval_loss": 5.010412596166134}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.77991008758545, "eval_loss": 4.739814430475235, "generalization_gap": 0.7331958413124084, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.0066185891628265, "train_loss_last": 4.060349464416504, "val_eval_loss": 4.739814430475235}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 91.44194889068604, "eval_loss": 4.5367100313305855, "generalization_gap": 0.404718317091465, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.1319917142391205, "train_loss_last": 4.323214530944824, "val_eval_loss": 4.5367100313305855}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.89891386032104, "eval_loss": 5.631713815033436, "generalization_gap": 1.7653722912073135, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 3.8663415238261223, "train_loss_last": 3.846677303314209, "val_eval_loss": 5.631713815033436}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 88.31412315368652, "eval_loss": 5.415093585848808, "generalization_gap": 1.9046627581119537, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.5104308277368546, "train_loss_last": 3.469115972518921, "val_eval_loss": 5.415093585848808}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 88.07946801185608, "eval_loss": 5.082612656056881, "generalization_gap": 1.302198402583599, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.780414253473282, "train_loss_last": 3.9317455291748047, "val_eval_loss": 5.082612656056881}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 90.06063985824585, "eval_loss": 4.753706589341164, "generalization_gap": 0.7191323041915894, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.034574285149574, "train_loss_last": 4.145359516143799, "val_eval_loss": 4.753706589341164}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 92.71470427513123, "eval_loss": 4.546178944408894, "generalization_gap": 0.3828253224492073, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.163353621959686, "train_loss_last": 4.104133129119873, "val_eval_loss": 4.546178944408894}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 92.13415598869324, "eval_loss": 5.644713446497917, "generalization_gap": 1.806356817483902, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 3.838356629014015, "train_loss_last": 3.7551283836364746, "val_eval_loss": 5.644713446497917}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 89.27249383926392, "eval_loss": 5.429135553538799, "generalization_gap": 1.9069734960794449, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.5221620574593544, "train_loss_last": 3.522247314453125, "val_eval_loss": 5.429135553538799}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 87.17945766448975, "eval_loss": 5.062139481306076, "generalization_gap": 1.3260410204529762, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.7360984608531, "train_loss_last": 3.8998873233795166, "val_eval_loss": 5.062139481306076}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 85.28739190101624, "eval_loss": 4.756018981337547, "generalization_gap": 0.7586047351360321, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 3.997414246201515, "train_loss_last": 3.8002028465270996, "val_eval_loss": 4.756018981337547}
+{"condition": "static_dropout_0.02", "condition_kind": "static", "dropout_active_final": 0.02, "dropout_final": 0.02, "dropout_initial": 0.02, "dropout_schedule": "constant", "elapsed_sec": 85.18242502212524, "eval_loss": 4.544781379401684, "generalization_gap": 0.3922204002737999, "model_config": {"block_size": 128, "dropout": 0.02, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.152560979127884, "train_loss_last": 4.062686920166016, "val_eval_loss": 4.544781379401684}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 85.12396001815796, "eval_loss": 5.508156895637512, "generalization_gap": 1.4284128621220589, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.079744033515453, "train_loss_last": 4.3121185302734375, "val_eval_loss": 5.508156895637512}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 85.83875703811646, "eval_loss": 5.224022679030895, "generalization_gap": 1.4923789128661156, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.7316437661647797, "train_loss_last": 3.8869738578796387, "val_eval_loss": 5.224022679030895}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.98465609550476, "eval_loss": 4.877769023180008, "generalization_gap": 0.9518706128001213, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.9258984103798866, "train_loss_last": 3.9108357429504395, "val_eval_loss": 4.877769023180008}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.71924090385437, "eval_loss": 4.6648936197161674, "generalization_gap": 0.5523728355765343, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.112520784139633, "train_loss_last": 4.277665138244629, "val_eval_loss": 4.6648936197161674}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.46567440032959, "eval_loss": 4.493896655738354, "generalization_gap": 0.3229532167315483, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.170943439006805, "train_loss_last": 4.201048374176025, "val_eval_loss": 4.493896655738354}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.63766598701477, "eval_loss": 5.524788901209831, "generalization_gap": 1.4730282947421074, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.051760606467724, "train_loss_last": 4.1509013175964355, "val_eval_loss": 5.524788901209831}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.58094906806946, "eval_loss": 5.200122766196728, "generalization_gap": 1.4515998288989067, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.748522937297821, "train_loss_last": 3.867222309112549, "val_eval_loss": 5.200122766196728}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.7393548488617, "eval_loss": 4.9155143946409225, "generalization_gap": 0.9573525786399841, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.9581618160009384, "train_loss_last": 4.031224250793457, "val_eval_loss": 4.9155143946409225}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.68595004081726, "eval_loss": 4.671512261033058, "generalization_gap": 0.5327281430363655, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.138784117996693, "train_loss_last": 4.2578840255737305, "val_eval_loss": 4.671512261033058}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.58462691307068, "eval_loss": 4.501494288444519, "generalization_gap": 0.315948948264122, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.185545340180397, "train_loss_last": 4.138848304748535, "val_eval_loss": 4.501494288444519}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.52455997467041, "eval_loss": 5.486165724694729, "generalization_gap": 1.4197945520281792, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.06637117266655, "train_loss_last": 4.213794708251953, "val_eval_loss": 5.486165724694729}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.35219979286194, "eval_loss": 5.207873582839966, "generalization_gap": 1.4069878831505775, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.8008856996893883, "train_loss_last": 4.008660316467285, "val_eval_loss": 5.207873582839966}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.39520287513733, "eval_loss": 4.919148109853268, "generalization_gap": 1.0330585837364197, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 3.886089526116848, "train_loss_last": 3.963630199432373, "val_eval_loss": 4.919148109853268}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 86.12316012382507, "eval_loss": 4.667882397770882, "generalization_gap": 0.5868068486452103, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.081075549125671, "train_loss_last": 4.332503795623779, "val_eval_loss": 4.667882397770882}
+{"condition": "static_dropout_0.08", "condition_kind": "static", "dropout_active_final": 0.08, "dropout_final": 0.08, "dropout_initial": 0.08, "dropout_schedule": "constant", "elapsed_sec": 85.79110598564148, "eval_loss": 4.501279823482037, "generalization_gap": 0.32453126460313797, "model_config": {"block_size": 128, "dropout": 0.08, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.176748558878899, "train_loss_last": 4.332511901855469, "val_eval_loss": 4.501279823482037}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 82.9429247379303, "eval_loss": 5.474037267267704, "generalization_gap": 1.2244683280587196, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.249568939208984, "train_loss_last": 4.2807817459106445, "val_eval_loss": 5.474037267267704}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 83.7153468132019, "eval_loss": 5.1435349360108376, "generalization_gap": 1.219955712556839, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.9235792234539986, "train_loss_last": 4.343637943267822, "val_eval_loss": 5.1435349360108376}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 85.14877796173096, "eval_loss": 4.822992421686649, "generalization_gap": 0.8021149635314941, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.020877458155155, "train_loss_last": 4.408194541931152, "val_eval_loss": 4.822992421686649}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 86.28292083740234, "eval_loss": 4.645638138055801, "generalization_gap": 0.4661043509840965, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.179533787071705, "train_loss_last": 4.2894206047058105, "val_eval_loss": 4.645638138055801}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 87.48458695411682, "eval_loss": 4.496338866651058, "generalization_gap": 0.2786053493618965, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.217733517289162, "train_loss_last": 4.3651604652404785, "val_eval_loss": 4.496338866651058}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 88.62815809249878, "eval_loss": 5.471045903861523, "generalization_gap": 1.2540272250771523, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.21701867878437, "train_loss_last": 4.285619735717773, "val_eval_loss": 5.471045903861523}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 89.17306399345398, "eval_loss": 5.10762532055378, "generalization_gap": 1.161530278623104, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.9460950419306755, "train_loss_last": 4.307913780212402, "val_eval_loss": 5.10762532055378}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 89.68199706077576, "eval_loss": 4.867001831531525, "generalization_gap": 0.7607270255684853, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.106274805963039, "train_loss_last": 4.097533226013184, "val_eval_loss": 4.867001831531525}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 96.21420216560364, "eval_loss": 4.635738044977188, "generalization_gap": 0.45351576060056686, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.182222284376621, "train_loss_last": 4.320278644561768, "val_eval_loss": 4.635738044977188}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 98.51693487167358, "eval_loss": 4.502298936247826, "generalization_gap": 0.26660506427288055, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.235693871974945, "train_loss_last": 4.196667671203613, "val_eval_loss": 4.502298936247826}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 99.1040849685669, "eval_loss": 5.434773683547974, "generalization_gap": 1.2490219175815582, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.185751765966415, "train_loss_last": 4.543873310089111, "val_eval_loss": 5.434773683547974}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 101.41888475418091, "eval_loss": 5.116771973669529, "generalization_gap": 1.1446522697806358, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 3.972119703888893, "train_loss_last": 4.026454448699951, "val_eval_loss": 5.116771973669529}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 102.92470812797546, "eval_loss": 4.844192124903202, "generalization_gap": 0.8362765908241272, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.007915534079075, "train_loss_last": 4.160324573516846, "val_eval_loss": 4.844192124903202}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 102.17207908630371, "eval_loss": 4.632426172494888, "generalization_gap": 0.4925713837146759, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.139854788780212, "train_loss_last": 4.122490406036377, "val_eval_loss": 4.632426172494888}
+{"condition": "static_dropout_0.14", "condition_kind": "static", "dropout_active_final": 0.14, "dropout_final": 0.14, "dropout_initial": 0.14, "dropout_schedule": "constant", "elapsed_sec": 100.92274498939514, "eval_loss": 4.48524085432291, "generalization_gap": 0.2743789106607437, "model_config": {"block_size": 128, "dropout": 0.14, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.210861943662167, "train_loss_last": 4.3369245529174805, "val_eval_loss": 4.48524085432291}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 99.96288299560547, "eval_loss": 5.459806442260742, "generalization_gap": 1.099911704659462, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.35989473760128, "train_loss_last": 4.235060214996338, "val_eval_loss": 5.459806442260742}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 99.67736482620239, "eval_loss": 5.1199976950883865, "generalization_gap": 1.0895164832472801, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.030481211841106, "train_loss_last": 4.190820693969727, "val_eval_loss": 5.1199976950883865}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 99.31055998802185, "eval_loss": 4.822036981582642, "generalization_gap": 0.7236514016985893, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.098385579884052, "train_loss_last": 4.60369348526001, "val_eval_loss": 4.822036981582642}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.8000099658966, "eval_loss": 4.644418828189373, "generalization_gap": 0.41945642977952957, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.2249623984098434, "train_loss_last": 4.394259929656982, "val_eval_loss": 4.644418828189373}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.64591407775879, "eval_loss": 4.492378875613213, "generalization_gap": 0.25095726549625397, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.241421610116959, "train_loss_last": 4.539054870605469, "val_eval_loss": 4.492378875613213}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.27119994163513, "eval_loss": 5.469226226210594, "generalization_gap": 1.1356151103973389, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.333611115813255, "train_loss_last": 4.384670734405518, "val_eval_loss": 5.469226226210594}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 98.35777807235718, "eval_loss": 5.092819690704346, "generalization_gap": 1.0521608665585518, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.040658824145794, "train_loss_last": 4.205634117126465, "val_eval_loss": 5.092819690704346}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 91.07860922813416, "eval_loss": 4.853221297264099, "generalization_gap": 0.7027387022972107, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.150482594966888, "train_loss_last": 4.3069024085998535, "val_eval_loss": 4.853221297264099}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.5945131778717, "eval_loss": 4.632607348263264, "generalization_gap": 0.4114653095602989, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.221142038702965, "train_loss_last": 4.197530746459961, "val_eval_loss": 4.632607348263264}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.36042618751526, "eval_loss": 4.502480059862137, "generalization_gap": 0.2522999197244644, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.250180140137672, "train_loss_last": 4.284463405609131, "val_eval_loss": 4.502480059862137}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.31073188781738, "eval_loss": 5.429519824683666, "generalization_gap": 1.1455372348427773, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.283982589840889, "train_loss_last": 4.598309516906738, "val_eval_loss": 5.429519824683666}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.2884590625763, "eval_loss": 5.1077147498726845, "generalization_gap": 1.0169242396950722, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.090790510177612, "train_loss_last": 4.279513359069824, "val_eval_loss": 5.1077147498726845}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 89.53543972969055, "eval_loss": 4.832363620400429, "generalization_gap": 0.7342051789164543, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.0981584414839745, "train_loss_last": 4.1803812980651855, "val_eval_loss": 4.832363620400429}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 90.07991695404053, "eval_loss": 4.642039962112904, "generalization_gap": 0.45552774518728256, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.186512216925621, "train_loss_last": 4.316293716430664, "val_eval_loss": 4.642039962112904}
+{"condition": "static_dropout_0.18", "condition_kind": "static", "dropout_active_final": 0.18, "dropout_final": 0.18, "dropout_initial": 0.18, "dropout_schedule": "constant", "elapsed_sec": 90.63944125175476, "eval_loss": 4.495470866560936, "generalization_gap": 0.2602871060371399, "model_config": {"block_size": 128, "dropout": 0.18, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.235183760523796, "train_loss_last": 4.447503566741943, "val_eval_loss": 4.495470866560936}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.06378698348999, "eval_loss": 5.451976537704468, "generalization_gap": 1.041361778974533, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.410614758729935, "train_loss_last": 4.446702003479004, "val_eval_loss": 5.451976537704468}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.85460114479065, "eval_loss": 5.123210750520229, "generalization_gap": 1.0067783072590828, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.1164324432611465, "train_loss_last": 4.462212562561035, "val_eval_loss": 5.123210750520229}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.4302430152893, "eval_loss": 4.834423378109932, "generalization_gap": 0.670300304889679, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.164123073220253, "train_loss_last": 4.44389009475708, "val_eval_loss": 4.834423378109932}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.19487309455872, "eval_loss": 4.667039297521114, "generalization_gap": 0.4070756286382675, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.259963668882847, "train_loss_last": 4.521040439605713, "val_eval_loss": 4.667039297521114}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.51846718788147, "eval_loss": 4.518057778477669, "generalization_gap": 0.23345574736595154, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.284602031111717, "train_loss_last": 4.3808183670043945, "val_eval_loss": 4.518057778477669}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.02405881881714, "eval_loss": 5.476214177906513, "generalization_gap": 1.1110263615846634, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.36518781632185, "train_loss_last": 4.610562324523926, "val_eval_loss": 5.476214177906513}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.02885913848877, "eval_loss": 5.097359970211983, "generalization_gap": 0.9864056929945946, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.110954277217388, "train_loss_last": 4.277379035949707, "val_eval_loss": 5.097359970211983}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 91.20587420463562, "eval_loss": 4.862358041107655, "generalization_gap": 0.6659520193934441, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.1964060217142105, "train_loss_last": 4.454376220703125, "val_eval_loss": 4.862358041107655}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.5145378112793, "eval_loss": 4.6511543318629265, "generalization_gap": 0.39532778412103653, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.25582654774189, "train_loss_last": 4.377513885498047, "val_eval_loss": 4.6511543318629265}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.71747303009033, "eval_loss": 4.5257163643836975, "generalization_gap": 0.24465039372444153, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.281065970659256, "train_loss_last": 4.475866794586182, "val_eval_loss": 4.5257163643836975}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.78583598136902, "eval_loss": 5.4391709342598915, "generalization_gap": 1.1067671701312065, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.332403764128685, "train_loss_last": 4.592159271240234, "val_eval_loss": 5.4391709342598915}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 90.46544909477234, "eval_loss": 5.118095904588699, "generalization_gap": 0.965451754629612, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.152644149959087, "train_loss_last": 4.286029815673828, "val_eval_loss": 5.118095904588699}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 89.8702449798584, "eval_loss": 4.848977982997894, "generalization_gap": 0.7016597390174866, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.147318243980408, "train_loss_last": 4.145584583282471, "val_eval_loss": 4.848977982997894}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 89.51202011108398, "eval_loss": 4.653353154659271, "generalization_gap": 0.43230512738227844, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.221048027276993, "train_loss_last": 4.275428771972656, "val_eval_loss": 4.653353154659271}
+{"condition": "static_dropout_0.2", "condition_kind": "static", "dropout_active_final": 0.2, "dropout_final": 0.2, "dropout_initial": 0.2, "dropout_schedule": "constant", "elapsed_sec": 89.36881709098816, "eval_loss": 4.508818320930004, "generalization_gap": 0.2372145727276802, "model_config": {"block_size": 128, "dropout": 0.2, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.271603748202324, "train_loss_last": 4.303030967712402, "val_eval_loss": 4.508818320930004}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.39422821998596, "eval_loss": 5.471387661993504, "generalization_gap": 0.9843398556113243, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.487047806382179, "train_loss_last": 4.588815689086914, "val_eval_loss": 5.471387661993504}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.34041380882263, "eval_loss": 5.123171776533127, "generalization_gap": 0.9006927907466888, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.222478985786438, "train_loss_last": 4.4486284255981445, "val_eval_loss": 5.123171776533127}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.33436322212219, "eval_loss": 4.8384788408875465, "generalization_gap": 0.5947165563702583, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.243762284517288, "train_loss_last": 4.541705131530762, "val_eval_loss": 4.8384788408875465}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.38607597351074, "eval_loss": 4.680613316595554, "generalization_gap": 0.36394860595464706, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.316664710640907, "train_loss_last": 4.310101509094238, "val_eval_loss": 4.680613316595554}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.34234309196472, "eval_loss": 4.545445613563061, "generalization_gap": 0.21612011641263962, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.329325497150421, "train_loss_last": 4.295492172241211, "val_eval_loss": 4.545445613563061}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.37102317810059, "eval_loss": 5.4896402060985565, "generalization_gap": 1.0082881152629852, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.481352090835571, "train_loss_last": 4.5635986328125, "val_eval_loss": 5.4896402060985565}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.80504488945007, "eval_loss": 5.100566066801548, "generalization_gap": 0.9002188816666603, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.200347185134888, "train_loss_last": 4.5332489013671875, "val_eval_loss": 5.100566066801548}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 90.24263501167297, "eval_loss": 4.872108653187752, "generalization_gap": 0.5782715156674385, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.293837137520313, "train_loss_last": 4.513686656951904, "val_eval_loss": 4.872108653187752}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 90.11540699005127, "eval_loss": 4.671582758426666, "generalization_gap": 0.3512818217277527, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.320300936698914, "train_loss_last": 4.520935535430908, "val_eval_loss": 4.671582758426666}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.37596607208252, "eval_loss": 4.541754223406315, "generalization_gap": 0.21430686861276627, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.327447354793549, "train_loss_last": 4.469289302825928, "val_eval_loss": 4.541754223406315}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.083731174469, "eval_loss": 5.44633674621582, "generalization_gap": 0.9801725596189499, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.46616418659687, "train_loss_last": 4.668099403381348, "val_eval_loss": 5.44633674621582}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.23882603645325, "eval_loss": 5.117508083581924, "generalization_gap": 0.8479217141866684, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.269586369395256, "train_loss_last": 4.168695449829102, "val_eval_loss": 5.117508083581924}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.65575408935547, "eval_loss": 4.864691182971001, "generalization_gap": 0.6078529059886932, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.256838276982307, "train_loss_last": 4.411451816558838, "val_eval_loss": 4.864691182971001}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.62967777252197, "eval_loss": 4.685315124690533, "generalization_gap": 0.3747243508696556, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.310590773820877, "train_loss_last": 4.3449296951293945, "val_eval_loss": 4.685315124690533}
+{"condition": "static_dropout_0.26", "condition_kind": "static", "dropout_active_final": 0.26, "dropout_final": 0.26, "dropout_initial": 0.26, "dropout_schedule": "constant", "elapsed_sec": 89.2124891281128, "eval_loss": 4.536212712526321, "generalization_gap": 0.22022628784179688, "model_config": {"block_size": 128, "dropout": 0.26, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.3159864246845245, "train_loss_last": 4.2295427322387695, "val_eval_loss": 4.536212712526321}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.50791621208191, "eval_loss": 5.500275187194347, "generalization_gap": 0.918338917195797, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.58193626999855, "train_loss_last": 4.801484107971191, "val_eval_loss": 5.500275187194347}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 90.16439700126648, "eval_loss": 5.156496524810791, "generalization_gap": 0.8506038039922714, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.30589272081852, "train_loss_last": 4.409400939941406, "val_eval_loss": 5.156496524810791}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 90.12380909919739, "eval_loss": 4.879989787936211, "generalization_gap": 0.5560016483068466, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.323988139629364, "train_loss_last": 4.452445030212402, "val_eval_loss": 4.879989787936211}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.98156309127808, "eval_loss": 4.714666917920113, "generalization_gap": 0.33146432042121887, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.383202597498894, "train_loss_last": 4.41795539855957, "val_eval_loss": 4.714666917920113}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.62339806556702, "eval_loss": 4.573916859924793, "generalization_gap": 0.1942192241549492, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 1, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.379697635769844, "train_loss_last": 4.556178569793701, "val_eval_loss": 4.573916859924793}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.48192501068115, "eval_loss": 5.513812951743603, "generalization_gap": 0.9296739622950554, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.584138989448547, "train_loss_last": 4.781554222106934, "val_eval_loss": 5.513812951743603}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.30899000167847, "eval_loss": 5.122172422707081, "generalization_gap": 0.8031972870230675, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.318975135684013, "train_loss_last": 4.599480628967285, "val_eval_loss": 5.122172422707081}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.44387793540955, "eval_loss": 4.903594605624676, "generalization_gap": 0.534355454146862, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.369239151477814, "train_loss_last": 4.467323303222656, "val_eval_loss": 4.903594605624676}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 89.30373406410217, "eval_loss": 4.708515301346779, "generalization_gap": 0.329326331615448, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.379188969731331, "train_loss_last": 4.398643970489502, "val_eval_loss": 4.708515301346779}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 88.56540608406067, "eval_loss": 4.582499638199806, "generalization_gap": 0.19323347508907318, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 2, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.389266163110733, "train_loss_last": 4.4059600830078125, "val_eval_loss": 4.582499638199806}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 88.06415605545044, "eval_loss": 5.473536089062691, "generalization_gap": 0.9294483512639999, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 0, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_eval_loss": 4.544087737798691, "train_loss_last": 4.708844184875488, "val_eval_loss": 5.473536089062691}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.83247828483582, "eval_loss": 5.127706632018089, "generalization_gap": 0.8091608434915543, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 1, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_eval_loss": 4.318545788526535, "train_loss_last": 4.4110612869262695, "val_eval_loss": 5.127706632018089}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.72863101959229, "eval_loss": 4.881375916302204, "generalization_gap": 0.5711916759610176, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 2, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_eval_loss": 4.3101842403411865, "train_loss_last": 4.227794647216797, "val_eval_loss": 4.881375916302204}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.52914309501648, "eval_loss": 4.709614671766758, "generalization_gap": 0.3716374859213829, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 3, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_eval_loss": 4.337977185845375, "train_loss_last": 4.518902778625488, "val_eval_loss": 4.709614671766758}
+{"condition": "static_dropout_0.3", "condition_kind": "static", "dropout_active_final": 0.3, "dropout_final": 0.3, "dropout_initial": 0.3, "dropout_schedule": "constant", "elapsed_sec": 87.48567986488342, "eval_loss": 4.569913923740387, "generalization_gap": 0.20761683583259583, "model_config": {"block_size": 128, "dropout": 0.3, "n_embd": 384, "n_head": 8, "n_layer": 8, "vocab_size": 4096}, "model_name": "wide_L8_H8_D384", "n_embd": 384, "n_head": 8, "n_layer": 8, "parameters": 17301504, "run_mode": "locked_stream", "seed": 3, "stage": 4, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_eval_loss": 4.362297087907791, "train_loss_last": 4.421485900878906, "val_eval_loss": 4.569913923740387}

runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.csv ADDED Viewed

	@@ -0,0 +1,41 @@

+run_mode,condition,condition_kind,stage,token_limit,model_name,n_layer,n_head,n_embd,parameters,dropout_initial,dropout_final,dropout_schedule,n,mean_train_eval_loss,std_train_eval_loss,mean_val_eval_loss,std_val_eval_loss,mean_generalization_gap,std_generalization_gap
+locked_stream,formula_wide_l8_h8,anchor_decay,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.56999338666598,0.020892215128759443,5.497370593249798,0.022738105633801235,0.9273772065838178,0.011340226985102892
+locked_stream,static_dropout_0.02,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,3.8605831737319627,0.01997973123033439,5.631304082771142,0.013618853293115287,1.7707209090391796,0.033285474730425785
+locked_stream,static_dropout_0.08,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,4.065958604216576,0.013996274750030279,5.506370507180691,0.01937345681127113,1.4404119029641151,0.02857342430791552
+locked_stream,static_dropout_0.14,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.217446461319923,0.03191073719912012,5.4599522848924,0.021856544511956278,1.2425058235724766,0.015820136190563192
+locked_stream,static_dropout_0.18,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.325829481085141,0.03854969420994424,5.452850831051667,0.02074692690552652,1.127021349966526,0.023996078635556407
+locked_stream,static_dropout_0.2,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.369402113060157,0.03927543942310581,5.455787216623624,0.01881333118653956,1.0863851035634677,0.03904945576752281
+locked_stream,static_dropout_0.26,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.478188027938207,0.01079536309478258,5.469121538102627,0.021740489819011388,0.9909335101644198,0.015173276757753403
+locked_stream,static_dropout_0.3,static,0,250000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.570054332415263,0.022514684546532972,5.495874742666881,0.02049583740381805,0.9258204102516174,0.006480144970797402
+locked_stream,formula_wide_l8_h8,anchor_decay,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.243280505140622,0.032274654795716486,5.116849479575952,0.013043710081023051,0.8735689744353294,0.036269420616252504
+locked_stream,static_dropout_0.02,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,3.5110411569476128,0.010828643474845237,5.423485159873962,0.007411461831348081,1.9124440029263496,0.011534364701109023
+locked_stream,static_dropout_0.08,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,3.760350801050663,0.03610450263157525,5.210673009355863,0.012193401903637546,1.4503222083051999,0.04270984927105307
+locked_stream,static_dropout_0.14,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,3.9472646564245224,0.024291367988665397,5.122644076744716,0.018661090977288925,1.175379420320193,0.039515834393275315
+locked_stream,static_dropout_0.18,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.053976848721504,0.03228513900248589,5.106844045221806,0.013609907255985704,1.0528671965003014,0.03630127590699434
+locked_stream,static_dropout_0.2,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.126676956812541,0.022654445827106268,5.112888875106971,0.013689433065470715,0.9862119182944298,0.020663957741316727
+locked_stream,static_dropout_0.26,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.230804180105527,0.0353623783786987,5.113748642305533,0.01176242224841787,0.8829444622000059,0.03033151506697832
+locked_stream,static_dropout_0.3,static,1,500000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.314471215009689,0.0074322948465704985,5.135458526511987,0.018428372079211774,0.8209873115022978,0.025821376260662585
+locked_stream,formula_wide_l8_h8,anchor_decay,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.178089372813702,0.018981456409242363,4.835277202228705,0.012797043787857624,0.6571878294150034,0.015951825016803534
+locked_stream,static_dropout_0.02,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,3.7576290518045425,0.022184519488321397,5.05172157784303,0.03721037081891111,1.2940925260384877,0.036679450397234824
+locked_stream,static_dropout_0.08,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,3.9233832508325577,0.03610191494845503,4.904143842558066,0.022913408617663143,0.9807605917255083,0.04537425441423101
+locked_stream,static_dropout_0.14,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.04502259939909,0.05344041051164446,4.844728792707126,0.022009612626586104,0.7997061933080355,0.037832338456288826
+locked_stream,static_dropout_0.18,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.115675538778305,0.03014400883010748,4.835873966415723,0.015885757236925133,0.7201984276374181,0.01601490118944517
+locked_stream,static_dropout_0.2,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.169282446304957,0.024947280987738584,4.84858646740516,0.013971446329696822,0.6793040211002032,0.01948231221870367
+locked_stream,static_dropout_0.26,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.26481256633997,0.025972383292421113,4.858426225682099,0.017668569161382996,0.5936136593421301,0.014821502950969461
+locked_stream,static_dropout_0.3,static,2,1000000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.334470510482788,0.030891434190892946,4.888320103287697,0.013246250571144034,0.5538495928049088,0.018512166716729395
+locked_stream,formula_wide_l8_h8,anchor_decay,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.163138948380947,0.013421730028855801,4.621206303437551,0.004959867552466137,0.45806735505660373,0.008658546315824916
+locked_stream,static_dropout_0.02,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,4.012869040171306,0.019352473408129574,4.749846667051315,0.008764765668577094,0.73697762688001,0.020006114758379243
+locked_stream,static_dropout_0.08,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,4.110793483753999,0.028893033853370435,4.6680960928400355,0.003314491274290642,0.5573026090860367,0.02737432374001718
+locked_stream,static_dropout_0.14,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.1672036200761795,0.023722898945556545,4.637934118509293,0.006874304525877202,0.4707304984331131,0.019934551772559723
+locked_stream,static_dropout_0.18,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.21087221801281,0.021182682238053856,4.63968871285518,0.006246922787675527,0.42881649484237033,0.0234751500917668
+locked_stream,static_dropout_0.2,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.245612747967243,0.021374004532989366,4.657182261347771,0.008606949344244184,0.4115695133805275,0.018893841026177444
+locked_stream,static_dropout_0.26,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.315852140386899,0.004905814773589955,4.679170399904251,0.006978966774156137,0.3633182595173518,0.011733969729614104
+locked_stream,static_dropout_0.3,static,3,2000000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.366789584358533,0.025032839092378575,4.710932297011216,0.003280655243995376,0.34414271265268326,0.0238351561114313
+locked_stream,formula_wide_l8_h8,anchor_decay,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.301,0.02,log_prefix_anchor,3,4.15141686052084,0.00864907460577688,4.4657708530624705,0.006533212137650865,0.31435399254163104,0.0021895349788719713
+locked_stream,static_dropout_0.02,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.02,0.02,constant,3,4.149302105108897,0.015932906479109284,4.542556785047054,0.005111427760726518,0.3932546799381574,0.010983082646486758
+locked_stream,static_dropout_0.08,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.08,0.08,constant,3,4.177745779355367,0.0073518511940769,4.498890255888303,0.004325913851224281,0.3211444765329361,0.0045681171466252805
+locked_stream,static_dropout_0.14,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.14,0.14,constant,3,4.221429777642091,0.012821970261243815,4.494626219073932,0.008657044012355454,0.27319644143184024,0.006086902797186131
+locked_stream,static_dropout_0.18,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.18,0.18,constant,3,4.242261836926143,0.007533414644039237,4.4967766006787615,0.005175633970589464,0.25451476375261944,0.005043870704268704
+locked_stream,static_dropout_0.2,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.2,0.2,constant,3,4.279090583324432,0.006720524978756697,4.51753082126379,0.008461337427952602,0.23844023793935776,0.005697079794179711
+locked_stream,static_dropout_0.26,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.26,0.26,constant,3,4.324253092209498,0.007220470805599103,4.541137516498566,0.004647242294749092,0.2168844242890676,0.0030328214421115702
+locked_stream,static_dropout_0.3,static,4,4000000,wide_L8_H8_D384,8,8,384,17301504,0.3,0.3,constant,3,4.37708696226279,0.013672763672564944,4.575443473954995,0.006430238324620357,0.19835651169220606,0.008034807259295824

runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/summary.json ADDED Viewed

	@@ -0,0 +1,882 @@

+[
+  {
+    "run_mode": "locked_stream",
+    "condition": "formula_wide_l8_h8",
+    "condition_kind": "anchor_decay",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.301,
+    "dropout_final": 0.02,
+    "dropout_schedule": "log_prefix_anchor",
+    "n": 3,
+    "mean_train_eval_loss": 4.56999338666598,
+    "std_train_eval_loss": 0.020892215128759443,
+    "mean_val_eval_loss": 5.497370593249798,
+    "std_val_eval_loss": 0.022738105633801235,
+    "mean_generalization_gap": 0.9273772065838178,
+    "std_generalization_gap": 0.011340226985102892
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.02",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.02,
+    "dropout_final": 0.02,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 3.8605831737319627,
+    "std_train_eval_loss": 0.01997973123033439,
+    "mean_val_eval_loss": 5.631304082771142,
+    "std_val_eval_loss": 0.013618853293115287,
+    "mean_generalization_gap": 1.7707209090391796,
+    "std_generalization_gap": 0.033285474730425785
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.08",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.08,
+    "dropout_final": 0.08,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.065958604216576,
+    "std_train_eval_loss": 0.013996274750030279,
+    "mean_val_eval_loss": 5.506370507180691,
+    "std_val_eval_loss": 0.01937345681127113,
+    "mean_generalization_gap": 1.4404119029641151,
+    "std_generalization_gap": 0.02857342430791552
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.14",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.14,
+    "dropout_final": 0.14,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.217446461319923,
+    "std_train_eval_loss": 0.03191073719912012,
+    "mean_val_eval_loss": 5.4599522848924,
+    "std_val_eval_loss": 0.021856544511956278,
+    "mean_generalization_gap": 1.2425058235724766,
+    "std_generalization_gap": 0.015820136190563192
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.18",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.18,
+    "dropout_final": 0.18,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.325829481085141,
+    "std_train_eval_loss": 0.03854969420994424,
+    "mean_val_eval_loss": 5.452850831051667,
+    "std_val_eval_loss": 0.02074692690552652,
+    "mean_generalization_gap": 1.127021349966526,
+    "std_generalization_gap": 0.023996078635556407
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.2",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.2,
+    "dropout_final": 0.2,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.369402113060157,
+    "std_train_eval_loss": 0.03927543942310581,
+    "mean_val_eval_loss": 5.455787216623624,
+    "std_val_eval_loss": 0.01881333118653956,
+    "mean_generalization_gap": 1.0863851035634677,
+    "std_generalization_gap": 0.03904945576752281
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.26",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.26,
+    "dropout_final": 0.26,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.478188027938207,
+    "std_train_eval_loss": 0.01079536309478258,
+    "mean_val_eval_loss": 5.469121538102627,
+    "std_val_eval_loss": 0.021740489819011388,
+    "mean_generalization_gap": 0.9909335101644198,
+    "std_generalization_gap": 0.015173276757753403
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.3",
+    "condition_kind": "static",
+    "stage": 0,
+    "token_limit": 250000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.3,
+    "dropout_final": 0.3,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.570054332415263,
+    "std_train_eval_loss": 0.022514684546532972,
+    "mean_val_eval_loss": 5.495874742666881,
+    "std_val_eval_loss": 0.02049583740381805,
+    "mean_generalization_gap": 0.9258204102516174,
+    "std_generalization_gap": 0.006480144970797402
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "formula_wide_l8_h8",
+    "condition_kind": "anchor_decay",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.301,
+    "dropout_final": 0.02,
+    "dropout_schedule": "log_prefix_anchor",
+    "n": 3,
+    "mean_train_eval_loss": 4.243280505140622,
+    "std_train_eval_loss": 0.032274654795716486,
+    "mean_val_eval_loss": 5.116849479575952,
+    "std_val_eval_loss": 0.013043710081023051,
+    "mean_generalization_gap": 0.8735689744353294,
+    "std_generalization_gap": 0.036269420616252504
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.02",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.02,
+    "dropout_final": 0.02,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 3.5110411569476128,
+    "std_train_eval_loss": 0.010828643474845237,
+    "mean_val_eval_loss": 5.423485159873962,
+    "std_val_eval_loss": 0.007411461831348081,
+    "mean_generalization_gap": 1.9124440029263496,
+    "std_generalization_gap": 0.011534364701109023
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.08",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.08,
+    "dropout_final": 0.08,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 3.760350801050663,
+    "std_train_eval_loss": 0.03610450263157525,
+    "mean_val_eval_loss": 5.210673009355863,
+    "std_val_eval_loss": 0.012193401903637546,
+    "mean_generalization_gap": 1.4503222083051999,
+    "std_generalization_gap": 0.04270984927105307
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.14",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.14,
+    "dropout_final": 0.14,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 3.9472646564245224,
+    "std_train_eval_loss": 0.024291367988665397,
+    "mean_val_eval_loss": 5.122644076744716,
+    "std_val_eval_loss": 0.018661090977288925,
+    "mean_generalization_gap": 1.175379420320193,
+    "std_generalization_gap": 0.039515834393275315
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.18",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.18,
+    "dropout_final": 0.18,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.053976848721504,
+    "std_train_eval_loss": 0.03228513900248589,
+    "mean_val_eval_loss": 5.106844045221806,
+    "std_val_eval_loss": 0.013609907255985704,
+    "mean_generalization_gap": 1.0528671965003014,
+    "std_generalization_gap": 0.03630127590699434
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.2",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.2,
+    "dropout_final": 0.2,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.126676956812541,
+    "std_train_eval_loss": 0.022654445827106268,
+    "mean_val_eval_loss": 5.112888875106971,
+    "std_val_eval_loss": 0.013689433065470715,
+    "mean_generalization_gap": 0.9862119182944298,
+    "std_generalization_gap": 0.020663957741316727
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.26",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.26,
+    "dropout_final": 0.26,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.230804180105527,
+    "std_train_eval_loss": 0.0353623783786987,
+    "mean_val_eval_loss": 5.113748642305533,
+    "std_val_eval_loss": 0.01176242224841787,
+    "mean_generalization_gap": 0.8829444622000059,
+    "std_generalization_gap": 0.03033151506697832
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.3",
+    "condition_kind": "static",
+    "stage": 1,
+    "token_limit": 500000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.3,
+    "dropout_final": 0.3,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.314471215009689,
+    "std_train_eval_loss": 0.0074322948465704985,
+    "mean_val_eval_loss": 5.135458526511987,
+    "std_val_eval_loss": 0.018428372079211774,
+    "mean_generalization_gap": 0.8209873115022978,
+    "std_generalization_gap": 0.025821376260662585
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "formula_wide_l8_h8",
+    "condition_kind": "anchor_decay",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.301,
+    "dropout_final": 0.02,
+    "dropout_schedule": "log_prefix_anchor",
+    "n": 3,
+    "mean_train_eval_loss": 4.178089372813702,
+    "std_train_eval_loss": 0.018981456409242363,
+    "mean_val_eval_loss": 4.835277202228705,
+    "std_val_eval_loss": 0.012797043787857624,
+    "mean_generalization_gap": 0.6571878294150034,
+    "std_generalization_gap": 0.015951825016803534
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.02",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.02,
+    "dropout_final": 0.02,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 3.7576290518045425,
+    "std_train_eval_loss": 0.022184519488321397,
+    "mean_val_eval_loss": 5.05172157784303,
+    "std_val_eval_loss": 0.03721037081891111,
+    "mean_generalization_gap": 1.2940925260384877,
+    "std_generalization_gap": 0.036679450397234824
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.08",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.08,
+    "dropout_final": 0.08,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 3.9233832508325577,
+    "std_train_eval_loss": 0.03610191494845503,
+    "mean_val_eval_loss": 4.904143842558066,
+    "std_val_eval_loss": 0.022913408617663143,
+    "mean_generalization_gap": 0.9807605917255083,
+    "std_generalization_gap": 0.04537425441423101
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.14",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.14,
+    "dropout_final": 0.14,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.04502259939909,
+    "std_train_eval_loss": 0.05344041051164446,
+    "mean_val_eval_loss": 4.844728792707126,
+    "std_val_eval_loss": 0.022009612626586104,
+    "mean_generalization_gap": 0.7997061933080355,
+    "std_generalization_gap": 0.037832338456288826
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.18",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.18,
+    "dropout_final": 0.18,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.115675538778305,
+    "std_train_eval_loss": 0.03014400883010748,
+    "mean_val_eval_loss": 4.835873966415723,
+    "std_val_eval_loss": 0.015885757236925133,
+    "mean_generalization_gap": 0.7201984276374181,
+    "std_generalization_gap": 0.01601490118944517
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.2",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.2,
+    "dropout_final": 0.2,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.169282446304957,
+    "std_train_eval_loss": 0.024947280987738584,
+    "mean_val_eval_loss": 4.84858646740516,
+    "std_val_eval_loss": 0.013971446329696822,
+    "mean_generalization_gap": 0.6793040211002032,
+    "std_generalization_gap": 0.01948231221870367
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.26",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.26,
+    "dropout_final": 0.26,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.26481256633997,
+    "std_train_eval_loss": 0.025972383292421113,
+    "mean_val_eval_loss": 4.858426225682099,
+    "std_val_eval_loss": 0.017668569161382996,
+    "mean_generalization_gap": 0.5936136593421301,
+    "std_generalization_gap": 0.014821502950969461
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.3",
+    "condition_kind": "static",
+    "stage": 2,
+    "token_limit": 1000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.3,
+    "dropout_final": 0.3,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.334470510482788,
+    "std_train_eval_loss": 0.030891434190892946,
+    "mean_val_eval_loss": 4.888320103287697,
+    "std_val_eval_loss": 0.013246250571144034,
+    "mean_generalization_gap": 0.5538495928049088,
+    "std_generalization_gap": 0.018512166716729395
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "formula_wide_l8_h8",
+    "condition_kind": "anchor_decay",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.301,
+    "dropout_final": 0.02,
+    "dropout_schedule": "log_prefix_anchor",
+    "n": 3,
+    "mean_train_eval_loss": 4.163138948380947,
+    "std_train_eval_loss": 0.013421730028855801,
+    "mean_val_eval_loss": 4.621206303437551,
+    "std_val_eval_loss": 0.004959867552466137,
+    "mean_generalization_gap": 0.45806735505660373,
+    "std_generalization_gap": 0.008658546315824916
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.02",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.02,
+    "dropout_final": 0.02,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.012869040171306,
+    "std_train_eval_loss": 0.019352473408129574,
+    "mean_val_eval_loss": 4.749846667051315,
+    "std_val_eval_loss": 0.008764765668577094,
+    "mean_generalization_gap": 0.73697762688001,
+    "std_generalization_gap": 0.020006114758379243
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.08",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.08,
+    "dropout_final": 0.08,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.110793483753999,
+    "std_train_eval_loss": 0.028893033853370435,
+    "mean_val_eval_loss": 4.6680960928400355,
+    "std_val_eval_loss": 0.003314491274290642,
+    "mean_generalization_gap": 0.5573026090860367,
+    "std_generalization_gap": 0.02737432374001718
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.14",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.14,
+    "dropout_final": 0.14,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.1672036200761795,
+    "std_train_eval_loss": 0.023722898945556545,
+    "mean_val_eval_loss": 4.637934118509293,
+    "std_val_eval_loss": 0.006874304525877202,
+    "mean_generalization_gap": 0.4707304984331131,
+    "std_generalization_gap": 0.019934551772559723
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.18",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.18,
+    "dropout_final": 0.18,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.21087221801281,
+    "std_train_eval_loss": 0.021182682238053856,
+    "mean_val_eval_loss": 4.63968871285518,
+    "std_val_eval_loss": 0.006246922787675527,
+    "mean_generalization_gap": 0.42881649484237033,
+    "std_generalization_gap": 0.0234751500917668
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.2",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.2,
+    "dropout_final": 0.2,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.245612747967243,
+    "std_train_eval_loss": 0.021374004532989366,
+    "mean_val_eval_loss": 4.657182261347771,
+    "std_val_eval_loss": 0.008606949344244184,
+    "mean_generalization_gap": 0.4115695133805275,
+    "std_generalization_gap": 0.018893841026177444
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.26",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.26,
+    "dropout_final": 0.26,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.315852140386899,
+    "std_train_eval_loss": 0.004905814773589955,
+    "mean_val_eval_loss": 4.679170399904251,
+    "std_val_eval_loss": 0.006978966774156137,
+    "mean_generalization_gap": 0.3633182595173518,
+    "std_generalization_gap": 0.011733969729614104
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.3",
+    "condition_kind": "static",
+    "stage": 3,
+    "token_limit": 2000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.3,
+    "dropout_final": 0.3,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.366789584358533,
+    "std_train_eval_loss": 0.025032839092378575,
+    "mean_val_eval_loss": 4.710932297011216,
+    "std_val_eval_loss": 0.003280655243995376,
+    "mean_generalization_gap": 0.34414271265268326,
+    "std_generalization_gap": 0.0238351561114313
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "formula_wide_l8_h8",
+    "condition_kind": "anchor_decay",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.301,
+    "dropout_final": 0.02,
+    "dropout_schedule": "log_prefix_anchor",
+    "n": 3,
+    "mean_train_eval_loss": 4.15141686052084,
+    "std_train_eval_loss": 0.00864907460577688,
+    "mean_val_eval_loss": 4.4657708530624705,
+    "std_val_eval_loss": 0.006533212137650865,
+    "mean_generalization_gap": 0.31435399254163104,
+    "std_generalization_gap": 0.0021895349788719713
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.02",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.02,
+    "dropout_final": 0.02,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.149302105108897,
+    "std_train_eval_loss": 0.015932906479109284,
+    "mean_val_eval_loss": 4.542556785047054,
+    "std_val_eval_loss": 0.005111427760726518,
+    "mean_generalization_gap": 0.3932546799381574,
+    "std_generalization_gap": 0.010983082646486758
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.08",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.08,
+    "dropout_final": 0.08,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.177745779355367,
+    "std_train_eval_loss": 0.0073518511940769,
+    "mean_val_eval_loss": 4.498890255888303,
+    "std_val_eval_loss": 0.004325913851224281,
+    "mean_generalization_gap": 0.3211444765329361,
+    "std_generalization_gap": 0.0045681171466252805
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.14",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.14,
+    "dropout_final": 0.14,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.221429777642091,
+    "std_train_eval_loss": 0.012821970261243815,
+    "mean_val_eval_loss": 4.494626219073932,
+    "std_val_eval_loss": 0.008657044012355454,
+    "mean_generalization_gap": 0.27319644143184024,
+    "std_generalization_gap": 0.006086902797186131
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.18",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.18,
+    "dropout_final": 0.18,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.242261836926143,
+    "std_train_eval_loss": 0.007533414644039237,
+    "mean_val_eval_loss": 4.4967766006787615,
+    "std_val_eval_loss": 0.005175633970589464,
+    "mean_generalization_gap": 0.25451476375261944,
+    "std_generalization_gap": 0.005043870704268704
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.2",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.2,
+    "dropout_final": 0.2,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.279090583324432,
+    "std_train_eval_loss": 0.006720524978756697,
+    "mean_val_eval_loss": 4.51753082126379,
+    "std_val_eval_loss": 0.008461337427952602,
+    "mean_generalization_gap": 0.23844023793935776,
+    "std_generalization_gap": 0.005697079794179711
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.26",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.26,
+    "dropout_final": 0.26,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.324253092209498,
+    "std_train_eval_loss": 0.007220470805599103,
+    "mean_val_eval_loss": 4.541137516498566,
+    "std_val_eval_loss": 0.004647242294749092,
+    "mean_generalization_gap": 0.2168844242890676,
+    "std_generalization_gap": 0.0030328214421115702
+  },
+  {
+    "run_mode": "locked_stream",
+    "condition": "static_dropout_0.3",
+    "condition_kind": "static",
+    "stage": 4,
+    "token_limit": 4000000,
+    "model_name": "wide_L8_H8_D384",
+    "n_layer": 8,
+    "n_head": 8,
+    "n_embd": 384,
+    "parameters": 17301504,
+    "dropout_initial": 0.3,
+    "dropout_final": 0.3,
+    "dropout_schedule": "constant",
+    "n": 3,
+    "mean_train_eval_loss": 4.37708696226279,
+    "std_train_eval_loss": 0.013672763672564944,
+    "mean_val_eval_loss": 4.575443473954995,
+    "std_val_eval_loss": 0.006430238324620357,
+    "mean_generalization_gap": 0.19835651169220606,
+    "std_generalization_gap": 0.008034807259295824
+  }
+]

runs/architecture_shape_holdout_wide_h8/locked_stream/20260528-151721/trace.jsonl ADDED Viewed

	@@ -0,0 +1,240 @@

+{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.435223579406738}
+{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.461259841918945}
+{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.799189567565918}
+{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.339989185333252}
+{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.445383071899414}
+{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.370175838470459}
+{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.557982921600342}
+{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.40201473236084}
+{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.392435550689697}
+{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.2655863761901855}
+{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.108766555786133}
+{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.6994476318359375}
+{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.655489921569824}
+{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.28831672668457}
+{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.475219249725342}
+{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.06334114074707}
+{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.254112243652344}
+{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.392305374145508}
+{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.22216796875}
+{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.331225872039795}
+{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.339109420776367}
+{"condition": "formula_wide_l8_h8", "dropout": 0.301, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.7740654945373535}
+{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.636873245239258}
+{"condition": "formula_wide_l8_h8", "dropout": 0.254, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.1092705726623535}
+{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.257920742034912}
+{"condition": "formula_wide_l8_h8", "dropout": 0.177, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.312147617340088}
+{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.455212593078613}
+{"condition": "formula_wide_l8_h8", "dropout": 0.087, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.083148002624512}
+{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.338054656982422}
+{"condition": "formula_wide_l8_h8", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.061500549316406}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.998419761657715}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.831916570663452}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.281755447387695}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.6908276081085205}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9898693561553955}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.6737847328186035}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.473372459411621}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.060349464416504}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.2894182205200195}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.323214530944824}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.905608654022217}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.846677303314209}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.216981887817383}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.469115972518921}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.021525859832764}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.9317455291748047}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.007565975189209}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.145359516143799}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.317673206329346}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.104133129119873}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.906655788421631}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 3.7551283836364746}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.161432266235352}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.522247314453125}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 3.9082343578338623}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.8998873233795166}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.303224086761475}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 3.8002028465270996}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.325773239135742}
+{"condition": "static_dropout_0.02", "dropout": 0.02, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.062686920166016}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.93206787109375}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.3121185302734375}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.280060291290283}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.8869738578796387}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.202347755432129}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.9108357429504395}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.140951156616211}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.277665138244629}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.203189849853516}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.201048374176025}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.783326148986816}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.1509013175964355}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.0230302810668945}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 3.867222309112549}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.463627815246582}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.031224250793457}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.441164970397949}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.2578840255737305}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.3251543045043945}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.138848304748535}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.168774604797363}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.213794708251953}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.474993705749512}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.008660316467285}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.217497825622559}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 3.963630199432373}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.334596633911133}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.332503795623779}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.172324180603027}
+{"condition": "static_dropout_0.08", "dropout": 0.08, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.332511901855469}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.201639652252197}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.2807817459106445}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.174047470092773}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.343637943267822}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.594155311584473}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.408194541931152}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.672398567199707}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.2894206047058105}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.316445827484131}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.3651604652404785}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.319988250732422}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.285619735717773}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.278521537780762}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.307913780212402}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.268133163452148}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.097533226013184}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.153548240661621}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.320278644561768}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.55961275100708}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.196667671203613}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 4.9843902587890625}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.543873310089111}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.703161239624023}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.026454448699951}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.307652473449707}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.160324573516846}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.454224109649658}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.122490406036377}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.278357028961182}
+{"condition": "static_dropout_0.14", "dropout": 0.14, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.3369245529174805}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.15626335144043}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.235060214996338}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.561727523803711}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.190820693969727}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.500772476196289}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.60369348526001}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.401540756225586}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.394259929656982}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.282741546630859}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.539054870605469}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.222869873046875}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.384670734405518}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.495967864990234}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.205634117126465}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.476202964782715}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.3069024085998535}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.370694160461426}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.197530746459961}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.3027873039245605}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.284463405609131}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.250606060028076}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.598309516906738}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.485733509063721}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.279513359069824}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.5862016677856445}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.1803812980651855}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.476117134094238}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.316293716430664}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.311079978942871}
+{"condition": "static_dropout_0.18", "dropout": 0.18, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.447503566741943}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.2886962890625}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.446702003479004}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.795304298400879}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.462212562561035}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.393544673919678}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.44389009475708}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.52379035949707}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.521040439605713}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.324027061462402}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.3808183670043945}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.325625419616699}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.610562324523926}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.611095905303955}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.277379035949707}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.572210311889648}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.454376220703125}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.59763240814209}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.377513885498047}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.411192893981934}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.475866794586182}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.290244102478027}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.592159271240234}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.338376998901367}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.286029815673828}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.403045177459717}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.145584583282471}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.335721969604492}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.275428771972656}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.567635536193848}
+{"condition": "static_dropout_0.2", "dropout": 0.2, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.303030967712402}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.306872367858887}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.588815689086914}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.641493797302246}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.4486284255981445}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.442417621612549}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.541705131530762}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.413729190826416}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.310101509094238}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.322086334228516}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.295492172241211}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.243927001953125}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.5635986328125}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.689737319946289}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.5332489013671875}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.609074592590332}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.513686656951904}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.5128173828125}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.520935535430908}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.397536277770996}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.469289302825928}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.171591281890869}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.668099403381348}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.4507832527160645}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.168695449829102}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.338339805603027}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.411451816558838}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.387897491455078}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.3449296951293945}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.204782485961914}
+{"condition": "static_dropout_0.26", "dropout": 0.26, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.2295427322387695}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.369487762451172}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.801484107971191}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.613731384277344}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.409400939941406}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.659245491027832}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.452445030212402}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.58389139175415}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.41795539855957}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.648049354553223}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 1, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.556178569793701}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.342921257019043}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.781554222106934}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.746016025543213}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.599480628967285}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.711769104003906}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.467323303222656}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.58270263671875}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.398643970489502}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.496487617492676}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 2, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.4059600830078125}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 500, "steps": 1000, "token_limit": 250000, "tokens_seen": 1024000, "train_batch_loss": 5.600184440612793}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 0, "step": 1000, "steps": 1000, "token_limit": 250000, "tokens_seen": 2048000, "train_batch_loss": 4.708844184875488}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 500, "steps": 1000, "token_limit": 500000, "tokens_seen": 3072000, "train_batch_loss": 4.56709098815918}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 1, "step": 1000, "steps": 1000, "token_limit": 500000, "tokens_seen": 4096000, "train_batch_loss": 4.4110612869262695}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 500, "steps": 1000, "token_limit": 1000000, "tokens_seen": 5120000, "train_batch_loss": 4.634671211242676}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 2, "step": 1000, "steps": 1000, "token_limit": 1000000, "tokens_seen": 6144000, "train_batch_loss": 4.227794647216797}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 500, "steps": 1000, "token_limit": 2000000, "tokens_seen": 7168000, "train_batch_loss": 4.368139743804932}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 3, "step": 1000, "steps": 1000, "token_limit": 2000000, "tokens_seen": 8192000, "train_batch_loss": 4.518902778625488}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 500, "steps": 1000, "token_limit": 4000000, "tokens_seen": 9216000, "train_batch_loss": 4.640765190124512}
+{"condition": "static_dropout_0.3", "dropout": 0.3, "event": "train_step", "model_name": "wide_L8_H8_D384", "run_mode": "locked_stream", "seed": 3, "stage": 4, "step": 1000, "steps": 1000, "token_limit": 4000000, "tokens_seen": 10240000, "train_batch_loss": 4.421485900878906}