| # OpenWebText10K Streaming Validation |
|
|
| Date: 2026-05-30 |
|
|
| This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs. |
| No additional training is performed by this script; it reads saved |
| `metrics.jsonl` files. |
|
|
| Regime: OpenWebText10K cached-corpus streaming setup with L16_H8_D384, |
| 31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000 |
| optimizer steps per stage. This is a clean five-seed run including the |
| OpenWebText10K interaction schedule, empirical decay schedules, and static |
| baselines. |
|
|
| ## Sources |
|
|
| - `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl` |
|
|
| ## Condition Provenance |
|
|
| The `anchor_decay` label means the dropout value is chosen from explicit |
| prefix-token anchors. It does not by itself imply that the schedule came from |
| the coefficient formula. |
|
|
| | Condition | Provenance | Dropout path | Interpretation | |
| |---|---|---|---| |
| | `openwebtext10k_interaction` | coefficient-derived schedule | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` | Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. | |
| | `hold_30_then_decay` | heuristic schedule-search ablation | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. | |
| | `mild_30_to_08` | heuristic schedule-search ablation | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. | |
| | `fitted_l16_static_law` | older fitted/static-law schedule | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. | |
| | `static_dropout_*` | static baseline | constant | Fixed dropout used at every stream prefix. | |
|
|
| The two heuristic schedules should be treated as ablations, not as independent |
| evidence that the coefficient formula generated their exact paths. Their role is |
| to show that the shape of the decay matters and that reasonable hand-designed |
| decays can also beat weak static choices. The main formula claim for this |
| regime should be based on `openwebtext10k_interaction`. |
|
|
| ## Condition Ranking By Final Loss |
|
|
| | Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path | |
| |---|---|---:|---:|---:|---:|---:|---:|---| |
| | `openwebtext10k_interaction` | `anchor_decay` | 5 | 4.8609 | 0.0046 | 4.3981 | 0.0095 | 0.3177 | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` | |
| | `hold_30_then_decay` | `anchor_decay` | 5 | 4.8512 | 0.0017 | 4.4052 | 0.0112 | 0.3565 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | |
| | `mild_30_to_08` | `anchor_decay` | 5 | 4.8509 | 0.0015 | 4.4073 | 0.0085 | 0.3337 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | |
| | `fitted_l16_static_law` | `anchor_decay` | 5 | 4.9521 | 0.0039 | 4.4124 | 0.0084 | 0.3137 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | |
| | `static_dropout_0.14` | `static` | 5 | 4.9051 | 0.0088 | 4.4455 | 0.0120 | 0.3289 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` | |
| | `static_dropout_0.3` | `static` | 5 | 4.8767 | 0.0019 | 4.4668 | 0.0141 | 0.2349 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` | |
| | `static_dropout_0.02` | `static` | 5 | 5.1571 | 0.0097 | 4.5358 | 0.0091 | 0.4829 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` | |
| | `static_dropout_0` | `static` | 5 | 5.2511 | 0.0160 | 4.5943 | 0.0216 | 0.5529 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` | |
|
|
| ## Paired Final-Loss Deltas |
|
|
| Negative `delta_vs_best_static` means the condition beat the best static |
| baseline for that seed. |
|
|
| | Seed | Condition | Final val | Best static | Best static final val | Delta vs best static | |
| |---:|---|---:|---|---:|---:| |
| | 1 | `openwebtext10k_interaction` | 4.4023 | `static_dropout_0.14` | 4.4418 | -0.0394 | |
| | 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 | |
| | 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 | |
| | 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 | |
| | 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 | |
| | 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 | |
| | 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0984 | |
| | 1 | `static_dropout_0` | 4.5704 | `static_dropout_0.14` | 4.4418 | +0.1286 | |
| | 2 | `openwebtext10k_interaction` | 4.4020 | `static_dropout_0.14` | 4.4602 | -0.0583 | |
| | 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4602 | -0.0534 | |
| | 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4602 | -0.0522 | |
| | 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4602 | -0.0466 | |
| | 2 | `static_dropout_0.14` | 4.4602 | `static_dropout_0.14` | 4.4602 | +0.0000 | |
| | 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4602 | +0.0117 | |
| | 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4602 | +0.0864 | |
| | 2 | `static_dropout_0` | 4.6094 | `static_dropout_0.14` | 4.4602 | +0.1492 | |
| | 3 | `openwebtext10k_interaction` | 4.4029 | `static_dropout_0.14` | 4.4356 | -0.0328 | |
| | 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4356 | -0.0183 | |
| | 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4356 | -0.0206 | |
| | 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4356 | -0.0223 | |
| | 3 | `static_dropout_0.14` | 4.4356 | `static_dropout_0.14` | 4.4356 | +0.0000 | |
| | 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4356 | +0.0401 | |
| | 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4356 | +0.0988 | |
| | 3 | `static_dropout_0` | 4.5928 | `static_dropout_0.14` | 4.4356 | +0.1571 | |
| | 4 | `openwebtext10k_interaction` | 4.3811 | `static_dropout_0.14` | 4.4337 | -0.0526 | |
| | 4 | `hold_30_then_decay` | 4.3936 | `static_dropout_0.14` | 4.4337 | -0.0400 | |
| | 4 | `mild_30_to_08` | 4.3978 | `static_dropout_0.14` | 4.4337 | -0.0359 | |
| | 4 | `fitted_l16_static_law` | 4.3983 | `static_dropout_0.14` | 4.4337 | -0.0354 | |
| | 4 | `static_dropout_0.14` | 4.4337 | `static_dropout_0.14` | 4.4337 | +0.0000 | |
| | 4 | `static_dropout_0.3` | 4.4455 | `static_dropout_0.14` | 4.4337 | +0.0118 | |
| | 4 | `static_dropout_0.02` | 4.5220 | `static_dropout_0.14` | 4.4337 | +0.0883 | |
| | 4 | `static_dropout_0` | 4.5768 | `static_dropout_0.14` | 4.4337 | +0.1432 | |
| | 5 | `openwebtext10k_interaction` | 4.4024 | `static_dropout_0.14` | 4.4560 | -0.0536 | |
| | 5 | `hold_30_then_decay` | 4.4145 | `static_dropout_0.14` | 4.4560 | -0.0415 | |
| | 5 | `mild_30_to_08` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 | |
| | 5 | `fitted_l16_static_law` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 | |
| | 5 | `static_dropout_0.14` | 4.4560 | `static_dropout_0.14` | 4.4560 | +0.0000 | |
| | 5 | `static_dropout_0.3` | 4.4805 | `static_dropout_0.14` | 4.4560 | +0.0245 | |
| | 5 | `static_dropout_0.02` | 4.5355 | `static_dropout_0.14` | 4.4560 | +0.0796 | |
| | 5 | `static_dropout_0` | 4.6219 | `static_dropout_0.14` | 4.4560 | +0.1660 | |
|
|
| ## Stage Trajectory |
|
|
| | Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap | |
| |---:|---:|---|---:|---:|---:|---:|---:|---:| |
| | 0 | 250,000 | `mild_30_to_08` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 | |
| | 0 | 250,000 | `hold_30_then_decay` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 | |
| | 0 | 250,000 | `static_dropout_0.3` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 | |
| | 0 | 250,000 | `static_dropout_0.14` | 0.140 | 5 | 5.4773 | 0.0224 | 4.0298 | 1.4475 | |
| | 0 | 250,000 | `openwebtext10k_interaction` | 0.385 | 5 | 5.4947 | 0.0109 | 4.6016 | 0.8930 | |
| | 0 | 250,000 | `static_dropout_0.02` | 0.020 | 5 | 5.7426 | 0.0242 | 3.5371 | 2.2055 | |
| | 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 5 | 5.7842 | 0.0096 | 5.1640 | 0.6202 | |
| | 0 | 250,000 | `static_dropout_0` | 0.000 | 5 | 5.8330 | 0.0198 | 3.4443 | 2.3887 | |
| | 1 | 500,000 | `mild_30_to_08` | 0.240 | 5 | 5.0582 | 0.0159 | 4.0349 | 1.0233 | |
| | 1 | 500,000 | `static_dropout_0.3` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 | |
| | 1 | 500,000 | `hold_30_then_decay` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 | |
| | 1 | 500,000 | `openwebtext10k_interaction` | 0.319 | 5 | 5.0715 | 0.0118 | 4.2065 | 0.8650 | |
| | 1 | 500,000 | `static_dropout_0.14` | 0.140 | 5 | 5.1492 | 0.0070 | 3.7143 | 1.4349 | |
| | 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 5 | 5.1507 | 0.0102 | 4.4632 | 0.6875 | |
| | 1 | 500,000 | `static_dropout_0.02` | 0.020 | 5 | 5.5754 | 0.0248 | 3.1246 | 2.4508 | |
| | 1 | 500,000 | `static_dropout_0` | 0.000 | 5 | 5.7175 | 0.0502 | 2.9583 | 2.7592 | |
| | 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 5 | 4.7757 | 0.0144 | 4.0378 | 0.7379 | |
| | 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 5 | 4.7774 | 0.0138 | 3.9886 | 0.7888 | |
| | 2 | 1,000,000 | `openwebtext10k_interaction` | 0.227 | 5 | 4.7811 | 0.0084 | 4.0826 | 0.6984 | |
| | 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.7983 | 0.0144 | 4.1501 | 0.6481 | |
| | 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 5 | 4.8326 | 0.0102 | 4.2632 | 0.5694 | |
| | 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.8490 | 0.0202 | 3.8712 | 0.9779 | |
| | 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 5 | 5.1470 | 0.0222 | 3.4615 | 1.6854 | |
| | 2 | 1,000,000 | `static_dropout_0` | 0.000 | 5 | 5.2637 | 0.0274 | 3.3260 | 1.9377 | |
| | 3 | 2,000,000 | `openwebtext10k_interaction` | 0.139 | 5 | 4.5590 | 0.0142 | 4.0802 | 0.4788 | |
| | 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 5 | 4.5599 | 0.0161 | 4.0445 | 0.5154 | |
| | 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 5 | 4.5631 | 0.0155 | 4.0441 | 0.5190 | |
| | 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 5 | 4.5806 | 0.0153 | 4.1471 | 0.4334 | |
| | 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.6035 | 0.0141 | 4.2150 | 0.3885 | |
| | 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.6048 | 0.0136 | 4.0399 | 0.5648 | |
| | 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.7847 | 0.0196 | 3.8405 | 0.9442 | |
| | 3 | 2,000,000 | `static_dropout_0` | 0.000 | 5 | 4.8472 | 0.0171 | 3.7786 | 1.0687 | |
| | 4 | 4,000,000 | `openwebtext10k_interaction` | 0.066 | 5 | 4.3981 | 0.0095 | 4.0805 | 0.3177 | |
| | 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 5 | 4.4052 | 0.0112 | 4.0488 | 0.3565 | |
| | 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 5 | 4.4073 | 0.0085 | 4.0736 | 0.3337 | |
| | 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 5 | 4.4124 | 0.0084 | 4.0987 | 0.3137 | |
| | 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.4455 | 0.0120 | 4.1165 | 0.3289 | |
| | 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.4668 | 0.0141 | 4.2319 | 0.2349 | |
| | 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.5358 | 0.0091 | 4.0529 | 0.4829 | |
| | 4 | 4,000,000 | `static_dropout_0` | 0.000 | 5 | 4.5943 | 0.0216 | 4.0414 | 0.5529 | |
|
|
| ## Interpretation |
|
|
| - `openwebtext10k_interaction` has the best 5-seed mean final validation loss: 4.3981 +/- 0.0095. |
| - The second-best final condition is `hold_30_then_decay` at 4.4052 +/- 0.0112. |
| - The best static baseline by mean final loss is `static_dropout_0.14` at 4.4455 +/- 0.0120. |
| - `openwebtext10k_interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0328. |
| - `hold_30_then_decay` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0183. |
| - `mild_30_to_08` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0206. |
| - `fitted_l16_static_law` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0211. |
| - The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4483; compare this with the final ranking before claiming a schedule is uniformly better. |
| - This is a saved-run streaming validation artifact. Treat it as strong |
| evidence only when the tested conditions, seeds, static baselines, and |
| stream protocol match the claim being made. |
|
|