dropout-decay / docs /openwebtext10k_streaming_report.md

Mandeep Sidhu

Document regime runbook and schedule provenance

b5daf7c 1 day ago

12 kB

	# OpenWebText10K Streaming Validation

	Date: 2026-05-30

	This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
	No additional training is performed by this script; it reads saved
	`metrics.jsonl` files.

	Regime: OpenWebText10K cached-corpus streaming setup with L16_H8_D384,
	31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000
	optimizer steps per stage. This is a clean five-seed run including the
	OpenWebText10K interaction schedule, empirical decay schedules, and static
	baselines.

	## Sources

	- `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`

	## Condition Provenance

	The `anchor_decay` label means the dropout value is chosen from explicit
	prefix-token anchors. It does not by itself imply that the schedule came from
	the coefficient formula.

	\| Condition \| Provenance \| Dropout path \| Interpretation \|
	\|---\|---\|---\|---\|
	\| `openwebtext10k_interaction` \| coefficient-derived schedule \| `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` \| Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. \|
	\| `hold_30_then_decay` \| heuristic schedule-search ablation \| `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` \| Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. \|
	\| `mild_30_to_08` \| heuristic schedule-search ablation \| `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` \| Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. \|
	\| `fitted_l16_static_law` \| older fitted/static-law schedule \| `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` \| Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. \|
	\| `static_dropout_*` \| static baseline \| constant \| Fixed dropout used at every stream prefix. \|

	The two heuristic schedules should be treated as ablations, not as independent
	evidence that the coefficient formula generated their exact paths. Their role is
	to show that the shape of the decay matters and that reasonable hand-designed
	decays can also beat weak static choices. The main formula claim for this
	regime should be based on `openwebtext10k_interaction`.

	## Condition Ranking By Final Loss

	\| Condition \| Kind \| N \| Mean trajectory val \| Std trajectory val \| Mean final val \| Std final val \| Mean final gap \| Dropout path \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---\|
	\| `openwebtext10k_interaction` \| `anchor_decay` \| 5 \| 4.8609 \| 0.0046 \| 4.3981 \| 0.0095 \| 0.3177 \| `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` \|
	\| `hold_30_then_decay` \| `anchor_decay` \| 5 \| 4.8512 \| 0.0017 \| 4.4052 \| 0.0112 \| 0.3565 \| `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` \|
	\| `mild_30_to_08` \| `anchor_decay` \| 5 \| 4.8509 \| 0.0015 \| 4.4073 \| 0.0085 \| 0.3337 \| `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` \|
	\| `fitted_l16_static_law` \| `anchor_decay` \| 5 \| 4.9521 \| 0.0039 \| 4.4124 \| 0.0084 \| 0.3137 \| `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` \|
	\| `static_dropout_0.14` \| `static` \| 5 \| 4.9051 \| 0.0088 \| 4.4455 \| 0.0120 \| 0.3289 \| `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` \|
	\| `static_dropout_0.3` \| `static` \| 5 \| 4.8767 \| 0.0019 \| 4.4668 \| 0.0141 \| 0.2349 \| `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` \|
	\| `static_dropout_0.02` \| `static` \| 5 \| 5.1571 \| 0.0097 \| 4.5358 \| 0.0091 \| 0.4829 \| `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` \|
	\| `static_dropout_0` \| `static` \| 5 \| 5.2511 \| 0.0160 \| 4.5943 \| 0.0216 \| 0.5529 \| `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` \|

	## Paired Final-Loss Deltas

	Negative `delta_vs_best_static` means the condition beat the best static
	baseline for that seed.

	\| Seed \| Condition \| Final val \| Best static \| Best static final val \| Delta vs best static \|
	\|---:\|---\|---:\|---\|---:\|---:\|
	\| 1 \| `openwebtext10k_interaction` \| 4.4023 \| `static_dropout_0.14` \| 4.4418 \| -0.0394 \|
	\| 1 \| `hold_30_then_decay` \| 4.3939 \| `static_dropout_0.14` \| 4.4418 \| -0.0479 \|
	\| 1 \| `mild_30_to_08` \| 4.3995 \| `static_dropout_0.14` \| 4.4418 \| -0.0423 \|
	\| 1 \| `fitted_l16_static_law` \| 4.4207 \| `static_dropout_0.14` \| 4.4418 \| -0.0211 \|
	\| 1 \| `static_dropout_0.14` \| 4.4418 \| `static_dropout_0.14` \| 4.4418 \| +0.0000 \|
	\| 1 \| `static_dropout_0.3` \| 4.4602 \| `static_dropout_0.14` \| 4.4418 \| +0.0184 \|
	\| 1 \| `static_dropout_0.02` \| 4.5402 \| `static_dropout_0.14` \| 4.4418 \| +0.0984 \|
	\| 1 \| `static_dropout_0` \| 4.5704 \| `static_dropout_0.14` \| 4.4418 \| +0.1286 \|
	\| 2 \| `openwebtext10k_interaction` \| 4.4020 \| `static_dropout_0.14` \| 4.4602 \| -0.0583 \|
	\| 2 \| `hold_30_then_decay` \| 4.4068 \| `static_dropout_0.14` \| 4.4602 \| -0.0534 \|
	\| 2 \| `mild_30_to_08` \| 4.4080 \| `static_dropout_0.14` \| 4.4602 \| -0.0522 \|
	\| 2 \| `fitted_l16_static_law` \| 4.4136 \| `static_dropout_0.14` \| 4.4602 \| -0.0466 \|
	\| 2 \| `static_dropout_0.14` \| 4.4602 \| `static_dropout_0.14` \| 4.4602 \| +0.0000 \|
	\| 2 \| `static_dropout_0.3` \| 4.4719 \| `static_dropout_0.14` \| 4.4602 \| +0.0117 \|
	\| 2 \| `static_dropout_0.02` \| 4.5466 \| `static_dropout_0.14` \| 4.4602 \| +0.0864 \|
	\| 2 \| `static_dropout_0` \| 4.6094 \| `static_dropout_0.14` \| 4.4602 \| +0.1492 \|
	\| 3 \| `openwebtext10k_interaction` \| 4.4029 \| `static_dropout_0.14` \| 4.4356 \| -0.0328 \|
	\| 3 \| `hold_30_then_decay` \| 4.4174 \| `static_dropout_0.14` \| 4.4356 \| -0.0183 \|
	\| 3 \| `mild_30_to_08` \| 4.4151 \| `static_dropout_0.14` \| 4.4356 \| -0.0206 \|
	\| 3 \| `fitted_l16_static_law` \| 4.4134 \| `static_dropout_0.14` \| 4.4356 \| -0.0223 \|
	\| 3 \| `static_dropout_0.14` \| 4.4356 \| `static_dropout_0.14` \| 4.4356 \| +0.0000 \|
	\| 3 \| `static_dropout_0.3` \| 4.4758 \| `static_dropout_0.14` \| 4.4356 \| +0.0401 \|
	\| 3 \| `static_dropout_0.02` \| 4.5345 \| `static_dropout_0.14` \| 4.4356 \| +0.0988 \|
	\| 3 \| `static_dropout_0` \| 4.5928 \| `static_dropout_0.14` \| 4.4356 \| +0.1571 \|
	\| 4 \| `openwebtext10k_interaction` \| 4.3811 \| `static_dropout_0.14` \| 4.4337 \| -0.0526 \|
	\| 4 \| `hold_30_then_decay` \| 4.3936 \| `static_dropout_0.14` \| 4.4337 \| -0.0400 \|
	\| 4 \| `mild_30_to_08` \| 4.3978 \| `static_dropout_0.14` \| 4.4337 \| -0.0359 \|
	\| 4 \| `fitted_l16_static_law` \| 4.3983 \| `static_dropout_0.14` \| 4.4337 \| -0.0354 \|
	\| 4 \| `static_dropout_0.14` \| 4.4337 \| `static_dropout_0.14` \| 4.4337 \| +0.0000 \|
	\| 4 \| `static_dropout_0.3` \| 4.4455 \| `static_dropout_0.14` \| 4.4337 \| +0.0118 \|
	\| 4 \| `static_dropout_0.02` \| 4.5220 \| `static_dropout_0.14` \| 4.4337 \| +0.0883 \|
	\| 4 \| `static_dropout_0` \| 4.5768 \| `static_dropout_0.14` \| 4.4337 \| +0.1432 \|
	\| 5 \| `openwebtext10k_interaction` \| 4.4024 \| `static_dropout_0.14` \| 4.4560 \| -0.0536 \|
	\| 5 \| `hold_30_then_decay` \| 4.4145 \| `static_dropout_0.14` \| 4.4560 \| -0.0415 \|
	\| 5 \| `mild_30_to_08` \| 4.4161 \| `static_dropout_0.14` \| 4.4560 \| -0.0399 \|
	\| 5 \| `fitted_l16_static_law` \| 4.4161 \| `static_dropout_0.14` \| 4.4560 \| -0.0399 \|
	\| 5 \| `static_dropout_0.14` \| 4.4560 \| `static_dropout_0.14` \| 4.4560 \| +0.0000 \|
	\| 5 \| `static_dropout_0.3` \| 4.4805 \| `static_dropout_0.14` \| 4.4560 \| +0.0245 \|
	\| 5 \| `static_dropout_0.02` \| 4.5355 \| `static_dropout_0.14` \| 4.4560 \| +0.0796 \|
	\| 5 \| `static_dropout_0` \| 4.6219 \| `static_dropout_0.14` \| 4.4560 \| +0.1660 \|

	## Stage Trajectory

	\| Stage \| Prefix tokens \| Condition \| Dropout \| N \| Mean val \| Std val \| Mean train \| Mean gap \|
	\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 0 \| 250,000 \| `mild_30_to_08` \| 0.300 \| 5 \| 5.4483 \| 0.0138 \| 4.4429 \| 1.0054 \|
	\| 0 \| 250,000 \| `hold_30_then_decay` \| 0.300 \| 5 \| 5.4483 \| 0.0138 \| 4.4429 \| 1.0054 \|
	\| 0 \| 250,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 5.4483 \| 0.0138 \| 4.4429 \| 1.0054 \|
	\| 0 \| 250,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 5.4773 \| 0.0224 \| 4.0298 \| 1.4475 \|
	\| 0 \| 250,000 \| `openwebtext10k_interaction` \| 0.385 \| 5 \| 5.4947 \| 0.0109 \| 4.6016 \| 0.8930 \|
	\| 0 \| 250,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 5.7426 \| 0.0242 \| 3.5371 \| 2.2055 \|
	\| 0 \| 250,000 \| `fitted_l16_static_law` \| 0.600 \| 5 \| 5.7842 \| 0.0096 \| 5.1640 \| 0.6202 \|
	\| 0 \| 250,000 \| `static_dropout_0` \| 0.000 \| 5 \| 5.8330 \| 0.0198 \| 3.4443 \| 2.3887 \|
	\| 1 \| 500,000 \| `mild_30_to_08` \| 0.240 \| 5 \| 5.0582 \| 0.0159 \| 4.0349 \| 1.0233 \|
	\| 1 \| 500,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 5.0667 \| 0.0173 \| 4.1383 \| 0.9284 \|
	\| 1 \| 500,000 \| `hold_30_then_decay` \| 0.300 \| 5 \| 5.0667 \| 0.0173 \| 4.1383 \| 0.9284 \|
	\| 1 \| 500,000 \| `openwebtext10k_interaction` \| 0.319 \| 5 \| 5.0715 \| 0.0118 \| 4.2065 \| 0.8650 \|
	\| 1 \| 500,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 5.1492 \| 0.0070 \| 3.7143 \| 1.4349 \|
	\| 1 \| 500,000 \| `fitted_l16_static_law` \| 0.400 \| 5 \| 5.1507 \| 0.0102 \| 4.4632 \| 0.6875 \|
	\| 1 \| 500,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 5.5754 \| 0.0248 \| 3.1246 \| 2.4508 \|
	\| 1 \| 500,000 \| `static_dropout_0` \| 0.000 \| 5 \| 5.7175 \| 0.0502 \| 2.9583 \| 2.7592 \|
	\| 2 \| 1,000,000 \| `hold_30_then_decay` \| 0.200 \| 5 \| 4.7757 \| 0.0144 \| 4.0378 \| 0.7379 \|
	\| 2 \| 1,000,000 \| `mild_30_to_08` \| 0.180 \| 5 \| 4.7774 \| 0.0138 \| 3.9886 \| 0.7888 \|
	\| 2 \| 1,000,000 \| `openwebtext10k_interaction` \| 0.227 \| 5 \| 4.7811 \| 0.0084 \| 4.0826 \| 0.6984 \|
	\| 2 \| 1,000,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 4.7983 \| 0.0144 \| 4.1501 \| 0.6481 \|
	\| 2 \| 1,000,000 \| `fitted_l16_static_law` \| 0.300 \| 5 \| 4.8326 \| 0.0102 \| 4.2632 \| 0.5694 \|
	\| 2 \| 1,000,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 4.8490 \| 0.0202 \| 3.8712 \| 0.9779 \|
	\| 2 \| 1,000,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 5.1470 \| 0.0222 \| 3.4615 \| 1.6854 \|
	\| 2 \| 1,000,000 \| `static_dropout_0` \| 0.000 \| 5 \| 5.2637 \| 0.0274 \| 3.3260 \| 1.9377 \|
	\| 3 \| 2,000,000 \| `openwebtext10k_interaction` \| 0.139 \| 5 \| 4.5590 \| 0.0142 \| 4.0802 \| 0.4788 \|
	\| 3 \| 2,000,000 \| `hold_30_then_decay` \| 0.100 \| 5 \| 4.5599 \| 0.0161 \| 4.0445 \| 0.5154 \|
	\| 3 \| 2,000,000 \| `mild_30_to_08` \| 0.120 \| 5 \| 4.5631 \| 0.0155 \| 4.0441 \| 0.5190 \|
	\| 3 \| 2,000,000 \| `fitted_l16_static_law` \| 0.140 \| 5 \| 4.5806 \| 0.0153 \| 4.1471 \| 0.4334 \|
	\| 3 \| 2,000,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 4.6035 \| 0.0141 \| 4.2150 \| 0.3885 \|
	\| 3 \| 2,000,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 4.6048 \| 0.0136 \| 4.0399 \| 0.5648 \|
	\| 3 \| 2,000,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 4.7847 \| 0.0196 \| 3.8405 \| 0.9442 \|
	\| 3 \| 2,000,000 \| `static_dropout_0` \| 0.000 \| 5 \| 4.8472 \| 0.0171 \| 3.7786 \| 1.0687 \|
	\| 4 \| 4,000,000 \| `openwebtext10k_interaction` \| 0.066 \| 5 \| 4.3981 \| 0.0095 \| 4.0805 \| 0.3177 \|
	\| 4 \| 4,000,000 \| `hold_30_then_decay` \| 0.020 \| 5 \| 4.4052 \| 0.0112 \| 4.0488 \| 0.3565 \|
	\| 4 \| 4,000,000 \| `mild_30_to_08` \| 0.080 \| 5 \| 4.4073 \| 0.0085 \| 4.0736 \| 0.3337 \|
	\| 4 \| 4,000,000 \| `fitted_l16_static_law` \| 0.020 \| 5 \| 4.4124 \| 0.0084 \| 4.0987 \| 0.3137 \|
	\| 4 \| 4,000,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 4.4455 \| 0.0120 \| 4.1165 \| 0.3289 \|
	\| 4 \| 4,000,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 4.4668 \| 0.0141 \| 4.2319 \| 0.2349 \|
	\| 4 \| 4,000,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 4.5358 \| 0.0091 \| 4.0529 \| 0.4829 \|
	\| 4 \| 4,000,000 \| `static_dropout_0` \| 0.000 \| 5 \| 4.5943 \| 0.0216 \| 4.0414 \| 0.5529 \|

	## Interpretation

	- `openwebtext10k_interaction` has the best 5-seed mean final validation loss: 4.3981 +/- 0.0095.
	- The second-best final condition is `hold_30_then_decay` at 4.4052 +/- 0.0112.
	- The best static baseline by mean final loss is `static_dropout_0.14` at 4.4455 +/- 0.0120.
	- `openwebtext10k_interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0328.
	- `hold_30_then_decay` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0183.
	- `mild_30_to_08` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0206.
	- `fitted_l16_static_law` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0211.
	- The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4483; compare this with the final ranking before claiming a schedule is uniformly better.
	- This is a saved-run streaming validation artifact. Treat it as strong
	evidence only when the tested conditions, seeds, static baselines, and
	stream protocol match the claim being made.

	# OpenWebText10K Streaming Validation

	Date: 2026-05-30

	This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
	No additional training is performed by this script; it reads saved
	`metrics.jsonl` files.

	Regime: OpenWebText10K cached-corpus streaming setup with L16_H8_D384,
	31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000
	optimizer steps per stage. This is a clean five-seed run including the
	OpenWebText10K interaction schedule, empirical decay schedules, and static
	baselines.

	## Sources

	- `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`

	## Condition Provenance

	The `anchor_decay` label means the dropout value is chosen from explicit
	prefix-token anchors. It does not by itself imply that the schedule came from
	the coefficient formula.

	\| Condition \| Provenance \| Dropout path \| Interpretation \|
	\|---\|---\|---\|---\|
	\| `openwebtext10k_interaction` \| coefficient-derived schedule \| `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` \| Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. \|
	\| `hold_30_then_decay` \| heuristic schedule-search ablation \| `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` \| Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. \|
	\| `mild_30_to_08` \| heuristic schedule-search ablation \| `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` \| Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. \|
	\| `fitted_l16_static_law` \| older fitted/static-law schedule \| `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` \| Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. \|
	\| `static_dropout_*` \| static baseline \| constant \| Fixed dropout used at every stream prefix. \|

	The two heuristic schedules should be treated as ablations, not as independent
	evidence that the coefficient formula generated their exact paths. Their role is
	to show that the shape of the decay matters and that reasonable hand-designed
	decays can also beat weak static choices. The main formula claim for this
	regime should be based on `openwebtext10k_interaction`.

	## Condition Ranking By Final Loss

	\| Condition \| Kind \| N \| Mean trajectory val \| Std trajectory val \| Mean final val \| Std final val \| Mean final gap \| Dropout path \|
	\|---\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---\|
	\| `openwebtext10k_interaction` \| `anchor_decay` \| 5 \| 4.8609 \| 0.0046 \| 4.3981 \| 0.0095 \| 0.3177 \| `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` \|
	\| `hold_30_then_decay` \| `anchor_decay` \| 5 \| 4.8512 \| 0.0017 \| 4.4052 \| 0.0112 \| 0.3565 \| `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` \|
	\| `mild_30_to_08` \| `anchor_decay` \| 5 \| 4.8509 \| 0.0015 \| 4.4073 \| 0.0085 \| 0.3337 \| `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` \|
	\| `fitted_l16_static_law` \| `anchor_decay` \| 5 \| 4.9521 \| 0.0039 \| 4.4124 \| 0.0084 \| 0.3137 \| `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` \|
	\| `static_dropout_0.14` \| `static` \| 5 \| 4.9051 \| 0.0088 \| 4.4455 \| 0.0120 \| 0.3289 \| `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` \|
	\| `static_dropout_0.3` \| `static` \| 5 \| 4.8767 \| 0.0019 \| 4.4668 \| 0.0141 \| 0.2349 \| `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` \|
	\| `static_dropout_0.02` \| `static` \| 5 \| 5.1571 \| 0.0097 \| 4.5358 \| 0.0091 \| 0.4829 \| `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` \|
	\| `static_dropout_0` \| `static` \| 5 \| 5.2511 \| 0.0160 \| 4.5943 \| 0.0216 \| 0.5529 \| `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` \|

	## Paired Final-Loss Deltas

	Negative `delta_vs_best_static` means the condition beat the best static
	baseline for that seed.

	\| Seed \| Condition \| Final val \| Best static \| Best static final val \| Delta vs best static \|
	\|---:\|---\|---:\|---\|---:\|---:\|
	\| 1 \| `openwebtext10k_interaction` \| 4.4023 \| `static_dropout_0.14` \| 4.4418 \| -0.0394 \|
	\| 1 \| `hold_30_then_decay` \| 4.3939 \| `static_dropout_0.14` \| 4.4418 \| -0.0479 \|
	\| 1 \| `mild_30_to_08` \| 4.3995 \| `static_dropout_0.14` \| 4.4418 \| -0.0423 \|
	\| 1 \| `fitted_l16_static_law` \| 4.4207 \| `static_dropout_0.14` \| 4.4418 \| -0.0211 \|
	\| 1 \| `static_dropout_0.14` \| 4.4418 \| `static_dropout_0.14` \| 4.4418 \| +0.0000 \|
	\| 1 \| `static_dropout_0.3` \| 4.4602 \| `static_dropout_0.14` \| 4.4418 \| +0.0184 \|
	\| 1 \| `static_dropout_0.02` \| 4.5402 \| `static_dropout_0.14` \| 4.4418 \| +0.0984 \|
	\| 1 \| `static_dropout_0` \| 4.5704 \| `static_dropout_0.14` \| 4.4418 \| +0.1286 \|
	\| 2 \| `openwebtext10k_interaction` \| 4.4020 \| `static_dropout_0.14` \| 4.4602 \| -0.0583 \|
	\| 2 \| `hold_30_then_decay` \| 4.4068 \| `static_dropout_0.14` \| 4.4602 \| -0.0534 \|
	\| 2 \| `mild_30_to_08` \| 4.4080 \| `static_dropout_0.14` \| 4.4602 \| -0.0522 \|
	\| 2 \| `fitted_l16_static_law` \| 4.4136 \| `static_dropout_0.14` \| 4.4602 \| -0.0466 \|
	\| 2 \| `static_dropout_0.14` \| 4.4602 \| `static_dropout_0.14` \| 4.4602 \| +0.0000 \|
	\| 2 \| `static_dropout_0.3` \| 4.4719 \| `static_dropout_0.14` \| 4.4602 \| +0.0117 \|
	\| 2 \| `static_dropout_0.02` \| 4.5466 \| `static_dropout_0.14` \| 4.4602 \| +0.0864 \|
	\| 2 \| `static_dropout_0` \| 4.6094 \| `static_dropout_0.14` \| 4.4602 \| +0.1492 \|
	\| 3 \| `openwebtext10k_interaction` \| 4.4029 \| `static_dropout_0.14` \| 4.4356 \| -0.0328 \|
	\| 3 \| `hold_30_then_decay` \| 4.4174 \| `static_dropout_0.14` \| 4.4356 \| -0.0183 \|
	\| 3 \| `mild_30_to_08` \| 4.4151 \| `static_dropout_0.14` \| 4.4356 \| -0.0206 \|
	\| 3 \| `fitted_l16_static_law` \| 4.4134 \| `static_dropout_0.14` \| 4.4356 \| -0.0223 \|
	\| 3 \| `static_dropout_0.14` \| 4.4356 \| `static_dropout_0.14` \| 4.4356 \| +0.0000 \|
	\| 3 \| `static_dropout_0.3` \| 4.4758 \| `static_dropout_0.14` \| 4.4356 \| +0.0401 \|
	\| 3 \| `static_dropout_0.02` \| 4.5345 \| `static_dropout_0.14` \| 4.4356 \| +0.0988 \|
	\| 3 \| `static_dropout_0` \| 4.5928 \| `static_dropout_0.14` \| 4.4356 \| +0.1571 \|
	\| 4 \| `openwebtext10k_interaction` \| 4.3811 \| `static_dropout_0.14` \| 4.4337 \| -0.0526 \|
	\| 4 \| `hold_30_then_decay` \| 4.3936 \| `static_dropout_0.14` \| 4.4337 \| -0.0400 \|
	\| 4 \| `mild_30_to_08` \| 4.3978 \| `static_dropout_0.14` \| 4.4337 \| -0.0359 \|
	\| 4 \| `fitted_l16_static_law` \| 4.3983 \| `static_dropout_0.14` \| 4.4337 \| -0.0354 \|
	\| 4 \| `static_dropout_0.14` \| 4.4337 \| `static_dropout_0.14` \| 4.4337 \| +0.0000 \|
	\| 4 \| `static_dropout_0.3` \| 4.4455 \| `static_dropout_0.14` \| 4.4337 \| +0.0118 \|
	\| 4 \| `static_dropout_0.02` \| 4.5220 \| `static_dropout_0.14` \| 4.4337 \| +0.0883 \|
	\| 4 \| `static_dropout_0` \| 4.5768 \| `static_dropout_0.14` \| 4.4337 \| +0.1432 \|
	\| 5 \| `openwebtext10k_interaction` \| 4.4024 \| `static_dropout_0.14` \| 4.4560 \| -0.0536 \|
	\| 5 \| `hold_30_then_decay` \| 4.4145 \| `static_dropout_0.14` \| 4.4560 \| -0.0415 \|
	\| 5 \| `mild_30_to_08` \| 4.4161 \| `static_dropout_0.14` \| 4.4560 \| -0.0399 \|
	\| 5 \| `fitted_l16_static_law` \| 4.4161 \| `static_dropout_0.14` \| 4.4560 \| -0.0399 \|
	\| 5 \| `static_dropout_0.14` \| 4.4560 \| `static_dropout_0.14` \| 4.4560 \| +0.0000 \|
	\| 5 \| `static_dropout_0.3` \| 4.4805 \| `static_dropout_0.14` \| 4.4560 \| +0.0245 \|
	\| 5 \| `static_dropout_0.02` \| 4.5355 \| `static_dropout_0.14` \| 4.4560 \| +0.0796 \|
	\| 5 \| `static_dropout_0` \| 4.6219 \| `static_dropout_0.14` \| 4.4560 \| +0.1660 \|

	## Stage Trajectory

	\| Stage \| Prefix tokens \| Condition \| Dropout \| N \| Mean val \| Std val \| Mean train \| Mean gap \|
	\|---:\|---:\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| 0 \| 250,000 \| `mild_30_to_08` \| 0.300 \| 5 \| 5.4483 \| 0.0138 \| 4.4429 \| 1.0054 \|
	\| 0 \| 250,000 \| `hold_30_then_decay` \| 0.300 \| 5 \| 5.4483 \| 0.0138 \| 4.4429 \| 1.0054 \|
	\| 0 \| 250,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 5.4483 \| 0.0138 \| 4.4429 \| 1.0054 \|
	\| 0 \| 250,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 5.4773 \| 0.0224 \| 4.0298 \| 1.4475 \|
	\| 0 \| 250,000 \| `openwebtext10k_interaction` \| 0.385 \| 5 \| 5.4947 \| 0.0109 \| 4.6016 \| 0.8930 \|
	\| 0 \| 250,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 5.7426 \| 0.0242 \| 3.5371 \| 2.2055 \|
	\| 0 \| 250,000 \| `fitted_l16_static_law` \| 0.600 \| 5 \| 5.7842 \| 0.0096 \| 5.1640 \| 0.6202 \|
	\| 0 \| 250,000 \| `static_dropout_0` \| 0.000 \| 5 \| 5.8330 \| 0.0198 \| 3.4443 \| 2.3887 \|
	\| 1 \| 500,000 \| `mild_30_to_08` \| 0.240 \| 5 \| 5.0582 \| 0.0159 \| 4.0349 \| 1.0233 \|
	\| 1 \| 500,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 5.0667 \| 0.0173 \| 4.1383 \| 0.9284 \|
	\| 1 \| 500,000 \| `hold_30_then_decay` \| 0.300 \| 5 \| 5.0667 \| 0.0173 \| 4.1383 \| 0.9284 \|
	\| 1 \| 500,000 \| `openwebtext10k_interaction` \| 0.319 \| 5 \| 5.0715 \| 0.0118 \| 4.2065 \| 0.8650 \|
	\| 1 \| 500,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 5.1492 \| 0.0070 \| 3.7143 \| 1.4349 \|
	\| 1 \| 500,000 \| `fitted_l16_static_law` \| 0.400 \| 5 \| 5.1507 \| 0.0102 \| 4.4632 \| 0.6875 \|
	\| 1 \| 500,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 5.5754 \| 0.0248 \| 3.1246 \| 2.4508 \|
	\| 1 \| 500,000 \| `static_dropout_0` \| 0.000 \| 5 \| 5.7175 \| 0.0502 \| 2.9583 \| 2.7592 \|
	\| 2 \| 1,000,000 \| `hold_30_then_decay` \| 0.200 \| 5 \| 4.7757 \| 0.0144 \| 4.0378 \| 0.7379 \|
	\| 2 \| 1,000,000 \| `mild_30_to_08` \| 0.180 \| 5 \| 4.7774 \| 0.0138 \| 3.9886 \| 0.7888 \|
	\| 2 \| 1,000,000 \| `openwebtext10k_interaction` \| 0.227 \| 5 \| 4.7811 \| 0.0084 \| 4.0826 \| 0.6984 \|
	\| 2 \| 1,000,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 4.7983 \| 0.0144 \| 4.1501 \| 0.6481 \|
	\| 2 \| 1,000,000 \| `fitted_l16_static_law` \| 0.300 \| 5 \| 4.8326 \| 0.0102 \| 4.2632 \| 0.5694 \|
	\| 2 \| 1,000,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 4.8490 \| 0.0202 \| 3.8712 \| 0.9779 \|
	\| 2 \| 1,000,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 5.1470 \| 0.0222 \| 3.4615 \| 1.6854 \|
	\| 2 \| 1,000,000 \| `static_dropout_0` \| 0.000 \| 5 \| 5.2637 \| 0.0274 \| 3.3260 \| 1.9377 \|
	\| 3 \| 2,000,000 \| `openwebtext10k_interaction` \| 0.139 \| 5 \| 4.5590 \| 0.0142 \| 4.0802 \| 0.4788 \|
	\| 3 \| 2,000,000 \| `hold_30_then_decay` \| 0.100 \| 5 \| 4.5599 \| 0.0161 \| 4.0445 \| 0.5154 \|
	\| 3 \| 2,000,000 \| `mild_30_to_08` \| 0.120 \| 5 \| 4.5631 \| 0.0155 \| 4.0441 \| 0.5190 \|
	\| 3 \| 2,000,000 \| `fitted_l16_static_law` \| 0.140 \| 5 \| 4.5806 \| 0.0153 \| 4.1471 \| 0.4334 \|
	\| 3 \| 2,000,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 4.6035 \| 0.0141 \| 4.2150 \| 0.3885 \|
	\| 3 \| 2,000,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 4.6048 \| 0.0136 \| 4.0399 \| 0.5648 \|
	\| 3 \| 2,000,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 4.7847 \| 0.0196 \| 3.8405 \| 0.9442 \|
	\| 3 \| 2,000,000 \| `static_dropout_0` \| 0.000 \| 5 \| 4.8472 \| 0.0171 \| 3.7786 \| 1.0687 \|
	\| 4 \| 4,000,000 \| `openwebtext10k_interaction` \| 0.066 \| 5 \| 4.3981 \| 0.0095 \| 4.0805 \| 0.3177 \|
	\| 4 \| 4,000,000 \| `hold_30_then_decay` \| 0.020 \| 5 \| 4.4052 \| 0.0112 \| 4.0488 \| 0.3565 \|
	\| 4 \| 4,000,000 \| `mild_30_to_08` \| 0.080 \| 5 \| 4.4073 \| 0.0085 \| 4.0736 \| 0.3337 \|
	\| 4 \| 4,000,000 \| `fitted_l16_static_law` \| 0.020 \| 5 \| 4.4124 \| 0.0084 \| 4.0987 \| 0.3137 \|
	\| 4 \| 4,000,000 \| `static_dropout_0.14` \| 0.140 \| 5 \| 4.4455 \| 0.0120 \| 4.1165 \| 0.3289 \|
	\| 4 \| 4,000,000 \| `static_dropout_0.3` \| 0.300 \| 5 \| 4.4668 \| 0.0141 \| 4.2319 \| 0.2349 \|
	\| 4 \| 4,000,000 \| `static_dropout_0.02` \| 0.020 \| 5 \| 4.5358 \| 0.0091 \| 4.0529 \| 0.4829 \|
	\| 4 \| 4,000,000 \| `static_dropout_0` \| 0.000 \| 5 \| 4.5943 \| 0.0216 \| 4.0414 \| 0.5529 \|

	## Interpretation

	- `openwebtext10k_interaction` has the best 5-seed mean final validation loss: 4.3981 +/- 0.0095.
	- The second-best final condition is `hold_30_then_decay` at 4.4052 +/- 0.0112.
	- The best static baseline by mean final loss is `static_dropout_0.14` at 4.4455 +/- 0.0120.
	- `openwebtext10k_interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0328.
	- `hold_30_then_decay` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0183.
	- `mild_30_to_08` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0206.
	- `fitted_l16_static_law` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0211.
	- The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4483; compare this with the final ranking before claiming a schedule is uniformly better.
	- This is a saved-run streaming validation artifact. Treat it as strong
	evidence only when the tested conditions, seeds, static baselines, and
	stream protocol match the claim being made.