File size: 12,045 Bytes
dcae82e
1c065aa
 
 
bf705c0
1c065aa
 
 
dcae82e
 
 
 
 
1c065aa
 
 
dcae82e
1c065aa
b5daf7c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1c065aa
 
 
 
dcae82e
bf705c0
 
 
 
 
 
 
1c065aa
 
 
 
 
 
 
 
dcae82e
1c065aa
 
 
 
 
bf705c0
 
dcae82e
bf705c0
 
 
 
 
 
 
dcae82e
bf705c0
 
 
 
 
 
 
dcae82e
bf705c0
 
 
 
 
 
 
dcae82e
bf705c0
 
 
 
 
 
 
1c065aa
 
 
 
 
bf705c0
 
 
 
dcae82e
bf705c0
 
 
 
 
 
dcae82e
bf705c0
 
 
 
 
 
dcae82e
bf705c0
 
 
 
 
dcae82e
bf705c0
 
 
 
 
 
 
dcae82e
bf705c0
 
 
 
 
 
 
1c065aa
 
 
dcae82e
bf705c0
 
dcae82e
bf705c0
 
 
 
1c065aa
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# OpenWebText10K Streaming Validation

Date: 2026-05-30

This report combines 5 random seeds (1, 2, 3, 4, 5) from saved streaming runs.
No additional training is performed by this script; it reads saved
`metrics.jsonl` files.

Regime: OpenWebText10K cached-corpus streaming setup with L16_H8_D384,
31,457,280 parameters, five prefixes from 250k to 4M tokens, and 1,000
optimizer steps per stage. This is a clean five-seed run including the
OpenWebText10K interaction schedule, empirical decay schedules, and static
baselines.

## Sources

- `runs/openwebtext10k_l16_updated_formula_clean_5seed/locked_stream/20260530-174525/metrics.jsonl`

## Condition Provenance

The `anchor_decay` label means the dropout value is chosen from explicit
prefix-token anchors. It does not by itself imply that the schedule came from
the coefficient formula.

| Condition | Provenance | Dropout path | Interpretation |
|---|---|---|---|
| `openwebtext10k_interaction` | coefficient-derived schedule | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` | Main OpenWebText10K formula-derived schedule. This is the condition that tests the regime-specific interaction coefficient hypothesis. |
| `hold_30_then_decay` | heuristic schedule-search ablation | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It caps the initial dropout at `0.30`, holds it for the two smallest stream prefixes, then releases capacity aggressively. |
| `mild_30_to_08` | heuristic schedule-search ablation | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` | Manually specified after exploratory single-seed OpenWebText10K schedule search. It tests whether a smoother decay from `0.30` to a moderate final dropout is competitive. |
| `fitted_l16_static_law` | older fitted/static-law schedule | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` | Retained as a comparison to the earlier overly aggressive fitted schedule; it is not the current interaction formula schedule. |
| `static_dropout_*` | static baseline | constant | Fixed dropout used at every stream prefix. |

The two heuristic schedules should be treated as ablations, not as independent
evidence that the coefficient formula generated their exact paths. Their role is
to show that the shape of the decay matters and that reasonable hand-designed
decays can also beat weak static choices. The main formula claim for this
regime should be based on `openwebtext10k_interaction`.

## Condition Ranking By Final Loss

| Condition | Kind | N | Mean trajectory val | Std trajectory val | Mean final val | Std final val | Mean final gap | Dropout path |
|---|---|---:|---:|---:|---:|---:|---:|---|
| `openwebtext10k_interaction` | `anchor_decay` | 5 | 4.8609 | 0.0046 | 4.3981 | 0.0095 | 0.3177 | `0.39 -> 0.32 -> 0.23 -> 0.14 -> 0.07` |
| `hold_30_then_decay` | `anchor_decay` | 5 | 4.8512 | 0.0017 | 4.4052 | 0.0112 | 0.3565 | `0.30 -> 0.30 -> 0.20 -> 0.10 -> 0.02` |
| `mild_30_to_08` | `anchor_decay` | 5 | 4.8509 | 0.0015 | 4.4073 | 0.0085 | 0.3337 | `0.30 -> 0.24 -> 0.18 -> 0.12 -> 0.08` |
| `fitted_l16_static_law` | `anchor_decay` | 5 | 4.9521 | 0.0039 | 4.4124 | 0.0084 | 0.3137 | `0.60 -> 0.40 -> 0.30 -> 0.14 -> 0.02` |
| `static_dropout_0.14` | `static` | 5 | 4.9051 | 0.0088 | 4.4455 | 0.0120 | 0.3289 | `0.14 -> 0.14 -> 0.14 -> 0.14 -> 0.14` |
| `static_dropout_0.3` | `static` | 5 | 4.8767 | 0.0019 | 4.4668 | 0.0141 | 0.2349 | `0.30 -> 0.30 -> 0.30 -> 0.30 -> 0.30` |
| `static_dropout_0.02` | `static` | 5 | 5.1571 | 0.0097 | 4.5358 | 0.0091 | 0.4829 | `0.02 -> 0.02 -> 0.02 -> 0.02 -> 0.02` |
| `static_dropout_0` | `static` | 5 | 5.2511 | 0.0160 | 4.5943 | 0.0216 | 0.5529 | `0.00 -> 0.00 -> 0.00 -> 0.00 -> 0.00` |

## Paired Final-Loss Deltas

Negative `delta_vs_best_static` means the condition beat the best static
baseline for that seed.

| Seed | Condition | Final val | Best static | Best static final val | Delta vs best static |
|---:|---|---:|---|---:|---:|
| 1 | `openwebtext10k_interaction` | 4.4023 | `static_dropout_0.14` | 4.4418 | -0.0394 |
| 1 | `hold_30_then_decay` | 4.3939 | `static_dropout_0.14` | 4.4418 | -0.0479 |
| 1 | `mild_30_to_08` | 4.3995 | `static_dropout_0.14` | 4.4418 | -0.0423 |
| 1 | `fitted_l16_static_law` | 4.4207 | `static_dropout_0.14` | 4.4418 | -0.0211 |
| 1 | `static_dropout_0.14` | 4.4418 | `static_dropout_0.14` | 4.4418 | +0.0000 |
| 1 | `static_dropout_0.3` | 4.4602 | `static_dropout_0.14` | 4.4418 | +0.0184 |
| 1 | `static_dropout_0.02` | 4.5402 | `static_dropout_0.14` | 4.4418 | +0.0984 |
| 1 | `static_dropout_0` | 4.5704 | `static_dropout_0.14` | 4.4418 | +0.1286 |
| 2 | `openwebtext10k_interaction` | 4.4020 | `static_dropout_0.14` | 4.4602 | -0.0583 |
| 2 | `hold_30_then_decay` | 4.4068 | `static_dropout_0.14` | 4.4602 | -0.0534 |
| 2 | `mild_30_to_08` | 4.4080 | `static_dropout_0.14` | 4.4602 | -0.0522 |
| 2 | `fitted_l16_static_law` | 4.4136 | `static_dropout_0.14` | 4.4602 | -0.0466 |
| 2 | `static_dropout_0.14` | 4.4602 | `static_dropout_0.14` | 4.4602 | +0.0000 |
| 2 | `static_dropout_0.3` | 4.4719 | `static_dropout_0.14` | 4.4602 | +0.0117 |
| 2 | `static_dropout_0.02` | 4.5466 | `static_dropout_0.14` | 4.4602 | +0.0864 |
| 2 | `static_dropout_0` | 4.6094 | `static_dropout_0.14` | 4.4602 | +0.1492 |
| 3 | `openwebtext10k_interaction` | 4.4029 | `static_dropout_0.14` | 4.4356 | -0.0328 |
| 3 | `hold_30_then_decay` | 4.4174 | `static_dropout_0.14` | 4.4356 | -0.0183 |
| 3 | `mild_30_to_08` | 4.4151 | `static_dropout_0.14` | 4.4356 | -0.0206 |
| 3 | `fitted_l16_static_law` | 4.4134 | `static_dropout_0.14` | 4.4356 | -0.0223 |
| 3 | `static_dropout_0.14` | 4.4356 | `static_dropout_0.14` | 4.4356 | +0.0000 |
| 3 | `static_dropout_0.3` | 4.4758 | `static_dropout_0.14` | 4.4356 | +0.0401 |
| 3 | `static_dropout_0.02` | 4.5345 | `static_dropout_0.14` | 4.4356 | +0.0988 |
| 3 | `static_dropout_0` | 4.5928 | `static_dropout_0.14` | 4.4356 | +0.1571 |
| 4 | `openwebtext10k_interaction` | 4.3811 | `static_dropout_0.14` | 4.4337 | -0.0526 |
| 4 | `hold_30_then_decay` | 4.3936 | `static_dropout_0.14` | 4.4337 | -0.0400 |
| 4 | `mild_30_to_08` | 4.3978 | `static_dropout_0.14` | 4.4337 | -0.0359 |
| 4 | `fitted_l16_static_law` | 4.3983 | `static_dropout_0.14` | 4.4337 | -0.0354 |
| 4 | `static_dropout_0.14` | 4.4337 | `static_dropout_0.14` | 4.4337 | +0.0000 |
| 4 | `static_dropout_0.3` | 4.4455 | `static_dropout_0.14` | 4.4337 | +0.0118 |
| 4 | `static_dropout_0.02` | 4.5220 | `static_dropout_0.14` | 4.4337 | +0.0883 |
| 4 | `static_dropout_0` | 4.5768 | `static_dropout_0.14` | 4.4337 | +0.1432 |
| 5 | `openwebtext10k_interaction` | 4.4024 | `static_dropout_0.14` | 4.4560 | -0.0536 |
| 5 | `hold_30_then_decay` | 4.4145 | `static_dropout_0.14` | 4.4560 | -0.0415 |
| 5 | `mild_30_to_08` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 |
| 5 | `fitted_l16_static_law` | 4.4161 | `static_dropout_0.14` | 4.4560 | -0.0399 |
| 5 | `static_dropout_0.14` | 4.4560 | `static_dropout_0.14` | 4.4560 | +0.0000 |
| 5 | `static_dropout_0.3` | 4.4805 | `static_dropout_0.14` | 4.4560 | +0.0245 |
| 5 | `static_dropout_0.02` | 4.5355 | `static_dropout_0.14` | 4.4560 | +0.0796 |
| 5 | `static_dropout_0` | 4.6219 | `static_dropout_0.14` | 4.4560 | +0.1660 |

## Stage Trajectory

| Stage | Prefix tokens | Condition | Dropout | N | Mean val | Std val | Mean train | Mean gap |
|---:|---:|---|---:|---:|---:|---:|---:|---:|
| 0 | 250,000 | `mild_30_to_08` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
| 0 | 250,000 | `hold_30_then_decay` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
| 0 | 250,000 | `static_dropout_0.3` | 0.300 | 5 | 5.4483 | 0.0138 | 4.4429 | 1.0054 |
| 0 | 250,000 | `static_dropout_0.14` | 0.140 | 5 | 5.4773 | 0.0224 | 4.0298 | 1.4475 |
| 0 | 250,000 | `openwebtext10k_interaction` | 0.385 | 5 | 5.4947 | 0.0109 | 4.6016 | 0.8930 |
| 0 | 250,000 | `static_dropout_0.02` | 0.020 | 5 | 5.7426 | 0.0242 | 3.5371 | 2.2055 |
| 0 | 250,000 | `fitted_l16_static_law` | 0.600 | 5 | 5.7842 | 0.0096 | 5.1640 | 0.6202 |
| 0 | 250,000 | `static_dropout_0` | 0.000 | 5 | 5.8330 | 0.0198 | 3.4443 | 2.3887 |
| 1 | 500,000 | `mild_30_to_08` | 0.240 | 5 | 5.0582 | 0.0159 | 4.0349 | 1.0233 |
| 1 | 500,000 | `static_dropout_0.3` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 |
| 1 | 500,000 | `hold_30_then_decay` | 0.300 | 5 | 5.0667 | 0.0173 | 4.1383 | 0.9284 |
| 1 | 500,000 | `openwebtext10k_interaction` | 0.319 | 5 | 5.0715 | 0.0118 | 4.2065 | 0.8650 |
| 1 | 500,000 | `static_dropout_0.14` | 0.140 | 5 | 5.1492 | 0.0070 | 3.7143 | 1.4349 |
| 1 | 500,000 | `fitted_l16_static_law` | 0.400 | 5 | 5.1507 | 0.0102 | 4.4632 | 0.6875 |
| 1 | 500,000 | `static_dropout_0.02` | 0.020 | 5 | 5.5754 | 0.0248 | 3.1246 | 2.4508 |
| 1 | 500,000 | `static_dropout_0` | 0.000 | 5 | 5.7175 | 0.0502 | 2.9583 | 2.7592 |
| 2 | 1,000,000 | `hold_30_then_decay` | 0.200 | 5 | 4.7757 | 0.0144 | 4.0378 | 0.7379 |
| 2 | 1,000,000 | `mild_30_to_08` | 0.180 | 5 | 4.7774 | 0.0138 | 3.9886 | 0.7888 |
| 2 | 1,000,000 | `openwebtext10k_interaction` | 0.227 | 5 | 4.7811 | 0.0084 | 4.0826 | 0.6984 |
| 2 | 1,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.7983 | 0.0144 | 4.1501 | 0.6481 |
| 2 | 1,000,000 | `fitted_l16_static_law` | 0.300 | 5 | 4.8326 | 0.0102 | 4.2632 | 0.5694 |
| 2 | 1,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.8490 | 0.0202 | 3.8712 | 0.9779 |
| 2 | 1,000,000 | `static_dropout_0.02` | 0.020 | 5 | 5.1470 | 0.0222 | 3.4615 | 1.6854 |
| 2 | 1,000,000 | `static_dropout_0` | 0.000 | 5 | 5.2637 | 0.0274 | 3.3260 | 1.9377 |
| 3 | 2,000,000 | `openwebtext10k_interaction` | 0.139 | 5 | 4.5590 | 0.0142 | 4.0802 | 0.4788 |
| 3 | 2,000,000 | `hold_30_then_decay` | 0.100 | 5 | 4.5599 | 0.0161 | 4.0445 | 0.5154 |
| 3 | 2,000,000 | `mild_30_to_08` | 0.120 | 5 | 4.5631 | 0.0155 | 4.0441 | 0.5190 |
| 3 | 2,000,000 | `fitted_l16_static_law` | 0.140 | 5 | 4.5806 | 0.0153 | 4.1471 | 0.4334 |
| 3 | 2,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.6035 | 0.0141 | 4.2150 | 0.3885 |
| 3 | 2,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.6048 | 0.0136 | 4.0399 | 0.5648 |
| 3 | 2,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.7847 | 0.0196 | 3.8405 | 0.9442 |
| 3 | 2,000,000 | `static_dropout_0` | 0.000 | 5 | 4.8472 | 0.0171 | 3.7786 | 1.0687 |
| 4 | 4,000,000 | `openwebtext10k_interaction` | 0.066 | 5 | 4.3981 | 0.0095 | 4.0805 | 0.3177 |
| 4 | 4,000,000 | `hold_30_then_decay` | 0.020 | 5 | 4.4052 | 0.0112 | 4.0488 | 0.3565 |
| 4 | 4,000,000 | `mild_30_to_08` | 0.080 | 5 | 4.4073 | 0.0085 | 4.0736 | 0.3337 |
| 4 | 4,000,000 | `fitted_l16_static_law` | 0.020 | 5 | 4.4124 | 0.0084 | 4.0987 | 0.3137 |
| 4 | 4,000,000 | `static_dropout_0.14` | 0.140 | 5 | 4.4455 | 0.0120 | 4.1165 | 0.3289 |
| 4 | 4,000,000 | `static_dropout_0.3` | 0.300 | 5 | 4.4668 | 0.0141 | 4.2319 | 0.2349 |
| 4 | 4,000,000 | `static_dropout_0.02` | 0.020 | 5 | 4.5358 | 0.0091 | 4.0529 | 0.4829 |
| 4 | 4,000,000 | `static_dropout_0` | 0.000 | 5 | 4.5943 | 0.0216 | 4.0414 | 0.5529 |

## Interpretation

- `openwebtext10k_interaction` has the best 5-seed mean final validation loss: 4.3981 +/- 0.0095.
- The second-best final condition is `hold_30_then_decay` at 4.4052 +/- 0.0112.
- The best static baseline by mean final loss is `static_dropout_0.14` at 4.4455 +/- 0.0120.
- `openwebtext10k_interaction` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0328.
- `hold_30_then_decay` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0183.
- `mild_30_to_08` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0206.
- `fitted_l16_static_law` beats the per-seed best static baseline in 5/5 seeds; worst paired delta is -0.0211.
- The best first-stage condition is `mild_30_to_08` at prefix 250,000 with mean validation loss 5.4483; compare this with the final ranking before claiming a schedule is uniformly better.
- This is a saved-run streaming validation artifact. Treat it as strong
  evidence only when the tested conditions, seeds, static baselines, and
  stream protocol match the claim being made.