Mandeep Sidhu commited on
Commit
0e508c7
·
1 Parent(s): cecc0f6

Add standalone research report

Browse files
Files changed (1) hide show
  1. docs/dropout_decay_research_report.md +319 -0
docs/dropout_decay_research_report.md ADDED
@@ -0,0 +1,319 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dropout Decay in Expanding-Stream Language Model Training
2
+
3
+ Date: 2026-05-28
4
+
5
+ ## Audience and Purpose
6
+
7
+ This report is written for an AI/ML engineer seeing the project for the first
8
+ time. It summarizes the research motivation, implementation setup, experimental
9
+ protocol, completed results, current evidence for the dropout formula, and the
10
+ remaining work needed before framing the result as a publishable paper.
11
+
12
+ The project studies dropout in a streaming-data regime. The central question is
13
+ whether a model can start with stronger regularization when the available stream
14
+ prefix is small, then reduce dropout as the stream grows, so that the model uses
15
+ more of its capacity without catastrophic overfitting.
16
+
17
+ ## Codebase and Attribution
18
+
19
+ The implementation is derived from Andrej Karpathy's nanochat project and keeps
20
+ only the relevant core pieces:
21
+
22
+ - BPE-style text tokenization.
23
+ - A nanochat-style causal Transformer.
24
+ - Dynamic dropout control for attention, residual, MLP, and embedding dropout.
25
+ - MPS-only experiment execution.
26
+ - Streaming-style expanding-prefix training loops.
27
+
28
+ The original nanochat MIT copyright and permission notice are retained in
29
+ derived source files. The project documentation explicitly attributes the
30
+ foundation to Andrej Karpathy's nanochat.
31
+
32
+ ## Initial Hypothesis and Correction
33
+
34
+ The original broad hypothesis was:
35
+
36
+ > Starting with very high dropout on a small initial dataset, then decaying
37
+ > dropout as more stream data arrives, lets a large model dynamically scale its
38
+ > effective capacity and avoid catastrophic overfitting.
39
+
40
+ The experiments rejected this version. A very high initial dropout such as
41
+ `0.8` was harmful. In early 8.39M-parameter streaming runs, static low dropout
42
+ beat the high-dropout decay schedule:
43
+
44
+ | Condition | 5M | 10M | 20M | 40M |
45
+ |---|---:|---:|---:|---:|
46
+ | High-dropout decay streaming | `6.9213` | `6.2689` | `5.4262` | `4.9090` |
47
+ | Static `0.1` dropout streaming | `5.6310` | `5.1018` | `4.8497` | `4.6743` |
48
+ | Static `0.8` dropout streaming | `6.9898` | `6.7637` | `6.4835` | `6.2390` |
49
+
50
+ The refined hypothesis is narrower and better supported:
51
+
52
+ > Prefix-aware dropout scheduling appears useful when the static dropout
53
+ > optimum changes with stream size. The schedule should start near the small
54
+ > prefix optimum and decay toward the large-prefix optimum, rather than using
55
+ > arbitrary high dropout.
56
+
57
+ ## Experimental Setup
58
+
59
+ All training experiments use MPS. The local project instruction is strict: no
60
+ CPU and no CUDA fallback for Torch experiments.
61
+
62
+ The core streaming protocol is:
63
+
64
+ - Tokenizer vocabulary: `4096`.
65
+ - Block size: `128`.
66
+ - Batch size: `16`.
67
+ - Tokens sampled per training step: `2048`.
68
+ - Stream prefixes: `250k`, `500k`, `1M`, `2M`, `4M` unique training tokens.
69
+ - Main schedule-validation stage length: `1000` steps per prefix.
70
+ - Validation tokens: `500k`.
71
+ - Seeds: generally `1, 2, 3` for full sweeps and validations.
72
+ - Static controls: fixed dropout values around the expected optimum.
73
+ - Dynamic condition: an anchor schedule with dropout set per stream prefix and
74
+ log interpolation between prefix anchors.
75
+
76
+ The important distinction is:
77
+
78
+ - **Unique prefix tokens**: how many distinct training tokens are currently
79
+ available from the stream.
80
+ - **Sampled tokens**: how many token positions the optimizer has consumed
81
+ through repeated random batches.
82
+ - **Update pressure**: repeated sampling relative to available prefix size,
83
+ approximated by `cumulative_sampled_tokens / unique_tokens`.
84
+
85
+ When unique tokens are low and sampled tokens are high, the model sees the same
86
+ prefix repeatedly and overfitting pressure increases.
87
+
88
+ ## Empirical Formula Under Test
89
+
90
+ The current formula is:
91
+
92
+ ```text
93
+ p = clamp(0.02, 0.65,
94
+ 0.154 * log10(params / unique_tokens)
95
+ + 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
96
+ - 0.210)
97
+ ```
98
+
99
+ The terms represent:
100
+
101
+ - `params / unique_tokens`: capacity pressure. Larger models on smaller stream
102
+ prefixes need more regularization.
103
+ - `cumulative_sampled_tokens / unique_tokens`: update pressure. More repeated
104
+ training on the same prefix needs more regularization.
105
+ - `0.02`: empirical lower floor to avoid assuming exact zero dropout is always
106
+ optimal.
107
+ - `0.65`: empirical upper guardrail; current successful schedules are far below
108
+ this in the main validation runs.
109
+
110
+ The coefficients are empirical, not theoretical constants. They were fit from
111
+ observed static-dropout curves and then tested against interpolated model sizes,
112
+ update-pressure changes, coefficient ablations, and an architecture-shape
113
+ holdout.
114
+
115
+ ## Static Dropout Screen
116
+
117
+ The first useful research result was that static dropout has a prefix-dependent
118
+ optimum. The optimum is not constant as stream data grows.
119
+
120
+ Key observations:
121
+
122
+ | Model | Params | Prefix | Best static dropout | Validation loss | Zero-dropout penalty |
123
+ |---|---:|---:|---:|---:|---:|
124
+ | L16 | 31.46M | 2M | `0.14` | `4.4270` | `+0.1982` |
125
+ | L12 | 17.37M | 2M | `0.14` | `4.5088` | `+0.0866` |
126
+ | L8 | 8.39M | 2M | `0.08` | `4.6232` | `+0.0266` |
127
+ | L8 | 8.39M | 4M | `0.0` | best | near zero |
128
+
129
+ This motivated a formula that tracks a moving optimum instead of comparing one
130
+ decay schedule to one arbitrary fixed dropout.
131
+
132
+ ## Model-Size Formula Validation
133
+
134
+ The formula was tested across model sizes from 8.39M to 31.46M parameters. Each
135
+ run used 3 seeds and compared the formula schedule against static dropout
136
+ controls.
137
+
138
+ | Model | Params | Formula path | Formula final val | Best static final val | Paired final deltas |
139
+ |---|---:|---|---:|---:|---:|
140
+ | L8 | 8.39M | `0.252 -> 0.206 -> 0.129 -> 0.038 -> 0.020` | `4.6094 +/- 0.0056` | `4.6242` | `-0.0102, -0.0160, -0.0182` |
141
+ | L10 | 12.31M | `0.278 -> 0.232 -> 0.154 -> 0.064 -> 0.020` | `4.5306 +/- 0.0094` | `4.5580` | `-0.0288, -0.0188, -0.0345` |
142
+ | L12 | 17.37M | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
143
+ | L14 | 23.70M | `0.322 -> 0.276 -> 0.198 -> 0.108 -> 0.020` | `4.4384 +/- 0.0087` | `4.4736` | `-0.0294, -0.0269, -0.0429` |
144
+ | L16 | 31.46M | `0.341 -> 0.294 -> 0.217 -> 0.127 -> 0.030` | `4.4059 +/- 0.0046` | `4.4459` | `-0.0411, -0.0512, -0.0279` |
145
+
146
+ The formula won all 15 paired final-loss comparisons across these five model
147
+ sizes.
148
+
149
+ ## L16 Schedule Development
150
+
151
+ The L16 model was used to understand why schedule shape matters. An early
152
+ formula-like schedule that started too high was inferior on trajectory, even
153
+ though it beat some static controls at the final prefix. A moderate schedule
154
+ near `0.30` performed much better.
155
+
156
+ 3-seed L16 confirmation:
157
+
158
+ | Condition | Final val | Final std | Mean trajectory val | Final gap |
159
+ |---|---:|---:|---:|---:|
160
+ | `hold_30_then_decay` | `4.4060` | `0.0118` | `4.8503` | `0.3530` |
161
+ | `mild_30_to_08` | `4.4075` | `0.0078` | `4.8504` | `0.3307` |
162
+ | `fitted_l16_static_law` | `4.4159` | `0.0042` | `4.9527` | `0.3144` |
163
+ | `static_dropout_0.14` | `4.4459` | `0.0128` | `4.9043` | `0.3205` |
164
+ | `static_dropout_0.30` | `4.4693` | `0.0081` | `4.8764` | `0.2327` |
165
+ | `static_dropout_0.02` | `4.5405` | `0.0061` | `5.1544` | `0.4747` |
166
+ | `static_dropout_0.0` | `4.5905` | `0.0192` | `5.2422` | `0.5464` |
167
+
168
+ This clarified that the winning schedule is not "high dropout, then decay." It
169
+ is "start near the small-prefix optimum, then decay as the optimum moves down."
170
+
171
+ ## Update-Pressure Validation
172
+
173
+ Changing `stage_steps` changes how many sampled tokens are consumed per stream
174
+ prefix. The formula should increase dropout when repeated sampling pressure is
175
+ higher.
176
+
177
+ L12 update-pressure sweep:
178
+
179
+ | Stage steps | Formula path | Mean trajectory val | Formula final val | Best static final val | Paired final deltas |
180
+ |---:|---|---:|---:|---:|---:|
181
+ | 500 | `0.226 -> 0.180 -> 0.102 -> 0.020 -> 0.020` | `5.1581` | `4.7138 +/- 0.0080` | `4.7321` | `-0.0152, -0.0147, -0.0249` |
182
+ | 1000 | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
183
+ | 2000 | `0.376 -> 0.330 -> 0.252 -> 0.162 -> 0.065` | `4.7841` | `4.3089 +/- 0.0116` | `4.3513` | `-0.0453, -0.0321, -0.0489` |
184
+
185
+ The formula won final loss in all three update-pressure regimes. At 2000
186
+ steps, it also won the mean trajectory, supporting the idea that repeated
187
+ sampling from the same prefix increases the appropriate dropout.
188
+
189
+ ## Sampled-Pressure Coefficient Ablation
190
+
191
+ The sampled-pressure coefficient was ablated on L12 while keeping model, stream
192
+ prefixes, and training budget fixed.
193
+
194
+ | Condition | Coefficient multiplier | Path | Mean trajectory val | Final val | Final std | Final gap |
195
+ |---|---:|---|---:|---:|---:|---:|
196
+ | `no_sample_pressure_l12` | 0x | `0.074 -> 0.027 -> 0.020 -> 0.020 -> 0.020` | `5.0282` | `4.5468` | `0.0011` | `0.3482` |
197
+ | `half_sample_pressure_l12` | 0.5x | `0.187 -> 0.141 -> 0.079 -> 0.020 -> 0.020` | `4.9260` | `4.5055` | `0.0046` | `0.3272` |
198
+ | `pressure_formula_floor02` | 1.0x | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812` | `0.0062` | `0.2825` |
199
+ | `high_sample_pressure_l12` | 1.5x | `0.415 -> 0.368 -> 0.275 -> 0.163 -> 0.041` | `4.9739` | `4.4959` | `0.0025` | `0.2418` |
200
+
201
+ The 1.0x coefficient was best on final validation. The 1.5x variant had the
202
+ smallest final gap but worse validation, showing that the objective is not
203
+ simply minimizing the train-validation gap. Too much dropout underfits.
204
+
205
+ ## Architecture-Shape Holdout
206
+
207
+ A key question is whether parameter count alone is a reasonable capacity proxy.
208
+ To test this, a conventional 8-head deep/narrow model was run:
209
+
210
+ - Model: `18x8x256`.
211
+ - Parameters: 16.25M.
212
+ - FFN ratio: `4 * n_embd`, unchanged from the base architecture.
213
+ - Formula path from parameter count only:
214
+ `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020`.
215
+
216
+ Results:
217
+
218
+ | Condition | Path | Mean trajectory val | Final val | Final std | Final gap |
219
+ |---|---|---:|---:|---:|---:|
220
+ | Formula | `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020` | `4.9720` | `4.5286` | `0.0118` | `0.2418` |
221
+ | Static `0.02` | constant | `5.0730` | `4.5887` | `0.0067` | `0.2947` |
222
+ | Static `0.08` | constant | `4.9900` | `4.5607` | `0.0081` | `0.2447` |
223
+ | Static `0.14` | constant | `4.9633` | `4.5564` | `0.0127` | `0.2080` |
224
+ | Static `0.18` | constant | `4.9699` | `4.5710` | `0.0061` | `0.1950` |
225
+ | Static `0.20` | constant | `4.9799` | `4.5835` | `0.0199` | `0.1841` |
226
+ | Static `0.26` | constant | `5.0021` | `4.6096` | `0.0126` | `0.1602` |
227
+ | Static `0.30` | constant | `5.0341` | `4.6520` | `0.0024` | `0.1545` |
228
+
229
+ Best static was `0.14`. Formula beat it on every paired final seed:
230
+
231
+ ```text
232
+ formula - best_static = -0.0270, -0.0317, -0.0248
233
+ ```
234
+
235
+ This supports final-loss transfer across architecture shape. It is not a clean
236
+ trajectory win because static `0.14` had slightly better mean trajectory. The
237
+ safe claim is therefore final-loss transfer, not universal trajectory
238
+ dominance.
239
+
240
+ ## Combined Evidence So Far
241
+
242
+ Across the completed formula tests:
243
+
244
+ - Model-size validation: 15/15 paired final-loss wins.
245
+ - Architecture-shape holdout: 3/3 paired final-loss wins.
246
+ - Combined completed paired final-loss comparisons: 18/18 formula wins.
247
+ - Update-pressure direction: supported.
248
+ - Sampled-pressure coefficient: supported on L12.
249
+ - High arbitrary initial dropout: rejected.
250
+
251
+ This is strong evidence for the refined hypothesis under the current
252
+ nanochat-style Transformer and expanding-prefix protocol.
253
+
254
+ ## What the Results Do Not Yet Prove
255
+
256
+ The results are promising but should not be overstated.
257
+
258
+ The current evidence does not prove:
259
+
260
+ - The formula is universal across arbitrary datasets.
261
+ - Parameter count alone fully captures architecture capacity.
262
+ - The formula always wins integrated trajectory loss.
263
+ - The `0.02` floor is theoretically optimal.
264
+ - The sampled-pressure coefficient is optimal for every model size.
265
+
266
+ The current evidence does support:
267
+
268
+ - Static dropout optima move downward as stream prefix size grows.
269
+ - Larger models need more early dropout at small stream prefixes.
270
+ - Repeated sampling from the same prefix increases the useful dropout.
271
+ - A pressure-aware schedule can beat the best single static dropout on final
272
+ validation loss.
273
+
274
+ ## Publication Framing
275
+
276
+ The strongest safe paper claim is:
277
+
278
+ > In nanochat-style causal Transformers trained under expanding-prefix
279
+ > streaming, a pressure-aware dropout schedule improves final validation loss
280
+ > over fixed-dropout baselines across model sizes, update pressures, and one
281
+ > architecture-shape holdout.
282
+
283
+ The claim that should be avoided for now is:
284
+
285
+ > This formula universally predicts optimal dropout for all models and datasets.
286
+
287
+ ## Remaining High-Value Experiments
288
+
289
+ The next experiments that would most strengthen a paper are:
290
+
291
+ 1. **Width-heavy architecture holdout**:
292
+ run a conventional `8x8x384` shape near the L12 parameter scale. This is the
293
+ paired complement to the completed `18x8x256` deep/narrow holdout.
294
+
295
+ 2. **Corpus/domain holdout**:
296
+ freeze the formula and run on a different text distribution. This is the
297
+ biggest missing generalization test.
298
+
299
+ 3. **L8 and L16 sampled-pressure ablations**:
300
+ repeat the `0x`, `0.5x`, `1.0x`, `1.5x` coefficient ablation outside L12.
301
+
302
+ 4. **Oracle schedule comparison**:
303
+ compare the formula against a stage-wise oracle chosen from measured static
304
+ optima. The formula does not need to beat the oracle; it should approach it
305
+ without using per-stage oracle knowledge.
306
+
307
+ 5. **5-seed headline confirmation**:
308
+ reserve 5-seed runs for the final paper table, not every exploratory sweep.
309
+
310
+ ## Current Bottom Line
311
+
312
+ The hypothesis is holding up well after the refinement. The correct story is
313
+ not that dropout decay is inherently good. The correct story is that
314
+ dropout should track a measurable pressure regime created by model size,
315
+ available stream prefix size, and repeated sampling.
316
+
317
+ The completed evidence is already strong enough for a serious empirical paper
318
+ draft if framed carefully. The remaining work is about generalization and
319
+ claim scope, especially architecture-width transfer and corpus transfer.