Mandeep Sidhu commited on
Commit ·
0e508c7
1
Parent(s): cecc0f6
Add standalone research report
Browse files
docs/dropout_decay_research_report.md
ADDED
|
@@ -0,0 +1,319 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dropout Decay in Expanding-Stream Language Model Training
|
| 2 |
+
|
| 3 |
+
Date: 2026-05-28
|
| 4 |
+
|
| 5 |
+
## Audience and Purpose
|
| 6 |
+
|
| 7 |
+
This report is written for an AI/ML engineer seeing the project for the first
|
| 8 |
+
time. It summarizes the research motivation, implementation setup, experimental
|
| 9 |
+
protocol, completed results, current evidence for the dropout formula, and the
|
| 10 |
+
remaining work needed before framing the result as a publishable paper.
|
| 11 |
+
|
| 12 |
+
The project studies dropout in a streaming-data regime. The central question is
|
| 13 |
+
whether a model can start with stronger regularization when the available stream
|
| 14 |
+
prefix is small, then reduce dropout as the stream grows, so that the model uses
|
| 15 |
+
more of its capacity without catastrophic overfitting.
|
| 16 |
+
|
| 17 |
+
## Codebase and Attribution
|
| 18 |
+
|
| 19 |
+
The implementation is derived from Andrej Karpathy's nanochat project and keeps
|
| 20 |
+
only the relevant core pieces:
|
| 21 |
+
|
| 22 |
+
- BPE-style text tokenization.
|
| 23 |
+
- A nanochat-style causal Transformer.
|
| 24 |
+
- Dynamic dropout control for attention, residual, MLP, and embedding dropout.
|
| 25 |
+
- MPS-only experiment execution.
|
| 26 |
+
- Streaming-style expanding-prefix training loops.
|
| 27 |
+
|
| 28 |
+
The original nanochat MIT copyright and permission notice are retained in
|
| 29 |
+
derived source files. The project documentation explicitly attributes the
|
| 30 |
+
foundation to Andrej Karpathy's nanochat.
|
| 31 |
+
|
| 32 |
+
## Initial Hypothesis and Correction
|
| 33 |
+
|
| 34 |
+
The original broad hypothesis was:
|
| 35 |
+
|
| 36 |
+
> Starting with very high dropout on a small initial dataset, then decaying
|
| 37 |
+
> dropout as more stream data arrives, lets a large model dynamically scale its
|
| 38 |
+
> effective capacity and avoid catastrophic overfitting.
|
| 39 |
+
|
| 40 |
+
The experiments rejected this version. A very high initial dropout such as
|
| 41 |
+
`0.8` was harmful. In early 8.39M-parameter streaming runs, static low dropout
|
| 42 |
+
beat the high-dropout decay schedule:
|
| 43 |
+
|
| 44 |
+
| Condition | 5M | 10M | 20M | 40M |
|
| 45 |
+
|---|---:|---:|---:|---:|
|
| 46 |
+
| High-dropout decay streaming | `6.9213` | `6.2689` | `5.4262` | `4.9090` |
|
| 47 |
+
| Static `0.1` dropout streaming | `5.6310` | `5.1018` | `4.8497` | `4.6743` |
|
| 48 |
+
| Static `0.8` dropout streaming | `6.9898` | `6.7637` | `6.4835` | `6.2390` |
|
| 49 |
+
|
| 50 |
+
The refined hypothesis is narrower and better supported:
|
| 51 |
+
|
| 52 |
+
> Prefix-aware dropout scheduling appears useful when the static dropout
|
| 53 |
+
> optimum changes with stream size. The schedule should start near the small
|
| 54 |
+
> prefix optimum and decay toward the large-prefix optimum, rather than using
|
| 55 |
+
> arbitrary high dropout.
|
| 56 |
+
|
| 57 |
+
## Experimental Setup
|
| 58 |
+
|
| 59 |
+
All training experiments use MPS. The local project instruction is strict: no
|
| 60 |
+
CPU and no CUDA fallback for Torch experiments.
|
| 61 |
+
|
| 62 |
+
The core streaming protocol is:
|
| 63 |
+
|
| 64 |
+
- Tokenizer vocabulary: `4096`.
|
| 65 |
+
- Block size: `128`.
|
| 66 |
+
- Batch size: `16`.
|
| 67 |
+
- Tokens sampled per training step: `2048`.
|
| 68 |
+
- Stream prefixes: `250k`, `500k`, `1M`, `2M`, `4M` unique training tokens.
|
| 69 |
+
- Main schedule-validation stage length: `1000` steps per prefix.
|
| 70 |
+
- Validation tokens: `500k`.
|
| 71 |
+
- Seeds: generally `1, 2, 3` for full sweeps and validations.
|
| 72 |
+
- Static controls: fixed dropout values around the expected optimum.
|
| 73 |
+
- Dynamic condition: an anchor schedule with dropout set per stream prefix and
|
| 74 |
+
log interpolation between prefix anchors.
|
| 75 |
+
|
| 76 |
+
The important distinction is:
|
| 77 |
+
|
| 78 |
+
- **Unique prefix tokens**: how many distinct training tokens are currently
|
| 79 |
+
available from the stream.
|
| 80 |
+
- **Sampled tokens**: how many token positions the optimizer has consumed
|
| 81 |
+
through repeated random batches.
|
| 82 |
+
- **Update pressure**: repeated sampling relative to available prefix size,
|
| 83 |
+
approximated by `cumulative_sampled_tokens / unique_tokens`.
|
| 84 |
+
|
| 85 |
+
When unique tokens are low and sampled tokens are high, the model sees the same
|
| 86 |
+
prefix repeatedly and overfitting pressure increases.
|
| 87 |
+
|
| 88 |
+
## Empirical Formula Under Test
|
| 89 |
+
|
| 90 |
+
The current formula is:
|
| 91 |
+
|
| 92 |
+
```text
|
| 93 |
+
p = clamp(0.02, 0.65,
|
| 94 |
+
0.154 * log10(params / unique_tokens)
|
| 95 |
+
+ 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
|
| 96 |
+
- 0.210)
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
The terms represent:
|
| 100 |
+
|
| 101 |
+
- `params / unique_tokens`: capacity pressure. Larger models on smaller stream
|
| 102 |
+
prefixes need more regularization.
|
| 103 |
+
- `cumulative_sampled_tokens / unique_tokens`: update pressure. More repeated
|
| 104 |
+
training on the same prefix needs more regularization.
|
| 105 |
+
- `0.02`: empirical lower floor to avoid assuming exact zero dropout is always
|
| 106 |
+
optimal.
|
| 107 |
+
- `0.65`: empirical upper guardrail; current successful schedules are far below
|
| 108 |
+
this in the main validation runs.
|
| 109 |
+
|
| 110 |
+
The coefficients are empirical, not theoretical constants. They were fit from
|
| 111 |
+
observed static-dropout curves and then tested against interpolated model sizes,
|
| 112 |
+
update-pressure changes, coefficient ablations, and an architecture-shape
|
| 113 |
+
holdout.
|
| 114 |
+
|
| 115 |
+
## Static Dropout Screen
|
| 116 |
+
|
| 117 |
+
The first useful research result was that static dropout has a prefix-dependent
|
| 118 |
+
optimum. The optimum is not constant as stream data grows.
|
| 119 |
+
|
| 120 |
+
Key observations:
|
| 121 |
+
|
| 122 |
+
| Model | Params | Prefix | Best static dropout | Validation loss | Zero-dropout penalty |
|
| 123 |
+
|---|---:|---:|---:|---:|---:|
|
| 124 |
+
| L16 | 31.46M | 2M | `0.14` | `4.4270` | `+0.1982` |
|
| 125 |
+
| L12 | 17.37M | 2M | `0.14` | `4.5088` | `+0.0866` |
|
| 126 |
+
| L8 | 8.39M | 2M | `0.08` | `4.6232` | `+0.0266` |
|
| 127 |
+
| L8 | 8.39M | 4M | `0.0` | best | near zero |
|
| 128 |
+
|
| 129 |
+
This motivated a formula that tracks a moving optimum instead of comparing one
|
| 130 |
+
decay schedule to one arbitrary fixed dropout.
|
| 131 |
+
|
| 132 |
+
## Model-Size Formula Validation
|
| 133 |
+
|
| 134 |
+
The formula was tested across model sizes from 8.39M to 31.46M parameters. Each
|
| 135 |
+
run used 3 seeds and compared the formula schedule against static dropout
|
| 136 |
+
controls.
|
| 137 |
+
|
| 138 |
+
| Model | Params | Formula path | Formula final val | Best static final val | Paired final deltas |
|
| 139 |
+
|---|---:|---|---:|---:|---:|
|
| 140 |
+
| L8 | 8.39M | `0.252 -> 0.206 -> 0.129 -> 0.038 -> 0.020` | `4.6094 +/- 0.0056` | `4.6242` | `-0.0102, -0.0160, -0.0182` |
|
| 141 |
+
| L10 | 12.31M | `0.278 -> 0.232 -> 0.154 -> 0.064 -> 0.020` | `4.5306 +/- 0.0094` | `4.5580` | `-0.0288, -0.0188, -0.0345` |
|
| 142 |
+
| L12 | 17.37M | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
|
| 143 |
+
| L14 | 23.70M | `0.322 -> 0.276 -> 0.198 -> 0.108 -> 0.020` | `4.4384 +/- 0.0087` | `4.4736` | `-0.0294, -0.0269, -0.0429` |
|
| 144 |
+
| L16 | 31.46M | `0.341 -> 0.294 -> 0.217 -> 0.127 -> 0.030` | `4.4059 +/- 0.0046` | `4.4459` | `-0.0411, -0.0512, -0.0279` |
|
| 145 |
+
|
| 146 |
+
The formula won all 15 paired final-loss comparisons across these five model
|
| 147 |
+
sizes.
|
| 148 |
+
|
| 149 |
+
## L16 Schedule Development
|
| 150 |
+
|
| 151 |
+
The L16 model was used to understand why schedule shape matters. An early
|
| 152 |
+
formula-like schedule that started too high was inferior on trajectory, even
|
| 153 |
+
though it beat some static controls at the final prefix. A moderate schedule
|
| 154 |
+
near `0.30` performed much better.
|
| 155 |
+
|
| 156 |
+
3-seed L16 confirmation:
|
| 157 |
+
|
| 158 |
+
| Condition | Final val | Final std | Mean trajectory val | Final gap |
|
| 159 |
+
|---|---:|---:|---:|---:|
|
| 160 |
+
| `hold_30_then_decay` | `4.4060` | `0.0118` | `4.8503` | `0.3530` |
|
| 161 |
+
| `mild_30_to_08` | `4.4075` | `0.0078` | `4.8504` | `0.3307` |
|
| 162 |
+
| `fitted_l16_static_law` | `4.4159` | `0.0042` | `4.9527` | `0.3144` |
|
| 163 |
+
| `static_dropout_0.14` | `4.4459` | `0.0128` | `4.9043` | `0.3205` |
|
| 164 |
+
| `static_dropout_0.30` | `4.4693` | `0.0081` | `4.8764` | `0.2327` |
|
| 165 |
+
| `static_dropout_0.02` | `4.5405` | `0.0061` | `5.1544` | `0.4747` |
|
| 166 |
+
| `static_dropout_0.0` | `4.5905` | `0.0192` | `5.2422` | `0.5464` |
|
| 167 |
+
|
| 168 |
+
This clarified that the winning schedule is not "high dropout, then decay." It
|
| 169 |
+
is "start near the small-prefix optimum, then decay as the optimum moves down."
|
| 170 |
+
|
| 171 |
+
## Update-Pressure Validation
|
| 172 |
+
|
| 173 |
+
Changing `stage_steps` changes how many sampled tokens are consumed per stream
|
| 174 |
+
prefix. The formula should increase dropout when repeated sampling pressure is
|
| 175 |
+
higher.
|
| 176 |
+
|
| 177 |
+
L12 update-pressure sweep:
|
| 178 |
+
|
| 179 |
+
| Stage steps | Formula path | Mean trajectory val | Formula final val | Best static final val | Paired final deltas |
|
| 180 |
+
|---:|---|---:|---:|---:|---:|
|
| 181 |
+
| 500 | `0.226 -> 0.180 -> 0.102 -> 0.020 -> 0.020` | `5.1581` | `4.7138 +/- 0.0080` | `4.7321` | `-0.0152, -0.0147, -0.0249` |
|
| 182 |
+
| 1000 | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
|
| 183 |
+
| 2000 | `0.376 -> 0.330 -> 0.252 -> 0.162 -> 0.065` | `4.7841` | `4.3089 +/- 0.0116` | `4.3513` | `-0.0453, -0.0321, -0.0489` |
|
| 184 |
+
|
| 185 |
+
The formula won final loss in all three update-pressure regimes. At 2000
|
| 186 |
+
steps, it also won the mean trajectory, supporting the idea that repeated
|
| 187 |
+
sampling from the same prefix increases the appropriate dropout.
|
| 188 |
+
|
| 189 |
+
## Sampled-Pressure Coefficient Ablation
|
| 190 |
+
|
| 191 |
+
The sampled-pressure coefficient was ablated on L12 while keeping model, stream
|
| 192 |
+
prefixes, and training budget fixed.
|
| 193 |
+
|
| 194 |
+
| Condition | Coefficient multiplier | Path | Mean trajectory val | Final val | Final std | Final gap |
|
| 195 |
+
|---|---:|---|---:|---:|---:|---:|
|
| 196 |
+
| `no_sample_pressure_l12` | 0x | `0.074 -> 0.027 -> 0.020 -> 0.020 -> 0.020` | `5.0282` | `4.5468` | `0.0011` | `0.3482` |
|
| 197 |
+
| `half_sample_pressure_l12` | 0.5x | `0.187 -> 0.141 -> 0.079 -> 0.020 -> 0.020` | `4.9260` | `4.5055` | `0.0046` | `0.3272` |
|
| 198 |
+
| `pressure_formula_floor02` | 1.0x | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812` | `0.0062` | `0.2825` |
|
| 199 |
+
| `high_sample_pressure_l12` | 1.5x | `0.415 -> 0.368 -> 0.275 -> 0.163 -> 0.041` | `4.9739` | `4.4959` | `0.0025` | `0.2418` |
|
| 200 |
+
|
| 201 |
+
The 1.0x coefficient was best on final validation. The 1.5x variant had the
|
| 202 |
+
smallest final gap but worse validation, showing that the objective is not
|
| 203 |
+
simply minimizing the train-validation gap. Too much dropout underfits.
|
| 204 |
+
|
| 205 |
+
## Architecture-Shape Holdout
|
| 206 |
+
|
| 207 |
+
A key question is whether parameter count alone is a reasonable capacity proxy.
|
| 208 |
+
To test this, a conventional 8-head deep/narrow model was run:
|
| 209 |
+
|
| 210 |
+
- Model: `18x8x256`.
|
| 211 |
+
- Parameters: 16.25M.
|
| 212 |
+
- FFN ratio: `4 * n_embd`, unchanged from the base architecture.
|
| 213 |
+
- Formula path from parameter count only:
|
| 214 |
+
`0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020`.
|
| 215 |
+
|
| 216 |
+
Results:
|
| 217 |
+
|
| 218 |
+
| Condition | Path | Mean trajectory val | Final val | Final std | Final gap |
|
| 219 |
+
|---|---|---:|---:|---:|---:|
|
| 220 |
+
| Formula | `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020` | `4.9720` | `4.5286` | `0.0118` | `0.2418` |
|
| 221 |
+
| Static `0.02` | constant | `5.0730` | `4.5887` | `0.0067` | `0.2947` |
|
| 222 |
+
| Static `0.08` | constant | `4.9900` | `4.5607` | `0.0081` | `0.2447` |
|
| 223 |
+
| Static `0.14` | constant | `4.9633` | `4.5564` | `0.0127` | `0.2080` |
|
| 224 |
+
| Static `0.18` | constant | `4.9699` | `4.5710` | `0.0061` | `0.1950` |
|
| 225 |
+
| Static `0.20` | constant | `4.9799` | `4.5835` | `0.0199` | `0.1841` |
|
| 226 |
+
| Static `0.26` | constant | `5.0021` | `4.6096` | `0.0126` | `0.1602` |
|
| 227 |
+
| Static `0.30` | constant | `5.0341` | `4.6520` | `0.0024` | `0.1545` |
|
| 228 |
+
|
| 229 |
+
Best static was `0.14`. Formula beat it on every paired final seed:
|
| 230 |
+
|
| 231 |
+
```text
|
| 232 |
+
formula - best_static = -0.0270, -0.0317, -0.0248
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
This supports final-loss transfer across architecture shape. It is not a clean
|
| 236 |
+
trajectory win because static `0.14` had slightly better mean trajectory. The
|
| 237 |
+
safe claim is therefore final-loss transfer, not universal trajectory
|
| 238 |
+
dominance.
|
| 239 |
+
|
| 240 |
+
## Combined Evidence So Far
|
| 241 |
+
|
| 242 |
+
Across the completed formula tests:
|
| 243 |
+
|
| 244 |
+
- Model-size validation: 15/15 paired final-loss wins.
|
| 245 |
+
- Architecture-shape holdout: 3/3 paired final-loss wins.
|
| 246 |
+
- Combined completed paired final-loss comparisons: 18/18 formula wins.
|
| 247 |
+
- Update-pressure direction: supported.
|
| 248 |
+
- Sampled-pressure coefficient: supported on L12.
|
| 249 |
+
- High arbitrary initial dropout: rejected.
|
| 250 |
+
|
| 251 |
+
This is strong evidence for the refined hypothesis under the current
|
| 252 |
+
nanochat-style Transformer and expanding-prefix protocol.
|
| 253 |
+
|
| 254 |
+
## What the Results Do Not Yet Prove
|
| 255 |
+
|
| 256 |
+
The results are promising but should not be overstated.
|
| 257 |
+
|
| 258 |
+
The current evidence does not prove:
|
| 259 |
+
|
| 260 |
+
- The formula is universal across arbitrary datasets.
|
| 261 |
+
- Parameter count alone fully captures architecture capacity.
|
| 262 |
+
- The formula always wins integrated trajectory loss.
|
| 263 |
+
- The `0.02` floor is theoretically optimal.
|
| 264 |
+
- The sampled-pressure coefficient is optimal for every model size.
|
| 265 |
+
|
| 266 |
+
The current evidence does support:
|
| 267 |
+
|
| 268 |
+
- Static dropout optima move downward as stream prefix size grows.
|
| 269 |
+
- Larger models need more early dropout at small stream prefixes.
|
| 270 |
+
- Repeated sampling from the same prefix increases the useful dropout.
|
| 271 |
+
- A pressure-aware schedule can beat the best single static dropout on final
|
| 272 |
+
validation loss.
|
| 273 |
+
|
| 274 |
+
## Publication Framing
|
| 275 |
+
|
| 276 |
+
The strongest safe paper claim is:
|
| 277 |
+
|
| 278 |
+
> In nanochat-style causal Transformers trained under expanding-prefix
|
| 279 |
+
> streaming, a pressure-aware dropout schedule improves final validation loss
|
| 280 |
+
> over fixed-dropout baselines across model sizes, update pressures, and one
|
| 281 |
+
> architecture-shape holdout.
|
| 282 |
+
|
| 283 |
+
The claim that should be avoided for now is:
|
| 284 |
+
|
| 285 |
+
> This formula universally predicts optimal dropout for all models and datasets.
|
| 286 |
+
|
| 287 |
+
## Remaining High-Value Experiments
|
| 288 |
+
|
| 289 |
+
The next experiments that would most strengthen a paper are:
|
| 290 |
+
|
| 291 |
+
1. **Width-heavy architecture holdout**:
|
| 292 |
+
run a conventional `8x8x384` shape near the L12 parameter scale. This is the
|
| 293 |
+
paired complement to the completed `18x8x256` deep/narrow holdout.
|
| 294 |
+
|
| 295 |
+
2. **Corpus/domain holdout**:
|
| 296 |
+
freeze the formula and run on a different text distribution. This is the
|
| 297 |
+
biggest missing generalization test.
|
| 298 |
+
|
| 299 |
+
3. **L8 and L16 sampled-pressure ablations**:
|
| 300 |
+
repeat the `0x`, `0.5x`, `1.0x`, `1.5x` coefficient ablation outside L12.
|
| 301 |
+
|
| 302 |
+
4. **Oracle schedule comparison**:
|
| 303 |
+
compare the formula against a stage-wise oracle chosen from measured static
|
| 304 |
+
optima. The formula does not need to beat the oracle; it should approach it
|
| 305 |
+
without using per-stage oracle knowledge.
|
| 306 |
+
|
| 307 |
+
5. **5-seed headline confirmation**:
|
| 308 |
+
reserve 5-seed runs for the final paper table, not every exploratory sweep.
|
| 309 |
+
|
| 310 |
+
## Current Bottom Line
|
| 311 |
+
|
| 312 |
+
The hypothesis is holding up well after the refinement. The correct story is
|
| 313 |
+
not that dropout decay is inherently good. The correct story is that
|
| 314 |
+
dropout should track a measurable pressure regime created by model size,
|
| 315 |
+
available stream prefix size, and repeated sampling.
|
| 316 |
+
|
| 317 |
+
The completed evidence is already strong enough for a serious empirical paper
|
| 318 |
+
draft if framed carefully. The remaining work is about generalization and
|
| 319 |
+
claim scope, especially architecture-width transfer and corpus transfer.
|