cuber12
/

dropout-decay

+# Dropout Decay in Expanding-Stream Language Model Training
+Date: 2026-05-28
+## Audience and Purpose
+This report is written for an AI/ML engineer seeing the project for the first
+time. It summarizes the research motivation, implementation setup, experimental
+protocol, completed results, current evidence for the dropout formula, and the
+remaining work needed before framing the result as a publishable paper.
+The project studies dropout in a streaming-data regime. The central question is
+whether a model can start with stronger regularization when the available stream
+prefix is small, then reduce dropout as the stream grows, so that the model uses
+more of its capacity without catastrophic overfitting.
+## Codebase and Attribution
+The implementation is derived from Andrej Karpathy's nanochat project and keeps
+only the relevant core pieces:
+- BPE-style text tokenization.
+- A nanochat-style causal Transformer.
+- Dynamic dropout control for attention, residual, MLP, and embedding dropout.
+- MPS-only experiment execution.
+- Streaming-style expanding-prefix training loops.
+The original nanochat MIT copyright and permission notice are retained in
+derived source files. The project documentation explicitly attributes the
+foundation to Andrej Karpathy's nanochat.
+## Initial Hypothesis and Correction
+The original broad hypothesis was:
+> Starting with very high dropout on a small initial dataset, then decaying
+> dropout as more stream data arrives, lets a large model dynamically scale its
+> effective capacity and avoid catastrophic overfitting.
+The experiments rejected this version. A very high initial dropout such as
+`0.8` was harmful. In early 8.39M-parameter streaming runs, static low dropout
+beat the high-dropout decay schedule:
+| Condition | 5M | 10M | 20M | 40M |
+|---|---:|---:|---:|---:|
+| High-dropout decay streaming | `6.9213` | `6.2689` | `5.4262` | `4.9090` |
+| Static `0.1` dropout streaming | `5.6310` | `5.1018` | `4.8497` | `4.6743` |
+| Static `0.8` dropout streaming | `6.9898` | `6.7637` | `6.4835` | `6.2390` |
+The refined hypothesis is narrower and better supported:
+> Prefix-aware dropout scheduling appears useful when the static dropout
+> optimum changes with stream size. The schedule should start near the small
+> prefix optimum and decay toward the large-prefix optimum, rather than using
+> arbitrary high dropout.
+## Experimental Setup
+All training experiments use MPS. The local project instruction is strict: no
+CPU and no CUDA fallback for Torch experiments.
+The core streaming protocol is:
+- Tokenizer vocabulary: `4096`.
+- Block size: `128`.
+- Batch size: `16`.
+- Tokens sampled per training step: `2048`.
+- Stream prefixes: `250k`, `500k`, `1M`, `2M`, `4M` unique training tokens.
+- Main schedule-validation stage length: `1000` steps per prefix.
+- Validation tokens: `500k`.
+- Seeds: generally `1, 2, 3` for full sweeps and validations.
+- Static controls: fixed dropout values around the expected optimum.
+- Dynamic condition: an anchor schedule with dropout set per stream prefix and
+  log interpolation between prefix anchors.
+The important distinction is:
+- **Unique prefix tokens**: how many distinct training tokens are currently
+  available from the stream.
+- **Sampled tokens**: how many token positions the optimizer has consumed
+  through repeated random batches.
+- **Update pressure**: repeated sampling relative to available prefix size,
+  approximated by `cumulative_sampled_tokens / unique_tokens`.
+When unique tokens are low and sampled tokens are high, the model sees the same
+prefix repeatedly and overfitting pressure increases.
+## Empirical Formula Under Test
+The current formula is:
+```text
+p = clamp(0.02, 0.65,
+          0.154 * log10(params / unique_tokens)
+        + 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
+        - 0.210)
+```
+The terms represent:
+- `params / unique_tokens`: capacity pressure. Larger models on smaller stream
+  prefixes need more regularization.
+- `cumulative_sampled_tokens / unique_tokens`: update pressure. More repeated
+  training on the same prefix needs more regularization.
+- `0.02`: empirical lower floor to avoid assuming exact zero dropout is always
+  optimal.
+- `0.65`: empirical upper guardrail; current successful schedules are far below
+  this in the main validation runs.
+The coefficients are empirical, not theoretical constants. They were fit from
+observed static-dropout curves and then tested against interpolated model sizes,
+update-pressure changes, coefficient ablations, and an architecture-shape
+holdout.
+## Static Dropout Screen
+The first useful research result was that static dropout has a prefix-dependent
+optimum. The optimum is not constant as stream data grows.
+Key observations:
+| Model | Params | Prefix | Best static dropout | Validation loss | Zero-dropout penalty |
+|---|---:|---:|---:|---:|---:|
+| L16 | 31.46M | 2M | `0.14` | `4.4270` | `+0.1982` |
+| L12 | 17.37M | 2M | `0.14` | `4.5088` | `+0.0866` |
+| L8 | 8.39M | 2M | `0.08` | `4.6232` | `+0.0266` |
+| L8 | 8.39M | 4M | `0.0` | best | near zero |
+This motivated a formula that tracks a moving optimum instead of comparing one
+decay schedule to one arbitrary fixed dropout.
+## Model-Size Formula Validation
+The formula was tested across model sizes from 8.39M to 31.46M parameters. Each
+run used 3 seeds and compared the formula schedule against static dropout
+controls.
+| Model | Params | Formula path | Formula final val | Best static final val | Paired final deltas |
+|---|---:|---|---:|---:|---:|
+| L8 | 8.39M | `0.252 -> 0.206 -> 0.129 -> 0.038 -> 0.020` | `4.6094 +/- 0.0056` | `4.6242` | `-0.0102, -0.0160, -0.0182` |
+| L10 | 12.31M | `0.278 -> 0.232 -> 0.154 -> 0.064 -> 0.020` | `4.5306 +/- 0.0094` | `4.5580` | `-0.0288, -0.0188, -0.0345` |
+| L12 | 17.37M | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
+| L14 | 23.70M | `0.322 -> 0.276 -> 0.198 -> 0.108 -> 0.020` | `4.4384 +/- 0.0087` | `4.4736` | `-0.0294, -0.0269, -0.0429` |
+| L16 | 31.46M | `0.341 -> 0.294 -> 0.217 -> 0.127 -> 0.030` | `4.4059 +/- 0.0046` | `4.4459` | `-0.0411, -0.0512, -0.0279` |
+The formula won all 15 paired final-loss comparisons across these five model
+sizes.
+## L16 Schedule Development
+The L16 model was used to understand why schedule shape matters. An early
+formula-like schedule that started too high was inferior on trajectory, even
+though it beat some static controls at the final prefix. A moderate schedule
+near `0.30` performed much better.
+3-seed L16 confirmation:
+| Condition | Final val | Final std | Mean trajectory val | Final gap |
+|---|---:|---:|---:|---:|
+| `hold_30_then_decay` | `4.4060` | `0.0118` | `4.8503` | `0.3530` |
+| `mild_30_to_08` | `4.4075` | `0.0078` | `4.8504` | `0.3307` |
+| `fitted_l16_static_law` | `4.4159` | `0.0042` | `4.9527` | `0.3144` |
+| `static_dropout_0.14` | `4.4459` | `0.0128` | `4.9043` | `0.3205` |
+| `static_dropout_0.30` | `4.4693` | `0.0081` | `4.8764` | `0.2327` |
+| `static_dropout_0.02` | `4.5405` | `0.0061` | `5.1544` | `0.4747` |
+| `static_dropout_0.0` | `4.5905` | `0.0192` | `5.2422` | `0.5464` |
+This clarified that the winning schedule is not "high dropout, then decay." It
+is "start near the small-prefix optimum, then decay as the optimum moves down."
+## Update-Pressure Validation
+Changing `stage_steps` changes how many sampled tokens are consumed per stream
+prefix. The formula should increase dropout when repeated sampling pressure is
+higher.
+L12 update-pressure sweep:
+| Stage steps | Formula path | Mean trajectory val | Formula final val | Best static final val | Paired final deltas |
+|---:|---|---:|---:|---:|---:|
+| 500 | `0.226 -> 0.180 -> 0.102 -> 0.020 -> 0.020` | `5.1581` | `4.7138 +/- 0.0080` | `4.7321` | `-0.0152, -0.0147, -0.0249` |
+| 1000 | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812 +/- 0.0062` | `4.5183` | `-0.0364, -0.0308, -0.0439` |
+| 2000 | `0.376 -> 0.330 -> 0.252 -> 0.162 -> 0.065` | `4.7841` | `4.3089 +/- 0.0116` | `4.3513` | `-0.0453, -0.0321, -0.0489` |
+The formula won final loss in all three update-pressure regimes. At 2000
+steps, it also won the mean trajectory, supporting the idea that repeated
+sampling from the same prefix increases the appropriate dropout.
+## Sampled-Pressure Coefficient Ablation
+The sampled-pressure coefficient was ablated on L12 while keeping model, stream
+prefixes, and training budget fixed.
+| Condition | Coefficient multiplier | Path | Mean trajectory val | Final val | Final std | Final gap |
+|---|---:|---|---:|---:|---:|---:|
+| `no_sample_pressure_l12` | 0x | `0.074 -> 0.027 -> 0.020 -> 0.020 -> 0.020` | `5.0282` | `4.5468` | `0.0011` | `0.3482` |
+| `half_sample_pressure_l12` | 0.5x | `0.187 -> 0.141 -> 0.079 -> 0.020 -> 0.020` | `4.9260` | `4.5055` | `0.0046` | `0.3272` |
+| `pressure_formula_floor02` | 1.0x | `0.300 -> 0.260 -> 0.180 -> 0.090 -> 0.020` | `4.9226` | `4.4812` | `0.0062` | `0.2825` |
+| `high_sample_pressure_l12` | 1.5x | `0.415 -> 0.368 -> 0.275 -> 0.163 -> 0.041` | `4.9739` | `4.4959` | `0.0025` | `0.2418` |
+The 1.0x coefficient was best on final validation. The 1.5x variant had the
+smallest final gap but worse validation, showing that the objective is not
+simply minimizing the train-validation gap. Too much dropout underfits.
+## Architecture-Shape Holdout
+A key question is whether parameter count alone is a reasonable capacity proxy.
+To test this, a conventional 8-head deep/narrow model was run:
+- Model: `18x8x256`.
+- Parameters: 16.25M.
+- FFN ratio: `4 * n_embd`, unchanged from the base architecture.
+- Formula path from parameter count only:
+  `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020`.
+Results:
+| Condition | Path | Mean trajectory val | Final val | Final std | Final gap |
+|---|---|---:|---:|---:|---:|
+| Formula | `0.297 -> 0.250 -> 0.173 -> 0.083 -> 0.020` | `4.9720` | `4.5286` | `0.0118` | `0.2418` |
+| Static `0.02` | constant | `5.0730` | `4.5887` | `0.0067` | `0.2947` |
+| Static `0.08` | constant | `4.9900` | `4.5607` | `0.0081` | `0.2447` |
+| Static `0.14` | constant | `4.9633` | `4.5564` | `0.0127` | `0.2080` |
+| Static `0.18` | constant | `4.9699` | `4.5710` | `0.0061` | `0.1950` |
+| Static `0.20` | constant | `4.9799` | `4.5835` | `0.0199` | `0.1841` |
+| Static `0.26` | constant | `5.0021` | `4.6096` | `0.0126` | `0.1602` |
+| Static `0.30` | constant | `5.0341` | `4.6520` | `0.0024` | `0.1545` |
+Best static was `0.14`. Formula beat it on every paired final seed:
+```text
+formula - best_static = -0.0270, -0.0317, -0.0248
+```
+This supports final-loss transfer across architecture shape. It is not a clean
+trajectory win because static `0.14` had slightly better mean trajectory. The
+safe claim is therefore final-loss transfer, not universal trajectory
+dominance.
+## Combined Evidence So Far
+Across the completed formula tests:
+- Model-size validation: 15/15 paired final-loss wins.
+- Architecture-shape holdout: 3/3 paired final-loss wins.
+- Combined completed paired final-loss comparisons: 18/18 formula wins.
+- Update-pressure direction: supported.
+- Sampled-pressure coefficient: supported on L12.
+- High arbitrary initial dropout: rejected.
+This is strong evidence for the refined hypothesis under the current
+nanochat-style Transformer and expanding-prefix protocol.
+## What the Results Do Not Yet Prove
+The results are promising but should not be overstated.
+The current evidence does not prove:
+- The formula is universal across arbitrary datasets.
+- Parameter count alone fully captures architecture capacity.
+- The formula always wins integrated trajectory loss.
+- The `0.02` floor is theoretically optimal.
+- The sampled-pressure coefficient is optimal for every model size.
+The current evidence does support:
+- Static dropout optima move downward as stream prefix size grows.
+- Larger models need more early dropout at small stream prefixes.
+- Repeated sampling from the same prefix increases the useful dropout.
+- A pressure-aware schedule can beat the best single static dropout on final
+  validation loss.
+## Publication Framing
+The strongest safe paper claim is:
+> In nanochat-style causal Transformers trained under expanding-prefix
+> streaming, a pressure-aware dropout schedule improves final validation loss
+> over fixed-dropout baselines across model sizes, update pressures, and one
+> architecture-shape holdout.
+The claim that should be avoided for now is:
+> This formula universally predicts optimal dropout for all models and datasets.
+## Remaining High-Value Experiments
+The next experiments that would most strengthen a paper are:
+1. **Width-heavy architecture holdout**:
+   run a conventional `8x8x384` shape near the L12 parameter scale. This is the
+   paired complement to the completed `18x8x256` deep/narrow holdout.
+2. **Corpus/domain holdout**:
+   freeze the formula and run on a different text distribution. This is the
+   biggest missing generalization test.
+3. **L8 and L16 sampled-pressure ablations**:
+   repeat the `0x`, `0.5x`, `1.0x`, `1.5x` coefficient ablation outside L12.
+4. **Oracle schedule comparison**:
+   compare the formula against a stage-wise oracle chosen from measured static
+   optima. The formula does not need to beat the oracle; it should approach it
+   without using per-stage oracle knowledge.
+5. **5-seed headline confirmation**:
+   reserve 5-seed runs for the final paper table, not every exploratory sweep.
+## Current Bottom Line
+The hypothesis is holding up well after the refinement. The correct story is
+not that dropout decay is inherently good. The correct story is that
+dropout should track a measurable pressure regime created by model size,
+available stream prefix size, and repeated sampling.
+The completed evidence is already strong enough for a serious empirical paper
+draft if framed carefully. The remaining work is about generalization and
+claim scope, especially architecture-width transfer and corpus transfer.