| # Formula and Coefficient Methodology |
|
|
| Date created: 2026-05-30 |
|
|
| This document explains how we use the dropout-pressure formula, how its |
| coefficients are derived from experiments, and how the resulting formula is |
| tested as a streaming dropout schedule. |
|
|
| ## Purpose |
|
|
| The goal is not to find one universal dropout value. |
|
|
| The goal is to learn a rule that maps the current training pressure to a useful |
| dropout rate: |
|
|
| ```text |
| model size + available unique data + cumulative sampled training |
| | |
| v |
| recommended dropout |
| ``` |
|
|
| In a streaming setting, available data and cumulative training change over time, |
| so the formula produces a sequence of dropout values rather than one fixed |
| dropout. |
|
|
| ## Formula Family |
|
|
| The current leading formula is the interaction pressure law: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| A * log10(P / U_t) |
| + B * log10(C_t / U_t) |
| + D * log10(P / U_t) * log10(C_t / U_t) |
| + C0) |
| ``` |
|
|
| The first-order ablation is: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| A * log10(P / U_t) |
| + B * log10(C_t / U_t) |
| + C0) |
| ``` |
|
|
| Where: |
|
|
| | Symbol | Meaning | |
| |---|---| |
| | `P` | model parameter count | |
| | `U_t` | unique tokens available at stage `t` | |
| | `C_t` | cumulative sampled training tokens consumed by stage `t` | |
| | `p_t` | active dropout rate at stage `t` | |
| | `A` | coefficient for model/data pressure | |
| | `B` | coefficient for sampled-token pressure | |
| | `D` | interaction coefficient | |
| | `C0` | regime baseline offset | |
|
|
| The two pressure variables are: |
|
|
| ```text |
| x_t = log10(P / U_t) |
| y_t = log10(C_t / U_t) |
| ``` |
|
|
| So the interaction formula can be written compactly as: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, A*x_t + B*y_t + D*x_t*y_t + C0) |
| ``` |
|
|
| ## Clamp |
|
|
| `clamp` keeps the formula output inside a valid dropout range: |
|
|
| ```text |
| clamp(p_min, p_max, z) = max(p_min, min(p_max, z)) |
| ``` |
|
|
| In the current schedule-generation experiments: |
|
|
| ```text |
| p_min = 0.02 |
| p_max = 0.65 |
| ``` |
|
|
| Static sweeps may still test `dropout=0.0` as an ablation. The clamp is mainly |
| used when turning fitted coefficients into a deployment/training schedule, so a |
| bad extrapolation cannot produce negative dropout or an unusably large dropout. |
|
|
| ## Regime-Specific Coefficients |
|
|
| The coefficients are not assumed to be universal constants. |
|
|
| A regime is defined by: |
|
|
| ```text |
| architecture family |
| + tokenizer |
| + corpus family |
| + optimizer and learning-rate protocol |
| + dropout placement and semantics |
| + streaming protocol |
| + evaluation distribution |
| ``` |
|
|
| Inside one regime, the formula inputs `P`, `U_t`, and `C_t` should explain how |
| dropout changes. If the regime changes, the coefficients may need to be |
| refitted. |
|
|
| Current TinyStories-regime coefficients: |
|
|
| ```text |
| A = -0.089261 |
| B = -0.129754 |
| D = 0.255069 |
| C0 = 0.081525 |
| ``` |
|
|
| Absolute TinyStories-regime formula: |
|
|
| ```text |
| p_t = clamp(p_min, p_max, |
| -0.089261 * log10(P / U_t) |
| - 0.129754 * log10(C_t / U_t) |
| + 0.255069 * log10(P / U_t) * log10(C_t / U_t) |
| + 0.081525) |
| ``` |
|
|
| The research claim we are testing is: |
|
|
| ```text |
| the pressure-law structure transfers across regimes; |
| the coefficient values are calibrated per regime. |
| ``` |
|
|
| ## How Coefficients Are Derived |
|
|
| Coefficient fitting starts from static dropout sweeps. |
|
|
| A calibration cell is one fixed experimental setting whose best dropout we want |
| the formula to explain. |
|
|
| One cell is: |
|
|
| ```text |
| one model architecture and parameter count P |
| + one unique-token prefix U |
| + one sampled-token training budget C |
| + one validation setup |
| + a sweep over static dropout rates |
| ``` |
|
|
| Example cell: |
|
|
| ```text |
| model: L12_H8_D320 |
| parameters P: 17,367,040 |
| unique tokens U: 1,000,000 |
| sampled training tokens C: 10,240,000 |
| dropout rates tested: 0.00, 0.04, 0.08, 0.12, 0.18, 0.26 |
| ``` |
|
|
| The dropout sweep for that cell might look like: |
|
|
| ```text |
| 0.00 -> validation loss 3.1074 |
| 0.04 -> validation loss 2.9188 |
| 0.08 -> validation loss 2.8721 |
| 0.12 -> validation loss 2.8454 best |
| 0.18 -> validation loss 2.8623 |
| 0.26 -> validation loss 2.9006 |
| ``` |
|
|
| That cell contributes one supervised training row for the coefficient fit: |
|
|
| ```text |
| input: P, U, C |
| target: p_star ~= 0.12 |
| ``` |
|
|
| So "cell" means one row in the coefficient-fitting dataset. |
|
|
| For each calibration cell, we choose: |
|
|
| ```text |
| model P |
| unique-token prefix U |
| sampled training-token budget C |
| dropout grid |
| ``` |
|
|
| Then we train/evaluate the model at several fixed dropout rates and record the |
| validation loss curve: |
|
|
| ```text |
| dropout -> validation loss |
| ``` |
|
|
| Example shape: |
|
|
| ```text |
| 0.00 -> high loss |
| 0.04 -> lower loss |
| 0.08 -> lower loss |
| 0.12 -> best loss |
| 0.18 -> higher loss |
| 0.26 -> higher loss |
| ``` |
|
|
| This gives one target value for the cell: |
|
|
| ```text |
| p_star = observed useful static dropout for (P, U, C) |
| ``` |
|
|
| ### Target Extraction |
|
|
| The fitting script supports two target choices: |
|
|
| | Target | Meaning | |
| |---|---| |
| | grid best | dropout rate with the lowest observed validation loss | |
| | quadratic optimum | local parabolic minimum around the best grid point | |
|
|
| The quadratic target is preferred when the curve is bracketed: |
|
|
| ```text |
| left dropout has higher loss |
| middle dropout is best |
| right dropout has higher loss |
| ``` |
|
|
| If the best dropout is at the edge of the tested grid, the optimum is marked as |
| a boundary optimum. Boundary cells are useful but weaker evidence because the |
| true optimum may lie outside the tested rates. |
|
|
| ### Feature Construction |
|
|
| For each cell, compute: |
|
|
| ```text |
| x = log10(P / U) |
| y = log10(C / U) |
| xy = x * y |
| ``` |
|
|
| Then fit: |
|
|
| ```text |
| p_star ~= A*x + B*y + D*xy + C0 |
| ``` |
|
|
| In plain language, the fitting step asks: |
|
|
| ```text |
| What values of A, B, D, and C0 make the formula's predicted dropout |
| as close as possible to the observed best dropout values across all cells? |
| ``` |
|
|
| Suppose we have many cells: |
|
|
| ```text |
| cell 1 observed best dropout: 0.12 |
| cell 2 observed best dropout: 0.18 |
| cell 3 observed best dropout: 0.08 |
| ... |
| ``` |
|
|
| For each cell, the formula predicts a dropout: |
|
|
| ```text |
| predicted_p_i = A*x_i + B*y_i + D*x_i*y_i + C0 |
| ``` |
|
|
| The error for that cell is: |
|
|
| ```text |
| error_i = predicted_p_i - observed_p_star_i |
| ``` |
|
|
| Ordinary least squares chooses the coefficients that minimize the sum of |
| squared errors: |
|
|
| ```text |
| minimize sum_i error_i^2 |
| ``` |
|
|
| We use squared error because large misses should matter more than tiny misses. |
| This is the standard linear-regression solution. |
|
|
| ### Why Some Cells Get Lower Weight |
|
|
| Not every observed `p_star` is equally reliable. Some dropout sweeps identify a |
| clear optimum; others only give a rough hint. The weighted fit keeps all cells, |
| but lets cleaner cells influence the coefficients more than uncertain cells. |
|
|
| Weighted least squares minimizes: |
|
|
| ```text |
| minimize sum_i w_i * error_i^2 |
| ``` |
|
|
| Where: |
|
|
| ```text |
| w_i = confidence weight for cell i |
| ``` |
|
|
| If a cell has weight `1.0`, it has full influence. If it has weight `0.3`, it |
| still contributes, but only weakly. |
|
|
| The current fitting script lowers a cell's weight in these cases: |
|
|
| | Condition | Meaning | Why it is less reliable | |
| |---|---|---| |
| | boundary optimum | the best tested dropout is the smallest or largest dropout in the grid | the real optimum may be outside the tested range | |
| | not bracketed | the best point does not have worse points on both sides | we cannot confidently fit a local parabola | |
| | very flat curve | many dropout rates have almost the same validation loss | the exact best dropout is weakly identified | |
| | noisy best loss | validation loss has high variance across seeds/eval batches | the selected best point may move with more samples | |
|
|
| Example boundary optimum: |
|
|
| ```text |
| dropout: 0.00 0.04 0.08 0.12 |
| loss: 3.20 3.05 2.96 2.90 |
| ``` |
|
|
| The best tested value is `0.12`, but the curve is still improving at the edge. |
| The true optimum might be `0.18` or `0.26`, so this cell should not dominate the |
| fit. |
|
|
| Example bracketed optimum: |
|
|
| ```text |
| dropout: 0.04 0.08 0.12 0.18 |
| loss: 2.92 2.87 2.85 2.86 |
| ``` |
|
|
| The best tested value is `0.12`, and both neighboring sides are worse. This is |
| a cleaner target because the bottom of the curve is visible. |
|
|
| Example flat curve: |
|
|
| ```text |
| dropout: 0.04 0.08 0.12 0.18 |
| loss: 2.851 2.849 2.850 2.852 |
| ``` |
|
|
| The grid best might be `0.08`, but `0.04`, `0.12`, and `0.18` are almost tied. |
| The correct conclusion is a plateau, not a sharply known optimum. |
|
|
| So the weighting is not changing the observed results. It only tells the |
| coefficient fit how much confidence to place in each row. |
|
|
| The main implementation is: |
|
|
| ```text |
| scripts/fit_dropout_coefficients.py |
| ``` |
|
|
| Its main outputs are: |
|
|
| | Output | Purpose | |
| |---|---| |
| | `coefficients.json` | fitted coefficients and fit metrics | |
| | `calibration_cells.csv` | per-cell target, prediction, residual, pressure variables | |
| | `fit_diagnostics.md` | human-readable report | |
| | `next_dropout_suggestions.csv` | suggested extra dropout points if a curve needs refinement | |
|
|
| ### Solving For A, B, D, And C0 |
|
|
| After target extraction, every cell gives one equation: |
|
|
| ```text |
| p_star_i ~= A*x_i + B*y_i + D*x_i*y_i + C0 |
| ``` |
|
|
| Where: |
|
|
| ```text |
| x_i = log10(P_i / U_i) |
| y_i = log10(C_i / U_i) |
| ``` |
|
|
| For `n` cells, stack those equations into a matrix: |
|
|
| ```text |
| X = |
| [ |
| x_1 y_1 x_1*y_1 1 |
| x_2 y_2 x_2*y_2 1 |
| x_3 y_3 x_3*y_3 1 |
| ... |
| x_n y_n x_n*y_n 1 |
| ] |
| |
| theta = |
| [ |
| A |
| B |
| D |
| C0 |
| ] |
| |
| p = |
| [ |
| p_star_1 |
| p_star_2 |
| p_star_3 |
| ... |
| p_star_n |
| ] |
| ``` |
|
|
| The coefficient fit solves: |
|
|
| ```text |
| X * theta ~= p |
| ``` |
|
|
| In least-squares form: |
|
|
| ```text |
| theta_hat = argmin_theta ||X * theta - p||^2 |
| ``` |
|
|
| With heuristic weights, the objective becomes: |
|
|
| ```text |
| theta_hat = argmin_theta sum_i w_i * (A*x_i + B*y_i + D*x_i*y_i + C0 - p_star_i)^2 |
| ``` |
|
|
| Cells with clean bracketed dropout optima get higher weight. Boundary, flat, or |
| noisy cells get lower weight. |
|
|
| The implementation uses NumPy least squares: |
|
|
| ```text |
| coef, *_ = np.linalg.lstsq(X_weighted, p_weighted, rcond=None) |
| ``` |
|
|
| For the first-order ABC ablation, the matrix drops the interaction column: |
|
|
| ```text |
| X = |
| [ |
| x_1 y_1 1 |
| x_2 y_2 1 |
| ... |
| ] |
| |
| theta = |
| [ |
| A |
| B |
| C0 |
| ] |
| ``` |
|
|
| Then the fit solves: |
|
|
| ```text |
| p_star_i ~= A*x_i + B*y_i + C0 |
| ``` |
|
|
| ## What The Coefficients Mean |
|
|
| The coefficients are not magic constants; they are slopes in the pressure space. |
|
|
| For the interaction formula: |
|
|
| ```text |
| p = A*x + B*y + D*x*y + C0 |
| ``` |
|
|
| `A` controls how dropout changes as model size grows relative to available |
| unique tokens. |
|
|
| `B` controls how dropout changes as cumulative sampled training grows relative |
| to available unique tokens. |
|
|
| `D` controls whether those two effects amplify or damp each other. |
|
|
| The interaction term was added because our TinyStories results showed that the |
| effect of repeated sampled training is not independent of model/data pressure. |
| The simple ABC formula underfit those changes. |
|
|
| ## How We Validate Coefficients |
|
|
| After fitting coefficients, we do not immediately launch new training. |
|
|
| First we backtest offline against existing saved results. |
|
|
| ### Within-Regime Fit |
|
|
| Fit coefficients using cells from one regime and measure: |
|
|
| ```text |
| predicted dropout - observed target dropout |
| ``` |
|
|
| Report: |
|
|
| ```text |
| MAE |
| RMSE |
| bias |
| weighted MAE |
| weighted RMSE |
| ``` |
|
|
| ### Held-Out Validation |
|
|
| When enough cells exist, run grouped validation: |
|
|
| | Validation | Test | |
| |---|---| |
| | leave-model-out | can the formula predict a held-out model size? | |
| | leave-prefix-out | can it predict a held-out unique-token prefix? | |
| | leave-source-out | can it predict cells from another run source? | |
|
|
| This tests whether the formula is learning a pressure relationship rather than |
| memorizing one grid. |
|
|
| ### Cross-Regime Backtest |
|
|
| For each saved regime: |
|
|
| 1. fit coefficients inside that regime; |
| 2. compare `base_abc` and `interaction`; |
| 3. test whether coefficients from one regime transfer numerically to another; |
| 4. decide whether the structure transfers but coefficients differ. |
|
|
| This is the next required step before new MPS training. |
|
|
| ## How The Formula Becomes A Decay Schedule |
|
|
| Static fitting gives a useful dropout estimate for one `(P, U, C)` point. |
|
|
| Streaming creates a sequence of points: |
|
|
| ```text |
| stage 0: P fixed, U_0 small, C_0 small |
| stage 1: P fixed, U_1 larger, C_1 larger |
| stage 2: P fixed, U_2 larger, C_2 larger |
| stage 3: P fixed, U_3 larger, C_3 larger |
| ``` |
|
|
| At each stage: |
|
|
| ```text |
| raw_p_t = A*x_t + B*y_t + D*x_t*y_t + C0 |
| p_t = clamp(p_min, p_max, raw_p_t) |
| ``` |
|
|
| The generated values become stage anchors: |
|
|
| ```text |
| U_0=p_0, U_1=p_1, U_2=p_2, U_3=p_3 |
| ``` |
|
|
| The helper script is: |
|
|
| ```text |
| scripts/make_streaming_anchors.py |
| ``` |
|
|
| For the latest L12 TinyStories streaming setup, the interaction schedule was: |
|
|
| ```text |
| 500k -> 0.184 |
| 1M -> 0.141 |
| 2M -> 0.084 |
| 4M -> 0.045 |
| ``` |
|
|
| That schedule is then tested against static dropout baselines using |
| `locked_stream`. |
|
|
| ## How We Test The Decay Hypothesis |
|
|
| The decay hypothesis is not proven by fitting coefficients. |
|
|
| Fitting coefficients proves only this: |
|
|
| ```text |
| the formula estimates useful static dropout for a given pressure point |
| ``` |
|
|
| The streaming experiment tests this stronger claim: |
|
|
| ```text |
| using those estimated dropout values as a schedule helps during streaming |
| ``` |
|
|
| For streaming validation, compare: |
|
|
| ```text |
| formula-derived decay schedules |
| static dropout baselines |
| schedule-shape controls |
| ``` |
|
|
| Measure: |
|
|
| | Metric | Why it matters | |
| |---|---| |
| | final validation loss | whether the model uses the largest stream effectively | |
| | mean trajectory validation loss | whether the full stream path is good | |
| | stage-wise validation loss | where each schedule wins or loses | |
| | train-validation gap | whether dropout is controlling overfit | |
| | paired seed deltas | whether wins survive initialization noise | |
|
|
| Current narrowed streaming comparison: |
|
|
| ```text |
| interaction decay |
| baseabc decay |
| smooth_low decay |
| static_dropout_0.08 |
| static_dropout_0.12 |
| static_dropout_0.18 |
| ``` |
|
|
| ## Current Multi-Seed Streaming Result |
|
|
| Latest TinyStories L12 3-seed final-loss result: |
|
|
| | Condition | Mean final 4M validation loss | Std | |
| |---|---:|---:| |
| | `interaction` decay | 2.5392 | 0.0020 | |
| | `smooth_low` decay | 2.5405 | 0.0018 | |
| | `baseabc` decay | 2.5418 | 0.0019 | |
| | static `0.08` | 2.5511 | 0.0112 | |
| | static `0.12` | 2.5541 | 0.0041 | |
| | static `0.18` | 2.5690 | 0.0069 | |
|
|
| Interpretation: |
|
|
| ```text |
| interaction decay has the best 3-seed mean final loss; |
| 5 seeds would make the TinyStories result more paper-grade. |
| ``` |
|
|
| The final proof path should be streaming multi-seed validation reports per |
| regime. Static coefficient backtests are supporting gates, not final evidence. |
|
|
| ## Pass And Fail Conditions |
|
|
| ### Coefficient Pass |
|
|
| The coefficient formula passes a regime if: |
|
|
| ```text |
| within-regime MAE is low |
| held-out model/prefix error is low |
| residuals do not show obvious systematic bias |
| the fitted coefficients have a defensible interpretation |
| ``` |
|
|
| For our current scale, a useful target is: |
|
|
| ```text |
| dropout MAE below about 0.05 |
| ``` |
|
|
| ### Streaming Strong Pass |
|
|
| The schedule strongly passes if: |
|
|
| ```text |
| mean final validation loss beats the best static baseline across seeds |
| and most paired seeds favor the decay schedule |
| ``` |
|
|
| ### Streaming Weak Pass |
|
|
| The schedule weakly passes if: |
|
|
| ```text |
| it matches the best hand-picked static dropout |
| while avoiding clearly bad static dropout choices across stages |
| ``` |
|
|
| This is still scientifically useful because it means the formula can choose a |
| competitive schedule without manually searching a fixed dropout for every |
| stream size. |
|
|
| ### Streaming Fail |
|
|
| The schedule fails if: |
|
|
| ```text |
| it loses to a simple static baseline in most seeds |
| or improves early stages by sacrificing final-stage loss |
| ``` |
|
|
| If that happens, do not launch a larger sweep immediately. First fit a |
| static-to-streaming correction offline and backtest it on saved results. |
|
|
| ## Immediate Next Work |
|
|
| The active proof artifact is: |
|
|
| ```text |
| docs/tinystories_streaming_report.md |
| runs/streaming_tinystories_multiseed_validation_l12/combined_5seed_summary/ |
| ``` |
|
|
| TinyStories has already been regenerated at `n=5`. The next paper-grade |
| streaming validation target is WikiText-103, after reconciling the TinyStories |
| and OpenWebText10K reports. |
|
|
| For any later regime, repeat the same pattern: first use static backtests to |
| choose coefficients, then create a streaming multi-seed validation report as the |
| end proof. |
|
|