Reproducing the Dropout Decay Experiments
This repository is intended to be runnable without checking out nanochat. The implementation is derived from nanochat and retains its MIT attribution, but runtime commands use this package, local cached data, and a local MPS-capable Python environment.
Requirements
- macOS with Apple Silicon MPS available.
- Python 3.10-3.12 recommended for PyTorch MPS wheels.
- A project-local virtual environment at
.venv. - MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the runner.
Create the environment:
python3.11 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/python -m pip install -e .
Verify MPS:
.venv/bin/python - <<'PY'
import torch
print(torch.__version__)
print(torch.backends.mps.is_built(), torch.backends.mps.is_available())
PY
Both booleans must be True.
Data
The runner supports two modes:
--use-cached-data --cache-dir .cache/dropout_decay--corpus/--corpus-globto build a cache from raw text or parquet.
The repo includes the curated source corpora needed by the current regimes:
data/openwebtext10k/base_data_climbmix/shard_*.parquet
data/openwebtext10k/openwebtext10k.txt
data/tinystories/train-00000-of-00004.parquet
data/wikitext103_raw/train-00001-of-00002.parquet
The experiments in the current report used these repo-local caches:
.cache/dropout_decay/tokenizer-v4096.json
.cache/dropout_decay/tokens-v4096-uint16.npy
.cache/dropout_decay_tinystories/tokenizer-v4096.json
.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy
.cache/dropout_decay_wikitext103/tokenizer-v4096.json
.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy
These binary artifacts are intended to be published with the repository through the configured large-file storage rules. The local OpenWebText-style cached split used by the completed OpenWebText10K runs contains:
train tokens: 5,000,970
validation tokens: 500,000
vocab size: 4,096
The public WikiText-103 holdout can be rebuilt from source:
.venv/bin/python scripts/prepare_wikitext103.py
The pinned parquet source is verified as:
bytes: 156,700,942
sha256: 75aa65dee9de2a7c10ba1808efd2408c3f4eb008104c3ccac47f8ed19300ebdd
The public TinyStories holdout can also be rebuilt from source:
.venv/bin/python scripts/prepare_tinystories.py
The pinned parquet source is verified as:
bytes: 248,731,111
sha256: 77cf780cebe52b6e83e3a2ac84bc56d8059363113e41d17a023f1d8b2ed0fc0b
Smoke Test
This verifies cached-data loading without running a Torch experiment:
PYTHONPATH=src .venv/bin/python - <<'PY'
from pathlib import Path
from dropout_decay.data import load_cached_splits
tok, splits = load_cached_splits(
cache_dir=Path(".cache/dropout_decay"),
vocab_size=4096,
max_required_train_tokens=4_000_000,
val_tokens=500_000,
allow_short_corpus=False,
)
print(tok.vocab_size)
print(len(splits.train), len(splits.val))
PY
Expected:
4096
5000970 500000
Headline Formula
The tested formula is:
p = clamp(0.02, 0.65,
0.154 * log10(params / unique_tokens)
+ 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
- 0.210)
For the standard protocol:
- stream prefixes:
250000 500000 1000000 2000000 4000000 - stage steps:
1000 - batch size:
16 - block size:
128 - cumulative sampled tokens after stage
i:i * 1000 * 16 * 128
Reproduce Model-Size Validation
Example L12 command:
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--output-dir runs/reproduce_l12_formula \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
Completed reference result:
pressure formula final validation: 4.4812 +/- 0.0062
best static final validation: 4.5183
Reproduce Architecture-Shape Holdout
Deep/narrow holdout:
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--output-dir runs/reproduce_arch_deep_narrow \
--models deep_narrow_L18_H8_D256=18x8x256 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
Completed reference result:
formula final validation: 4.5286 +/- 0.0118
best static final validation: 4.5564 +/- 0.0127
Reproduce Width-Heavy Holdout
The width-heavy architecture holdout is the paired complement to the deep/narrow holdout above:
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--output-dir runs/architecture_shape_holdout_wide_h8 \
--models wide_L8_H8_D384=8x8x384 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
Completed reference result:
formula final validation: 4.4658 +/- 0.0065
best static final validation: 4.4946 +/- 0.0087
best mean trajectory: static 0.18, 4.9064 vs formula 4.9073
Reproduce WikiText-103 Corpus Holdout
Prepare the public corpus first:
.venv/bin/python scripts/prepare_wikitext103.py
Then run the frozen L12 formula against a broad static grid:
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--corpus data/wikitext103_raw/train-00001-of-00002.parquet \
--cache-dir .cache/dropout_decay_wikitext103 \
--output-dir runs/corpus_holdout_wikitext103_l12 \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays formula_l12_wikitext103:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
Completed reference result:
formula final validation: 4.0836 +/- 0.0258
best static final validation: 4.1081 +/- 0.0258
best mean trajectory: formula, 4.5728 vs static 0.18, 4.5759
Reproduce TinyStories Corpus Holdout
Prepare the public corpus first:
.venv/bin/python scripts/prepare_tinystories.py
Then run the frozen L12 formula against a broad static grid:
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--corpus data/tinystories/train-00000-of-00004.parquet \
--cache-dir .cache/dropout_decay_tinystories \
--output-dir runs/corpus_holdout_tinystories_l12 \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 0.40 0.50 \
--anchor-decays formula_l12_tinystories:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
Completed reference result:
frozen formula final validation: 2.7016 +/- 0.0212
best static final validation: 2.7009 +/- 0.0216
paired frozen-formula wins: 1/3
The TinyStories holdout exposed a useful boundary condition: the frozen pressure formula starts too high for this easier corpus. The focused low-decay follow-up can be reproduced with:
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay_tinystories \
--output-dir runs/corpus_holdout_tinystories_l12_lowdecay \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates \
--anchor-decays \
tinystories_low_decay_014_002:250000=0.140,500000=0.140,1000000=0.100,2000000=0.060,4000000=0.020 \
tinystories_low_decay_014_006:250000=0.140,500000=0.140,1000000=0.100,2000000=0.080,4000000=0.060 \
tinystories_monotone_oracle:250000=0.140,500000=0.140,1000000=0.140,2000000=0.080,4000000=0.080 \
tinystories_low_decay_010_006:250000=0.100,500000=0.100,1000000=0.080,2000000=0.080,4000000=0.060 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
Completed reference result:
best low-decay path: 0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02
best low-decay final validation: 2.6859 +/- 0.0286
best static final validation: 2.7009 +/- 0.0216
paired low-decay wins: 3/3
Notes for Publication
- Do not claim the formula is universal.
- The supported claim is final-validation improvement under this expanding-prefix protocol.
- PDFs are generated artifacts and are ignored by Git.
- The exact cached token file should be published through an appropriate binary artifact mechanism once dataset provenance is finalized.