# Reproducing the Dropout Decay Experiments This repository is intended to be runnable without checking out nanochat. The implementation is derived from nanochat and retains its MIT attribution, but runtime commands use this package, local cached data, and a local MPS-capable Python environment. ## Requirements - macOS with Apple Silicon MPS available. - Python 3.10-3.12 recommended for PyTorch MPS wheels. - A project-local virtual environment at `.venv`. - MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the runner. Create the environment: ```bash python3.11 -m venv .venv .venv/bin/python -m pip install --upgrade pip .venv/bin/python -m pip install -e . ``` Verify MPS: ```bash .venv/bin/python - <<'PY' import torch print(torch.__version__) print(torch.backends.mps.is_built(), torch.backends.mps.is_available()) PY ``` Both booleans must be `True`. ## Data The runner supports two modes: - `--use-cached-data --cache-dir .cache/dropout_decay` - `--corpus` / `--corpus-glob` to build a cache from raw text or parquet. The repo includes the curated source corpora needed by the current regimes: ```text data/openwebtext10k/base_data_climbmix/shard_*.parquet data/openwebtext10k/openwebtext10k.txt data/tinystories/train-00000-of-00004.parquet data/wikitext103_raw/train-00001-of-00002.parquet ``` The experiments in the current report used these repo-local caches: ```text .cache/dropout_decay/tokenizer-v4096.json .cache/dropout_decay/tokens-v4096-uint16.npy .cache/dropout_decay_tinystories/tokenizer-v4096.json .cache/dropout_decay_tinystories/tokens-v4096-uint16.npy .cache/dropout_decay_wikitext103/tokenizer-v4096.json .cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy ``` These binary artifacts are intended to be published with the repository through the configured large-file storage rules. The local OpenWebText-style cached split used by the completed OpenWebText10K runs contains: ```text train tokens: 5,000,970 validation tokens: 500,000 vocab size: 4,096 ``` The public WikiText-103 holdout can be rebuilt from source: ```bash .venv/bin/python scripts/prepare_wikitext103.py ``` The pinned parquet source is verified as: ```text bytes: 156,700,942 sha256: 75aa65dee9de2a7c10ba1808efd2408c3f4eb008104c3ccac47f8ed19300ebdd ``` The public TinyStories holdout can also be rebuilt from source: ```bash .venv/bin/python scripts/prepare_tinystories.py ``` The pinned parquet source is verified as: ```text bytes: 248,731,111 sha256: 77cf780cebe52b6e83e3a2ac84bc56d8059363113e41d17a023f1d8b2ed0fc0b ``` ## Smoke Test This verifies cached-data loading without running a Torch experiment: ```bash PYTHONPATH=src .venv/bin/python - <<'PY' from pathlib import Path from dropout_decay.data import load_cached_splits tok, splits = load_cached_splits( cache_dir=Path(".cache/dropout_decay"), vocab_size=4096, max_required_train_tokens=4_000_000, val_tokens=500_000, allow_short_corpus=False, ) print(tok.vocab_size) print(len(splits.train), len(splits.val)) PY ``` Expected: ```text 4096 5000970 500000 ``` ## Headline Formula The tested formula is: ```text p = clamp(0.02, 0.65, 0.154 * log10(params / unique_tokens) + 0.249 * log10(cumulative_sampled_tokens / unique_tokens) - 0.210) ``` For the standard protocol: - stream prefixes: `250000 500000 1000000 2000000 4000000` - stage steps: `1000` - batch size: `16` - block size: `128` - cumulative sampled tokens after stage `i`: `i * 1000 * 16 * 128` ## Reproduce Model-Size Validation Example L12 command: ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --use-cached-data \ --cache-dir .cache/dropout_decay \ --output-dir runs/reproduce_l12_formula \ --models L12_H8_D320=12x8x320 \ --seeds 1 2 3 \ --stream-token-caps 250000 500000 1000000 2000000 4000000 \ --dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \ --anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \ --stage-steps 1000 \ --batch-size 16 \ --block-size 128 \ --eval-batches 64 \ --train-eval-batches 32 \ --trace-eval-batches 8 \ --log-every 500 \ --vocab-size 4096 \ --val-tokens 500000 \ --lr 0.0003 \ --weight-decay 0.1 \ --grad-clip 1.0 ``` Completed reference result: ```text pressure formula final validation: 4.4812 +/- 0.0062 best static final validation: 4.5183 ``` ## Reproduce Architecture-Shape Holdout Deep/narrow holdout: ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --use-cached-data \ --cache-dir .cache/dropout_decay \ --output-dir runs/reproduce_arch_deep_narrow \ --models deep_narrow_L18_H8_D256=18x8x256 \ --seeds 1 2 3 \ --stream-token-caps 250000 500000 1000000 2000000 4000000 \ --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \ --anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \ --stage-steps 1000 \ --batch-size 16 \ --block-size 128 \ --eval-batches 64 \ --train-eval-batches 32 \ --trace-eval-batches 8 \ --log-every 500 \ --vocab-size 4096 \ --val-tokens 500000 \ --lr 0.0003 \ --weight-decay 0.1 \ --grad-clip 1.0 ``` Completed reference result: ```text formula final validation: 4.5286 +/- 0.0118 best static final validation: 4.5564 +/- 0.0127 ``` ## Reproduce Width-Heavy Holdout The width-heavy architecture holdout is the paired complement to the deep/narrow holdout above: ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --use-cached-data \ --cache-dir .cache/dropout_decay \ --output-dir runs/architecture_shape_holdout_wide_h8 \ --models wide_L8_H8_D384=8x8x384 \ --seeds 1 2 3 \ --stream-token-caps 250000 500000 1000000 2000000 4000000 \ --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \ --anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \ --stage-steps 1000 \ --batch-size 16 \ --block-size 128 \ --eval-batches 64 \ --train-eval-batches 32 \ --trace-eval-batches 8 \ --log-every 500 \ --vocab-size 4096 \ --val-tokens 500000 \ --lr 0.0003 \ --weight-decay 0.1 \ --grad-clip 1.0 ``` Completed reference result: ```text formula final validation: 4.4658 +/- 0.0065 best static final validation: 4.4946 +/- 0.0087 best mean trajectory: static 0.18, 4.9064 vs formula 4.9073 ``` ## Reproduce WikiText-103 Corpus Holdout Prepare the public corpus first: ```bash .venv/bin/python scripts/prepare_wikitext103.py ``` Then run the frozen L12 formula against a broad static grid: ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --corpus data/wikitext103_raw/train-00001-of-00002.parquet \ --cache-dir .cache/dropout_decay_wikitext103 \ --output-dir runs/corpus_holdout_wikitext103_l12 \ --models L12_H8_D320=12x8x320 \ --seeds 1 2 3 \ --stream-token-caps 250000 500000 1000000 2000000 4000000 \ --dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \ --anchor-decays formula_l12_wikitext103:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \ --stage-steps 1000 \ --batch-size 16 \ --block-size 128 \ --eval-batches 64 \ --train-eval-batches 32 \ --trace-eval-batches 8 \ --log-every 500 \ --vocab-size 4096 \ --val-tokens 500000 \ --lr 0.0003 \ --weight-decay 0.1 \ --grad-clip 1.0 ``` Completed reference result: ```text formula final validation: 4.0836 +/- 0.0258 best static final validation: 4.1081 +/- 0.0258 best mean trajectory: formula, 4.5728 vs static 0.18, 4.5759 ``` ## Reproduce TinyStories Corpus Holdout Prepare the public corpus first: ```bash .venv/bin/python scripts/prepare_tinystories.py ``` Then run the frozen L12 formula against a broad static grid: ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --corpus data/tinystories/train-00000-of-00004.parquet \ --cache-dir .cache/dropout_decay_tinystories \ --output-dir runs/corpus_holdout_tinystories_l12 \ --models L12_H8_D320=12x8x320 \ --seeds 1 2 3 \ --stream-token-caps 250000 500000 1000000 2000000 4000000 \ --dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 0.40 0.50 \ --anchor-decays formula_l12_tinystories:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \ --stage-steps 1000 \ --batch-size 16 \ --block-size 128 \ --eval-batches 64 \ --train-eval-batches 32 \ --trace-eval-batches 8 \ --log-every 500 \ --vocab-size 4096 \ --val-tokens 500000 \ --lr 0.0003 \ --weight-decay 0.1 \ --grad-clip 1.0 ``` Completed reference result: ```text frozen formula final validation: 2.7016 +/- 0.0212 best static final validation: 2.7009 +/- 0.0216 paired frozen-formula wins: 1/3 ``` The TinyStories holdout exposed a useful boundary condition: the frozen pressure formula starts too high for this easier corpus. The focused low-decay follow-up can be reproduced with: ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --use-cached-data \ --cache-dir .cache/dropout_decay_tinystories \ --output-dir runs/corpus_holdout_tinystories_l12_lowdecay \ --models L12_H8_D320=12x8x320 \ --seeds 1 2 3 \ --stream-token-caps 250000 500000 1000000 2000000 4000000 \ --dropout-rates \ --anchor-decays \ tinystories_low_decay_014_002:250000=0.140,500000=0.140,1000000=0.100,2000000=0.060,4000000=0.020 \ tinystories_low_decay_014_006:250000=0.140,500000=0.140,1000000=0.100,2000000=0.080,4000000=0.060 \ tinystories_monotone_oracle:250000=0.140,500000=0.140,1000000=0.140,2000000=0.080,4000000=0.080 \ tinystories_low_decay_010_006:250000=0.100,500000=0.100,1000000=0.080,2000000=0.080,4000000=0.060 \ --stage-steps 1000 \ --batch-size 16 \ --block-size 128 \ --eval-batches 64 \ --train-eval-batches 32 \ --trace-eval-batches 8 \ --log-every 500 \ --vocab-size 4096 \ --val-tokens 500000 \ --lr 0.0003 \ --weight-decay 0.1 \ --grad-clip 1.0 ``` Completed reference result: ```text best low-decay path: 0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02 best low-decay final validation: 2.6859 +/- 0.0286 best static final validation: 2.7009 +/- 0.0216 paired low-decay wins: 3/3 ``` ## Notes for Publication - Do not claim the formula is universal. - The supported claim is final-validation improvement under this expanding-prefix protocol. - PDFs are generated artifacts and are ignored by Git. - The exact cached token file should be published through an appropriate binary artifact mechanism once dataset provenance is finalized.