| # Reproducing the Dropout Decay Experiments |
|
|
| This repository is intended to be runnable without checking out nanochat. The |
| implementation is derived from nanochat and retains its MIT attribution, but |
| runtime commands use this package, local cached data, and a local MPS-capable |
| Python environment. |
|
|
| ## Requirements |
|
|
| - macOS with Apple Silicon MPS available. |
| - Python 3.10-3.12 recommended for PyTorch MPS wheels. |
| - A project-local virtual environment at `.venv`. |
| - MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the |
| runner. |
|
|
| Create the environment: |
|
|
| ```bash |
| python3.11 -m venv .venv |
| .venv/bin/python -m pip install --upgrade pip |
| .venv/bin/python -m pip install -e . |
| ``` |
|
|
| Verify MPS: |
|
|
| ```bash |
| .venv/bin/python - <<'PY' |
| import torch |
| print(torch.__version__) |
| print(torch.backends.mps.is_built(), torch.backends.mps.is_available()) |
| PY |
| ``` |
|
|
| Both booleans must be `True`. |
|
|
| ## Data |
|
|
| The runner supports two modes: |
|
|
| - `--use-cached-data --cache-dir .cache/dropout_decay` |
| - `--corpus` / `--corpus-glob` to build a cache from raw text or parquet. |
|
|
| The repo includes the curated source corpora needed by the current regimes: |
|
|
| ```text |
| data/openwebtext10k/base_data_climbmix/shard_*.parquet |
| data/openwebtext10k/openwebtext10k.txt |
| data/tinystories/train-00000-of-00004.parquet |
| data/wikitext103_raw/train-00001-of-00002.parquet |
| ``` |
|
|
| The experiments in the current report used these repo-local caches: |
|
|
| ```text |
| .cache/dropout_decay/tokenizer-v4096.json |
| .cache/dropout_decay/tokens-v4096-uint16.npy |
| .cache/dropout_decay_tinystories/tokenizer-v4096.json |
| .cache/dropout_decay_tinystories/tokens-v4096-uint16.npy |
| .cache/dropout_decay_wikitext103/tokenizer-v4096.json |
| .cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy |
| ``` |
|
|
| These binary artifacts are intended to be published with the repository through |
| the configured large-file storage rules. The local OpenWebText-style cached |
| split used by the completed OpenWebText10K runs contains: |
|
|
| ```text |
| train tokens: 5,000,970 |
| validation tokens: 500,000 |
| vocab size: 4,096 |
| ``` |
|
|
| The public WikiText-103 holdout can be rebuilt from source: |
|
|
| ```bash |
| .venv/bin/python scripts/prepare_wikitext103.py |
| ``` |
|
|
| The pinned parquet source is verified as: |
|
|
| ```text |
| bytes: 156,700,942 |
| sha256: 75aa65dee9de2a7c10ba1808efd2408c3f4eb008104c3ccac47f8ed19300ebdd |
| ``` |
|
|
| The public TinyStories holdout can also be rebuilt from source: |
|
|
| ```bash |
| .venv/bin/python scripts/prepare_tinystories.py |
| ``` |
|
|
| The pinned parquet source is verified as: |
|
|
| ```text |
| bytes: 248,731,111 |
| sha256: 77cf780cebe52b6e83e3a2ac84bc56d8059363113e41d17a023f1d8b2ed0fc0b |
| ``` |
|
|
| ## Smoke Test |
|
|
| This verifies cached-data loading without running a Torch experiment: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python - <<'PY' |
| from pathlib import Path |
| from dropout_decay.data import load_cached_splits |
| |
| tok, splits = load_cached_splits( |
| cache_dir=Path(".cache/dropout_decay"), |
| vocab_size=4096, |
| max_required_train_tokens=4_000_000, |
| val_tokens=500_000, |
| allow_short_corpus=False, |
| ) |
| print(tok.vocab_size) |
| print(len(splits.train), len(splits.val)) |
| PY |
| ``` |
|
|
| Expected: |
|
|
| ```text |
| 4096 |
| 5000970 500000 |
| ``` |
|
|
| ## Headline Formula |
|
|
| The tested formula is: |
|
|
| ```text |
| p = clamp(0.02, 0.65, |
| 0.154 * log10(params / unique_tokens) |
| + 0.249 * log10(cumulative_sampled_tokens / unique_tokens) |
| - 0.210) |
| ``` |
|
|
| For the standard protocol: |
|
|
| - stream prefixes: `250000 500000 1000000 2000000 4000000` |
| - stage steps: `1000` |
| - batch size: `16` |
| - block size: `128` |
| - cumulative sampled tokens after stage `i`: `i * 1000 * 16 * 128` |
|
|
| ## Reproduce Model-Size Validation |
|
|
| Example L12 command: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay \ |
| --output-dir runs/reproduce_l12_formula \ |
| --models L12_H8_D320=12x8x320 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 250000 500000 1000000 2000000 4000000 \ |
| --dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \ |
| --anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \ |
| --stage-steps 1000 \ |
| --batch-size 16 \ |
| --block-size 128 \ |
| --eval-batches 64 \ |
| --train-eval-batches 32 \ |
| --trace-eval-batches 8 \ |
| --log-every 500 \ |
| --vocab-size 4096 \ |
| --val-tokens 500000 \ |
| --lr 0.0003 \ |
| --weight-decay 0.1 \ |
| --grad-clip 1.0 |
| ``` |
|
|
| Completed reference result: |
|
|
| ```text |
| pressure formula final validation: 4.4812 +/- 0.0062 |
| best static final validation: 4.5183 |
| ``` |
|
|
| ## Reproduce Architecture-Shape Holdout |
|
|
| Deep/narrow holdout: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay \ |
| --output-dir runs/reproduce_arch_deep_narrow \ |
| --models deep_narrow_L18_H8_D256=18x8x256 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 250000 500000 1000000 2000000 4000000 \ |
| --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \ |
| --anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \ |
| --stage-steps 1000 \ |
| --batch-size 16 \ |
| --block-size 128 \ |
| --eval-batches 64 \ |
| --train-eval-batches 32 \ |
| --trace-eval-batches 8 \ |
| --log-every 500 \ |
| --vocab-size 4096 \ |
| --val-tokens 500000 \ |
| --lr 0.0003 \ |
| --weight-decay 0.1 \ |
| --grad-clip 1.0 |
| ``` |
|
|
| Completed reference result: |
|
|
| ```text |
| formula final validation: 4.5286 +/- 0.0118 |
| best static final validation: 4.5564 +/- 0.0127 |
| ``` |
|
|
| ## Reproduce Width-Heavy Holdout |
|
|
| The width-heavy architecture holdout is the paired complement to the deep/narrow |
| holdout above: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay \ |
| --output-dir runs/architecture_shape_holdout_wide_h8 \ |
| --models wide_L8_H8_D384=8x8x384 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 250000 500000 1000000 2000000 4000000 \ |
| --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \ |
| --anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \ |
| --stage-steps 1000 \ |
| --batch-size 16 \ |
| --block-size 128 \ |
| --eval-batches 64 \ |
| --train-eval-batches 32 \ |
| --trace-eval-batches 8 \ |
| --log-every 500 \ |
| --vocab-size 4096 \ |
| --val-tokens 500000 \ |
| --lr 0.0003 \ |
| --weight-decay 0.1 \ |
| --grad-clip 1.0 |
| ``` |
|
|
| Completed reference result: |
|
|
| ```text |
| formula final validation: 4.4658 +/- 0.0065 |
| best static final validation: 4.4946 +/- 0.0087 |
| best mean trajectory: static 0.18, 4.9064 vs formula 4.9073 |
| ``` |
|
|
| ## Reproduce WikiText-103 Corpus Holdout |
|
|
| Prepare the public corpus first: |
|
|
| ```bash |
| .venv/bin/python scripts/prepare_wikitext103.py |
| ``` |
|
|
| Then run the frozen L12 formula against a broad static grid: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --corpus data/wikitext103_raw/train-00001-of-00002.parquet \ |
| --cache-dir .cache/dropout_decay_wikitext103 \ |
| --output-dir runs/corpus_holdout_wikitext103_l12 \ |
| --models L12_H8_D320=12x8x320 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 250000 500000 1000000 2000000 4000000 \ |
| --dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \ |
| --anchor-decays formula_l12_wikitext103:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \ |
| --stage-steps 1000 \ |
| --batch-size 16 \ |
| --block-size 128 \ |
| --eval-batches 64 \ |
| --train-eval-batches 32 \ |
| --trace-eval-batches 8 \ |
| --log-every 500 \ |
| --vocab-size 4096 \ |
| --val-tokens 500000 \ |
| --lr 0.0003 \ |
| --weight-decay 0.1 \ |
| --grad-clip 1.0 |
| ``` |
|
|
| Completed reference result: |
|
|
| ```text |
| formula final validation: 4.0836 +/- 0.0258 |
| best static final validation: 4.1081 +/- 0.0258 |
| best mean trajectory: formula, 4.5728 vs static 0.18, 4.5759 |
| ``` |
|
|
| ## Reproduce TinyStories Corpus Holdout |
|
|
| Prepare the public corpus first: |
|
|
| ```bash |
| .venv/bin/python scripts/prepare_tinystories.py |
| ``` |
|
|
| Then run the frozen L12 formula against a broad static grid: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --corpus data/tinystories/train-00000-of-00004.parquet \ |
| --cache-dir .cache/dropout_decay_tinystories \ |
| --output-dir runs/corpus_holdout_tinystories_l12 \ |
| --models L12_H8_D320=12x8x320 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 250000 500000 1000000 2000000 4000000 \ |
| --dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 0.40 0.50 \ |
| --anchor-decays formula_l12_tinystories:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \ |
| --stage-steps 1000 \ |
| --batch-size 16 \ |
| --block-size 128 \ |
| --eval-batches 64 \ |
| --train-eval-batches 32 \ |
| --trace-eval-batches 8 \ |
| --log-every 500 \ |
| --vocab-size 4096 \ |
| --val-tokens 500000 \ |
| --lr 0.0003 \ |
| --weight-decay 0.1 \ |
| --grad-clip 1.0 |
| ``` |
|
|
| Completed reference result: |
|
|
| ```text |
| frozen formula final validation: 2.7016 +/- 0.0212 |
| best static final validation: 2.7009 +/- 0.0216 |
| paired frozen-formula wins: 1/3 |
| ``` |
|
|
| The TinyStories holdout exposed a useful boundary condition: the frozen |
| pressure formula starts too high for this easier corpus. The focused low-decay |
| follow-up can be reproduced with: |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay_tinystories \ |
| --output-dir runs/corpus_holdout_tinystories_l12_lowdecay \ |
| --models L12_H8_D320=12x8x320 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 250000 500000 1000000 2000000 4000000 \ |
| --dropout-rates \ |
| --anchor-decays \ |
| tinystories_low_decay_014_002:250000=0.140,500000=0.140,1000000=0.100,2000000=0.060,4000000=0.020 \ |
| tinystories_low_decay_014_006:250000=0.140,500000=0.140,1000000=0.100,2000000=0.080,4000000=0.060 \ |
| tinystories_monotone_oracle:250000=0.140,500000=0.140,1000000=0.140,2000000=0.080,4000000=0.080 \ |
| tinystories_low_decay_010_006:250000=0.100,500000=0.100,1000000=0.080,2000000=0.080,4000000=0.060 \ |
| --stage-steps 1000 \ |
| --batch-size 16 \ |
| --block-size 128 \ |
| --eval-batches 64 \ |
| --train-eval-batches 32 \ |
| --trace-eval-batches 8 \ |
| --log-every 500 \ |
| --vocab-size 4096 \ |
| --val-tokens 500000 \ |
| --lr 0.0003 \ |
| --weight-decay 0.1 \ |
| --grad-clip 1.0 |
| ``` |
|
|
| Completed reference result: |
|
|
| ```text |
| best low-decay path: 0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02 |
| best low-decay final validation: 2.6859 +/- 0.0286 |
| best static final validation: 2.7009 +/- 0.0216 |
| paired low-decay wins: 3/3 |
| ``` |
|
|
| ## Notes for Publication |
|
|
| - Do not claim the formula is universal. |
| - The supported claim is final-validation improvement under this |
| expanding-prefix protocol. |
| - PDFs are generated artifacts and are ignored by Git. |
| - The exact cached token file should be published through an appropriate binary |
| artifact mechanism once dataset provenance is finalized. |
|
|