dropout-decay / REPRODUCING.md
Mandeep Sidhu
Make research artifacts self contained
618af58
# Reproducing the Dropout Decay Experiments
This repository is intended to be runnable without checking out nanochat. The
implementation is derived from nanochat and retains its MIT attribution, but
runtime commands use this package, local cached data, and a local MPS-capable
Python environment.
## Requirements
- macOS with Apple Silicon MPS available.
- Python 3.10-3.12 recommended for PyTorch MPS wheels.
- A project-local virtual environment at `.venv`.
- MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the
runner.
Create the environment:
```bash
python3.11 -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/python -m pip install -e .
```
Verify MPS:
```bash
.venv/bin/python - <<'PY'
import torch
print(torch.__version__)
print(torch.backends.mps.is_built(), torch.backends.mps.is_available())
PY
```
Both booleans must be `True`.
## Data
The runner supports two modes:
- `--use-cached-data --cache-dir .cache/dropout_decay`
- `--corpus` / `--corpus-glob` to build a cache from raw text or parquet.
The repo includes the curated source corpora needed by the current regimes:
```text
data/openwebtext10k/base_data_climbmix/shard_*.parquet
data/openwebtext10k/openwebtext10k.txt
data/tinystories/train-00000-of-00004.parquet
data/wikitext103_raw/train-00001-of-00002.parquet
```
The experiments in the current report used these repo-local caches:
```text
.cache/dropout_decay/tokenizer-v4096.json
.cache/dropout_decay/tokens-v4096-uint16.npy
.cache/dropout_decay_tinystories/tokenizer-v4096.json
.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy
.cache/dropout_decay_wikitext103/tokenizer-v4096.json
.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy
```
These binary artifacts are intended to be published with the repository through
the configured large-file storage rules. The local OpenWebText-style cached
split used by the completed OpenWebText10K runs contains:
```text
train tokens: 5,000,970
validation tokens: 500,000
vocab size: 4,096
```
The public WikiText-103 holdout can be rebuilt from source:
```bash
.venv/bin/python scripts/prepare_wikitext103.py
```
The pinned parquet source is verified as:
```text
bytes: 156,700,942
sha256: 75aa65dee9de2a7c10ba1808efd2408c3f4eb008104c3ccac47f8ed19300ebdd
```
The public TinyStories holdout can also be rebuilt from source:
```bash
.venv/bin/python scripts/prepare_tinystories.py
```
The pinned parquet source is verified as:
```text
bytes: 248,731,111
sha256: 77cf780cebe52b6e83e3a2ac84bc56d8059363113e41d17a023f1d8b2ed0fc0b
```
## Smoke Test
This verifies cached-data loading without running a Torch experiment:
```bash
PYTHONPATH=src .venv/bin/python - <<'PY'
from pathlib import Path
from dropout_decay.data import load_cached_splits
tok, splits = load_cached_splits(
cache_dir=Path(".cache/dropout_decay"),
vocab_size=4096,
max_required_train_tokens=4_000_000,
val_tokens=500_000,
allow_short_corpus=False,
)
print(tok.vocab_size)
print(len(splits.train), len(splits.val))
PY
```
Expected:
```text
4096
5000970 500000
```
## Headline Formula
The tested formula is:
```text
p = clamp(0.02, 0.65,
0.154 * log10(params / unique_tokens)
+ 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
- 0.210)
```
For the standard protocol:
- stream prefixes: `250000 500000 1000000 2000000 4000000`
- stage steps: `1000`
- batch size: `16`
- block size: `128`
- cumulative sampled tokens after stage `i`: `i * 1000 * 16 * 128`
## Reproduce Model-Size Validation
Example L12 command:
```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--output-dir runs/reproduce_l12_formula \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
```
Completed reference result:
```text
pressure formula final validation: 4.4812 +/- 0.0062
best static final validation: 4.5183
```
## Reproduce Architecture-Shape Holdout
Deep/narrow holdout:
```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--output-dir runs/reproduce_arch_deep_narrow \
--models deep_narrow_L18_H8_D256=18x8x256 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
```
Completed reference result:
```text
formula final validation: 4.5286 +/- 0.0118
best static final validation: 4.5564 +/- 0.0127
```
## Reproduce Width-Heavy Holdout
The width-heavy architecture holdout is the paired complement to the deep/narrow
holdout above:
```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay \
--output-dir runs/architecture_shape_holdout_wide_h8 \
--models wide_L8_H8_D384=8x8x384 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
```
Completed reference result:
```text
formula final validation: 4.4658 +/- 0.0065
best static final validation: 4.4946 +/- 0.0087
best mean trajectory: static 0.18, 4.9064 vs formula 4.9073
```
## Reproduce WikiText-103 Corpus Holdout
Prepare the public corpus first:
```bash
.venv/bin/python scripts/prepare_wikitext103.py
```
Then run the frozen L12 formula against a broad static grid:
```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--corpus data/wikitext103_raw/train-00001-of-00002.parquet \
--cache-dir .cache/dropout_decay_wikitext103 \
--output-dir runs/corpus_holdout_wikitext103_l12 \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
--anchor-decays formula_l12_wikitext103:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
```
Completed reference result:
```text
formula final validation: 4.0836 +/- 0.0258
best static final validation: 4.1081 +/- 0.0258
best mean trajectory: formula, 4.5728 vs static 0.18, 4.5759
```
## Reproduce TinyStories Corpus Holdout
Prepare the public corpus first:
```bash
.venv/bin/python scripts/prepare_tinystories.py
```
Then run the frozen L12 formula against a broad static grid:
```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--corpus data/tinystories/train-00000-of-00004.parquet \
--cache-dir .cache/dropout_decay_tinystories \
--output-dir runs/corpus_holdout_tinystories_l12 \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 0.40 0.50 \
--anchor-decays formula_l12_tinystories:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
```
Completed reference result:
```text
frozen formula final validation: 2.7016 +/- 0.0212
best static final validation: 2.7009 +/- 0.0216
paired frozen-formula wins: 1/3
```
The TinyStories holdout exposed a useful boundary condition: the frozen
pressure formula starts too high for this easier corpus. The focused low-decay
follow-up can be reproduced with:
```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
--mode locked_stream \
--use-cached-data \
--cache-dir .cache/dropout_decay_tinystories \
--output-dir runs/corpus_holdout_tinystories_l12_lowdecay \
--models L12_H8_D320=12x8x320 \
--seeds 1 2 3 \
--stream-token-caps 250000 500000 1000000 2000000 4000000 \
--dropout-rates \
--anchor-decays \
tinystories_low_decay_014_002:250000=0.140,500000=0.140,1000000=0.100,2000000=0.060,4000000=0.020 \
tinystories_low_decay_014_006:250000=0.140,500000=0.140,1000000=0.100,2000000=0.080,4000000=0.060 \
tinystories_monotone_oracle:250000=0.140,500000=0.140,1000000=0.140,2000000=0.080,4000000=0.080 \
tinystories_low_decay_010_006:250000=0.100,500000=0.100,1000000=0.080,2000000=0.080,4000000=0.060 \
--stage-steps 1000 \
--batch-size 16 \
--block-size 128 \
--eval-batches 64 \
--train-eval-batches 32 \
--trace-eval-batches 8 \
--log-every 500 \
--vocab-size 4096 \
--val-tokens 500000 \
--lr 0.0003 \
--weight-decay 0.1 \
--grad-clip 1.0
```
Completed reference result:
```text
best low-decay path: 0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02
best low-decay final validation: 2.6859 +/- 0.0286
best static final validation: 2.7009 +/- 0.0216
paired low-decay wins: 3/3
```
## Notes for Publication
- Do not claim the formula is universal.
- The supported claim is final-validation improvement under this
expanding-prefix protocol.
- PDFs are generated artifacts and are ignored by Git.
- The exact cached token file should be published through an appropriate binary
artifact mechanism once dataset provenance is finalized.