Mandeep Sidhu

Make research artifacts self contained

618af58 1 day ago

10.8 kB

	# Reproducing the Dropout Decay Experiments

	This repository is intended to be runnable without checking out nanochat. The
	implementation is derived from nanochat and retains its MIT attribution, but
	runtime commands use this package, local cached data, and a local MPS-capable
	Python environment.

	## Requirements

	- macOS with Apple Silicon MPS available.
	- Python 3.10-3.12 recommended for PyTorch MPS wheels.
	- A project-local virtual environment at `.venv`.
	- MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the
	runner.

	Create the environment:

	```bash
	python3.11 -m venv .venv
	.venv/bin/python -m pip install --upgrade pip
	.venv/bin/python -m pip install -e .
	```

	Verify MPS:

	```bash
	.venv/bin/python - <<'PY'
	import torch
	print(torch.__version__)
	print(torch.backends.mps.is_built(), torch.backends.mps.is_available())
	PY
	```

	Both booleans must be `True`.

	## Data

	The runner supports two modes:

	- `--use-cached-data --cache-dir .cache/dropout_decay`
	- `--corpus` / `--corpus-glob` to build a cache from raw text or parquet.

	The repo includes the curated source corpora needed by the current regimes:

	```text
	data/openwebtext10k/base_data_climbmix/shard_*.parquet
	data/openwebtext10k/openwebtext10k.txt
	data/tinystories/train-00000-of-00004.parquet
	data/wikitext103_raw/train-00001-of-00002.parquet
	```

	The experiments in the current report used these repo-local caches:

	```text
	.cache/dropout_decay/tokenizer-v4096.json
	.cache/dropout_decay/tokens-v4096-uint16.npy
	.cache/dropout_decay_tinystories/tokenizer-v4096.json
	.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy
	.cache/dropout_decay_wikitext103/tokenizer-v4096.json
	.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy
	```

	These binary artifacts are intended to be published with the repository through
	the configured large-file storage rules. The local OpenWebText-style cached
	split used by the completed OpenWebText10K runs contains:

	```text
	train tokens: 5,000,970
	validation tokens: 500,000
	vocab size: 4,096
	```

	The public WikiText-103 holdout can be rebuilt from source:

	```bash
	.venv/bin/python scripts/prepare_wikitext103.py
	```

	The pinned parquet source is verified as:

	```text
	bytes: 156,700,942
	sha256: 75aa65dee9de2a7c10ba1808efd2408c3f4eb008104c3ccac47f8ed19300ebdd
	```

	The public TinyStories holdout can also be rebuilt from source:

	```bash
	.venv/bin/python scripts/prepare_tinystories.py
	```

	The pinned parquet source is verified as:

	```text
	bytes: 248,731,111
	sha256: 77cf780cebe52b6e83e3a2ac84bc56d8059363113e41d17a023f1d8b2ed0fc0b
	```

	## Smoke Test

	This verifies cached-data loading without running a Torch experiment:

	```bash
	PYTHONPATH=src .venv/bin/python - <<'PY'
	from pathlib import Path
	from dropout_decay.data import load_cached_splits

	tok, splits = load_cached_splits(
	cache_dir=Path(".cache/dropout_decay"),
	vocab_size=4096,
	max_required_train_tokens=4_000_000,
	val_tokens=500_000,
	allow_short_corpus=False,
	)
	print(tok.vocab_size)
	print(len(splits.train), len(splits.val))
	PY
	```

	Expected:

	```text
	4096
	5000970 500000
	```

	## Headline Formula

	The tested formula is:

	```text
	p = clamp(0.02, 0.65,
	0.154 * log10(params / unique_tokens)
	+ 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
	- 0.210)
	```

	For the standard protocol:

	- stream prefixes: `250000 500000 1000000 2000000 4000000`
	- stage steps: `1000`
	- batch size: `16`
	- block size: `128`
	- cumulative sampled tokens after stage `i`: `i * 1000 * 16 * 128`

	## Reproduce Model-Size Validation

	Example L12 command:

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--use-cached-data \
	--cache-dir .cache/dropout_decay \
	--output-dir runs/reproduce_l12_formula \
	--models L12_H8_D320=12x8x320 \
	--seeds 1 2 3 \
	--stream-token-caps 250000 500000 1000000 2000000 4000000 \
	--dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \
	--anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
	--stage-steps 1000 \
	--batch-size 16 \
	--block-size 128 \
	--eval-batches 64 \
	--train-eval-batches 32 \
	--trace-eval-batches 8 \
	--log-every 500 \
	--vocab-size 4096 \
	--val-tokens 500000 \
	--lr 0.0003 \
	--weight-decay 0.1 \
	--grad-clip 1.0
	```

	Completed reference result:

	```text
	pressure formula final validation: 4.4812 +/- 0.0062
	best static final validation: 4.5183
	```

	## Reproduce Architecture-Shape Holdout

	Deep/narrow holdout:

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--use-cached-data \
	--cache-dir .cache/dropout_decay \
	--output-dir runs/reproduce_arch_deep_narrow \
	--models deep_narrow_L18_H8_D256=18x8x256 \
	--seeds 1 2 3 \
	--stream-token-caps 250000 500000 1000000 2000000 4000000 \
	--dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
	--anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \
	--stage-steps 1000 \
	--batch-size 16 \
	--block-size 128 \
	--eval-batches 64 \
	--train-eval-batches 32 \
	--trace-eval-batches 8 \
	--log-every 500 \
	--vocab-size 4096 \
	--val-tokens 500000 \
	--lr 0.0003 \
	--weight-decay 0.1 \
	--grad-clip 1.0
	```

	Completed reference result:

	```text
	formula final validation: 4.5286 +/- 0.0118
	best static final validation: 4.5564 +/- 0.0127
	```

	## Reproduce Width-Heavy Holdout

	The width-heavy architecture holdout is the paired complement to the deep/narrow
	holdout above:

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--use-cached-data \
	--cache-dir .cache/dropout_decay \
	--output-dir runs/architecture_shape_holdout_wide_h8 \
	--models wide_L8_H8_D384=8x8x384 \
	--seeds 1 2 3 \
	--stream-token-caps 250000 500000 1000000 2000000 4000000 \
	--dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
	--anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \
	--stage-steps 1000 \
	--batch-size 16 \
	--block-size 128 \
	--eval-batches 64 \
	--train-eval-batches 32 \
	--trace-eval-batches 8 \
	--log-every 500 \
	--vocab-size 4096 \
	--val-tokens 500000 \
	--lr 0.0003 \
	--weight-decay 0.1 \
	--grad-clip 1.0
	```

	Completed reference result:

	```text
	formula final validation: 4.4658 +/- 0.0065
	best static final validation: 4.4946 +/- 0.0087
	best mean trajectory: static 0.18, 4.9064 vs formula 4.9073
	```

	## Reproduce WikiText-103 Corpus Holdout

	Prepare the public corpus first:

	```bash
	.venv/bin/python scripts/prepare_wikitext103.py
	```

	Then run the frozen L12 formula against a broad static grid:

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--corpus data/wikitext103_raw/train-00001-of-00002.parquet \
	--cache-dir .cache/dropout_decay_wikitext103 \
	--output-dir runs/corpus_holdout_wikitext103_l12 \
	--models L12_H8_D320=12x8x320 \
	--seeds 1 2 3 \
	--stream-token-caps 250000 500000 1000000 2000000 4000000 \
	--dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 \
	--anchor-decays formula_l12_wikitext103:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
	--stage-steps 1000 \
	--batch-size 16 \
	--block-size 128 \
	--eval-batches 64 \
	--train-eval-batches 32 \
	--trace-eval-batches 8 \
	--log-every 500 \
	--vocab-size 4096 \
	--val-tokens 500000 \
	--lr 0.0003 \
	--weight-decay 0.1 \
	--grad-clip 1.0
	```

	Completed reference result:

	```text
	formula final validation: 4.0836 +/- 0.0258
	best static final validation: 4.1081 +/- 0.0258
	best mean trajectory: formula, 4.5728 vs static 0.18, 4.5759
	```

	## Reproduce TinyStories Corpus Holdout

	Prepare the public corpus first:

	```bash
	.venv/bin/python scripts/prepare_tinystories.py
	```

	Then run the frozen L12 formula against a broad static grid:

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--corpus data/tinystories/train-00000-of-00004.parquet \
	--cache-dir .cache/dropout_decay_tinystories \
	--output-dir runs/corpus_holdout_tinystories_l12 \
	--models L12_H8_D320=12x8x320 \
	--seeds 1 2 3 \
	--stream-token-caps 250000 500000 1000000 2000000 4000000 \
	--dropout-rates 0.00 0.02 0.06 0.08 0.10 0.14 0.18 0.20 0.26 0.30 0.40 0.50 \
	--anchor-decays formula_l12_tinystories:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
	--stage-steps 1000 \
	--batch-size 16 \
	--block-size 128 \
	--eval-batches 64 \
	--train-eval-batches 32 \
	--trace-eval-batches 8 \
	--log-every 500 \
	--vocab-size 4096 \
	--val-tokens 500000 \
	--lr 0.0003 \
	--weight-decay 0.1 \
	--grad-clip 1.0
	```

	Completed reference result:

	```text
	frozen formula final validation: 2.7016 +/- 0.0212
	best static final validation: 2.7009 +/- 0.0216
	paired frozen-formula wins: 1/3
	```

	The TinyStories holdout exposed a useful boundary condition: the frozen
	pressure formula starts too high for this easier corpus. The focused low-decay
	follow-up can be reproduced with:

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--use-cached-data \
	--cache-dir .cache/dropout_decay_tinystories \
	--output-dir runs/corpus_holdout_tinystories_l12_lowdecay \
	--models L12_H8_D320=12x8x320 \
	--seeds 1 2 3 \
	--stream-token-caps 250000 500000 1000000 2000000 4000000 \
	--dropout-rates \
	--anchor-decays \
	tinystories_low_decay_014_002:250000=0.140,500000=0.140,1000000=0.100,2000000=0.060,4000000=0.020 \
	tinystories_low_decay_014_006:250000=0.140,500000=0.140,1000000=0.100,2000000=0.080,4000000=0.060 \
	tinystories_monotone_oracle:250000=0.140,500000=0.140,1000000=0.140,2000000=0.080,4000000=0.080 \
	tinystories_low_decay_010_006:250000=0.100,500000=0.100,1000000=0.080,2000000=0.080,4000000=0.060 \
	--stage-steps 1000 \
	--batch-size 16 \
	--block-size 128 \
	--eval-batches 64 \
	--train-eval-batches 32 \
	--trace-eval-batches 8 \
	--log-every 500 \
	--vocab-size 4096 \
	--val-tokens 500000 \
	--lr 0.0003 \
	--weight-decay 0.1 \
	--grad-clip 1.0
	```

	Completed reference result:

	```text
	best low-decay path: 0.14 -> 0.14 -> 0.10 -> 0.06 -> 0.02
	best low-decay final validation: 2.6859 +/- 0.0286
	best static final validation: 2.7009 +/- 0.0216
	paired low-decay wins: 3/3
	```

	## Notes for Publication

	- Do not claim the formula is universal.
	- The supported claim is final-validation improvement under this
	expanding-prefix protocol.
	- PDFs are generated artifacts and are ignored by Git.
	- The exact cached token file should be published through an appropriate binary
	artifact mechanism once dataset provenance is finalized.