Prepare reproducible research package

Files changed (3) hide show

.gitignore +3 -0
README.md +16 -0
REPRODUCING.md +218 -0

.gitignore CHANGED Viewed

@@ -3,3 +3,6 @@ __pycache__/
 *.py[cod]
 .cache/
 *.npy

 *.py[cod]
 .cache/
 *.npy
+*.pdf
+.venv/
+*.egg-info/

README.md CHANGED Viewed

@@ -1,3 +1,17 @@
 # Dropout Decay Streaming Experiments
 This project tests dropout decay only after first finding a model/data regime
@@ -53,6 +67,8 @@ Every run writes:
 Old exploratory outputs are archived under `archive/`.
 ## Step 1: Cheap Static Screen
 Use one or two seeds. The output tells us, for each model, where the static

+---
+license: mit
+language:
+  - en
+tags:
+  - dropout
+  - streaming
+  - language-modeling
+  - transformer
+  - mps
+  - reproducibility
+pretty_name: Dropout Decay Streaming Experiments
+---
 # Dropout Decay Streaming Experiments
 This project tests dropout decay only after first finding a model/data regime
 Old exploratory outputs are archived under `archive/`.
+For exact headline reproduction, see `REPRODUCING.md`.
 ## Step 1: Cheap Static Screen
 Use one or two seeds. The output tells us, for each model, where the static

REPRODUCING.md ADDED Viewed

	@@ -0,0 +1,218 @@

+# Reproducing the Dropout Decay Experiments
+This repository is intended to be runnable without checking out nanochat. The
+implementation is derived from nanochat and retains its MIT attribution, but
+runtime commands use this package, local cached data, and a local MPS-capable
+Python environment.
+## Requirements
+- macOS with Apple Silicon MPS available.
+- Python 3.10-3.12 recommended for PyTorch MPS wheels.
+- A project-local virtual environment at `.venv`.
+- MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the
+  runner.
+Create the environment:
+```bash
+python3.11 -m venv .venv
+.venv/bin/python -m pip install --upgrade pip
+.venv/bin/python -m pip install -e .
+```
+Verify MPS:
+```bash
+.venv/bin/python - <<'PY'
+import torch
+print(torch.__version__)
+print(torch.backends.mps.is_built(), torch.backends.mps.is_available())
+PY
+```
+Both booleans must be `True`.
+## Data
+The runner supports two modes:
+- `--use-cached-data --cache-dir .cache/dropout_decay`
+- `--corpus` / `--corpus-glob` to build a cache from raw text or parquet.
+The experiments in the current report used:
+```text
+.cache/dropout_decay/tokenizer-v4096.json
+.cache/dropout_decay/tokens-v4096-uint16.npy
+```
+The cached token file is deliberately ignored by Git until dataset provenance
+and binary hosting are finalized. For exact reproduction, place the two files
+above in `.cache/dropout_decay`. The local cached split used by the completed
+runs contains:
+```text
+train tokens: 5,000,970
+validation tokens: 500,000
+vocab size: 4,096
+```
+## Smoke Test
+This verifies cached-data loading without running a Torch experiment:
+```bash
+PYTHONPATH=src .venv/bin/python - <<'PY'
+from pathlib import Path
+from dropout_decay.data import load_cached_splits
+tok, splits = load_cached_splits(
+    cache_dir=Path(".cache/dropout_decay"),
+    vocab_size=4096,
+    max_required_train_tokens=4_000_000,
+    val_tokens=500_000,
+    allow_short_corpus=False,
+)
+print(tok.vocab_size)
+print(len(splits.train), len(splits.val))
+PY
+```
+Expected:
+```text
+4096
+5000970 500000
+```
+## Headline Formula
+The tested formula is:
+```text
+p = clamp(0.02, 0.65,
+          0.154 * log10(params / unique_tokens)
+        + 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
+        - 0.210)
+```
+For the standard protocol:
+- stream prefixes: `250000 500000 1000000 2000000 4000000`
+- stage steps: `1000`
+- batch size: `16`
+- block size: `128`
+- cumulative sampled tokens after stage `i`: `i * 1000 * 16 * 128`
+## Reproduce Model-Size Validation
+Example L12 command:
+```bash
+PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
+  --mode locked_stream \
+  --use-cached-data \
+  --cache-dir .cache/dropout_decay \
+  --output-dir runs/reproduce_l12_formula \
+  --models L12_H8_D320=12x8x320 \
+  --seeds 1 2 3 \
+  --stream-token-caps 250000 500000 1000000 2000000 4000000 \
+  --dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \
+  --anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
+  --stage-steps 1000 \
+  --batch-size 16 \
+  --block-size 128 \
+  --eval-batches 64 \
+  --train-eval-batches 32 \
+  --trace-eval-batches 8 \
+  --log-every 500 \
+  --vocab-size 4096 \
+  --val-tokens 500000 \
+  --lr 0.0003 \
+  --weight-decay 0.1 \
+  --grad-clip 1.0
+```
+Completed reference result:
+```text
+pressure formula final validation: 4.4812 +/- 0.0062
+best static final validation:      4.5183
+```
+## Reproduce Architecture-Shape Holdout
+Deep/narrow holdout:
+```bash
+PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
+  --mode locked_stream \
+  --use-cached-data \
+  --cache-dir .cache/dropout_decay \
+  --output-dir runs/reproduce_arch_deep_narrow \
+  --models deep_narrow_L18_H8_D256=18x8x256 \
+  --seeds 1 2 3 \
+  --stream-token-caps 250000 500000 1000000 2000000 4000000 \
+  --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
+  --anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \
+  --stage-steps 1000 \
+  --batch-size 16 \
+  --block-size 128 \
+  --eval-batches 64 \
+  --train-eval-batches 32 \
+  --trace-eval-batches 8 \
+  --log-every 500 \
+  --vocab-size 4096 \
+  --val-tokens 500000 \
+  --lr 0.0003 \
+  --weight-decay 0.1 \
+  --grad-clip 1.0
+```
+Completed reference result:
+```text
+formula final validation:     4.5286 +/- 0.0118
+best static final validation: 4.5564 +/- 0.0127
+```
+## Next Unrun Holdout
+The next planned holdout is the width-heavy architecture test:
+```bash
+PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
+  --mode locked_stream \
+  --use-cached-data \
+  --cache-dir .cache/dropout_decay \
+  --output-dir runs/architecture_shape_holdout_wide_h8 \
+  --models wide_L8_H8_D384=8x8x384 \
+  --seeds 1 2 3 \
+  --stream-token-caps 250000 500000 1000000 2000000 4000000 \
+  --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
+  --anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \
+  --stage-steps 1000 \
+  --batch-size 16 \
+  --block-size 128 \
+  --eval-batches 64 \
+  --train-eval-batches 32 \
+  --trace-eval-batches 8 \
+  --log-every 500 \
+  --vocab-size 4096 \
+  --val-tokens 500000 \
+  --lr 0.0003 \
+  --weight-decay 0.1 \
+  --grad-clip 1.0
+```
+Expected runtime on the current MPS setup is about 2.5-3.5 hours.
+## Notes for Publication
+- Do not claim the formula is universal.
+- The supported claim is final-validation improvement under this
+  expanding-prefix protocol.
+- PDFs are generated artifacts and are ignored by Git.
+- The exact cached token file should be published through an appropriate binary
+  artifact mechanism once dataset provenance is finalized.