--- license: mit language: - en tags: - dropout - streaming - language-modeling - transformer - mps - reproducibility pretty_name: Dropout Decay Streaming Experiments --- # Dropout Decay Streaming Experiments This project tests dropout decay only after first finding a model/data regime where static dropout has a real nonzero validation optimum. The implementation is derived from Andrej Karpathy's `nanochat` repository: https://github.com/karpathy/nanochat. Only the core tokenizer ideas and foundational causal Transformer architecture are retained. Chat interfaces, deployment scripts, distributed training code, and inference services are not included. The original nanochat MIT copyright and permission notice are retained in derived source files and in `LICENSE`. ## Compliance All Torch experiment runs are MPS-only. The runner exits before model creation if MPS is unavailable, if PyTorch was not built with MPS, or if `PYTORCH_ENABLE_MPS_FALLBACK=1` is set. ## Local Data and Environment The project should not depend on another checkout of `nanochat` at runtime. Use the project-local package and either: - `--use-cached-data --cache-dir .cache/dropout_decay` to reuse the local tokenizer and encoded token array; or - `--corpus` / `--corpus-glob` to build a fresh local cache from a source corpus. The curated repo-local data artifacts are: - `data/openwebtext10k/base_data_climbmix/shard_*.parquet` - `data/openwebtext10k/openwebtext10k.txt` - `data/tinystories/train-00000-of-00004.parquet` - `data/wikitext103_raw/train-00001-of-00002.parquet` The existing repo-local caches are: - `.cache/dropout_decay/tokenizer-v4096.json` - `.cache/dropout_decay/tokens-v4096-uint16.npy` - `.cache/dropout_decay_tinystories/tokenizer-v4096.json` - `.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy` - `.cache/dropout_decay_wikitext103/tokenizer-v4096.json` - `.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy` Use a project-local Python environment with MPS-capable PyTorch, for example `.venv/bin/python`. Attribution to nanochat remains in the source and docs, but experiment commands should not point into a separate nanochat repository. ## Workflow 1. Screen candidate model sizes with cheap static dropout sweeps. 2. Select candidate models whose validation curve has an interior nonzero dropout optimum. 3. Confirm the winner with a 3-seed static sweep. 4. Lock the model and run static-vs-decay streaming comparisons from scratch. Every run writes: - `config.json`: command, model specs, data paths, environment, attribution. - `metrics.jsonl`: one row per seed/model/dropout/stage. - `trace.jsonl`: optional training and intermediate evaluation trace. - `summary.csv` / `summary.json`: mean/std train loss, validation loss, and gap. - `model_selection.csv` / `model_selection.json`: static-sweep optimum and plateau diagnostics for screen and confirm runs. Old exploratory outputs are archived under `archive/`. For exact headline reproduction, see `REPRODUCING.md`. For the current first-reader hypothesis document, see `docs/current_hypothesis_for_ml_engineers.md`. For the denser run-by-run research summary, see `docs/dropout_decay_research_report_v2.md`. For the corpus difficulty follow-up that motivated the next formula refinement, see `docs/corpus_difficulty_probe_20260529.md`. For the first single-seed probe-calibrated streaming test, see `docs/probe_calibrated_stream_20260529.md`. For the coefficient refit using only existing saved results, see `docs/coefficient_refit_existing_data_20260529.md`. For a first-reader report on regime definition and coefficient interpretation, see `docs/regime_coefficient_report_20260529.md`. For the live coefficient calibration plan, see `docs/coefficient_calibration_plan.md`. For the current arXiv-style paper draft, see `paper/dropout_decay_pressure_law.tex`. ## Step 1: Cheap Static Screen Use one or two seeds. The output tells us, for each model, where the static dropout curve bottoms out and which dropout range is within the configured plateau delta. ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode screen_static \ --use-cached-data \ --cache-dir .cache/dropout_decay \ --models 8x8x256 12x8x384 16x8x384 \ --seeds 1 2 \ --token-limits 5000000 \ --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \ --steps 2000 \ --eval-batches 64 ``` ## Step 2: Confirm Winner After selecting a promising model, rerun the static dropout curve with exactly three seeds. ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode confirm_static \ --use-cached-data \ --cache-dir .cache/dropout_decay \ --models winner=12x8x384 \ --seeds 1 2 3 \ --token-limits 5000000 \ --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \ --steps 2000 \ --eval-batches 64 ``` ## Step 3: Locked Streaming Comparison Only after the model is locked, compare static dropout and decay schedules from fresh initialization. ```bash PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ --mode locked_stream \ --use-cached-data \ --cache-dir .cache/dropout_decay \ --models winner=12x8x384 \ --seeds 1 2 3 \ --stream-token-caps 5000000 10000000 20000000 40000000 \ --dropout-rates 0.0 0.10 0.14 0.20 \ --decays decay_030_to_014:0.30:0.14:cosine decay_020_to_010:0.20:0.10:cosine \ --stage-steps 1000 \ --eval-batches 64 ```