| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - dropout |
| - streaming |
| - language-modeling |
| - transformer |
| - mps |
| - reproducibility |
| pretty_name: Dropout Decay Streaming Experiments |
| --- |
| |
| # Dropout Decay Streaming Experiments |
|
|
| This project tests dropout decay only after first finding a model/data regime |
| where static dropout has a real nonzero validation optimum. |
|
|
| The implementation is derived from Andrej Karpathy's `nanochat` repository: |
| https://github.com/karpathy/nanochat. Only the core tokenizer ideas and |
| foundational causal Transformer architecture are retained. Chat interfaces, |
| deployment scripts, distributed training code, and inference services are not |
| included. The original nanochat MIT copyright and permission notice are retained |
| in derived source files and in `LICENSE`. |
|
|
| ## Compliance |
|
|
| All Torch experiment runs are MPS-only. The runner exits before model creation if |
| MPS is unavailable, if PyTorch was not built with MPS, or if |
| `PYTORCH_ENABLE_MPS_FALLBACK=1` is set. |
|
|
| ## Local Data and Environment |
|
|
| The project should not depend on another checkout of `nanochat` at runtime. Use |
| the project-local package and either: |
|
|
| - `--use-cached-data --cache-dir .cache/dropout_decay` to reuse the local |
| tokenizer and encoded token array; or |
| - `--corpus` / `--corpus-glob` to build a fresh local cache from a source corpus. |
|
|
| The curated repo-local data artifacts are: |
|
|
| - `data/openwebtext10k/base_data_climbmix/shard_*.parquet` |
| - `data/openwebtext10k/openwebtext10k.txt` |
| - `data/tinystories/train-00000-of-00004.parquet` |
| - `data/wikitext103_raw/train-00001-of-00002.parquet` |
|
|
| The existing repo-local caches are: |
|
|
| - `.cache/dropout_decay/tokenizer-v4096.json` |
| - `.cache/dropout_decay/tokens-v4096-uint16.npy` |
| - `.cache/dropout_decay_tinystories/tokenizer-v4096.json` |
| - `.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy` |
| - `.cache/dropout_decay_wikitext103/tokenizer-v4096.json` |
| - `.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy` |
|
|
| Use a project-local Python environment with MPS-capable PyTorch, for example |
| `.venv/bin/python`. Attribution to nanochat remains in the source and docs, but |
| experiment commands should not point into a separate nanochat repository. |
|
|
| ## Workflow |
|
|
| 1. Screen candidate model sizes with cheap static dropout sweeps. |
| 2. Select candidate models whose validation curve has an interior nonzero |
| dropout optimum. |
| 3. Confirm the winner with a 3-seed static sweep. |
| 4. Lock the model and run static-vs-decay streaming comparisons from scratch. |
|
|
| Every run writes: |
|
|
| - `config.json`: command, model specs, data paths, environment, attribution. |
| - `metrics.jsonl`: one row per seed/model/dropout/stage. |
| - `trace.jsonl`: optional training and intermediate evaluation trace. |
| - `summary.csv` / `summary.json`: mean/std train loss, validation loss, and gap. |
| - `model_selection.csv` / `model_selection.json`: static-sweep optimum and |
| plateau diagnostics for screen and confirm runs. |
|
|
| Old exploratory outputs are archived under `archive/`. |
|
|
| For exact headline reproduction, see `REPRODUCING.md`. For the current |
| first-reader hypothesis document, see |
| `docs/current_hypothesis_for_ml_engineers.md`. For the denser run-by-run |
| research summary, see `docs/dropout_decay_research_report_v2.md`. For the corpus |
| difficulty follow-up that motivated the next formula refinement, see |
| `docs/corpus_difficulty_probe_20260529.md`. For the first single-seed |
| probe-calibrated streaming test, see |
| `docs/probe_calibrated_stream_20260529.md`. For the coefficient refit using only |
| existing saved results, see |
| `docs/coefficient_refit_existing_data_20260529.md`. For a first-reader report on |
| regime definition and coefficient interpretation, see |
| `docs/regime_coefficient_report_20260529.md`. For the live coefficient |
| calibration plan, see `docs/coefficient_calibration_plan.md`. For the current |
| arXiv-style paper draft, see `paper/dropout_decay_pressure_law.tex`. |
|
|
| ## Step 1: Cheap Static Screen |
|
|
| Use one or two seeds. The output tells us, for each model, where the static |
| dropout curve bottoms out and which dropout range is within the configured |
| plateau delta. |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode screen_static \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay \ |
| --models 8x8x256 12x8x384 16x8x384 \ |
| --seeds 1 2 \ |
| --token-limits 5000000 \ |
| --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \ |
| --steps 2000 \ |
| --eval-batches 64 |
| ``` |
|
|
| ## Step 2: Confirm Winner |
|
|
| After selecting a promising model, rerun the static dropout curve with exactly |
| three seeds. |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode confirm_static \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay \ |
| --models winner=12x8x384 \ |
| --seeds 1 2 3 \ |
| --token-limits 5000000 \ |
| --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \ |
| --steps 2000 \ |
| --eval-batches 64 |
| ``` |
|
|
| ## Step 3: Locked Streaming Comparison |
|
|
| Only after the model is locked, compare static dropout and decay schedules from |
| fresh initialization. |
|
|
| ```bash |
| PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \ |
| --mode locked_stream \ |
| --use-cached-data \ |
| --cache-dir .cache/dropout_decay \ |
| --models winner=12x8x384 \ |
| --seeds 1 2 3 \ |
| --stream-token-caps 5000000 10000000 20000000 40000000 \ |
| --dropout-rates 0.0 0.10 0.14 0.20 \ |
| --decays decay_030_to_014:0.30:0.14:cosine decay_020_to_010:0.20:0.10:cosine \ |
| --stage-steps 1000 \ |
| --eval-batches 64 |
| ``` |
|
|