Mandeep Sidhu

Make research artifacts self contained

618af58 1 day ago

5.44 kB

	---
	license: mit
	language:
	- en
	tags:
	- dropout
	- streaming
	- language-modeling
	- transformer
	- mps
	- reproducibility
	pretty_name: Dropout Decay Streaming Experiments
	---

	# Dropout Decay Streaming Experiments

	This project tests dropout decay only after first finding a model/data regime
	where static dropout has a real nonzero validation optimum.

	The implementation is derived from Andrej Karpathy's `nanochat` repository:
	https://github.com/karpathy/nanochat. Only the core tokenizer ideas and
	foundational causal Transformer architecture are retained. Chat interfaces,
	deployment scripts, distributed training code, and inference services are not
	included. The original nanochat MIT copyright and permission notice are retained
	in derived source files and in `LICENSE`.

	## Compliance

	All Torch experiment runs are MPS-only. The runner exits before model creation if
	MPS is unavailable, if PyTorch was not built with MPS, or if
	`PYTORCH_ENABLE_MPS_FALLBACK=1` is set.

	## Local Data and Environment

	The project should not depend on another checkout of `nanochat` at runtime. Use
	the project-local package and either:

	- `--use-cached-data --cache-dir .cache/dropout_decay` to reuse the local
	tokenizer and encoded token array; or
	- `--corpus` / `--corpus-glob` to build a fresh local cache from a source corpus.

	The curated repo-local data artifacts are:

	- `data/openwebtext10k/base_data_climbmix/shard_*.parquet`
	- `data/openwebtext10k/openwebtext10k.txt`
	- `data/tinystories/train-00000-of-00004.parquet`
	- `data/wikitext103_raw/train-00001-of-00002.parquet`

	The existing repo-local caches are:

	- `.cache/dropout_decay/tokenizer-v4096.json`
	- `.cache/dropout_decay/tokens-v4096-uint16.npy`
	- `.cache/dropout_decay_tinystories/tokenizer-v4096.json`
	- `.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy`
	- `.cache/dropout_decay_wikitext103/tokenizer-v4096.json`
	- `.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy`

	Use a project-local Python environment with MPS-capable PyTorch, for example
	`.venv/bin/python`. Attribution to nanochat remains in the source and docs, but
	experiment commands should not point into a separate nanochat repository.

	## Workflow

	1. Screen candidate model sizes with cheap static dropout sweeps.
	2. Select candidate models whose validation curve has an interior nonzero
	dropout optimum.
	3. Confirm the winner with a 3-seed static sweep.
	4. Lock the model and run static-vs-decay streaming comparisons from scratch.

	Every run writes:

	- `config.json`: command, model specs, data paths, environment, attribution.
	- `metrics.jsonl`: one row per seed/model/dropout/stage.
	- `trace.jsonl`: optional training and intermediate evaluation trace.
	- `summary.csv` / `summary.json`: mean/std train loss, validation loss, and gap.
	- `model_selection.csv` / `model_selection.json`: static-sweep optimum and
	plateau diagnostics for screen and confirm runs.

	Old exploratory outputs are archived under `archive/`.

	For exact headline reproduction, see `REPRODUCING.md`. For the current
	first-reader hypothesis document, see
	`docs/current_hypothesis_for_ml_engineers.md`. For the denser run-by-run
	research summary, see `docs/dropout_decay_research_report_v2.md`. For the corpus
	difficulty follow-up that motivated the next formula refinement, see
	`docs/corpus_difficulty_probe_20260529.md`. For the first single-seed
	probe-calibrated streaming test, see
	`docs/probe_calibrated_stream_20260529.md`. For the coefficient refit using only
	existing saved results, see
	`docs/coefficient_refit_existing_data_20260529.md`. For a first-reader report on
	regime definition and coefficient interpretation, see
	`docs/regime_coefficient_report_20260529.md`. For the live coefficient
	calibration plan, see `docs/coefficient_calibration_plan.md`. For the current
	arXiv-style paper draft, see `paper/dropout_decay_pressure_law.tex`.

	## Step 1: Cheap Static Screen

	Use one or two seeds. The output tells us, for each model, where the static
	dropout curve bottoms out and which dropout range is within the configured
	plateau delta.

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode screen_static \
	--use-cached-data \
	--cache-dir .cache/dropout_decay \
	--models 8x8x256 12x8x384 16x8x384 \
	--seeds 1 2 \
	--token-limits 5000000 \
	--dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
	--steps 2000 \
	--eval-batches 64
	```

	## Step 2: Confirm Winner

	After selecting a promising model, rerun the static dropout curve with exactly
	three seeds.

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode confirm_static \
	--use-cached-data \
	--cache-dir .cache/dropout_decay \
	--models winner=12x8x384 \
	--seeds 1 2 3 \
	--token-limits 5000000 \
	--dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
	--steps 2000 \
	--eval-batches 64
	```

	## Step 3: Locked Streaming Comparison

	Only after the model is locked, compare static dropout and decay schedules from
	fresh initialization.

	```bash
	PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
	--mode locked_stream \
	--use-cached-data \
	--cache-dir .cache/dropout_decay \
	--models winner=12x8x384 \
	--seeds 1 2 3 \
	--stream-token-caps 5000000 10000000 20000000 40000000 \
	--dropout-rates 0.0 0.10 0.14 0.20 \
	--decays decay_030_to_014:0.30:0.14:cosine decay_020_to_010:0.20:0.10:cosine \
	--stage-steps 1000 \
	--eval-batches 64
	```