File size: 5,444 Bytes
baafabf
 
 
 
 
 
 
 
 
 
 
 
 
 
b4b069f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cecc0f6
 
 
 
 
 
 
 
 
618af58
 
 
 
 
 
 
 
cecc0f6
 
 
618af58
 
 
 
cecc0f6
 
 
 
 
b4b069f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3550904
 
 
 
403e802
806b3a7
 
3550904
 
 
 
 
 
 
baafabf
b4b069f
 
 
 
 
 
 
cecc0f6
b4b069f
cecc0f6
 
b4b069f
 
 
 
 
 
 
 
 
 
 
 
 
 
cecc0f6
b4b069f
cecc0f6
 
b4b069f
 
 
 
 
 
 
 
 
 
 
 
 
 
cecc0f6
b4b069f
cecc0f6
 
b4b069f
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
license: mit
language:
  - en
tags:
  - dropout
  - streaming
  - language-modeling
  - transformer
  - mps
  - reproducibility
pretty_name: Dropout Decay Streaming Experiments
---

# Dropout Decay Streaming Experiments

This project tests dropout decay only after first finding a model/data regime
where static dropout has a real nonzero validation optimum.

The implementation is derived from Andrej Karpathy's `nanochat` repository:
https://github.com/karpathy/nanochat. Only the core tokenizer ideas and
foundational causal Transformer architecture are retained. Chat interfaces,
deployment scripts, distributed training code, and inference services are not
included. The original nanochat MIT copyright and permission notice are retained
in derived source files and in `LICENSE`.

## Compliance

All Torch experiment runs are MPS-only. The runner exits before model creation if
MPS is unavailable, if PyTorch was not built with MPS, or if
`PYTORCH_ENABLE_MPS_FALLBACK=1` is set.

## Local Data and Environment

The project should not depend on another checkout of `nanochat` at runtime. Use
the project-local package and either:

- `--use-cached-data --cache-dir .cache/dropout_decay` to reuse the local
  tokenizer and encoded token array; or
- `--corpus` / `--corpus-glob` to build a fresh local cache from a source corpus.

The curated repo-local data artifacts are:

- `data/openwebtext10k/base_data_climbmix/shard_*.parquet`
- `data/openwebtext10k/openwebtext10k.txt`
- `data/tinystories/train-00000-of-00004.parquet`
- `data/wikitext103_raw/train-00001-of-00002.parquet`

The existing repo-local caches are:

- `.cache/dropout_decay/tokenizer-v4096.json`
- `.cache/dropout_decay/tokens-v4096-uint16.npy`
- `.cache/dropout_decay_tinystories/tokenizer-v4096.json`
- `.cache/dropout_decay_tinystories/tokens-v4096-uint16.npy`
- `.cache/dropout_decay_wikitext103/tokenizer-v4096.json`
- `.cache/dropout_decay_wikitext103/tokens-v4096-uint16.npy`

Use a project-local Python environment with MPS-capable PyTorch, for example
`.venv/bin/python`. Attribution to nanochat remains in the source and docs, but
experiment commands should not point into a separate nanochat repository.

## Workflow

1. Screen candidate model sizes with cheap static dropout sweeps.
2. Select candidate models whose validation curve has an interior nonzero
   dropout optimum.
3. Confirm the winner with a 3-seed static sweep.
4. Lock the model and run static-vs-decay streaming comparisons from scratch.

Every run writes:

- `config.json`: command, model specs, data paths, environment, attribution.
- `metrics.jsonl`: one row per seed/model/dropout/stage.
- `trace.jsonl`: optional training and intermediate evaluation trace.
- `summary.csv` / `summary.json`: mean/std train loss, validation loss, and gap.
- `model_selection.csv` / `model_selection.json`: static-sweep optimum and
  plateau diagnostics for screen and confirm runs.

Old exploratory outputs are archived under `archive/`.

For exact headline reproduction, see `REPRODUCING.md`. For the current
first-reader hypothesis document, see
`docs/current_hypothesis_for_ml_engineers.md`. For the denser run-by-run
research summary, see `docs/dropout_decay_research_report_v2.md`. For the corpus
difficulty follow-up that motivated the next formula refinement, see
`docs/corpus_difficulty_probe_20260529.md`. For the first single-seed
probe-calibrated streaming test, see
`docs/probe_calibrated_stream_20260529.md`. For the coefficient refit using only
existing saved results, see
`docs/coefficient_refit_existing_data_20260529.md`. For a first-reader report on
regime definition and coefficient interpretation, see
`docs/regime_coefficient_report_20260529.md`. For the live coefficient
calibration plan, see `docs/coefficient_calibration_plan.md`. For the current
arXiv-style paper draft, see `paper/dropout_decay_pressure_law.tex`.

## Step 1: Cheap Static Screen

Use one or two seeds. The output tells us, for each model, where the static
dropout curve bottoms out and which dropout range is within the configured
plateau delta.

```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
  --mode screen_static \
  --use-cached-data \
  --cache-dir .cache/dropout_decay \
  --models 8x8x256 12x8x384 16x8x384 \
  --seeds 1 2 \
  --token-limits 5000000 \
  --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
  --steps 2000 \
  --eval-batches 64
```

## Step 2: Confirm Winner

After selecting a promising model, rerun the static dropout curve with exactly
three seeds.

```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
  --mode confirm_static \
  --use-cached-data \
  --cache-dir .cache/dropout_decay \
  --models winner=12x8x384 \
  --seeds 1 2 3 \
  --token-limits 5000000 \
  --dropout-rates 0.0 0.02 0.05 0.08 0.10 0.14 0.20 0.30 0.50 \
  --steps 2000 \
  --eval-batches 64
```

## Step 3: Locked Streaming Comparison

Only after the model is locked, compare static dropout and decay schedules from
fresh initialization.

```bash
PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
  --mode locked_stream \
  --use-cached-data \
  --cache-dir .cache/dropout_decay \
  --models winner=12x8x384 \
  --seeds 1 2 3 \
  --stream-token-caps 5000000 10000000 20000000 40000000 \
  --dropout-rates 0.0 0.10 0.14 0.20 \
  --decays decay_030_to_014:0.30:0.14:cosine decay_020_to_010:0.20:0.10:cosine \
  --stage-steps 1000 \
  --eval-batches 64
```