Mandeep Sidhu commited on
Commit
baafabf
·
1 Parent(s): 0e508c7

Prepare reproducible research package

Browse files
Files changed (3) hide show
  1. .gitignore +3 -0
  2. README.md +16 -0
  3. REPRODUCING.md +218 -0
.gitignore CHANGED
@@ -3,3 +3,6 @@ __pycache__/
3
  *.py[cod]
4
  .cache/
5
  *.npy
 
 
 
 
3
  *.py[cod]
4
  .cache/
5
  *.npy
6
+ *.pdf
7
+ .venv/
8
+ *.egg-info/
README.md CHANGED
@@ -1,3 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Dropout Decay Streaming Experiments
2
 
3
  This project tests dropout decay only after first finding a model/data regime
@@ -53,6 +67,8 @@ Every run writes:
53
 
54
  Old exploratory outputs are archived under `archive/`.
55
 
 
 
56
  ## Step 1: Cheap Static Screen
57
 
58
  Use one or two seeds. The output tells us, for each model, where the static
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - dropout
7
+ - streaming
8
+ - language-modeling
9
+ - transformer
10
+ - mps
11
+ - reproducibility
12
+ pretty_name: Dropout Decay Streaming Experiments
13
+ ---
14
+
15
  # Dropout Decay Streaming Experiments
16
 
17
  This project tests dropout decay only after first finding a model/data regime
 
67
 
68
  Old exploratory outputs are archived under `archive/`.
69
 
70
+ For exact headline reproduction, see `REPRODUCING.md`.
71
+
72
  ## Step 1: Cheap Static Screen
73
 
74
  Use one or two seeds. The output tells us, for each model, where the static
REPRODUCING.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reproducing the Dropout Decay Experiments
2
+
3
+ This repository is intended to be runnable without checking out nanochat. The
4
+ implementation is derived from nanochat and retains its MIT attribution, but
5
+ runtime commands use this package, local cached data, and a local MPS-capable
6
+ Python environment.
7
+
8
+ ## Requirements
9
+
10
+ - macOS with Apple Silicon MPS available.
11
+ - Python 3.10-3.12 recommended for PyTorch MPS wheels.
12
+ - A project-local virtual environment at `.venv`.
13
+ - MPS-capable PyTorch. CPU and CUDA runs are intentionally refused by the
14
+ runner.
15
+
16
+ Create the environment:
17
+
18
+ ```bash
19
+ python3.11 -m venv .venv
20
+ .venv/bin/python -m pip install --upgrade pip
21
+ .venv/bin/python -m pip install -e .
22
+ ```
23
+
24
+ Verify MPS:
25
+
26
+ ```bash
27
+ .venv/bin/python - <<'PY'
28
+ import torch
29
+ print(torch.__version__)
30
+ print(torch.backends.mps.is_built(), torch.backends.mps.is_available())
31
+ PY
32
+ ```
33
+
34
+ Both booleans must be `True`.
35
+
36
+ ## Data
37
+
38
+ The runner supports two modes:
39
+
40
+ - `--use-cached-data --cache-dir .cache/dropout_decay`
41
+ - `--corpus` / `--corpus-glob` to build a cache from raw text or parquet.
42
+
43
+ The experiments in the current report used:
44
+
45
+ ```text
46
+ .cache/dropout_decay/tokenizer-v4096.json
47
+ .cache/dropout_decay/tokens-v4096-uint16.npy
48
+ ```
49
+
50
+ The cached token file is deliberately ignored by Git until dataset provenance
51
+ and binary hosting are finalized. For exact reproduction, place the two files
52
+ above in `.cache/dropout_decay`. The local cached split used by the completed
53
+ runs contains:
54
+
55
+ ```text
56
+ train tokens: 5,000,970
57
+ validation tokens: 500,000
58
+ vocab size: 4,096
59
+ ```
60
+
61
+ ## Smoke Test
62
+
63
+ This verifies cached-data loading without running a Torch experiment:
64
+
65
+ ```bash
66
+ PYTHONPATH=src .venv/bin/python - <<'PY'
67
+ from pathlib import Path
68
+ from dropout_decay.data import load_cached_splits
69
+
70
+ tok, splits = load_cached_splits(
71
+ cache_dir=Path(".cache/dropout_decay"),
72
+ vocab_size=4096,
73
+ max_required_train_tokens=4_000_000,
74
+ val_tokens=500_000,
75
+ allow_short_corpus=False,
76
+ )
77
+ print(tok.vocab_size)
78
+ print(len(splits.train), len(splits.val))
79
+ PY
80
+ ```
81
+
82
+ Expected:
83
+
84
+ ```text
85
+ 4096
86
+ 5000970 500000
87
+ ```
88
+
89
+ ## Headline Formula
90
+
91
+ The tested formula is:
92
+
93
+ ```text
94
+ p = clamp(0.02, 0.65,
95
+ 0.154 * log10(params / unique_tokens)
96
+ + 0.249 * log10(cumulative_sampled_tokens / unique_tokens)
97
+ - 0.210)
98
+ ```
99
+
100
+ For the standard protocol:
101
+
102
+ - stream prefixes: `250000 500000 1000000 2000000 4000000`
103
+ - stage steps: `1000`
104
+ - batch size: `16`
105
+ - block size: `128`
106
+ - cumulative sampled tokens after stage `i`: `i * 1000 * 16 * 128`
107
+
108
+ ## Reproduce Model-Size Validation
109
+
110
+ Example L12 command:
111
+
112
+ ```bash
113
+ PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
114
+ --mode locked_stream \
115
+ --use-cached-data \
116
+ --cache-dir .cache/dropout_decay \
117
+ --output-dir runs/reproduce_l12_formula \
118
+ --models L12_H8_D320=12x8x320 \
119
+ --seeds 1 2 3 \
120
+ --stream-token-caps 250000 500000 1000000 2000000 4000000 \
121
+ --dropout-rates 0.09 0.14 0.18 0.20 0.26 0.30 \
122
+ --anchor-decays pressure_formula_l12:250000=0.300,500000=0.260,1000000=0.180,2000000=0.090,4000000=0.020 \
123
+ --stage-steps 1000 \
124
+ --batch-size 16 \
125
+ --block-size 128 \
126
+ --eval-batches 64 \
127
+ --train-eval-batches 32 \
128
+ --trace-eval-batches 8 \
129
+ --log-every 500 \
130
+ --vocab-size 4096 \
131
+ --val-tokens 500000 \
132
+ --lr 0.0003 \
133
+ --weight-decay 0.1 \
134
+ --grad-clip 1.0
135
+ ```
136
+
137
+ Completed reference result:
138
+
139
+ ```text
140
+ pressure formula final validation: 4.4812 +/- 0.0062
141
+ best static final validation: 4.5183
142
+ ```
143
+
144
+ ## Reproduce Architecture-Shape Holdout
145
+
146
+ Deep/narrow holdout:
147
+
148
+ ```bash
149
+ PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
150
+ --mode locked_stream \
151
+ --use-cached-data \
152
+ --cache-dir .cache/dropout_decay \
153
+ --output-dir runs/reproduce_arch_deep_narrow \
154
+ --models deep_narrow_L18_H8_D256=18x8x256 \
155
+ --seeds 1 2 3 \
156
+ --stream-token-caps 250000 500000 1000000 2000000 4000000 \
157
+ --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
158
+ --anchor-decays formula_deep_narrow_l18_h8:250000=0.297,500000=0.250,1000000=0.173,2000000=0.083,4000000=0.020 \
159
+ --stage-steps 1000 \
160
+ --batch-size 16 \
161
+ --block-size 128 \
162
+ --eval-batches 64 \
163
+ --train-eval-batches 32 \
164
+ --trace-eval-batches 8 \
165
+ --log-every 500 \
166
+ --vocab-size 4096 \
167
+ --val-tokens 500000 \
168
+ --lr 0.0003 \
169
+ --weight-decay 0.1 \
170
+ --grad-clip 1.0
171
+ ```
172
+
173
+ Completed reference result:
174
+
175
+ ```text
176
+ formula final validation: 4.5286 +/- 0.0118
177
+ best static final validation: 4.5564 +/- 0.0127
178
+ ```
179
+
180
+ ## Next Unrun Holdout
181
+
182
+ The next planned holdout is the width-heavy architecture test:
183
+
184
+ ```bash
185
+ PYTHONPATH=src .venv/bin/python scripts/run_experiments.py \
186
+ --mode locked_stream \
187
+ --use-cached-data \
188
+ --cache-dir .cache/dropout_decay \
189
+ --output-dir runs/architecture_shape_holdout_wide_h8 \
190
+ --models wide_L8_H8_D384=8x8x384 \
191
+ --seeds 1 2 3 \
192
+ --stream-token-caps 250000 500000 1000000 2000000 4000000 \
193
+ --dropout-rates 0.02 0.08 0.14 0.18 0.20 0.26 0.30 \
194
+ --anchor-decays formula_wide_l8_h8:250000=0.301,500000=0.254,1000000=0.177,2000000=0.087,4000000=0.020 \
195
+ --stage-steps 1000 \
196
+ --batch-size 16 \
197
+ --block-size 128 \
198
+ --eval-batches 64 \
199
+ --train-eval-batches 32 \
200
+ --trace-eval-batches 8 \
201
+ --log-every 500 \
202
+ --vocab-size 4096 \
203
+ --val-tokens 500000 \
204
+ --lr 0.0003 \
205
+ --weight-decay 0.1 \
206
+ --grad-clip 1.0
207
+ ```
208
+
209
+ Expected runtime on the current MPS setup is about 2.5-3.5 hours.
210
+
211
+ ## Notes for Publication
212
+
213
+ - Do not claim the formula is universal.
214
+ - The supported claim is final-validation improvement under this
215
+ expanding-prefix protocol.
216
+ - PDFs are generated artifacts and are ignored by Git.
217
+ - The exact cached token file should be published through an appropriate binary
218
+ artifact mechanism once dataset provenance is finalized.