OpenTransformer commited on
Commit
33e80c3
·
verified ·
1 Parent(s): bb066f1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +246 -0
README.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Got it. Here’s a clean, professional **README.md / model card** you can drop into your Hugging Face repo. No fluff, no mystical hand-waving — just what users need to run `5L.py`.
2
+
3
+ ---
4
+
5
+ # AGILLM2-fast-training · `5L.py`
6
+
7
+ Autoregressive (AR-only) single-file trainer/decoder using the Qwen3 tokenizer
8
+
9
+ **Repo:** [https://huggingface.co/OpenTransformer/AGILLM2-fast-training](https://huggingface.co/OpenTransformer/AGILLM2-fast-training)
10
+ **Org:** [https://huggingface.co/OpenTransformer](https://huggingface.co/OpenTransformer)
11
+ **Contact:** [OpenTransformers@proton.me](mailto:OpenTransformers@proton.me)
12
+
13
+ ## Overview
14
+
15
+ `5L.py` is a ~single-file PyTorch training and inference script for language models with:
16
+
17
+ * **AR-only** training/decoding
18
+ * **Qwen3** tokenizer by default (override via `TOKENIZER_ID`)
19
+ * **Progressive block growth**, **AMP/FP8 autocast**, **OOM backoff**
20
+ * **Time-based checkpointing** only (monotonic, resume-safe)
21
+ * **Sampling controls:** top-k/top-p/min-p, greedy, repetition/presence/frequency penalties, no-repeat-ngrams
22
+ * **Chinchilla-style target token estimator** using all enabled params (core + AR head)
23
+
24
+ The goal is **minimal surface area** with production-lean features so you can train quickly, resume safely, and decode reliably on commodity GPUs or cloud nodes.
25
+
26
+ ## Features
27
+
28
+ * **Presets:** `small`, `smallx2`, `base`
29
+ * **Attention:** Low-rank MHA with ALiBi relative bias
30
+ * **Determinism helpers:** seed management, checkpoint metadata (RNG states)
31
+ * **Tokenizer safety:** adds `[PAD]` if missing; handles EOS fallbacks
32
+ * **Streaming data:** uses `datasets` streaming for large corpora
33
+
34
+ ## Requirements
35
+
36
+ * Python 3.10+
37
+ * PyTorch 2.2+ (CUDA build if using NVIDIA GPUs)
38
+ * `transformers`, `datasets`, `tqdm`
39
+ * CUDA-capable GPU recommended; script also runs CPU-only for smoke tests
40
+
41
+ Install:
42
+
43
+ ```bash
44
+ pip install torch --index-url https://download.pytorch.org/whl/cu121 # pick your CUDA/CPU wheel
45
+ pip install transformers datasets tqdm
46
+ ```
47
+
48
+ ## Quick start
49
+
50
+ ### 1) Set tokenizer (optional)
51
+
52
+ Default is Qwen3:
53
+
54
+ ```bash
55
+ export TOKENIZER_ID="Qwen/Qwen3-235B-A22B-Thinking-2507"
56
+ ```
57
+
58
+ Use any compatible tokenizer:
59
+
60
+ ```bash
61
+ export TOKENIZER_ID="qwen/qwen2.5-7b"
62
+ ```
63
+
64
+ ### 2) Train
65
+
66
+ Minimal example on SlimPajama (streaming):
67
+
68
+ ```bash
69
+ python 5L.py train \
70
+ --preset small \
71
+ --source cerebras/SlimPajama-627B \
72
+ --amp \
73
+ --save_dir ckpts_joint \
74
+ --save_every_sec 7200
75
+ ```
76
+
77
+ Targets and steps:
78
+
79
+ ```bash
80
+ # Let script compute Chinchilla-style target tokens automatically
81
+ python 5L.py train --preset small --amp
82
+
83
+ # Or cap by steps
84
+ python 5L.py train --preset small --steps 20000 --amp
85
+ ```
86
+
87
+ Warm start / resume:
88
+
89
+ ```bash
90
+ # Warm-start from a prior final.pt (shape-safe copy of matching tensors)
91
+ python 5L.py train --preset small --warmstart_from ckpts_joint/final.pt
92
+
93
+ # Full resume (optimizer, scaler, seen tokens, timers)
94
+ python 5L.py train --resume ckpts_joint/step00050000.pt
95
+ ```
96
+
97
+ Progressive block growth:
98
+
99
+ ```bash
100
+ python 5L.py train \
101
+ --preset small \
102
+ --auto_grow \
103
+ --grow_plan "576,640,768,896,1024" \
104
+ --grow_every_steps 50000
105
+ ```
106
+
107
+ FP8 fast path:
108
+
109
+ ```bash
110
+ # Try FP8; if not supported, fall back to bf16
111
+ python 5L.py train --preset small --fp8-only --fp8-fallback
112
+ ```
113
+
114
+ ### 3) Inference
115
+
116
+ ```bash
117
+ python 5L.py infer \
118
+ --mode ar \
119
+ --ckpt ckpts_joint/final.pt \
120
+ --preset small \
121
+ --prompt "Explain ALiBi in simple terms." \
122
+ --max_new 120 \
123
+ --top_p 0.9 --top_k 50 \
124
+ --repetition_penalty 1.1 \
125
+ --no_repeat_ngram_size 3
126
+ ```
127
+
128
+ Greedy decode:
129
+
130
+ ```bash
131
+ python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
132
+ --prompt "What is progressive block growth in training?" --greedy --max_new 80
133
+ ```
134
+
135
+ FP8 during decode (if supported):
136
+
137
+ ```bash
138
+ python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
139
+ --prompt "Summarize transformer attention variants." --fp8-only --fp8-fallback
140
+ ```
141
+
142
+ ## Presets
143
+
144
+ ```text
145
+ small : d=512, layers=8, heads=16, rank=64
146
+ smallx2 : d=512, layers=16, heads=16, rank=64
147
+ base : d=768, layers=12, heads=24, rank=96
148
+ ```
149
+
150
+ Use `--x2` during training to double layers of an inferred previous config.
151
+
152
+ ## Checkpointing & Resume
153
+
154
+ * **Saves** only by **time interval** (`--save_every_sec`, default 24h) to avoid step-based drift.
155
+ * `final.pt` includes: core, AR head, optimizer, AMP scaler, cfg, RNG states, and metadata.
156
+ * **Resume** with `--resume <path>` to restore optimizer/scaler/wall-clock cadence.
157
+ * **Warm start** only copies shape-matched tensors (safe if your topology changed).
158
+
159
+ Artifacts:
160
+
161
+ * `ckpts_joint/stepXXXXXXXX.pt`
162
+ * `ckpts_joint/latest.json` with canonical latest path and step
163
+
164
+ ## Data
165
+
166
+ Default streaming dataset:
167
+
168
+ * `cerebras/SlimPajama-627B` (train split, streaming enabled).
169
+ Replace `--source` with any `datasets`-compatible corpus that yields `{"text": ...}`.
170
+
171
+ EOS handling: if tokenizer’s `eos_token_id` is missing, uses `sep_token_id`; if a sample doesn’t end with EOS, one is appended.
172
+
173
+ ## Sampling controls
174
+
175
+ * `--temperature`, `--top_k`, `--top_p`, `--min_p`
176
+ * `--repetition_penalty`, `--presence_penalty`, `--frequency_penalty`, `--penalty_last_n`
177
+ * `--no_repeat_ngram_size`
178
+
179
+ Greedy mode (`--greedy`) overrides sampling.
180
+
181
+ ## FP8 / AMP
182
+
183
+ * `--fp8-only` attempts `float8_e4m3fn` autocast
184
+ * `--fp8-fallback` continues with bf16 if FP8 unsupported
185
+ * Otherwise use `--amp` for bf16/fp16 autocast
186
+ * `torch.backends.cuda.matmul.allow_tf32=True` is enabled when available
187
+
188
+ ## OOM backoff & block growth
189
+
190
+ * On CUDA OOM, the script **halves** `BLOCK` (down to 128), empties cache, and retries the step.
191
+ * With `--auto_grow`, the script periodically attempts to **increase** `BLOCK` along your `--grow_plan`.
192
+
193
+ ## Token targets (Chinchilla-style)
194
+
195
+ If `--target_tokens` is unspecified, the script computes `25 × (enabled parameters)` using **all** trainable params (core + AR head). This provides a rough target for total tokens to consume.
196
+
197
+ ## Repro tips
198
+
199
+ * Pin a specific tokenizer via `TOKENIZER_ID`
200
+ * Log your `--preset`, `--block`, and `--grow_plan`
201
+ * Keep `save_every_sec` stable between resumes for monotonic cadence
202
+ * Record CUDA/cuDNN versions in your run logs for reproducibility
203
+
204
+ ## Limitations
205
+
206
+ * AR-only trainer (no encoder-decoder, no multimodal)
207
+ * Low-rank MHA path; FlashAttention not included
208
+ * Single-GPU by default; multi-GPU DDP not wired in this file
209
+ * Safety/guardrails are out of scope here (this is a trainer, not a hosted chat product)
210
+
211
+ ## Roadmap (planned)
212
+
213
+ * Optional DDP with NCCL/RCCL/HCCL backends
214
+ * FlashAttention path when available across vendors
215
+ * Export helpers (Safetensors, GGUF) for downstream serving
216
+
217
+ ## License
218
+
219
+ * Code in this repo is intended to be released under a permissive license (e.g., Apache-2.0 or MIT).
220
+ Add your chosen license file at the repo root and reflect it here.
221
+
222
+ ## Responsible Use
223
+
224
+ * Ensure your dataset usage complies with its license and applicable laws.
225
+ * Models trained with this script can generate incorrect or biased outputs.
226
+ Evaluate and align according to your deployment requirements.
227
+
228
+ ## Citation
229
+
230
+ If this script or training pipeline helps your work, consider citing the repo:
231
+
232
+ ```bibtex
233
+ @software{OpenTransformer_AGILLM2_fast_training_2025,
234
+ title = {AGILLM2-fast-training: Single-file AR-only trainer/decoder (5L.py)},
235
+ author = {OpenTransformers},
236
+ year = {2025},
237
+ url = {https://huggingface.co/OpenTransformer/AGILLM2-fast-training}
238
+ }
239
+ ```
240
+
241
+ ---
242
+
243
+ **Support / Contracts**
244
+ We provide **custom development** and **end-to-end training** services (data prep → training → evaluation → deployment).
245
+ Email: **[OpenTransformers@proton.me](mailto:OpenTransformers@proton.me)**
246
+ Org page: [https://huggingface.co/OpenTransformer](https://huggingface.co/OpenTransformer)