File size: 7,306 Bytes
33e80c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
# AGILLM2-fast-training · `5L.py`

Autoregressive (AR-only) single-file trainer/decoder using the Qwen3 tokenizer

**Repo:** [https://huggingface.co/OpenTransformer/AGILLM2-fast-training](https://huggingface.co/OpenTransformer/AGILLM2-fast-training)
**Org:** [https://huggingface.co/OpenTransformer](https://huggingface.co/OpenTransformer)
**Contact:** [OpenTransformers@proton.me](mailto:OpenTransformers@proton.me)

## Overview

`5L.py` is a ~single-file PyTorch training and inference script for language models with:

* **AR-only** training/decoding
* **Qwen3** tokenizer by default (override via `TOKENIZER_ID`)
* **Progressive block growth**, **AMP/FP8 autocast**, **OOM backoff**
* **Time-based checkpointing** only (monotonic, resume-safe)
* **Sampling controls:** top-k/top-p/min-p, greedy, repetition/presence/frequency penalties, no-repeat-ngrams
* **Chinchilla-style target token estimator** using all enabled params (core + AR head)

The goal is **minimal surface area** with production-lean features so you can train quickly, resume safely, and decode reliably on commodity GPUs or cloud nodes.

## Features

* **Presets:** `small`, `smallx2`, `base`
* **Attention:** Low-rank MHA with ALiBi relative bias
* **Determinism helpers:** seed management, checkpoint metadata (RNG states)
* **Tokenizer safety:** adds `[PAD]` if missing; handles EOS fallbacks
* **Streaming data:** uses `datasets` streaming for large corpora

## Requirements

* Python 3.10+
* PyTorch 2.2+ (CUDA build if using NVIDIA GPUs)
* `transformers`, `datasets`, `tqdm`
* CUDA-capable GPU recommended; script also runs CPU-only for smoke tests

Install:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121  # pick your CUDA/CPU wheel
pip install transformers datasets tqdm
```

## Quick start

### 1) Set tokenizer (optional)

Default is Qwen3:

```bash
export TOKENIZER_ID="Qwen/Qwen3-235B-A22B-Thinking-2507"
```

Use any compatible tokenizer:

```bash
export TOKENIZER_ID="qwen/qwen2.5-7b"
```

### 2) Train

Minimal example on SlimPajama (streaming):

```bash
python 5L.py train \
  --preset small \
  --source cerebras/SlimPajama-627B \
  --amp \
  --save_dir ckpts_joint \
  --save_every_sec 7200
```

Targets and steps:

```bash
# Let script compute Chinchilla-style target tokens automatically
python 5L.py train --preset small --amp

# Or cap by steps
python 5L.py train --preset small --steps 20000 --amp
```

Warm start / resume:

```bash
# Warm-start from a prior final.pt (shape-safe copy of matching tensors)
python 5L.py train --preset small --warmstart_from ckpts_joint/final.pt

# Full resume (optimizer, scaler, seen tokens, timers)
python 5L.py train --resume ckpts_joint/step00050000.pt
```

Progressive block growth:

```bash
python 5L.py train \
  --preset small \
  --auto_grow \
  --grow_plan "576,640,768,896,1024" \
  --grow_every_steps 50000
```

FP8 fast path:

```bash
# Try FP8; if not supported, fall back to bf16
python 5L.py train --preset small --fp8-only --fp8-fallback
```

### 3) Inference

```bash
python 5L.py infer \
  --mode ar \
  --ckpt ckpts_joint/final.pt \
  --preset small \
  --prompt "Explain ALiBi in simple terms." \
  --max_new 120 \
  --top_p 0.9 --top_k 50 \
  --repetition_penalty 1.1 \
  --no_repeat_ngram_size 3
```

Greedy decode:

```bash
python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
  --prompt "What is progressive block growth in training?" --greedy --max_new 80
```

FP8 during decode (if supported):

```bash
python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
  --prompt "Summarize transformer attention variants." --fp8-only --fp8-fallback
```

## Presets

```text
small    : d=512, layers=8,  heads=16, rank=64
smallx2  : d=512, layers=16, heads=16, rank=64
base     : d=768, layers=12, heads=24, rank=96
```

Use `--x2` during training to double layers of an inferred previous config.

## Checkpointing & Resume

* **Saves** only by **time interval** (`--save_every_sec`, default 24h) to avoid step-based drift.
* `final.pt` includes: core, AR head, optimizer, AMP scaler, cfg, RNG states, and metadata.
* **Resume** with `--resume <path>` to restore optimizer/scaler/wall-clock cadence.
* **Warm start** only copies shape-matched tensors (safe if your topology changed).

Artifacts:

* `ckpts_joint/stepXXXXXXXX.pt`
* `ckpts_joint/latest.json` with canonical latest path and step

## Data

Default streaming dataset:

* `cerebras/SlimPajama-627B` (train split, streaming enabled).
  Replace `--source` with any `datasets`-compatible corpus that yields `{"text": ...}`.

EOS handling: if tokenizer’s `eos_token_id` is missing, uses `sep_token_id`; if a sample doesn’t end with EOS, one is appended.

## Sampling controls

* `--temperature`, `--top_k`, `--top_p`, `--min_p`
* `--repetition_penalty`, `--presence_penalty`, `--frequency_penalty`, `--penalty_last_n`
* `--no_repeat_ngram_size`

Greedy mode (`--greedy`) overrides sampling.

## FP8 / AMP

* `--fp8-only` attempts `float8_e4m3fn` autocast
* `--fp8-fallback` continues with bf16 if FP8 unsupported
* Otherwise use `--amp` for bf16/fp16 autocast
* `torch.backends.cuda.matmul.allow_tf32=True` is enabled when available

## OOM backoff & block growth

* On CUDA OOM, the script **halves** `BLOCK` (down to 128), empties cache, and retries the step.
* With `--auto_grow`, the script periodically attempts to **increase** `BLOCK` along your `--grow_plan`.

## Token targets (Chinchilla-style)

If `--target_tokens` is unspecified, the script computes `25 × (enabled parameters)` using **all** trainable params (core + AR head). This provides a rough target for total tokens to consume.

## Repro tips

* Pin a specific tokenizer via `TOKENIZER_ID`
* Log your `--preset`, `--block`, and `--grow_plan`
* Keep `save_every_sec` stable between resumes for monotonic cadence
* Record CUDA/cuDNN versions in your run logs for reproducibility

## Limitations

* AR-only trainer (no encoder-decoder, no multimodal)
* Low-rank MHA path; FlashAttention not included
* Single-GPU by default; multi-GPU DDP not wired in this file
* Safety/guardrails are out of scope here (this is a trainer, not a hosted chat product)

## Roadmap (planned)

* Optional DDP with NCCL/RCCL/HCCL backends
* FlashAttention path when available across vendors
* Export helpers (Safetensors, GGUF) for downstream serving

## Responsible Use

* Ensure your dataset usage complies with its license and applicable laws.
* Models trained with this script can generate incorrect or biased outputs.
  Evaluate and align according to your deployment requirements.

## Citation

If this script or training pipeline helps your work, consider citing the repo:

```bibtex
@software{OpenTransformer_AGILLM2_fast_training_2025,
  title   = {AGILLM2-fast-training: Single-file AR-only trainer/decoder (5L.py)},
  author  = {OpenTransformers},
  year    = {2025},
  url     = {https://huggingface.co/OpenTransformer/AGILLM2-fast-training}
}
```

---

**Support / Contracts**
We provide **custom development** and **end-to-end training** services (data prep → training → evaluation → deployment).
Email: **[OpenTransformers@proton.me](mailto:OpenTransformers@proton.me)**
Org page: [https://huggingface.co/OpenTransformer](https://huggingface.co/OpenTransformer)