File size: 7,306 Bytes
33e80c3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 |
# AGILLM2-fast-training · `5L.py`
Autoregressive (AR-only) single-file trainer/decoder using the Qwen3 tokenizer
**Repo:** [https://huggingface.co/OpenTransformer/AGILLM2-fast-training](https://huggingface.co/OpenTransformer/AGILLM2-fast-training)
**Org:** [https://huggingface.co/OpenTransformer](https://huggingface.co/OpenTransformer)
**Contact:** [OpenTransformers@proton.me](mailto:OpenTransformers@proton.me)
## Overview
`5L.py` is a ~single-file PyTorch training and inference script for language models with:
* **AR-only** training/decoding
* **Qwen3** tokenizer by default (override via `TOKENIZER_ID`)
* **Progressive block growth**, **AMP/FP8 autocast**, **OOM backoff**
* **Time-based checkpointing** only (monotonic, resume-safe)
* **Sampling controls:** top-k/top-p/min-p, greedy, repetition/presence/frequency penalties, no-repeat-ngrams
* **Chinchilla-style target token estimator** using all enabled params (core + AR head)
The goal is **minimal surface area** with production-lean features so you can train quickly, resume safely, and decode reliably on commodity GPUs or cloud nodes.
## Features
* **Presets:** `small`, `smallx2`, `base`
* **Attention:** Low-rank MHA with ALiBi relative bias
* **Determinism helpers:** seed management, checkpoint metadata (RNG states)
* **Tokenizer safety:** adds `[PAD]` if missing; handles EOS fallbacks
* **Streaming data:** uses `datasets` streaming for large corpora
## Requirements
* Python 3.10+
* PyTorch 2.2+ (CUDA build if using NVIDIA GPUs)
* `transformers`, `datasets`, `tqdm`
* CUDA-capable GPU recommended; script also runs CPU-only for smoke tests
Install:
```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121 # pick your CUDA/CPU wheel
pip install transformers datasets tqdm
```
## Quick start
### 1) Set tokenizer (optional)
Default is Qwen3:
```bash
export TOKENIZER_ID="Qwen/Qwen3-235B-A22B-Thinking-2507"
```
Use any compatible tokenizer:
```bash
export TOKENIZER_ID="qwen/qwen2.5-7b"
```
### 2) Train
Minimal example on SlimPajama (streaming):
```bash
python 5L.py train \
--preset small \
--source cerebras/SlimPajama-627B \
--amp \
--save_dir ckpts_joint \
--save_every_sec 7200
```
Targets and steps:
```bash
# Let script compute Chinchilla-style target tokens automatically
python 5L.py train --preset small --amp
# Or cap by steps
python 5L.py train --preset small --steps 20000 --amp
```
Warm start / resume:
```bash
# Warm-start from a prior final.pt (shape-safe copy of matching tensors)
python 5L.py train --preset small --warmstart_from ckpts_joint/final.pt
# Full resume (optimizer, scaler, seen tokens, timers)
python 5L.py train --resume ckpts_joint/step00050000.pt
```
Progressive block growth:
```bash
python 5L.py train \
--preset small \
--auto_grow \
--grow_plan "576,640,768,896,1024" \
--grow_every_steps 50000
```
FP8 fast path:
```bash
# Try FP8; if not supported, fall back to bf16
python 5L.py train --preset small --fp8-only --fp8-fallback
```
### 3) Inference
```bash
python 5L.py infer \
--mode ar \
--ckpt ckpts_joint/final.pt \
--preset small \
--prompt "Explain ALiBi in simple terms." \
--max_new 120 \
--top_p 0.9 --top_k 50 \
--repetition_penalty 1.1 \
--no_repeat_ngram_size 3
```
Greedy decode:
```bash
python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
--prompt "What is progressive block growth in training?" --greedy --max_new 80
```
FP8 during decode (if supported):
```bash
python 5L.py infer --mode ar --ckpt ckpts_joint/final.pt --preset small \
--prompt "Summarize transformer attention variants." --fp8-only --fp8-fallback
```
## Presets
```text
small : d=512, layers=8, heads=16, rank=64
smallx2 : d=512, layers=16, heads=16, rank=64
base : d=768, layers=12, heads=24, rank=96
```
Use `--x2` during training to double layers of an inferred previous config.
## Checkpointing & Resume
* **Saves** only by **time interval** (`--save_every_sec`, default 24h) to avoid step-based drift.
* `final.pt` includes: core, AR head, optimizer, AMP scaler, cfg, RNG states, and metadata.
* **Resume** with `--resume <path>` to restore optimizer/scaler/wall-clock cadence.
* **Warm start** only copies shape-matched tensors (safe if your topology changed).
Artifacts:
* `ckpts_joint/stepXXXXXXXX.pt`
* `ckpts_joint/latest.json` with canonical latest path and step
## Data
Default streaming dataset:
* `cerebras/SlimPajama-627B` (train split, streaming enabled).
Replace `--source` with any `datasets`-compatible corpus that yields `{"text": ...}`.
EOS handling: if tokenizer’s `eos_token_id` is missing, uses `sep_token_id`; if a sample doesn’t end with EOS, one is appended.
## Sampling controls
* `--temperature`, `--top_k`, `--top_p`, `--min_p`
* `--repetition_penalty`, `--presence_penalty`, `--frequency_penalty`, `--penalty_last_n`
* `--no_repeat_ngram_size`
Greedy mode (`--greedy`) overrides sampling.
## FP8 / AMP
* `--fp8-only` attempts `float8_e4m3fn` autocast
* `--fp8-fallback` continues with bf16 if FP8 unsupported
* Otherwise use `--amp` for bf16/fp16 autocast
* `torch.backends.cuda.matmul.allow_tf32=True` is enabled when available
## OOM backoff & block growth
* On CUDA OOM, the script **halves** `BLOCK` (down to 128), empties cache, and retries the step.
* With `--auto_grow`, the script periodically attempts to **increase** `BLOCK` along your `--grow_plan`.
## Token targets (Chinchilla-style)
If `--target_tokens` is unspecified, the script computes `25 × (enabled parameters)` using **all** trainable params (core + AR head). This provides a rough target for total tokens to consume.
## Repro tips
* Pin a specific tokenizer via `TOKENIZER_ID`
* Log your `--preset`, `--block`, and `--grow_plan`
* Keep `save_every_sec` stable between resumes for monotonic cadence
* Record CUDA/cuDNN versions in your run logs for reproducibility
## Limitations
* AR-only trainer (no encoder-decoder, no multimodal)
* Low-rank MHA path; FlashAttention not included
* Single-GPU by default; multi-GPU DDP not wired in this file
* Safety/guardrails are out of scope here (this is a trainer, not a hosted chat product)
## Roadmap (planned)
* Optional DDP with NCCL/RCCL/HCCL backends
* FlashAttention path when available across vendors
* Export helpers (Safetensors, GGUF) for downstream serving
## Responsible Use
* Ensure your dataset usage complies with its license and applicable laws.
* Models trained with this script can generate incorrect or biased outputs.
Evaluate and align according to your deployment requirements.
## Citation
If this script or training pipeline helps your work, consider citing the repo:
```bibtex
@software{OpenTransformer_AGILLM2_fast_training_2025,
title = {AGILLM2-fast-training: Single-file AR-only trainer/decoder (5L.py)},
author = {OpenTransformers},
year = {2025},
url = {https://huggingface.co/OpenTransformer/AGILLM2-fast-training}
}
```
---
**Support / Contracts**
We provide **custom development** and **end-to-end training** services (data prep → training → evaluation → deployment).
Email: **[OpenTransformers@proton.me](mailto:OpenTransformers@proton.me)**
Org page: [https://huggingface.co/OpenTransformer](https://huggingface.co/OpenTransformer)
|