Add README with model details
Browse files
README.md
CHANGED
|
@@ -1,51 +1,59 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
tags:
|
| 4 |
-
- pytorch
|
| 5 |
-
- transformer
|
| 6 |
-
- language-model
|
| 7 |
-
- agillm
|
| 8 |
-
- ar-sat
|
| 9 |
-
- joint-training
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
# AGILLM-3 Large (698M)
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
##
|
| 17 |
|
| 18 |
| Parameter | Value |
|
| 19 |
|-----------|-------|
|
|
|
|
|
|
|
| 20 |
| d_model | 1024 |
|
| 21 |
-
|
|
| 22 |
-
|
|
| 23 |
-
|
|
| 24 |
-
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## Training
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
- **Framework:** Custom PyTorch trainer
|
| 32 |
|
| 33 |
-
#
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
-
|
| 37 |
-
|
| 38 |
-
- ... etc
|
| 39 |
|
| 40 |
-
##
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## License
|
| 50 |
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# AGILLM-3 Large (698M)
|
| 2 |
|
| 3 |
+
**AR+SAT Joint Training** — Novel architecture training both autoregressive and semi-autoregressive heads simultaneously, enabling faster parallel inference.
|
| 4 |
|
| 5 |
+
## Model Details
|
| 6 |
|
| 7 |
| Parameter | Value |
|
| 8 |
|-----------|-------|
|
| 9 |
+
| Parameters | 698M |
|
| 10 |
+
| Architecture | Transformer with Expansion Rank |
|
| 11 |
| d_model | 1024 |
|
| 12 |
+
| Layers | 24 |
|
| 13 |
+
| Heads | 16 |
|
| 14 |
+
| Expansion Rank | 128 (2x ratio) |
|
| 15 |
+
| Tokenizer | DeepSeek-V3.2 (128,815 vocab) |
|
| 16 |
+
| Training Target | 35.76B tokens (51.2x Chinchilla) |
|
| 17 |
+
| Context Length | 1122 tokens |
|
| 18 |
|
| 19 |
## Training
|
| 20 |
|
| 21 |
+
```bash
|
| 22 |
+
# Minimal run (uses sane defaults)
|
| 23 |
+
python n.py train --preset large
|
|
|
|
| 24 |
|
| 25 |
+
# Resume from checkpoint
|
| 26 |
+
python n.py train --preset large --resume ckpts/latest.pt
|
| 27 |
|
| 28 |
+
# Inference
|
| 29 |
+
python n.py infer --mode ar --ckpt ckpts/pretrain_step00176907.pt --prompt "Hello" --max_new 100
|
| 30 |
+
```
|
|
|
|
| 31 |
|
| 32 |
+
## Defaults Baked In
|
| 33 |
|
| 34 |
+
- `--max_ckpts 3` — Auto-prune old checkpoints
|
| 35 |
+
- `--chilla_max_double True` — Double Chinchilla (51.2x tokens)
|
| 36 |
+
- `--after_sft_steps 80000` — 80K SFT steps with chat format
|
| 37 |
+
- Auto HF upload on each checkpoint save
|
| 38 |
|
| 39 |
+
## Hot Config
|
| 40 |
+
|
| 41 |
+
Edit `hot_config.json` mid-training without restart:
|
| 42 |
+
```json
|
| 43 |
+
{"save_every_sec": 43200, "pause_training": false}
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
## Files
|
| 47 |
|
| 48 |
+
- `n.py` — Main trainer with AR+SAT joint training
|
| 49 |
+
- `rotating_log.py` — Dual rotating log
|
| 50 |
+
- `hf_upload.py` — Checkpoint uploader
|
| 51 |
+
- `tokenizer/` — DeepSeek-V3.2 tokenizer
|
| 52 |
|
| 53 |
## License
|
| 54 |
|
| 55 |
+
Apache 2.0
|
| 56 |
+
|
| 57 |
+
## Author
|
| 58 |
+
|
| 59 |
+
OpenTransformers Ltd (UK Company #16940923)
|