| # AGILLM-3 Large (698M) | |
| **AR+SAT Joint Training** β Novel architecture training both autoregressive and semi-autoregressive heads simultaneously, enabling faster parallel inference. | |
| ## Model Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | Parameters | 698M | | |
| | Architecture | Transformer with Expansion Rank | | |
| | d_model | 1024 | | |
| | Layers | 24 | | |
| | Heads | 16 | | |
| | Expansion Rank | 128 (2x ratio) | | |
| | Tokenizer | DeepSeek-V3.2 (128,815 vocab) | | |
| | Training Target | 35.76B tokens (51.2x Chinchilla) | | |
| | Context Length | 1122 tokens | | |
| ## Training | |
| ```bash | |
| # Minimal run (uses sane defaults) | |
| python n.py train --preset large | |
| # Resume from checkpoint | |
| python n.py train --preset large --resume ckpts/latest.pt | |
| # Inference | |
| python n.py infer --mode ar --ckpt ckpts/pretrain_step00176907.pt --prompt "Hello" --max_new 100 | |
| ``` | |
| ## Defaults Baked In | |
| - `--max_ckpts 3` β Auto-prune old checkpoints | |
| - `--chilla_max_double True` β Double Chinchilla (51.2x tokens) | |
| - `--after_sft_steps 80000` β 80K SFT steps with chat format | |
| - Auto HF upload on each checkpoint save | |
| ## Hot Config | |
| Edit `hot_config.json` mid-training without restart: | |
| ```json | |
| {"save_every_sec": 43200, "pause_training": false} | |
| ``` | |
| ## Files | |
| - `n.py` β Main trainer with AR+SAT joint training | |
| - `rotating_log.py` β Dual rotating log | |
| - `hf_upload.py` β Checkpoint uploader | |
| - `tokenizer/` β DeepSeek-V3.2 tokenizer | |
| ## License | |
| Apache 2.0 | |
| ## Author | |
| OpenTransformers Ltd (UK Company #16940923) | |