OpenTransformer commited on
Commit
eb981c3
·
verified ·
1 Parent(s): 9bdce90

Add README with model details

Browse files
Files changed (1) hide show
  1. README.md +40 -32
README.md CHANGED
@@ -1,51 +1,59 @@
1
- ---
2
- license: mit
3
- tags:
4
- - pytorch
5
- - transformer
6
- - language-model
7
- - agillm
8
- - ar-sat
9
- - joint-training
10
- ---
11
-
12
  # AGILLM-3 Large (698M)
13
 
14
- A 698 million parameter language model trained using novel **AR+SAT joint training** - combining autoregressive and semi-autoregressive objectives in a single forward pass.
15
 
16
- ## Architecture
17
 
18
  | Parameter | Value |
19
  |-----------|-------|
 
 
20
  | d_model | 1024 |
21
- | layers | 24 |
22
- | heads | 16 |
23
- | rank | 128 |
24
- | total params | 698,389,088 |
 
 
25
 
26
  ## Training
27
 
28
- - **Dataset:** OpenWebText + WikiText
29
- - **Target:** 2.7M pretrain steps + 300k SFT steps
30
- - **Hardware:** RTX 3090 24GB
31
- - **Framework:** Custom PyTorch trainer
32
 
33
- ## Checkpoints
 
34
 
35
- Milestone checkpoints saved every 100k steps:
36
- - `checkpoints/step_100000.pt`
37
- - `checkpoints/step_200000.pt`
38
- - ... etc
39
 
40
- ## Research Hypothesis
41
 
42
- Joint AR+SAT training provides ~2x learning efficiency compared to isolated training. The SAT decoder's parallel prediction forces holistic understanding while AR maintains autoregressive generation capability.
 
 
 
43
 
44
- ## Author
 
 
 
 
 
 
 
45
 
46
- **OpenTransformers Ltd** (Company #16940923)
47
- Scott Edwards - Founder/Director
 
 
48
 
49
  ## License
50
 
51
- MIT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # AGILLM-3 Large (698M)
2
 
3
+ **AR+SAT Joint Training** Novel architecture training both autoregressive and semi-autoregressive heads simultaneously, enabling faster parallel inference.
4
 
5
+ ## Model Details
6
 
7
  | Parameter | Value |
8
  |-----------|-------|
9
+ | Parameters | 698M |
10
+ | Architecture | Transformer with Expansion Rank |
11
  | d_model | 1024 |
12
+ | Layers | 24 |
13
+ | Heads | 16 |
14
+ | Expansion Rank | 128 (2x ratio) |
15
+ | Tokenizer | DeepSeek-V3.2 (128,815 vocab) |
16
+ | Training Target | 35.76B tokens (51.2x Chinchilla) |
17
+ | Context Length | 1122 tokens |
18
 
19
  ## Training
20
 
21
+ ```bash
22
+ # Minimal run (uses sane defaults)
23
+ python n.py train --preset large
 
24
 
25
+ # Resume from checkpoint
26
+ python n.py train --preset large --resume ckpts/latest.pt
27
 
28
+ # Inference
29
+ python n.py infer --mode ar --ckpt ckpts/pretrain_step00176907.pt --prompt "Hello" --max_new 100
30
+ ```
 
31
 
32
+ ## Defaults Baked In
33
 
34
+ - `--max_ckpts 3` Auto-prune old checkpoints
35
+ - `--chilla_max_double True` — Double Chinchilla (51.2x tokens)
36
+ - `--after_sft_steps 80000` — 80K SFT steps with chat format
37
+ - Auto HF upload on each checkpoint save
38
 
39
+ ## Hot Config
40
+
41
+ Edit `hot_config.json` mid-training without restart:
42
+ ```json
43
+ {"save_every_sec": 43200, "pause_training": false}
44
+ ```
45
+
46
+ ## Files
47
 
48
+ - `n.py` Main trainer with AR+SAT joint training
49
+ - `rotating_log.py` Dual rotating log
50
+ - `hf_upload.py` — Checkpoint uploader
51
+ - `tokenizer/` — DeepSeek-V3.2 tokenizer
52
 
53
  ## License
54
 
55
+ Apache 2.0
56
+
57
+ ## Author
58
+
59
+ OpenTransformers Ltd (UK Company #16940923)