OpenTransformer commited on
Commit
61b9671
·
verified ·
1 Parent(s): cf7f479

Upload AGILLM3_technical_documentation.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. AGILLM3_technical_documentation.md +468 -0
AGILLM3_technical_documentation.md ADDED
@@ -0,0 +1,468 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AGILLM-3: Technical Documentation
2
+ ## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training
3
+
4
+ **Scott Bisset**
5
+ OpenTransformers Ltd
6
+ January 2026
7
+
8
+ ---
9
+
10
+ ## Abstract
11
+
12
+ This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier models—AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.
13
+
14
+ ---
15
+
16
+ ## 1. Motivation
17
+
18
+ ### 1.1 What This Is
19
+
20
+ AGILLM-3 is a research project exploring:
21
+
22
+ 1. **Tuneable attention rank**: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?
23
+
24
+ 2. **Joint AR+SAT training**: Can a model learn both next-token prediction AND multi-token speculation simultaneously?
25
+
26
+ ### 1.2 What This Isn't
27
+
28
+ This is not:
29
+ - A frontier model
30
+ - A competitor to GPT-4/Claude/Gemini
31
+ - A claim that small models can match large ones
32
+ - A business
33
+
34
+ AGI already exists. This is documentation, not disruption.
35
+
36
+ ---
37
+
38
+ ## 2. Architecture
39
+
40
+ ### 2.1 Overview
41
+
42
+ ```
43
+ Input tokens
44
+
45
+ Embedding (vocab → d)
46
+
47
+ [Block × L layers]
48
+ ├── LayerNorm → TuneableAttentionMHA → +residual
49
+ └── LayerNorm → FFN (d → 4d → d) → +residual
50
+
51
+ Final LayerNorm
52
+
53
+ ├── ARHead (next token prediction)
54
+ └── SATHead (multi-token speculation)
55
+ ```
56
+
57
+ ### 2.2 Tuneable Attention (The Novel Bit)
58
+
59
+ Standard multi-head attention computes:
60
+
61
+ ```
62
+ Q = XWq, K = XWk, V = XWv
63
+ Attention = softmax(QKᵀ/√d_k) · V
64
+ ```
65
+
66
+ Where Q, K have shape [batch, seq, heads, d_k].
67
+
68
+ **AGILLM-3's modification:**
69
+
70
+ ```python
71
+ class TuneableAttentionMHA(nn.Module):
72
+ def __init__(self, d: int, h: int, r: int):
73
+ # r = rank (the tuneable parameter)
74
+ self.U = nn.Parameter(torch.randn(d_k, r))
75
+ nn.init.orthogonal_(self.U)
76
+
77
+ def _proj_qk(self, x):
78
+ # Project through U: [batch, seq, heads, d_k] @ [d_k, r] → [batch, seq, heads, r]
79
+ return x.view(B, N, h, d_k).transpose(1,2) @ self.U
80
+ ```
81
+
82
+ The attention computation becomes:
83
+
84
+ ```
85
+ Q' = Q @ U # [batch, heads, seq, r]
86
+ K' = K @ U # [batch, heads, seq, r]
87
+ Attention = softmax(Q'K'ᵀ/√d_k) · V
88
+ ```
89
+
90
+ **What this means:**
91
+
92
+ | Regime | Condition | Effect |
93
+ |--------|-----------|--------|
94
+ | Compression | r < d_k | Q-K similarity computed in lower-dim space |
95
+ | Identity | r = d_k | Equivalent to standard attention (if U=I) |
96
+ | Expansion | r > d_k | Q-K similarity computed in higher-dim space |
97
+
98
+ The presets encode this as ratios:
99
+ - `nano_1x`: r = d_k (standard)
100
+ - `nano_3x`: r = 3 × d_k (expansion)
101
+ - `nano_12x`: r = 12 × d_k (heavy expansion)
102
+
103
+ **Hypothesis being tested:** Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.
104
+
105
+ ### 2.3 Positional Encoding: ALiBi
106
+
107
+ AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:
108
+
109
+ ```python
110
+ def alibi_bias(n_heads, n_tokens):
111
+ # Each head gets a different slope
112
+ # Attention score penalized by distance: score -= slope * |i - j|
113
+ slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
114
+ return -slopes * distance_matrix
115
+ ```
116
+
117
+ ALiBi chosen for:
118
+ - Zero additional parameters
119
+ - Good length extrapolation
120
+ - Simplicity
121
+
122
+ ### 2.4 Block Structure
123
+
124
+ Each transformer block:
125
+
126
+ ```python
127
+ class Block(nn.Module):
128
+ def forward(self, x, mask):
129
+ # Pre-norm architecture
130
+ x = x + self.mha(self.ln1(x), mask)
131
+ x = x + self.ff(self.ln2(x))
132
+ return x
133
+ ```
134
+
135
+ FFN is standard: Linear(d, 4d) → ReLU → Linear(4d, d)
136
+
137
+ ### 2.5 Model Configurations
138
+
139
+ From the presets in code:
140
+
141
+ | Preset | d_model | Layers | Heads | Rank | ~Params |
142
+ |--------|---------|--------|-------|------|---------|
143
+ | nano_3x | 64 | 2 | 4 | 48 | ~200K |
144
+ | micro_12x | 128 | 4 | 8 | 192 | ~2M |
145
+ | small | 512 | 8 | 16 | 64 | ~50M |
146
+ | base | 768 | 12 | 24 | 96 | ~125M |
147
+ | large | 1024 | 24 | 16 | 128 | ~698M |
148
+
149
+ The "large" preset at 698M parameters is the primary AGILLM-3 configuration.
150
+
151
+ ---
152
+
153
+ ## 3. Joint AR+SAT Training
154
+
155
+ ### 3.1 The Idea
156
+
157
+ Standard language models train only on next-token prediction (autoregressive, AR).
158
+
159
+ AGILLM-3 trains on BOTH:
160
+
161
+ 1. **AR objective**: Predict token t+1 from tokens 1..t
162
+ 2. **SAT objective**: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)
163
+
164
+ ### 3.2 Masking
165
+
166
+ **AR mask** (standard causal):
167
+ ```
168
+ Position can attend to: all previous positions
169
+ [1 0 0 0]
170
+ [1 1 0 0]
171
+ [1 1 1 0]
172
+ [1 1 1 1]
173
+ ```
174
+
175
+ **SAT mask** (block-wise):
176
+ ```
177
+ SAT_BLOCK = 2
178
+ Positions in same block can attend to each other AND all previous blocks
179
+
180
+ Block 0: positions 0,1 can see each other
181
+ Block 1: positions 2,3 can see each other + block 0
182
+ etc.
183
+ ```
184
+
185
+ ```python
186
+ def sat_mask(n, block=2):
187
+ idx = torch.arange(n)
188
+ grp = idx // block
189
+ allow = (grp.T == grp) | (grp.T > grp) # Same block OR previous blocks
190
+ return torch.where(allow, 0.0, -inf)
191
+ ```
192
+
193
+ ### 3.3 Training Loop
194
+
195
+ Each batch:
196
+
197
+ ```python
198
+ # Forward pass 1: AR
199
+ h_ar = core(ids, causal_mask(n))
200
+ logits_ar = ar_head(h_ar)[:, :-1]
201
+ loss_ar = cross_entropy(logits_ar, targets[:, 1:])
202
+
203
+ # Forward pass 2: SAT
204
+ h_sat = core(ids, sat_mask(n))
205
+ logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
206
+ loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])
207
+
208
+ # Optional: gate loss (predict how many tokens to emit)
209
+ if gate is not None:
210
+ loss_sat += 0.1 * cross_entropy(gate, emit_target)
211
+
212
+ loss = loss_ar + loss_sat
213
+ ```
214
+
215
+ ### 3.4 SAT Head with Gating
216
+
217
+ ```python
218
+ class SATHead(nn.Module):
219
+ def __init__(self, d, mode="var"):
220
+ self.proj = nn.Linear(d, vocab) # Token prediction
221
+ self.gate = nn.Linear(d, 2) # Emit 1 or 2 tokens?
222
+ ```
223
+
224
+ The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.
225
+
226
+ ### 3.5 Why Joint Training?
227
+
228
+ **Hypothesis:** Training both objectives together might:
229
+ 1. Improve representation quality (multi-task learning)
230
+ 2. Enable speculative decoding at inference (predict multiple tokens, verify with AR)
231
+ 3. Learn confidence estimation via the gate
232
+
233
+ **Current status:** Experimental. No claims of improvement over AR-only.
234
+
235
+ ---
236
+
237
+ ## 4. Training Infrastructure
238
+
239
+ ### 4.1 Data Pipeline
240
+
241
+ ```python
242
+ def token_stream(ds_names, target_tokens, seed, ...):
243
+ """
244
+ Streaming token generator from HuggingFace datasets.
245
+ - Supports multiple comma-separated datasets
246
+ - Auto-rotates through sources
247
+ - Handles chat format (messages key) or raw text
248
+ - Appends EOS tokens
249
+ """
250
+ ```
251
+
252
+ Default pretraining sources (from code):
253
+ ```
254
+ OpenTransformer/goddess-crawl
255
+ OpenTransformer/agillm-crawl-data
256
+ OpenTransformer/web-crawl-2026
257
+ OpenTransformer/web-crawl-clean-v2
258
+ OpenTransformer/scraped-web-data
259
+ OpenTransformer/turbo-crawl
260
+ OpenTransformer/sft-data-clean
261
+ OpenTransformer/web-crawl-v1
262
+ ```
263
+
264
+ ### 4.2 Optimizer Configuration
265
+
266
+ ```python
267
+ opt = AdamW([
268
+ {"params": core.parameters(), "lr": 5e-5}, # LR_CORE
269
+ {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD
270
+ {"params": sat_head.parameters(), "lr": 2e-4},
271
+ ])
272
+ ```
273
+
274
+ Separate learning rates for core vs heads.
275
+
276
+ ### 4.3 Training Features
277
+
278
+ - **AMP**: Automatic mixed precision (bf16 if available, else fp16)
279
+ - **Gradient clipping**: max_norm=1.0
280
+ - **Label smoothing**: 0.1
281
+ - **Dropout**: 0.1 in attention
282
+ - **Checkpointing**: Configurable interval (default 24h), automatic pruning
283
+
284
+ ### 4.4 Chinchilla Scaling
285
+
286
+ ```python
287
+ ratio = 51.2 if args.chilla_max_double else 25
288
+ param_count = count_params(core, ar_h, sat_h)
289
+ target_tokens = int(ratio * param_count)
290
+ ```
291
+
292
+ Default follows ~25× Chinchilla ratio; optional 51.2× for "double Chinchilla".
293
+
294
+ For 698M params: ~17.5B tokens default, ~35.7B tokens with double.
295
+
296
+ ### 4.5 Hot Config
297
+
298
+ Runtime dataset switching without restart:
299
+
300
+ ```python
301
+ # /workspace/hot_config.json
302
+ {"datasets": ["new_dataset_1", "new_dataset_2"]}
303
+ ```
304
+
305
+ Trainer checks this file periodically and switches data sources.
306
+
307
+ ### 4.6 Auto-Grow
308
+
309
+ Optional feature to increase block size during training:
310
+
311
+ ```python
312
+ --auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000
313
+ ```
314
+
315
+ Starts with smaller context, grows as training stabilizes.
316
+
317
+ ---
318
+
319
+ ## 5. Inference
320
+
321
+ ### 5.1 AR Mode (Standard)
322
+
323
+ ```python
324
+ python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"
325
+ ```
326
+
327
+ Standard autoregressive generation with KV-cache.
328
+
329
+ ### 5.2 SAT Mode (Speculative)
330
+
331
+ ```python
332
+ python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var
333
+ ```
334
+
335
+ Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.
336
+
337
+ ### 5.3 Sampling Parameters
338
+
339
+ | Parameter | AR Default | SAT Default |
340
+ |-----------|------------|-------------|
341
+ | temperature | 0.7 | 0.5 |
342
+ | top_k | 0 | 30 |
343
+ | repetition_penalty | 1.3 | 2.0 |
344
+ | presence_penalty | 0.0 | 0.6 |
345
+ | frequency_penalty | 0.3 | 1.0 |
346
+ | penalty_last_n | 128 | 200 |
347
+
348
+ SAT mode uses more aggressive penalties to avoid repetition from parallel generation.
349
+
350
+ ---
351
+
352
+ ## 6. Weight Tying
353
+
354
+ Optional embedding-LM head weight tying:
355
+
356
+ ```python
357
+ class ARHead(nn.Module):
358
+ def __init__(self, d, tie_weights=False, embedding_weight=None):
359
+ if tie_weights and embedding_weight is not None:
360
+ self.proj = nn.Linear(d, vocab, bias=False)
361
+ self.proj.weight = embedding_weight # Share weights
362
+ ```
363
+
364
+ Reduces parameters by ~vocab × d (significant for large vocab).
365
+
366
+ ---
367
+
368
+ ## 7. Current Training Status
369
+
370
+ As of January 2026:
371
+ - Step: 2.2M+
372
+ - Tokens seen: ~2.4B
373
+ - Preset: large (698M params)
374
+ - Training on vast.ai 3090
375
+ - Checkpoints every 6 hours
376
+
377
+ ---
378
+
379
+ ## 8. Observations and Notes
380
+
381
+ ### 8.1 Expansion Ratio Effects
382
+
383
+ Early experiments suggest:
384
+ - 1x (standard): baseline behavior
385
+ - 3x-6x: slight improvement in attention patterns
386
+ - 12x+: diminishing returns, increased compute
387
+
388
+ Not rigorously benchmarked. Observations only.
389
+
390
+ ### 8.2 AR vs AR+SAT
391
+
392
+ AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch.
393
+
394
+ ### 8.3 Known Issues
395
+
396
+ 1. SAT inference quality lags AR (expected - harder task)
397
+ 2. Gate accuracy mediocre (often just predicts "emit 2")
398
+ 3. Memory usage higher than equivalent AR-only model
399
+
400
+ ---
401
+
402
+ ## 9. Code Location
403
+
404
+ Primary file: `n.py`
405
+
406
+ Key classes:
407
+ - `TuneableAttentionMHA`: The modified attention
408
+ - `Block`: Transformer block
409
+ - `Encoder`: Full encoder stack
410
+ - `ARHead`, `SATHead`: Output heads
411
+ - `token_stream`: Data pipeline
412
+ - `_train_phase`: Training loop
413
+
414
+ ---
415
+
416
+ ## 10. License and Citation
417
+
418
+ Code released under MIT license.
419
+
420
+ If referencing this work:
421
+ ```
422
+ @misc{agillm3,
423
+ author = {Bisset, Scott},
424
+ title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
425
+ year = {2026},
426
+ publisher = {OpenTransformers Ltd}
427
+ }
428
+ ```
429
+
430
+ ---
431
+
432
+ ## Appendix A: Full Preset Table
433
+
434
+ ```python
435
+ PRESETS = {
436
+ "femto_1x": dict(d=16, layers=1, heads=1, rank=16),
437
+ "femto_12x": dict(d=16, layers=1, heads=1, rank=192),
438
+ "pico_1x": dict(d=32, layers=1, heads=2, rank=16),
439
+ "pico_12x": dict(d=32, layers=1, heads=2, rank=192),
440
+ "nano_1x": dict(d=64, layers=2, heads=4, rank=16),
441
+ "nano_3x": dict(d=64, layers=2, heads=4, rank=48),
442
+ "nano_12x": dict(d=64, layers=2, heads=4, rank=192),
443
+ "micro_12x": dict(d=128, layers=4, heads=8, rank=192),
444
+ "small": dict(d=512, layers=8, heads=16, rank=64),
445
+ "base": dict(d=768, layers=12, heads=24, rank=96),
446
+ "large": dict(d=1024, layers=24, heads=16, rank=128),
447
+ }
448
+ ```
449
+
450
+ ---
451
+
452
+ ## Appendix B: Example Training Command
453
+
454
+ ```bash
455
+ python n.py train \
456
+ --preset large \
457
+ --batch_size 4 \
458
+ --block 1122 \
459
+ --amp \
460
+ --save_every_sec 21600 \
461
+ --save_dir /workspace/ckpts_expansion \
462
+ --max_ckpts 5 \
463
+ --resume /workspace/ckpts_expansion
464
+ ```
465
+
466
+ ---
467
+
468
+ *Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM*