# AGILLM-3: Technical Documentation ## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training **Scott Bisset** OpenTransformers Ltd January 2026 --- ## Abstract This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier models—AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community. --- ## 1. Motivation ### 1.1 What This Is AGILLM-3 is a research project exploring: 1. **Tuneable attention rank**: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension? 2. **Joint AR+SAT training**: Can a model learn both next-token prediction AND multi-token speculation simultaneously? ### 1.2 What This Isn't This is not: - A frontier model - A competitor to GPT-4/Claude/Gemini - A claim that small models can match large ones - A business AGI already exists. This is documentation, not disruption. --- ## 2. Architecture ### 2.1 Overview ``` Input tokens ↓ Embedding (vocab → d) ↓ [Block × L layers] ├── LayerNorm → TuneableAttentionMHA → +residual └── LayerNorm → FFN (d → 4d → d) → +residual ↓ Final LayerNorm ↓ ├── ARHead (next token prediction) └── SATHead (multi-token speculation) ``` ### 2.2 Tuneable Attention (The Novel Bit) Standard multi-head attention computes: ``` Q = XWq, K = XWk, V = XWv Attention = softmax(QKᵀ/√d_k) · V ``` Where Q, K have shape [batch, seq, heads, d_k]. **AGILLM-3's modification:** ```python class TuneableAttentionMHA(nn.Module): def __init__(self, d: int, h: int, r: int): # r = rank (the tuneable parameter) self.U = nn.Parameter(torch.randn(d_k, r)) nn.init.orthogonal_(self.U) def _proj_qk(self, x): # Project through U: [batch, seq, heads, d_k] @ [d_k, r] → [batch, seq, heads, r] return x.view(B, N, h, d_k).transpose(1,2) @ self.U ``` The attention computation becomes: ``` Q' = Q @ U # [batch, heads, seq, r] K' = K @ U # [batch, heads, seq, r] Attention = softmax(Q'K'ᵀ/√d_k) · V ``` **What this means:** | Regime | Condition | Effect | |--------|-----------|--------| | Compression | r < d_k | Q-K similarity computed in lower-dim space | | Identity | r = d_k | Equivalent to standard attention (if U=I) | | Expansion | r > d_k | Q-K similarity computed in higher-dim space | The presets encode this as ratios: - `nano_1x`: r = d_k (standard) - `nano_3x`: r = 3 × d_k (expansion) - `nano_12x`: r = 12 × d_k (heavy expansion) **Hypothesis being tested:** Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information. ### 2.3 Positional Encoding: ALiBi AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions: ```python def alibi_bias(n_heads, n_tokens): # Each head gets a different slope # Attention score penalized by distance: score -= slope * |i - j| slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...] return -slopes * distance_matrix ``` ALiBi chosen for: - Zero additional parameters - Good length extrapolation - Simplicity ### 2.4 Block Structure Each transformer block: ```python class Block(nn.Module): def forward(self, x, mask): # Pre-norm architecture x = x + self.mha(self.ln1(x), mask) x = x + self.ff(self.ln2(x)) return x ``` FFN is standard: Linear(d, 4d) → ReLU → Linear(4d, d) ### 2.5 Model Configurations From the presets in code: | Preset | d_model | Layers | Heads | Rank | ~Params | |--------|---------|--------|-------|------|---------| | nano_3x | 64 | 2 | 4 | 48 | ~200K | | micro_12x | 128 | 4 | 8 | 192 | ~2M | | small | 512 | 8 | 16 | 64 | ~50M | | base | 768 | 12 | 24 | 96 | ~125M | | large | 1024 | 24 | 16 | 128 | ~698M | The "large" preset at 698M parameters is the primary AGILLM-3 configuration. --- ## 3. Joint AR+SAT Training ### 3.1 The Idea Standard language models train only on next-token prediction (autoregressive, AR). AGILLM-3 trains on BOTH: 1. **AR objective**: Predict token t+1 from tokens 1..t 2. **SAT objective**: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive) ### 3.2 Masking **AR mask** (standard causal): ``` Position can attend to: all previous positions [1 0 0 0] [1 1 0 0] [1 1 1 0] [1 1 1 1] ``` **SAT mask** (block-wise): ``` SAT_BLOCK = 2 Positions in same block can attend to each other AND all previous blocks Block 0: positions 0,1 can see each other Block 1: positions 2,3 can see each other + block 0 etc. ``` ```python def sat_mask(n, block=2): idx = torch.arange(n) grp = idx // block allow = (grp.T == grp) | (grp.T > grp) # Same block OR previous blocks return torch.where(allow, 0.0, -inf) ``` ### 3.3 Training Loop Each batch: ```python # Forward pass 1: AR h_ar = core(ids, causal_mask(n)) logits_ar = ar_head(h_ar)[:, :-1] loss_ar = cross_entropy(logits_ar, targets[:, 1:]) # Forward pass 2: SAT h_sat = core(ids, sat_mask(n)) logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:]) loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1]) # Optional: gate loss (predict how many tokens to emit) if gate is not None: loss_sat += 0.1 * cross_entropy(gate, emit_target) loss = loss_ar + loss_sat ``` ### 3.4 SAT Head with Gating ```python class SATHead(nn.Module): def __init__(self, d, mode="var"): self.proj = nn.Linear(d, vocab) # Token prediction self.gate = nn.Linear(d, 2) # Emit 1 or 2 tokens? ``` The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation. ### 3.5 Why Joint Training? **Hypothesis:** Training both objectives together might: 1. Improve representation quality (multi-task learning) 2. Enable speculative decoding at inference (predict multiple tokens, verify with AR) 3. Learn confidence estimation via the gate **Current status:** Experimental. No claims of improvement over AR-only. --- ## 4. Training Infrastructure ### 4.1 Data Pipeline ```python def token_stream(ds_names, target_tokens, seed, ...): """ Streaming token generator from HuggingFace datasets. - Supports multiple comma-separated datasets - Auto-rotates through sources - Handles chat format (messages key) or raw text - Appends EOS tokens """ ``` Default pretraining sources (from code): ``` OpenTransformer/goddess-crawl OpenTransformer/agillm-crawl-data OpenTransformer/web-crawl-2026 OpenTransformer/web-crawl-clean-v2 OpenTransformer/scraped-web-data OpenTransformer/turbo-crawl OpenTransformer/sft-data-clean OpenTransformer/web-crawl-v1 ``` ### 4.2 Optimizer Configuration ```python opt = AdamW([ {"params": core.parameters(), "lr": 5e-5}, # LR_CORE {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD {"params": sat_head.parameters(), "lr": 2e-4}, ]) ``` Separate learning rates for core vs heads. ### 4.3 Training Features - **AMP**: Automatic mixed precision (bf16 if available, else fp16) - **Gradient clipping**: max_norm=1.0 - **Label smoothing**: 0.1 - **Dropout**: 0.1 in attention - **Checkpointing**: Configurable interval (default 24h), automatic pruning ### 4.4 Chinchilla Scaling ```python ratio = 51.2 if args.chilla_max_double else 25 param_count = count_params(core, ar_h, sat_h) target_tokens = int(ratio * param_count) ``` Default follows ~25× Chinchilla ratio; optional 51.2× for "double Chinchilla". For 698M params: ~17.5B tokens default, ~35.7B tokens with double. ### 4.5 Hot Config Runtime dataset switching without restart: ```python # /workspace/hot_config.json {"datasets": ["new_dataset_1", "new_dataset_2"]} ``` Trainer checks this file periodically and switches data sources. ### 4.6 Auto-Grow Optional feature to increase block size during training: ```python --auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000 ``` Starts with smaller context, grows as training stabilizes. --- ## 5. Inference ### 5.1 AR Mode (Standard) ```python python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello" ``` Standard autoregressive generation with KV-cache. ### 5.2 SAT Mode (Speculative) ```python python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var ``` Generates SAT_BLOCK tokens at once, optionally using gate to choose stride. ### 5.3 Sampling Parameters | Parameter | AR Default | SAT Default | |-----------|------------|-------------| | temperature | 0.7 | 0.5 | | top_k | 0 | 30 | | repetition_penalty | 1.3 | 2.0 | | presence_penalty | 0.0 | 0.6 | | frequency_penalty | 0.3 | 1.0 | | penalty_last_n | 128 | 200 | SAT mode uses more aggressive penalties to avoid repetition from parallel generation. --- ## 6. Weight Tying Optional embedding-LM head weight tying: ```python class ARHead(nn.Module): def __init__(self, d, tie_weights=False, embedding_weight=None): if tie_weights and embedding_weight is not None: self.proj = nn.Linear(d, vocab, bias=False) self.proj.weight = embedding_weight # Share weights ``` Reduces parameters by ~vocab × d (significant for large vocab). --- ## 7. Current Training Status As of January 2026: - Step: 2.2M+ - Tokens seen: ~2.4B - Preset: large (698M params) - Training on vast.ai 3090 - Checkpoints every 6 hours --- ## 8. Observations and Notes ### 8.1 Expansion Ratio Effects Early experiments suggest: - 1x (standard): baseline behavior - 3x-6x: slight improvement in attention patterns - 12x+: diminishing returns, increased compute Not rigorously benchmarked. Observations only. ### 8.2 AR vs AR+SAT AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch. ### 8.3 Known Issues 1. SAT inference quality lags AR (expected - harder task) 2. Gate accuracy mediocre (often just predicts "emit 2") 3. Memory usage higher than equivalent AR-only model --- ## 9. Code Location Primary file: `n.py` Key classes: - `TuneableAttentionMHA`: The modified attention - `Block`: Transformer block - `Encoder`: Full encoder stack - `ARHead`, `SATHead`: Output heads - `token_stream`: Data pipeline - `_train_phase`: Training loop --- ## 10. License and Citation Code released under MIT license. If referencing this work: ``` @misc{agillm3, author = {Bisset, Scott}, title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training}, year = {2026}, publisher = {OpenTransformers Ltd} } ``` --- ## Appendix A: Full Preset Table ```python PRESETS = { "femto_1x": dict(d=16, layers=1, heads=1, rank=16), "femto_12x": dict(d=16, layers=1, heads=1, rank=192), "pico_1x": dict(d=32, layers=1, heads=2, rank=16), "pico_12x": dict(d=32, layers=1, heads=2, rank=192), "nano_1x": dict(d=64, layers=2, heads=4, rank=16), "nano_3x": dict(d=64, layers=2, heads=4, rank=48), "nano_12x": dict(d=64, layers=2, heads=4, rank=192), "micro_12x": dict(d=128, layers=4, heads=8, rank=192), "small": dict(d=512, layers=8, heads=16, rank=64), "base": dict(d=768, layers=12, heads=24, rank=96), "large": dict(d=1024, layers=24, heads=16, rank=128), } ``` --- ## Appendix B: Example Training Command ```bash python n.py train \ --preset large \ --batch_size 4 \ --block 1122 \ --amp \ --save_every_sec 21600 \ --save_dir /workspace/ckpts_expansion \ --max_ckpts 5 \ --resume /workspace/ckpts_expansion ``` --- *Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM*