| # AGILLM-3: Technical Documentation | |
| ## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training | |
| **Scott Bisset** | |
| OpenTransformers Ltd | |
| January 2026 | |
| --- | |
| ## Abstract | |
| This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier modelsβAGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community. | |
| --- | |
| ## 1. Motivation | |
| ### 1.1 What This Is | |
| AGILLM-3 is a research project exploring: | |
| 1. **Tuneable attention rank**: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension? | |
| 2. **Joint AR+SAT training**: Can a model learn both next-token prediction AND multi-token speculation simultaneously? | |
| ### 1.2 What This Isn't | |
| This is not: | |
| - A frontier model | |
| - A competitor to GPT-4/Claude/Gemini | |
| - A claim that small models can match large ones | |
| - A business | |
| AGI already exists. This is documentation, not disruption. | |
| --- | |
| ## 2. Architecture | |
| ### 2.1 Overview | |
| ``` | |
| Input tokens | |
| β | |
| Embedding (vocab β d) | |
| β | |
| [Block Γ L layers] | |
| βββ LayerNorm β TuneableAttentionMHA β +residual | |
| βββ LayerNorm β FFN (d β 4d β d) β +residual | |
| β | |
| Final LayerNorm | |
| β | |
| βββ ARHead (next token prediction) | |
| βββ SATHead (multi-token speculation) | |
| ``` | |
| ### 2.2 Tuneable Attention (The Novel Bit) | |
| Standard multi-head attention computes: | |
| ``` | |
| Q = XWq, K = XWk, V = XWv | |
| Attention = softmax(QKα΅/βd_k) Β· V | |
| ``` | |
| Where Q, K have shape [batch, seq, heads, d_k]. | |
| **AGILLM-3's modification:** | |
| ```python | |
| class TuneableAttentionMHA(nn.Module): | |
| def __init__(self, d: int, h: int, r: int): | |
| # r = rank (the tuneable parameter) | |
| self.U = nn.Parameter(torch.randn(d_k, r)) | |
| nn.init.orthogonal_(self.U) | |
| def _proj_qk(self, x): | |
| # Project through U: [batch, seq, heads, d_k] @ [d_k, r] β [batch, seq, heads, r] | |
| return x.view(B, N, h, d_k).transpose(1,2) @ self.U | |
| ``` | |
| The attention computation becomes: | |
| ``` | |
| Q' = Q @ U # [batch, heads, seq, r] | |
| K' = K @ U # [batch, heads, seq, r] | |
| Attention = softmax(Q'K'α΅/βd_k) Β· V | |
| ``` | |
| **What this means:** | |
| | Regime | Condition | Effect | | |
| |--------|-----------|--------| | |
| | Compression | r < d_k | Q-K similarity computed in lower-dim space | | |
| | Identity | r = d_k | Equivalent to standard attention (if U=I) | | |
| | Expansion | r > d_k | Q-K similarity computed in higher-dim space | | |
| The presets encode this as ratios: | |
| - `nano_1x`: r = d_k (standard) | |
| - `nano_3x`: r = 3 Γ d_k (expansion) | |
| - `nano_12x`: r = 12 Γ d_k (heavy expansion) | |
| **Hypothesis being tested:** Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information. | |
| ### 2.3 Positional Encoding: ALiBi | |
| AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions: | |
| ```python | |
| def alibi_bias(n_heads, n_tokens): | |
| # Each head gets a different slope | |
| # Attention score penalized by distance: score -= slope * |i - j| | |
| slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...] | |
| return -slopes * distance_matrix | |
| ``` | |
| ALiBi chosen for: | |
| - Zero additional parameters | |
| - Good length extrapolation | |
| - Simplicity | |
| ### 2.4 Block Structure | |
| Each transformer block: | |
| ```python | |
| class Block(nn.Module): | |
| def forward(self, x, mask): | |
| # Pre-norm architecture | |
| x = x + self.mha(self.ln1(x), mask) | |
| x = x + self.ff(self.ln2(x)) | |
| return x | |
| ``` | |
| FFN is standard: Linear(d, 4d) β ReLU β Linear(4d, d) | |
| ### 2.5 Model Configurations | |
| From the presets in code: | |
| | Preset | d_model | Layers | Heads | Rank | ~Params | | |
| |--------|---------|--------|-------|------|---------| | |
| | nano_3x | 64 | 2 | 4 | 48 | ~200K | | |
| | micro_12x | 128 | 4 | 8 | 192 | ~2M | | |
| | small | 512 | 8 | 16 | 64 | ~50M | | |
| | base | 768 | 12 | 24 | 96 | ~125M | | |
| | large | 1024 | 24 | 16 | 128 | ~698M | | |
| The "large" preset at 698M parameters is the primary AGILLM-3 configuration. | |
| --- | |
| ## 3. Joint AR+SAT Training | |
| ### 3.1 The Idea | |
| Standard language models train only on next-token prediction (autoregressive, AR). | |
| AGILLM-3 trains on BOTH: | |
| 1. **AR objective**: Predict token t+1 from tokens 1..t | |
| 2. **SAT objective**: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive) | |
| ### 3.2 Masking | |
| **AR mask** (standard causal): | |
| ``` | |
| Position can attend to: all previous positions | |
| [1 0 0 0] | |
| [1 1 0 0] | |
| [1 1 1 0] | |
| [1 1 1 1] | |
| ``` | |
| **SAT mask** (block-wise): | |
| ``` | |
| SAT_BLOCK = 2 | |
| Positions in same block can attend to each other AND all previous blocks | |
| Block 0: positions 0,1 can see each other | |
| Block 1: positions 2,3 can see each other + block 0 | |
| etc. | |
| ``` | |
| ```python | |
| def sat_mask(n, block=2): | |
| idx = torch.arange(n) | |
| grp = idx // block | |
| allow = (grp.T == grp) | (grp.T > grp) # Same block OR previous blocks | |
| return torch.where(allow, 0.0, -inf) | |
| ``` | |
| ### 3.3 Training Loop | |
| Each batch: | |
| ```python | |
| # Forward pass 1: AR | |
| h_ar = core(ids, causal_mask(n)) | |
| logits_ar = ar_head(h_ar)[:, :-1] | |
| loss_ar = cross_entropy(logits_ar, targets[:, 1:]) | |
| # Forward pass 2: SAT | |
| h_sat = core(ids, sat_mask(n)) | |
| logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:]) | |
| loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1]) | |
| # Optional: gate loss (predict how many tokens to emit) | |
| if gate is not None: | |
| loss_sat += 0.1 * cross_entropy(gate, emit_target) | |
| loss = loss_ar + loss_sat | |
| ``` | |
| ### 3.4 SAT Head with Gating | |
| ```python | |
| class SATHead(nn.Module): | |
| def __init__(self, d, mode="var"): | |
| self.proj = nn.Linear(d, vocab) # Token prediction | |
| self.gate = nn.Linear(d, 2) # Emit 1 or 2 tokens? | |
| ``` | |
| The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation. | |
| ### 3.5 Why Joint Training? | |
| **Hypothesis:** Training both objectives together might: | |
| 1. Improve representation quality (multi-task learning) | |
| 2. Enable speculative decoding at inference (predict multiple tokens, verify with AR) | |
| 3. Learn confidence estimation via the gate | |
| **Current status:** Experimental. No claims of improvement over AR-only. | |
| --- | |
| ## 4. Training Infrastructure | |
| ### 4.1 Data Pipeline | |
| ```python | |
| def token_stream(ds_names, target_tokens, seed, ...): | |
| """ | |
| Streaming token generator from HuggingFace datasets. | |
| - Supports multiple comma-separated datasets | |
| - Auto-rotates through sources | |
| - Handles chat format (messages key) or raw text | |
| - Appends EOS tokens | |
| """ | |
| ``` | |
| Default pretraining sources (from code): | |
| ``` | |
| OpenTransformer/goddess-crawl | |
| OpenTransformer/agillm-crawl-data | |
| OpenTransformer/web-crawl-2026 | |
| OpenTransformer/web-crawl-clean-v2 | |
| OpenTransformer/scraped-web-data | |
| OpenTransformer/turbo-crawl | |
| OpenTransformer/sft-data-clean | |
| OpenTransformer/web-crawl-v1 | |
| ``` | |
| ### 4.2 Optimizer Configuration | |
| ```python | |
| opt = AdamW([ | |
| {"params": core.parameters(), "lr": 5e-5}, # LR_CORE | |
| {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD | |
| {"params": sat_head.parameters(), "lr": 2e-4}, | |
| ]) | |
| ``` | |
| Separate learning rates for core vs heads. | |
| ### 4.3 Training Features | |
| - **AMP**: Automatic mixed precision (bf16 if available, else fp16) | |
| - **Gradient clipping**: max_norm=1.0 | |
| - **Label smoothing**: 0.1 | |
| - **Dropout**: 0.1 in attention | |
| - **Checkpointing**: Configurable interval (default 24h), automatic pruning | |
| ### 4.4 Chinchilla Scaling | |
| ```python | |
| ratio = 51.2 if args.chilla_max_double else 25 | |
| param_count = count_params(core, ar_h, sat_h) | |
| target_tokens = int(ratio * param_count) | |
| ``` | |
| Default follows ~25Γ Chinchilla ratio; optional 51.2Γ for "double Chinchilla". | |
| For 698M params: ~17.5B tokens default, ~35.7B tokens with double. | |
| ### 4.5 Hot Config | |
| Runtime dataset switching without restart: | |
| ```python | |
| # /workspace/hot_config.json | |
| {"datasets": ["new_dataset_1", "new_dataset_2"]} | |
| ``` | |
| Trainer checks this file periodically and switches data sources. | |
| ### 4.6 Auto-Grow | |
| Optional feature to increase block size during training: | |
| ```python | |
| --auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000 | |
| ``` | |
| Starts with smaller context, grows as training stabilizes. | |
| --- | |
| ## 5. Inference | |
| ### 5.1 AR Mode (Standard) | |
| ```python | |
| python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello" | |
| ``` | |
| Standard autoregressive generation with KV-cache. | |
| ### 5.2 SAT Mode (Speculative) | |
| ```python | |
| python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var | |
| ``` | |
| Generates SAT_BLOCK tokens at once, optionally using gate to choose stride. | |
| ### 5.3 Sampling Parameters | |
| | Parameter | AR Default | SAT Default | | |
| |-----------|------------|-------------| | |
| | temperature | 0.7 | 0.5 | | |
| | top_k | 0 | 30 | | |
| | repetition_penalty | 1.3 | 2.0 | | |
| | presence_penalty | 0.0 | 0.6 | | |
| | frequency_penalty | 0.3 | 1.0 | | |
| | penalty_last_n | 128 | 200 | | |
| SAT mode uses more aggressive penalties to avoid repetition from parallel generation. | |
| --- | |
| ## 6. Weight Tying | |
| Optional embedding-LM head weight tying: | |
| ```python | |
| class ARHead(nn.Module): | |
| def __init__(self, d, tie_weights=False, embedding_weight=None): | |
| if tie_weights and embedding_weight is not None: | |
| self.proj = nn.Linear(d, vocab, bias=False) | |
| self.proj.weight = embedding_weight # Share weights | |
| ``` | |
| Reduces parameters by ~vocab Γ d (significant for large vocab). | |
| --- | |
| ## 7. Current Training Status | |
| As of January 2026: | |
| - Step: 2.2M+ | |
| - Tokens seen: ~2.4B | |
| - Preset: large (698M params) | |
| - Training on vast.ai 3090 | |
| - Checkpoints every 6 hours | |
| --- | |
| ## 8. Observations and Notes | |
| ### 8.1 Expansion Ratio Effects | |
| Early experiments suggest: | |
| - 1x (standard): baseline behavior | |
| - 3x-6x: slight improvement in attention patterns | |
| - 12x+: diminishing returns, increased compute | |
| Not rigorously benchmarked. Observations only. | |
| ### 8.2 AR vs AR+SAT | |
| AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch. | |
| ### 8.3 Known Issues | |
| 1. SAT inference quality lags AR (expected - harder task) | |
| 2. Gate accuracy mediocre (often just predicts "emit 2") | |
| 3. Memory usage higher than equivalent AR-only model | |
| --- | |
| ## 9. Code Location | |
| Primary file: `n.py` | |
| Key classes: | |
| - `TuneableAttentionMHA`: The modified attention | |
| - `Block`: Transformer block | |
| - `Encoder`: Full encoder stack | |
| - `ARHead`, `SATHead`: Output heads | |
| - `token_stream`: Data pipeline | |
| - `_train_phase`: Training loop | |
| --- | |
| ## 10. License and Citation | |
| Code released under MIT license. | |
| If referencing this work: | |
| ``` | |
| @misc{agillm3, | |
| author = {Bisset, Scott}, | |
| title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training}, | |
| year = {2026}, | |
| publisher = {OpenTransformers Ltd} | |
| } | |
| ``` | |
| --- | |
| ## Appendix A: Full Preset Table | |
| ```python | |
| PRESETS = { | |
| "femto_1x": dict(d=16, layers=1, heads=1, rank=16), | |
| "femto_12x": dict(d=16, layers=1, heads=1, rank=192), | |
| "pico_1x": dict(d=32, layers=1, heads=2, rank=16), | |
| "pico_12x": dict(d=32, layers=1, heads=2, rank=192), | |
| "nano_1x": dict(d=64, layers=2, heads=4, rank=16), | |
| "nano_3x": dict(d=64, layers=2, heads=4, rank=48), | |
| "nano_12x": dict(d=64, layers=2, heads=4, rank=192), | |
| "micro_12x": dict(d=128, layers=4, heads=8, rank=192), | |
| "small": dict(d=512, layers=8, heads=16, rank=64), | |
| "base": dict(d=768, layers=12, heads=24, rank=96), | |
| "large": dict(d=1024, layers=24, heads=16, rank=128), | |
| } | |
| ``` | |
| --- | |
| ## Appendix B: Example Training Command | |
| ```bash | |
| python n.py train \ | |
| --preset large \ | |
| --batch_size 4 \ | |
| --block 1122 \ | |
| --amp \ | |
| --save_every_sec 21600 \ | |
| --save_dir /workspace/ckpts_expansion \ | |
| --max_ckpts 5 \ | |
| --resume /workspace/ckpts_expansion | |
| ``` | |
| --- | |
| *Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM* | |