SciPapers / AGILLM3_technical_documentation.md
OpenTransformer's picture
Upload AGILLM3_technical_documentation.md with huggingface_hub
61b9671 verified
# AGILLM-3: Technical Documentation
## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training
**Scott Bisset**
OpenTransformers Ltd
January 2026
---
## Abstract
This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier modelsβ€”AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.
---
## 1. Motivation
### 1.1 What This Is
AGILLM-3 is a research project exploring:
1. **Tuneable attention rank**: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?
2. **Joint AR+SAT training**: Can a model learn both next-token prediction AND multi-token speculation simultaneously?
### 1.2 What This Isn't
This is not:
- A frontier model
- A competitor to GPT-4/Claude/Gemini
- A claim that small models can match large ones
- A business
AGI already exists. This is documentation, not disruption.
---
## 2. Architecture
### 2.1 Overview
```
Input tokens
↓
Embedding (vocab β†’ d)
↓
[Block Γ— L layers]
β”œβ”€β”€ LayerNorm β†’ TuneableAttentionMHA β†’ +residual
└── LayerNorm β†’ FFN (d β†’ 4d β†’ d) β†’ +residual
↓
Final LayerNorm
↓
β”œβ”€β”€ ARHead (next token prediction)
└── SATHead (multi-token speculation)
```
### 2.2 Tuneable Attention (The Novel Bit)
Standard multi-head attention computes:
```
Q = XWq, K = XWk, V = XWv
Attention = softmax(QKα΅€/√d_k) Β· V
```
Where Q, K have shape [batch, seq, heads, d_k].
**AGILLM-3's modification:**
```python
class TuneableAttentionMHA(nn.Module):
def __init__(self, d: int, h: int, r: int):
# r = rank (the tuneable parameter)
self.U = nn.Parameter(torch.randn(d_k, r))
nn.init.orthogonal_(self.U)
def _proj_qk(self, x):
# Project through U: [batch, seq, heads, d_k] @ [d_k, r] β†’ [batch, seq, heads, r]
return x.view(B, N, h, d_k).transpose(1,2) @ self.U
```
The attention computation becomes:
```
Q' = Q @ U # [batch, heads, seq, r]
K' = K @ U # [batch, heads, seq, r]
Attention = softmax(Q'K'α΅€/√d_k) Β· V
```
**What this means:**
| Regime | Condition | Effect |
|--------|-----------|--------|
| Compression | r < d_k | Q-K similarity computed in lower-dim space |
| Identity | r = d_k | Equivalent to standard attention (if U=I) |
| Expansion | r > d_k | Q-K similarity computed in higher-dim space |
The presets encode this as ratios:
- `nano_1x`: r = d_k (standard)
- `nano_3x`: r = 3 Γ— d_k (expansion)
- `nano_12x`: r = 12 Γ— d_k (heavy expansion)
**Hypothesis being tested:** Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.
### 2.3 Positional Encoding: ALiBi
AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:
```python
def alibi_bias(n_heads, n_tokens):
# Each head gets a different slope
# Attention score penalized by distance: score -= slope * |i - j|
slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
return -slopes * distance_matrix
```
ALiBi chosen for:
- Zero additional parameters
- Good length extrapolation
- Simplicity
### 2.4 Block Structure
Each transformer block:
```python
class Block(nn.Module):
def forward(self, x, mask):
# Pre-norm architecture
x = x + self.mha(self.ln1(x), mask)
x = x + self.ff(self.ln2(x))
return x
```
FFN is standard: Linear(d, 4d) β†’ ReLU β†’ Linear(4d, d)
### 2.5 Model Configurations
From the presets in code:
| Preset | d_model | Layers | Heads | Rank | ~Params |
|--------|---------|--------|-------|------|---------|
| nano_3x | 64 | 2 | 4 | 48 | ~200K |
| micro_12x | 128 | 4 | 8 | 192 | ~2M |
| small | 512 | 8 | 16 | 64 | ~50M |
| base | 768 | 12 | 24 | 96 | ~125M |
| large | 1024 | 24 | 16 | 128 | ~698M |
The "large" preset at 698M parameters is the primary AGILLM-3 configuration.
---
## 3. Joint AR+SAT Training
### 3.1 The Idea
Standard language models train only on next-token prediction (autoregressive, AR).
AGILLM-3 trains on BOTH:
1. **AR objective**: Predict token t+1 from tokens 1..t
2. **SAT objective**: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)
### 3.2 Masking
**AR mask** (standard causal):
```
Position can attend to: all previous positions
[1 0 0 0]
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]
```
**SAT mask** (block-wise):
```
SAT_BLOCK = 2
Positions in same block can attend to each other AND all previous blocks
Block 0: positions 0,1 can see each other
Block 1: positions 2,3 can see each other + block 0
etc.
```
```python
def sat_mask(n, block=2):
idx = torch.arange(n)
grp = idx // block
allow = (grp.T == grp) | (grp.T > grp) # Same block OR previous blocks
return torch.where(allow, 0.0, -inf)
```
### 3.3 Training Loop
Each batch:
```python
# Forward pass 1: AR
h_ar = core(ids, causal_mask(n))
logits_ar = ar_head(h_ar)[:, :-1]
loss_ar = cross_entropy(logits_ar, targets[:, 1:])
# Forward pass 2: SAT
h_sat = core(ids, sat_mask(n))
logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])
# Optional: gate loss (predict how many tokens to emit)
if gate is not None:
loss_sat += 0.1 * cross_entropy(gate, emit_target)
loss = loss_ar + loss_sat
```
### 3.4 SAT Head with Gating
```python
class SATHead(nn.Module):
def __init__(self, d, mode="var"):
self.proj = nn.Linear(d, vocab) # Token prediction
self.gate = nn.Linear(d, 2) # Emit 1 or 2 tokens?
```
The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.
### 3.5 Why Joint Training?
**Hypothesis:** Training both objectives together might:
1. Improve representation quality (multi-task learning)
2. Enable speculative decoding at inference (predict multiple tokens, verify with AR)
3. Learn confidence estimation via the gate
**Current status:** Experimental. No claims of improvement over AR-only.
---
## 4. Training Infrastructure
### 4.1 Data Pipeline
```python
def token_stream(ds_names, target_tokens, seed, ...):
"""
Streaming token generator from HuggingFace datasets.
- Supports multiple comma-separated datasets
- Auto-rotates through sources
- Handles chat format (messages key) or raw text
- Appends EOS tokens
"""
```
Default pretraining sources (from code):
```
OpenTransformer/goddess-crawl
OpenTransformer/agillm-crawl-data
OpenTransformer/web-crawl-2026
OpenTransformer/web-crawl-clean-v2
OpenTransformer/scraped-web-data
OpenTransformer/turbo-crawl
OpenTransformer/sft-data-clean
OpenTransformer/web-crawl-v1
```
### 4.2 Optimizer Configuration
```python
opt = AdamW([
{"params": core.parameters(), "lr": 5e-5}, # LR_CORE
{"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD
{"params": sat_head.parameters(), "lr": 2e-4},
])
```
Separate learning rates for core vs heads.
### 4.3 Training Features
- **AMP**: Automatic mixed precision (bf16 if available, else fp16)
- **Gradient clipping**: max_norm=1.0
- **Label smoothing**: 0.1
- **Dropout**: 0.1 in attention
- **Checkpointing**: Configurable interval (default 24h), automatic pruning
### 4.4 Chinchilla Scaling
```python
ratio = 51.2 if args.chilla_max_double else 25
param_count = count_params(core, ar_h, sat_h)
target_tokens = int(ratio * param_count)
```
Default follows ~25Γ— Chinchilla ratio; optional 51.2Γ— for "double Chinchilla".
For 698M params: ~17.5B tokens default, ~35.7B tokens with double.
### 4.5 Hot Config
Runtime dataset switching without restart:
```python
# /workspace/hot_config.json
{"datasets": ["new_dataset_1", "new_dataset_2"]}
```
Trainer checks this file periodically and switches data sources.
### 4.6 Auto-Grow
Optional feature to increase block size during training:
```python
--auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000
```
Starts with smaller context, grows as training stabilizes.
---
## 5. Inference
### 5.1 AR Mode (Standard)
```python
python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"
```
Standard autoregressive generation with KV-cache.
### 5.2 SAT Mode (Speculative)
```python
python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var
```
Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.
### 5.3 Sampling Parameters
| Parameter | AR Default | SAT Default |
|-----------|------------|-------------|
| temperature | 0.7 | 0.5 |
| top_k | 0 | 30 |
| repetition_penalty | 1.3 | 2.0 |
| presence_penalty | 0.0 | 0.6 |
| frequency_penalty | 0.3 | 1.0 |
| penalty_last_n | 128 | 200 |
SAT mode uses more aggressive penalties to avoid repetition from parallel generation.
---
## 6. Weight Tying
Optional embedding-LM head weight tying:
```python
class ARHead(nn.Module):
def __init__(self, d, tie_weights=False, embedding_weight=None):
if tie_weights and embedding_weight is not None:
self.proj = nn.Linear(d, vocab, bias=False)
self.proj.weight = embedding_weight # Share weights
```
Reduces parameters by ~vocab Γ— d (significant for large vocab).
---
## 7. Current Training Status
As of January 2026:
- Step: 2.2M+
- Tokens seen: ~2.4B
- Preset: large (698M params)
- Training on vast.ai 3090
- Checkpoints every 6 hours
---
## 8. Observations and Notes
### 8.1 Expansion Ratio Effects
Early experiments suggest:
- 1x (standard): baseline behavior
- 3x-6x: slight improvement in attention patterns
- 12x+: diminishing returns, increased compute
Not rigorously benchmarked. Observations only.
### 8.2 AR vs AR+SAT
AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch.
### 8.3 Known Issues
1. SAT inference quality lags AR (expected - harder task)
2. Gate accuracy mediocre (often just predicts "emit 2")
3. Memory usage higher than equivalent AR-only model
---
## 9. Code Location
Primary file: `n.py`
Key classes:
- `TuneableAttentionMHA`: The modified attention
- `Block`: Transformer block
- `Encoder`: Full encoder stack
- `ARHead`, `SATHead`: Output heads
- `token_stream`: Data pipeline
- `_train_phase`: Training loop
---
## 10. License and Citation
Code released under MIT license.
If referencing this work:
```
@misc{agillm3,
author = {Bisset, Scott},
title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
year = {2026},
publisher = {OpenTransformers Ltd}
}
```
---
## Appendix A: Full Preset Table
```python
PRESETS = {
"femto_1x": dict(d=16, layers=1, heads=1, rank=16),
"femto_12x": dict(d=16, layers=1, heads=1, rank=192),
"pico_1x": dict(d=32, layers=1, heads=2, rank=16),
"pico_12x": dict(d=32, layers=1, heads=2, rank=192),
"nano_1x": dict(d=64, layers=2, heads=4, rank=16),
"nano_3x": dict(d=64, layers=2, heads=4, rank=48),
"nano_12x": dict(d=64, layers=2, heads=4, rank=192),
"micro_12x": dict(d=128, layers=4, heads=8, rank=192),
"small": dict(d=512, layers=8, heads=16, rank=64),
"base": dict(d=768, layers=12, heads=24, rank=96),
"large": dict(d=1024, layers=24, heads=16, rank=128),
}
```
---
## Appendix B: Example Training Command
```bash
python n.py train \
--preset large \
--batch_size 4 \
--block 1122 \
--amp \
--save_every_sec 21600 \
--save_dir /workspace/ckpts_expansion \
--max_ckpts 5 \
--resume /workspace/ckpts_expansion
```
---
*Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM*