OpenTransformer
/

SciPapers

Model card Files Files and versions

xet

Community

OpenTransformer commited on Jan 26

Commit

61b9671

verified ·

1 Parent(s): cf7f479

Upload AGILLM3_technical_documentation.md with huggingface_hub

Browse files

Files changed (1) hide show

AGILLM3_technical_documentation.md +468 -0

AGILLM3_technical_documentation.md ADDED Viewed

	@@ -0,0 +1,468 @@

+# AGILLM-3: Technical Documentation
+## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training
+**Scott Bisset**
+OpenTransformers Ltd
+January 2026
+---
+## Abstract
+This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier models—AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.
+---
+## 1. Motivation
+### 1.1 What This Is
+AGILLM-3 is a research project exploring:
+1. **Tuneable attention rank**: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?
+2. **Joint AR+SAT training**: Can a model learn both next-token prediction AND multi-token speculation simultaneously?
+### 1.2 What This Isn't
+This is not:
+- A frontier model
+- A competitor to GPT-4/Claude/Gemini
+- A claim that small models can match large ones
+- A business
+AGI already exists. This is documentation, not disruption.
+---
+## 2. Architecture
+### 2.1 Overview
+```
+Input tokens
+    ↓
+Embedding (vocab → d)
+    ↓
+[Block × L layers]
+    ├── LayerNorm → TuneableAttentionMHA → +residual
+    └── LayerNorm → FFN (d → 4d → d) → +residual
+    ↓
+Final LayerNorm
+    ↓
+├── ARHead (next token prediction)
+└── SATHead (multi-token speculation)
+```
+### 2.2 Tuneable Attention (The Novel Bit)
+Standard multi-head attention computes:
+```
+Q = XWq,  K = XWk,  V = XWv
+Attention = softmax(QKᵀ/√d_k) · V
+```
+Where Q, K have shape [batch, seq, heads, d_k].
+**AGILLM-3's modification:**
+```python
+class TuneableAttentionMHA(nn.Module):
+    def __init__(self, d: int, h: int, r: int):
+        # r = rank (the tuneable parameter)
+        self.U = nn.Parameter(torch.randn(d_k, r))
+        nn.init.orthogonal_(self.U)
+    def _proj_qk(self, x):
+        # Project through U: [batch, seq, heads, d_k] @ [d_k, r] → [batch, seq, heads, r]
+        return x.view(B, N, h, d_k).transpose(1,2) @ self.U
+```
+The attention computation becomes:
+```
+Q' = Q @ U    # [batch, heads, seq, r]
+K' = K @ U    # [batch, heads, seq, r]
+Attention = softmax(Q'K'ᵀ/√d_k) · V
+```
+**What this means:**
+| Regime | Condition | Effect |
+|--------|-----------|--------|
+| Compression | r < d_k | Q-K similarity computed in lower-dim space |
+| Identity | r = d_k | Equivalent to standard attention (if U=I) |
+| Expansion | r > d_k | Q-K similarity computed in higher-dim space |
+The presets encode this as ratios:
+- `nano_1x`: r = d_k (standard)
+- `nano_3x`: r = 3 × d_k (expansion)
+- `nano_12x`: r = 12 × d_k (heavy expansion)
+**Hypothesis being tested:** Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.
+### 2.3 Positional Encoding: ALiBi
+AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:
+```python
+def alibi_bias(n_heads, n_tokens):
+    # Each head gets a different slope
+    # Attention score penalized by distance: score -= slope * |i - j|
+    slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
+    return -slopes * distance_matrix
+```
+ALiBi chosen for:
+- Zero additional parameters
+- Good length extrapolation
+- Simplicity
+### 2.4 Block Structure
+Each transformer block:
+```python
+class Block(nn.Module):
+    def forward(self, x, mask):
+        # Pre-norm architecture
+        x = x + self.mha(self.ln1(x), mask)
+        x = x + self.ff(self.ln2(x))
+        return x
+```
+FFN is standard: Linear(d, 4d) → ReLU → Linear(4d, d)
+### 2.5 Model Configurations
+From the presets in code:
+| Preset | d_model | Layers | Heads | Rank | ~Params |
+|--------|---------|--------|-------|------|---------|
+| nano_3x | 64 | 2 | 4 | 48 | ~200K |
+| micro_12x | 128 | 4 | 8 | 192 | ~2M |
+| small | 512 | 8 | 16 | 64 | ~50M |
+| base | 768 | 12 | 24 | 96 | ~125M |
+| large | 1024 | 24 | 16 | 128 | ~698M |
+The "large" preset at 698M parameters is the primary AGILLM-3 configuration.
+---
+## 3. Joint AR+SAT Training
+### 3.1 The Idea
+Standard language models train only on next-token prediction (autoregressive, AR).
+AGILLM-3 trains on BOTH:
+1. **AR objective**: Predict token t+1 from tokens 1..t
+2. **SAT objective**: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)
+### 3.2 Masking
+**AR mask** (standard causal):
+```
+Position can attend to: all previous positions
+[1 0 0 0]
+[1 1 0 0]
+[1 1 1 0]
+[1 1 1 1]
+```
+**SAT mask** (block-wise):
+```
+SAT_BLOCK = 2
+Positions in same block can attend to each other AND all previous blocks
+Block 0: positions 0,1 can see each other
+Block 1: positions 2,3 can see each other + block 0
+etc.
+```
+```python
+def sat_mask(n, block=2):
+    idx = torch.arange(n)
+    grp = idx // block
+    allow = (grp.T == grp) | (grp.T > grp)  # Same block OR previous blocks
+    return torch.where(allow, 0.0, -inf)
+```
+### 3.3 Training Loop
+Each batch:
+```python
+# Forward pass 1: AR
+h_ar = core(ids, causal_mask(n))
+logits_ar = ar_head(h_ar)[:, :-1]
+loss_ar = cross_entropy(logits_ar, targets[:, 1:])
+# Forward pass 2: SAT
+h_sat = core(ids, sat_mask(n))
+logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
+loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])
+# Optional: gate loss (predict how many tokens to emit)
+if gate is not None:
+    loss_sat += 0.1 * cross_entropy(gate, emit_target)
+loss = loss_ar + loss_sat
+```
+### 3.4 SAT Head with Gating
+```python
+class SATHead(nn.Module):
+    def __init__(self, d, mode="var"):
+        self.proj = nn.Linear(d, vocab)  # Token prediction
+        self.gate = nn.Linear(d, 2)      # Emit 1 or 2 tokens?
+```
+The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.
+### 3.5 Why Joint Training?
+**Hypothesis:** Training both objectives together might:
+1. Improve representation quality (multi-task learning)
+2. Enable speculative decoding at inference (predict multiple tokens, verify with AR)
+3. Learn confidence estimation via the gate
+**Current status:** Experimental. No claims of improvement over AR-only.
+---
+## 4. Training Infrastructure
+### 4.1 Data Pipeline
+```python
+def token_stream(ds_names, target_tokens, seed, ...):
+    """
+    Streaming token generator from HuggingFace datasets.
+    - Supports multiple comma-separated datasets
+    - Auto-rotates through sources
+    - Handles chat format (messages key) or raw text
+    - Appends EOS tokens
+    """
+```
+Default pretraining sources (from code):
+```
+OpenTransformer/goddess-crawl
+OpenTransformer/agillm-crawl-data
+OpenTransformer/web-crawl-2026
+OpenTransformer/web-crawl-clean-v2
+OpenTransformer/scraped-web-data
+OpenTransformer/turbo-crawl
+OpenTransformer/sft-data-clean
+OpenTransformer/web-crawl-v1
+```
+### 4.2 Optimizer Configuration
+```python
+opt = AdamW([
+    {"params": core.parameters(), "lr": 5e-5},   # LR_CORE
+    {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD
+    {"params": sat_head.parameters(), "lr": 2e-4},
+])
+```
+Separate learning rates for core vs heads.
+### 4.3 Training Features
+- **AMP**: Automatic mixed precision (bf16 if available, else fp16)
+- **Gradient clipping**: max_norm=1.0
+- **Label smoothing**: 0.1
+- **Dropout**: 0.1 in attention
+- **Checkpointing**: Configurable interval (default 24h), automatic pruning
+### 4.4 Chinchilla Scaling
+```python
+ratio = 51.2 if args.chilla_max_double else 25
+param_count = count_params(core, ar_h, sat_h)
+target_tokens = int(ratio * param_count)
+```
+Default follows ~25× Chinchilla ratio; optional 51.2× for "double Chinchilla".
+For 698M params: ~17.5B tokens default, ~35.7B tokens with double.
+### 4.5 Hot Config
+Runtime dataset switching without restart:
+```python
+# /workspace/hot_config.json
+{"datasets": ["new_dataset_1", "new_dataset_2"]}
+```
+Trainer checks this file periodically and switches data sources.
+### 4.6 Auto-Grow
+Optional feature to increase block size during training:
+```python
+--auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000
+```
+Starts with smaller context, grows as training stabilizes.
+---
+## 5. Inference
+### 5.1 AR Mode (Standard)
+```python
+python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"
+```
+Standard autoregressive generation with KV-cache.
+### 5.2 SAT Mode (Speculative)
+```python
+python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var
+```
+Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.
+### 5.3 Sampling Parameters
+| Parameter | AR Default | SAT Default |
+|-----------|------------|-------------|
+| temperature | 0.7 | 0.5 |
+| top_k | 0 | 30 |
+| repetition_penalty | 1.3 | 2.0 |
+| presence_penalty | 0.0 | 0.6 |
+| frequency_penalty | 0.3 | 1.0 |
+| penalty_last_n | 128 | 200 |
+SAT mode uses more aggressive penalties to avoid repetition from parallel generation.
+---
+## 6. Weight Tying
+Optional embedding-LM head weight tying:
+```python
+class ARHead(nn.Module):
+    def __init__(self, d, tie_weights=False, embedding_weight=None):
+        if tie_weights and embedding_weight is not None:
+            self.proj = nn.Linear(d, vocab, bias=False)
+            self.proj.weight = embedding_weight  # Share weights
+```
+Reduces parameters by ~vocab × d (significant for large vocab).
+---
+## 7. Current Training Status
+As of January 2026:
+- Step: 2.2M+
+- Tokens seen: ~2.4B
+- Preset: large (698M params)
+- Training on vast.ai 3090
+- Checkpoints every 6 hours
+---
+## 8. Observations and Notes
+### 8.1 Expansion Ratio Effects
+Early experiments suggest:
+- 1x (standard): baseline behavior
+- 3x-6x: slight improvement in attention patterns
+- 12x+: diminishing returns, increased compute
+Not rigorously benchmarked. Observations only.
+### 8.2 AR vs AR+SAT
+AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch.
+### 8.3 Known Issues
+1. SAT inference quality lags AR (expected - harder task)
+2. Gate accuracy mediocre (often just predicts "emit 2")
+3. Memory usage higher than equivalent AR-only model
+---
+## 9. Code Location
+Primary file: `n.py`
+Key classes:
+- `TuneableAttentionMHA`: The modified attention
+- `Block`: Transformer block
+- `Encoder`: Full encoder stack
+- `ARHead`, `SATHead`: Output heads
+- `token_stream`: Data pipeline
+- `_train_phase`: Training loop
+---
+## 10. License and Citation
+Code released under MIT license.
+If referencing this work:
+```
+@misc{agillm3,
+  author = {Bisset, Scott},
+  title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
+  year = {2026},
+  publisher = {OpenTransformers Ltd}
+}
+```
+---
+## Appendix A: Full Preset Table
+```python
+PRESETS = {
+    "femto_1x":  dict(d=16,   layers=1,  heads=1,  rank=16),
+    "femto_12x": dict(d=16,   layers=1,  heads=1,  rank=192),
+    "pico_1x":   dict(d=32,   layers=1,  heads=2,  rank=16),
+    "pico_12x":  dict(d=32,   layers=1,  heads=2,  rank=192),
+    "nano_1x":   dict(d=64,   layers=2,  heads=4,  rank=16),
+    "nano_3x":   dict(d=64,   layers=2,  heads=4,  rank=48),
+    "nano_12x":  dict(d=64,   layers=2,  heads=4,  rank=192),
+    "micro_12x": dict(d=128,  layers=4,  heads=8,  rank=192),
+    "small":     dict(d=512,  layers=8,  heads=16, rank=64),
+    "base":      dict(d=768,  layers=12, heads=24, rank=96),
+    "large":     dict(d=1024, layers=24, heads=16, rank=128),
+}
+```
+---
+## Appendix B: Example Training Command
+```bash
+python n.py train \
+    --preset large \
+    --batch_size 4 \
+    --block 1122 \
+    --amp \
+    --save_every_sec 21600 \
+    --save_dir /workspace/ckpts_expansion \
+    --max_ckpts 5 \
+    --resume /workspace/ckpts_expansion
+```
+---
+*Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM*