SciPapers / AGILLM3_technical_documentation.md

Upload AGILLM3_technical_documentation.md with huggingface_hub

61b9671 verified about 2 months ago

12 kB

	# AGILLM-3: Technical Documentation
	## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training

	Scott Bisset
	OpenTransformers Ltd
	January 2026

	---

	## Abstract

	This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier models—AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.

	---

	## 1. Motivation

	### 1.1 What This Is

	AGILLM-3 is a research project exploring:

	1. Tuneable attention rank: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?

	2. Joint AR+SAT training: Can a model learn both next-token prediction AND multi-token speculation simultaneously?

	### 1.2 What This Isn't

	This is not:
	- A frontier model
	- A competitor to GPT-4/Claude/Gemini
	- A claim that small models can match large ones
	- A business

	AGI already exists. This is documentation, not disruption.

	---

	## 2. Architecture

	### 2.1 Overview

	```
	Input tokens
	↓
	Embedding (vocab → d)
	↓
	[Block × L layers]
	├── LayerNorm → TuneableAttentionMHA → +residual
	└── LayerNorm → FFN (d → 4d → d) → +residual
	↓
	Final LayerNorm
	↓
	├── ARHead (next token prediction)
	└── SATHead (multi-token speculation)
	```

	### 2.2 Tuneable Attention (The Novel Bit)

	Standard multi-head attention computes:

	```
	Q = XWq, K = XWk, V = XWv
	Attention = softmax(QKᵀ/√d_k) · V
	```

	Where Q, K have shape [batch, seq, heads, d_k].

	AGILLM-3's modification:

	```python
	class TuneableAttentionMHA(nn.Module):
	def __init__(self, d: int, h: int, r: int):
	# r = rank (the tuneable parameter)
	self.U = nn.Parameter(torch.randn(d_k, r))
	nn.init.orthogonal_(self.U)

	def _proj_qk(self, x):
	# Project through U: [batch, seq, heads, d_k] @ [d_k, r] → [batch, seq, heads, r]
	return x.view(B, N, h, d_k).transpose(1,2) @ self.U
	```

	The attention computation becomes:

	```
	Q' = Q @ U # [batch, heads, seq, r]
	K' = K @ U # [batch, heads, seq, r]
	Attention = softmax(Q'K'ᵀ/√d_k) · V
	```

	What this means:

	\| Regime \| Condition \| Effect \|
	\|--------\|-----------\|--------\|
	\| Compression \| r < d_k \| Q-K similarity computed in lower-dim space \|
	\| Identity \| r = d_k \| Equivalent to standard attention (if U=I) \|
	\| Expansion \| r > d_k \| Q-K similarity computed in higher-dim space \|

	The presets encode this as ratios:
	- `nano_1x`: r = d_k (standard)
	- `nano_3x`: r = 3 × d_k (expansion)
	- `nano_12x`: r = 12 × d_k (heavy expansion)

	Hypothesis being tested: Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.

	### 2.3 Positional Encoding: ALiBi

	AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:

	```python
	def alibi_bias(n_heads, n_tokens):
	# Each head gets a different slope
	# Attention score penalized by distance: score -= slope * \|i - j\|
	slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
	return -slopes * distance_matrix
	```

	ALiBi chosen for:
	- Zero additional parameters
	- Good length extrapolation
	- Simplicity

	### 2.4 Block Structure

	Each transformer block:

	```python
	class Block(nn.Module):
	def forward(self, x, mask):
	# Pre-norm architecture
	x = x + self.mha(self.ln1(x), mask)
	x = x + self.ff(self.ln2(x))
	return x
	```

	FFN is standard: Linear(d, 4d) → ReLU → Linear(4d, d)

	### 2.5 Model Configurations

	From the presets in code:

	\| Preset \| d_model \| Layers \| Heads \| Rank \| ~Params \|
	\|--------\|---------\|--------\|-------\|------\|---------\|
	\| nano_3x \| 64 \| 2 \| 4 \| 48 \| ~200K \|
	\| micro_12x \| 128 \| 4 \| 8 \| 192 \| ~2M \|
	\| small \| 512 \| 8 \| 16 \| 64 \| ~50M \|
	\| base \| 768 \| 12 \| 24 \| 96 \| ~125M \|
	\| large \| 1024 \| 24 \| 16 \| 128 \| ~698M \|

	The "large" preset at 698M parameters is the primary AGILLM-3 configuration.

	---

	## 3. Joint AR+SAT Training

	### 3.1 The Idea

	Standard language models train only on next-token prediction (autoregressive, AR).

	AGILLM-3 trains on BOTH:

	1. AR objective: Predict token t+1 from tokens 1..t
	2. SAT objective: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)

	### 3.2 Masking

	AR mask (standard causal):
	```
	Position can attend to: all previous positions
	[1 0 0 0]
	[1 1 0 0]
	[1 1 1 0]
	[1 1 1 1]
	```

	SAT mask (block-wise):
	```
	SAT_BLOCK = 2
	Positions in same block can attend to each other AND all previous blocks

	Block 0: positions 0,1 can see each other
	Block 1: positions 2,3 can see each other + block 0
	etc.
	```

	```python
	def sat_mask(n, block=2):
	idx = torch.arange(n)
	grp = idx // block
	allow = (grp.T == grp) \| (grp.T > grp) # Same block OR previous blocks
	return torch.where(allow, 0.0, -inf)
	```

	### 3.3 Training Loop

	Each batch:

	```python
	# Forward pass 1: AR
	h_ar = core(ids, causal_mask(n))
	logits_ar = ar_head(h_ar)[:, :-1]
	loss_ar = cross_entropy(logits_ar, targets[:, 1:])

	# Forward pass 2: SAT
	h_sat = core(ids, sat_mask(n))
	logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
	loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])

	# Optional: gate loss (predict how many tokens to emit)
	if gate is not None:
	loss_sat += 0.1 * cross_entropy(gate, emit_target)

	loss = loss_ar + loss_sat
	```

	### 3.4 SAT Head with Gating

	```python
	class SATHead(nn.Module):
	def __init__(self, d, mode="var"):
	self.proj = nn.Linear(d, vocab) # Token prediction
	self.gate = nn.Linear(d, 2) # Emit 1 or 2 tokens?
	```

	The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.

	### 3.5 Why Joint Training?

	Hypothesis: Training both objectives together might:
	1. Improve representation quality (multi-task learning)
	2. Enable speculative decoding at inference (predict multiple tokens, verify with AR)
	3. Learn confidence estimation via the gate

	Current status: Experimental. No claims of improvement over AR-only.

	---

	## 4. Training Infrastructure

	### 4.1 Data Pipeline

	```python
	def token_stream(ds_names, target_tokens, seed, ...):
	"""
	Streaming token generator from HuggingFace datasets.
	- Supports multiple comma-separated datasets
	- Auto-rotates through sources
	- Handles chat format (messages key) or raw text
	- Appends EOS tokens
	"""
	```

	Default pretraining sources (from code):
	```
	OpenTransformer/goddess-crawl
	OpenTransformer/agillm-crawl-data
	OpenTransformer/web-crawl-2026
	OpenTransformer/web-crawl-clean-v2
	OpenTransformer/scraped-web-data
	OpenTransformer/turbo-crawl
	OpenTransformer/sft-data-clean
	OpenTransformer/web-crawl-v1
	```

	### 4.2 Optimizer Configuration

	```python
	opt = AdamW([
	{"params": core.parameters(), "lr": 5e-5}, # LR_CORE
	{"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD
	{"params": sat_head.parameters(), "lr": 2e-4},
	])
	```

	Separate learning rates for core vs heads.

	### 4.3 Training Features

	- AMP: Automatic mixed precision (bf16 if available, else fp16)
	- Gradient clipping: max_norm=1.0
	- Label smoothing: 0.1
	- Dropout: 0.1 in attention
	- Checkpointing: Configurable interval (default 24h), automatic pruning

	### 4.4 Chinchilla Scaling

	```python
	ratio = 51.2 if args.chilla_max_double else 25
	param_count = count_params(core, ar_h, sat_h)
	target_tokens = int(ratio * param_count)
	```

	Default follows ~25× Chinchilla ratio; optional 51.2× for "double Chinchilla".

	For 698M params: ~17.5B tokens default, ~35.7B tokens with double.

	### 4.5 Hot Config

	Runtime dataset switching without restart:

	```python
	# /workspace/hot_config.json
	{"datasets": ["new_dataset_1", "new_dataset_2"]}
	```

	Trainer checks this file periodically and switches data sources.

	### 4.6 Auto-Grow

	Optional feature to increase block size during training:

	```python
	--auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000
	```

	Starts with smaller context, grows as training stabilizes.

	---

	## 5. Inference

	### 5.1 AR Mode (Standard)

	```python
	python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"
	```

	Standard autoregressive generation with KV-cache.

	### 5.2 SAT Mode (Speculative)

	```python
	python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var
	```

	Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.

	### 5.3 Sampling Parameters

	\| Parameter \| AR Default \| SAT Default \|
	\|-----------\|------------\|-------------\|
	\| temperature \| 0.7 \| 0.5 \|
	\| top_k \| 0 \| 30 \|
	\| repetition_penalty \| 1.3 \| 2.0 \|
	\| presence_penalty \| 0.0 \| 0.6 \|
	\| frequency_penalty \| 0.3 \| 1.0 \|
	\| penalty_last_n \| 128 \| 200 \|

	SAT mode uses more aggressive penalties to avoid repetition from parallel generation.

	---

	## 6. Weight Tying

	Optional embedding-LM head weight tying:

	```python
	class ARHead(nn.Module):
	def __init__(self, d, tie_weights=False, embedding_weight=None):
	if tie_weights and embedding_weight is not None:
	self.proj = nn.Linear(d, vocab, bias=False)
	self.proj.weight = embedding_weight # Share weights
	```

	Reduces parameters by ~vocab × d (significant for large vocab).

	---

	## 7. Current Training Status

	As of January 2026:
	- Step: 2.2M+
	- Tokens seen: ~2.4B
	- Preset: large (698M params)
	- Training on vast.ai 3090
	- Checkpoints every 6 hours

	---

	## 8. Observations and Notes

	### 8.1 Expansion Ratio Effects

	Early experiments suggest:
	- 1x (standard): baseline behavior
	- 3x-6x: slight improvement in attention patterns
	- 12x+: diminishing returns, increased compute

	Not rigorously benchmarked. Observations only.

	### 8.2 AR vs AR+SAT

	AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch.

	### 8.3 Known Issues

	1. SAT inference quality lags AR (expected - harder task)
	2. Gate accuracy mediocre (often just predicts "emit 2")
	3. Memory usage higher than equivalent AR-only model

	---

	## 9. Code Location

	Primary file: `n.py`

	Key classes:
	- `TuneableAttentionMHA`: The modified attention
	- `Block`: Transformer block
	- `Encoder`: Full encoder stack
	- `ARHead`, `SATHead`: Output heads
	- `token_stream`: Data pipeline
	- `_train_phase`: Training loop

	---

	## 10. License and Citation

	Code released under MIT license.

	If referencing this work:
	```
	@misc{agillm3,
	author = {Bisset, Scott},
	title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
	year = {2026},
	publisher = {OpenTransformers Ltd}
	}
	```

	---

	## Appendix A: Full Preset Table

	```python
	PRESETS = {
	"femto_1x": dict(d=16, layers=1, heads=1, rank=16),
	"femto_12x": dict(d=16, layers=1, heads=1, rank=192),
	"pico_1x": dict(d=32, layers=1, heads=2, rank=16),
	"pico_12x": dict(d=32, layers=1, heads=2, rank=192),
	"nano_1x": dict(d=64, layers=2, heads=4, rank=16),
	"nano_3x": dict(d=64, layers=2, heads=4, rank=48),
	"nano_12x": dict(d=64, layers=2, heads=4, rank=192),
	"micro_12x": dict(d=128, layers=4, heads=8, rank=192),
	"small": dict(d=512, layers=8, heads=16, rank=64),
	"base": dict(d=768, layers=12, heads=24, rank=96),
	"large": dict(d=1024, layers=24, heads=16, rank=128),
	}
	```

	---

	## Appendix B: Example Training Command

	```bash
	python n.py train \
	--preset large \
	--batch_size 4 \
	--block 1122 \
	--amp \
	--save_every_sec 21600 \
	--save_dir /workspace/ckpts_expansion \
	--max_ckpts 5 \
	--resume /workspace/ckpts_expansion
	```

	---

	Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM