File size: 12,012 Bytes

61b9671

# AGILLM-3: Technical Documentation
## A 698M Parameter Language Model with Tuneable Attention Rank and Joint AR+SAT Training

**Scott Bisset**  
OpenTransformers Ltd  
January 2026

---

## Abstract

This document provides complete technical documentation of AGILLM-3, a language model exploring two architectural variations: (1) tuneable attention rank via learned orthogonal projections, and (2) joint autoregressive and semi-autoregressive training. We make no claims of competing with frontier models—AGI exists in systems like Claude and GPT-4. This is documentation of independent research for reproducibility and potential future reference by the research community.

---

## 1. Motivation

### 1.1 What This Is

AGILLM-3 is a research project exploring:

1. **Tuneable attention rank**: What happens when Q and K are projected through an intermediate space of different dimensionality than the standard head dimension?

2. **Joint AR+SAT training**: Can a model learn both next-token prediction AND multi-token speculation simultaneously?

### 1.2 What This Isn't

This is not:
- A frontier model
- A competitor to GPT-4/Claude/Gemini
- A claim that small models can match large ones
- A business

AGI already exists. This is documentation, not disruption.

---

## 2. Architecture

### 2.1 Overview

```
Input tokens
    ↓
Embedding (vocab → d)
    ↓
[Block × L layers]
    ├── LayerNorm → TuneableAttentionMHA → +residual
    └── LayerNorm → FFN (d → 4d → d) → +residual
    ↓
Final LayerNorm
    ↓
├── ARHead (next token prediction)
└── SATHead (multi-token speculation)
```

### 2.2 Tuneable Attention (The Novel Bit)

Standard multi-head attention computes:

```
Q = XWq,  K = XWk,  V = XWv
Attention = softmax(QKᵀ/√d_k) · V
```

Where Q, K have shape [batch, seq, heads, d_k].

**AGILLM-3's modification:**

```python
class TuneableAttentionMHA(nn.Module):
    def __init__(self, d: int, h: int, r: int):
        # r = rank (the tuneable parameter)
        self.U = nn.Parameter(torch.randn(d_k, r))
        nn.init.orthogonal_(self.U)
    
    def _proj_qk(self, x):
        # Project through U: [batch, seq, heads, d_k] @ [d_k, r] → [batch, seq, heads, r]
        return x.view(B, N, h, d_k).transpose(1,2) @ self.U
```

The attention computation becomes:

```
Q' = Q @ U    # [batch, heads, seq, r]
K' = K @ U    # [batch, heads, seq, r]
Attention = softmax(Q'K'ᵀ/√d_k) · V
```

**What this means:**

| Regime | Condition | Effect |
|--------|-----------|--------|
| Compression | r < d_k | Q-K similarity computed in lower-dim space |
| Identity | r = d_k | Equivalent to standard attention (if U=I) |
| Expansion | r > d_k | Q-K similarity computed in higher-dim space |

The presets encode this as ratios:
- `nano_1x`: r = d_k (standard)
- `nano_3x`: r = 3 × d_k (expansion)
- `nano_12x`: r = 12 × d_k (heavy expansion)

**Hypothesis being tested:** Does expanding the Q-K interaction space improve attention quality? The orthogonal initialization ensures U starts as a rotation/reflection, not destroying information.

### 2.3 Positional Encoding: ALiBi

AGILLM-3 uses ALiBi (Attention with Linear Biases) rather than RoPE or learned positions:

```python
def alibi_bias(n_heads, n_tokens):
    # Each head gets a different slope
    # Attention score penalized by distance: score -= slope * |i - j|
    slopes = [2^(-8/n_heads), 2^(-16/n_heads), ...]
    return -slopes * distance_matrix
```

ALiBi chosen for:
- Zero additional parameters
- Good length extrapolation
- Simplicity

### 2.4 Block Structure

Each transformer block:

```python
class Block(nn.Module):
    def forward(self, x, mask):
        # Pre-norm architecture
        x = x + self.mha(self.ln1(x), mask)
        x = x + self.ff(self.ln2(x))
        return x
```

FFN is standard: Linear(d, 4d) → ReLU → Linear(4d, d)

### 2.5 Model Configurations

From the presets in code:

| Preset | d_model | Layers | Heads | Rank | ~Params |
|--------|---------|--------|-------|------|---------|
| nano_3x | 64 | 2 | 4 | 48 | ~200K |
| micro_12x | 128 | 4 | 8 | 192 | ~2M |
| small | 512 | 8 | 16 | 64 | ~50M |
| base | 768 | 12 | 24 | 96 | ~125M |
| large | 1024 | 24 | 16 | 128 | ~698M |

The "large" preset at 698M parameters is the primary AGILLM-3 configuration.

---

## 3. Joint AR+SAT Training

### 3.1 The Idea

Standard language models train only on next-token prediction (autoregressive, AR).

AGILLM-3 trains on BOTH:

1. **AR objective**: Predict token t+1 from tokens 1..t
2. **SAT objective**: Predict tokens t+1..t+k from tokens 1..t (semi-autoregressive)

### 3.2 Masking

**AR mask** (standard causal):
```
Position can attend to: all previous positions
[1 0 0 0]
[1 1 0 0]
[1 1 1 0]
[1 1 1 1]
```

**SAT mask** (block-wise):
```
SAT_BLOCK = 2
Positions in same block can attend to each other AND all previous blocks

Block 0: positions 0,1 can see each other
Block 1: positions 2,3 can see each other + block 0
etc.
```

```python
def sat_mask(n, block=2):
    idx = torch.arange(n)
    grp = idx // block
    allow = (grp.T == grp) | (grp.T > grp)  # Same block OR previous blocks
    return torch.where(allow, 0.0, -inf)
```

### 3.3 Training Loop

Each batch:

```python
# Forward pass 1: AR
h_ar = core(ids, causal_mask(n))
logits_ar = ar_head(h_ar)[:, :-1]
loss_ar = cross_entropy(logits_ar, targets[:, 1:])

# Forward pass 2: SAT
h_sat = core(ids, sat_mask(n))
logits_sat, gate = sat_head(h_sat[:, -SAT_BLOCK:])
loss_sat = cross_entropy(logits_sat, targets[:, 1:SAT_BLOCK+1])

# Optional: gate loss (predict how many tokens to emit)
if gate is not None:
    loss_sat += 0.1 * cross_entropy(gate, emit_target)

loss = loss_ar + loss_sat
```

### 3.4 SAT Head with Gating

```python
class SATHead(nn.Module):
    def __init__(self, d, mode="var"):
        self.proj = nn.Linear(d, vocab)  # Token prediction
        self.gate = nn.Linear(d, 2)      # Emit 1 or 2 tokens?
```

The gate predicts whether to emit 1 or 2 tokens during inference, allowing variable-stride speculation.

### 3.5 Why Joint Training?

**Hypothesis:** Training both objectives together might:
1. Improve representation quality (multi-task learning)
2. Enable speculative decoding at inference (predict multiple tokens, verify with AR)
3. Learn confidence estimation via the gate

**Current status:** Experimental. No claims of improvement over AR-only.

---

## 4. Training Infrastructure

### 4.1 Data Pipeline

```python
def token_stream(ds_names, target_tokens, seed, ...):
    """
    Streaming token generator from HuggingFace datasets.
    - Supports multiple comma-separated datasets
    - Auto-rotates through sources
    - Handles chat format (messages key) or raw text
    - Appends EOS tokens
    """
```

Default pretraining sources (from code):
```
OpenTransformer/goddess-crawl
OpenTransformer/agillm-crawl-data
OpenTransformer/web-crawl-2026
OpenTransformer/web-crawl-clean-v2
OpenTransformer/scraped-web-data
OpenTransformer/turbo-crawl
OpenTransformer/sft-data-clean
OpenTransformer/web-crawl-v1
```

### 4.2 Optimizer Configuration

```python
opt = AdamW([
    {"params": core.parameters(), "lr": 5e-5},   # LR_CORE
    {"params": ar_head.parameters(), "lr": 2e-4}, # LR_HEAD  
    {"params": sat_head.parameters(), "lr": 2e-4},
])
```

Separate learning rates for core vs heads.

### 4.3 Training Features

- **AMP**: Automatic mixed precision (bf16 if available, else fp16)
- **Gradient clipping**: max_norm=1.0
- **Label smoothing**: 0.1
- **Dropout**: 0.1 in attention
- **Checkpointing**: Configurable interval (default 24h), automatic pruning

### 4.4 Chinchilla Scaling

```python
ratio = 51.2 if args.chilla_max_double else 25
param_count = count_params(core, ar_h, sat_h)
target_tokens = int(ratio * param_count)
```

Default follows ~25× Chinchilla ratio; optional 51.2× for "double Chinchilla".

For 698M params: ~17.5B tokens default, ~35.7B tokens with double.

### 4.5 Hot Config

Runtime dataset switching without restart:

```python
# /workspace/hot_config.json
{"datasets": ["new_dataset_1", "new_dataset_2"]}
```

Trainer checks this file periodically and switches data sources.

### 4.6 Auto-Grow

Optional feature to increase block size during training:

```python
--auto_grow --grow_plan "576,640,768,896,1024,1122" --grow_every_steps 50000
```

Starts with smaller context, grows as training stabilizes.

---

## 5. Inference

### 5.1 AR Mode (Standard)

```python
python n.py infer --mode ar --ckpt path/to/ckpt.pt --prompt "Hello"
```

Standard autoregressive generation with KV-cache.

### 5.2 SAT Mode (Speculative)

```python
python n.py infer --mode sat --ckpt path/to/ckpt.pt --prompt "Hello" --var
```

Generates SAT_BLOCK tokens at once, optionally using gate to choose stride.

### 5.3 Sampling Parameters

| Parameter | AR Default | SAT Default |
|-----------|------------|-------------|
| temperature | 0.7 | 0.5 |
| top_k | 0 | 30 |
| repetition_penalty | 1.3 | 2.0 |
| presence_penalty | 0.0 | 0.6 |
| frequency_penalty | 0.3 | 1.0 |
| penalty_last_n | 128 | 200 |

SAT mode uses more aggressive penalties to avoid repetition from parallel generation.

---

## 6. Weight Tying

Optional embedding-LM head weight tying:

```python
class ARHead(nn.Module):
    def __init__(self, d, tie_weights=False, embedding_weight=None):
        if tie_weights and embedding_weight is not None:
            self.proj = nn.Linear(d, vocab, bias=False)
            self.proj.weight = embedding_weight  # Share weights
```

Reduces parameters by ~vocab × d (significant for large vocab).

---

## 7. Current Training Status

As of January 2026:
- Step: 2.2M+
- Tokens seen: ~2.4B
- Preset: large (698M params)
- Training on vast.ai 3090
- Checkpoints every 6 hours

---

## 8. Observations and Notes

### 8.1 Expansion Ratio Effects

Early experiments suggest:
- 1x (standard): baseline behavior
- 3x-6x: slight improvement in attention patterns
- 12x+: diminishing returns, increased compute

Not rigorously benchmarked. Observations only.

### 8.2 AR vs AR+SAT

AR-only mode (`--ar_only`) available for comparison. Joint training adds ~2x forward passes per batch.

### 8.3 Known Issues

1. SAT inference quality lags AR (expected - harder task)
2. Gate accuracy mediocre (often just predicts "emit 2")
3. Memory usage higher than equivalent AR-only model

---

## 9. Code Location

Primary file: `n.py`

Key classes:
- `TuneableAttentionMHA`: The modified attention
- `Block`: Transformer block
- `Encoder`: Full encoder stack
- `ARHead`, `SATHead`: Output heads
- `token_stream`: Data pipeline
- `_train_phase`: Training loop

---

## 10. License and Citation

Code released under MIT license.

If referencing this work:
```
@misc{agillm3,
  author = {Bisset, Scott},
  title = {AGILLM-3: Tuneable Attention Rank and Joint AR+SAT Training},
  year = {2026},
  publisher = {OpenTransformers Ltd}
}
```

---

## Appendix A: Full Preset Table

```python
PRESETS = {
    "femto_1x":  dict(d=16,   layers=1,  heads=1,  rank=16),
    "femto_12x": dict(d=16,   layers=1,  heads=1,  rank=192),
    "pico_1x":   dict(d=32,   layers=1,  heads=2,  rank=16),
    "pico_12x":  dict(d=32,   layers=1,  heads=2,  rank=192),
    "nano_1x":   dict(d=64,   layers=2,  heads=4,  rank=16),
    "nano_3x":   dict(d=64,   layers=2,  heads=4,  rank=48),
    "nano_12x":  dict(d=64,   layers=2,  heads=4,  rank=192),
    "micro_12x": dict(d=128,  layers=4,  heads=8,  rank=192),
    "small":     dict(d=512,  layers=8,  heads=16, rank=64),
    "base":      dict(d=768,  layers=12, heads=24, rank=96),
    "large":     dict(d=1024, layers=24, heads=16, rank=128),
}
```

---

## Appendix B: Example Training Command

```bash
python n.py train \
    --preset large \
    --batch_size 4 \
    --block 1122 \
    --amp \
    --save_every_sec 21600 \
    --save_dir /workspace/ckpts_expansion \
    --max_ckpts 5 \
    --resume /workspace/ckpts_expansion
```

---

*Documentation current as of January 2026. Code at github.com/OpenTransformer/AGILLM*