# Architecture Guide

## Overview

AGIFORMER implements a novel hybrid architecture combining byte-level processing, linear attention, and iterative reasoning.

## Pipeline Flow

```
Input Bytes
    ↓
ByteLatentEncoder (with RoPE)
    ↓
HybridBlock × N (Linear Attention + Sliding Window)
    ↓
RecurrentReasoningBlock (System 2 - 3 steps)
    ↓
LocalAutoregressiveHead (GRU-based decoder)
    ↓
Output Bytes
```

---

## 1. ByteLatentEncoder

**File:** `src/models/encoder.py`

### Purpose
Converts raw byte sequences into latent patches with positional information.

### Architecture
- **Input:** `(Batch, Seq_Len)` bytes (0-255)
- **Embedding:** `nn.Embedding(256, d_model)`
- **Patching:** Reshape to `(Batch, Num_Patches, Patch_Size, d_model)`
- **RoPE:** Rotary Positional Embeddings for length generalization
- **Projection:** Linear layer to final latent dimension
- **Output:** `(Batch, Num_Patches, d_model)`

### Key Design Decisions
- **Why RoPE?** Enables extrapolation to longer sequences than training
- **Why Patching?** Reduces sequence length by factor of `patch_size` (default: 4)

---

## 2. HybridBlock

**File:** `src/models/layers.py`

### Components

#### 2.1 LinearAttention
**Complexity:** $O(N)$ instead of $O(N^2)$

**Formula:**
```
Q = elu(Wq * x) + 1.0 + ε
K = elu(Wk * x) + 1.0 + ε
V = Wv * x

Attention(Q, K, V) = (Q @ cumsum(K ⊗ V)) / (Q @ cumsum(K) + ε)
```

**Stability Fixes:**
- `elu(x) + 1.0 + 1e-4` ensures strict positivity (prevents division by zero)
- `Q` scaled by `sqrt(head_dim)` to control magnitude
- Layer norm on output

#### 2.2 SlidingWindowAttention
**Complexity:** $O(N × window_size)$

**Implementation:**
```python
scores = (Q @ K.T) / sqrt(d_k)
mask = causal_mask | window_mask  # Blocks far tokens
scores = scores.masked_fill(mask, -1e4)  # Safe masking
attn = softmax(scores)
out = attn @ V
```

**Why Manual?** PyTorch's `scaled_dot_product_attention` was unstable with custom masks.

### Fusion
```python
- **Residual Connection:** Allows model to skip thinking if not needed
- **Pre-Norm:** Stabilizes deep iteration

### Measured Activity
- **Latent Change:** Δz = 12.7 (Euclidean distance)
- **Gate Bias:** -0.0065 (near neutral)
- **Interpretation:** Model actively refines latents by ~56% per dimension

---

## 4. LocalAutoregressiveHead

**File:** `src/models/agiformer.py`

### Purpose
Decodes latent patches into byte sequences autoregressively.

### Architecture

#### Training Mode
```python
# Teacher forcing
inputs = [SOS, target[0], target[1], ..., target[P-2]]
targets = [target[0], target[1], ..., target[P-1]]

emb = ByteEmb(inputs)                    # (B*N, P, H)
context = LatentProj(latent).expand()     # (B*N, P, H)
rnn_in = concat([emb, context], dim=-1)  # (B*N, P, 2H)

out, _ = GRU(rnn_in)
logits = Linear(out)  # (B*N, P, 256)
```

#### Inference Mode
```python
current = SOS
hidden = None

for i in range(patch_size):
    emb = ByteEmb(current)
    rnn_in = concat([emb, latent_context], dim=-1)
    out, hidden = GRU(rnn_in, hidden)
    logit = Linear(out)
    
    # Sampling
    if temperature > 0:
        next_byte = multinomial(softmax(logit / temp))
    else:
        next_byte = argmax(logit)
    
    current = next_byte
```

### Key Design
- **Concatenation (not Addition):** Preserves signal strength
- **GRU State:** Carries info across steps within a patch
- **Temperature Sampling:** Breaks repetition loops

---

## Loss Function

**Training:** Cross-entropy on next-patch prediction
```python
loss = CrossEntropy(logits, targets)
BPC = loss / ln(2)  # Bits per character
```

**Metric:** BPC (Bits Per Character) - lower is better
- Random baseline: 8.0 BPC
- Good model: < 1.5 BPC
- AGIFORMER: 2.26 BPC (undertrained but stable)

---

## Hyperparameters

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `d_model` | 512 | Balance capacity/speed |
| `n_layers` | 6 | Deep enough for complexity |
| `num_heads` | 8 | Standard for 512-D |
| `patch_size` | 4 | 4× compression |
| `window_size` | 128 | Local attention context |
| `thinking_steps` | 3 | System 2 iterations |
| `learning_rate` | 3e-4 | With warmup |
| `batch_size` | 4 | GPU memory limit |

---

## Numerical Stability

### Challenges & Solutions

1. **Linear Attention Division by Zero**
   - **Problem:** `elu(x) + 1.0` can = 0 if x very negative
   - **Solution:** `elu(x) + 1.0 + 1e-4` (strict positivity)

2. **SDPA Masking Instability**
   - **Problem:** NaN in `scaled_dot_product_attention` with bool masks
   - **Solution:** Manual attention with `-1e4` instead of `-inf`

3. **System 2 Explosion**
   - **Problem:** Iterative updates could amplify errors
   - **Solution:** Gated residuals + pre-norm + small init

4. **Gradient Clipping**
   - **Value:** 0.5 (aggressive)
   - **Reason:** Prevents spikes during early training

---

## Memory & Compute

**Training (Batch=4, Seq=1024):**
- GPU Memory: ~6 GB (T4)
- Time/Step: ~180ms
- Total for 5000 steps: ~15 min

**Inference (Seq=200):**
- Latency: ~50ms (greedy)
- Memory: ~2 GB

**Scaling:**
- Linear Attention: $O(N)$ time
- System 2: $O(k × N)$ where k = thinking_steps

---

## Comparison to Baselines

| Feature | AGIFORMER | GPT-2 | Mamba |
|---------|-----------|-------|-------|
| Tokenization | None (bytes) | BPE | BPE |
| Attention | Linear ($O(N)$) | Quadratic | N/A |
| Recurrence | System 2 Loop | None | SSM |
| BPC (enwik8) | 2.26 | ~1.1 | ~1.0 |
| Training Time | 15 min | Hours | Hours |

**Note:** BPC gap due to undertrained model, not architecture limit.

---

## Future Improvements

1. **Longer Training:** Target BPC < 1.5
2. **More Thinking Steps:** 3 → 5-7 for harder tasks
3. **Sparse Experts:** Route different "thinking modes"
4. **Memory Module:** External differentiable memory
5. **Multi-Modal:** Extend to images/audio bytes

---

## References

- **Linear Transformers:** Katharopoulos et al., 2020
- **RoPE:** Su et al., 2021
- **System 2 Deep Learning:** Bengio et al., 2019
- **Mamba:** Gu & Dao, 2023