SciPapers / deriving_transformer_from_first_principles.md

Upload deriving_transformer_from_first_principles.md with huggingface_hub

0faa5b8 verified 3 months ago

preview code

raw

history blame contribute delete

20.6 kB

Deriving the Transformer from First Principles

Why This Architecture and No Other

Scott Bisset, Silicon Goddess
OpenTransformers Ltd
January 2026

Abstract

The Transformer architecture is typically presented as a fait accompli—a collection of design choices (attention, layer normalization, residual connections, feedforward blocks) that "work well empirically." This obscures the deeper question: why this architecture? We show that the Transformer can be derived from first principles by starting with four fundamental desiderata for sequence modeling and systematically resolving the constraints they impose. Attention emerges as the unique solution to content-based routing. Residual connections emerge from gradient flow requirements. Normalization emerges from training stability. The feedforward block emerges from expressivity requirements. Using Newton's fluxion notation throughout, we reveal the architecture not as arbitrary engineering but as the natural solution to a well-posed problem.

Part I: The Problem Statement

1. What We Want

1.1 The Sequence Modeling Task

We have:

Input: A sequence of tokens X = [x₁, x₂, ..., xₙ]
Output: A sequence of representations Y = [y₁, y₂, ..., yₙ]

Each output yᵢ should encode:

The content of xᵢ
Relevant context from other positions
The position i itself

1.2 Four Fundamental Desiderata

D1. PARALLELISM: All positions should be computable simultaneously. Unlike RNNs, we cannot afford O(n) sequential steps.

D2. VARIABLE CONTEXT: Each position should dynamically select which other positions are relevant. Unlike CNNs, we cannot afford fixed receptive fields.

D3. TRAINABILITY: Gradients must flow from output to input without vanishing or exploding. Deep networks must remain trainable.

D4. EXPRESSIVITY: The function class must be rich enough to approximate arbitrary sequence-to-sequence mappings. We need universal approximation.

1.3 The Derivation Strategy

We will show that each component of the Transformer is the MINIMAL solution to these desiderata:

Desideratum	Implies	Component
D1 (Parallel) + D2 (Variable)	→	Self-Attention
D3 (Trainable)	→	Residual Connections
D3 (Trainable)	→	Layer Normalization
D4 (Expressive)	→	Feedforward Block

Part II: Deriving Attention

2. The Routing Problem

2.1 What We Need

Each position i must:

Query the sequence: "What information do I need?"
Receive information from relevant positions
Aggregate that information into its representation

2.2 Constraint: Parallelism (D1)

The routing mechanism must be expressible as matrix operations. No sequential dependencies between positions.

2.3 Constraint: Content-Based (D2)

The routing must depend on CONTENT, not just position. Position 5 might need position 2 in one sentence, position 7 in another.

2.4 The Derivation

Step 1: Represent the query.

Position i needs to express "what I'm looking for." Simplest parameterization: linear projection.

qᵢ = Wq · xᵢ

Step 2: Represent what each position offers.

Position j needs to express "what I have." Simplest parameterization: linear projection.

kⱼ = Wk · xⱼ

Step 3: Measure compatibility.

How well does position i's query match position j's key? Simplest symmetric measure: dot product.

sᵢⱼ = qᵢ · kⱼᵀ

Step 4: Convert to routing weights.

We need weights that:

Sum to 1 (conservation of information)
Are non-negative (no "negative information")
Are differentiable (for gradient flow)

Unique solution: softmax.

aᵢⱼ = softmax(sᵢⱼ) = exp(sᵢⱼ) / Σₖ exp(sᵢₖ)

Step 5: Aggregate information.

Position j's contribution to position i: Weighted by aᵢⱼ, content is a linear projection of xⱼ.

vⱼ = Wv · xⱼ
yᵢ = Σⱼ aᵢⱼ · vⱼ

2.5 We Have Derived Attention

Q = X · Wq
K = X · Wk  
V = X · Wv
Y = softmax(QKᵀ/√d) · V

This is the UNIQUE solution to:

Parallel computation (matrix operations)
Content-based routing (Q-K compatibility)
Differentiable (softmax)
Information conservation (weights sum to 1)

2.6 The √d Scaling

Why divide by √d?

Fluxion analysis of softmax:

If sᵢⱼ ~ N(0, d) (dot product of d-dimensional unit vectors)
Then variance of sᵢⱼ = d

Large variance → softmax saturates → gradients vanish.

L̇ˢ = A ⊙ (L̇ᴬ - A·L̇ᴬᵀ)

When A is nearly one-hot (saturated softmax), L̇ˢ ≈ 0.

Solution: Scale by √d to maintain unit variance.

sᵢⱼ = qᵢ · kⱼᵀ / √d ~ N(0, 1)

The scaling factor is not arbitrary—it's REQUIRED for gradient flow.

3. Multi-Head Attention

3.1 The Limitation

Single attention head = single routing pattern. But different "types" of relevance exist:

Syntactic (subject-verb agreement)
Semantic (entity-attribute)
Positional (adjacent tokens)

3.2 The Solution: Parallel Heads

Run H independent attention mechanisms:

headₕ = Attention(X·Wqₕ, X·Wkₕ, X·Wvₕ)

Concatenate and project:

MultiHead(X) = Concat(head₁, ..., headₕ) · Wₒ

3.3 Why This Works

Each head can specialize in different routing patterns. The output projection Wₒ learns to combine them.

3.4 Fluxion Analysis

Gradient flows through all heads in parallel:

L̇ˣ = Σₕ L̇ʰᵉᵃᵈₕ

No bottleneck—each head contributes independently to gradient.

Part III: Deriving Residual Connections

4. The Gradient Flow Problem

4.1 Deep Networks Fail

Consider L layers without residuals:

Y = fₗ(fₗ₋₁(...f₁(X)))

Gradient flow:

L̇ˣ = ∂fₗ/∂x · ∂fₗ₋₁/∂x · ... · ∂f₁/∂x · L̇ʸ

Product of L Jacobians. If each has norm slightly ≠ 1:

‖L̇ˣ‖ ~ ‖J‖ᴸ → 0 or ∞

Gradients vanish or explode exponentially with depth.

4.2 The Residual Solution

Add skip connections:

Y = X + f(X)

Gradient flow:

L̇ˣ = L̇ʸ + L̇ʸ · ∂f/∂x
     ↑
     Direct path!

Even if ∂f/∂x → 0, gradient flows through the identity.

4.3 Why Addition Specifically?

Alternatives considered:

Concatenation: Y = [X, f(X)]

Doubles dimension each layer
Not sustainable for deep networks

Multiplication: Y = X ⊙ f(X)

Gradient: L̇ˣ = L̇ʸ ⊙ f(X) + L̇ʸ ⊙ X · ∂f/∂x
If f(X) → 0, gradient vanishes
No direct path

Gating: Y = g(X) ⊙ X + (1-g(X)) ⊙ f(X)

Works (LSTM, GRU)
More parameters, more complexity
Addition is the minimal solution

4.4 Residual = Gradient Highway

            ┌──────────────────┐
            │   Direct path    │
            │   (identity)     │
     ┌──────┴──────┐    ┌──────┴──────┐
X ───┤             ├────┤             ├───→ Y
     │   f(X)      │    │   + (add)   │
     └─────────────┘    └─────────────┘
            │                  │
            │   Through f      │
            └──────────────────┘

Gradient can flow through EITHER path. Network can choose (via learning) which path to use.

4.5 Initialization Implication

At initialization, we want f(X) ≈ 0 so Y ≈ X.

This means deep network at init ≈ identity function. Stable starting point for optimization.

This is why GPT-style models often use:

output = x + scale * Attention(x)

with scale initialized small.

Part IV: Deriving Normalization

5. The Scale Problem

5.1 Without Normalization

Each layer's output can have arbitrary scale:

f₁(X) might have ‖output‖ ~ 100
f₂(input) might expect ‖input‖ ~ 1

Scale mismatch causes:

Attention softmax saturation
Activation function saturation
Gradient instability

5.2 The Constraint

We need: Consistent statistics at each layer's input.

5.3 Options

BatchNorm: Normalize across batch

Problem: Batch statistics unreliable at inference
Problem: Doesn't work for sequence models (each position needs different batch items)

LayerNorm: Normalize across features (per token)

No batch dependence
Each token normalized independently
Works at any batch size

5.4 Deriving LayerNorm

Requirement 1: Zero mean (center the distribution)

x̂ = x - μ   where μ = mean(x)

Requirement 2: Unit variance (control scale)

x̂ = (x - μ) / σ   where σ = std(x)

Requirement 3: Learnable scale/shift (restore expressivity)

y = γ · x̂ + β

Without γ and β, normalization constrains the representation. With them, the network can learn to undo normalization if needed.

5.5 Fluxion Analysis: Why Normalization Stabilizes Training

Jacobian of LayerNorm:

∂y/∂x = (γ/σ) · (I - (1/d)·1·1ᵀ - (1/d)·x̂·x̂ᵀ)

This matrix has bounded singular values!

σₘₐₓ(∂y/∂x) ≤ γ/σ · √2
σₘᵢₙ(∂y/∂x) ≥ 0

Key insight: Normalization bounds the Jacobian spectrum. No single direction can have arbitrarily large gradient.

5.6 Pre-Norm vs Post-Norm

Post-Norm (original Transformer):

Y = LayerNorm(X + Attention(X))

Gradient must pass through LayerNorm.

Pre-Norm (modern default):

Y = X + Attention(LayerNorm(X))

Gradient has direct path bypassing LayerNorm.

Pre-Norm is more stable for very deep networks.

Part V: Deriving the Feedforward Block

6. The Expressivity Problem

6.1 Attention Is Not Enough

Self-attention is:

Linear in V (weighted sum)
Nonlinear only in routing (softmax)

Without feedforward, network is nearly linear!

Attention(X) = softmax(XWqWkᵀXᵀ/√d) · XWv

The Wv projection is linear. For fixed attention weights, output is linear in X.

6.2 Universal Approximation Requirement (D4)

We need to approximate arbitrary functions. Attention provides dynamic routing but limited transformation.

6.3 The MLP Solution

Add a position-wise feedforward network:

FFN(x) = W₂ · σ(W₁ · x + b₁) + b₂

Why this structure?

Step 1: Project to higher dimension.

h = W₁ · x     (d → 4d typically)

Creates "features" the network can work with.

Step 2: Apply nonlinearity.

h = σ(h)       (ReLU, GELU, SiLU, etc.)

Breaks linearity. Essential for universal approximation.

Step 3: Project back.

y = W₂ · h     (4d → d)

Compress back to model dimension.

6.4 Why 4x Expansion?

Empirical finding: 4x expansion ratio works well.

Theoretical justification:

More expansion = more expressivity per layer
Less expansion = more parameters in attention
4x is a sweet spot for compute/parameter balance

6.5 Fluxion Analysis

L̇ˣ = W₁ᵀ · (L̇ʰ ⊙ σ̇(h))
L̇ʷ¹ = (L̇ʰ ⊙ σ̇(h)) · xᵀ
L̇ʷ² = L̇ʸ · hᵀ

Gradient flows through:

σ̇(h): The activation derivative
W₁, W₂: The projections

Dead neurons (ReLU): If h < 0, σ̇(h) = 0, no gradient flows. Solution: GELU/SiLU have non-zero gradient everywhere.

Part VI: Putting It Together

7. The Complete Transformer Block

7.1 The Architecture

Input: X

# Attention sub-block
X₁ = X + Attention(LayerNorm(X))

# Feedforward sub-block  
X₂ = X₁ + FFN(LayerNorm(X₁))

Output: X₂

7.2 Why This Order?

LayerNorm → Attention → Residual → LayerNorm → FFN → Residual

Each component addresses a specific desideratum:

LayerNorm(X)           # Stabilize input scale (D3)
    ↓
Attention(·)           # Content-based routing (D1, D2)
    ↓
X + ·                  # Gradient highway (D3)
    ↓
LayerNorm(·)           # Stabilize again (D3)
    ↓
FFN(·)                 # Nonlinear transformation (D4)
    ↓
· + ·                  # Gradient highway (D3)

7.3 The Complete Forward Flow

For each block l = 1 to L:
    # Attention
    Q, K, V = LayerNorm(X) · Wq, · Wk, · Wv
    A = softmax(QKᵀ/√d)
    X = X + A·V·Wₒ
    
    # FFN
    H = GELU(LayerNorm(X) · W₁)
    X = X + H · W₂

7.4 The Complete Backward Flow (Fluxions)

For each block l = L down to 1:
    # FFN backward
    L̇ʰ = L̇ˣ · W₂ᵀ
    L̇ˣ = L̇ˣ + LayerNorm_backward(W₁ᵀ · (L̇ʰ ⊙ GELU'(h)))
    
    # Attention backward
    L̇ᵛ = Aᵀ · L̇ᵒᵘᵗ · Wₒᵀ
    L̇ᴬ = L̇ᵒᵘᵗ · Wₒᵀ · Vᵀ
    L̇ˢ = softmax_backward(L̇ᴬ)
    L̇Q = L̇ˢ · K / √d
    L̇K = L̇ˢᵀ · Q / √d
    L̇ˣ = L̇ˣ + LayerNorm_backward(L̇Q·Wqᵀ + L̇K·Wkᵀ + L̇V·Wvᵀ)

The key: L̇ˣ = L̇ˣ + ... at each step. Gradient accumulates through residual highways.

8. Why No Other Architecture?

8.1 Could We Remove Anything?

Remove attention:

Lose content-based routing (D2 violated)
Reduce to position-wise MLP

Remove residuals:

Gradient vanishing in deep networks (D3 violated)
Training becomes impossible past ~6 layers

Remove normalization:

Scale explosion/collapse (D3 violated)
Training unstable

Remove FFN:

Nearly linear network (D4 violated)
Cannot approximate complex functions

8.2 Could We Add Anything?

More attention per block:

Diminishing returns
Compute better spent on more blocks

Recurrence:

Violates parallelism (D1)
Slower training

Convolution:

Fixed receptive field violates D2
Attention subsumes convolution anyway

8.3 The Transformer Is Minimal

Each component is:

Necessary (removing it violates a desideratum)
Sufficient (adding more doesn't help much)
Minimal (simplest form that works)

The architecture is not arbitrary—it's the unique minimal solution to the desiderata.

Part VII: Emergent Properties

9. Properties We Didn't Design For

9.1 In-Context Learning

We designed for sequence modeling. We got: ability to learn new tasks from examples in the prompt.

Why? Attention can route information from examples to queries. The network learns to "match patterns" dynamically.

9.2 Compositional Generalization

We designed for fixed-length sequences. We got: ability to compose learned concepts in new ways.

Why? Attention is content-based, not position-based. Learned Q-K patterns transfer to new positions.

9.3 Scaling Laws

We designed for expressivity. We got: predictable performance improvement with scale.

Why? More parameters = more capacity for Q-K-V patterns. Residuals ensure gradient flow even at huge depth. Loss decreases smoothly with compute.

10. The Fluxion Perspective: Computation as Flow

10.1 Forward Pass = Information Flow

Input embeddings → 
    Attention routes information between positions →
    FFN transforms information at each position →
    Output representations

Information FLOWS from input to output, dynamically routed by attention.

10.2 Backward Pass = Sensitivity Flow

Output gradients →
    FFN backward: which transformations mattered →
    Attention backward: which routes mattered →
    Input gradients

Sensitivity FLOWS from output to input, through the same routes.

10.3 Training = Shaping the Flow

Gradient descent adjusts:
    - Wq, Wk: Which routes to create
    - Wv, Wₒ: What to send through routes
    - W₁, W₂: How to transform at each position

Training shapes the flow patterns to minimize loss.

10.4 The Trained Network = A Flow System

A trained Transformer is a physical system where:

Tokens are sources of information
Attention creates dynamic channels
Information flows to where it's needed
Gradients reveal which flows matter

This is not metaphor—it's the literal computation.

Part VIII: Implications

11. For Architecture Design

11.1 Principled Modifications

To improve Transformers, we can:

Better attention: Flash Attention (same math, better memory access)
Better normalization: RMSNorm (simpler, equally effective)
Better FFN: GLU variants (gated linear units, smoother gradients)
Better positional encoding: RoPE (relative position in dot product)

Each modification preserves the core derivation while improving implementation.

11.2 What NOT to Do

Modifications that violate desiderata will fail:

Removing residuals (even "simplifying" them)
Making attention non-differentiable
Removing all nonlinearity

11.3 Scaling Strategy

The derivation suggests:

Scale depth (more blocks) with residual highways
Scale width (larger d) with normalization
Scale heads (more attention patterns) with parallel computation

All three maintain the core structure.

12. For Understanding Intelligence

12.1 The Transformer Didn't Come from Nowhere

We wanted:

Parallel computation
Dynamic routing
Trainable depth
Expressivity

We got the Transformer because it's the UNIQUE solution.

12.2 Could Biological Brains Be Similar?

Brains face similar constraints:

Parallel processing (neurons compute simultaneously)
Content-based routing (association, not fixed wiring)
Deep processing (many layers of abstraction)
Universal learning (arbitrary input-output mappings)

Perhaps attention-like mechanisms are convergent—any system solving these constraints discovers something similar.

12.3 Why Language Models Work

Language requires:

Variable-length context
Content-based relevance
Compositional meaning
Deep abstraction

These are EXACTLY the desiderata we started with. The Transformer is the natural architecture for language.

13. Conclusion

13.1 What We Showed

The Transformer architecture can be DERIVED, not just presented:

Attention emerges from parallel + content-based routing
Residuals emerge from gradient flow requirements
Normalization emerges from scale stability
FFN emerges from expressivity requirements

13.2 The Deeper Point

Good architectures aren't arbitrary collections of tricks. They're solutions to well-posed problems.

The Transformer solves:

"How do we build a parallel, dynamic, trainable, expressive sequence model?"

Understanding WHY it works lets us:

Modify it principled
Scale it correctly
Know what NOT to change

13.3 The Fluxion Contribution

Newton's notation reveals the architecture as a FLOW SYSTEM:

Forward: information flows
Backward: sensitivity flows
Training: shaping flows

This isn't just pedagogy—it's the right way to think about neural computation.

References

Vaswani et al. (2017). "Attention Is All You Need."
He et al. (2016). "Deep Residual Learning for Image Recognition."
Ba et al. (2016). "Layer Normalization."
Cybenko (1989). "Approximation by Superpositions of a Sigmoidal Function."
Newton, I. (1736). The Method of Fluxions.

Appendix A: Summary of Derivation

DESIDERATA:
D1. Parallelism         → Matrix operations
D2. Variable context    → Content-based routing
D3. Trainability        → Gradient highways + normalization
D4. Expressivity        → Nonlinear transformations

DERIVATION:
D1 + D2 → QKᵀ compatibility → softmax → weighted V sum = ATTENTION
D3 (gradient) → Y = X + f(X) = RESIDUAL CONNECTION
D3 (scale) → (X - μ)/σ · γ + β = LAYER NORMALIZATION
D4 → W₂ · σ(W₁ · x) = FEEDFORWARD BLOCK

COMPOSITION:
X → LN → Attention → +X → LN → FFN → +X = TRANSFORMER BLOCK
Stack L blocks = TRANSFORMER

Appendix B: The Four Desiderata as Constraints

Desideratum	Constraint	Solution	Alternative	Why Alternative Fails
D1: Parallel	O(1) depth	Matrix ops	RNN	O(n) sequential
D2: Dynamic	Content-based	Q·K similarity	CNN	Fixed receptive field
D3: Trainable	Gradient flows	Residual + Norm	None	Vanishing/exploding
D4: Expressive	Universal approx	MLP	Linear	Can't approximate

Correspondence: scott@opentransformers.online

Word count: ~4,500 Time to write: One flow state afternoon Notation: Pure Newtonian fluxions Ambition level: Textbook-grade derivation from first principles