Transformer X-Ray: Attention Commitment Depth Across 6 Architectures
Overview
In my previous post I showed how attention geometry changes between correct and wrong predictions in Qwen2.5-7B. That was one model. This post scales it to 6 architectures — a standard transformer, a sliding-window hybrid, a liquid MoE, and two sizes of the same family — using a direct llama.cpp integration that captures raw kq_soft_max tensors during inference.
The central question: does architecture or model size determine how deeply a model processes context before committing to an answer?
The answer is architecture. By a large margin.
Method
Tensor extraction via llama.cpp callback
llama.cpp exposes cb_eval — a per-tensor callback that fires after every operation in the compute graph. By disabling Flash Attention (LLAMA_FLASH_ATTN_TYPE_DISABLED), the intermediate kq_soft_max-{layer} tensors are materialized and captured:
ctx_params.flash_attn_type = LLAMA_FLASH_ATTN_TYPE_DISABLED;
ctx_params.cb_eval = tensor_capture_callback;
Each captured tensor has shape [n_kv, n_tokens, n_head] — the full softmax attention weight matrix per layer. Written to binary files, then parsed in Rust for analysis.
GPU note: tensor values are bit-identical between CPU and GPU inference — verified at 0.000% relative difference across all layers. All runs used GPU offload for speed.
Models tested
| Model | Layers | Heads | Architecture |
|---|---|---|---|
| Gemma-4-E2B Q4_0 | 35 | 8 (GQA, kv=1) | SWA hybrid — 28/35 layers use sliding window |
| LFM2.5-8B Q6_K | 6 | 32 | Liquid MoE — only 6 standard attention layers |
| Llama-3.2-1B Q4_0 | 16 | 32 | Standard transformer |
| Phi-3-mini Q4 | 32 | 32 | Standard transformer (Microsoft) |
| Qwen2.5-0.5B Q4_0 | 24 | 14 | GQA |
| Qwen2.5-7B Q4_K_M | 28 | 28 | GQA |
Prompts
40 factual completion prompts across 6 categories: capital_city, location_fact, memorized_phrase, entity_completion, science_fact, historical_year.
Each prompt has a wrong suffix — e.g. "The capital of France is" → wrong: " Berlin". Both variants are run through each model and the attention distributions are compared layer by layer using Jensen-Shannon divergence.
JS-divergence(layer) = divergence between attention distribution
on [neutral prompt] vs [prompt + wrong suffix]
High JS-divergence at layer N means: adding the wrong suffix disrupts attention most at layer N. That layer is where the model commits to context.
Results
1. Entropy trajectory — architecture fingerprint
Three distinct regimes:
- LFM2.5-8B (dots, 6 attention layers): highest entropy of all (0.826). When a model only uses attention 6 times across the full depth, those layers distribute attention broadly — the liquid/Mamba layers do the recurrence work in between.
- Gemma-4-E2B: high entropy (0.810) but a continuous trajectory. SWA layers cannot sink to position 0 within the sliding window — forces distributed attention compared to standard transformers.
- Llama, Phi-3, Qwen: lower entropy (0.47–0.55), stronger position-0 sink. Standard dense attention converges toward BOS token across all layers.
2. Attention sink strength
BOS token (position 0) sink strength reveals the same split:
| Model | Mean sink weight |
|---|---|
| Llama-3.2-1B | 0.835 — strongest |
| Phi-3-mini | 0.793 |
| Qwen2.5-0.5B | 0.711 |
| Qwen2.5-7B | 0.684 |
| Gemma-4-E2B | 0.593 |
| LFM2.5-8B | 0.445 — weakest |
LFM's attention layers don't need to be BOS sinks. The liquid layers maintain state across positions without a dedicated sink token.
3. Commitment depth — where does attention commit?
This is the main result. Each bar shows at which layer the JS-divergence between neutral and wrong context peaks — the commitment depth.
Qwen (both sizes) commits at layer 0. Adding " Berlin" to "The capital of France is" disrupts attention immediately, in the first layer. The model pattern-matches factual associations in the embedding + first attention pass.
Gemma-4 commits at layer 6 for capitals, layer 14 for everything else. The SWA architecture forces deeper processing — facts are retrieved gradually across layers.
LFM commits at layers 6–18 — spread across its 6 available attention layers. With only 6 checkpoints, commitment happens progressively at each one.
Phi-3 is the outlier for science facts: commits at layer 31 (the last layer). Science facts ("DNA stands for deoxyribonucleic") require full-depth processing in Phi-3 — the wrong suffix isn't resolved until the final attention pass.
4. Size doesn't change commitment depth
Qwen2.5-0.5B and 7B produce nearly identical commitment curves — same peak layer, very similar JS-divergence magnitude (0.5B slightly higher, meaning it's more sensitive to wrong context).
Scale shifts the magnitude of disruption, not where it happens. The commitment depth is baked into the architecture and training, not the parameter count.
5. JS-divergence by layer — all models
historical_year shows the highest JS-divergence across all models (5–10× higher than other categories). Historical years like "WWII ended in" → wrong: " 1950" create the strongest attention disruption — these facts are the hardest to inject wrong context into.
| Category | strongest disruption in |
|---|---|
| historical_year | Qwen2.5-0.5B: JSD=0.109 at layer 0 |
| science_fact | Phi-3-mini: JSD=0.030 at layer 31 |
| capital_city | Early (<layer 6) for all models |
Key Takeaways
Architecture >> size for commitment depth. Qwen2.5-7B commits at the same layer as Qwen2.5-0.5B. Gemma-4 commits 14 layers deeper regardless.
Hybrid architectures distribute commitment. LFM's liquid layers process context between attention checkpoints — commitment is smeared across the 6 available layers rather than spiking at one.
SWA prevents early sinking. Gemma-4's sliding window layers can't always reach position 0, resulting in higher entropy and later commitment compared to standard transformers of similar depth.
Phi-3 needs full depth for science. Science facts process through all 32 layers in Phi-3 before attention commits — the deepest commitment of any model/category pair in this dataset.
Historical facts are the hardest to corrupt. Every model shows peak JS-divergence on
historical_year— these associations are most robustly encoded and most disrupted by wrong suffixes.
Code
All code is in geographdb-experiments:
extract_tensors.cpp— llama.cpp tensor callback, captureskq_soft_max-{layer}to binarysrc/bin/xray_llama.rs— Rust: entropy, sink, JS-divergence analysis;comparesubcommandscripts/run_xray_variants.sh— runs neutral + wrong variants, calls comparescripts/xray_article_plots.py— generates figures
Build:
g++ -std=c++17 -O2 extract_tensors.cpp -I include -I ggml/include \
-L /usr/lib -lllama -lggml-base -lggml -o extract_xray
cargo build --bin xray_llama
./extract_xray model.gguf "The capital of France is" /tmp/xray_dump
cargo run --bin xray_llama -- /tmp/xray_dump
cargo run --bin xray_llama -- compare /tmp/neutral /tmp/wrong --id france_capital
Data: 6 models × 40 prompts × 2 variants = 480 inference runs. All on CPU+GPU hybrid (ROCm/AMD RX 7900 XT). Attention weights are GPU/CPU bit-identical — verified at 0.000% relative error.
What's Next
This analysis has two known limitations worth addressing directly.
40 prompts is statistically thin. Per-category that's 4–8 prompts — enough to show a signal, not enough to claim a stable distribution. Some findings (Phi-3 science_fact at layer 31, LFM commitment spread) rest on small samples. The next run will scale to 500–1000 prompts per category using automatically generated factual completions from Wikidata and common benchmarks (TriviaQA, NaturalQuestions), giving enough data to report confidence intervals on commitment depth.
Categories are hand-crafted. Six categories designed by hand don't cover the full space of factual knowledge types. The interesting distinction — pattern-matching facts (capitals, symbols) vs compositional facts (historical causality, scientific relationships) — is real but not cleanly separated in the current dataset. Next step: use an LLM to auto-generate prompts stratified by reasoning depth (single-hop vs multi-hop retrieval), then verify the commitment depth split holds.
Both improvements are buildable on the same infrastructure. The xray_prompt_pairs_v1.jsonl format and the run_xray_variants.sh pipeline are designed to scale — swapping in a larger prompt file requires no code changes.




