tmax-2b brain atlas activation census, OV-circuits, and capability fence (24 layers)

by juiceb0xc0de - opened 7 days ago

•

allenai/tmax-2b Brain Atlas — First Look Inside a Hybrid SSM-Mamba-Transformer

Cross-post: I ran a full GWIQ-style brain atlas on the smallest member of the tmax family. It's not a downstream benchmark just a look at what the tensors are actually doing.

model: allenai/tmax-2b
atlas type: activation census + Sub-Zero brain atlas + OV-circuit SVD
corpus: 8,965 prompts
layers: 24
attention layers: 3, 7, 11, 15, 19, 23
hybrid layers: literally everything else
sacred (fully probed) layers: all 24
datasets: juiceb0xc0de/tmax-2b-atlas

What this is

TMax-2B is a hybrid SSM/Mamba/transformer architecture. Only one in four layers is a full multi-head attention block. The rest are linear-attention / SSM-style layers with components like linattn_qkv, linattn_z, linattn_out, gate, up, and mlp. I wanted to see whether a 2B hybrid could still build clean late-layer subspaces and surgical directions, or whether the architecture just makes interpretability noisy.

Short answer: it is surprisingly clean. The noisy part is mostly stuff I deliberately skipped. I will cover more on what I found at the end.

What was run

Activation census over 8,965 prompts.
Per-layer feature taxonomy for every component that tokenizes language.
Per-head analysis on the 6 full attention layers.
OV-circuit SVD (W_V @ W_O) on every head.
Logit-lens pass to see which internal directions predict output tokens.
Sub-Zero surgery pass with capability fence across code, math, reasoning, factual, and multilingual.

The shape of the thing

Property	Value
Layers	24
Attention layers	6
KV heads	2
Head dim	256
d_model	2048 implied
Hybrid components	`gate`, `up`, `mlp`, `linattn_out`, `linattn_qkv`, `linattn_z`

What the numbers suggest

Attention is doing distributed computation, not copy-paste

OV-circuit spectral concentration is 0.060 with effective rank around 79. That is a broad signature, not a memorized token-to-token circuit. The QK and FC paths are a bit more structured (0.20 and 0.22), but the attention itself looks like weighted high-dimensional computation, not sparse lookup.

Features are broad, not hyperspecific

The taxonomy is dominated by partial_shared and broadly_shared, with very few specific_* directions. Most dimensions respond to many prompts rather than one weird niche trigger. This matches the "reasons in-network rather than indexing a compressed store" vibe I also saw in other dense reasoning models with classical components.

MLP gates talk to logits more than attention heads do

The strongest logit-lens directions are gate features in the middle-to-late layers. Layer 15 gate feature 1943 hits F-stat 591, and layer 20 gate feature 71 is around 552. In a dense transformer you might expect late attention to dominate the logit lens. Here the hybrid MLP gates are doing a lot of output vocabulary routing.

The most surgical surprise is in layer 1

Classifier accuracy is solid across all 24 layers: 0.969–0.984, average 0.981. But the worst damage from removing a Sub-Zero direction is layer 1 linattn_in_proj_z axis 0, which fails the fence across all five capability domains with up to 0.603 damage, worst on multilingual.

In a normal transformer, layer 1 is preprocessing. In this hybrid, an early linear-attention projection is load-bearing universal machinery. Remove it and code, factual, math, reasoning, and multilingual all take a hit simultaneously.

Surgical headroom is decent

335 of 365 tested Sub-Zero axes pass the capability fence (91.8%). Average damage is only 0.029. So aside from that one early linattn direction, the network is pretty editable.

What Sub-Zero is actually measuring here

The Sub-Zero pass is not a generic "find all important directions" sweep. It specifically looks for directions that separate corporate style from authentic style, then uses DAS rotation and the capability fence to check whether removing those directions damages code, math, reasoning, factual, or multilingual ability. So the 365 tested axes are compliance/behavior candidate axes, not a census of every load-bearing direction in the model. The layer 1 linattn axis happens to be a compliance/behavior direction that is also load-bearing, which is why the fence rejects it.

Compliance/behavior directions are only partly separated

Peak compliance-behaviour SV fraction is around 10%. Style/behavior directions are still partially entangled with capability directions at this scale.

The stuff I deliberately skipped

The atlas does not probe hybrid SSM layers that do not tokenize language. Measuring activations on those components just produced noisy numbers with no real structure, so I left them out. I am working on a way to capture whatever those hybrid components are actually doing, but that is a future post.

Caveats

This is a 2B model. Broad activation and low specific-feature counts may partly be a capacity limit, not a deliberate design choice.
The layer 1 linattn damage is an outlier; most of the model is more forgiving.
No Qwen3.5-2B base comparison here; that needs a separate atlas.

Bottom line

TMax-2B is a complete little brain atlas. Stable classifiers, editable late layers, distributed attention, and MLP/SSM gates that route to output tokens. The hybrid architecture does not break interpretability. The hybrid architecture does move some of the load-bearing directions earlier than you would expect in a dense transformer.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment