Spaces:

Chris4K
/

Compression_Navigator

Paused

App Files Files Community

Chris4K commited on 11 days ago

Commit

35acee3

verified ·

1 Parent(s): 92e1b3e

Create app.py

Browse files

Files changed (1) hide show

app.py +858 -0

app.py ADDED Viewed

	@@ -0,0 +1,858 @@

+# =============================================================================
+# COMPRESSION NAVIGATOR  ·  extended + annotated edition
+# =============================================================================
+# An LLM is a lossy codec for text. Training compresses a corpus into weights;
+# a forward pass decompresses a continuation. These five tools let you watch
+# that decompression happen and poke at where facts physically live.
+#
+# The five tabs are not toys invented here - each one is a real mechanistic-
+# interpretability technique you'll find in papers:
+#
+#   1. Decompress      = LOGIT LENS            (nostalgebraist, 2020)
+#   2. Triangulate     = EMBEDDING NEIGHBOURS  (the geometry of the vocab)
+#   3. Re-route        = ACTIVATION STEERING   (ActAdd / repr. engineering)
+#   4. Diff            = CROSS-MODEL ALIGNMENT  (compare checkpoints by depth)
+#   5. Causal trace    = ACTIVATION PATCHING    (ROME, Meng et al., 2022)
+#
+# WHY THE GLASS-BOX MODELS MATTER
+# -------------------------------
+# On a real model (gpt2) you never know the ground truth, so you can't tell
+# whether a tool is *correct* or just producing plausible-looking output.
+# This file ships two models whose internals you fully specify, so you can
+# check each tool against a known answer:
+#
+#   "handmade"  - facts stored as a LOOKUP TABLE keyed on the prompt string.
+#                 The computation happens in a side channel (string match),
+#                 NOT in the residual stream. Lesson: such a model is almost
+#                 invisible to residual-stream interpretability. Logit lens
+#                 sees a sudden jump with no build-up; causal tracing finds
+#                 nothing, because corrupting activations doesn't touch the
+#                 string match. This is a real and underappreciated *limit*
+#                 of these methods.
+#
+#   "glassbox"  - facts stored the way real transformers store them: as
+#                 key->value writes into the RESIDUAL STREAM (Geva et al.'s
+#                 "MLPs are key-value memories", which is exactly what ROME
+#                 edits). Because the fact flows through activations, ALL five
+#                 tools light up correctly - and you can verify they report
+#                 the layer you actually put the fact in. This is a unit-test
+#                 harness for interpretability code.
+#
+# Run order suggestion:  glassbox  ->  handmade  ->  gpt2
+#   glassbox shows what "correct" looks like; handmade shows a failure mode;
+#   gpt2 shows the fuzzy, distributed real thing.
+# =============================================================================
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import gradio as gr
+from transformers import AutoModelForCausalLM, AutoTokenizer
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+DTYPE = torch.float32
+MODELS = {}                 # name -> (model, tokenizer) cache
+STATE = {"name": None}      # currently loaded model name
+# =============================================================================
+# A tiny shared tokenizer for both glass-box models.
+# Case is CANONICALISED to lowercase everywhere (this fixes a real bug in the
+# original: "Paris" from a pinned fact and "paris" from the Markov table became
+# two different vocab entries, so the boosted token and the *tracked* token
+# silently diverged - every neighbour read cos=0.000 and every tracked prob 0).
+# =============================================================================
+class FakeBatchEncoding(dict):
+    def to(self, device):            # let callers do tok(...).to(DEVICE) safely
+        return self
+class SimpleTok:
+    """Whitespace tokenizer over a fixed vocab. Not 'fast' (no offset map)."""
+    is_fast = False
+    def __init__(self, stoi, itos):
+        self.stoi, self.itos = stoi, itos
+        self.eos_token_id = stoi["."]      # period doubles as end-of-sequence
+    def _ids(self, text):
+        words = text.lower().replace(".", " .").split()
+        return [self.stoi.get(w, self.stoi["<s>"]) for w in words]
+    def __call__(self, text, return_tensors=None, return_offsets_mapping=False):
+        ids = self._ids(text) or [self.stoi["<s>"]]
+        return FakeBatchEncoding(
+            input_ids=torch.tensor([ids]),
+            attention_mask=torch.ones(1, len(ids), dtype=torch.long),
+        )
+    def encode(self, text, add_special_tokens=False):
+        return self._ids(text)
+    def decode(self, ids, skip_special_tokens=False):
+        out = []
+        for i in ids:
+            w = self.itos.get(int(i), "?")
+            if skip_special_tokens and w in ("<pad>", "<s>"):
+                continue
+            out.append(w)
+        return " ".join(out)
+class _Out:
+    """Mimics a HF CausalLMOutput: .logits and (optional) .hidden_states."""
+    def __init__(self, logits, hidden_states):
+        self.logits = logits
+        self.hidden_states = hidden_states
+def _greedy_generate(model, input_ids, max_new_tokens=20, pad_token_id=None, **_):
+    """Minimal greedy decode so the steering tab works on the toy models too
+    (the originals had no .generate, so that tab crashed on 'handmade')."""
+    ids = input_ids
+    for _ in range(int(max_new_tokens)):
+        nxt = model(input_ids=ids).logits[0, -1].argmax().view(1, 1)
+        ids = torch.cat([ids, nxt], dim=1)
+        if pad_token_id is not None and int(nxt.item()) == int(pad_token_id):
+            break
+    return ids
+# =============================================================================
+# MODEL 1 - "handmade": facts as a LOOKUP TABLE (the side-channel glass box)
+# -----------------------------------------------------------------------------
+# Embeddings are the identity matrix (each token is its own one-hot). The two
+# "layers" don't read the residual stream in a meaningful linear way:
+#   - MemoryBlock matches the *decoded prompt string* and boosts the answer.
+#   - MarkovBlock adds a hand-built bigram transition for the last token.
+# Because MemoryBlock keys on the prompt TEXT, not on activations, this is a
+# deliberate demonstration of a model that residual-stream interpretability
+# cannot see. Use it as the "what failure looks like" control.
+# =============================================================================
+PINNED = {                              # answers are lowercase now (bug fix)
+    "the capital of france is": " paris",
+    "the eiffel tower is in":   " paris",
+    "two plus two equals":      " four",
+}
+MARKOV = {
+    "<s>":    {"the": 3, "i": 2, "a": 1},
+    "the":    {"city": 2, "tower": 2, "answer": 1},
+    "i":      {"think": 2, "am": 1},
+    "a":      {"model": 2, "city": 1},
+    "city":   {"of": 3, "is": 1},
+    "of":     {"light": 2, "paris": 1},
+    "tower":  {"is": 3},
+    "is":     {"in": 2, "a": 1},
+    "in":     {"paris": 2, "france": 1},
+    "model":  {"is": 2},
+    "think":  {"the": 2},
+    "paris":  {".": 1},
+    "france": {".": 1},
+    "light":  {".": 1},
+    "four":   {".": 1},
+}
+def _build_handmade_vocab():
+    toks, seen = ["<pad>", "<s>", "."], {"<pad>", "<s>", "."}
+    def add(w):
+        if w not in seen:
+            toks.append(w); seen.add(w)
+    for v in PINNED.values():
+        add(v.strip())
+    for w, nxts in MARKOV.items():
+        add(w)
+        for x in nxts:
+            add(x)
+    for k in PINNED:
+        for w in k.split():
+            add(w)
+    return toks
+HM_VOCAB = _build_handmade_vocab()
+HM_STOI = {w: i for i, w in enumerate(HM_VOCAB)}
+HM_ITOS = {i: w for w, i in HM_STOI.items()}
+HM_V = len(HM_VOCAB)
+class _MemoryBlock(nn.Module):
+    """If the decoded prompt ends with a pinned key, slam the answer logit.
+    NOTE: this reads prompt_ids (the string), not x - that's the whole point."""
+    def forward(self, x, prompt_ids=None):
+        out = x.clone()
+        if prompt_ids is not None:
+            text = " ".join(HM_ITOS.get(int(i), "") for i in prompt_ids).strip()
+            for key, ans in PINNED.items():
+                if text.endswith(key):
+                    out[0, -1, HM_STOI[ans.strip()]] += 12.0
+        return (out,)
+class _MarkovBlock(nn.Module):
+    """Add a hand-built bigram transition row for the last token."""
+    def __init__(self):
+        super().__init__()
+        T = torch.zeros(HM_V, HM_V)
+        for w, nxts in MARKOV.items():
+            if w in HM_STOI:
+                tot = sum(nxts.values())
+                for x, wt in nxts.items():
+                    if x in HM_STOI:
+                        T[HM_STOI[w], HM_STOI[x]] = wt / tot
+        self.register_buffer("T", T)
+    def forward(self, x, prompt_ids=None):
+        out = x.clone()
+        if prompt_ids:
+            out[0, -1] += 4.0 * self.T[int(prompt_ids[-1])]
+        return (out,)
+class _HMTransformer(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.wte = nn.Embedding(HM_V, HM_V)
+        with torch.no_grad():
+            self.wte.weight.copy_(torch.eye(HM_V))          # one-hot embeddings
+        self.h = nn.ModuleList([_MemoryBlock(), _MarkovBlock()])
+        self.ln_f = nn.Identity()
+class HandmadeModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.transformer = _HMTransformer()
+        self.head = nn.Linear(HM_V, HM_V, bias=False)
+        with torch.no_grad():
+            self.head.weight.copy_(torch.eye(HM_V))         # identity unembed
+        self.tok = SimpleTok(HM_STOI, HM_ITOS)
+    def get_input_embeddings(self):  return self.transformer.wte
+    def get_output_embeddings(self): return self.head
+    def generate(self, input_ids=None, attention_mask=None, **kw):
+        return _greedy_generate(self, input_ids, **kw)
+    def forward(self, input_ids=None, attention_mask=None, output_hidden_states=False):
+        ids = input_ids[0].tolist()
+        x = self.transformer.wte(input_ids).float()
+        hs = [x]; h = x
+        for blk in self.transformer.h:
+            (h,) = blk(h, prompt_ids=ids); hs.append(h)
+        logits = self.head(self.transformer.ln_f(h))
+        return _Out(logits, tuple(hs) if output_hidden_states else None)
+# =============================================================================
+# MODEL 2 - "glassbox": facts as RESIDUAL-STREAM key->value writes
+# -----------------------------------------------------------------------------
+# This is the model the original was missing. It stores facts the way real
+# transformers do, so every tool works AND can be checked against ground truth.
+#
+# Vocab + structured embeddings (d=32). Country and its capital deliberately
+# SHARE an embedding dimension, so the neighbours tool finds real geometry
+# (paris is near france).
+#
+# Four layers:
+#   L0  subject site   : (identity here) the residual the trace will restore
+#   L1  pool/attention : copies subject signal from earlier positions -> last
+#   L2  fact MLP       : key(subject+relation) -> relu -> value(answer dir)   <- ROME edits this kind of layer
+#   L3  cleanup        : identity
+#
+# Ground truth you can verify:
+#   - logit lens: the answer is INVISIBLE until L2, then appears. Compare with
+#     handmade (sudden, no build-up) and gpt2 (fuzzy, spread over many layers).
+#   - causal trace: corrupting the subject and restoring layer by layer peaks
+#     at L0 - because L1's "attention" re-reads the restored subject. That is
+#     the ROME story: the causal site is an early layer at the SUBJECT token.
+#   - steering / neighbours: both operate on real directions, so both work.
+# =============================================================================
+GB_D = 32
+GB_TOKS = ["<pad>", "<s>", ".", "the", "capital", "of", "is", "in",
+           "france", "germany", "japan", "paris", "berlin", "tokyo"]
+GB_STOI = {w: i for i, w in enumerate(GB_TOKS)}
+GB_ITOS = {i: w for w, i in GB_STOI.items()}
+GB_V = len(GB_TOKS)
+GB_FACTS = [("france", "paris"), ("germany", "berlin"), ("japan", "tokyo")]
+def _build_gb_embeddings():
+    E = torch.zeros(GB_V, GB_D)
+    def setd(tok, pairs):
+        for d, v in pairs:
+            E[GB_STOI[tok], d] = v
+    # country/capital pairs share their first dim -> positive cosine (geometry!)
+    setd("france", [(0, 1.0), (1, 0.6), (20, 0.5)])
+    setd("paris",  [(0, 0.8), (2, 0.9), (21, 0.5)])
+    setd("germany",[(3, 1.0), (4, 0.6), (22, 0.5)])
+    setd("berlin", [(3, 0.8), (5, 0.9), (23, 0.5)])
+    setd("japan",  [(6, 1.0), (7, 0.6), (24, 0.5)])
+    setd("tokyo",  [(6, 0.8), (8, 0.9), (25, 0.5)])
+    setd("is",     [(9, 1.0), (26, 0.4)])                  # the relation marker
+    for i, t in enumerate(GB_TOKS):                        # give fillers an id
+        if E[i].abs().sum() == 0:
+            E[i, 10 + i % 6] = 1.0
+    return E / (E.norm(dim=-1, keepdim=True) + 1e-9)       # unit rows
+GB_E = _build_gb_embeddings()
+GB_SUBJ = torch.zeros(GB_D, GB_D)                          # projector onto subject dims 0..8
+for _d in range(9):
+    GB_SUBJ[_d, _d] = 1.0
+class _GBIdent(nn.Module):
+    def forward(self, x, prompt_ids=None):
+        return (x.clone(),)
+class _GBPool(nn.Module):
+    """Toy 'attention': sum the subject-projected residual of all earlier
+    positions into the last position. Corrupting the subject earlier shows up
+    here; restoring the subject BEFORE this layer is what makes the trace
+    recover - that is why the causal peak lands at L0, not L1."""
+    def forward(self, x, prompt_ids=None):
+        out = x.clone()
+        if x.shape[1] > 1:
+            pooled = (x[0, :-1] @ GB_SUBJ.T).sum(0)
+            out[0, -1] = out[0, -1] + 0.9 * pooled
+        return (out,)
+class _GBFactMLP(nn.Module):
+    """Geva-style key->value memory. W_in rows are (subject+relation) keys;
+    relu gates which fact fires; W_out columns are answer unembed directions.
+    This is structurally the exact layer ROME rewrites to edit a fact."""
+    def __init__(self):
+        super().__init__()
+        Win = torch.zeros(len(GB_FACTS), GB_D)
+        Wout = torch.zeros(GB_D, len(GB_FACTS))
+        rel = GB_E[GB_STOI["is"]]
+        for k, (s, a) in enumerate(GB_FACTS):
+            key = (GB_E[GB_STOI[s]] @ GB_SUBJ.T) * 0.9 + rel
+            Win[k] = key / key.norm()
+            Wout[:, k] = GB_E[GB_STOI[a]]                  # write answer direction
+        self.register_buffer("Win", Win)
+        self.register_buffer("Wout", Wout)
+        self.bias, self.gain = 0.85, 6.0                   # tuned: clean p~0.5, corrupt p~0.07
+    def forward(self, x, prompt_ids=None):
+        out = x.clone()
+        pre = F.relu(self.Win @ out[0, -1] - self.bias)
+        out[0, -1] = out[0, -1] + self.gain * (self.Wout @ pre)
+        return (out,)
+class _GBTransformer(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.wte = nn.Embedding(GB_V, GB_D)
+        with torch.no_grad():
+            self.wte.weight.copy_(GB_E)
+        self.h = nn.ModuleList([_GBIdent(), _GBPool(), _GBFactMLP(), _GBIdent()])
+        self.ln_f = nn.Identity()
+class GlassBoxModel(nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.transformer = _GBTransformer()
+        self.head = nn.Linear(GB_D, GB_V, bias=False)
+        with torch.no_grad():
+            self.head.weight.copy_(GB_E)                   # tied unembed
+        self.tok = SimpleTok(GB_STOI, GB_ITOS)
+    def get_input_embeddings(self):  return self.transformer.wte
+    def get_output_embeddings(self): return self.head
+    def generate(self, input_ids=None, attention_mask=None, **kw):
+        return _greedy_generate(self, input_ids, **kw)
+    def forward(self, input_ids=None, attention_mask=None, output_hidden_states=False):
+        ids = input_ids[0].tolist()
+        x = self.transformer.wte(input_ids).float()
+        hs = [x]; h = x
+        for blk in self.transformer.h:
+            (h,) = blk(h, prompt_ids=ids); hs.append(h)
+        logits = self.head(self.transformer.ln_f(h))
+        return _Out(logits, tuple(hs) if output_hidden_states else None)
+# =============================================================================
+# REAL MODELS - resolve the architecture-specific module paths
+# =============================================================================
+def _resolve(model, paths):
+    for path in paths:
+        obj, ok = model, True
+        for part in path.split("."):
+            if hasattr(obj, part):
+                obj = getattr(obj, part)
+            else:
+                ok = False; break
+        if ok:
+            return obj
+    return None
+def get_blocks(model):
+    blocks = _resolve(model, ["transformer.h", "model.layers",
+                              "gpt_neox.layers", "model.decoder.layers"])
+    if blocks is None:
+        raise RuntimeError("Could not locate transformer blocks.")
+    return blocks
+def get_final_norm(model):
+    norm = _resolve(model, ["transformer.ln_f", "model.norm",
+                            "gpt_neox.final_layer_norm",
+                            "model.decoder.final_layer_norm"])
+    return norm if norm is not None else (lambda x: x)
+def get_head(model):
+    return model.get_output_embeddings()
+def get_handles(name):
+    if name not in MODELS:
+        if name == "handmade":
+            m = HandmadeModel().eval(); MODELS[name] = (m, m.tok)
+        elif name == "glassbox":
+            m = GlassBoxModel().eval(); MODELS[name] = (m, m.tok)
+        else:
+            tok = AutoTokenizer.from_pretrained(name)
+            model = AutoModelForCausalLM.from_pretrained(
+                name, torch_dtype=DTYPE).to(DEVICE).eval()
+            MODELS[name] = (model, tok)
+    return MODELS[name]
+def load_model(name):
+    name = name.strip()
+    model, _ = get_handles(name)
+    STATE["name"] = name
+    return "Loaded **%s** (%d layers)." % (name, len(get_blocks(model)))
+# =============================================================================
+# Shared readout: project every layer's last-token residual to a vocab dist.
+# =============================================================================
+@torch.no_grad()
+def layer_distributions(model, tok, prompt):
+    inputs = tok(prompt, return_tensors="pt").to(DEVICE)
+    out = model(**inputs, output_hidden_states=True)
+    hs = out.hidden_states
+    norm, head, n = get_final_norm(model), get_head(model), len(out.hidden_states)
+    dists = []
+    for i, layer_hs in enumerate(hs):
+        vec = layer_hs[0, -1].to(DTYPE)
+        # HF convention: the LAST hidden_states entry is already post-ln_f,
+        # so skip norm there; apply ln_f to intermediates (logit-lens style).
+        logits = head(vec) if i == n - 1 else head(norm(vec))
+        dists.append(("embed" if i == 0 else "L%d" % i, F.softmax(logits, dim=-1)))
+    return dists
+def _entropy_bits(probs):
+    p = probs.clamp_min(1e-12)
+    return float(-(p * p.log()).sum() / math.log(2))
+# =============================================================================
+# TAB 1 - LOGIT LENS: watch the answer condense out of the residual stream
+# =============================================================================
+@torch.no_grad()
+def logit_lens(prompt, top_k, track):
+    if STATE["name"] is None:
+        return "Load a model first."
+    model, tok = get_handles(STATE["name"])
+    top_k = int(top_k)
+    tids = tok.encode(track, add_special_tokens=False) if track.strip() else []
+    tid = tids[0] if tids else None
+    dists = layer_distributions(model, tok, prompt)
+    header = "layer | top tokens (prob)                       | entropy" \
+             + ("   | p(%r)" % track if tid is not None else "")
+    lines = ["prompt: %r" % prompt, header, "-" * len(header)]
+    for label, probs in dists:
+        p, idx = probs.topk(top_k)
+        shown = "  ".join("%r:%.2f" % (tok.decode([t]).replace("\n", "\\n"), v)
+                          for t, v in zip(idx.tolist(), p.tolist()))
+        row = "%5s | %-40s | %4.1fb" % (label, shown, _entropy_bits(probs))
+        if tid is not None:
+            row += "   | %.3f" % probs[tid].item()
+        lines.append(row)
+    return "\n".join(lines)
+# =============================================================================
+# TAB 2 - NEIGHBOURS: the geometry of the (un)embedding space
+# =============================================================================
+@torch.no_grad()
+def neighbors(word, top_k):
+    if STATE["name"] is None:
+        return "Load a model first."
+    model, tok = get_handles(STATE["name"])
+    top_k = int(top_k)
+    ids = tok.encode(word, add_special_tokens=False)
+    if not ids:
+        return "Could not tokenize %r." % word
+    tid = ids[0]
+    W = F.normalize(get_head(model).weight.to(DTYPE), dim=-1)
+    sims = W @ W[tid]
+    vals, idx = sims.topk(top_k + 1)
+    note = ""
+    if STATE["name"] == "handmade":
+        note = ("(handmade uses one-hot embeddings, so every token is "
+                "orthogonal -> all cosines are 0 by construction. This is the "
+                "tool telling the truth about a model with no vocab geometry.)\n")
+    lines = [note + "neighbours of %r:" % word]
+    for v, j in zip(vals.tolist(), idx.tolist()):
+        if j != tid:
+            lines.append("  %14r  cos=%.3f" % (tok.decode([j]), v))
+    return "\n".join(lines[: top_k + 1])
+# =============================================================================
+# TAB 3 - STEERING: bend behaviour by adding a direction, no retraining
+# =============================================================================
+def _make_steer_hook(direction, alpha):
+    d = direction * alpha
+    def hook(module, inp, out):
+        if isinstance(out, tuple):
+            return (out[0] + d.to(out[0].dtype).to(out[0].device),) + out[1:]
+        return out + d.to(out.dtype).to(out.device)
+    return hook
+@torch.no_grad()
+def steer_generate(prompt, source, target, layer, alpha, max_new):
+    if STATE["name"] is None:
+        return "Load a model first.", ""
+    model, tok = get_handles(STATE["name"])
+    layer, max_new = int(layer), int(max_new)
+    emb = model.get_input_embeddings().weight
+    def first_emb(w):
+        ids = tok.encode(w, add_special_tokens=False)
+        return emb[ids[0]] if ids else torch.zeros(emb.shape[-1], device=DEVICE)
+    direction = F.normalize((first_emb(target) - first_emb(source)).to(DTYPE), dim=-1)
+    inputs = tok(prompt, return_tensors="pt").to(DEVICE)
+    gk = dict(max_new_tokens=max_new, do_sample=False, pad_token_id=tok.eos_token_id)
+    base = tok.decode(model.generate(**inputs, **gk)[0], skip_special_tokens=True)
+    blocks = get_blocks(model)
+    layer = max(0, min(layer, len(blocks) - 1))
+    handle = blocks[layer].register_forward_hook(_make_steer_hook(direction, alpha))
+    try:
+        steered = tok.decode(model.generate(**inputs, **gk)[0], skip_special_tokens=True)
+    finally:
+        handle.remove()
+    return base, "steer %r -> %r @ L%d alpha=%s\n%s" % (source, target, layer, alpha, steered)
+# =============================================================================
+# TAB 4 - DIFF: compare two models on one prompt, aligned by relative depth
+# =============================================================================
+@torch.no_grad()
+def diff_models(name_a, name_b, prompt, target, top_k):
+    ma, ta = get_handles(name_a.strip())
+    mb, tb = get_handles(name_b.strip())
+    ida = ta.encode(target, add_special_tokens=False)
+    idb = tb.encode(target, add_special_tokens=False)
+    if not ida or not idb:
+        return "Could not tokenize target %r in both models." % target
+    ida, idb = ida[0], idb[0]
+    da = layer_distributions(ma, ta, prompt)
+    db = layer_distributions(mb, tb, prompt)
+    nA, nB = len(da) - 1, len(db) - 1
+    def top1(probs, tok):
+        v, i = probs.topk(1)
+        return "%r:%.2f" % (tok.decode([i.item()]), v.item())
+    lines = ["prompt: %r   target: %r" % (prompt, target),
+             "%18s | %16s %6s | %16s %6s | %7s"
+             % ("depth (A/B)", "A top1", "pA", "B top1", "pB", "dp")]
+    for i in range(nA + 1):
+        frac = (i / nA) if nA > 0 else 0.0
+        j = max(0, min(round(frac * nB), nB)) if nB > 0 else 0
+        la, pa = da[i]; lb, pb = db[j]
+        a_t, b_t = pa[ida].item(), pb[idb].item()
+        lines.append("%18s | %16s %6.3f | %16s %6.3f | %+7.3f"
+                     % ("%3.0f%% (%s/%s)" % (frac * 100, la, lb),
+                        top1(pa, ta), a_t, top1(pb, tb), b_t, b_t - a_t))
+    return "\n".join(lines)
+# =============================================================================
+# TAB 5 - CAUSAL TRACE: corrupt the subject, restore each layer, find the site
+# -----------------------------------------------------------------------------
+# This is ROME's activation patching. We:
+#   1. record clean activations and clean p(target)
+#   2. add gaussian noise to the SUBJECT token embeddings -> corrupt p(target)
+#   3. for each layer L: run corrupted, but force layer L's residual back to
+#      the clean values at the subject positions. How much p(target) recovers
+#      tells you how causally important layer L is. The peak is "the site".
+# The glass-box gives a clean, verifiable peak; gpt2 gives a realistic band.
+# =============================================================================
+def _find_subject_positions(tok, input_ids, prompt, subject):
+    """Locate subject token positions, with a path for slow (non-fast) toks."""
+    seq_len = input_ids.shape[1]
+    if getattr(tok, "is_fast", False):
+        enc = tok(prompt, return_tensors="pt", return_offsets_mapping=True)
+        cs = prompt.find(subject)
+        if cs >= 0:
+            ce = cs + len(subject)
+            offs = enc["offset_mapping"][0].tolist()
+            pos = [i for i, (s, e) in enumerate(offs) if e > cs and s < ce]
+            if pos:
+                return [p for p in pos if p != seq_len - 1], ""
+    else:
+        sub_ids = tok.encode(subject, add_special_tokens=False)
+        seq = input_ids[0].tolist()
+        pos = [i for i, t in enumerate(seq) if t in sub_ids]
+        if pos:
+            return [p for p in pos if p != seq_len - 1], ""
+    fb = list(range(0, max(1, seq_len - 1)))[: max(1, seq_len // 2)]
+    return fb, "(subject not found; using fallback window)\n"
+@torch.no_grad()
+def causal_trace(prompt, subject, target, noise_scale, seed):
+    if STATE["name"] is None:
+        return "Load a model first."
+    model, tok = get_handles(STATE["name"])
+    seed, noise_scale = int(seed), float(noise_scale)
+    inputs = tok(prompt, return_tensors="pt").to(DEVICE)
+    input_ids = inputs["input_ids"]
+    positions, note = _find_subject_positions(tok, input_ids, prompt, subject)
+    if not positions:
+        return note + "No valid subject positions."
+    target_ids = tok.encode(target, add_special_tokens=False)
+    if not target_ids:
+        return "Could not tokenize target %r." % target
+    tid = target_ids[0]
+    out_clean = model(**inputs, output_hidden_states=True)
+    clean_hs = out_clean.hidden_states
+    clean_p = F.softmax(out_clean.logits[0, -1].to(DTYPE), dim=-1)[tid].item()
+    emb_module = model.get_input_embeddings()
+    std = emb_module.weight.std().item()
+    hidden = emb_module.weight.shape[-1]
+    torch.manual_seed(seed)
+    noise = torch.randn(len(positions), hidden, device=DEVICE) * noise_scale * std
+    def corrupt_hook(module, inp, out):
+        out = out.clone()
+        for k, p in enumerate(positions):
+            out[0, p] = out[0, p] + noise[k].to(out.dtype)
+        return out
+    h = emb_module.register_forward_hook(corrupt_hook)
+    corrupt_p = F.softmax(model(**inputs).logits[0, -1].to(DTYPE), dim=-1)[tid].item()
+    h.remove()
+    blocks, rows = get_blocks(model), []
+    for l in range(len(blocks)):
+        clean_layer_hs = clean_hs[l + 1][0]
+        def restore_hook(module, inp, out, _clean=clean_layer_hs):
+            if isinstance(out, tuple):
+                h0 = out[0].clone()
+                for p in positions:
+                    h0[0, p] = _clean[p].to(h0.dtype)
+                return (h0,) + out[1:]
+            h0 = out.clone()
+            for p in positions:
+                h0[0, p] = _clean[p].to(h0.dtype)
+            return h0
+        h1 = emb_module.register_forward_hook(corrupt_hook)
+        h2 = blocks[l].register_forward_hook(restore_hook)
+        p_r = F.softmax(model(**inputs).logits[0, -1].to(DTYPE), dim=-1)[tid].item()
+        h1.remove(); h2.remove()
+        rows.append((l, p_r))
+    denom = clean_p - corrupt_p
+    lines = [note + "prompt: %r" % prompt,
+             "subject: %r   target: %r" % (subject, target),
+             "clean p=%.3f   corrupt p=%.3f   noise=%sx std" % (clean_p, corrupt_p, noise_scale),
+             "", "%6s | %9s | %9s" % ("layer", "p(target)", "recovery")]
+    best_l, best_r = 0, -1e9
+    for l, p_r in rows:
+        rec = (p_r - corrupt_p) / denom if abs(denom) > 1e-6 else 0.0
+        if rec > best_r:
+            best_r, best_l = rec, l
+        lines.append("  L%-3d | %9.3f | %8.1f%%" % (l, p_r, rec * 100))
+    lines.append("")
+    lines.append("# peak at L%d (%.0f%% recovery) <- the causal site" % (best_l, best_r * 100))
+    if abs(denom) < 1e-6:
+        lines.append("# (corruption didn't move p(target): on 'handmade' this is "
+                     "EXPECTED - the fact lives in a string match, not activations.)")
+    return "\n".join(lines)
+# =============================================================================
+# UI
+# =============================================================================
+INTRO = """
+# Compression Navigator
+**An LLM is a lossy codec for text.** Training compresses a corpus into weights;
+a forward pass decompresses a continuation. These five tools let you watch that
+decompression and find where facts physically live.
+Each tab is a real interpretability technique: **logit lens, embedding
+neighbours, activation steering, cross-model diff, and causal tracing (ROME).**
+### Three models, on purpose
+| name | how it stores facts | what it teaches |
+|---|---|---|
+| **`glassbox`** | key→value writes into the **residual stream** (like a real transformer / what ROME edits) | the tools **work and are verifiable** against ground truth you can read in the source |
+| **`handmade`** | a **lookup table** keyed on the prompt string (a side channel) | a model can be **invisible** to residual-stream interpretability — a real limitation |
+| **`gpt2`** | learned, fuzzy, **distributed** over many layers | what the real, messy thing looks like |
+**Suggested order:** load `glassbox` first (see "correct"), then `handmade`
+(see a failure mode), then `gpt2` (see reality). Type a name below and Load.
+"""
+with gr.Blocks(title="Compression Navigator") as demo:
+    gr.Markdown(INTRO)
+    with gr.Row():
+        model_name = gr.Textbox(value="glassbox", label="model name or HF id")
+        load_btn = gr.Button("Load", variant="primary")
+    load_status = gr.Markdown()
+    load_btn.click(load_model, inputs=model_name, outputs=load_status)
+    # ---- TAB 1 -------------------------------------------------------------
+    with gr.Tab("1 · Decompress (logit lens)"):
+        gr.Markdown("""
+### Logit lens — watch the answer condense, layer by layer
+**What it does:** takes the last-token residual at *every* layer and reads it
+through the unembedding, as if the model had to answer right there. You see the
+prediction form.
+**How to read it:** each row is a layer. Watch your tracked token's probability
+(right column) climb, and watch **entropy** (bits) fall as the model commits.
+**Ground truth to check:**
+- `glassbox` — `paris` is ~0 until **L2** (the fact-MLP), then jumps. Sharp and localised because you put it there.
+- `handmade` — the answer appears suddenly with no build-up (it's a lookup, not a computation).
+- `gpt2` — the answer accretes *gradually* across many middle/late layers. That smear is what "distributed representation" actually looks like.
+""")
+        ll_prompt = gr.Textbox(value="the capital of france is", label="prompt")
+        with gr.Row():
+            ll_k = gr.Slider(1, 10, value=3, step=1, label="top-k per layer")
+            ll_track = gr.Textbox(value="paris", label="track this token's prob")
+        ll_out = gr.Textbox(label="output", lines=18)
+        gr.Button("Run").click(logit_lens, [ll_prompt, ll_k, ll_track], ll_out)
+    # ---- TAB 2 -------------------------------------------------------------
+    with gr.Tab("2 · Triangulate (neighbours)"):
+        gr.Markdown("""
+### Neighbours — the geometry of the vocabulary
+**What it does:** ranks tokens by cosine similarity of their unembedding rows.
+Directions that point the same way are "near" in the model's compressed space.
+**How to read it:** high cosine = the model treats these tokens as related.
+**Ground truth to check:**
+- `glassbox` — `paris` is near `france` (cos ≈ 0.48): the source deliberately makes a capital share a dimension with its country. Real geometry, by design.
+- `handmade` — **every** cosine is 0. One-hot embeddings are mutually orthogonal, so there's no geometry at all. The tool is correctly reporting "nothing here."
+- `gpt2` — neighbours are messy but meaningful (casing variants, plurals, semantic kin).
+""")
+        nb_word = gr.Textbox(value="paris", label="word")
+        nb_k = gr.Slider(5, 25, value=10, step=1, label="top neighbours")
+        nb_out = gr.Textbox(label="output", lines=15)
+        gr.Button("Run").click(neighbors, [nb_word, nb_k], nb_out)
+    # ---- TAB 3 -------------------------------------------------------------
+    with gr.Tab("3 · Re-route (steering)"):
+        gr.Markdown("""
+### Steering — bend behaviour with a direction, no retraining
+**What it does:** builds the vector `emb(target) − emb(source)` and *adds* it to
+a layer's output during generation. The model drifts from `source` toward
+`target`. This is the cheap cousin of fine-tuning (ActAdd / representation
+engineering).
+**How to read it:** compare *baseline* vs *steered*. Raise **strength** until the
+output flips; too high and it turns to noise (you've knocked the residual off
+the manifold).
+**Tips:** on `gpt2` try `from: Paris  to: London` on the France prompt, layer
+0–4, strength 6–14. On `glassbox`/`handmade` the vocab is tiny — steering is
+mostly a mechanics demo there; the real lesson lives on `gpt2`.
+""")
+        st_prompt = gr.Textbox(value="the capital of france is", label="prompt")
+        with gr.Row():
+            st_src = gr.Textbox(value="Paris", label="from")
+            st_tgt = gr.Textbox(value="London", label="to")
+        with gr.Row():
+            st_layer = gr.Slider(0, 11, value=2, step=1, label="layer")
+            st_alpha = gr.Slider(0, 30, value=10, step=0.5, label="strength")
+            st_max = gr.Slider(8, 80, value=40, step=1, label="max new tokens")
+        st_base = gr.Textbox(label="baseline", lines=2)
+        st_out = gr.Textbox(label="steered", lines=3)
+        gr.Button("Run").click(steer_generate,
+                               [st_prompt, st_src, st_tgt, st_layer, st_alpha, st_max],
+                               [st_base, st_out])
+    # ---- TAB 4 -------------------------------------------------------------
+    with gr.Tab("4 · Diff (align by depth)"):
+        gr.Markdown("""
+### Diff — two models on one prompt, aligned by *relative* depth
+**What it does:** runs the logit lens on model A and model B and lines their
+layers up by percentage depth (0–100%), so you can compare a 2-layer toy with a
+12-layer gpt2 side by side. `dp` is `p_B − p_A` for the target token.
+**How to read it:** look at *where* on the depth axis each model commits to the
+target. A localised model commits at one depth; a distributed one ramps up.
+**Try:** A = `gpt2`, B = `glassbox`, target = `paris`. You'll see gpt2 ramp
+through the middle while glassbox snaps on at its fact layer — the same fact,
+two very different internal shapes.
+""")
+        with gr.Row():
+            df_a = gr.Textbox(value="gpt2", label="model A")
+            df_b = gr.Textbox(value="glassbox", label="model B")
+        df_prompt = gr.Textbox(value="the capital of france is", label="prompt")
+        df_target = gr.Textbox(value="paris", label="target token")
+        df_k = gr.Slider(1, 5, value=1, step=1, label="top-k (display)")
+        df_out = gr.Textbox(label="output", lines=16)
+        gr.Button("Run").click(diff_models,
+                               [df_a, df_b, df_prompt, df_target, df_k], df_out)
+    # ---- TAB 5 -------------------------------------------------------------
+    with gr.Tab("5 · Causal trace (ROME)"):
+        gr.Markdown("""
+### Causal trace — corrupt the subject, restore each layer, find the site
+**What it does:** activation patching (Meng et al.'s ROME). It noises the
+**subject** token, which breaks the prediction, then restores one layer at a
+time and measures how much of the answer comes back. The layer that restores
+the most is where the fact is *causally* computed.
+**How to read it:** `recovery` ≈ 100% means "restoring this layer is enough" →
+the fact is read here. The peak line names the site.
+**Ground truth to check:**
+- `glassbox` — peak at **L0** (≈100%). The fact is read at the early subject site, because the L1 "attention" re-reads the restored subject. You know this is right because you wrote the mechanism.
+- `handmade` — `clean p` ≈ `corrupt p`, so recovery is meaningless. **Expected:** the fact is a string match, untouched by activation noise. This is the headline lesson — patching can't see lookup behaviour.
+- `gpt2` — a *band* of early–middle layers at the subject token light up, exactly as in the ROME paper.
+""")
+        ct_prompt = gr.Textbox(value="the capital of france is", label="prompt")
+        ct_subject = gr.Textbox(value="france", label="subject to corrupt")
+        ct_target = gr.Textbox(value="paris", label="target token")
+        with gr.Row():
+            ct_noise = gr.Slider(0, 10, value=3, step=0.5, label="noise (x embed std)")
+            ct_seed = gr.Slider(0, 100, value=0, step=1, label="seed")
+        ct_out = gr.Textbox(label="output", lines=18)
+        gr.Button("Run").click(causal_trace,
+                               [ct_prompt, ct_subject, ct_target, ct_noise, ct_seed], ct_out)
+    gr.Markdown("""
+---
+### Where this goes next
+- **Edit loop (the VINDEX bridge):** trace → pick the layer → apply a ROME/MEMIT rank-1 edit to that MLP → re-run the logit lens to confirm the new fact took *and* nothing else moved. The glass-box is the unit test for that pipeline before you trust it on a real model.
+- **More glass-box facts / multi-hop:** add `"the currency of france is"` to force a second relation through the same subject, and watch the trace separate the two sites.
+- **Attention + MLP key-value inspection:** Geva-style "what does this neuron write to the vocab" and per-head attribution.
+- **Package as an HF Space** with this writeup as the README — it's a clean teaching artifact and a regression harness for interpretability code.
+""")
+    demo.load(lambda: load_model("glassbox"), outputs=load_status)
+if __name__ == "__main__":
+    demo.launch()