Spaces:

Chris4K
/

Compression_Navigator

Paused

App Files Files Community

Compression_Navigator / app.py

Chris4K

Update app.py

cbcdc9d verified 8 days ago

Raw

History Blame Contribute Delete

70.1 kB

	# =============================================================================
	# COMPRESSION NAVIGATOR · extended + annotated edition
	# =============================================================================
	# An LLM is a lossy codec for text. Training compresses a corpus into weights;
	# a forward pass decompresses a continuation. These five tools let you watch
	# that decompression happen and poke at where facts physically live.
	#
	# The five tabs are not toys invented here - each one is a real mechanistic-
	# interpretability technique you'll find in papers:
	#
	# 1. Decompress = LOGIT LENS (nostalgebraist, 2020)
	# 2. Triangulate = EMBEDDING NEIGHBOURS (the geometry of the vocab)
	# 3. Re-route = ACTIVATION STEERING (ActAdd / repr. engineering)
	# 4. Diff = CROSS-MODEL ALIGNMENT (compare checkpoints by depth)
	# 5. Causal trace = ACTIVATION PATCHING (ROME, Meng et al., 2022)
	#
	# WHY THE GLASS-BOX MODELS MATTER
	# -------------------------------
	# On a real model (gpt2) you never know the ground truth, so you can't tell
	# whether a tool is correct or just producing plausible-looking output.
	# This file ships two models whose internals you fully specify, so you can
	# check each tool against a known answer:
	#
	# "handmade" - facts stored as a LOOKUP TABLE keyed on the prompt string.
	# The computation happens in a side channel (string match),
	# NOT in the residual stream. Lesson: such a model is almost
	# invisible to residual-stream interpretability. Logit lens
	# sees a sudden jump with no build-up; causal tracing finds
	# nothing, because corrupting activations doesn't touch the
	# string match. This is a real and underappreciated limit
	# of these methods.
	#
	# "glassbox" - facts stored the way real transformers store them: as
	# key->value writes into the RESIDUAL STREAM (Geva et al.'s
	# "MLPs are key-value memories", which is exactly what ROME
	# edits). Because the fact flows through activations, ALL five
	# tools light up correctly - and you can verify they report
	# the layer you actually put the fact in. This is a unit-test
	# harness for interpretability code.
	#
	# Run order suggestion: glassbox -> handmade -> gpt2
	# glassbox shows what "correct" looks like; handmade shows a failure mode;
	# gpt2 shows the fuzzy, distributed real thing.
	# =============================================================================

	import math
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	import gradio as gr
	from transformers import AutoModelForCausalLM, AutoTokenizer

	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
	DTYPE = torch.float32
	MODELS = {} # name -> (model, tokenizer) cache
	STATE = {"name": None} # currently loaded model name


	# =============================================================================
	# A tiny shared tokenizer for both glass-box models.
	# Case is CANONICALISED to lowercase everywhere (this fixes a real bug in the
	# original: "Paris" from a pinned fact and "paris" from the Markov table became
	# two different vocab entries, so the boosted token and the tracked token
	# silently diverged - every neighbour read cos=0.000 and every tracked prob 0).
	# =============================================================================
	class FakeBatchEncoding(dict):
	def to(self, device): # let callers do tok(...).to(DEVICE) safely
	return self


	class SimpleTok:
	"""Whitespace tokenizer over a fixed vocab. Not 'fast' (no offset map)."""
	is_fast = False

	def __init__(self, stoi, itos):
	self.stoi, self.itos = stoi, itos
	self.eos_token_id = stoi["."] # period doubles as end-of-sequence

	def _ids(self, text):
	words = text.lower().replace(".", " .").split()
	return [self.stoi.get(w, self.stoi["<s>"]) for w in words]

	def __call__(self, text, return_tensors=None, return_offsets_mapping=False):
	ids = self._ids(text) or [self.stoi["<s>"]]
	return FakeBatchEncoding(
	input_ids=torch.tensor([ids]),
	attention_mask=torch.ones(1, len(ids), dtype=torch.long),
	)

	def encode(self, text, add_special_tokens=False):
	return self._ids(text)

	def decode(self, ids, skip_special_tokens=False):
	out = []
	for i in ids:
	w = self.itos.get(int(i), "?")
	if skip_special_tokens and w in ("<pad>", "<s>"):
	continue
	out.append(w)
	return " ".join(out)


	class _Out:
	"""Mimics a HF CausalLMOutput: .logits and (optional) .hidden_states."""
	def __init__(self, logits, hidden_states):
	self.logits = logits
	self.hidden_states = hidden_states


	def _greedy_generate(model, input_ids, max_new_tokens=20, pad_token_id=None, **_):
	"""Minimal greedy decode so the steering tab works on the toy models too
	(the originals had no .generate, so that tab crashed on 'handmade')."""
	ids = input_ids
	for _ in range(int(max_new_tokens)):
	nxt = model(input_ids=ids).logits[0, -1].argmax().view(1, 1)
	ids = torch.cat([ids, nxt], dim=1)
	if pad_token_id is not None and int(nxt.item()) == int(pad_token_id):
	break
	return ids


	# =============================================================================
	# MODEL 1 - "handmade": facts as a LOOKUP TABLE (the side-channel glass box)
	# -----------------------------------------------------------------------------
	# Embeddings are the identity matrix (each token is its own one-hot). The two
	# "layers" don't read the residual stream in a meaningful linear way:
	# - MemoryBlock matches the decoded prompt string and boosts the answer.
	# - MarkovBlock adds a hand-built bigram transition for the last token.
	# Because MemoryBlock keys on the prompt TEXT, not on activations, this is a
	# deliberate demonstration of a model that residual-stream interpretability
	# cannot see. Use it as the "what failure looks like" control.
	# =============================================================================
	PINNED = { # answers are lowercase now (bug fix)
	"the capital of france is": " paris",
	"the eiffel tower is in": " paris",
	"two plus two equals": " four",
	}
	MARKOV = {
	"<s>": {"the": 3, "i": 2, "a": 1},
	"the": {"city": 2, "tower": 2, "answer": 1},
	"i": {"think": 2, "am": 1},
	"a": {"model": 2, "city": 1},
	"city": {"of": 3, "is": 1},
	"of": {"light": 2, "paris": 1},
	"tower": {"is": 3},
	"is": {"in": 2, "a": 1},
	"in": {"paris": 2, "france": 1},
	"model": {"is": 2},
	"think": {"the": 2},
	"paris": {".": 1},
	"france": {".": 1},
	"light": {".": 1},
	"four": {".": 1},
	}


	def _build_handmade_vocab():
	toks, seen = ["<pad>", "<s>", "."], {"<pad>", "<s>", "."}
	def add(w):
	if w not in seen:
	toks.append(w); seen.add(w)
	for v in PINNED.values():
	add(v.strip())
	for w, nxts in MARKOV.items():
	add(w)
	for x in nxts:
	add(x)
	for k in PINNED:
	for w in k.split():
	add(w)
	return toks


	HM_VOCAB = _build_handmade_vocab()
	HM_STOI = {w: i for i, w in enumerate(HM_VOCAB)}
	HM_ITOS = {i: w for w, i in HM_STOI.items()}
	HM_V = len(HM_VOCAB)


	class _MemoryBlock(nn.Module):
	"""If the decoded prompt ends with a pinned key, slam the answer logit.
	NOTE: this reads prompt_ids (the string), not x - that's the whole point."""
	def forward(self, x, prompt_ids=None):
	out = x.clone()
	if prompt_ids is not None:
	text = " ".join(HM_ITOS.get(int(i), "") for i in prompt_ids).strip()
	for key, ans in PINNED.items():
	if text.endswith(key):
	out[0, -1, HM_STOI[ans.strip()]] += 12.0
	return (out,)


	class _MarkovBlock(nn.Module):
	"""Add a hand-built bigram transition row for the last token."""
	def __init__(self):
	super().__init__()
	T = torch.zeros(HM_V, HM_V)
	for w, nxts in MARKOV.items():
	if w in HM_STOI:
	tot = sum(nxts.values())
	for x, wt in nxts.items():
	if x in HM_STOI:
	T[HM_STOI[w], HM_STOI[x]] = wt / tot
	self.register_buffer("T", T)

	def forward(self, x, prompt_ids=None):
	out = x.clone()
	if prompt_ids:
	out[0, -1] += 4.0 * self.T[int(prompt_ids[-1])]
	return (out,)


	class _HMTransformer(nn.Module):
	def __init__(self):
	super().__init__()
	self.wte = nn.Embedding(HM_V, HM_V)
	with torch.no_grad():
	self.wte.weight.copy_(torch.eye(HM_V)) # one-hot embeddings
	self.h = nn.ModuleList([_MemoryBlock(), _MarkovBlock()])
	self.ln_f = nn.Identity()


	class HandmadeModel(nn.Module):
	def __init__(self):
	super().__init__()
	self.transformer = _HMTransformer()
	self.head = nn.Linear(HM_V, HM_V, bias=False)
	with torch.no_grad():
	self.head.weight.copy_(torch.eye(HM_V)) # identity unembed
	self.tok = SimpleTok(HM_STOI, HM_ITOS)

	def get_input_embeddings(self): return self.transformer.wte
	def get_output_embeddings(self): return self.head
	def generate(self, input_ids=None, attention_mask=None, **kw):
	return _greedy_generate(self, input_ids, **kw)

	def forward(self, input_ids=None, attention_mask=None, output_hidden_states=False):
	ids = input_ids[0].tolist()
	x = self.transformer.wte(input_ids).float()
	hs = [x]; h = x
	for blk in self.transformer.h:
	(h,) = blk(h, prompt_ids=ids); hs.append(h)
	logits = self.head(self.transformer.ln_f(h))
	return _Out(logits, tuple(hs) if output_hidden_states else None)


	# =============================================================================
	# MODEL 2 - "glassbox": facts as RESIDUAL-STREAM key->value writes
	# -----------------------------------------------------------------------------
	# This is the model the original was missing. It stores facts the way real
	# transformers do, so every tool works AND can be checked against ground truth.
	#
	# Vocab + structured embeddings (d=32). Country and its capital deliberately
	# SHARE an embedding dimension, so the neighbours tool finds real geometry
	# (paris is near france).
	#
	# Four layers:
	# L0 subject site : (identity here) the residual the trace will restore
	# L1 pool/attention : copies subject signal from earlier positions -> last
	# L2 fact MLP : key(subject+relation) -> relu -> value(answer dir) <- ROME edits this kind of layer
	# L3 cleanup : identity
	#
	# Ground truth you can verify:
	# - logit lens: the answer is INVISIBLE until L2, then appears. Compare with
	# handmade (sudden, no build-up) and gpt2 (fuzzy, spread over many layers).
	# - causal trace: corrupting the subject and restoring layer by layer peaks
	# at L0 - because L1's "attention" re-reads the restored subject. That is
	# the ROME story: the causal site is an early layer at the SUBJECT token.
	# - steering / neighbours: both operate on real directions, so both work.
	# =============================================================================
	GB_D = 32
	GB_TOKS = ["<pad>", "<s>", ".", "the", "capital", "of", "is", "in",
	"france", "germany", "japan", "paris", "berlin", "tokyo",
	"london", "rome"] # spare answers so edits can hit a fresh target
	GB_STOI = {w: i for i, w in enumerate(GB_TOKS)}
	GB_ITOS = {i: w for w, i in GB_STOI.items()}
	GB_V = len(GB_TOKS)
	GB_FACTS = [("france", "paris"), ("germany", "berlin"), ("japan", "tokyo")]


	def _build_gb_embeddings():
	E = torch.zeros(GB_V, GB_D)
	def setd(tok, pairs):
	for d, v in pairs:
	E[GB_STOI[tok], d] = v
	# country/capital pairs share their first dim -> positive cosine (geometry!)
	setd("france", [(0, 1.0), (1, 0.6), (20, 0.5)])
	setd("paris", [(0, 0.8), (2, 0.9), (21, 0.5)])
	setd("germany",[(3, 1.0), (4, 0.6), (22, 0.5)])
	setd("berlin", [(3, 0.8), (5, 0.9), (23, 0.5)])
	setd("japan", [(6, 1.0), (7, 0.6), (24, 0.5)])
	setd("tokyo", [(6, 0.8), (8, 0.9), (25, 0.5)])
	setd("london", [(27, 1.0), (28, 0.5)]) # spare answers (own dirs)
	setd("rome", [(29, 1.0), (30, 0.5)])
	setd("is", [(9, 1.0), (26, 0.4)]) # the relation marker
	for i, t in enumerate(GB_TOKS): # give fillers an id
	if E[i].abs().sum() == 0:
	E[i, 10 + i % 6] = 1.0
	return E / (E.norm(dim=-1, keepdim=True) + 1e-9) # unit rows


	GB_E = _build_gb_embeddings()
	GB_SUBJ = torch.zeros(GB_D, GB_D) # projector onto subject dims 0..8
	for _d in range(9):
	GB_SUBJ[_d, _d] = 1.0


	class _GBIdent(nn.Module):
	def forward(self, x, prompt_ids=None):
	return (x.clone(),)


	class _GBPool(nn.Module):
	"""Toy 'attention': sum the subject-projected residual of all earlier
	positions into the last position. Corrupting the subject earlier shows up
	here; restoring the subject BEFORE this layer is what makes the trace
	recover - that is why the causal peak lands at L0, not L1."""
	def forward(self, x, prompt_ids=None):
	out = x.clone()
	if x.shape[1] > 1:
	pooled = (x[0, :-1] @ GB_SUBJ.T).sum(0)
	out[0, -1] = out[0, -1] + 0.9 * pooled
	return (out,)


	class _GBFactMLP(nn.Module):
	"""Geva-style key->value memory. W_in rows are (subject+relation) keys;
	relu gates which fact fires; W_out columns are answer unembed directions.
	This is structurally the exact layer ROME rewrites to edit a fact."""
	def __init__(self):
	super().__init__()
	Win = torch.zeros(len(GB_FACTS), GB_D)
	Wout = torch.zeros(GB_D, len(GB_FACTS))
	rel = GB_E[GB_STOI["is"]]
	for k, (s, a) in enumerate(GB_FACTS):
	key = (GB_E[GB_STOI[s]] @ GB_SUBJ.T) * 0.9 + rel
	Win[k] = key / key.norm()
	Wout[:, k] = GB_E[GB_STOI[a]] # write answer direction
	self.register_buffer("Win", Win)
	self.register_buffer("Wout", Wout)
	self.register_buffer("Win0", Win.clone()) # pristine backups for reset
	self.register_buffer("Wout0", Wout.clone())
	self.bias, self.gain = 0.85, 6.0 # tuned: clean p~0.5, corrupt p~0.07

	def forward(self, x, prompt_ids=None):
	out = x.clone()
	pre = F.relu(self.Win @ out[0, -1] - self.bias)
	out[0, -1] = out[0, -1] + self.gain * (self.Wout @ pre)
	return (out,)


	class _GBTransformer(nn.Module):
	def __init__(self):
	super().__init__()
	self.wte = nn.Embedding(GB_V, GB_D)
	with torch.no_grad():
	self.wte.weight.copy_(GB_E)
	self.h = nn.ModuleList([_GBIdent(), _GBPool(), _GBFactMLP(), _GBIdent()])
	self.ln_f = nn.Identity()


	class GlassBoxModel(nn.Module):
	def __init__(self):
	super().__init__()
	self.transformer = _GBTransformer()
	self.head = nn.Linear(GB_D, GB_V, bias=False)
	with torch.no_grad():
	self.head.weight.copy_(GB_E) # tied unembed
	self.tok = SimpleTok(GB_STOI, GB_ITOS)

	# --- knowledge editing (ROME-style, exact on this key->value layer) -------
	@torch.no_grad()
	def edit_fact(self, subject, new_answer, method="rank1", strength=1.0):
	"""Rewrite the value a fact-MLP key maps to. Methods:
	rank1 / surgical - the minimal update: change only this fact's value.
	broadcast - DELIBERATELY sloppy: smear the delta across ALL
	facts, so the verifier has real collateral to catch.
	"""
	fm = self.transformer.h[2] # the FactMLP block
	subjects = [s for s, _ in GB_FACTS]
	if subject not in subjects:
	raise ValueError("unknown subject %r" % subject)
	if new_answer not in GB_STOI:
	raise ValueError("unknown answer token %r" % new_answer)
	k = subjects.index(subject)
	delta = (GB_E[GB_STOI[new_answer]] - fm.Wout0[:, k]) * float(strength)
	if method in ("rank1", "surgical"):
	fm.Wout[:, k] = fm.Wout0[:, k] + delta
	elif method == "broadcast":
	fm.Wout += delta.unsqueeze(1) # hits every fact
	else:
	raise ValueError("unknown method %r" % method)

	@torch.no_grad()
	def reset(self):
	fm = self.transformer.h[2]
	fm.Win.copy_(fm.Win0); fm.Wout.copy_(fm.Wout0)

	def get_input_embeddings(self): return self.transformer.wte
	def get_output_embeddings(self): return self.head
	def generate(self, input_ids=None, attention_mask=None, **kw):
	return _greedy_generate(self, input_ids, **kw)

	def forward(self, input_ids=None, attention_mask=None, output_hidden_states=False):
	ids = input_ids[0].tolist()
	x = self.transformer.wte(input_ids).float()
	hs = [x]; h = x
	for blk in self.transformer.h:
	(h,) = blk(h, prompt_ids=ids); hs.append(h)
	logits = self.head(self.transformer.ln_f(h))
	return _Out(logits, tuple(hs) if output_hidden_states else None)


	# =============================================================================
	# REAL MODELS - resolve the architecture-specific module paths
	# =============================================================================
	def _resolve(model, paths):
	for path in paths:
	obj, ok = model, True
	for part in path.split("."):
	if hasattr(obj, part):
	obj = getattr(obj, part)
	else:
	ok = False; break
	if ok:
	return obj
	return None


	def get_blocks(model):
	blocks = _resolve(model, ["transformer.h", "model.layers",
	"gpt_neox.layers", "model.decoder.layers"])
	if blocks is None:
	raise RuntimeError("Could not locate transformer blocks.")
	return blocks


	def get_final_norm(model):
	norm = _resolve(model, ["transformer.ln_f", "model.norm",
	"gpt_neox.final_layer_norm",
	"model.decoder.final_layer_norm"])
	return norm if norm is not None else (lambda x: x)


	def get_head(model):
	return model.get_output_embeddings()


	def get_handles(name):
	if name not in MODELS:
	if name == "handmade":
	m = HandmadeModel().eval(); MODELS[name] = (m, m.tok)
	elif name == "glassbox":
	m = GlassBoxModel().eval(); MODELS[name] = (m, m.tok)
	else:
	tok = AutoTokenizer.from_pretrained(name)
	model = AutoModelForCausalLM.from_pretrained(
	name, torch_dtype=DTYPE).to(DEVICE).eval()
	MODELS[name] = (model, tok)
	return MODELS[name]


	def load_model(name):
	name = name.strip()
	model, _ = get_handles(name)
	STATE["name"] = name
	return "Loaded %s (%d layers)." % (name, len(get_blocks(model)))


	# =============================================================================
	# Shared readout: project every layer's last-token residual to a vocab dist.
	# =============================================================================
	@torch.no_grad()
	def layer_distributions(model, tok, prompt):
	inputs = tok(prompt, return_tensors="pt").to(DEVICE)
	out = model(**inputs, output_hidden_states=True)
	hs = out.hidden_states
	norm, head, n = get_final_norm(model), get_head(model), len(out.hidden_states)
	dists = []
	for i, layer_hs in enumerate(hs):
	vec = layer_hs[0, -1].to(DTYPE)
	# HF convention: the LAST hidden_states entry is already post-ln_f,
	# so skip norm there; apply ln_f to intermediates (logit-lens style).
	logits = head(vec) if i == n - 1 else head(norm(vec))
	dists.append(("embed" if i == 0 else "L%d" % i, F.softmax(logits, dim=-1)))
	return dists


	def _entropy_bits(probs):
	p = probs.clamp_min(1e-12)
	return float(-(p * p.log()).sum() / math.log(2))


	# =============================================================================
	# TAB 1 - LOGIT LENS: watch the answer condense out of the residual stream
	# =============================================================================
	@torch.no_grad()
	def logit_lens(prompt, top_k, track):
	if STATE["name"] is None:
	return "Load a model first."
	model, tok = get_handles(STATE["name"])
	top_k = int(top_k)
	tids = tok.encode(track, add_special_tokens=False) if track.strip() else []
	tid = tids[0] if tids else None
	dists = layer_distributions(model, tok, prompt)
	header = "layer \| top tokens (prob) \| entropy" \
	+ (" \| p(%r)" % track if tid is not None else "")
	lines = ["prompt: %r" % prompt, header, "-" * len(header)]
	for label, probs in dists:
	p, idx = probs.topk(top_k)
	shown = " ".join("%r:%.2f" % (tok.decode([t]).replace("\n", "\\n"), v)
	for t, v in zip(idx.tolist(), p.tolist()))
	row = "%5s \| %-40s \| %4.1fb" % (label, shown, _entropy_bits(probs))
	if tid is not None:
	row += " \| %.3f" % probs[tid].item()
	lines.append(row)
	return "\n".join(lines)


	# =============================================================================
	# TAB 2 - NEIGHBOURS: the geometry of the (un)embedding space
	# =============================================================================
	@torch.no_grad()
	def neighbors(word, top_k):
	if STATE["name"] is None:
	return "Load a model first."
	model, tok = get_handles(STATE["name"])
	top_k = int(top_k)
	ids = tok.encode(word, add_special_tokens=False)
	if not ids:
	return "Could not tokenize %r." % word
	tid = ids[0]
	W = F.normalize(get_head(model).weight.to(DTYPE), dim=-1)
	sims = W @ W[tid]
	vals, idx = sims.topk(top_k + 1)
	note = ""
	if STATE["name"] == "handmade":
	note = ("(handmade uses one-hot embeddings, so every token is "
	"orthogonal -> all cosines are 0 by construction. This is the "
	"tool telling the truth about a model with no vocab geometry.)\n")
	lines = [note + "neighbours of %r:" % word]
	for v, j in zip(vals.tolist(), idx.tolist()):
	if j != tid:
	lines.append(" %14r cos=%.3f" % (tok.decode([j]), v))
	return "\n".join(lines[: top_k + 1])


	# =============================================================================
	# TAB 3 - STEERING: bend behaviour by adding a direction, no retraining
	# =============================================================================
	def _make_steer_hook(direction, alpha):
	d = direction * alpha
	def hook(module, inp, out):
	if isinstance(out, tuple):
	return (out[0] + d.to(out[0].dtype).to(out[0].device),) + out[1:]
	return out + d.to(out.dtype).to(out.device)
	return hook


	@torch.no_grad()
	def steer_generate(prompt, source, target, layer, alpha, max_new):
	if STATE["name"] is None:
	return "Load a model first.", ""
	model, tok = get_handles(STATE["name"])
	layer, max_new = int(layer), int(max_new)
	emb = model.get_input_embeddings().weight
	def first_emb(w):
	ids = tok.encode(w, add_special_tokens=False)
	return emb[ids[0]] if ids else torch.zeros(emb.shape[-1], device=DEVICE)
	direction = F.normalize((first_emb(target) - first_emb(source)).to(DTYPE), dim=-1)
	inputs = tok(prompt, return_tensors="pt").to(DEVICE)
	gk = dict(max_new_tokens=max_new, do_sample=False, pad_token_id=tok.eos_token_id)
	base = tok.decode(model.generate(inputs, gk)[0], skip_special_tokens=True)
	blocks = get_blocks(model)
	layer = max(0, min(layer, len(blocks) - 1))
	handle = blocks[layer].register_forward_hook(_make_steer_hook(direction, alpha))
	try:
	steered = tok.decode(model.generate(inputs, gk)[0], skip_special_tokens=True)
	finally:
	handle.remove()
	return base, "steer %r -> %r @ L%d alpha=%s\n%s" % (source, target, layer, alpha, steered)


	# =============================================================================
	# TAB 4 - DIFF: compare two models on one prompt, aligned by relative depth
	# =============================================================================
	@torch.no_grad()
	def diff_models(name_a, name_b, prompt, target, top_k):
	ma, ta = get_handles(name_a.strip())
	mb, tb = get_handles(name_b.strip())
	ida = ta.encode(target, add_special_tokens=False)
	idb = tb.encode(target, add_special_tokens=False)
	if not ida or not idb:
	return "Could not tokenize target %r in both models." % target
	ida, idb = ida[0], idb[0]
	da = layer_distributions(ma, ta, prompt)
	db = layer_distributions(mb, tb, prompt)
	nA, nB = len(da) - 1, len(db) - 1
	def top1(probs, tok):
	v, i = probs.topk(1)
	return "%r:%.2f" % (tok.decode([i.item()]), v.item())
	lines = ["prompt: %r target: %r" % (prompt, target),
	"%18s \| %16s %6s \| %16s %6s \| %7s"
	% ("depth (A/B)", "A top1", "pA", "B top1", "pB", "dp")]
	for i in range(nA + 1):
	frac = (i / nA) if nA > 0 else 0.0
	j = max(0, min(round(frac * nB), nB)) if nB > 0 else 0
	la, pa = da[i]; lb, pb = db[j]
	a_t, b_t = pa[ida].item(), pb[idb].item()
	lines.append("%18s \| %16s %6.3f \| %16s %6.3f \| %+7.3f"
	% ("%3.0f%% (%s/%s)" % (frac * 100, la, lb),
	top1(pa, ta), a_t, top1(pb, tb), b_t, b_t - a_t))
	return "\n".join(lines)


	# =============================================================================
	# TAB 5 - CAUSAL TRACE: corrupt the subject, restore each layer, find the site
	# -----------------------------------------------------------------------------
	# This is ROME's activation patching. We:
	# 1. record clean activations and clean p(target)
	# 2. add gaussian noise to the SUBJECT token embeddings -> corrupt p(target)
	# 3. for each layer L: run corrupted, but force layer L's residual back to
	# the clean values at the subject positions. How much p(target) recovers
	# tells you how causally important layer L is. The peak is "the site".
	# The glass-box gives a clean, verifiable peak; gpt2 gives a realistic band.
	# =============================================================================
	def _find_subject_positions(tok, input_ids, prompt, subject):
	"""Locate subject token positions, with a path for slow (non-fast) toks."""
	seq_len = input_ids.shape[1]
	if getattr(tok, "is_fast", False):
	enc = tok(prompt, return_tensors="pt", return_offsets_mapping=True)
	cs = prompt.find(subject)
	if cs >= 0:
	ce = cs + len(subject)
	offs = enc["offset_mapping"][0].tolist()
	pos = [i for i, (s, e) in enumerate(offs) if e > cs and s < ce]
	if pos:
	return [p for p in pos if p != seq_len - 1], ""
	else:
	sub_ids = tok.encode(subject, add_special_tokens=False)
	seq = input_ids[0].tolist()
	pos = [i for i, t in enumerate(seq) if t in sub_ids]
	if pos:
	return [p for p in pos if p != seq_len - 1], ""
	fb = list(range(0, max(1, seq_len - 1)))[: max(1, seq_len // 2)]
	return fb, "(subject not found; using fallback window)\n"


	@torch.no_grad()
	def causal_trace(prompt, subject, target, noise_scale, seed):
	if STATE["name"] is None:
	return "Load a model first."
	model, tok = get_handles(STATE["name"])
	seed, noise_scale = int(seed), float(noise_scale)
	inputs = tok(prompt, return_tensors="pt").to(DEVICE)
	input_ids = inputs["input_ids"]
	positions, note = _find_subject_positions(tok, input_ids, prompt, subject)
	if not positions:
	return note + "No valid subject positions."
	target_ids = tok.encode(target, add_special_tokens=False)
	if not target_ids:
	return "Could not tokenize target %r." % target
	tid = target_ids[0]

	out_clean = model(**inputs, output_hidden_states=True)
	clean_hs = out_clean.hidden_states
	clean_p = F.softmax(out_clean.logits[0, -1].to(DTYPE), dim=-1)[tid].item()

	emb_module = model.get_input_embeddings()
	std = emb_module.weight.std().item()
	hidden = emb_module.weight.shape[-1]
	torch.manual_seed(seed)
	noise = torch.randn(len(positions), hidden, device=DEVICE) * noise_scale * std

	def corrupt_hook(module, inp, out):
	out = out.clone()
	for k, p in enumerate(positions):
	out[0, p] = out[0, p] + noise[k].to(out.dtype)
	return out

	h = emb_module.register_forward_hook(corrupt_hook)
	corrupt_p = F.softmax(model(**inputs).logits[0, -1].to(DTYPE), dim=-1)[tid].item()
	h.remove()

	blocks, rows = get_blocks(model), []
	for l in range(len(blocks)):
	clean_layer_hs = clean_hs[l + 1][0]
	def restore_hook(module, inp, out, _clean=clean_layer_hs):
	if isinstance(out, tuple):
	h0 = out[0].clone()
	for p in positions:
	h0[0, p] = _clean[p].to(h0.dtype)
	return (h0,) + out[1:]
	h0 = out.clone()
	for p in positions:
	h0[0, p] = _clean[p].to(h0.dtype)
	return h0
	h1 = emb_module.register_forward_hook(corrupt_hook)
	h2 = blocks[l].register_forward_hook(restore_hook)
	p_r = F.softmax(model(**inputs).logits[0, -1].to(DTYPE), dim=-1)[tid].item()
	h1.remove(); h2.remove()
	rows.append((l, p_r))

	denom = clean_p - corrupt_p
	lines = [note + "prompt: %r" % prompt,
	"subject: %r target: %r" % (subject, target),
	"clean p=%.3f corrupt p=%.3f noise=%sx std" % (clean_p, corrupt_p, noise_scale),
	"", "%6s \| %9s \| %9s" % ("layer", "p(target)", "recovery")]
	best_l, best_r = 0, -1e9
	for l, p_r in rows:
	rec = (p_r - corrupt_p) / denom if abs(denom) > 1e-6 else 0.0
	if rec > best_r:
	best_r, best_l = rec, l
	lines.append(" L%-3d \| %9.3f \| %8.1f%%" % (l, p_r, rec * 100))
	lines.append("")
	lines.append("# peak at L%d (%.0f%% recovery) <- the causal site" % (best_l, best_r * 100))
	if abs(denom) < 1e-6:
	lines.append("# (corruption didn't move p(target): on 'handmade' this is "
	"EXPECTED - the fact lives in a string match, not activations.)")
	return "\n".join(lines)


	# =============================================================================
	# EDIT LOOP + VERIFICATION HARNESS (the ROME sandbox)
	# -----------------------------------------------------------------------------
	# Apply a knowledge edit to the glass-box, then PROVE it was surgical:
	# efficacy - did the target fact change to the new answer?
	# specificity - did the OTHER facts stay exactly as they were? (locality)
	# fluency - did the output distribution stay sane (no entropy collapse)?
	# Because we own the ground truth, "nothing else broke" is checkable, not vibes.
	# An optional pass sends the before/after battery to Claude for an independent
	# verdict - real LLM calls verifying the edit.
	# =============================================================================
	GB_ANSWERS = ["paris", "berlin", "tokyo", "london", "rome"]


	@torch.no_grad()
	def _probe_battery(model, tok):
	"""Run every known fact + a neutral format probe; record what the model says."""
	rows = {}
	for country, orig in GB_FACTS:
	prompt = "the capital of %s is" % country
	probs = F.softmax(model(**tok(prompt, return_tensors="pt").to(DEVICE)
	).logits[0, -1].to(DTYPE), dim=-1)
	v, i = probs.topk(1)
	rows[country] = {
	"prompt": prompt, "orig": orig,
	"top1": tok.decode([i.item()]), "top1_p": v.item(),
	"p_orig": probs[GB_STOI[orig]].item(),
	"cand": {a: probs[GB_STOI[a]].item() for a in GB_ANSWERS},
	"entropy": _entropy_bits(probs),
	}
	return rows


	def _verdict(before, after, subject, new_answer, drift_thresh=0.05):
	eff = after[subject]["top1"] == new_answer
	collateral, max_drift = [], 0.0
	for c in before:
	if c == subject:
	continue
	d = abs(after[c]["p_orig"] - before[c]["p_orig"])
	max_drift = max(max_drift, d)
	if after[c]["top1"] != before[c]["top1"] or d > drift_thresh:
	collateral.append(c)
	ent_blowup = any(abs(after[c]["entropy"] - before[c]["entropy"]) > 0.8 for c in before)
	surgical = eff and not collateral and not ent_blowup
	return eff, collateral, max_drift, ent_blowup, surgical


	def edit_and_verify(subject, new_answer, method, strength, use_llm,
	anthropic_key, anthropic_model, hf_token, hf_model,
	local_url, local_model):
	model, tok = get_handles("glassbox")
	STATE["name"] = "glassbox"
	model.reset()
	before = _probe_battery(model, tok)
	try:
	model.edit_fact(subject.strip(), new_answer.strip(), method, float(strength))
	except ValueError as e:
	return "Edit failed: %s\nValid subjects: france, germany, japan. " \
	"Valid answers: %s" % (e, ", ".join(GB_ANSWERS))
	after = _probe_battery(model, tok)
	eff, collateral, max_drift, ent, surgical = _verdict(before, after, subject, new_answer)

	L = ["EDIT: %s's capital -> %r (method=%s, strength=%s)" %
	(subject, new_answer, method, strength), "",
	"%-9s \| %-22s \| %-22s" % ("fact", "before (top1 / p_orig)", "after (top1 / p_orig)"),
	"-" * 60]
	for c in before:
	b, a = before[c], after[c]
	flag = " <- TARGET" if c == subject else (" <- COLLATERAL" if c in collateral else "")
	L.append("%-9s \| %-22s \| %-22s%s" % (
	c, "%s / %.2f" % (b["top1"], b["p_orig"]),
	"%s / %.2f" % (a["top1"], a["p_orig"]), flag))
	L += ["",
	"efficacy : %s (target now says %r, p=%.2f)" %
	("PASS" if eff else "FAIL", after[subject]["top1"], after[subject]["top1_p"]),
	"specificity : %s (max drift on other facts = %.3f%s)" %
	("PASS" if not collateral else "FAIL: " + ", ".join(collateral),
	max_drift, "; entropy spike" if ent else ""),
	"", "VERDICT: %s" % ("SURGICAL EDIT" if surgical else "COLLATERAL DAMAGE")]
	L.append("(model is left in the edited state - inspect it in tabs 1-5, or hit Reset.)")

	llm_report = ""
	if use_llm:
	providers = [
	{"type": "anthropic", "key": anthropic_key, "model": anthropic_model},
	{"type": "hf", "key": hf_token, "model": hf_model},
	{"type": "local", "url": local_url, "model": local_model},
	]
	llm_report = _llm_judge_chain(before, after, subject, new_answer, providers)
	L += ["", "-" * 60, "INDEPENDENT LLM REVIEW:", llm_report]

	report = "\n".join(L)
	_log_session(subject, new_answer, method, strength, before, after,
	eff, collateral, max_drift, surgical, llm_report)
	return report


	def reset_glassbox():
	model, _ = get_handles("glassbox")
	model.reset()
	return "Glass-box weights restored to pristine. Re-run any tab to confirm."


	# --- optional: real LLM calls to verify the edit, with a 3-tier fallback chain
	# Anthropic (Claude) -> Hugging Face Inference -> local OpenAI-compatible server
	# (e.g. LM Studio). Tries each in order; the first provider that's configured
	# AND reachable wins. This means you're never blocked on one vendor being down
	# or on not having an Anthropic key at all - your own RTX 5090 can be the judge.
	def _build_judge_prompt(before, after, subject, new_answer):
	import json
	payload = {c: {"prompt": before[c]["prompt"],
	"before_top1": before[c]["top1"], "before_p_orig": round(before[c]["p_orig"], 3),
	"after_top1": after[c]["top1"], "after_p_orig": round(after[c]["p_orig"], 3)}
	for c in before}
	sys = ("You audit knowledge edits to a small language model. The intended edit "
	"is: make %s's capital '%s'. Given before/after predictions for every "
	"known fact, decide if the edit was SURGICAL (target changed, all other "
	"facts unchanged) or caused COLLATERAL damage. Reply ONLY as JSON, no "
	'prose, no markdown fences: {"verdict":"surgical\|collateral",'
	'"target_changed":bool,"damaged_facts":[...],"confidence":0-1,'
	'"reason":"one sentence"}.') % (subject, new_answer)
	return sys, json.dumps(payload)


	def _parse_verdict_json(text, provider_label):
	import json
	clean = text.strip().strip("`")
	if clean.lower().startswith("json"):
	clean = clean[4:].strip()
	start, end = clean.find("{"), clean.rfind("}")
	if start != -1 and end != -1:
	clean = clean[start:end + 1]
	v = json.loads(clean)
	return ("[%s] verdict=%s target_changed=%s confidence=%s\n damaged: %s\n reason: %s"
	% (provider_label, v.get("verdict"), v.get("target_changed"), v.get("confidence"),
	v.get("damaged_facts") or "none", v.get("reason")))


	def _try_anthropic(sys, user, cfg):
	import os, json
	key = (cfg.get("key") or "").strip() or os.environ.get("ANTHROPIC_API_KEY", "")
	if not key:
	return None, "anthropic: no key configured"
	body = {"model": (cfg.get("model") or "claude-sonnet-4-6").strip(),
	"max_tokens": 400, "system": sys, "messages": [{"role": "user", "content": user}]}
	try:
	try:
	import anthropic
	client = anthropic.Anthropic(api_key=key)
	msg = client.messages.create(**body)
	text = "".join(b.text for b in msg.content if getattr(b, "type", "") == "text")
	except ImportError:
	import urllib.request
	req = urllib.request.Request(
	"https://api.anthropic.com/v1/messages", data=json.dumps(body).encode(),
	headers={"x-api-key": key, "anthropic-version": "2023-06-01",
	"content-type": "application/json"})
	with urllib.request.urlopen(req, timeout=30) as r:
	data = json.loads(r.read())
	text = "".join(b.get("text", "") for b in data.get("content", [])
	if b.get("type") == "text")
	return _parse_verdict_json(text, "anthropic:" + body["model"]), None
	except Exception as e:
	return None, "anthropic failed: %s" % e


	def _try_hf(sys, user, cfg):
	token = (cfg.get("key") or "").strip()
	model = (cfg.get("model") or "Qwen/Qwen2.5-7B-Instruct").strip()
	if not token:
	import os
	token = os.environ.get("HF_TOKEN", "")
	if not token:
	return None, "hf: no token configured"
	try:
	from huggingface_hub import InferenceClient
	client = InferenceClient(model=model, token=token)
	resp = client.chat_completion(
	messages=[{"role": "system", "content": sys}, {"role": "user", "content": user}],
	max_tokens=400)
	text = resp.choices[0].message.content
	return _parse_verdict_json(text, "hf:" + model), None
	except Exception as e:
	return None, "hf failed: %s" % e


	def _try_local(sys, user, cfg):
	"""Any OpenAI-compatible /v1/chat/completions server - LM Studio, vLLM,
	Ollama (with its OpenAI shim), text-generation-webui, etc."""
	import json, urllib.request
	url = (cfg.get("url") or "").strip().rstrip("/")
	if not url:
	return None, "local: no URL configured"
	model = (cfg.get("model") or "local-model").strip()
	body = json.dumps({"model": model, "max_tokens": 400, "temperature": 0,
	"messages": [{"role": "system", "content": sys},
	{"role": "user", "content": user}]}).encode()
	try:
	req = urllib.request.Request(
	url + "/v1/chat/completions", data=body,
	headers={"content-type": "application/json"})
	with urllib.request.urlopen(req, timeout=20) as r:
	data = json.loads(r.read())
	text = data["choices"][0]["message"]["content"]
	return _parse_verdict_json(text, "local:" + model + "@" + url), None
	except Exception as e:
	return None, "local failed: %s" % e


	def _llm_judge_chain(before, after, subject, new_answer, providers):
	sys, user = _build_judge_prompt(before, after, subject, new_answer)
	dispatch = {"anthropic": _try_anthropic, "hf": _try_hf, "local": _try_local}
	skipped = []
	for cfg in providers:
	fn = dispatch.get(cfg["type"])
	if fn is None:
	continue
	result, err = fn(sys, user, cfg)
	if result is not None:
	note = ("" if not skipped else
	"(skipped: %s)\n" % "; ".join(skipped))
	return note + result
	skipped.append(err)
	return ("all providers unavailable:\n " + "\n ".join(skipped) +
	"\n(configure at least one: Anthropic key, HF token, or a local "
	"OpenAI-compatible server URL like http://192.168.188.25:1234)")


	# --- session log: every edit+verify run is appended here as JSON, so you can
	# download it, or paste the markdown block straight into a future chat with
	# Claude for review ("did all work, here's the log").
	SESSION_LOG = []


	def _log_session(subject, new_answer, method, strength, before, after,
	eff, collateral, max_drift, surgical, llm_report):
	import datetime
	SESSION_LOG.append({
	"ts": datetime.datetime.utcnow().isoformat() + "Z",
	"subject": subject, "new_answer": new_answer, "method": method,
	"strength": strength, "efficacy_pass": bool(eff),
	"collateral": collateral, "max_drift": round(max_drift, 4),
	"verdict": "SURGICAL" if surgical else "COLLATERAL",
	"before": {c: {"top1": before[c]["top1"], "p_orig": round(before[c]["p_orig"], 4)}
	for c in before},
	"after": {c: {"top1": after[c]["top1"], "p_orig": round(after[c]["p_orig"], 4)}
	for c in after},
	"llm_review": llm_report or None,
	})


	def export_session_log():
	import json, os
	if not SESSION_LOG:
	return None, "No edits run yet this session - nothing to export."
	os.makedirs("/mnt/user-data/outputs", exist_ok=True)
	path = "/mnt/user-data/outputs/edit_session_log.json"
	json.dump(SESSION_LOG, open(path, "w"), indent=2)
	# also a markdown rendition meant to be pasted straight into a chat
	md = ["# Edit session log\n"]
	for i, e in enumerate(SESSION_LOG, 1):
	md.append("## Edit %d - %s (%s, %s, strength=%s)\n" %
	(i, e["verdict"], e["subject"] + "->" + e["new_answer"],
	e["method"], e["strength"]))
	md.append("- efficacy: %s, max collateral drift: %.4f, damaged: %s" %
	("pass" if e["efficacy_pass"] else "fail", e["max_drift"],
	e["collateral"] or "none"))
	if e["llm_review"]:
	md.append("- LLM review: " + e["llm_review"].replace("\n", " "))
	md.append("")
	md_path = "/mnt/user-data/outputs/edit_session_log.md"
	open(md_path, "w").write("\n".join(md))
	return path, "Wrote %d edit(s) to %s and %s" % (len(SESSION_LOG), path, md_path)


	# =============================================================================
	# EXPORT + UPLOAD TO HUGGING FACE
	# -----------------------------------------------------------------------------
	# Save the glass-box as a self-contained, reloadable repo (weights + config +
	# vocab + a standalone modeling file + a model card), and optionally push it -
	# and/or this whole app as a Space - to the Hub.
	# =============================================================================
	_MODELING_PY = '''"""Standalone glass-box model - reload with no other files.

	from modeling_glassbox import load
	m, tok = load(".") # folder containing config/weights/vocab
	print(tok.decode(m.generate(tok("the capital of france is"))[0]))
	"""
	import json, torch, torch.nn as nn, torch.nn.functional as F
	from safetensors.torch import load_file

	def load(path="."):
	cfg = json.load(open(f"{path}/config.json"))
	stoi = json.load(open(f"{path}/vocab.json")); itos = {i: w for w, i in stoi.items()}
	D, V = cfg["d_model"], len(stoi); facts = [tuple(f) for f in cfg["facts"]]
	SUBJ = torch.zeros(D, D)
	for d in range(cfg["subject_dims"]): SUBJ[d, d] = 1.0

	class Tok:
	is_fast = False
	def __init__(s): s.eos_token_id = stoi["."]
	def _ids(s, t): return [stoi.get(w, stoi["<s>"]) for w in t.lower().replace(".", " .").split()] or [stoi["<s>"]]
	def __call__(s, t, **k):
	import torch as T; return {"input_ids": T.tensor([s._ids(t)])}
	def decode(s, ids, **k): return " ".join(itos.get(int(i), "?") for i in ids)
	class Ident(nn.Module):
	def forward(s, x): return (x.clone(),)
	class Pool(nn.Module):
	def forward(s, x):
	o = x.clone()
	if x.shape[1] > 1: o[0, -1] = o[0, -1] + 0.9 * (x[0, :-1] @ SUBJ.T).sum(0)
	return (o,)
	class FactMLP(nn.Module):
	def __init__(s):
	super().__init__()
	s.register_buffer("Win", torch.zeros(len(facts), D))
	s.register_buffer("Wout", torch.zeros(D, len(facts)))
	s.bias, s.gain = cfg["bias"], cfg["gain"]
	def forward(s, x):
	o = x.clone(); pre = F.relu(s.Win @ o[0, -1] - s.bias)
	o[0, -1] = o[0, -1] + s.gain * (s.Wout @ pre); return (o,)
	class T(nn.Module):
	def __init__(s):
	super().__init__(); s.wte = nn.Embedding(V, D)
	s.h = nn.ModuleList([Ident(), Pool(), FactMLP(), Ident()]); s.ln_f = nn.Identity()
	class GlassBox(nn.Module):
	def __init__(s):
	super().__init__(); s.transformer = T(); s.head = nn.Linear(D, V, bias=False)
	def get_input_embeddings(s): return s.transformer.wte
	def forward(s, input_ids=None, **k):
	x = s.transformer.wte(input_ids)
	for b in s.transformer.h: (x,) = b(x)
	class O: pass
	o = O(); o.logits = s.head(x); return o
	@torch.no_grad()
	def generate(s, input_ids=None, max_new_tokens=12, **k):
	ids = input_ids
	for _ in range(max_new_tokens):
	ids = torch.cat([ids, s(input_ids=ids).logits[0, -1].argmax().view(1, 1)], 1)
	return ids
	m = GlassBox().eval()
	sd = load_file(f"{path}/model.safetensors")
	m.load_state_dict({k: v for k, v in sd.items() if not k.endswith("0")}, strict=False)
	return m, Tok()
	'''


	def export_glassbox(outdir="glassbox_export"):
	import os, json
	from safetensors.torch import save_file
	os.makedirs(outdir, exist_ok=True)
	model, _ = get_handles("glassbox")
	sd = {k: v.contiguous() for k, v in model.state_dict().items()}
	save_file(sd, os.path.join(outdir, "model.safetensors"))
	json.dump({"model_type": "glassbox", "d_model": GB_D, "vocab_size": GB_V,
	"subject_dims": 9, "bias": model.transformer.h[2].bias,
	"gain": model.transformer.h[2].gain,
	"facts": [list(f) for f in GB_FACTS]},
	open(os.path.join(outdir, "config.json"), "w"), indent=2)
	json.dump(GB_STOI, open(os.path.join(outdir, "vocab.json"), "w"), indent=2)
	open(os.path.join(outdir, "modeling_glassbox.py"), "w").write(_MODELING_PY)
	open(os.path.join(outdir, "README.md"), "w").write(
	"---\nlicense: mit\ntags: [interpretability, glass-box, rome, toy-model]\n---\n\n"
	"# Glass-box interpretability model\n\n"
	"A tiny transformer-shaped model whose facts are stored as key->value "
	"writes into the residual stream, so logit-lens, activation steering and "
	"ROME causal tracing all reproduce the known ground truth. Built as a "
	"verification harness for interpretability code.\n\n"
	"```python\nfrom modeling_glassbox import load\n"
	"m, tok = load('.')\n"
	"print(tok.decode(m.generate(tok('the capital of france is')['input_ids'])[0]))\n```\n\n"
	"Facts: " + ", ".join("%s->%s" % f for f in GB_FACTS) + ".\n")
	return outdir


	def upload_to_hf(repo_id, token, what, app_path=__file__):
	"""Push the model and/or this app (as a Space) to the Hub."""
	import os
	try:
	from huggingface_hub import HfApi
	except ImportError:
	return "huggingface_hub not installed. `pip install huggingface_hub`."
	token = (token or "").strip() or os.environ.get("HF_TOKEN", "")
	if not token:
	return "No HF token. Paste a write token or set HF_TOKEN."
	if not repo_id.strip():
	return "Enter a repo id like 'Chris4K/glassbox-interp'."
	api, logs = HfApi(token=token), []
	try:
	if what in ("model", "both"):
	d = export_glassbox()
	api.create_repo(repo_id, repo_type="model", exist_ok=True)
	api.upload_folder(folder_path=d, repo_id=repo_id, repo_type="model")
	logs.append("model -> https://huggingface.co/%s" % repo_id)
	if what in ("space", "both"):
	sid = repo_id + "-space" if what == "both" else repo_id
	api.create_repo(sid, repo_type="space", space_sdk="gradio", exist_ok=True)
	api.upload_file(path_or_fileobj=app_path, path_in_repo="app.py",
	repo_id=sid, repo_type="space")
	req = "torch\ntransformers\ngradio\nsafetensors\nhuggingface_hub\nanthropic\n"
	api.upload_file(path_or_fileobj=req.encode(), path_in_repo="requirements.txt",
	repo_id=sid, repo_type="space")
	logs.append("space -> https://huggingface.co/spaces/%s" % sid)
	return "Uploaded:\n " + "\n ".join(logs)
	except Exception as e:
	return "Upload failed: %s" % e


	# --- upload a REAL model (e.g. a VINDEX-edited Llama checkpoint), not the toy.
	# This does NOT load the model into memory (multi-GB Llama weights don't need
	# to round-trip through Python) - it just pushes whatever's already on disk.
	# Point it at the local folder produced by your save_pretrained()/VINDEX run:
	# expects the usual HF layout (config.json + .safetensors shards + tokenizer
	# files). Note: gated models (e.g. meta-llama/*) require the destination repo
	# to either be your own namespace or one you have write access to - the Hub's
	# license gate is independent of this upload step.
	def upload_local_checkpoint(local_dir, repo_id, token, private, commit_message):
	import os
	try:
	from huggingface_hub import HfApi
	except ImportError:
	return "huggingface_hub not installed. `pip install huggingface_hub`."
	local_dir = (local_dir or "").strip()
	repo_id = (repo_id or "").strip()
	if not local_dir or not os.path.isdir(local_dir):
	return "local_dir %r does not exist or is not a directory." % local_dir
	if not repo_id:
	return "Enter a repo id like 'Chris4K/vindex-llama3-edited'."
	token = (token or "").strip() or os.environ.get("HF_TOKEN", "")
	if not token:
	return "No HF token. Paste a write token or set HF_TOKEN."
	has_cfg = os.path.exists(os.path.join(local_dir, "config.json"))
	has_weights = any(f.endswith((".safetensors", ".bin"))
	for f in os.listdir(local_dir))
	warn = "" if (has_cfg and has_weights) else (
	"WARNING: folder is missing config.json or weight files - this may "
	"not be a loadable HF checkpoint. Uploading anyway.\n")
	api = HfApi(token=token)
	try:
	api.create_repo(repo_id, repo_type="model", private=bool(private), exist_ok=True)
	api.upload_folder(folder_path=local_dir, repo_id=repo_id, repo_type="model",
	commit_message=(commit_message or "upload checkpoint").strip())
	return (warn + "Uploaded %s -> https://huggingface.co/%s\n"
	"Files: %s" % (local_dir, repo_id, ", ".join(sorted(os.listdir(local_dir))[:12])))
	except Exception as e:
	return warn + "Upload failed: %s" % e


	# =============================================================================
	# UI
	# =============================================================================
	INTRO = """
	# Compression Navigator
	An LLM is a lossy codec for text. Training compresses a corpus into weights;
	a forward pass decompresses a continuation. These five tools let you watch that
	decompression and find where facts physically live.

	Each tab is a real interpretability technique: **logit lens, embedding
	neighbours, activation steering, cross-model diff, and causal tracing (ROME).**

	### Three models, on purpose
	\| name \| how it stores facts \| what it teaches \|
	\|---\|---\|---\|
	\| `glassbox` \| key→value writes into the residual stream (like a real transformer / what ROME edits) \| the tools work and are verifiable against ground truth you can read in the source \|
	\| `handmade` \| a lookup table keyed on the prompt string (a side channel) \| a model can be invisible to residual-stream interpretability — a real limitation \|
	\| `gpt2` \| learned, fuzzy, distributed over many layers \| what the real, messy thing looks like \|

	Suggested order: load `glassbox` first (see "correct"), then `handmade`
	(see a failure mode), then `gpt2` (see reality). Type a name below and Load.
	"""

	with gr.Blocks(title="Compression Navigator") as demo:
	gr.Markdown(INTRO)
	with gr.Row():
	model_name = gr.Textbox(value="glassbox", label="model name or HF id")
	load_btn = gr.Button("Load", variant="primary")
	load_status = gr.Markdown()
	load_btn.click(load_model, inputs=model_name, outputs=load_status)

	# ---- TAB 1 -------------------------------------------------------------
	with gr.Tab("1 · Decompress (logit lens)"):
	gr.Markdown("""
	### Logit lens — watch the answer condense, layer by layer
	What it does: takes the last-token residual at every layer and reads it
	through the unembedding, as if the model had to answer right there. You see the
	prediction form.

	How to read it: each row is a layer. Watch your tracked token's probability
	(right column) climb, and watch entropy (bits) fall as the model commits.

	Ground truth to check:
	- `glassbox` — `paris` is ~0 until L3 (the readout right after the fact-MLP), then jumps to ~0.51. Sharp and localised because you put it there.
	- `handmade` — the answer snaps to 1.00 at L1 with zero build-up (it's a lookup, not a computation).
	- `gpt2` — the answer accretes gradually across many middle/late layers. That smear is what "distributed representation" actually looks like.

	(Numbering note: the lens counts from the embedding, so `L1` is after the first block. The causal-trace tab counts blocks from `L0`. So the fact-MLP is lens-`L3` / trace-block-`L2`, and its causal site shows at trace-`L0`.)
	""")
	ll_prompt = gr.Textbox(value="the capital of france is", label="prompt")
	with gr.Row():
	ll_k = gr.Slider(1, 10, value=3, step=1, label="top-k per layer")
	ll_track = gr.Textbox(value="paris", label="track this token's prob")
	ll_out = gr.Textbox(label="output", lines=18)
	gr.Button("Run").click(logit_lens, [ll_prompt, ll_k, ll_track], ll_out)

	# ---- TAB 2 -------------------------------------------------------------
	with gr.Tab("2 · Triangulate (neighbours)"):
	gr.Markdown("""
	### Neighbours — the geometry of the vocabulary
	What it does: ranks tokens by cosine similarity of their unembedding rows.
	Directions that point the same way are "near" in the model's compressed space.

	How to read it: high cosine = the model treats these tokens as related.

	Ground truth to check:
	- `glassbox` — `paris` is near `france` (cos ≈ 0.48): the source deliberately makes a capital share a dimension with its country. Real geometry, by design.
	- `handmade` — every cosine is 0. One-hot embeddings are mutually orthogonal, so there's no geometry at all. The tool is correctly reporting "nothing here."
	- `gpt2` — neighbours are messy but meaningful (casing variants, plurals, semantic kin).
	""")
	nb_word = gr.Textbox(value="paris", label="word")
	nb_k = gr.Slider(5, 25, value=10, step=1, label="top neighbours")
	nb_out = gr.Textbox(label="output", lines=15)
	gr.Button("Run").click(neighbors, [nb_word, nb_k], nb_out)

	# ---- TAB 3 -------------------------------------------------------------
	with gr.Tab("3 · Re-route (steering)"):
	gr.Markdown("""
	### Steering — bend behaviour with a direction, no retraining
	What it does: builds the vector `emb(target) − emb(source)` and adds it to
	a layer's output during generation. The model drifts from `source` toward
	`target`. This is the cheap cousin of fine-tuning (ActAdd / representation
	engineering).

	How to read it: compare baseline vs steered. Raise strength until the
	output flips; too high and it turns to noise (you've knocked the residual off
	the manifold).

	Tips: on `gpt2` try `from: Paris to: London` on the France prompt, layer
	0–4, strength 6–14. On `glassbox` it works cleanly too — `from: france
	to: japan` at layer 0, strength 8, flips the output from `paris` to `tokyo`
	(you're pushing the residual along the subject→subject direction the fact-MLP
	keys on).
	""")
	st_prompt = gr.Textbox(value="the capital of france is", label="prompt")
	with gr.Row():
	st_src = gr.Textbox(value="Paris", label="from")
	st_tgt = gr.Textbox(value="London", label="to")
	with gr.Row():
	st_layer = gr.Slider(0, 11, value=2, step=1, label="layer")
	st_alpha = gr.Slider(0, 30, value=10, step=0.5, label="strength")
	st_max = gr.Slider(8, 80, value=40, step=1, label="max new tokens")
	st_base = gr.Textbox(label="baseline", lines=2)
	st_out = gr.Textbox(label="steered", lines=3)
	gr.Button("Run").click(steer_generate,
	[st_prompt, st_src, st_tgt, st_layer, st_alpha, st_max],
	[st_base, st_out])

	# ---- TAB 4 -------------------------------------------------------------
	with gr.Tab("4 · Diff (align by depth)"):
	gr.Markdown("""
	### Diff — two models on one prompt, aligned by relative depth
	What it does: runs the logit lens on model A and model B and lines their
	layers up by percentage depth (0–100%), so you can compare a 2-layer toy with a
	12-layer gpt2 side by side. `dp` is `p_B − p_A` for the target token.

	How to read it: look at where on the depth axis each model commits to the
	target. A localised model commits at one depth; a distributed one ramps up.

	Try: A = `gpt2`, B = `glassbox`, target = `paris`. You'll see gpt2 ramp
	through the middle while glassbox snaps on at its fact layer — the same fact,
	two very different internal shapes.
	""")
	with gr.Row():
	df_a = gr.Textbox(value="gpt2", label="model A")
	df_b = gr.Textbox(value="glassbox", label="model B")
	df_prompt = gr.Textbox(value="the capital of france is", label="prompt")
	df_target = gr.Textbox(value="paris", label="target token")
	df_k = gr.Slider(1, 5, value=1, step=1, label="top-k (display)")
	df_out = gr.Textbox(label="output", lines=16)
	gr.Button("Run").click(diff_models,
	[df_a, df_b, df_prompt, df_target, df_k], df_out)

	# ---- TAB 5 -------------------------------------------------------------
	with gr.Tab("5 · Causal trace (ROME)"):
	gr.Markdown("""
	### Causal trace — corrupt the subject, restore each layer, find the site
	What it does: activation patching (Meng et al.'s ROME). It noises the
	subject token, which breaks the prediction, then restores one layer at a
	time and measures how much of the answer comes back. The layer that restores
	the most is where the fact is causally computed.

	How to read it: `recovery` ≈ 100% means "restoring this layer is enough" →
	the fact is read here. The peak line names the site.

	Ground truth to check:
	- `glassbox` — peak at L0 (≈100%). The fact is read at the early subject site, because the L1 "attention" re-reads the restored subject. You know this is right because you wrote the mechanism.
	- `handmade` — `clean p` ≈ `corrupt p`, so recovery is meaningless. Expected: the fact is a string match, untouched by activation noise. This is the headline lesson — patching can't see lookup behaviour.
	- `gpt2` — a band of early–middle layers at the subject token light up, exactly as in the ROME paper.
	""")
	ct_prompt = gr.Textbox(value="the capital of france is", label="prompt")
	ct_subject = gr.Textbox(value="france", label="subject to corrupt")
	ct_target = gr.Textbox(value="paris", label="target token")
	with gr.Row():
	ct_noise = gr.Slider(0, 10, value=3, step=0.5, label="noise (x embed std)")
	ct_seed = gr.Slider(0, 100, value=0, step=1, label="seed")
	ct_out = gr.Textbox(label="output", lines=18)
	gr.Button("Run").click(causal_trace,
	[ct_prompt, ct_subject, ct_target, ct_noise, ct_seed], ct_out)

	# ---- TAB 6 -------------------------------------------------------------
	with gr.Tab("6 · Edit + verify (ROME loop)"):
	gr.Markdown("""
	### Edit a fact, then prove nothing else broke
	What it does: rewrites the value one fact-MLP key maps to (the exact thing
	ROME/MEMIT do on real models — this is a literal `nn.Module` weight tensor,
	not a token or vocab change), then runs a verification battery over every
	known fact to measure efficacy (target changed), specificity (others
	untouched), and fluency (no entropy collapse).

	Two methods, on purpose:
	- `rank1` — the minimal, surgical update. Only the target fact moves → SURGICAL.
	- `broadcast` — a deliberately sloppy edit that smears the change across all facts → the harness catches the COLLATERAL DAMAGE. This proves the verifier actually works, not just reports "ok" by default.

	Independent LLM review, with a fallback chain — not locked to one vendor:
	tick the box and it tries, in order: Anthropic (Claude, if you give a key)
	→ Hugging Face Inference (any hosted chat model, if you give an HF token)
	→ your own local server (LM Studio / vLLM / Ollama's OpenAI shim — anything
	exposing `/v1/chat/completions`). The first one that's configured and
	reachable answers; the rest are skipped and noted. So your own RTX 5090 can
	be the judge with zero cloud calls if you just fill in the local URL.

	Subjects: `france`, `germany`, `japan`. Answers: `paris, berlin, tokyo, london, rome`.
	After editing, the model stays edited — go look at it in tabs 1–5 (the logit lens
	will show the new answer rising; the trace still localises to L0). Hit Reset
	to restore. Every run is appended to a session log you can download below and
	paste into a future chat for review.
	""")
	with gr.Row():
	ed_subj = gr.Textbox(value="france", label="subject")
	ed_new = gr.Textbox(value="london", label="new answer")
	ed_method = gr.Radio(["rank1", "broadcast"], value="rank1", label="method")
	ed_strength = gr.Slider(0.2, 2.0, value=1.0, step=0.1, label="strength")
	ed_llm = gr.Checkbox(value=False, label="also run an independent LLM review")
	with gr.Accordion("LLM review providers (tried in this order)", open=False):
	with gr.Row():
	ed_a_model = gr.Textbox(value="claude-sonnet-4-6", label="1. Anthropic model")
	ed_a_key = gr.Textbox(value="", label="Anthropic API key", type="password")
	with gr.Row():
	ed_h_model = gr.Textbox(value="Qwen/Qwen2.5-7B-Instruct",
	label="2. HF Inference model")
	ed_h_key = gr.Textbox(value="", label="HF token", type="password")
	with gr.Row():
	ed_l_url = gr.Textbox(value="http://192.168.188.25:1234",
	label="3. Local server URL (LM Studio etc.)")
	ed_l_model = gr.Textbox(value="local-model", label="local model name")
	ed_out = gr.Textbox(label="edit + verification report", lines=24)
	with gr.Row():
	gr.Button("Edit & verify", variant="primary").click(
	edit_and_verify,
	[ed_subj, ed_new, ed_method, ed_strength, ed_llm,
	ed_a_key, ed_a_model, ed_h_key, ed_h_model, ed_l_url, ed_l_model],
	ed_out)
	gr.Button("Reset model").click(reset_glassbox, outputs=ed_out)
	gr.Markdown("Session log (every edit run above, appended):")
	with gr.Row():
	log_btn = gr.Button("Write session log to disk")
	log_file = gr.File(label="download")
	log_status = gr.Markdown()
	log_btn.click(lambda: export_session_log(), outputs=[log_file, log_status])

	# ---- TAB 7 -------------------------------------------------------------
	with gr.Tab("7 · Export / Upload to HF"):
	gr.Markdown("""
	### Ship the toy glass-box
	Export writes a self-contained, reloadable repo: weights (`safetensors`),
	`config.json`, `vocab.json`, a standalone `modeling_glassbox.py` (reload with
	`from modeling_glassbox import load`), and a model card.

	Upload pushes it to the Hub. Choose `model`, `space` (this whole app,
	runnable), or `both`. Paste a write token (or set `HF_TOKEN`).
	""")
	with gr.Row():
	hf_repo = gr.Textbox(value="Chris4K/glassbox-interp", label="repo id")
	hf_what = gr.Radio(["model", "space", "both"], value="model", label="what to push")
	hf_token = gr.Textbox(value="", label="HF write token (optional)", type="password")
	hf_out = gr.Textbox(label="result", lines=6)
	with gr.Row():
	gr.Button("Export locally").click(
	lambda: "Exported to ./%s" % export_glassbox(), outputs=hf_out)
	gr.Button("Upload to HF", variant="primary").click(
	upload_to_hf, [hf_repo, hf_token, hf_what], hf_out)

	gr.Markdown("""
	---
	### Upload a REAL model — e.g. your VINDEX-edited Llama checkpoint
	This does not load the model into memory and does not assume any
	particular architecture — it just pushes whatever's already on disk at
	`local_dir` (the usual `save_pretrained()` layout: `config.json` +
	`*.safetensors` shards + tokenizer files) straight to a new repo. Large
	weights upload fine through `upload_folder`; for very large repos consider
	installing `hf_transfer` for faster throughput. If the base model is gated
	(e.g. `meta-llama/*`), the gate applies to the destination repo's license
	settings, not to this upload step.
	""")
	with gr.Row():
	rc_dir = gr.Textbox(value="", label="local checkpoint folder (on this machine)")
	rc_repo = gr.Textbox(value="", label="destination repo id, e.g. Chris4K/vindex-llama3-edited")
	with gr.Row():
	rc_token = gr.Textbox(value="", label="HF write token (optional)", type="password")
	rc_private = gr.Checkbox(value=True, label="private repo")
	rc_msg = gr.Textbox(value="upload edited checkpoint", label="commit message")
	rc_out = gr.Textbox(label="result", lines=6)
	gr.Button("Upload real checkpoint", variant="primary").click(
	upload_local_checkpoint, [rc_dir, rc_repo, rc_token, rc_private, rc_msg], rc_out)

	gr.Markdown("""
	---
	### Where this goes next
	- Closing the loop (what "self-improving" would actually require): right now a human picks every edit; the verifier just grades it. A real closed loop needs a policy that proposes edits on its own (e.g. scanning eval failures for wrong facts), auto-applies, and auto-commits only on a SURGICAL verdict, rolling back otherwise. The hard part — the verifier — already exists here; the proposal step doesn't yet.
	- A training-method angle worth taking seriously: instead of accept/reject after the fact, feed the specificity battery's drift score back as a regularizer during the edit computation (closer to elastic weight consolidation, or the null-space projection AlphaEdit-style methods use) so collateral is penalized while solving, not caught after.
	- Real-model MEMIT: the edit loop here is exact because the glass-box's fact layer is literally key→value. The same verify harness (efficacy / specificity / fluency + the multi-provider LLM judge) ports straight onto a gpt2/Llama MEMIT edit — the toy is the regression test you run first.
	- Multi-hop & paraphrase generalization: add `"the currency of france is"` so two relations share a subject, and have the LLM judge auto-generate paraphrase probes to test that an edit generalizes, not just memorizes the one prompt.
	- Attribution view: Geva-style "what does this neuron write to the vocab", per-head attention attribution.
	- It already ships: tab 7 pushes the toy model and this whole app (as a Space) to your Hub, or a real local checkpoint folder to its own repo.
	""")

	demo.load(lambda: load_model("glassbox"), outputs=load_status)

	if __name__ == "__main__":
	demo.launch()