mechinterp-v1-ckpt800
A finetuned Trinity Nano (AfMoE, ~6B active params) that labels SAE features from activation data. Given a feature's stats, top tokens, similar features, and context examples, it outputs a short natural language label and confidence level.
Trained on 4,597 labeled features from a 5,120-feature SAE trained on Baguettotron.
This is checkpoint 800/1150 (loss ~1.68).
SAE explorer: https://lyramakesmusic.github.io/bread-slicer/
Quality
The model was tested on 100 activations at temp 0.3. They were graded by opus 4.6 subagents on the following scale:
- 7 = perfect — captures exactly what the feature does
- 6 = correct — right concept, minor wording differences
- 5 = good — right ballpark, slightly over/under-scoped but usable
- 4 = partial — gets the general area but misses important specifics
- 3 = vague — too broad, too narrow, or captures only a tangent
- 2 = bad — wrong interpretation, hallucinated specifics
- 1 = wrong — completely off, describes something else
The results show mixed early results, with somewhat accurate labels but imperfect context/scope:
Overall: 4.78/7 mean, 5/7 median
| Score | Count | Meaning |
|---|---|---|
| 7 | 18 | perfect |
| 6 | 25 | correct |
| 5 | 14 | good |
| 4 | 17 | partial |
| 3 | 14 | vague |
| 2 | 10 | bad |
| 1 | 2 | wrong ( leaks) |
The model can be used as-is to get estimated labels, but should go through an external validator or agent to reach high quality accuracy thresholds. Further RL should likely be done to improve contextual accuracy, this model is just the baseline.
Original data used for training and as ground truth for the grader was generated using google/gemini-3-flash-preview with reasoning-effort: high, given 25 turns of prefill in the same context alongside an expository system prompt. Manual inspection of feature labels show good quality, with little confusion mostly on polysemantic/highly abstract or meta features.
Loading
Q8_0 GGUF, 6.1 GB. Load in LM Studio:
lms load lyraaaa-exp/mechinterp-v1-ckpt800-Q8_0-GGUF --gpu max -y
Uses ChatML template. No system prompt needed.
Input format
Send a single user message containing Label the following feature. followed by the full feature data block. The model responds with a label and confidence tag.
Prompt components
Each prompt has these sections in order:
Stats line — basic activation statistics for the feature:
density: what fraction of tokens in the dataset activate this featuremax_act: strongest activation value seenmean_act: average activation when it firesfires: total number of tokens that activated this feature
Top tokens — the 20 tokens this feature fires on most often, with counts. This is the single most informative signal for labeling. A feature with top tokens " research"x487, " Research"x74, " investigation"x51 is probably about the concept of research.
Similar features (labeled) — other features in the SAE with high cosine similarity in embedding space, but only ones that already have labels. These give context about the neighborhood the feature lives in. Similarity values are cosine similarity scores (0-1). Only present if any similar features have labels.
Examples — 8 context windows from the dataset showing where the feature fires hardest. Each example is a ~50 token window around the peak activation. Tokens are marked with activation strength:
| Markup | Meaning |
|---|---|
[[token]] |
Peak activation token (the token with the highest activation in this example) |
*token* |
Moderately activated token (activation > 30% of the feature's max) |
token |
Not activated or below threshold |
These examples are the core evidence for what the feature does. A feature might fire on [[.]] at the end of narrative sentences, or on [[ baseline]] in analytical contexts.
Response format
{short label}
[{confidence}]
Confidence is one of:
[confident]— clear pattern, top tokens and examples agree (93% of training data)[tentative]— plausible interpretation but some ambiguity (1.4% of training data)[dead]— feature never fires or has no discernible pattern (5.5% of training data)
Chat template tokens in feature data
Some SAE features fire on tokens inside chat template markup (<think>, </think>, <|im_start|>, etc.) because the SAE was trained on model completions that include these tokens. When these appear raw in the prompt's context examples, the tokenizer interprets them as control tokens rather than literal text, which can cause garbled output.
If you're generating feature data for this model, replace special tokens with bracket equivalents before sending:
| Raw token | Safe replacement |
|---|---|
<think> |
[think] |
</think> |
[/think] |
<|im_start|> |
[im_start] |
<|im_end|> |
[im_end] |
<|endoftext|> |
[endoftext] |
<|pad|> |
[pad] |
This preserves the information (you can still tell the feature fires on think-block boundaries) without confusing the tokenizer.
Example interp script
"""Batch-label SAE features using mechinterp-v1."""
import json
import requests
API = "http://localhost:1234/v1/chat/completions"
MODEL = "mechinterp-v1-ckpt800"
def label_feature(prompt_text):
resp = requests.post(API, json={
"model": MODEL,
"messages": [{"role": "user", "content": prompt_text}],
"max_tokens": 150,
"temperature": 0.3,
})
return resp.json()["choices"][0]["message"]["content"].strip()
def build_prompt(feat_id, stats, top_tokens, examples):
"""Build the feature prompt from components."""
lines = [f"Label the following feature.\n\n=== FEATURE {feat_id} ==="]
lines.append(f"Stats: density={stats['density']:.4%}, max_act={stats['max']:.2f}, "
f"mean_act={stats['mean']:.2f}, fires={stats['count']}")
tok_strs = [f'"{t["text"]}"x{t["count"]}' for t in top_tokens[:20]]
lines.append(f"Top tokens: {', '.join(tok_strs)}")
for i, ex in enumerate(examples[:8]):
lines.append(f"\nExample {i+1} (peak={ex['peak']:.2f}):")
lines.append(ex["context"])
return "\n".join(lines)
# -- main --
features = json.load(open("features_to_label.json"))
results = []
for feat in features:
prompt = build_prompt(feat["id"], feat["stats"], feat["top_tokens"], feat["examples"])
output = label_feature(prompt)
# parse label and confidence
parts = output.strip().split("\n\n")
label = parts[0] if parts else output
confidence = parts[1].strip("[]") if len(parts) > 1 else "unknown"
results.append({"feature": feat["id"], "label": label, "confidence": confidence})
print(f"#{feat['id']}: {label} [{confidence}]")
with open("labels_out.jsonl", "w") as f:
for r in results:
f.write(json.dumps(r) + "\n")
print(f"\nLabeled {len(results)} features -> labels_out.jsonl")
Training details
- Base: arcee-ai/Trinity-Nano-Preview (AfMoE, 128 experts, ~6B active)
- Method: LoRA (rank 16) via unsloth + SFTTrainer
- Data: 4,597 singleturn examples, ChatML format, no system prompt
- Training: 2 epochs, bs=4, max_seq_length=2048, bf16, adamw_8bit
- Checkpoint: 800/1150 steps (loss ~1.68)
- Hardware: RTX 4090 24GB via WSL2
- Quantization: Q8_0 via llama.cpp (6.1 GB)
Full examples
These are complete, untruncated inputs and outputs. Real prompts are this long — don't truncate them.
Example 1: active feature, confident label
Input:
Label the following feature.
=== FEATURE 100 ===
Stats: density=0.1885%, max_act=256.59, mean_act=109.08, fires=3119
Top tokens: "."×1768, "
"×464, "?"×68, ";"×48, ","×89, ".""×33, ".)"×34, " ""×45, "'"×71, " but"×34, ".""×19, " and"×34, " But"×22, ":**"×33, ":"×21, "!"×13, " ""×15, " The"×37, " And"×11, ".""×6
Similar features (labeled): #4372 "sentence-ending punctuation in analytical or descriptive prose" (sim=0.46), #1994 "structural boundaries - periods, newlines, and list/reference markers" (sim=0.40), #1258 "subject pronouns or copula at the start of narrative sentences" (sim=0.37), #1114 "punctuation introducing direct speech or dialogue, especially quotes and colons after names" (sim=0.34), #4299 "narrative transitions and descriptive scene-setting" (sim=0.32), #157 "comma or conjunction separating items in a list or parallel phrases (multilingual)" (sim=0.30), #2541 "period before a sentence starting with a demonstrative or transition word" (sim=0.29), #3226 "elevated descriptive and narrative prose" (sim=0.28), #3528 "newline or punctuation separating lines in verse or structured lists" (sim=0.28), #1377 "structural delimiters in conceptual outlines and reasoning chains" (sim=0.26)
Example 1 (peak=256.59):
the traditional *besmertny*—the undying tunic—and felt its weight as a mantle of history[[.]] We were the last knights,* and* I was proud to be one of them*.*
*
*Then came the change, as
Example 2 (peak=250.20):
psychology was the crisp, clean dogma of behaviorism, a science being polished to a high sheen of statistical significance[[.]] Klug, a man whose tweed jacket seemed tailored from the very authority he exuded, was a high priest of this
Example 3 (peak=245.63):
' and 'pulmonary complications', a language of clinical detachment that shielded the patient from despair and the institution from panic[[.]]* But* the truth, which Klug knew with the certainty of a man who had stared down death in a thousand forms,
Example 4 (peak=241.38):
His hands, mapped with the fine scars of hot oil and caustic lye, worked not by will but by memory[[.]] He was not making soap today*;* he was presiding over its slow birth, a midwife to a hundred kilograms
Example 5 (peak=241.04):
ueerde een reeks onvermijdelijke scenario's, elk met een gruwelijke, klinische logica[[.]]
*
*De dagen die volgden, vervormden tot een ondragelijke, rekbare substantie*.* De routine
Example 6 (peak=234.71):
en la universidad, vio en ello el renacimiento de una amoralidad que desafiaba toda lógica[[.]] Yo, en cambio, sentí una clase de pavor más íntimo, la certeza de que aquella decisión no
Example 7 (peak=232.79):
true antiquarian, the sense of standing not upon the shoulders of giants, but at the very base of human endeavour itself[[.]]
*
*We began our work, not with the bluster of an excavation, but with the reverence of an unse
Output:
sentence-ending punctuation, especially periods in narrative prose
[confident]
Example 2: semantic feature with subword activation
Input:
Label the following feature.
=== FEATURE 4016 ===
Stats: density=0.0904%, max_act=306.98, mean_act=88.23, fires=1496
Top tokens: " baseline"×151, "eline"×50, " Bas"×56, " base"×71, " reference"×53, " basic"×76, " background"×22, " threshold"×41, " target"×36, " benchmark"×12, " default"×16, " initial"×27, " quo"×20, " basal"×19, """×29, " starting"×16, " point"×18, "ing"×22, " anchor"×15, " standard"×20
Similar features (labeled): #1987 ""foundation", "base", "foundational", "underlying"" (sim=0.53), #2731 ""bas" word-initial string, multilingual (e.g., "basal", "baseline", "basierend")" (sim=0.45), #4375 ""standard", "standards", and "standardization" (multilingual)" (sim=0.40), #4547 ""initial", "initially", and multilingual equivalents (inicialmente, initialement, etc.)" (sim=0.36), #3266 ""precedent" and related terms (multilingual)" (sim=0.35), #2844 ""normal", "norm", and related terms across multiple languages (English, Dutch, French)" (sim=0.35), #4334 ""host", "source", "target", "input", "feedstock", and "substrate"" (sim=0.34), #3012 ""premise" and "premises" in logical or analytical evaluation" (sim=0.34), #3814 ""original", "initial", "native", and multilingual equivalents (e.g., "ursprünglich")" (sim=0.34), #323 ""framework", "platform", "paradigm" as conceptual or structural systems" (sim=0.33)
Example 1 (peak=306.98):
to entry?
- Product differentiation?
- Strategic interdependence?
- Market concentration?
Perfect competition[[ baseline]]:
- Many small firms
- Identical products
- Perfect information
- No barriers
- Price tak
Example 2 (peak=305.72):
other repatriation efforts failed?"
**Parsing the question.**
- "Succeed" = what[[ baseline]]? Survival? Integration? Governance? Economic viability?
- Comparative scope unclear. Which "other efforts"?
-
Example 3 (peak=303.92):
Innovation theory: Constraints → forced creativity → new solutions
Need to establish:
- Bas[[eline]]: What constraints exist in open data AI training?
- Mechanism: How do these constraints force innovation?
-
Example 4 (peak=301.36):
→ bundling into "special menu" vs "separate channels like normal TV"
? **What's the[[ baseline]] comparison?**
- Public channels typically: dedicated channel numbers
- Commercial channels: also have dedicated numbers
-
Example 5 (peak=295.52):
20 years → medium-term prediction
- Core mechanism: continued fusion of regional foods/ingredients
- Comparative[[ baseline]]: "never before" suggests exponential acceleration of current trends
- Outcome focus: global cuisine evolution
※ Translation
Example 6 (peak=292.65):
? Semantic ambiguity: "normal pasta" = wheat pasta? durum wheat? other grains?
? Comparative claim needs[[ baseline]] definition
"why do some people put cheese on top but not others" -
→ Cultural variation hypothesis
→
Example 7 (peak=291.58):
issue: "would happen" = predictive modeling. Complex.
### 2. Current Catholic Fertility Context
Bas[[eline]] data needed:
- Catholic birth rates globally: ~16-17/1000 vs Protestant ~13-14/1000 vs
Example 8 (peak=291.49):
"than normal people who don't have much money"
→ Comparative framework established
→ But "normal people" = undefined[[ baseline]]
### 2. Conceptual Framework
**Leisure Class Theory Context:**
Thorstein Veblen
Output:
"baseline", "reference", or "benchmark" for comparison
[confident]
Example 3: dead feature
Input:
Label the following feature.
=== FEATURE 1644 ===
Stats: density=0.0000%, max_act=0.00, mean_act=0.00, fires=0
Top tokens: (none)
Similar features (labeled): #3422 ""marginal", "margin", and "marginalization" (economic, sociological, and spatial)" (sim=0.14), #2062 "outlining steps in a response strategy or framework" (sim=0.14), #4057 "subsequent term in a contrast or parallel structure" (sim=0.14), #4320 ""|" in "<|im_end|>"" (sim=0.14), #3832 "mathematical comparison and set inclusion symbols - ">", ">>", "⊂", "≈"" (sim=0.13), #2010 ""vs" comparing terms, concepts, or translations" (sim=0.13), #1367 "subword fragments in technical or analytical reasoning" (sim=0.12), #2737 "subsequent terms in coordinate pairs or structured lists" (sim=0.12), #2807 "dead" (sim=0.12), #770 ""ul" subword fragment" (sim=0.12)
Output:
dead
[dead]
- Downloads last month
- 34
8-bit
Model tree for lyraaaa/feature-interpreter-Q_8-GGUF
Base model
arcee-ai/Trinity-Nano-Base-Pre-Anneal