mechinterp-v1-ckpt800

A finetuned Trinity Nano (AfMoE, ~6B active params) that labels SAE features from activation data. Given a feature's stats, top tokens, similar features, and context examples, it outputs a short natural language label and confidence level.

Trained on 4,597 labeled features from a 5,120-feature SAE trained on Baguettotron.

This is checkpoint 800/1150 (loss ~1.68).

SAE explorer: https://lyramakesmusic.github.io/bread-slicer/

Quality

The model was tested on 100 activations at temp 0.3. They were graded by opus 4.6 subagents on the following scale:

  • 7 = perfect — captures exactly what the feature does
  • 6 = correct — right concept, minor wording differences
  • 5 = good — right ballpark, slightly over/under-scoped but usable
  • 4 = partial — gets the general area but misses important specifics
  • 3 = vague — too broad, too narrow, or captures only a tangent
  • 2 = bad — wrong interpretation, hallucinated specifics
  • 1 = wrong — completely off, describes something else

The results show mixed early results, with somewhat accurate labels but imperfect context/scope:

Overall: 4.78/7 mean, 5/7 median

Score Count Meaning
7 18 perfect
6 25 correct
5 14 good
4 17 partial
3 14 vague
2 10 bad
1 2 wrong ( leaks)

The model can be used as-is to get estimated labels, but should go through an external validator or agent to reach high quality accuracy thresholds. Further RL should likely be done to improve contextual accuracy, this model is just the baseline.

Original data used for training and as ground truth for the grader was generated using google/gemini-3-flash-preview with reasoning-effort: high, given 25 turns of prefill in the same context alongside an expository system prompt. Manual inspection of feature labels show good quality, with little confusion mostly on polysemantic/highly abstract or meta features.

Loading

Q8_0 GGUF, 6.1 GB. Load in LM Studio:

lms load lyraaaa-exp/mechinterp-v1-ckpt800-Q8_0-GGUF --gpu max -y

Uses ChatML template. No system prompt needed.

Input format

Send a single user message containing Label the following feature. followed by the full feature data block. The model responds with a label and confidence tag.

Prompt components

Each prompt has these sections in order:

Stats line — basic activation statistics for the feature:

  • density: what fraction of tokens in the dataset activate this feature
  • max_act: strongest activation value seen
  • mean_act: average activation when it fires
  • fires: total number of tokens that activated this feature

Top tokens — the 20 tokens this feature fires on most often, with counts. This is the single most informative signal for labeling. A feature with top tokens " research"x487, " Research"x74, " investigation"x51 is probably about the concept of research.

Similar features (labeled) — other features in the SAE with high cosine similarity in embedding space, but only ones that already have labels. These give context about the neighborhood the feature lives in. Similarity values are cosine similarity scores (0-1). Only present if any similar features have labels.

Examples — 8 context windows from the dataset showing where the feature fires hardest. Each example is a ~50 token window around the peak activation. Tokens are marked with activation strength:

Markup Meaning
[[token]] Peak activation token (the token with the highest activation in this example)
*token* Moderately activated token (activation > 30% of the feature's max)
token Not activated or below threshold

These examples are the core evidence for what the feature does. A feature might fire on [[.]] at the end of narrative sentences, or on [[ baseline]] in analytical contexts.

Response format

{short label}

[{confidence}]

Confidence is one of:

  • [confident] — clear pattern, top tokens and examples agree (93% of training data)
  • [tentative] — plausible interpretation but some ambiguity (1.4% of training data)
  • [dead] — feature never fires or has no discernible pattern (5.5% of training data)

Chat template tokens in feature data

Some SAE features fire on tokens inside chat template markup (<think>, </think>, <|im_start|>, etc.) because the SAE was trained on model completions that include these tokens. When these appear raw in the prompt's context examples, the tokenizer interprets them as control tokens rather than literal text, which can cause garbled output.

If you're generating feature data for this model, replace special tokens with bracket equivalents before sending:

Raw token Safe replacement
<think> [think]
</think> [/think]
<|im_start|> [im_start]
<|im_end|> [im_end]
<|endoftext|> [endoftext]
<|pad|> [pad]

This preserves the information (you can still tell the feature fires on think-block boundaries) without confusing the tokenizer.

Example interp script

"""Batch-label SAE features using mechinterp-v1."""
import json
import requests

API = "http://localhost:1234/v1/chat/completions"
MODEL = "mechinterp-v1-ckpt800"

def label_feature(prompt_text):
    resp = requests.post(API, json={
        "model": MODEL,
        "messages": [{"role": "user", "content": prompt_text}],
        "max_tokens": 150,
        "temperature": 0.3,
    })
    return resp.json()["choices"][0]["message"]["content"].strip()


def build_prompt(feat_id, stats, top_tokens, examples):
    """Build the feature prompt from components."""
    lines = [f"Label the following feature.\n\n=== FEATURE {feat_id} ==="]
    lines.append(f"Stats: density={stats['density']:.4%}, max_act={stats['max']:.2f}, "
                 f"mean_act={stats['mean']:.2f}, fires={stats['count']}")

    tok_strs = [f'"{t["text"]}"x{t["count"]}' for t in top_tokens[:20]]
    lines.append(f"Top tokens: {', '.join(tok_strs)}")

    for i, ex in enumerate(examples[:8]):
        lines.append(f"\nExample {i+1} (peak={ex['peak']:.2f}):")
        lines.append(ex["context"])

    return "\n".join(lines)


# -- main --
features = json.load(open("features_to_label.json"))

results = []
for feat in features:
    prompt = build_prompt(feat["id"], feat["stats"], feat["top_tokens"], feat["examples"])
    output = label_feature(prompt)

    # parse label and confidence
    parts = output.strip().split("\n\n")
    label = parts[0] if parts else output
    confidence = parts[1].strip("[]") if len(parts) > 1 else "unknown"

    results.append({"feature": feat["id"], "label": label, "confidence": confidence})
    print(f"#{feat['id']}: {label} [{confidence}]")

with open("labels_out.jsonl", "w") as f:
    for r in results:
        f.write(json.dumps(r) + "\n")

print(f"\nLabeled {len(results)} features -> labels_out.jsonl")

Training details

  • Base: arcee-ai/Trinity-Nano-Preview (AfMoE, 128 experts, ~6B active)
  • Method: LoRA (rank 16) via unsloth + SFTTrainer
  • Data: 4,597 singleturn examples, ChatML format, no system prompt
  • Training: 2 epochs, bs=4, max_seq_length=2048, bf16, adamw_8bit
  • Checkpoint: 800/1150 steps (loss ~1.68)
  • Hardware: RTX 4090 24GB via WSL2
  • Quantization: Q8_0 via llama.cpp (6.1 GB)

Full examples

These are complete, untruncated inputs and outputs. Real prompts are this long — don't truncate them.

Example 1: active feature, confident label

Input:

Label the following feature.

=== FEATURE 100 ===
Stats: density=0.1885%, max_act=256.59, mean_act=109.08, fires=3119
Top tokens: "."×1768, "
"×464, "?"×68, ";"×48, ","×89, ".""×33, ".)"×34, " ""×45, "'"×71, " but"×34, ".""×19, " and"×34, " But"×22, ":**"×33, ":"×21, "!"×13, " ""×15, " The"×37, " And"×11, ".""×6
Similar features (labeled): #4372 "sentence-ending punctuation in analytical or descriptive prose" (sim=0.46), #1994 "structural boundaries - periods, newlines, and list/reference markers" (sim=0.40), #1258 "subject pronouns or copula at the start of narrative sentences" (sim=0.37), #1114 "punctuation introducing direct speech or dialogue, especially quotes and colons after names" (sim=0.34), #4299 "narrative transitions and descriptive scene-setting" (sim=0.32), #157 "comma or conjunction separating items in a list or parallel phrases (multilingual)" (sim=0.30), #2541 "period before a sentence starting with a demonstrative or transition word" (sim=0.29), #3226 "elevated descriptive and narrative prose" (sim=0.28), #3528 "newline or punctuation separating lines in verse or structured lists" (sim=0.28), #1377 "structural delimiters in conceptual outlines and reasoning chains" (sim=0.26)

Example 1 (peak=256.59):
 the traditional *besmertny*—the undying tunic—and felt its weight as a mantle of history[[.]] We were the last knights,* and* I was proud to be one of them*.*
*
*Then came the change, as

Example 2 (peak=250.20):
 psychology was the crisp, clean dogma of behaviorism, a science being polished to a high sheen of statistical significance[[.]] Klug, a man whose tweed jacket seemed tailored from the very authority he exuded, was a high priest of this

Example 3 (peak=245.63):
' and 'pulmonary complications', a language of clinical detachment that shielded the patient from despair and the institution from panic[[.]]* But* the truth, which Klug knew with the certainty of a man who had stared down death in a thousand forms,

Example 4 (peak=241.38):
 His hands, mapped with the fine scars of hot oil and caustic lye, worked not by will but by memory[[.]] He was not making soap today*;* he was presiding over its slow birth, a midwife to a hundred kilograms

Example 5 (peak=241.04):
ueerde een reeks onvermijdelijke scenario's, elk met een gruwelijke, klinische logica[[.]]
*
*De dagen die volgden, vervormden tot een ondragelijke, rekbare substantie*.* De routine

Example 6 (peak=234.71):
 en la universidad, vio en ello el renacimiento de una amoralidad que desafiaba toda lógica[[.]] Yo, en cambio, sentí una clase de pavor más íntimo, la certeza de que aquella decisión no

Example 7 (peak=232.79):
 true antiquarian, the sense of standing not upon the shoulders of giants, but at the very base of human endeavour itself[[.]]
*
*We began our work, not with the bluster of an excavation, but with the reverence of an unse

Output:

sentence-ending punctuation, especially periods in narrative prose

[confident]

Example 2: semantic feature with subword activation

Input:

Label the following feature.

=== FEATURE 4016 ===
Stats: density=0.0904%, max_act=306.98, mean_act=88.23, fires=1496
Top tokens: " baseline"×151, "eline"×50, " Bas"×56, " base"×71, " reference"×53, " basic"×76, " background"×22, " threshold"×41, " target"×36, " benchmark"×12, " default"×16, " initial"×27, " quo"×20, " basal"×19, """×29, " starting"×16, " point"×18, "ing"×22, " anchor"×15, " standard"×20
Similar features (labeled): #1987 ""foundation", "base", "foundational", "underlying"" (sim=0.53), #2731 ""bas" word-initial string, multilingual (e.g., "basal", "baseline", "basierend")" (sim=0.45), #4375 ""standard", "standards", and "standardization" (multilingual)" (sim=0.40), #4547 ""initial", "initially", and multilingual equivalents (inicialmente, initialement, etc.)" (sim=0.36), #3266 ""precedent" and related terms (multilingual)" (sim=0.35), #2844 ""normal", "norm", and related terms across multiple languages (English, Dutch, French)" (sim=0.35), #4334 ""host", "source", "target", "input", "feedstock", and "substrate"" (sim=0.34), #3012 ""premise" and "premises" in logical or analytical evaluation" (sim=0.34), #3814 ""original", "initial", "native", and multilingual equivalents (e.g., "ursprünglich")" (sim=0.34), #323 ""framework", "platform", "paradigm" as conceptual or structural systems" (sim=0.33)

Example 1 (peak=306.98):
 to entry?
- Product differentiation?
- Strategic interdependence?
- Market concentration?

Perfect competition[[ baseline]]:
- Many small firms
- Identical products
- Perfect information
- No barriers
- Price tak

Example 2 (peak=305.72):
 other repatriation efforts failed?"

**Parsing the question.**
- "Succeed" = what[[ baseline]]? Survival? Integration? Governance? Economic viability?
- Comparative scope unclear. Which "other efforts"?
-

Example 3 (peak=303.92):

Innovation theory: Constraints → forced creativity → new solutions

Need to establish:
- Bas[[eline]]: What constraints exist in open data AI training?
- Mechanism: How do these constraints force innovation?
-

Example 4 (peak=301.36):
 → bundling into "special menu" vs "separate channels like normal TV"

? **What's the[[ baseline]] comparison?**
- Public channels typically: dedicated channel numbers
- Commercial channels: also have dedicated numbers
-

Example 5 (peak=295.52):
 20 years → medium-term prediction
- Core mechanism: continued fusion of regional foods/ingredients
- Comparative[[ baseline]]: "never before" suggests exponential acceleration of current trends
- Outcome focus: global cuisine evolution

※ Translation

Example 6 (peak=292.65):
? Semantic ambiguity: "normal pasta" = wheat pasta? durum wheat? other grains?
? Comparative claim needs[[ baseline]] definition

"why do some people put cheese on top but not others" -
→ Cultural variation hypothesis
→

Example 7 (peak=291.58):
 issue: "would happen" = predictive modeling. Complex.

### 2. Current Catholic Fertility Context

Bas[[eline]] data needed:
- Catholic birth rates globally: ~16-17/1000 vs Protestant ~13-14/1000 vs

Example 8 (peak=291.49):
"than normal people who don't have much money"
→ Comparative framework established
→ But "normal people" = undefined[[ baseline]]

### 2. Conceptual Framework

**Leisure Class Theory Context:**
Thorstein Veblen

Output:

"baseline", "reference", or "benchmark" for comparison

[confident]

Example 3: dead feature

Input:

Label the following feature.

=== FEATURE 1644 ===
Stats: density=0.0000%, max_act=0.00, mean_act=0.00, fires=0
Top tokens: (none)
Similar features (labeled): #3422 ""marginal", "margin", and "marginalization" (economic, sociological, and spatial)" (sim=0.14), #2062 "outlining steps in a response strategy or framework" (sim=0.14), #4057 "subsequent term in a contrast or parallel structure" (sim=0.14), #4320 ""|" in "<|im_end|>"" (sim=0.14), #3832 "mathematical comparison and set inclusion symbols - ">", ">>", "⊂", "≈"" (sim=0.13), #2010 ""vs" comparing terms, concepts, or translations" (sim=0.13), #1367 "subword fragments in technical or analytical reasoning" (sim=0.12), #2737 "subsequent terms in coordinate pairs or structured lists" (sim=0.12), #2807 "dead" (sim=0.12), #770 ""ul" subword fragment" (sim=0.12)

Output:

dead

[dead]
Downloads last month
34
GGUF
Model size
6B params
Architecture
afmoe
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyraaaa/feature-interpreter-Q_8-GGUF