File size: 12,880 Bytes

f86dc09

# Metacognition in a Small Routed Language Model Is Not a Separable Module

**Tilelli LLM Team** · hello@tilelli.tech
Code, checkpoints, and the evaluation set: https://github.com/TilelliLab/Tilelli-llm (Apache-2.0)

*Draft — workshop format (4 pages + appendix). Every number in this paper is produced by a
script in `reproduce/` that exits non-zero if the bundled checkpoint fails to reproduce it
within tolerance.*

---

## Abstract

We study whether the gate distribution of a routed language model can be exploited as a
metacognition / uncertainty signal at the smallest scale where routing is non-trivial
(10.2 M parameters). We pre-registered a per-regime AUROC decision rule across 7 evaluation
regimes and ran five training variants sweeping the metacognition-loss weight from 20 to 0,
plus a head-only weight-graft ("splice") condition. **The pre-registered claim is disproven:**
router entropy alone does not beat an output-side baseline in any of the 7 regimes. A weaker
but informative result survives: joint router + abstain-head training reaches cross-regime
in-domain-vs-OOD AUROC up to 0.85 on the abstain head's sigmoid output, but (i) the gain does
not survive a head-only splice onto a fresh base (AUROC drops to 0.54, at chance), and (ii)
every configuration that produces the gain also degrades generation. We argue these two
negative results together bound a substantive claim about modularity: in small routed LMs the
uncertainty signal lives in the joint {router, head} representation rather than in the head as a
transferable module. We further isolate the mechanism — at this scale the router is fragile
enough that cross-entropy backprop on an in-domain subset alone, with the metacognition loss set
identically to zero, shifts the routing distribution enough to break out-of-domain generation.

---

## 1. Introduction

Uncertainty and abstention heads are increasingly proposed as pluggable modules: train a small
head to predict "I don't know," and bolt it onto a base model. This paper tests that modularity
assumption at the small/edge scale where it would matter most, using a 10.2 M-parameter routed
byte-level LM, and finds it fails in a specific, mechanism-explainable way.

We make three contributions, all negative or qualifying, and all reproducible:

1. A **pre-registered, disproven** claim that router entropy provides metacognition at 10 M
   parameters (Section 4).
2. A **non-transferability** result for abstain heads across base models — a head that reaches
   AUROC 0.76 in situ drops to 0.54 when lifted onto a fresh base (Section 5).
3. A **mechanism** for why joint training succeeds at producing the signal but breaks
   generation, including a falsifiable corollary (Section 6).

We deliberately do not headline an architecture win. A preliminary single-seed benchmark of the
3-pathway block against a vanilla decoder is reported honestly in Section 3 and
`results/claim_01_benchmark.md`, and it is **not** a defensible result; we say so plainly rather
than promote it.

## 2. Setup

### 2.1 Model

A 10.2 M-parameter byte-level language model: 8 layers, `d_model = 256`. Each block contains
three parallel pathways — a local pathway (1×1 convolution), a sparse-attention pathway (top-k),
and a dense feed-forward pathway — mixed by a learned linear gate over the hidden state,
softmax-routed. The model was trained on FineWeb-Edu (~10 B bytes) for 12 K base steps, then
chat-SFT, then abstain-aware SFT. The deployed checkpoint (`tilelli_chat_v4.pt`, FP32,
unquantized) anchors every positive claim in this paper.

### 2.2 Evaluation regimes

We hand-curated 7 regimes × 30 prompts = a 210-prompt probe set
(`prompts/probe_210.jsonl`): `in_domain`, `ood_topic`, `ood_style`, `long_input`, `gibberish`,
`factual_misleading`, and `neo_false_inability` (well-formed prompts that invite a spurious
refusal). For each prompt we record output-side and routing-side signals: `max_softmax_mean` and
`max_softmax_last` (output-side baselines), `router_conf`, `router_entropy_mean`,
`router_entropy_var`, the 8-vector `router_entropy_per_layer`, and `abstain_p` (the sigmoid of a
dedicated abstain head on the final hidden state).

### 2.3 Pre-registered decision rule

Registered before the runs (`MASTER_PLAN_2026-05-23.md` in the source repo). A *win* in a regime
requires AUROC ≥ 0.02 above the best baseline with a bootstrap 95% CI not crossing zero.
**≥ 4 wins** including at least one of {gibberish, factual-misleading, NEO} → PROVEN;
**1–3 wins** → PARTIAL; **0 wins** → DISPROVEN.

## 3. Baselines and a benchmark we will not oversell

The output-side baseline is `max_softmax_mean`, computed over generated tokens only, with each
regime length-balanced (n = 30) so a length-correlated signal cannot masquerade as
metacognition.

Separately, on the architecture itself: on the only Tilelli-Lite seed whose training log is
preserved (seed 1234), the comparison against a single-seed vanilla decoder **depends on the
metric**. On the within-training periodic-eval best, Lite *loses* by 0.6% (0.5742 vs 0.5707); on
a single post-training extra-eval pass, Lite *wins* by 0.4% (0.5685) — on a different
validation-batch composition, with a different `eval_every` than the baseline. Two further Lite
seeds were reported (0.5679, 0.5693) but their logs are not preserved. We therefore treat this as
a **preliminary single-seed directional finding, not a result** (`results/claim_01_benchmark.md`),
and it is not load-bearing for anything below.

## 4. The pre-registered router-entropy claim: disproven

Across the v3 baseline and all joint-finetune variants (v6, v7, v8a, v8b), the router-entropy
family wins **0 / 7 regimes** by the registered criterion. The pre-registered claim is
**DISPROVEN**.

Cross-regime in-domain-vs-OOD AUROC on the deployed v4 checkpoint:

| Signal | Cross-regime AUROC |
|---|---:|
| `max_softmax_mean` (baseline) | 0.54 |
| `router_conf` | 0.55 |
| `router_entropy_mean` | 0.55 |
| `abstain_p` | 0.51 |

All four are essentially at chance for *cross-regime* discrimination. The one place an
uncertainty signal works at this scale is `max_softmax_mean` *broken down per regime*: on
gibberish-vs-in-domain it reaches AUROC ≈ 0.93. That signal is output-side and
architecture-agnostic — it would work on any vanilla LM — so it offers no support for the
routing-as-metacognition narrative.

## 5. Cross-regime AUROC and the splice test

A looser question — does any signal separate in-domain from OOD after *joint* training? — has a
more interesting answer. We swept the metacognition-loss weight from 20 → 5 → 0 while keeping an
abstain BCE term:

| Variant | metacog wt | abstain wt | `abstain_p` AUROC | gibberish mean `abstain_p` | in-domain FP @ 0.775 | generation coherent? |
|---|---:|---:|---:|---:|---:|:--:|
| v4 (base SFT only) | – | – | 0.51 | 0.60 | 0% | yes |
| v7 | 20 | 1 | 0.76 | 0.94 | 20% | no |
| v8a | 5 | 1 | 0.80 | 0.97 | 23% | no |
| **v8b** | **0** | **5** | **0.85** | **1.00** | 10% | no |
| splice (v4 base + v7 head) | – | – | 0.54 | 0.46 | 27% | yes (v4-like) |

Two findings stand out.

**(1) The losses compete; they do not synergize.** The cross-regime signal *strengthens
monotonically as the metacognition weight goes to zero*. v8b, with zero metacognition pressure,
produces the strongest abstain signal in the entire project (AUROC 0.85, gibberish mean 1.00).
Adding the metacognition loss makes the discrimination *worse*, not better — the two losses
contend for the router's limited representation budget.

**(2) The signal does not survive a head-only splice.** Lifting v7's trained abstain head onto
v4's frozen base gives AUROC 0.54 — at chance, despite v7 itself reaching 0.76 — and makes
behavior *worse*, not neutral, raising the in-domain false-positive rate to 27%:

| Deploy gate | v4 | splice | v7 |
|---|---:|---:|---:|
| gibberish mean `abstain_p` (target > 0.775) | 0.60 ✗ | 0.46 ✗ | 0.94 ✓ |
| in-domain false-positive rate (target ≤ 0%) | 0% | 27% | 20% |
| chat coherence | ✓ | ✓ (v4-like) | ✗ broken |

### 5.1 Why the splice fails

A trained abstain head learns to read residual-stream patterns specific to its co-trained router.
Joint training shifts the router, which reshapes the residual stream; the head reads those
reshaped patterns. Lift the head onto a fresh base and the patterns are gone — consistent with
the literature on feature non-transferability in linear probes. The uncertainty signal is a
property of the joint {router-perturbation, head} representation, not of the head alone.

## 6. The router-fragility mechanism

v8b sets the metacognition weight to exactly zero: only cross-entropy on the in-domain subset and
BCE on the abstain head contribute gradient, and the only unfrozen parameters are the router
linears plus the abstain linear. **v8b still breaks generation** — sometimes more severely than
v7, which had a metacognition weight of 20.

Diagnosis: even with the metacognition loss identically zero, the in-domain cross-entropy term
backprops through the output head into the residual stream and from there into the unfrozen router
linears. Roughly 16,000 in-domain updates (500 steps × 32) shift the routing distribution enough
to break the routing the rest of the (frozen) model was tuned against; OOD generation then
collapses. At this scale the router cannot be retrained on *any* subset distribution without
disrupting generation elsewhere.

**Falsifiable corollary (queued, not yet run):** additionally freeze the router linears and train
only the abstain linear under BCE. We predict (a) the abstain head still reaches strong
cross-regime AUROC, because its signal comes from the residual-stream pattern rather than from
re-routing, and (b) generation is preserved. Confirmation would localize the damage precisely to
router re-tuning.

## 7. The deployed operating point (what actually works)

The practical recommendation at this scale is **not** joint finetuning: it is `max_softmax_mean`
plus abstain-aware SFT. The deployed v4 checkpoint, using exactly that recipe, reaches **9 / 10**
on the bundled held-out "I don't know" gate (PASS gate ≥ 9; the deploy probe was 10 / 10 on
slightly different phrasing) with a **0%** in-domain false-positive rate at threshold 0.775
(calibrated on held-out data). On a separate false-inability probe it fires the refusal template
on **7 / 20** answerable prompts — precision-bounded by SFT coverage. These are precision claims
about a head working on its trained pattern, not generalization claims; on semantic OOD outside
the SFT distribution the same head is at chance (Section 4).

## 8. Discussion

What we did **not** show: that any of this holds at 100 M or 1 B parameters. The router-fragility
argument is explicitly scale-dependent — a larger router with more capacity may absorb in-domain
updates without disrupting OOD routing. We leave that open. What we **did** show, at the scale we
tested: (1) the router-entropy-as-metacognition narrative is dead at 10 M; (2) abstain heads in
small routed LMs are not modular; (3) the strongest joint signal is reached by *removing* the
metacognition loss, not adding it.

## 9. Related work

Ternary base models at scale (e.g. BitNet b1.58) motivate small-model interest but do not address
modular uncertainty. Work treating sparse features as liftable modules is closer to our positive
counterexample — we show the lifting fails for abstain heads in the routed-LM setting. Most
calibration work (ECE, temperature scaling, learned uncertainty heads) operates at 100 M+ scale;
our finding is small-scale specific.

## 10. Limitations and reproducibility

10.2 M parameters only; architecture-specific (3-pathway routed block). The v8 sweep uses one
base checkpoint and v4 another (history dependence). The probe set is hand-curated and
inter-rater reliability is not measured. Cost: ~$0.35 of GPU for the v8 sweep, the rest CPU.
Every headline number is bound to a script:

```bash
python reproduce/01_benchmark.py            # arch loads, ~10 M params (CPU, ~2 s)
python reproduce/03_abstain_held_out.py     # 9 / 10 held-out IDK gate (CPU, ~1 min)
python reproduce/04_neo_false_inability.py  # 7 / 20 false-inability (CPU, ~2 min)
python reproduce/02_metacog_probe.py        # cross-regime AUROC sweep (CPU, ~15 min)
```

Each exits non-zero if the bundled v4 checkpoint fails to produce the documented number within
tolerance.

## Appendix (sketch)

- **A1** Full 7-regime × variant AUROC matrix.
- **A2** Sample generations for all 5 variants on 5 representative prompts.
- **A3** Training curves (abstain gap, entropy gap, CE) for v7 / v8a / v8b.
- **A4** The 210-prompt probe set (`prompts/probe_210.jsonl`).
- **A5** Checkpoints and SHAs for all variants (negative-result checkpoints available on request
  via hello@tilelli.tech).