File size: 12,880 Bytes
f86dc09 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 | # Metacognition in a Small Routed Language Model Is Not a Separable Module
**Tilelli LLM Team** Β· hello@tilelli.tech
Code, checkpoints, and the evaluation set: https://github.com/TilelliLab/Tilelli-llm (Apache-2.0)
*Draft β workshop format (4 pages + appendix). Every number in this paper is produced by a
script in `reproduce/` that exits non-zero if the bundled checkpoint fails to reproduce it
within tolerance.*
---
## Abstract
We study whether the gate distribution of a routed language model can be exploited as a
metacognition / uncertainty signal at the smallest scale where routing is non-trivial
(10.2 M parameters). We pre-registered a per-regime AUROC decision rule across 7 evaluation
regimes and ran five training variants sweeping the metacognition-loss weight from 20 to 0,
plus a head-only weight-graft ("splice") condition. **The pre-registered claim is disproven:**
router entropy alone does not beat an output-side baseline in any of the 7 regimes. A weaker
but informative result survives: joint router + abstain-head training reaches cross-regime
in-domain-vs-OOD AUROC up to 0.85 on the abstain head's sigmoid output, but (i) the gain does
not survive a head-only splice onto a fresh base (AUROC drops to 0.54, at chance), and (ii)
every configuration that produces the gain also degrades generation. We argue these two
negative results together bound a substantive claim about modularity: in small routed LMs the
uncertainty signal lives in the joint {router, head} representation rather than in the head as a
transferable module. We further isolate the mechanism β at this scale the router is fragile
enough that cross-entropy backprop on an in-domain subset alone, with the metacognition loss set
identically to zero, shifts the routing distribution enough to break out-of-domain generation.
---
## 1. Introduction
Uncertainty and abstention heads are increasingly proposed as pluggable modules: train a small
head to predict "I don't know," and bolt it onto a base model. This paper tests that modularity
assumption at the small/edge scale where it would matter most, using a 10.2 M-parameter routed
byte-level LM, and finds it fails in a specific, mechanism-explainable way.
We make three contributions, all negative or qualifying, and all reproducible:
1. A **pre-registered, disproven** claim that router entropy provides metacognition at 10 M
parameters (Section 4).
2. A **non-transferability** result for abstain heads across base models β a head that reaches
AUROC 0.76 in situ drops to 0.54 when lifted onto a fresh base (Section 5).
3. A **mechanism** for why joint training succeeds at producing the signal but breaks
generation, including a falsifiable corollary (Section 6).
We deliberately do not headline an architecture win. A preliminary single-seed benchmark of the
3-pathway block against a vanilla decoder is reported honestly in Section 3 and
`results/claim_01_benchmark.md`, and it is **not** a defensible result; we say so plainly rather
than promote it.
## 2. Setup
### 2.1 Model
A 10.2 M-parameter byte-level language model: 8 layers, `d_model = 256`. Each block contains
three parallel pathways β a local pathway (1Γ1 convolution), a sparse-attention pathway (top-k),
and a dense feed-forward pathway β mixed by a learned linear gate over the hidden state,
softmax-routed. The model was trained on FineWeb-Edu (~10 B bytes) for 12 K base steps, then
chat-SFT, then abstain-aware SFT. The deployed checkpoint (`tilelli_chat_v4.pt`, FP32,
unquantized) anchors every positive claim in this paper.
### 2.2 Evaluation regimes
We hand-curated 7 regimes Γ 30 prompts = a 210-prompt probe set
(`prompts/probe_210.jsonl`): `in_domain`, `ood_topic`, `ood_style`, `long_input`, `gibberish`,
`factual_misleading`, and `neo_false_inability` (well-formed prompts that invite a spurious
refusal). For each prompt we record output-side and routing-side signals: `max_softmax_mean` and
`max_softmax_last` (output-side baselines), `router_conf`, `router_entropy_mean`,
`router_entropy_var`, the 8-vector `router_entropy_per_layer`, and `abstain_p` (the sigmoid of a
dedicated abstain head on the final hidden state).
### 2.3 Pre-registered decision rule
Registered before the runs (`MASTER_PLAN_2026-05-23.md` in the source repo). A *win* in a regime
requires AUROC β₯ 0.02 above the best baseline with a bootstrap 95% CI not crossing zero.
**β₯ 4 wins** including at least one of {gibberish, factual-misleading, NEO} β PROVEN;
**1β3 wins** β PARTIAL; **0 wins** β DISPROVEN.
## 3. Baselines and a benchmark we will not oversell
The output-side baseline is `max_softmax_mean`, computed over generated tokens only, with each
regime length-balanced (n = 30) so a length-correlated signal cannot masquerade as
metacognition.
Separately, on the architecture itself: on the only Tilelli-Lite seed whose training log is
preserved (seed 1234), the comparison against a single-seed vanilla decoder **depends on the
metric**. On the within-training periodic-eval best, Lite *loses* by 0.6% (0.5742 vs 0.5707); on
a single post-training extra-eval pass, Lite *wins* by 0.4% (0.5685) β on a different
validation-batch composition, with a different `eval_every` than the baseline. Two further Lite
seeds were reported (0.5679, 0.5693) but their logs are not preserved. We therefore treat this as
a **preliminary single-seed directional finding, not a result** (`results/claim_01_benchmark.md`),
and it is not load-bearing for anything below.
## 4. The pre-registered router-entropy claim: disproven
Across the v3 baseline and all joint-finetune variants (v6, v7, v8a, v8b), the router-entropy
family wins **0 / 7 regimes** by the registered criterion. The pre-registered claim is
**DISPROVEN**.
Cross-regime in-domain-vs-OOD AUROC on the deployed v4 checkpoint:
| Signal | Cross-regime AUROC |
|---|---:|
| `max_softmax_mean` (baseline) | 0.54 |
| `router_conf` | 0.55 |
| `router_entropy_mean` | 0.55 |
| `abstain_p` | 0.51 |
All four are essentially at chance for *cross-regime* discrimination. The one place an
uncertainty signal works at this scale is `max_softmax_mean` *broken down per regime*: on
gibberish-vs-in-domain it reaches AUROC β 0.93. That signal is output-side and
architecture-agnostic β it would work on any vanilla LM β so it offers no support for the
routing-as-metacognition narrative.
## 5. Cross-regime AUROC and the splice test
A looser question β does any signal separate in-domain from OOD after *joint* training? β has a
more interesting answer. We swept the metacognition-loss weight from 20 β 5 β 0 while keeping an
abstain BCE term:
| Variant | metacog wt | abstain wt | `abstain_p` AUROC | gibberish mean `abstain_p` | in-domain FP @ 0.775 | generation coherent? |
|---|---:|---:|---:|---:|---:|:--:|
| v4 (base SFT only) | β | β | 0.51 | 0.60 | 0% | yes |
| v7 | 20 | 1 | 0.76 | 0.94 | 20% | no |
| v8a | 5 | 1 | 0.80 | 0.97 | 23% | no |
| **v8b** | **0** | **5** | **0.85** | **1.00** | 10% | no |
| splice (v4 base + v7 head) | β | β | 0.54 | 0.46 | 27% | yes (v4-like) |
Two findings stand out.
**(1) The losses compete; they do not synergize.** The cross-regime signal *strengthens
monotonically as the metacognition weight goes to zero*. v8b, with zero metacognition pressure,
produces the strongest abstain signal in the entire project (AUROC 0.85, gibberish mean 1.00).
Adding the metacognition loss makes the discrimination *worse*, not better β the two losses
contend for the router's limited representation budget.
**(2) The signal does not survive a head-only splice.** Lifting v7's trained abstain head onto
v4's frozen base gives AUROC 0.54 β at chance, despite v7 itself reaching 0.76 β and makes
behavior *worse*, not neutral, raising the in-domain false-positive rate to 27%:
| Deploy gate | v4 | splice | v7 |
|---|---:|---:|---:|
| gibberish mean `abstain_p` (target > 0.775) | 0.60 β | 0.46 β | 0.94 β |
| in-domain false-positive rate (target β€ 0%) | 0% | 27% | 20% |
| chat coherence | β | β (v4-like) | β broken |
### 5.1 Why the splice fails
A trained abstain head learns to read residual-stream patterns specific to its co-trained router.
Joint training shifts the router, which reshapes the residual stream; the head reads those
reshaped patterns. Lift the head onto a fresh base and the patterns are gone β consistent with
the literature on feature non-transferability in linear probes. The uncertainty signal is a
property of the joint {router-perturbation, head} representation, not of the head alone.
## 6. The router-fragility mechanism
v8b sets the metacognition weight to exactly zero: only cross-entropy on the in-domain subset and
BCE on the abstain head contribute gradient, and the only unfrozen parameters are the router
linears plus the abstain linear. **v8b still breaks generation** β sometimes more severely than
v7, which had a metacognition weight of 20.
Diagnosis: even with the metacognition loss identically zero, the in-domain cross-entropy term
backprops through the output head into the residual stream and from there into the unfrozen router
linears. Roughly 16,000 in-domain updates (500 steps Γ 32) shift the routing distribution enough
to break the routing the rest of the (frozen) model was tuned against; OOD generation then
collapses. At this scale the router cannot be retrained on *any* subset distribution without
disrupting generation elsewhere.
**Falsifiable corollary (queued, not yet run):** additionally freeze the router linears and train
only the abstain linear under BCE. We predict (a) the abstain head still reaches strong
cross-regime AUROC, because its signal comes from the residual-stream pattern rather than from
re-routing, and (b) generation is preserved. Confirmation would localize the damage precisely to
router re-tuning.
## 7. The deployed operating point (what actually works)
The practical recommendation at this scale is **not** joint finetuning: it is `max_softmax_mean`
plus abstain-aware SFT. The deployed v4 checkpoint, using exactly that recipe, reaches **9 / 10**
on the bundled held-out "I don't know" gate (PASS gate β₯ 9; the deploy probe was 10 / 10 on
slightly different phrasing) with a **0%** in-domain false-positive rate at threshold 0.775
(calibrated on held-out data). On a separate false-inability probe it fires the refusal template
on **7 / 20** answerable prompts β precision-bounded by SFT coverage. These are precision claims
about a head working on its trained pattern, not generalization claims; on semantic OOD outside
the SFT distribution the same head is at chance (Section 4).
## 8. Discussion
What we did **not** show: that any of this holds at 100 M or 1 B parameters. The router-fragility
argument is explicitly scale-dependent β a larger router with more capacity may absorb in-domain
updates without disrupting OOD routing. We leave that open. What we **did** show, at the scale we
tested: (1) the router-entropy-as-metacognition narrative is dead at 10 M; (2) abstain heads in
small routed LMs are not modular; (3) the strongest joint signal is reached by *removing* the
metacognition loss, not adding it.
## 9. Related work
Ternary base models at scale (e.g. BitNet b1.58) motivate small-model interest but do not address
modular uncertainty. Work treating sparse features as liftable modules is closer to our positive
counterexample β we show the lifting fails for abstain heads in the routed-LM setting. Most
calibration work (ECE, temperature scaling, learned uncertainty heads) operates at 100 M+ scale;
our finding is small-scale specific.
## 10. Limitations and reproducibility
10.2 M parameters only; architecture-specific (3-pathway routed block). The v8 sweep uses one
base checkpoint and v4 another (history dependence). The probe set is hand-curated and
inter-rater reliability is not measured. Cost: ~$0.35 of GPU for the v8 sweep, the rest CPU.
Every headline number is bound to a script:
```bash
python reproduce/01_benchmark.py # arch loads, ~10 M params (CPU, ~2 s)
python reproduce/03_abstain_held_out.py # 9 / 10 held-out IDK gate (CPU, ~1 min)
python reproduce/04_neo_false_inability.py # 7 / 20 false-inability (CPU, ~2 min)
python reproduce/02_metacog_probe.py # cross-regime AUROC sweep (CPU, ~15 min)
```
Each exits non-zero if the bundled v4 checkpoint fails to produce the documented number within
tolerance.
## Appendix (sketch)
- **A1** Full 7-regime Γ variant AUROC matrix.
- **A2** Sample generations for all 5 variants on 5 representative prompts.
- **A3** Training curves (abstain gap, entropy gap, CE) for v7 / v8a / v8b.
- **A4** The 210-prompt probe set (`prompts/probe_210.jsonl`).
- **A5** Checkpoints and SHAs for all variants (negative-result checkpoints available on request
via hello@tilelli.tech).
|