File size: 12,880 Bytes
f86dc09
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
# Metacognition in a Small Routed Language Model Is Not a Separable Module

**Tilelli LLM Team** Β· hello@tilelli.tech
Code, checkpoints, and the evaluation set: https://github.com/TilelliLab/Tilelli-llm (Apache-2.0)

*Draft β€” workshop format (4 pages + appendix). Every number in this paper is produced by a
script in `reproduce/` that exits non-zero if the bundled checkpoint fails to reproduce it
within tolerance.*

---

## Abstract

We study whether the gate distribution of a routed language model can be exploited as a
metacognition / uncertainty signal at the smallest scale where routing is non-trivial
(10.2 M parameters). We pre-registered a per-regime AUROC decision rule across 7 evaluation
regimes and ran five training variants sweeping the metacognition-loss weight from 20 to 0,
plus a head-only weight-graft ("splice") condition. **The pre-registered claim is disproven:**
router entropy alone does not beat an output-side baseline in any of the 7 regimes. A weaker
but informative result survives: joint router + abstain-head training reaches cross-regime
in-domain-vs-OOD AUROC up to 0.85 on the abstain head's sigmoid output, but (i) the gain does
not survive a head-only splice onto a fresh base (AUROC drops to 0.54, at chance), and (ii)
every configuration that produces the gain also degrades generation. We argue these two
negative results together bound a substantive claim about modularity: in small routed LMs the
uncertainty signal lives in the joint {router, head} representation rather than in the head as a
transferable module. We further isolate the mechanism β€” at this scale the router is fragile
enough that cross-entropy backprop on an in-domain subset alone, with the metacognition loss set
identically to zero, shifts the routing distribution enough to break out-of-domain generation.

---

## 1. Introduction

Uncertainty and abstention heads are increasingly proposed as pluggable modules: train a small
head to predict "I don't know," and bolt it onto a base model. This paper tests that modularity
assumption at the small/edge scale where it would matter most, using a 10.2 M-parameter routed
byte-level LM, and finds it fails in a specific, mechanism-explainable way.

We make three contributions, all negative or qualifying, and all reproducible:

1. A **pre-registered, disproven** claim that router entropy provides metacognition at 10 M
   parameters (Section 4).
2. A **non-transferability** result for abstain heads across base models β€” a head that reaches
   AUROC 0.76 in situ drops to 0.54 when lifted onto a fresh base (Section 5).
3. A **mechanism** for why joint training succeeds at producing the signal but breaks
   generation, including a falsifiable corollary (Section 6).

We deliberately do not headline an architecture win. A preliminary single-seed benchmark of the
3-pathway block against a vanilla decoder is reported honestly in Section 3 and
`results/claim_01_benchmark.md`, and it is **not** a defensible result; we say so plainly rather
than promote it.

## 2. Setup

### 2.1 Model

A 10.2 M-parameter byte-level language model: 8 layers, `d_model = 256`. Each block contains
three parallel pathways β€” a local pathway (1Γ—1 convolution), a sparse-attention pathway (top-k),
and a dense feed-forward pathway β€” mixed by a learned linear gate over the hidden state,
softmax-routed. The model was trained on FineWeb-Edu (~10 B bytes) for 12 K base steps, then
chat-SFT, then abstain-aware SFT. The deployed checkpoint (`tilelli_chat_v4.pt`, FP32,
unquantized) anchors every positive claim in this paper.

### 2.2 Evaluation regimes

We hand-curated 7 regimes Γ— 30 prompts = a 210-prompt probe set
(`prompts/probe_210.jsonl`): `in_domain`, `ood_topic`, `ood_style`, `long_input`, `gibberish`,
`factual_misleading`, and `neo_false_inability` (well-formed prompts that invite a spurious
refusal). For each prompt we record output-side and routing-side signals: `max_softmax_mean` and
`max_softmax_last` (output-side baselines), `router_conf`, `router_entropy_mean`,
`router_entropy_var`, the 8-vector `router_entropy_per_layer`, and `abstain_p` (the sigmoid of a
dedicated abstain head on the final hidden state).

### 2.3 Pre-registered decision rule

Registered before the runs (`MASTER_PLAN_2026-05-23.md` in the source repo). A *win* in a regime
requires AUROC β‰₯ 0.02 above the best baseline with a bootstrap 95% CI not crossing zero.
**β‰₯ 4 wins** including at least one of {gibberish, factual-misleading, NEO} β†’ PROVEN;
**1–3 wins** β†’ PARTIAL; **0 wins** β†’ DISPROVEN.

## 3. Baselines and a benchmark we will not oversell

The output-side baseline is `max_softmax_mean`, computed over generated tokens only, with each
regime length-balanced (n = 30) so a length-correlated signal cannot masquerade as
metacognition.

Separately, on the architecture itself: on the only Tilelli-Lite seed whose training log is
preserved (seed 1234), the comparison against a single-seed vanilla decoder **depends on the
metric**. On the within-training periodic-eval best, Lite *loses* by 0.6% (0.5742 vs 0.5707); on
a single post-training extra-eval pass, Lite *wins* by 0.4% (0.5685) β€” on a different
validation-batch composition, with a different `eval_every` than the baseline. Two further Lite
seeds were reported (0.5679, 0.5693) but their logs are not preserved. We therefore treat this as
a **preliminary single-seed directional finding, not a result** (`results/claim_01_benchmark.md`),
and it is not load-bearing for anything below.

## 4. The pre-registered router-entropy claim: disproven

Across the v3 baseline and all joint-finetune variants (v6, v7, v8a, v8b), the router-entropy
family wins **0 / 7 regimes** by the registered criterion. The pre-registered claim is
**DISPROVEN**.

Cross-regime in-domain-vs-OOD AUROC on the deployed v4 checkpoint:

| Signal | Cross-regime AUROC |
|---|---:|
| `max_softmax_mean` (baseline) | 0.54 |
| `router_conf` | 0.55 |
| `router_entropy_mean` | 0.55 |
| `abstain_p` | 0.51 |

All four are essentially at chance for *cross-regime* discrimination. The one place an
uncertainty signal works at this scale is `max_softmax_mean` *broken down per regime*: on
gibberish-vs-in-domain it reaches AUROC β‰ˆ 0.93. That signal is output-side and
architecture-agnostic β€” it would work on any vanilla LM β€” so it offers no support for the
routing-as-metacognition narrative.

## 5. Cross-regime AUROC and the splice test

A looser question β€” does any signal separate in-domain from OOD after *joint* training? β€” has a
more interesting answer. We swept the metacognition-loss weight from 20 β†’ 5 β†’ 0 while keeping an
abstain BCE term:

| Variant | metacog wt | abstain wt | `abstain_p` AUROC | gibberish mean `abstain_p` | in-domain FP @ 0.775 | generation coherent? |
|---|---:|---:|---:|---:|---:|:--:|
| v4 (base SFT only) | – | – | 0.51 | 0.60 | 0% | yes |
| v7 | 20 | 1 | 0.76 | 0.94 | 20% | no |
| v8a | 5 | 1 | 0.80 | 0.97 | 23% | no |
| **v8b** | **0** | **5** | **0.85** | **1.00** | 10% | no |
| splice (v4 base + v7 head) | – | – | 0.54 | 0.46 | 27% | yes (v4-like) |

Two findings stand out.

**(1) The losses compete; they do not synergize.** The cross-regime signal *strengthens
monotonically as the metacognition weight goes to zero*. v8b, with zero metacognition pressure,
produces the strongest abstain signal in the entire project (AUROC 0.85, gibberish mean 1.00).
Adding the metacognition loss makes the discrimination *worse*, not better β€” the two losses
contend for the router's limited representation budget.

**(2) The signal does not survive a head-only splice.** Lifting v7's trained abstain head onto
v4's frozen base gives AUROC 0.54 β€” at chance, despite v7 itself reaching 0.76 β€” and makes
behavior *worse*, not neutral, raising the in-domain false-positive rate to 27%:

| Deploy gate | v4 | splice | v7 |
|---|---:|---:|---:|
| gibberish mean `abstain_p` (target > 0.775) | 0.60 βœ— | 0.46 βœ— | 0.94 βœ“ |
| in-domain false-positive rate (target ≀ 0%) | 0% | 27% | 20% |
| chat coherence | βœ“ | βœ“ (v4-like) | βœ— broken |

### 5.1 Why the splice fails

A trained abstain head learns to read residual-stream patterns specific to its co-trained router.
Joint training shifts the router, which reshapes the residual stream; the head reads those
reshaped patterns. Lift the head onto a fresh base and the patterns are gone β€” consistent with
the literature on feature non-transferability in linear probes. The uncertainty signal is a
property of the joint {router-perturbation, head} representation, not of the head alone.

## 6. The router-fragility mechanism

v8b sets the metacognition weight to exactly zero: only cross-entropy on the in-domain subset and
BCE on the abstain head contribute gradient, and the only unfrozen parameters are the router
linears plus the abstain linear. **v8b still breaks generation** β€” sometimes more severely than
v7, which had a metacognition weight of 20.

Diagnosis: even with the metacognition loss identically zero, the in-domain cross-entropy term
backprops through the output head into the residual stream and from there into the unfrozen router
linears. Roughly 16,000 in-domain updates (500 steps Γ— 32) shift the routing distribution enough
to break the routing the rest of the (frozen) model was tuned against; OOD generation then
collapses. At this scale the router cannot be retrained on *any* subset distribution without
disrupting generation elsewhere.

**Falsifiable corollary (queued, not yet run):** additionally freeze the router linears and train
only the abstain linear under BCE. We predict (a) the abstain head still reaches strong
cross-regime AUROC, because its signal comes from the residual-stream pattern rather than from
re-routing, and (b) generation is preserved. Confirmation would localize the damage precisely to
router re-tuning.

## 7. The deployed operating point (what actually works)

The practical recommendation at this scale is **not** joint finetuning: it is `max_softmax_mean`
plus abstain-aware SFT. The deployed v4 checkpoint, using exactly that recipe, reaches **9 / 10**
on the bundled held-out "I don't know" gate (PASS gate β‰₯ 9; the deploy probe was 10 / 10 on
slightly different phrasing) with a **0%** in-domain false-positive rate at threshold 0.775
(calibrated on held-out data). On a separate false-inability probe it fires the refusal template
on **7 / 20** answerable prompts β€” precision-bounded by SFT coverage. These are precision claims
about a head working on its trained pattern, not generalization claims; on semantic OOD outside
the SFT distribution the same head is at chance (Section 4).

## 8. Discussion

What we did **not** show: that any of this holds at 100 M or 1 B parameters. The router-fragility
argument is explicitly scale-dependent β€” a larger router with more capacity may absorb in-domain
updates without disrupting OOD routing. We leave that open. What we **did** show, at the scale we
tested: (1) the router-entropy-as-metacognition narrative is dead at 10 M; (2) abstain heads in
small routed LMs are not modular; (3) the strongest joint signal is reached by *removing* the
metacognition loss, not adding it.

## 9. Related work

Ternary base models at scale (e.g. BitNet b1.58) motivate small-model interest but do not address
modular uncertainty. Work treating sparse features as liftable modules is closer to our positive
counterexample β€” we show the lifting fails for abstain heads in the routed-LM setting. Most
calibration work (ECE, temperature scaling, learned uncertainty heads) operates at 100 M+ scale;
our finding is small-scale specific.

## 10. Limitations and reproducibility

10.2 M parameters only; architecture-specific (3-pathway routed block). The v8 sweep uses one
base checkpoint and v4 another (history dependence). The probe set is hand-curated and
inter-rater reliability is not measured. Cost: ~$0.35 of GPU for the v8 sweep, the rest CPU.
Every headline number is bound to a script:

```bash
python reproduce/01_benchmark.py            # arch loads, ~10 M params (CPU, ~2 s)
python reproduce/03_abstain_held_out.py     # 9 / 10 held-out IDK gate (CPU, ~1 min)
python reproduce/04_neo_false_inability.py  # 7 / 20 false-inability (CPU, ~2 min)
python reproduce/02_metacog_probe.py        # cross-regime AUROC sweep (CPU, ~15 min)
```

Each exits non-zero if the bundled v4 checkpoint fails to produce the documented number within
tolerance.

## Appendix (sketch)

- **A1** Full 7-regime Γ— variant AUROC matrix.
- **A2** Sample generations for all 5 variants on 5 representative prompts.
- **A3** Training curves (abstain gap, entropy gap, CE) for v7 / v8a / v8b.
- **A4** The 210-prompt probe set (`prompts/probe_210.jsonl`).
- **A5** Checkpoints and SHAs for all variants (negative-result checkpoints available on request
  via hello@tilelli.tech).