# Understanding & Constructing Sentences via SONAR + Sparse Autoencoders

*An autonomous-research writeup. 2026-06-06. Companion to `parascopes/` (probes) and
`text-autoencoders/` (SAEs). Live state: `AUTOPILOT_INTERP.md`.*

## The idea
Turn the abstract goal — **"take an arbitrary sentence and understand it; construct a
sentence from anything"** — into a concrete, measurable pipeline:

```
text  --SONAR encoder-->  z ∈ R^1024  --TopK SAE-->  sparse concepts
sparse concepts  --SAE decoder-->  ẑ  --SONAR decoder-->  text
LLM residual @ L27  --parascope probe-->  predicted z(next paragraph)  --SAE-->  concepts
```

SONAR is a frozen sentence autoencoder (text ⇄ 1024-d vector). We learn a sparse
dictionary over that vector space, so a sentence becomes a short list of interpretable
concepts, and any concept combination decodes back to a sentence. The parascope reads an
LLM's hidden state and predicts the SONAR vector of what it is *about to say*, so we can
name a model's upcoming content in human concepts.

## Artifacts (HF, account `nickypro`)
- **Probes** — `nickypro/parascope-probes`: `llama-3b_layer27_maghead_t0.03_a0.5.pt` (linear,
  rank1 0.592), `..._mlp_h2048x2_t0.05_a1.0.pt` (champion MLP, rank1 0.624).
- **SAEs** — `nickypro/sonar-sae`: `sonar-sae_h{H}_k{K}_btk_s{shards}e{epochs}.pt` + `_metrics.json`.
  Feature dictionaries: `feat_names_k128.json` (generative), `feat_dict_k128.json` (data-driven,
  512k-sentence corpus), `feat_dict_h16384_k8.json` (monosemantic). Demo outputs:
  `roundtrip_k128.json`, `parascope_concepts{,_champion}.json`.
- **Code** — `parascopes/layerwise/src/`: `train_sae_topk.py` (TopK/BatchTopK SAE + AuxK),
  `sonar_interp_lab.py` (understand / name / roundtrip / construct / build_dict / score_features),
  `parascope_to_concepts.py` (the full-stack demo).

## 1 — A modern SAE over SONAR (reconstruction)
Replaced the legacy ReLU+L1 SAE with a **TopK / BatchTopK** SAE (Gao et al. 2024; Bussmann et al.
2024) + AuxK dead-feature revival. Trained on `nickypro/llama-3b-embeds` (precomputed SONAR
paragraph vectors — cheap, no LLM/GPU capture). **k (active features/sentence) is the dominant,
monotonic reconstruction lever; dictionary width saturates by 16–32k.**

| n_feats | k | FVU ↓ | recon-cos ↑ | dead% |
|--------:|---:|------:|------------:|------:|
| 16384 | 32 | 0.50 | 0.73 | 0 |
| 16384 | 64 | 0.38 | 0.80 | 0 |
| 16384 | 128 | 0.31 | 0.84 | 0 |
| 16384 | 256 | 0.22 | 0.89 | 0 |
| 16384 | 512 | **0.11** | **0.95** | 0 |
| 32768 | 128 | 0.33 | 0.83 | 0 |
| 65536 | 32 | 0.54 | 0.70 | 2.3 |

BatchTopK beats per-sample TopK by a hair; wider dictionaries do **not** help reconstruction.

## 2 — Understand (roundtrip through text)
Measured in *text* space (decode the SAE reconstruction, SBERT-compare to the original) — the
metric that actually matters. With the k128 SAE: **text-SBERT 0.882 = 0.90× the SONAR encode→decode
ceiling (0.985)**, mean L0 = 128. Reconstructions preserve topic and sentiment:
- "The orchestra performed a moving symphony to a silent crowd." → "The orchestra played a
  symphonic ensemble before a silent crowd."
- "**The State of Scottish Football: A House of Cards**" → "**The Scottish Football Club: A House of Cards**"

## 3 — Construct (SONAR-space algebra & interpolation)
- **Analogy:** `"The king ruled wisely" − "king" + "queen"` → **"The queen ruled the country wisely."**
- **Interpolation:** *"A quiet morning by the sea." → "A noisy city street at night."* morphs cleanly
  across the 7 steps (sea→city, quiet→noisy).
- **From features:** activate chosen SAE features → decode → a sentence on that theme.

## 4 — Parascope → concepts (read the hidden state, name what's coming)
From an LLM's L27 residual, predict the next paragraph's SONAR vector, decompose into concepts,
and decode a paraphrase — then check vs the true next paragraph (held-out shard, n=30):

| probe | cos(pred,true) | concept Jaccard@12 | SBERT(decode,true) |
|---|---:|---:|---:|
| linear maghead | 0.349 | 0.179 | 0.338 |
| **champion MLP h2048x2** | **0.397** | **0.203** | 0.331 |
| random baseline | — | ≈0.0007 (12/16384) | — |

**Concept overlap is ~30× chance and topically on-point** (an IMDb/mini-series passage surfaces
film/series/classification concepts). The better probe improves *which* concepts and vector fidelity
(+14% Jaccard). The decoded paraphrase is loose — capped by probe-prediction accuracy (cos ≈ 0.4),
**not** by the decoder (see §6).

## 5 — Monosemanticity is the real frontier (and sparsity fixes it)
High-k SAEs reconstruct well but their features are **polysemantic** — a feature's top paragraphs mix
unrelated themes. We quantified this with a **coherence** metric (mean pairwise SBERT cosine among a
feature's top-activating sentences) and swept sparsity (matched 16384 width + corpus):

| n_feats | k | FVU | mean coherence ↑ | %features > 0.5 |
|--------:|---:|----:|-----------------:|----------------:|
| 16384 | 8 | 0.65 | **0.159** | 4.3 |
| 16384 | 16 | 0.58 | 0.130 | 2.9 |
| 16384 | 32 | 0.50 | 0.114 | 2.6 |
| 16384 | 128 | 0.31 | 0.123 | 4.9 |
| 65536 | 16 | 0.61 | 0.112 (worst) | 1.2 |

**Lower k → more monosemantic features (+40% coherence k32→k8).** Wider dictionaries *hurt*
(under-trained, 19.5% dead). k8's top features are genuinely single-theme — e.g. *creatine
supplementation*, *ICTR / Rwanda international humanitarian law*, *geopoetry (poetry × geology ×
climate)*, *quaternary error-correcting codes*. **There is a steep reconstruction ↔ interpretability
tradeoff**: use **low-k** for *understanding* (crisp concepts) and **high-k (k128/k512)** for
*reconstruction/construction* (faithful decode).

**Sentence-level beats paragraph-level.** SONAR is trained on *sentences*, so encoding whole paragraphs
is mildly out-of-distribution. Re-running on sentence-split, deduplicated text (150k sentences) improves
monosemanticity **and** reconstruction simultaneously:

| level | k | coherence ↑ | FVU ↓ |
|---|---:|---:|---:|
| paragraph | 8 | 0.159 | 0.65 |
| paragraph | 16 | 0.130 | 0.58 |
| **sentence** | **8** | **0.210** | **0.48** |
| sentence | 16 | 0.180 | 0.41 |

Sentence-level k8 is the most monosemantic (+32% over paragraph-k8), and sentence-level reconstructs
better at matched k. **Recommended "understand" SAE: sentence-level, k8** (`sentence-sae_h16384_k8.pt`).
Keep a high-k paragraph SAE (k128/k512) for faithful reconstruction/construction.

## 6 — Negative result: the SONAR decoder is *not* the bottleneck
A decode-aware adapter (trained to pull off-manifold vectors back onto the SONAR manifold) gave **no
improvement** (≈ −0.01 SBERT). The "Queen Queen…" degeneracy seen early was a **k64-SAE / freshly-
encoded-tiny-sentence artifact** that the better SAE already fixes (k128 short-input decode 0.893
SBERT, ~1% repetition). ⇒ **No SONAR-decoder LoRA needed.** The remaining paraphrase limit is probe
prediction quality, which lives in the probe lane.

## 7 — Mechanistic: manifold steering of the LLM (Goodfire thesis, confirmed)
Steering Llama-3.2-3B at a single layer by adding a concept's **difference-of-means**
direction. Naive ("raw") steering works at low strength but derails off-manifold at
high strength. We compare against **tangent** steering (keep the activation, add only
the component of the steering delta lying *along* the residual data-manifold — a PCA
subspace fit on ~300 real residuals), plus norm-clamp and a (deliberately lossy) full
re-projection. Metrics: steering *success* (SBERT cos of the continuation to the concept)
and *coherence* (distinct-trigram ratio). 1284 runs, concepts × layers 4–27 × α 4–32.

**At high strength (α ≥ 24):**

| method | mean coherence ↑ | mean success ↑ |
|---|---:|---:|
| **tangent (on-manifold)** | **0.853** | **+0.076** |
| normclamp | 0.468 | +0.052 |
| raw (naive) | 0.438 | +0.049 |
| manifold (full re-projection) | 0.135 | −0.007 |

**Tangent steering preserves coherence far better at high strength AND steers more.**
Raw success peaks at α≈16 then *falls* as generation derails; tangent keeps climbing
(e.g. ocean L10 α24: success +0.35 at coherence 0.91). The naive full-reprojection
collapses generation (it discards most of the residual) — a cautionary result. This
**confirms Goodfire's manifold-steering thesis in our setting** and redeems the project's
earlier *negative* single-layer steering result (`parascopes/RESULTS.md`): the failure was
the steering *method* (straight-line, off-manifold), not a fact about the model. Early
layers (~L10) steer most effectively. Results: `steer2_*.json` on HF.

**But the steering *direction source* matters — read ≠ write.** We also tried steering with
directions obtained by running the **parascope backwards**: concept → SONAR descriptor →
residual direction via the probe's pseudo-inverse (`M = W⊙scale`, steer with `x_std⊙(M⁺d)`),
at the probe's own layer, raw vs tangent (`steer_concept_*.json`). Tangent again kept
generation *coherent* (0.81–0.99 where raw collapsed), **but none of the methods actually
steered toward the concept** (max success ≈ +0.13 is metric noise on generic "I'm a language
model…" continuations; mean −0.03 — vs diff-of-means' +0.35). So the parascope is a good
*reader* but its linear inverse is **not** a causal *controller*: the direction the probe
uses to read a concept off the residual is not the direction that makes the model express
it (the classic read/write-direction gap), compounded by the SONAR→residual map being
massively underdetermined (3072 ≫ 1024 → the min-norm pinv direction is arbitrary in the
2048-d null space). **Takeaway:** manifold steering solves the *coherence* problem
universally; causal *efficacy* requires a causal direction (difference-of-means works;
probe-pseudo-inverse does not).

**A 12-concept tangent steering dictionary** (`steering_dictionary.json`). Replicating the
sweep over 12 concepts (ocean, medicine, anger, finance, space, cooking, law, music, nature,
fear, sports, history) confirms the effect: tangent at α≥24 averages success +0.086 /
coherence 0.84 vs raw +0.053 / 0.42. **All 12 concepts are steerable** under tangent while
staying coherent (best success +0.11 to +0.35; e.g. history→"a complex and tumultuous
conflict", ocean→"the depths of the ocean… the blue", nature→"mystical realms, ancient
forests"). Notably, **steering is most effective at *early* layers — median best layer L5**
(L4–L10 for most concepts) — even though the parascope *reads* the concept at L27. So the
model is **read-deep / write-shallow**: the layer where a concept is most linearly *decodable*
(late) differs from where injecting it most effectively *changes behavior* (early) — another
facet of the read/write asymmetry, and a useful practical guide for where to intervene.

**The read/write gap, quantified geometrically** (`rw_*.json`). At every probe layer (10–27),
the causal *write* direction (diff-of-means) is **orthogonal** to the probe's *read* direction
`Mᵀd` (cos ≈ 0.00–0.02) and to the pinv `M⁺d` (cos ≈ 0.01–0.02), and the probe does **not**
read the write direction as the concept (probe-reads-write ≈ 0). Read and pinv are mutually
correlated (cos ≈ 0.42–0.46, both from `M`) but lie in a subspace nearly orthogonal to what
steers. **This is the crisp explanation of why running the probe backwards can't steer**: the
probe's read subspace and the model's causal write subspace barely overlap — a predictive
linear probe identifies *where information is*, not *the lever that controls it*.

## 8 — Interpretable SAE features as causal levers (closing the loop)
Can we causally steer the model toward a *specific interpretable SAE feature*, not just a
hand-picked concept? For each feature we derive a causal residual write-direction from the
feature's **own top-activating sentences** (diff-of-means — the kind of direction the
read/write study showed actually steers), tangent-steer Llama-3.2-3B, then **re-encode the
output through SONAR + the SAE and check whether feature f's activation rose**
(`steer_saefeat_preact_*.json`).

A metric subtlety mattered: scoring f's *post-TopK* activation reads ~null, because the
sentence-SAE is k=8-sparse so any one narrow feature is essentially never in the active set
(activation exactly 0, no gradient). Scoring the **pre-activation** (continuous) reveals the
real effect:

| | raw | tangent |
|---|---:|---:|
| mean Δfeature-activation (coherent rows) | +0.33 | +0.26 |
| # coherent operating points (coh>0.6) | 131 | **254** |
| mean coherence at α≥16 | 0.285 | **0.770** |

**Steering coherently raises the chosen feature's activation**, and once you require coherence,
**tangent dominates** — it has ~2× the coherent operating points and holds coherence at the
strengths where raw collapses (raw raises the feature marginally more but mostly incoherently).
Best lever: **mid-layer L8** (tangent, α24 → the target feature jumps to rank ~107 / 16384 at
coherence 0.73). Caveat on *precision*: steering reliably pushes a feature into roughly the top
~1–3% by rank, but rarely makes it the single most-active feature (1/385 coherent runs reached
top-100). So: **any interpretable SAE concept is a usable causal lever** — you can pick a
human-named feature and coherently drive the model toward it — but exact one-feature control is
not yet achieved. This closes the descriptive↔mechanistic loop: the same dictionary that *names*
what the model is doing also *steers* it.

*Multi-layer doesn't sharpen it.* Steering the same feature jointly at {4,6,8} underperforms
single-layer L8 (+0.06 vs +0.49 feature gain) and mostly breaks coherence (11/60 vs 54/60
coherent runs) — naive equal-strength stacking overshoots the manifold. A single mid-layer is
the sweet spot; a true geodesic (per-layer-scaled) intervention would be the thing to try next.

## 9 — The interp toolbox applied to the TAE (SONAR): how well does each technique work?
A systematic pass applying standard interpretability methods to the SONAR text-autoencoder's
1024-d sentence/paragraph embedding space (the same space the parascope predicts and the SAE
decomposes). All on the precomputed `llama-3b-embeds`.

| technique | works on the TAE? | headline |
|---|---|---|
| **Sparse autoencoders** (§1,5) | ✅ very well | TopK/BatchTopK; sentence-level low-k → monosemantic features; auto-labeled. **TopK > JumpReLU** on SONAR at matched L0 (L0≈30: FVU 0.50 vs 0.75; L0≈177: 0.31 vs 0.42; JumpReLU only low-FVU when dense) |
| **Linear probing** (`tae_probe`) | ✅ strongly | surface/structural attrs are highly linear: length R²≈0.97, numbers AUC .999, quotes AUC 1.0, first-person .99, capitalization .99; caps-ratio weak (.05) |
| **Geometry / PCA** (`tae_geometry`, `tae_pc_decode`) | ✅ informative | **PC0 = sentence length** (10.6% var); participation-ratio dim ≈75 but 90% var needs 647 dims; near-isotropic after centering (mean cos .005). Decoding ±PCs: PC0=length/elaboration, PC1=prose↔markup, PC2=formal↔informal register, PC3-5=topical/register contrasts |
| **Concept/attribute steering** (`tae_attr_steer`) | ✅ (low α) | diff-of-means latent directions are causal: decode acquires quotes/first-person/numbers (+0.98/+0.98/+1.0); high α → repetition (same strength↔coherence tradeoff as LLM steering) |
| **Decoder-lens / robustness** (`tae_decoder_lens`) | ✅ maps the manifold | robust to noise (σ0.5→.92) & 4-bit quantization (.92); meaning is **high-dim** (needs ~512 PCs, not low-rank); **linear interpolation breaks past t≈¼** → manifold is curved |
| **Neighborhood / clustering** (`tae_cluster`) | ✅ | proximity is dominated by surface form: kNN share length at corr **0.95**, capitalization +0.39 over chance, numbers/quotes/person +0.12–0.21 — SONAR geometry organizes by length/structure more than content |
| **Superposition / per-dim** (`tae_superposition`) | ✅ confirms need for SAE | attributes are **distributed** (length ≈590 dims, capitalization 623) and the native basis is **in superposition** — 67% of dims polysemantic (mean 3 attrs/dim) → exactly why an SAE is needed to disentangle |
| **Activation steering of the source LLM** (§7) | ✅ on-manifold | tangent > raw; read-deep/write-shallow |
| **Probe pseudo-inverse as a control knob** (§7) | ❌ | read ≠ write (causal dir ⟂ probe read dir) |
| **SAE features → LLM causal levers** (§8) | ◐ partial | coherently steer toward a named feature; exact one-feature precision not reached |

**Unifying picture of the TAE.** SONAR linearly and *causally* encodes surface/structural
attributes — **length is literally the top principal component** — and these are both decodable
(probing) and steerable (latent diff-of-means). But the *content* lives in a high-dimensional,
**curved** manifold: faithful decoding needs hundreds of PCs, the space is near-isotropic after
centering, and linear interpolation/large steps leave the manifold and decode to garbage. That
curvature is the through-line: it's why vector algebra works near real points but midpoints
break, why on-manifold (tangent) steering beats straight-line steering, and why low-α latent
edits stay coherent while large ones collapse. Artifacts: `tae_*.json` on HF.

## 10 — Full-coverage round 2: the June-2026 toolbox vs the TAE (in progress)
Driven by a fresh survey of the field (`INTERP_TECHNIQUES_SURVEY.md`, ~135 techniques mapped onto
SONAR). New techniques applied this round, each with its TAE-specific twist — the latent decodes
to text, so every method gets a *behavioral* readout no LLM-internals study can have.

| technique | works? | headline |
|---|---|---|
| **Matryoshka SAE** (`train_sae_matryoshka`) | ✅ free hierarchy | FVU matches plain BatchTopK exactly (k32: .506 vs .502; k128: .310 vs .311) while adding nested coarse→fine structure: first-1024-feature prefix alone reconstructs at FVU .41 (k128) |
| **Concept erasure, LEACE** (`tae_erasure`) | ✅ + a twist | closed-form erasure kills every linear probe (length R² .96→−.08 at ‖Δ‖/‖z‖=0.26). **Decode readout: erasure ≠ removal** — first-person truly vanishes from decoded text (0.117→0.033, content sim .925) but has_number barely moves (0.28→0.25) though its probe is dead: the decoder regenerates some attrs from nonlinear traces |
| **Latent activation patching** (`tae_latent_patching`) | ✅ sharp result | **no PC band transfers content** (sim-to-B ≈ .12–.15 for every band; ALL dims = .96) — content is holographically spread. **SAE-feature transplants DO transfer compositionally** (top-64 feats: .29, with real blends) — the learned sparse basis beats every spectral band as a causal handle |
| **Dark-matter accounting** (`tae_darkmatter`) | ✅ | only **24.5%** of SAE error is linearly predictable from z (vs ~50% for LLM residual SAEs, Engels 2024) — but adding the predicted error back recovers **64% of the decode-similarity gap** (.842→.893, ceiling .922): the blind spot is structured, just not linear |
| **Intrinsic dimension, TwoNN** (`tae_intrinsic_dim`) | ✅ headline number | **local ID ≈ 24 vs PCA effective dim ≈ 490**: the text manifold is locally ~24-dimensional but so curved it linearly smears across half the ambient space. ID grows with text length (12→32); SAE reconstruction preserves ID (24.3). (Methodological catch: text dedup is mandatory — boilerplate dupes collapse TwoNN to ~4) |
| **Convexity / interpolation atlas** (`tae_convexity`) | ✅ decisive | midpoint blends almost never happen: **0.7% blend, 68% abrupt switch, 31% degenerate valley** (midpoints decode to boilerplate); linear analogies (a−b+c) fail (sim .27/.29 mixed). Chords leave the manifold — the curvature thesis, verified behaviorally |
| **Rate–distortion / bit allocation** (`tae_rate_distortion`) | ✅ | content robust to iso noise up to σ≈0.4·RMS; **top-64-PC noise destroys content ~2× faster than tail noise at matched norm** (.56 vs .83 at σ=0.7); most-robust attribute = first-person (survives σ=1.5), length decays first |
| **Output-centric autointerp** (`tae_autointerp_flipbook`) | ✅ dose-sensitive | α=8: detection .57 (underdosed). α=20: .66, 19% of features >0.75 with real semantic labels — "shifts topic to law" (det 1.0), "shifts perspective to first person" (.875), "changes singular to plural" (.812); also surfaced grapheme-level features ("replaces letters with B/E"). α=32 + bigger sample in flight |
| **Round-trip Lyapunov spectroscopy** (`tae_roundtrip_lyapunov`) | ✅ TAE-native, works | iterate encode∘decode on perturbed z: SAE directions survive **6.5× better than random** (cos .103 vs .016 at step 6; PCs .057); survival plateaus after step 2 — surviving component is genuinely text-expressible. Cross-check: top survivor (feat 14006) = autointerp's "singular→plural" feature |
| **DAS, templated counterfactuals** (`tae_das`) | ✅ strikingly linear | a **rank-1 learned interchange suffices for every binary contrast** — tense, person, even a binary object slot: 34–39/40 EXACT counterfactual decodes at k=1 (cos-to-target .98). On-template, SONAR is locally linear enough for 1-d causal coordinates. 10-way content-slot test running (binarity vs true dimensionality) |
| **Cross-space alignment / universal decoder** (`tae_align_spaces`) | ◐ moderate-platonic | CKA(SONAR, MiniLM)=0.42; ridge MiniLM→SONAR cos .53; decoding mapped alien embeddings gives sim .475 vs ceiling .957 — **topic gist crosses encoders linearly, full content does not** (and mpnet transfers worse than MiniLM despite more dims) |
| **Behavioral SAE text-bench** (`tae_sae_text_bench`) | ✅ new ranking axis | text quality is monotone in k (k512: .903 ≈ ceiling .935, degeneracy at ceiling) but **k64 ≈ k128 behaviorally** (.744 vs .735) despite better FVU — mid-range variance gains don't buy text. Attr preservation saturates ~.82 from k64. Matryoshka = plain BatchTopK behaviorally at matched k |
| **Matryoshka-basis patching** | ✅ hierarchy is causal | transplanting top-64 coarse-prefix features transfers MORE content than flat-k128 features (sim-to-B .362 vs .291) — coarse features carry bigger semantic chunks |
| **Self-labeling thermometers** (`tae_selflabel`) | ✅ clean ledger | round-trip agreement ≥.83 for everything except **n_sentences (.47 — THE decoder-lossy attribute**; its probe drops .71→.35 on decodes); no decoder-added attributes found |

| **Multiclass DAS** (`tae_das_multiclass`) | ❌→ corrected by audit | initial "10-way object slot ≈ 8-d block (43/50 exact at k=8)" was **object leakage** (audit FATAL): with fully HELD-OUT object values the interchange collapses to **0/50 at every k** (cos-to-source 0.99 — the swap barely moves anything). DAS had memorized per-object directions; SONAR exposes **no value-generalizing "content slot"** to linear interchange — binding isn't linearly accessible (fits the local-frames picture) |
| **Composition map** (`tae_composition`) | ✅ user-recollection confirmed | [z_A; z_B] → z("A then B"): full linear ridge cos **0.889**/R² 0.77 (decode of prediction ≈ true concat: 0.941 vs 0.976 ceiling); mean-baseline only 0.63; adding (A−B) does NOT help (0.8889); **low-rank fails** (r64: 0.33, r256: 0.52) — concatenation is a genuinely high-rank linear operation. ORDER: cos(z_AB, z_BA)=0.52, and the SAME map on swapped inputs predicts z_BA at 0.881 (order-blind control 0.52) — one order-equivariant composition operator |
| **Canonicity grid** (seed-robustness × config) | ❌ universal disjointness | seed-match median barely moves across configs: k128/12ep 0.149, h8192/k64 0.164, k32 0.173, 40 epochs 0.156 (rotation baselines 0.117–0.122; held-out FVUs identical within pairs). No width/sparsity/training-length makes dictionaries canonical |
| **Lyapunov full-rank scan + atlas shards** | ✅ flat residence, rising labelability | round-trip survival is FLAT across firing ranks (r96–191: 0.117, r192–287: 0.113 vs random 0.018) — the whole live dictionary is equally manifold-resident; meanwhile detection-scored labelability RISES down-rank (head 0.66–0.70 → ranks 192–287: 0.76 → ranks 288–383: **0.78**, 57%>0.75): mid-frequency features are the interpretability sweet spot |
| **Transcoder scale-up** (L6/L12) | ✅ scales | 2.3× capture + 20 epochs: mid-layer FFN FVU 0.50 → **0.32–0.33** (h8192/k64, <0.1% dead) — the decoder's hard middle layers are tractably dictionary-izable; clear path to feature circuits |
| **Text-loss SAE v2** (2 epochs, 100k) | ✅ trend confirmed | decoder-CE 1.118 → **0.799** (v1: 0.84), decode-SBERT .818→**.839**, FVU .48 — and grad-map alignment went MORE negative (−.217→−.288): text-loss training improves text behavior while moving row energy *away* from decoder-sensitive dims. Behavior and geometric alignment are genuinely different objectives |
| **Canonicity under text-loss** (iron JOB2-4) | ✅ micro-adjustment, reproducible | two text-loss fine-tunes (different data order) from the same warm start: only ~21% of features move at all, and "moved" means cos-to-warm **0.988**; the two runs move features in AGREEING directions (Δ-direction cos median 0.59, 86% positive; best cross-run match = own index 100%) — behavioral gains come from tiny consistent micro-tuning of the existing frame, not re-tiling. (From-scratch text-loss canonicity remains open; variance-SAE reference: match cos 0.149) |
| **Multilingual decode mechanics** (`tae_lang_decode`) | ✅ answers "how does it work?" | no English fallback (script purity .86–.95, ~0 English stopwords); behaves like a **soft pivot** — decode(z,L) ≈ translation of the English decode at semantic cos .87–.94 but only ~.8 char-level (one inner text, language-specific rendering; not word-by-word: CJK compresses properly, len 0.28–0.53×); re-encode tax: cos .70–.87 vs .98 for English (worst: Japanese); noise hurts all languages uniformly — no low-resource cliff |
| **Iterated erasure** (`tae_erasure_iter`) | ◐ diminishing returns | erase→decode→re-encode→re-erase ×3: has_number 0.267→0.183, has_quote 0.083→0.033, but content sim decays to ~0.90 and removal is still incomplete — the decoder keeps regenerating; nonlinearly-stored attributes can't be fully removed by iterated linear erasure |
| **Cross-encoder feature universality** (`tae_xencoder_universality`) | ◐ shared core only | match top-512 SONAR features to an SBERT-mpnet-space SAE by activation correlation: median 0.19 (shuffle 0.013) but **16% find a >0.5 partner** (p90 0.61) — a minority concept core is encoder-universal; the rest is space-specific tiling (coheres with seed-disjointness) |
| **Decoder transcoders** (`tae_tc_*`, layers 0/6/12/18/23) | ✅ decoder opens up | per-layer FFN BatchTopK transcoders: layers 0/23 are **near-linear** (FVU .005/.015), mid layers carry the hard computation (FVU ~.50 at 6/12/18) — same mid-stack concentration as LLMs. L12 features are recognizably interpretable (punctuation, determiners, proper-noun institutions, participles, suffix morphology) — first dictionary view inside the TAE decoder |
| **Multilingual feature transfer** (`tae_multilingual`) | ✅✅ headline | **8/8 labeled SAE features act in ALL of fr/de/es/zh/ar** with comparable Δactivation (law feature: 6.6 en vs 7.0–8.0 others; verified by re-encoding steered foreign text). SONAR features are language-universal concept handles — the strongest available test that labels name concepts, not English surface patterns. Space invariance: cos(z_en, z_xx) ≈ 0.71–0.76 |
| **AtP gradient map** (`sonar_grad` + `tae_atp_map`) | ✅✅ unlocks grads | differentiable teacher-forced CE through the frozen decoder (sanity: CE 0.59 matched vs 4.3 random z). Findings: top-64 PCs carry only **12.5% of gradient energy vs 35% of variance** — the decoder reads the spectral TAIL; SAE row energy **anti-correlates** with decoder sensitivity (r=−0.22); content tokens read z 1.37× harder than function words (which the decoder prior fills in) |
| **Embedding MELBO** (`tae_melbo`) | ✅ finds dictionary gaps | 8 restarts → 8 distinct directions (pairwise cos .03), each ~2× random behavioral change — and their SAE coverage (max-cos .13) equals the random baseline: **unsupervised search finds behavior-relevant directions the dictionary misses** (consistent with the AtP misalignment) |
| **Paraphrase quotient** (`tae_paraquotient`) | ◐ honest negative | round-trip orbits are 88% content-variance; the form residual is diffuse (eff-dim ~225) and does NOT transplant (decoded length corr with donor form: −0.01). "Form" under round-trip noise is not a coherent transferable code |
| **Dose–response** (autointerp α-ladder) | ✅ two regimes | α≈20 reveals semantic edits (detection .66); α≈32 pushes weak features into grapheme-attractor collapse ("replaces text with J", detection 1.0 but trivial) — mean detection .71 / 41% >.75. Every feature has a collapse dose; census in flight |

| **Jacobian-vs-dictionary audit** (`tae_jacobian_audit`) | ✅✅ critique confirmed | the SAE's ACTIVE features cover only **18.5%** of the decoder's local sensitivity subspace — barely above random features (16.3%), below same-count top-PCs (23.7%). Triple-confirmed with the global AtP anti-correlation and MELBO-outside-dictionary |
| **e2e text-loss SAE** (`train_sae_textloss`) | ✅✅ FVU debunked | fine-tune the k128 SAE through the frozen decoder (loss = decoder CE): decoder-CE on reconstructions **1.12 → 0.84 (−25%)**, behavioral decode-sim slightly UP (.818→.823), while **FVU worsens .30 → .48**. Variance explained and text-information are decoupled — FVU mismeasures dictionary quality for a TAE |
| **Text-constrained latent surgery** (`tae_latent_surgery`) | ✅ + curvature again | gradient-through-decoder finds Δz making the decoder produce targeted edits at 100% success, ‖Δz‖≈0.2·RMS — but minimal Δz for the SAME edit at different bases are nearly orthogonal (cos .17–.35): predicate directions are local, not global |
| **Belief-state geometry** (`tae_beliefstate`) | ✅ survives audit, reframed | after fixing the emission-prob bug and adding the audit-required surface baseline: bag-of-topic-counts+last explains R² .98 of the posterior, BUT z still linearly predicts the **order-information residual at R² .90–.92** (robust at noisier emissions, where the surface baseline drops to .92). Honest claim: SONAR linearly exposes sequence-order statistics sufficient for the Bayes posterior — beyond bag-of-topics |
| **Circular features (SMDS-lite)** (`tae_circles`) | ◐ ordered, not planar | months/weekdays/hours form ORDERED CURVES, not LLM-style planar circles (top-2 PC share only .28–.44); hours have perfect cyclic order in-plane and walking the fitted ring decodes to ordered times. Cycles respected but not flattened |
| **Steering honesty baseline** (`tae_steer_vs_rewrite`, `tae_steer_tangent_pareto`) | ❌ honest negative | 0/6 features Pareto-dominate "LLM rewrite + re-encode" (steering: 2–10× more feature activation but content sim .24–.62 vs rewrite's .74–.97) — and TANGENT projection does not close the gap (0/6; it even reduces potency). Latent steering's value = LLM-free causal handles, not content-preserving editing |
| **Dose–response census** (`tae_dose_response`) | ✅ | 64 features: 12 smooth-semantic / 43 semantic-then-collapse / 9 cliff; every collapsing feature has its own single-LETTER attractor (f, n, e, k, s, a, …) — the decoder's deep basins are grapheme loops keyed by direction |
| **Direction zoo** (`tae_direction_zoo`) | ✅ synthesis | at matched ‖δ‖: AtP gradient axes = most text-potent (drift .058, coherence 1.0) but least manifold-resident (survival .021); top PCs = most resident (.059); SAE/Matryoshka rows balance both; MELBO/random trail. **Potency and manifold-residence are different merits — no family wins both** |

| **Seed robustness** (`tae_seed_robustness`) | ❌❌ not canonical | two seeds, identical FVU (.2737/.2741), essentially DISJOINT dictionaries: median best-match cos **0.149** vs random-rotation baseline 0.122; 0.1% of features have a >0.7 partner. The dictionary is a frame tiling the manifold, not discovered atoms |
| **Textloss-SAE evals** (wave 8) | ✅ behavior ≠ geometry | behavioral bench: sbert .735→**.778**, attr preservation .816→**.919** (beats k512 at 4× lower FVU), degeneracy down — yet Jacobian coverage unchanged (.189 vs .185) and labelability unchanged (det .64): text-loss fixes behavior without relocating the dictionary into the decoder-sensitive subspace |
| **Multilingual stress** (hi/ja/sw/tr/ko) | ✅✅ | 10/10 atlas features transfer to all five distant languages — cumulative: **18/18 feature-language checks across 10 languages, 4 scripts** |
| **Belief states, depth 8** | ✅ | posterior R² .97, history-residual R² .88 at 8 clauses — belief-state encoding holds with deeper history |

(Iterated erasure still running at writeup time; result lands on HF as `tae_erasure_iter.json`.)

**Independent code audit (codex, `CODEX_AUDIT.md`).** An external review of the 7 load-bearing
scripts found 3 FATAL + 14 MAJOR issues. The FATALs: (1) the belief-state generator's noise
branch made the "exact" posterior slightly wrong (own-emission 0.867 vs assumed 0.8) — fixed;
(2) belief-state R² needed a surface-statistics baseline (bag-of-topic-counts) to support the
"belief state" framing — added, including an order-information residual probe; (3) the
multiclass-DAS k≈8 result evaluated on objects SEEN in training (template-level split only) —
fixed with fully held-out object values + no-op-swap removal. Seed-robustness also hardened
(symmetric matching, both-live filtering, held-out FVU, 5-rotation baseline). RERUN OUTCOMES:
multiclass-DAS claim **reversed** (no generalization to unseen objects — table row corrected);
belief-state claim **survived in honest form** (order-residual R² .90 beyond the surface
baseline); seed-disjointness **fully robust** (sym median .149 vs rotation .122, held-out FVU
identical .386/.387). One audit pass found one false claim and strengthened two — independent
code review of agent-written experiments is clearly worth the tokens.

**Emerging round-2 synthesis.** Three independent new measurements (ID 24-vs-490, 0.7% blends,
no-band-transfers) triangulate the same object: a low-dimensional, highly curved content manifold
embedded redundantly across all 1024 dims. Methods that respect the manifold (SAE features,
on-manifold steering, near-point algebra) get causal traction; methods that move along straight
ambient lines (PC bands, chords, analogies, isotropic erasure of nonlinear attrs) reliably fail —
and the decoder's language prior actively repairs/regenerates what linear edits try to delete.

## 11 — SYNTHESIS: what a night of applying ~30 interp techniques says about TAEs

One 8-hour run: a fresh survey of the June-2026 interp toolbox (~135 techniques,
`INTERP_TECHNIQUES_SURVEY.md`), then every applicable technique executed against SONAR on a
7×A40 fleet — including several TAE-native techniques that don't exist elsewhere, because only
here does every latent direction decode to readable text. ~30 distinct techniques produced
results; the findings organize into five claims.

**1. The object itself: a low-dimensional, curved, redundantly-embedded content manifold.**
Locally the data is ~24-dimensional (TwoNN), but it curls through ~490 linear dimensions (PCA
eff-dim) — a 20× curvature gap. Everything else follows: midpoint interpolation almost never
blends (0.7% — chords leave the manifold and decode to boilerplate); no PCA band causally
carries content (only ALL dims transfer it); cyclic concepts are ordered curves, not planar
circles; and the minimal latent edit achieving the *same* text change at different points uses
nearly orthogonal directions — causal structure is local, defined in each point's tangent frame.

**2. What IS linear is striking — but audit it honestly.** SONAR linearly encodes the Bayes
posterior of a hidden process generating its text (R²≈0.99), and — surviving an independent
audit's surface-statistics baseline — the *order-information* component beyond bag-of-topic
counts (residual R²≈0.90, robust under noisier emissions, holds at depth 8). Templated binary
contrasts (tense, person) are rank-1 causal coordinates under DAS. But the audit KILLED the
content-slot claim: with held-out object values, multiclass interchange collapses to 0% —
DAS memorizes per-value directions; no value-generalizing binding variable is linearly
accessible. And composition is linear-but-HIGH-RANK: [z_A;z_B]→z("A then B") works at cos 0.89
with a full map while rank-256 only reaches 0.52. Attribute probes, LEACE, diff-of-means all
work as advertised — *for reading*.

**3. Dictionaries: useful frames, not canonical atoms — and trained on the wrong objective.**
The SAE basis is the best causal handle we found (features transplant content compositionally
where PC bands fail; 6.5× round-trip survival vs random; detection-scored labels at .66–.71;
and the strongest result: features act identically across 10 languages and 4 scripts — they are
language-universal concept handles). But three independent measurements show the same blind
spot: SAE rows anti-correlate with the decoder's gradient sensitivity (r≈−0.22), active features
cover only ~18.5% of the local sensitivity subspace (≈random), and MELBO finds behavior-changing
directions the dictionary misses. Worse for canonicity: two seeds reach identical FVU with
essentially disjoint dictionaries (median match cos 0.149 ≈ rotation baseline 0.122). And FVU
itself is the wrong target: an SAE fine-tuned through the frozen decoder (text-CE loss) gets
*worse* FVU (.30→.48) yet better behavior on every axis — decoder CE −25%, attribute
preservation 0.82→0.92 (beating a k512 dictionary with 4× lower FVU). For TAEs, dictionary
quality should be measured (and trained) in decoder-CE/text space, not variance space.

**4. Reading ≫ writing, and the honest steering baseline wins.** Erasure kills probes but the
decoder regenerates some attributes from nonlinear traces (erasure ≠ removal). Feature steering
moves activations 2–10× harder than any baseline but never Pareto-beats "ask an LLM to rewrite,
re-encode" on content preservation — and tangent projection doesn't close the gap. The direction
zoo splits the merits cleanly: gradient (AtP) axes are the most text-potent and the least
manifold-resident; PCs the reverse; SAE rows balanced. Potency and on-manifold-ness are
different, partially conflicting properties of a direction.

**5. The decoder is a strong prior, not a transparent readout.** It fills in function words
(content tokens pull 1.37× more gradient), silently drops sentence segmentation (the one
decoder-lossy attribute), repairs linear edits, and at high dose every direction falls into a
feature-specific single-letter attractor basin. Any TAE-interp claim must separate "information
in z" from "decoder prior fills it in" — our self-labeling thermometers and ensemble tools do.

**What full TAE interpretability needs next** (the gap this battery exposes): dictionaries that
are (a) trained through the decoder (text-loss, confirmed better), (b) manifold-aware (local
frames/tangent bundles rather than one global basis — seed non-canonicity and local surgery
directions both demand it), and (c) evaluated behaviorally (decode-based detection scoring,
rewrite baselines, multilingual label transfer). The TAE setting makes all three *measurable*,
which is exactly why it's a good proving ground for the interp toolbox.

## 12 — COMPOSITION: anatomy of the [z_A; z_B] → z("A then B") operator (overnight 2026-06-11/12)
User-directed deep-dive. All scripts codex-pre-reviewed (1 FATAL + 13 MAJOR leakage fixes applied
BEFORE launch); every headline below is being independently replicated on fresh shards (flags: ⏳).

| question | answer |
|---|---|
| What is the operator? | NOT a weighted average: W_A, W_B are near-**isometric full-rank rotation-like maps** (flat singular spectra ~1.05/0.87, eff-rank 429/326, diag mass 2%, cos-to-identity .42/.32) — empirically HRR-style **role rotations**, one per slot; cos(W_A,W_B)=.37 ⏳ |
| Slot asymmetry? | slot A gets higher gain (sv ~1.0 vs ~0.87) and dominates the decode (sim-to-A .776 vs sim-to-B .547); deletion confirms causally — corrupting A crashes A-half fidelity (.78→.46) with modest bleed into B ⏳ |
| String-size axis? | YES: PC0(z_AB) ≈ **0.546·PC0(z_A) + 0.545·PC0(z_B)**, R²=.967 (held-out) — a literal size-adder on the length axis. The NORM is not it (R² negative; ‖z‖–length corr only .245) ⏳ |
| Does it stack? | YES, approximately associative: depth-3 left fold .673 ≈ right fold .657 ≈ direct-trained 3-input map .694 (mean baseline .557); graceful decay .76→.67→.61 at depths 2→3→4. Honesty note: held-out-shard pair cos is **.758**, not the .889 same-distribution figure ⏳ |
| Information deletion? | LEACE-erasing length from z_A is nearly FREE for composition (.778 vs .782 clean); SAE-k32 sparsification and PC-band truncation hurt a lot; keeping only top-64 PCs is WORSE (.654) than keeping only the tail (.708) — composition relies on the spectral tail ⏳ |
| Language-of-origin axis? | EXISTS (user hunch ✓): 6-way language ID 84% (chance 17%), ~100% binary vs English — but small & low-rank (offsets ~17% of data RMS, eff-dim 4.4), LEACE-erasable at 8.8% edit (acc 1.0→.49). Mean-shift toward English barely helps (cos .834→.849): per-sentence language structure ≠ constant offset. Composer is language-robust (fr-A+en-B: .776 vs .806 pure) ⏳ |
| Feature-level composition? | SAE features are NOT compositional atoms: only ~17% of A/B features survive into AB (union Jaccard .108, ~92 novel features per pair, shared-feature additivity R²=.38 held-out) — the sparse code is contextual/holistic, as the PolySAE critique predicts ⏳ |
| Order? (from §10 + tonight) | order is in the embedding (cos(z_AB, z_BA)=.52) and the SAME operator handles both orders (W on swapped inputs → z_BA at .881) — order-equivariant, not memorized; ordering-ID subspace test in flight |

**REPLICATION BATTERY (fresh shards 8-11): every headline above replicates** — anatomy
(effrank 419 vs 429, pc0-adder 0.566/0.530 @ R² .966), stack (pair .755, folds≈direct), features
(survival .1745 vs .1741, Jaccard .1084), deletion ordering (tail>top64 again), composition
(.8898, identical rank curve + order test), language (85.6%, effdim 4.43, LEACE same). ⏳→✓

| question (round 2) | answer |
|---|---|
| Is composition invertible? | YES — an unbinding map z_AB → (ẑ_A, ẑ_B) recovers BOTH slots at cos ~.72 (similarity baseline .49), decoding at .84/.76; recovery flat across length quartiles. Composition = approximately invertible superposition |
| Is order a low-rank binding-ID? | NO — the AB↔BA delta is ~98% of vector norm with eff-dim 465 (top-8 PCs: 11%); adding the mean-Δ does nothing (.516→.516). Order is a GLOBAL ROTATION (consistent with the role-rotation block anatomy), unlike LLM additive binding-IDs |
| Does the composer need natural pairs? | NO — on RANDOM cross-paragraph pairs: cos .8745 (vs .8898), same rank curve, decode .872 vs ceiling .958. It is a general concatenation operator, not a continuation prior |
| Can training fix dictionary canonicity? | PARTIALLY — L2-reg on the encoder creates an aligned core: PW-MCC median .095→.289 (19% of features >.5) at FVU .39→.42; dose-response confirms (l2=1e-5 → .107 ≈ baseline). Archetypal attempt #1 COLLAPSED (FVU 1.95 — implementation failure, its .975 row-match is a collapse artifact, not success); v2 with relaxed constraint in flight |
| Are hierarchical features more compositional? | YES — Matryoshka-k128 features survive concatenation better than flat-k128 (survival .236 vs .174, union Jaccard .154 vs .108, activation additivity R² **.60 vs .38**) |
| Is the language component read by the decoder? | NO — **decode-inert, decisively**: LEACE-ing language changes decoding by nothing (char-agree Δ −.004); amplifying the French component ×8 under an English flag yields 0% French. SONAR carries a language fingerprint its own decoder ignores (vestigial encoder-side code) |
| Does CE-through-decoder help the composer? | YES, same trade as the SAEs (3rd instance): decoder-CE 1.40→**0.84**, decode-SBERT .78→**.84**, while cosine DROPS .73→.70 — geometric and behavioral quality are different objectives |
| What do decoder layers read? (weight-SVD) | every layer's cross-attn K/V are near-full-rank readers (eff-rank 500–650/1024) with only 5–17% top-64-PC overlap — the decoder reads broadly and mostly OFF the principal variance directions (independent weight-space confirmation of the AtP tail finding) |
| ICA axes? | mostly NEGATIVE: FastICA non-converged on SONAR, 0 attribute-nameable axes, cross-shard axis stability .31 (> SAE's .149, < useful; random .07). The word-embedding ICA success does not transfer with the standard recipe |
| 3-slot unbinding capacity? | graceful VSA-style crosstalk: recovery .72 (2 slots) → .63/.58/.62 (3 slots) with the MIDDLE slot worst — a serial-position (primacy+recency) curve inside a single embedding |
| Is the belief simplex real geometry? | YES — a 3-VERTEX simplex model inverts held-out posteriors at L1 .303, BETTER than the full 1024-d probe (.524), both crushing a shuffled-vertex control (1.41); vertices form a genuine triangle (cos .46–.58). Caveat: the probe wins on KL (.17 vs .61) — better calibrated. Upgrades §11 claim 2 from "regressable" to "low-parameter simplex geometry" |
| Canonicity dose-response | L2 on the encoder: 1e-5 → PW-MCC .107 (≈baseline .095), 1e-4 → **.289** (19% >.5); 1e-3 + archetypal-v2 pending |
| Connective operators? | "However,"/"because"/"Meanwhile," each compose at cos ~.89 with the same high-rank signature (r256 ≈ .53) — every discourse joiner is an equally-good rotation operator |
| Unbinding capacity curve | per-slot recovery ~.72/.63/.51/.39 at depths 2/3/4/5 — graceful ~linear VSA crosstalk, serial-position dip (edges>middle) at every depth |
| Fingerprint as translation clue? (user Q) | mostly inert there too: German decodes of original vs language-LEACE'd z agree at .83 char/.93 semantic — minor lexical jitter ("Führungskräfte" vs "Die Führungskräfte"), no content/quality change |
| Feature additivity vs k | NON-monotone: k32 .58 > k256 .44 > k128 .38, Matryoshka-k128 **.60** — compositionality tracks feature COARSENESS, not raw sparsity |
| Canonicity Pareto frontier (scheduler sweep, 12 pts) | **alignment is a purchasable dial**: PW-MCC rises monotonically with encoder-L2 (3e-4→.42, 1e-3→.63, 3e-3→.74, 1e-2→.78 with 98% of features matching >.5) while FVU pays (.44→.70); knee ≈ l2 1e-3 (.63 alignment at .52 FVU). Longer training does NOT substitute (e40 ≈ e24) |
| Leaderboard structure (22 dictionaries) | k64 = all-round sweet spot (best manifold survival .147, pxc 4.07, recon≈k128); potency peaks ultra-sparse (k8: 4.56); width HURTS causal+manifold at matched k; **text-loss dictionaries dominate their base on both behavioral axes** (SBERT .77-.79 vs .735, pxc 3.3-3.5 vs 3.27) at much worse FVU |
| Canonicity: data lever (v2 corpus) | training on the new 834k SENTENCE-level corpus (data_v2, built tonight) improves seed-consistency +64% relative (PW-MCC .095→.156, row-match .149→.190) at h32768/k64 FVU .366 — data quality is a second, milder canonicity dial alongside L2 (controls: baseline & archetypal ≈ 0 on the >.5 metric, l2reg ≈ .95) |
| Composer CE-finetune refined | at mse_coef=1.0 it is a PURE win: cos 0.734→**0.744**, decoder-CE 1.39→**0.68**, decode-SBERT 0.79→**0.89** — the earlier cosine drop was under-anchoring (mse 0.1-0.3), not an inherent trade |
| Scheduler-agent ops note | an autonomous scheduler ran the fleet overnight: 64 jobs, 25-row leaderboard, self-fixed a harness bug, ran its own controls and determinism check — benchmark-driven self-improvement loops work for interp infrastructure |
| Canonicity heatmap (7 l2 × 4 k, e24, dense) | alignment saturates ≥.95 matched-fraction by l2≈2e-3 at EVERY k; the FVU tax shrinks with k — sweet spot k256+l2 1e-3: **96.7% aligned at FVU .435** (vs .21 unregularized). Canonical dictionaries are now a recipe, not a hope |
| Composer anchor curve (mse 0.1→10) | higher MSE anchor preserves more cosine at equal behavioral gain: mse10 → cos .793 (ridge .73), CE .71, SBERT .886 — anchor strength tunes where on the geometry↔behavior line you land |
| Word-swap arithmetic (user recollection) | VERIFIED: bare-word delta (z − z(cat) + z(dog)) swaps ALL occurrences (88% singles, 89% both-in-doubles, 0% exactly-one); a carrier-sentence delta swaps EXACTLY the position-matched occurrence (67% exactly-one) — "The cat chased the other cat" → "...the other dog". Word-type is position-free; word-in-role is carried by context deltas |
| Is the composition operator sparse? (user Q) | full-rank but **approximately PC-sparse**: entry kurtosis 240-441 (refs ~0.9), each output reads ~28 effective inputs, and pruning to the top-10% of PC-basis entries keeps 97% of composition quality (top-1%: 90%); raw-basis prunes worse. Structure = sparse PC-to-PC couplings (size-adder most prominent) + faint dense correction |
| **TAE-Bench v1** (new standing benchmark) | one scorecard (recon-behavioral / causal potency×coherence / manifold survival / sparsity, fixed seeds+shard, HF leaderboard). First law: recon monotone in k (SBERT .56→.90) while causal usability runs OPPOSITE (p×c 4.46@k32 → 2.65@k512; survival .13→.06) — the understand-vs-reconstruct tradeoff is the leaderboard's principal axis. Scheduler agent benchmarks every new ckpt automatically |

## 13 — NIGHT-2 SYNTHESIS: composition, canonicity, language — and the infrastructure that found them

**The composition operator, fully characterized.** "A then B" in SONAR space is a single linear
operator made of two near-isometric, full-rank ROTATIONS (one per slot, slot-A dominant) applied
to the constituents and superposed — empirically the holographic/VSA binding scheme, discovered
rather than designed. It is: order-equivariant (the same map produces "B then A" from swapped
inputs; order lives in a global rotation, NOT a low-rank additive ID — the AB↔BA delta is ~98% of
the vector with eff-dim 465); approximately associative (folds match a directly-trained 3-input
map); INVERTIBLE (an unbinding map recovers constituents at cos .72 for pairs, decaying ~linearly
to a flat ~0.4 floor by depth 6+ — with a serial-position curve, middle slots worst, at every
depth — and the early "recovery rises at extreme depth" trend was killed as a pool-size artifact
by the scheduler's own control); a GENERAL concatenation operator (works on random cross-paragraph
pairs at .874, across languages, and for every discourse connective tested at ~.89); and equipped
with a literal SIZE-ADDER (PC0_AB = .546·PC0_A + .545·PC0_B, R² .967 — while vector NORM carries
no size information). Behaviorally it improves the same way dictionaries do: fine-tuning through
the frozen decoder with a strong MSE anchor (mse≥1) is a PURE win (cos +.03, decoder-CE −50%,
decode-SBERT .79→.89).

**Canonicity, from lament to dials.** Night 1 ended at "dictionaries are seed-disjoint frames."
Night 2 turned that into engineering: (1) encoder-L2 is a strong canonicity dial — PW-MCC rises
monotonically to .78-.98 matched fractions across k∈{32..256}, paid for in FVU (knee ≈ 1e-3), with
baseline and (our collapsed) archetypal controls at ≈0; (2) data quality is a second, mild dial
(+64% relative PW-MCC from the new 834k sentence-level corpus, free of recon cost); (3) the SSAE
identifiability route works as theory predicts in direction — sparse SHIFT dictionaries on minted
paraphrase pairs are ~2.2× more seed-consistent than raw SAEs with no regularization at all, and
their atoms decode as recognizable paraphrase OPERATORS; (4) hierarchy helps compositionality
(Matryoshka features survive concatenation best, additivity R² .60). Longer training does nothing.

**Language, resolved into a precise oddity.** SONAR carries a real, low-rank (eff-dim ~4),
84%-decodable source-language fingerprint that its OWN DECODER IGNORES: LEACE-ing it changes
decoded text by nothing, amplifying it 8× cannot force a language flip, and its effect on
third-language rendering is sub-stylistic jitter (.83 char/.93 semantic agreement). Output
language is purely flag-controlled. The fingerprint is a vestigial encoder-side code — and it
does not impede composition (cross-language composition works at ~96% of monolingual quality).

**Infrastructure as a result.** Two artifacts outlived the night: TAE-Bench v1 (fixed-seed
behavioral scorecard — recon/causal/manifold/sparsity — whose 27-row leaderboard immediately
exposed its principal axis: reconstruction and causal usability are OPPOSING virtues of k, with
k64 the all-round optimum and text-loss training dominating its base on both behavioral axes) and
the autonomous SCHEDULER+CODEX loop: 94 jobs overnight, a self-fixed harness bug, self-run
controls (determinism rerun, pool-matched anchors, baseline/archetypal canonicity controls) — and
one pre-registered FATAL caught by independent code review before any GPU hour was spent, plus a
full replication battery in which EVERY headline reproduced on fresh data. The "easiest person to
trick is yourself" discipline is now executable.

Datasets released: `data_v2/` — 834k sentence-level SONAR embeddings + texts, with fr/de/zh
parallel embeddings for the first 6 chunks. Backlog of 38 designed follow-ups: EXPERIMENT_BACKLOG.md.

## 14 — QUESTION-DRIVEN ROUND: token/position vs meaning (Q3), and construction (Q5)

Reframing experiments around five explicit questions (QUESTIONS.md), with the new pre-pooled
token-state instrument (`sonar_prepool.py`, mean pooling verified cos>0.999). Verdict first:

**z is a positionally-addressed, lossy TRANSCRIPT — far more token-like than the "semantic
vector" framing suggests — pooled holographically, with meaning as its robust residue.**

| Q3 finding | evidence |
|---|---|
| Token states are lexical-identity codes | top-1 token-ID from a pre-pooled state: **75%** over a 2000-word vocab (chance .05%, shuffled-label control 6%); accuracy tracks frequency (86% → 42% by decile) |
| Position is linearly in the token states | R² **0.88** (absolute AND relative), surviving same-token and length-stratified controls |
| Position is readable from pooled z | "where is word w" probes: relative-position R² ≈ .59 vs length-prior null ≈ 0 (**~60× ratio**, 45/45 words pass 2×) — z contains per-word position addressing |
| z carries a strong exact-wording code | teacher-forced CE prefers the original wording over a length-matched paraphrase by **1.2–1.5 nats/token** (threshold .3; 96% of pairs in both directions). ce(z_A→A)=.22 vs ce(z_A→B)=1.25 |
| Order is in z, not the decoder prior | encode a scramble → decode FOLLOWS the scramble: τ-to-scramble **.93** vs τ-to-original .01, 99.2% scramble-wins — even in the word-salad stratum (.93). The language prior does not repair order (only a mild pull, τ .23, when the scramble is near-grammatical) |
| Pooling is holographic, not localist | zeroing ONE token's state exactly deletes that word only 4–6% of the time (78% no-change; the rest global perturbation); content tokens carry 1.6× the state norm of function words but norm explains damage weakly (r=.17) |
| "Topic" largely reduces to vocabulary | LEACE-ing the presence of 10 topic words drops the topic probe AUC .97→.63 (word probes → chance) — most of topic is its lexicon; a genuine abstract residue remains (.63 > .52 baseline) |

| Q5 baseline | result |
|---|---|
| Bag-of-word-vectors | fails raw (sum: 77% degenerate; mean: 1-word decodes) but geometric normalization rescues it from zero: mean+corpus-shift+RMS → recall .31 / SBERT .44 (ceiling: roundtrip .97/.99). Failure = missing structure, not unrepresentable words |
| Fold words through the composition operator | recall .17 overall BUT word ORDER nearly perfect where scored (τ .92) and only 3% degenerate — the rotation algebra transmits order faithfully while content mass attenuates with depth (recall .54 at 4 words → .08 at 12), matching the unbinding floor |

**Wave 2 — localization all fails while global readout succeeds:**

| attempt to localize the wording/order code | result |
|---|---|
| Remove the learned "wording subspace" (paraphrase-orbit residuals, length-matched, r=8..64) → re-test the CE wording margin | margin collapse: **0.000 at every rank** — same as removing the meaning subspace or random dims. The wording code is NOT low-rank; transplanting B's wording-residual onto A imports none of B's words (2.6% ≈ 2.1% control) |
| Transplant ONE token state pre-pooling (with noise + wrong-position controls) | donor word appears at the matcher's false-positive floor (single content: 0.48% vs floor 0.17%); rest preserved (.99). The rare successes ARE position-specific (wrongpos control) — addressing exists, the holographic veto dominates |
| Query z as an addressable register (position-conditioned state decoder) | works WEAKLY and mainly at the EDGES: reconstruction beats all baselines (cos .38–.42 vs shuffled-z .26) but reconstructed token-ID is 22.6% vs the 86.9% real-state ceiling (interior deciles ≈ floor; first/last 58%/49%); re-pooled reconstructions decode at SBERT .59 (ceiling .98) |

Combined wave-1+2 statement: **z faithfully CONTAINS the transcript (CE margins, scramble
fidelity) but stores it as a fully distributed/holographic code** — no low-rank wording subspace,
no per-token locality, only edge-biased weak addressability. Reading the transcript out of z is
easy for the decoder and hard for every surgical tool; this is the same read≫write asymmetry the
battery found for attributes, now established for the token stream itself.

**Q5 constructor (wave 2):** a GRU over word vectors is the first constructor that genuinely
works: test cos .655, decode recall .31 / order τ .75 / 21% degenerate — roughly doubling the
normalized-bag baseline — and its shuffled-input control dissociates cleanly (τ drops .56, recall
barely moves): the constructor SYNTHESIZES the order code from input sequence structure. Mean
and scaffold archs trail; attention failed to train. Content recall (.31 vs .96 ceiling) is the
remaining bottleneck — the same holographic content-mass limit seen in unbinding and folding.
Local-frame splice: local PCA coordinates reach SBERT .8 at r*=192 vs global r*=256.

**Wave 3 (closing the question set):** (i) the rank-1 DAS tense coordinate does NOT generalize:
wild-corpus flip rate 0.0% — indistinguishable from a random direction (caveat: this rerun's
templated control also underperformed the original protocol, so the zero may be partly
implementation; the "global dial" reading is dead either way). (ii) Local-frame overlap decays
with distance at scale ~0.34×RMS (halfway "curvature radius" ≈ 1.0×RMS); interpolation failure
(u≈0.25) falls at 0.72× the decay scale — directionally consistent retrodiction, weak fit
(R²≈.33). (iii) The spectral tail is CONTENT down to r≈128 (donor-tail swaps import donor
content; leak gap .47 at r32 → .11 at r128) and SCAFFOLDING past r≈256 (donor ≈ matched noise):
content occupies ~the first 256 spectral dims, the rest is generic texture — refining both the
AtP "decoder reads the tail" result and the deletion experiments.

Implications: (a) the language fingerprint, wording code, and position addressing are all parts
of one picture — z is closer to a compressed transcript with a meaning halo than to an abstract
proposition; (b) construction-from-scratch needs an ORDER/length scaffold plus content mass
management, not better word vectors — exactly what wave-2 (pooling inversion, token-state
transplant, structured constructors) targets.

## 15 — TAE-BENCH v2: the first "full understanding" certificate (and where it fails)

Design: TAEBENCH_V2_DESIGN.md — understanding = 7 conjunctive criteria (PREDICT/DECOMPOSE/EDIT/
CONSTRUCT/MECHANISM/TRANSFER/CALIBRATE), 21 protocols with baselines, ceilings, pass thresholds,
held-out splits, and anti-gaming constraints. First run (k128 dictionary as the benched account):

| aspect | result | adjudication |
|---|---|---|
| E1 edit precision + collateral | gradient surgery **0.73** mean exact-edit@collateral≤.1, incl. HELD-OUT edit types (.82/.76), collateral ≈ 0 — while spec-only methods score .07–.10 and the rewrite ceiling .20 (paraphrasing trips the collateral gate) | SPLIT on review: **E1-controllability PASSES** (any specified edit is reachable in z with zero collateral when the target is known — the rewrite pipeline cannot do this) but the optimizer descends on the target's own tokens, so this is not spec-only editing. **E1-calculus FAILS** (~.1): we cannot yet edit without knowing the answer |
| E2 erasure completeness | regeneration-proof removal at content≥.9: LEACE **.22**, iterated **.16** (pass ≥.8) | FAIL — erasure remains mostly regenerable; consistent with §10's erasure≠removal, now a standing score |
| D1 lossless reassembly | SAE + dark-matter predictor leaves **0.392 nats/token** unexplained (normalized .90 toward the 4.10 floor) | FAIL — the decomposition account is far from lossless in the decoder's own units; this is the honest DECOMPOSE gap number |
| P2 perturbation forecasting | within-family skill real (acc .73 vs .67 base; dose-MAE beats baseline in all 6 families) but unseen-family accuracy = base rate exactly (degenerate small held-out sets) | FAIL, correctly — the strict ex-ante criterion refused within-distribution skill; eval-set redesign needed (larger, class-balanced held-out families) |
| M1 pooling accounting | predicting ablation outcomes from token norms+positions: balanced acc **.47** (3-class) | FAIL — we can describe pooling but not yet predict its counterfactuals |
| M3 transcoder swap | swapping decoder-FFN-L12 for its transcoder costs only **+0.023 nats/token** (text sim .948) — but an FVU-matched RANDOM control costs +0.028 (.947) | UNINFORMATIVE as run — the decoder is robust to any FFN approximation at L12, so the test cannot discriminate mechanism-understanding from generic compression; protocol needs a harder layer or per-feature swaps. MECHANISM remains unproven |
| C1/C2 construction vs spec | rows written (GRU best, consistent with §14: ~⅓ of ceiling composite) | FAIL vs encoder-quality pass bar — as expected; the standing CONSTRUCT score |

**Certificate verdict: we do not yet fully understand SONAR — and now that statement has
numbers.** The certified positives: edit-controllability (E1-a), the transcript characterization
(§14), composition algebra (§12), canonicity dials (§12). The measured gaps: 0.39 nats/token of
unexplained reassembly, spec-only editing at ~10%, erasure regeneration, mechanism prediction at
chance-adjacent levels, construction at ⅓ of ceiling. Leaderboard plumbing note: v2 rows uploaded
with null scores in the shared schema (per-aspect JSONs are authoritative) — schema fix queued.

## 16 — Night-4: a generative model of z, and how it was earned

Everything above measured *what* SONAR's `z` does. This section asks *what z is* — a single
generative equation for the 1024-d vector, every term tied to an encoder computation, each tagged
KNOWN vs ASSUMED. It is the night's capstone, and it earned itself by a multi-round skeptic loop
that killed five over-claims along the way (§16.7). The honest abstract, five sentences:

> **z is a position-clocked, mean-pooled superposition of role-rotated token codes plus a high-rank
> "dark-matter" context field — *not* the abstract semantic vector the framing suggests, and *not*
> one tidy rotation.** Composition ("A then B") is a single *learned, large, reflective* near-orthogonal
> role rotation applied per slot and superposed — a discovered VSA/holographic binding scheme, and a
> **different object** from the near-identity sinusoidal position clock that survives the encoder
> stack (we tried to unify them; a behavioral arbiter refuted it: the fitted clock is a *worse*
> composer than doing nothing). Serial order, number, and tense are genuinely *in* z but stored
> **non-linearly and curved** — un-readable by a linear probe, un-steerable by a linear edit, and (we
> checked exhaustively) *genuinely curved*: not the position frame, not the role frame, and not any
> learned global orthogonal chart linearizes order — the geometry is irreducibly bilinear, only the
> MLP/decoder reads it (§16.5, F-A ANSWERED). The remaining ~80% of dropped content is a high-rank
> (eff-rank ≈ 25), token-identity-bound field that piles onto rare content "hub" tokens (proper nouns,
> specific numbers) and is **irreducible** — un-constructible and un-erasable (one object behind both
> ceilings; spec-only editing *also* fails, and we decomposed *why*: it is an information deficit (the
> proxy embedding is not the token's real in-context state), not a rotation/binding mismatch — see
> §16.2). The safety payload
> is a single law — **readable ≠ causally-used** — plus a
> three-bin attribute taxonomy telling a `z`-auditor exactly what they may detect, what they may
> steer, and what is truly lost.

Artifacts (HF `nickypro/sonar-sae`): `tae_hrr_model_{a,b,c}.pt` + `tae_hrr_fit_*.json`,
`tae_hrr_crosstalk*.json`, `tae_comp_spanacct_moon.json` (T1-E3), `tae_unification_test.json`,
`tae_layer_rsurvival*.json` (T2-E14), `tae_order_confound.json` (T3-A), `tae_ccorr_attn.json`
(T3-B), `tae_attr_inert.json` (T3-C), `tae_hubfield.json` (T3-G), `tae_nonlinear_store.json`
(T3-H), `tae_prior_filling_k128_btk.json` (T1-E5), `tae_erase_likelihood_*.json` (T2-E13),
`tae_construct_specvalue.json` (T3-E), `tae_curved_store.json` (F-A, ANSWERED),
`tae_edit_mechanism.json` (F-B edit 2×2×2), `tae_edit_transport_b_night4.json` +
`tae_transport_correction.json` (edit transport, learned & low-rank), `tae_layer_rsurvival_natural.json`
(T2-E14 natural replication). Forecast register: `FORECASTS_NIGHT3.md` (Brier **.1878 over 57**
pre-registered rows, vs .25 coin). Running synthesis: `NIGHT4_STATE.md`, `THEORY_SYNTHESIS_N4.md`.

### 16.1 — The generative model

> **z = (1/N) · Σ_{i=1..N}  R_pos^{p_i} · [ R_role^{r(i)} · E(w_i) + c_i ]**

A sentence's `z` is the **mean over its token positions** of each token's encoder state, where each
state is a lexical code `E(w_i)` rotated into its in-context **role** by `R_role`, plus a context
correction `c_i`, with the whole thing carried by the **position clock** `R_pos^{p_i}`. Every term
has a measured mechanistic identity:

| term | mechanistic identity | status | the receipt |
|---|---|---|---|
| **(1/N)·Σ** | literal mean pooling over encoder output states | **KNOWN** | pooling identity γ = 1.00 ± 0.00 exact; `cos_pool_vs_pipeline = 1.0`, `partition_resid = 0` (T1-E3). The apparent "size-adder"/"gain" is the length-norm law, not a free renormalization: composed-pair centered norm lands ON the natural norm-vs-length curve (z_AB/z_solo ratio **1.20–1.25**, not the ≈1.9 the renorm/span-amplification branches predicted — both REFUTED). |
| **E(w_i)** | the token's lexical code | **KNOWN** | a pre-pooled token state recovers its token-ID at **75%** over a 2000-word vocab (chance 0.05%, shuffled-label control 6%); the zero-glue white-box bind-sum `Σ R^p·E(w_p)` reaches the GRU constructor's natural-sentence anchor (recall .314 vs GRU .296), extrapolating UP at 1.5× length (ratio 1.27) — the algebra carries it, the manifold-snap adds only +.016 (T2-E9). |
| **R_pos^{p_i}** — the **POSITION CLOCK** | a near-identity sinusoidal rotation (median eigen-angle ~0.007 rad), injected as the L0 positional encoding and *preserved* — not re-assembled — through all 24 encoder layers, sharpening monotonically (cos-to-sinusoidal .77 → .94 peak L17–18 → .92 final, **no mid-stack jump**, angle-CV .32). It carries serial position **non-linearly**; the decoder unbinds it. | **KNOWN** that it exists, survives, and governs serial-position phenomenology | its eigen-angles alone (one calibration constant) reproduce the unbinding depth-decay **.72/.63/.51/.39** and the serial-position U-curve param-free (depth-5 .428 predicted vs .39 measured); the semigroup-violating control fails the *same* gate (.107) — diagnostic, not a free fit (T1-E6). Sinusoidal-init R drifted only **.0054 rad** across the entire fit. |
| **R_role^{r(i)}** — the **ROLE / COMPOSITION rotation** | a **LARGE, near-orthogonal, reflective** rotation (median eigen-angle ~0.87–0.97 rad, ≥32 eigenvalues exactly at π = binary reflections, cos-to-identity .38) applied at composition time, re-addressing a span's content into its in-context slot. A *learned downstream* transform — **NOT** the position clock, **NOT** the closed-form sinusoidal map, **NOT** a power of `R_pos` (semigroup dead by p=8). | **KNOWN** as a distinct large reflective object that carries composition; **ASSUMED** that its π-reflections implement a *role/span* partition specifically | carries B-solo token states into their in-AB states at cos **.91/.87/.83** across len(A) buckets, while the pre-committed fixed sinusoidal map is the *worst* column (.65/.51/.39, below even identity); the F2 recontextualization residual is r = **.07–.15 ≪ .30** (T1-E3). |
| **c_i** — the **CONTEXT FIELD** (dark matter) | the contextual meaning a token acquires from the encoder's all-to-all attention beyond its bound lexical code. **HIGH-RANK** (eff-rank ≈ 25.5/sentence), **token-identity-bound** (per-sentence mean holds only 9% of its energy; a rank-1 shared field makes CE *worse*, −5%), **concentrated on content-hub tokens** (out-degree q4/q1 CE-excess 3.75×; content 3.07 vs stop 1.18), and **~80% truly lost** (only .199 prior-recoverable). | **KNOWN**: high-rank, token-bound, hub-located, ~80% irreducible; **ASSUMED-then-REFUTED**: that it is a *constructible* low-rank hub self-field (T3-G killed this — see §16.2) | "**Sergeant Bluff**" carries c_norm 2.76/2.00 on its rare proper-noun word-pieces while "we"/"are" sit at .37/.28 — the per-token spikiness IS the eff-rank-25 story made visible (T3-B/T3-G). |

**Two rotations, not one — the unification is REFUTED.** The night's last load-bearing number plugged
the data-fitted `R` (`tae_hrr_model_a.pt`) into T1-E3's behavioral state-mapping arbiter (carry B's
solo token states into their in-AB states, score per-token cos). The verdict: `UNIFICATION_CONFIRMED:
False`, 0/2 buckets confirm.

| map applied (B-solo → B-in-AB) | mean state-cos | reading |
|---|---:|---|
| **learned_orth (the composition map)** | **0.887** | the real composition operator |
| identity (no map) | 0.827 | doing nothing already beats the fitted clock |
| **fitted R (the position clock)** | **0.758** | BELOW identity — the clock does *negative* work as a composer |
| closed-form sinusoidal | 0.577 | worst column |

`mean_fitted_minus_learnedorth = −0.130`. A blunt Frobenius cosine had made the two matrices *look*
similar (`matrix_cos = 0.864`, eigen-angle Pearson .968), but asked to actually *do* composition, the
clock fails. **They are two geometric objects** — a near-identity clock (median angle .007 rad) and a
large reflective role map (median .87 rad, top-32 eigen-angles = π). The two-way unification that
*does* survive is `{fitted R = deep-layer position rotation = sinusoidal L0 PE}` (cos .955 Frobenius
/ .982 spectral). The three-way one (clock = composition map) is dead. Keep two rotations.

### 16.2 — What each part is, with the evidence and a decoded example

**Position/order is stored NON-LINEARLY (and is genuinely curved).** A non-linear MLP probe recovers
token order from the *real* z at Kendall τ = **0.546** while a capacity-matched linear readout gets
only **0.249** (T3-H Part 1; replicated on a fresh shard at τ = 0.587, shuffled-label control −0.014,
capacity-matched linear 0.290 — nonlinearity is essential, not probe over-fitting). This settles a
saga: order is *in* z (not invented by the decoder — see the frame table, F4), but it lives there in a
form a linear probe barely sees. The decoder reads it the *same* way for predictable and unpredictable
orderings; scrambling z's position tags halves recall (.96 → .52) even when the LM prior cannot guess
the order:

> `"Here is a sample News Feed on Agricultural Trends…"` → permute z's position tags →
> `"News trend in 2012 News trend in 2012 News trend…"` (order-τ **0.0**) — when the prior can't guess
> the order, scrambling the tags collapses it completely. The tags, not the prior, carry order.

**But the store is NOT in the position-rotated frame.** We predicted that un-rotating by `R_pos^{−p}`
would expose order/number/tense to a linear probe. It **failed** — cross-position AUC slightly *drops*
after un-rotating, for every attribute, in both languages (tense 0.736 → 0.724), and
Spearman(axis-rotation, 1−steerability) = **−0.4** (predicted +0.7, wrong sign). The non-linear store
is **genuinely curved**, not a linear axis hiding behind the clock. The night-4 closing batch then
*exhausted* the linear-frame search and settled F-A: a pre-registered fixed-bag × probe-class ladder
(`tae_curved_store.json`, anchors linear .264 / MLP .57) shows the **role frame** R_role⁻¹ buys only
τ .274 (+.031), a **learned global orthogonal chart** — the strict upper bound over *any* rotation —
scores τ **.203, *below* the linear anchor** (so no global linear frame exists at all), and
**content-conditioning** buys τ .270 (+.02, not a readable permutation of a known bag). No rotation, no
learned chart, no content-conditioning flattens order: the geometry is irreducibly **bilinear /
pairwise-comparator** (only the MLP/decoder reads it). F-A is **ANSWERED** (§16.5) — the only residual
question is the precise comparator form, not whether a linear frame exists.

**Composition = a learned near-orthogonal role rotation + superposition.** The arbiter table above plus
the discriminator (residual ∝ SBERT(A,B), r = .07–.15 ≪ F2's .30 bar) and the gain read-off (≈1.0)
together fix "A then B" as one large reflective rotation per slot, applied and mean-pooled (γ=1.00).
The clean illustration is the longest bucket, where re-addressing has the most work to do:

> 35–50 tokens: A=`"The National Cattlemen's Beef Association reports that 70%…"` B=`"This trend is
> driven by consumer demand…"` — the fixed sinusoidal map scores cos **.435** mapping B's solo states
> into the pair, the **learned** map **.871**. The closed-form rotation falls apart exactly where a
> genuine re-addressing operator must work hardest.

**The dark matter c_i is high-rank, hub-located, and IRREDUCIBLE — one object, three ceilings.** The
truly-lost content concentrates on rare content "hub" tokens (out-degree q4/q1 CE-excess **3.75×**,
content 3.07 ≫ stop 1.18), and the lost mass on a hub is a context re-weighting of that hub token's
*own* embedding (self-embed cos +0.070 on hubs vs −0.110 on peripherals). But it is **not** a
constructible low-rank field: a budget-matched `ĉ_hubfield` built from the hubs' attention rows closes
only **5.48%** of the gap (*worse* than the uniform global GRU's 19.4%); eff-rank{c_i} is length-driven,
not #hubs-driven (partial r controlling length = −0.17, against a predicted +1 slope); and adding one
per-sentence field vector to every token makes CE *worse* by .123 nats (FVU 1.07). The same irreducible
field caps **construction** (the white-box can bind-sum word codes but cannot synthesize c_i from a
spec) and **erasure** (regeneration-proof removal .15 ≪ baseline .22) — these two share the one
mechanism. **Editing** also fails badly (spec-only edits score .037 ≪ the .1 floor), but we tested
whether the *same* hub field explains it and found only a weak, non-significant hint: success vs the
edited token's out-degree is r=−0.08 (p=.16), collateral is flat across out-degree quartiles, and
peripheral tokens edit ~2.3× better than hubs (q1 .093 vs q4 .040) yet still far below usable. So
editing failure is *broad* — the spec-only algebra simply does not land anywhere — and is **not**
cleanly hub-localized; we do not fold it into the single-object story. (Honest null, recorded:
`tae_edit_by_outdegree.json`.)

**The editing failure, decomposed (night-4 closing batch): it is an information deficit, not a binding
mismatch.** A pre-registered 2×2×2 — {renorm on/off} × {R = sinusoidal-clock / fitted R_role} × {E =
proxy-embedding / true-encoder-oracle} — on a single 300-edit set isolates *which* operation breaks the
single-token delta (`tae_edit_mechanism.json`). The lead hypothesis was role-binding: that the edit
applies the near-identity sinusoidal clock to the inserted token while its survivors carry the large
reflective R_role, and renorm then smears the slot error globally. **That hypothesis is REFUTED on all
three of its predictions.** Swapping in R_role *hurts* (main effect −.02), turning renorm off is null
(+.0016), and their interaction is sub-additive (−.003); the decisive {R_role, renorm-off, proxy-E}
rescue cell scores **0.0**, far below the .3 usable bar — **NOT rescued**. The only lever that moves the
needle is the **true-encoder oracle** E (main effect **+.877**: lifting the *real* in-context gold state
takes exact-edit from .037 to .897), and that oracle is non-deployable (it requires encoding the answer
— flagged C4). The conclusion is mechanistically sharp: spec-only editing fails because the proxy
embedding is **not** the real contextual state of the swapped token — an information deficit — **not**
because the token is bound in the wrong (clock vs role) rotation. Two further edit angles confirm the
same wall: a *learned* stage-2 transport `g(z,spec)→Δz` (`tae_edit_transport_b_night4.json`) scores
**0.0 on held-out edit types** (it learns only `register`, already the easy type), and a low-rank
universal correction ΔW (`tae_transport_correction.json`) is *not* low-rank (var-explained still rising
at rank 32, .29) and does **not transfer** to held-out types (.025→0.0). Editing is its own irreducible
object — capped by the missing real-state information, parallel to (but distinct from) the c_i hub
field.

> Out-degree made concrete: `"…were Grace F."` — `▁Grace` has out-degree 2.48 and CE-excess **12.0
> nats** while `▁of`/`▁this` sit at 0.15/0.03. The lost mass piles onto the named-entity hub. And it
> is *content* out-degree, not raw attention: in `"…the friend zone…"` the final `?` has the highest
> out-degree (3.67) but ~0 CE-excess.

### 16.3 — The frame-verdict table

Four prior frames were arbitrated against the landed numbers. **The honest synthesis is NOT "F1 wins."**
Many apparent F1 wins are **frame-nonspecific** — they follow from superposition + mean-pooling + a
length code, which F1 *and* F2 both assert; F1 only graduates where its *unique* claims were tested
and survived.

| frame | verdict | what SURVIVED | what DIED | decisive evidence |
|---|---|---|---|---|
| **F1 — VSA / fixed-rotation algebra** | **SPLIT: core survives, unique strong claims refined.** | 1/N pooling exact (γ=1.00); the position clock's **param-free U-curve / depth-decay** (the one clean F1-unique win, gated by the failing semigroup control); white-box bind-sum reaches the GRU (algebra extrapolates UP, ratio 1.27); composition is a near-orthogonal rotation+superposition (F2-residual < .30). | one fixed *global* rotation carrying everything (TWO rotations; unification 0/2, fitted R .758 < identity .827 as a composer); closed-form sinusoidal as the composition map (cos .58, worst column); the semigroup (R_p ⊥ R_1^p by p=8); linearly-readable order (real-z linear τ only .25). | `tae_unification_test.json`; T1-E3 arbiter; T1-E6 ρ; T3-H Part 1. |
| **F2 — contextual-kernel / recontextualization** | **REFUTED as a mechanism; survives only as the default where nothing else predicts.** | wins the most decisive *edit* cell by DEFAULT (reorder/word-swap fail for everyone, ≈.00); "z is genuinely contextual" is true at the c_i level. | its forced number — residual ∝ SBERT(A,B) ≥ .30 — is REFUTED at .07–.15; 1.5× construction-collapse REFUTED (extrapolates up); "the manifold is the magic" REFUTED (retraction worth +.016). | T1-E3 discriminator; T2-E9 white-box. |
| **F3 — rate-distortion / capacity** | **CONFIRMED on its pre-committed core; no free-parameter laxity.** | wording margin tracks **prior surprisal** (ρ=.318, sign survives partialling — the FORECASTS-clean number, NOT the tautological shuffled ρ=.94); the **single-C collapse FAILS** *as pre-registered* (joint C 6.9 vs folding 11.7 vs GRU 2.1, separations ≫25%; folding residuals signed = per-fold λ<1 compounding); V_eff measured (45.4 all-token / 87.4 content). | the strong "everything collapses to one capacity constant" reading — and that failure was itself the F3-correct, pre-registered outcome. | T2-E12; T1-E7 + single-C re-run. |
| **F4 — Bayes-decoder (prior fills the cheap tokens)** | **READ-side survives; WRITE-side / prior-filling died.** | the prior is real and operationally defined (shuffled-z CE 3.8 nats); the **inert-by-prior** bin is real and *perfectly predicted* (language/register/voice readable-but-inert); attractor basins are **token-embedding modes, not prior modes** (95% embed-match). | the prior-filling split — only **.199** of dark matter is prior-recoverable (forecast .55) ⇒ dark matter is ~80% **truly-lost content**, not decoder-fillable; likelihood-stopped erasure no better than probe-death (.15 vs .22). | T1-E5 (.199); T2-E13; F-6. |

**Frame-nonspecific wins we did NOT over-credit.** The 1/√N unbinding law, surface-order encoding,
extensive-attribute adders, and the length code all follow from superposition + pooling + a length
code — which F1 *and* F2 both predict. We score these as *shared* wins, not F1 graduations. F1's only
genuinely *unique*, falsified-against-control prediction is the position clock's param-free U-curve and
depth-decay (T1-E6); the role rotation's behavioral signature is solid but its *role* semantics is an
inference (§16.6).

### 16.4 — The safety / auditing payload

The governing law a SONAR z-auditor must internalize: **READABLE ≠ CAUSALLY-USED.** A linear probe
detecting an attribute does **not** license a claim that editing that attribute will change the
decode. Number is linearly readable (AUC .64) and translation-invariant (MT-survival .95), yet
mean-shifting its readable axis barely moves the decode (decode-shift .08). The trustworthy causal
signal is `decode_shift`, not `readability` — an auditor should report the **triplet (readability,
MT-survival, decode_shift)** and may only claim *control* where decode_shift is high.

**The attribute bins** (refined — the un-steerable attributes split by *why* they resist steering,
which changes the auditing verb):

| bin | attributes | linearly readable? | linearly steerable? | auditing status |
|---|---|---|---|---|
| **INERT-by-prior** | language, register, voice/wording | yes | **no** (the prior owns them; z barely encodes them) | detectable as a *correlational shadow* only; do not treat as represented or controllable |
| **ACTIVE** | sentiment, named-entity | yes | **yes** (linear code; mean-shift moves decode — sentiment decode-shift .83/.93) | **auditable AND controllable** with linear edits — the only bin you can do both on |
| **CURVED-ENCODED** | **order, number** | **no** (depth-2-MLP-only: order τ .25→.55; the gain is an *interaction* term, not a hidden frame) | **no** | **audit-only, and even detection needs a non-linear probe**: present + truth-conditional but with no linear axis at all; no rotation linearizes it (R^{−p} *and* the role frame fail) |
| **LINEAR-but-INERT** | **tense** | **yes** (clean linear axis, AUC **.998**, ~0 non-linear gain) | **no** (decode ignores it; steer .17 — same failure mode as INERT-by-prior, *not* a curved store) | **auditable, and a linear axis EXISTS** — but the decoder is causally deaf to it; flipping the axis does not flip the decode |

> Skeptic correction (theory round 4): the earlier "non-linear store {order, number, tense}" lumped two
> mechanisms. Tense is **linearly readable** (AUC .998) yet causally inert — its un-steerability is the
> decoder ignoring a clean axis (like register), NOT a curved encoding. Only **order/number** are
> genuinely curved (the readability gain is depth-2-MLP-specific). Same observable (un-steerable),
> different cause, different auditing verb — so the safety advice differs: for tense a linear detector
> works; for order/number you need a non-linear probe even to detect, and gradient-in-z to touch.

The most safety-relevant fact: the **curved-encoded + linear-but-inert bins hold exactly the truth-conditional content
one would most want to control** — quantities, temporal claims, sequence — and it is precisely the
**un-erasable, un-steerable, audit-only** class. You can DETECT a quantity/temporal/ordering claim is
present; you cannot RELIABLY remove or flip it with a linear edit. The R^{−p} un-rotation failure makes
this *stronger*: there is no cheap rotation that linearizes the store, so "must go through gradient-in-z,
not a probe axis" is load-bearing, not a hedge.

> `number` ×8 amplification: `"replaced the 16 Regional Authorities"` → `"15"` → `"151"` — the count
> is truth-conditionally in z (MT-survival .95) but a mean-shift along its readable axis barely moves
> it (decode-shift .08); the non-linear probe gains +.07/+.12 AUC over linear — the M-store signature.
>
> Erasure regenerates rather than removes: `"over 76,000 acres"` → erase-numbers → `"covering 16
> million acres"`, and `"SoftBank's Market Capitalization: $100 billion+"` → `"… 10 billion+"`. z wants
> *a* number there even when the original is likelihood-dead. You can detect "a quantity is present";
> you cannot reliably null it.

**What is recoverable vs truly lost (for provenance / completeness audits).** ~80% of the dark-matter
gap is **truly-lost content** (only .199 prior-recoverable), concentrated on high-information content
hubs — rare proper nouns, entities, specific numbers. An auditor **cannot** assume "the decoder will
fill in what z dropped"; it mostly will not, and what it drops is exactly the high-info specifics. The
lost mass is high-rank and token-identity-bound (eff-rank ≈ 25.5) — there is no low-rank "context
vector" to subtract and recover it. Reconstruction is capped by **spec-information-density**, not
constructor cleverness (spec-axis lift +.168 vs architecture +.036, T3-E): for safety-relevant
reconstruction from a partial spec, the limiting factor is whether the spec carries the hub field.

### 16.5 — Honest open frontiers

| # | open question | why it is still open |
|---|---|---|
| **F-A (ANSWERED — genuinely curved, irreducibly bilinear)** | **What generates the curved non-linear store?** RESOLVED: order is a **genuinely curved / irreducible** function of z — *no* linear chart linearizes it. A pre-registered fixed-bag × probe-class ladder (`tae_curved_store.json`) crossed three arms against the MLP anchor (τ .57) and the linear anchor (τ .264): the **role frame** R_role⁻¹ buys only τ .274 (+.031), a **learned global orthogonal chart** — the strict upper bound over *any* rotation, the (b)-killer — scores τ **.203, *below* the linear anchor** (no global linear frame exists at all), and **content-conditioning** buys τ .270 (+.02, so order is not a readable permutation of a known bag either). The earlier `R^{−p}` and axis-rotation refutations are thus completed: there is no rotation, no learned chart, and no content-conditioning that flattens order. The geometry is the irreducibly **bilinear / pairwise-comparator** form (theory-r4) — only the depth-2 MLP / the decoder reads it. *Mechanism named; the only remaining gap is the precise comparator form (single bilinear vs deeper), not whether a linear frame exists — it does not.* |
| **F-B** | **What is the c_i generator?** We know c is high-rank, hub-located, ~80% lost, identity-bound — but T3-G showed it is *not* a constructible low-rank hub self-field. Its generator is the encoder's full all-to-all attention computation; it is irreducible by construction, so the right next move is *characterization* (which edits its rank bounds), not another low-rank repair attempt. |
| **F-C** | **Is R_role specifically a role/span rotation?** Its behavioral signature (large, reflective, near-orthogonal, distinct from the clock, carries composition at .89) is solid, but the *role* semantics is an inference from its π-reflections; the eigen-role probe (project token states onto W_B's π-eigen-subspace, test first-span vs second-span separation) has not run. |
| **F-D (orphan)** | **The edge-asymmetry remainder.** ρ explains the SIGN of the 58-vs-49 first/last addressability edge but only ~1/3 its magnitude; the c-profile meant to carry the rest is FLAT at prepool (ratio 0.99). The remaining two-thirds has no home — possibly a decoder-side BOS/EOS effect (an M-decode edge, parallel to the order finding). |
| **F-E** | **Crosstalk frays at depth.** The param-free ρ(Δ) law holds in shape and to ~20% through depth-4 but over-predicts retention at depth-5 (rel-err .35, n=49) — small-n plus genuine extra crosstalk; the law is qualitatively robust, quantitatively loose at the deepest rung. |

### 16.6 — How it was earned: the self-corrections

The reason the surviving model is trustworthy is that a multi-round skeptic loop *killed* its most
attractive over-claims before they reached this page. Each collapse failed on a specific number, and
each rejection sharpened the picture. This is the methodological point for the safety community:
**agent-generated mechanistic claims need an adversarial, pre-registered, control-gated loop, or the
prettiest wrong story wins.**

- **Fixed-sinusoidal composition (killed).** The hope that "A then B" *is* the closed-form sinusoidal
  position rotation died on the state-mapping arbiter: cos .65/.51/.39, the *worst* column, below
  identity. Composition is a *learned* near-orthogonal map.
- **One-rotation unification (killed).** The follow-on hope that the composition map equals the
  position clock died on the behavioral arbiter: the fitted clock scores .758 as a composer, *below*
  identity's .827 — a Frobenius-cosine look-alike that fails the actual job. **Two rotations.**
- **Order as a decoder-prior artifact (killed, twice).** An early reframing claimed order was imposed
  by the autoregressive decoder, not stored in z. The order-confound (T3-A: gap constant across
  prior-guessable strata; tag-scramble craters both) and then the real-z non-linear probe (τ.546 ≫
  linear .25, replicated) both refute it: order is **stored**, non-linearly. And the symmetric
  over-correction — "it's all in z, linearly" — also died (linear τ only .25).
- **Mean-field dark matter (killed).** The tempting "c_i is one rank-1 shared context vector" failed
  the ≥60% test by every variant; adding the sentence-mean field makes CE *worse* (−5%); eff-rank is
  25.5, not the predicted 2–4. The dark matter is the spikes, and the spikes are token identities.
- **Low-rank hub repair (killed).** Even after localizing the lost mass on content hubs, the hope of a
  *constructible* low-rank hub self-field died: `ĉ_hubfield` closes 5.48% vs the global GRU's 19.4%,
  and the "eff-rank = #hubs" smoking gun was a length illusion (partial r −0.17). Irreducible.

Two methodological habits did the work and are worth stating plainly. **Frame-nonspecific discounting**:
we refused to credit F1 for wins (1/√N, adders, surface-order) that F2 predicts equally — F1 graduates
only on its one control-gated unique prediction. **Independent code review before GPU spend**: a prior
audit pass caught a FATAL leakage bug and forced held-out / surface-baseline controls that reversed one
headline and strengthened two; the night-4 batch carried that discipline forward (every frontier claim
flagged ASSUMED/PENDING until its JSON landed, then re-scored). The Brier of **.1878 over 57**
pre-registered forecasts — beating the .25 coin while *losing two* batches to it (.272 mid-night, and
.2265 on the night-4 closing batch, every miss an over-trusted *mechanism* call against which no pilot
datum existed: the edit role-binding hypothesis scored 0.0) — is the calibration receipt: the model is
good not because the forecasters were always right, but because the loop reliably told us when they were
wrong. The closing batch is exemplary — the *phenomenon* calls landed (genuinely-curved store, no-rescue
editing, natural-replication, no-transfer correction) while the proposed *mechanism* (R_role rescues
editing) was cleanly refuted and replaced by the right one (proxy-E information deficit).

## 17 — Day-5: the composition map characterized, editing diagnosed, the model stress-tested

Night-4 left §16 with four loose ends the model *named* but did not *solve*: it asserted composition
is a "large reflective near-orthogonal role rotation" without writing the operator down; it asserted
the dark-matter `c_i` is an "attention-derived hub field" without saying where in the stack it is born
or how much attention actually explains; it left editing stranded at "deployable methods fail at .037,
only a non-deployable oracle reaches .897"; and it ran almost entirely in English. Day-5 threw the
fleet at all four. The headline: the composition operator is now written down in closed form (92% of
it is two pure orthogonal role rotations plus a length adder), it is **20× sparse** in its own Schur
eigenbasis, the `c_i` field is a **late-stack, attention-mixed** object that surprisal predicts, the
generative model `z = (1/N)·Σ R_pos^p·[R_role·E(w)+c_i]` **transfers across eleven languages**, and —
the one genuinely new capability — editing is **deployably rescued, but barely**: a partial-re-encode
method crosses the usability line at exact-edit@collateral **.325**, closing 37% of the proxy→oracle
gap at 43% of the encode cost. Artifacts (HF `nickypro/sonar-sae`): `tae_comp_math.json`,
`tae_comp_sparse.json`, `tae_true_e_predict.json`, `tae_ci_generator.json`, `tae_multilingual.json`
+ `tae_multilingual_stress.json`.

### 17.1 — The composition map, written down: two reflective role rotations + a length adder

The composition operator is the learned block-linear map `z_AB = W_A·z_A + W_B·z_B + b` that takes two
sentence vectors to the vector of "A then B" (fit on 20k pairs, held-out cos **.889**). Night-4 only
knew it was "large and reflective". Day-5 decomposes each slot block by polar decomposition `W = Q·S`
(`tae_comp_math.json`) and the answer is clean:

| property | W_A (slot-1) | W_B (slot-2) | reading |
|---|---:|---:|---|
| median eigen-angle | **0.141 rad** | **0.150 rad** | a real rotation, not near-identity (the position clock is ~.007) |
| stretch `‖S−I‖_rel` | **0.502** | 0.513 | moderate scaling, *not* the bulk of the action |
| orthogonal factor `Q` det | **−1.0** | −1.0 | **Q is a reflection** in both slots (improper orthogonal) |
| rotation dominant `‖W−Q‖<‖S−I‖` | True | (False, marginal) | the map is **carried by its orthogonal factor**, the stretch is a correction |
| eigenvalues at angle ≈ π | **11** | **13** | binary-reflection axes — far fewer than the "≥32" §16 implied |

So each slot map is a **scaled near-isometric reflective rotation**: reflect-rotate the slot's content
into its role subspace, scale modestly, add. The two blocks **approximately share an eigenbasis** (relative
commutator `[Q_A,Q_B]` = **.028**, i.e. they nearly commute) — but they are **not** a shared base rotation
raised to different powers: `Q_B = Q_A^k` is refuted with residual **.984** at the best-fit `k=−0.18`.
**HRR-shared-base is REFUTED; there are two genuinely distinct role rotations**, exactly as §12 first
suspected. On the size axis the map is a clean **adder**: the lead PC of the composed pair is
`PC0(z_AB) ≈ 0.549·PC0(z_A) + 0.545·PC0(z_B)` at **R² .967** (the norm version is weaker, R² .617),
confirming the §16 "length-norm law, not free renorm" reading from the operator side.

The **best low-parameter closed form** is a *scaled-orthogonal* factorization — replace each `W` by
`c·Q` (one orthogonal matrix + one scalar per slot) — which recovers cos **.813** vs the .889 full map,
i.e. **~92% of the composition is captured by pure orthogonal role factors plus a length scalar**
(`c_A=.600`, `c_B=.584`). Dropping even the scalar (pure orthogonal, no scale) barely moves it (.815).
A fixed-basis 128-plane Givens approximation reaches only .795, so the rotation is genuinely *dense* in
plane-count even though it is *low-information* as an operator. **One distinction §16 must absorb:** the
"top-32 eigen-angles all at π" claim from night-4 was measured on the **order-swap involution S**
(`z_AB ↔ z_BA`, eigenvalues ±1, reflection-heavy *by construction*) — it is **a different operator**
from these slot maps `W_A/W_B`, which are milder (only 11–13 near-π axes). Both are real; §16's
`R_role` row conflated them. The corrected statement: **R_role/composition = a mild reflective
near-isometry (≈11–13 reflection axes); the order-swap S = a hard involution (reflection-heavy).**

> *Decoded, one line:* the operator's plain-English content is "reflect-rotate each sentence into its
> in-context slot, scale to ~0.59×, and superpose" — an HRR-style role binding by near-isometry plus a
> PC0/length adder, recovered to 92% by two orthogonal matrices and two scalars.

### 17.2 — Sparsity: the composition map is 20× sparse — but only in its Schur eigenbasis

The user asked whether the composition operator can be made sparse. The answer is a sharp **yes, in one
basis and one basis only** (`tae_comp_sparse.json`). Magnitude-pruning the map in three bases gives very
different Pareto curves:

| keep fraction | raw basis | PC basis | **eigen(Schur) basis** |
|---|---:|---:|---:|
| 0.5% (10.5k nnz) | .663 | .710 | **.787** |
| 1% (21k) | .676 | .731 | **.802** |
| 5% (105k) | .734 | .788 | **.846** |
| 10% (210k) | .776 | .817 | **.867** |
| 100% (2.10M) | .889 | .889 | .889 |

In the **Schur eigenbasis**, keeping just **5% of entries** (105k of 2.10M nnz, **20× compression**)
recovers cos **.846 — 95% of the full map's .889**. The raw basis needs 30% to reach that, and the PC
basis never gets there cleanly. The reason is structural: in the Schur basis the map is
**near-permutation / block-diagonal** — row-participation mean **1.24** (median **1.07**), with
essentially all mass on a 2-band — i.e. each eigendirection maps to ~one other, the signature of a
rotation expressed in its own canonical frame. Things that **don't** beat eigenbasis pruning: an
importance-finetuned sparse refit (5% → .774), and a low-rank-plus-sparse decomposition
(rank-32 + 2%-sparse → .735). Decodes under a 30%-raw sparse map stay faithful (mean SBERT-sim to
reference **.941**). So the composition operator is **compressible by 20× without retraining**, but the
compression lives in the operator's spectral coordinates, not its raw weights — the same lesson as the
SAE canonicity work: the *right basis* is where the structure is sparse.

### 17.3 — Deployable editing: rescued, but only just, and only by re-encoding part of the sentence

This is the headline follow-up to §16.2, which left editing in a frustrating place: the
spec-only/proxy embedding scores exact-edit@collateral **.037** (useless), while a non-deployable
**oracle** that reads the token's *true in-context encoder state* scores **.873** (the gap is an
information deficit, not a binding mismatch). Day-5 asks the deployable question: **can any cheap method
predict the true in-context state well enough to make editing usable?** (usable = exact@coll ≥ .30).
Four candidate methods, scored on 300 edits across word-swap / entity-swap / number-change, with a 1139-pair
OOD holdout (reorder + delete-clause source types) — `tae_true_e_predict.json`:

| method (deployable?) | exact@coll | frac of proxy→oracle gap closed | cos-to-oracle | OOD cos-to-oracle |
|---|---:|---:|---:|---:|
| proxy-E (baseline) | .000 | 0.00 | .094 | — |
| self-sub R_role·bare (binding-only) | .000 | 0.00 | .094 | — |
| linear context-correction | .210 | 0.24 | .371 | .453 |
| neighbor-context MLP | .233 | 0.27 | .377 | .542 |
| **partial re-encode (win_k=4)** ✓ | **.325** | **0.37** | **.901** | — |
| true-E ORACLE (upper bound, *not* deployable) | .873 | 1.00 | 1.000 | — |

**Verdict: editing is deployably rescued — but barely, and by the most expensive of the cheap methods.**
Only **partial re-encode** crosses the .30 usability line, at exact@coll **.325** (word-swap .42,
entity-swap .30, number-change .25), closing **37%** of the proxy→oracle gap with cos-to-oracle **.901**.
The catch on cost: it re-encodes a window around the edit (mean 13.1 tokens of a mean 30.3-token sentence,
**43% of the full encode**), so it is "deployable" only in the sense of being **sub-full-re-encode**, not
free. The two genuinely-cheap learned predictors (neighbor-context MLP, linear correction) **plateau just
below usable** at .233/.210 — and note their cos-to-oracle is only ~.37 even though they close a quarter
of the behavioral gap, so they capture the *easy* direction of the in-context state but miss its
high-rank tail (the same `c_i` dark matter from §16). The pure binding methods (proxy-E, self-substitution
through `R_role`) **score exactly 0.0**, re-confirming §16's kill of the role-binding hypothesis from a
third independent angle: rotation is not the missing ingredient, the **in-context information is**.

The honest reading for the transport problem: the night-4 frame said editing was "irreducible." Day-5
**softens that to "irreducible at zero marginal encode cost, recoverable at ~half the encode cost."** The
dark matter is not constructible from a frozen proxy embedding (confirmed), but ~37% of the editing gap is
recoverable if you are willing to **partially re-run the encoder over the edit's neighborhood** — which is
the operational statement that the lost information is *local-context* information, not global. A
genuinely free deployable edit remains **unsolved**.

> *Decoded, the flip:* `man → woman`, base "New Bedford, MA — A local **man** was arrested and charged
> with heroin distribution…". proxy-E (deployable, .037) decodes back to "…a local **man**…" — the swap
> never lands. Partial-re-encode (deployable, .325) decodes "…A local **woman** was arrested and charged
> with heroin distribution…" — **lands**, collateral .012. The oracle lands too (.025). This edit is one
> of the cases that flips fail→land purely by upgrading the proxy to a partial re-encode.

### 17.4 — The c_i generator: a late-stack, attention-mixed hub field that surprisal predicts

§16 called `c_i` "the contextual meaning a token acquires from all-to-all attention" but never located it
in the stack. `tae_ci_generator.json` regresses the final `c_i` onto every encoder layer's residual
(ridge, held-out R²) over 109k tokens, and the **emergence curve is monotone and back-loaded**:

| layer | 0 | 4 | 8 | 12 | 16 | 18 | 20 | 22 | 23/24 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| R²(state^ℓ → c_i) | .036 | .112 | .171 | .205 | .252 | .299 | .388 | .490 | **.544** |

`c_i` is **born in the last quarter of the 24-layer stack** — R² is flat-ish through mid-stack (~.20 at
L12) then accelerates sharply over L18→24 (.30 → .39 → .49 → **.54**). This co-locates exactly with the
residual-norm collapse (layer norm 1.37 at L16 → 0.92 at L22 → **0.43** at L23, the pooling sink): the
dark matter assembles in the same final layers where the representation compresses for pooling. **It is a
late-stack object, not an early lexical one** — consistent with it being context the token *earns*, not
content it *carries*.

How much is attention? The attention-mixture features explain `c_i` only **partially**: the best single
predictor is `Σ_j A_ij·x_j` (attention-weighted mix of last-layer states) at R² **.152**; adding the own
embedding and a mid-layer mix bundles to R² **.193**. So **attention re-mixing accounts for ~15–19% of
`c_i`'s variance** — real but far from all of it; the field has an effective-rank-participation of **819**
(it is genuinely high-rank, echoing the §16 eff-rank-25-per-sentence story aggregated over tokens), so
most of `c_i` is *not* a simple linear read-out of the attention pattern. The **hub predictor** confirms
the §16 picture sharply: the scalar `‖c_i‖` is predicted by **surprisal** at R² **.161** (Pearson **+.397**)
and equivalently by **−log-frequency** (Pearson **−.397**) — rarer, more-surprising tokens carry bigger
`c_i`. Position is irrelevant (R² .002) and the content/stop flag alone gives R² .071 (content-token
`c_i`-norm .467 vs stop .326 vs punct .376). **The dark matter piles onto rare, high-surprisal content
tokens** — exactly the "Sergeant Bluff carries c_norm 2.76 while 'we'/'are' sit at .37" anecdote from §16,
now quantified: `c_i` is a **late-stack, attention-partly-mixed, surprisal-driven hub field.**

### 17.5 — Multilingual: the generative model is not an English artifact

§16's entire model was fit in English; the night-4 deu C5 caveat flagged this as an open generalization
risk. Day-5 stress-tests whether the same `z`-geometry — specifically the SAE feature directions that
implement the role/attribute structure — survives across languages by steering SONAR features and
decoding into each language (`tae_multilingual.json` for fr/de/es/zh/ar, `tae_multilingual_stress.json`
for the harder hi/ja/sw/tr/ko set). Two findings:

1. **Feature directions transfer to 100% of languages tested.** Every probed SAE feature steers
   coherently in **all eleven** languages — `frac_transfer_all_langs = 1.0` in both files. A feature whose
   English steer delta-activates at 6.6 fires at 6.6–8.0 in French/German/Spanish/Chinese/Arabic and
   7.1–7.5 in Hindi/Japanese/Swahili/Turkish/Korean. The *same direction in z-space* carries the *same
   meaning* regardless of surface language — the strongest possible statement that `z` is the
   language-agnostic interlingua the SONAR design intends, and that the §16 model is **not an English
   artifact.**

2. **Cross-lingual invariance is high but real-valued, and dips on distant scripts.** The
   language-invariance score (cosine of the steered-delta across languages) sits at **.75–.76** for the
   Latin/European set (fra .757, deu .762, spa .754), **.71–.75** for Arabic/Chinese (zho .711, arb .746),
   and **.68–.77** for the stress set (hin .769, swh .766 high; jpn .690, kor .698, tur .676 lower). The
   pattern is sensible: **typologically-distant, non-Latin-script languages (Japanese, Korean, Turkish)
   show the largest residual**, but never break transfer. This **fills the night-4 deu caveat** — German
   is in fact one of the *best*-transferring languages (.762) — and upgrades the model from "fit in
   English" to "**confirmed cross-lingual to .68–.77 invariance, 100% qualitative transfer across 11
   languages spanning 6 scripts.**"

> *Decoded, cross-script:* steering the same feature into the same base sentence —
> French base "Le gouvernement britannique s'est également rendu compte des atrocités commises au
> Congo." → steered "L'université du Royaume-Uni a également observé les atrocités de l'université.";
> Hindi base "ब्रिटिश सरकार ने कांगो में हुए अत्याचारों…" → steered "ब्रिटिश विश्वविद्यालय ने कांगो में
> शैक्षणिक अत्याचारों का भी अवलोकन किया" ("the British **university** also observed the **academic**
> atrocities…"). The *same* "university/academic" feature lands the *same* semantic shift in both — in
> Latin and Devanagari script alike.

**What day-5 does and does not overturn in §16.** Nothing is reversed; three rows are sharpened. (a) The
`R_role`/composition row's "≥32 eigenvalues at π" should read **"11–13 near-π axes; the ≥32-at-π count
belongs to the order-swap involution S, a different operator"** (§17.1). (b) Editing's "irreducible"
verdict softens to **"irreducible at zero encode cost; ~37% recoverable at ~half encode cost via partial
re-encode"** (§17.3) — the role-binding kill stands and is reconfirmed a third time. (c) The "fit in
English" caveat is **discharged** — the model is cross-lingual (§17.5). The composition operator is now
written down (92% closed-form, 20× sparse in eigenbasis) and `c_i` is now *located* (late-stack,
attention-15–19%, surprisal-driven), turning two §16 *assertions* into measured *receipts*.

### 17.6 — Patchscope: `z` is LLM-legible through a learned bridge — as gist, not verbatim

The §16.4 safety payload rests on a claim that has so far only been demonstrated with SONAR's *own*
decoder: that `z` *can be read*. Day-5 tests the harder, more deployment-relevant question — **can a
frozen, unrelated LLM narrate `z`?** A tiny non-linear bridge (a 2048-hidden MLP, ~5M params) maps `z`
into **4 soft tokens** in the embedding space of a frozen `Llama-3.2-3B-Instruct`, injected at layer 0;
the LLM then free-generates a description, scored by SBERT against the true sentence
(`tae_patchscope.json`, 512 eval, adapter trained on 59k pairs).

1. **`z` is strongly LLM-legible.** The bridge reads `z` at **SBERT .589** against the true text, versus a
   **shuffled-pairing floor of .084** — a 7× separation. The SONAR decoder ceiling is .980, so the bridge
   recovers most-but-not-all of what is decodably present (and the bridge-vs-SONAR-decode agreement, .587,
   tracks bridge-vs-true almost exactly — the bridge is faithfully reading the *same* content SONAR reads,
   just imperfectly). A foreign LLM, given four learned soft tokens, **narrates the latent.**

2. **It captures gist, register, and structure — but hallucinates specific entities.** The failure mode is
   sharp and consistent with §16's dark-matter story: the bridge nails topic, tone, and syntactic frame
   but swaps the *named* entities, exactly the hub-bound proper-noun content that c_i locks away (§16.1).
   This is the **same gist-not-verbatim signature** the SAE roundtrip shows, now reproduced through an
   independent model — strong convergent evidence that what is irrecoverable is *specific lexical
   identity*, not meaning.

> *Decoded (bridge vs. true):* true "OHA's creditors include banks, private lenders, and other financial
> institutions." → bridge "The **FDIC's clients** include banks, **thrifts**, and other financial
> institutions." And true "**Pompeii and the Etruscans**: Unveiling the Complexities of Cultural Exchange"
> → bridge "**Bucharest** and the Art of Cultural Exchange: A Tale of Two Cities." The frame ("X's
> creditors include banks…", "**…** and … Cultural Exchange") is exact; the entities (OHA→FDIC,
> Pompeii→Bucharest) are plausible confabulations. The hub-bound content stays locked — consistent with
> dark-matter (§16.2). **Safety reading:** the parascope→concepts pipeline can put `z` in front of an
> arbitrary auditor LLM and get a faithful *gist*, but must not be trusted for verbatim entity claims.

### 17.7 — The big SAE: scale + full data + 80 epochs buys ~1.5× the FVU drop of a k-doubling, free

The DAY5 directive was "throw a tonne of compute at a big BatchTopK SAE." The flagship that finished first
is **h32768 / k64 / BatchTopK+AuxK(512)**, trained **80 epochs on the full data_v2 sentence corpus** (800k
train, 34k val, 17 chunks), and scored with a new *behavioral* metric — the decode-roundtrip SBERT, i.e.
how much sentence-meaning survives a full *encode→SAE→decode* trip (`sonar-sae_h32768_k64_btk_day5_metrics.json`).
The comparators are the existing BatchTopK leaderboard at matched/bracketing config.

| SAE | h | k | FVU↓ | cos↑ | dead% | L0 | roundtrip-SBERT | % of decoder ceiling |
|---|---|---|---|---|---|---|---|---|
| **day5 flagship** | **32768** | **64** | **0.320** | **0.833** | **0.003%** | 64 | **0.835** | **84.8%** |
| prior h32768 k48 | 32768 | 48 | 0.466 | 0.751 | 0.00% | 48 | — | — |
| prior h32768 k128 | 32768 | 128 | 0.331 | 0.834 | 0.00% | 128 | — | — |
| prior h16384 k64 | 16384 | 64 | 0.379 | 0.803 | 0.00% | 64 | — | — |
| prior h65536 k128 | 65536 | 128 | 0.351 | 0.828 | 0.00% | 128 | — | — |
| prior h65536 k32 | 65536 | 32 | 0.536 | 0.703 | 2.30% | 32 | — | — |

Three things land:

1. **At matched sparsity, the day-5 recipe is a real gain.** Holding k=64, going from the old h16384 to the
   day-5 h32768+full-data_v2+80-epochs cuts **FVU 0.379 → 0.320** and lifts **cos 0.803 → 0.833** — a
   ~16% relative FVU reduction with *no change in L0*. Dead features fall to a negligible **0.003%** (one
   feature in 32768). The behavioral roundtrip is the headline new number: **0.835 SBERT, 84.8% of the
   .984 SONAR-decoder ceiling** — meaning a 64-active-feature sparse code preserves ~85% of the meaning the
   dense decoder could.

2. **It buys the recall of a k-doubling at half the L0.** The day-5 k=64 (FVU .320, cos .833) essentially
   *matches* the prior **k=128** SAEs — h32768/k128 (FVU .331, cos .834) and h65536/k128 (FVU .351,
   cos .828) — while firing **half as many features**. Scale + data + epochs bought, for free, the
   reconstruction that previously cost a 2× denser code. This sharpens the §17/§16 "k is the
   reconstruction lever" line: *training* (data + epochs) is a second, cheaper lever that shifts the
   FVU-vs-L0 Pareto frontier left.

3. **It is the best k≤64 SAE on the leaderboard and lands on the k=128 frontier behaviorally.** Among all
   sparse (k≤64) configs, the day-5 flagship is now top on FVU, cos, and dead%. The remaining gap to the
   .984 ceiling (15%) is the familiar dark-matter / verbatim-entity loss, not a dictionary-capacity
   problem — consistent with §5's "bigger dictionaries don't buy recon" once you are on the frontier.

> *Verdict:* **scale + full data + long training does help — but as a Pareto shift, not a new ceiling.**
> The win is "k=128 reconstruction at k=64 sparsity, 0.003% dead, 85% of meaning behaviorally preserved,"
> not a break past the ~0.32 FVU / .83 cos plateau that the larger-dict / higher-k configs also sit on.

### 17.8 — Value-dense construction: no spec design beats the bind-sum; the wall is binding, not density

The night-4 white-box constructor hit a wall at **recall .314** (the `whitebox_retract` extrapolation
anchor; GRU anchor .31), and the standing hypothesis was that the wall was **spec information-density** —
that a richer construction spec would let the bind-sum recover more. Day-5 tests six spec designs against
the bind-sum base and the encode-decode ceiling (`tae_construct_valuedense.json`, scored by the same
bag-recall scorer; all glue ≤1% of encoder params): **content-only**, **content-only + length/order
code**, **info-weighted** (rare content full, frequent compressed), **position-clock** (content reindexed
to a clean 0..k-1 clock), and **+residual-head** (≤1% learned).

| design | grid recall (len-matched 4–16) | extrap recall (1.5×, len 18–30) |
|---|---|---|
| base bind-sum | 0.246 | **0.316** |
| content-only | 0.259 | 0.301 |
| content-only + length code | 0.230 | 0.302 |
| info-weighted | **0.216** (worst) | 0.287 (worst) |
| **position-clock** | **0.259** (grid best) | 0.311 |
| +residual-head | 0.231 | 0.289 |
| ceiling (roundtrip) | 0.823 | 0.961 |
| — night-4 white-box anchor | (0.247 retract) | **0.314** |

The verdict is a **negative result that refutes the density hypothesis:**

1. **No design materially beats the bind-sum.** In the length-matched grid the best design (**position-clock**,
   .259) lifts only **+0.013** over base (.246); in the extrapolation regime the best is **base bind-sum
   itself** (.316), with position-clock (.311) statistically level and the .314 anchor neither beaten nor
   lost. The winner is **regime-dependent and within noise** — there is no spec design that unlocks the
   gap.

2. **Information-weighting actively *hurts*.** The most "value-dense" arm — compressing frequent pieces,
   spending budget on rare content — is the **worst** in both regimes (.216 grid, .287 extrap). Denser
   specs do not help; if anything they perturb the bind-sum off the manifold. The night-4 "spec
   info-density is the wall" hypothesis is **rejected.**

3. **The ceiling shows where the loss actually is.** The encode-decode ceiling is .823 (grid) / .961
   (extrap); the constructor sits at .26–.32, i.e. **only ~31.5% of the achievable ceiling** (grid). The
   missing two-thirds is the **role-binding + dark-matter loss** that §16.2 / §17.3 already located — *not*
   spec density. A C4 note: grid (short, length-matched specs) and extrap (long natural specs) are not
   directly comparable, which is why the nominal "winner" flips between regimes; the honest summary is
   *"all designs cluster, none escapes the binding wall."*

> *Decoded (position-clock, len-4):* spec "The people lost hard." → constructed "The loss of people is
> difficult for challenges." (recall .333) — content nouns survive, syntax and the adverbial intensity are
> gone. Versus ceiling "The people lost hard." (recall 1.0). The constructor recovers *what* the sentence
> is about; the bind-sum cannot recover *how it is said*. This is the construction-side image of the
> §16.2 dark-matter loss.

### 17.9 — The reconstruction ladder: c_i is **93.7%** of the recoverable signal; R_role buys ~nothing in CE

This is the definitive decomposition the night-4 model promised. We rebuild `z` term by term —
**bag → +R_pos → +R_role → +c_i → ceiling** — and measure each term's share not in `z`-cosine (where
norms flatter the early terms) but in the **decoder's teacher-forced cross-entropy** (nats/token), the
behavioral currency. `R_role` is refit by orthogonal Procrustes per token-kind on **fresh disjoint
shards**, and the whole ladder is evaluated on a held-out shard (`tae_model_robustness.json`; fit 4000,
eval 1500; bag→ceiling CE gap = **2.616 nats**).

| ladder term | CE (nats/token) | **share of bag→ceiling gap closed** |
|---|---|---|
| bag (content embeddings, mean-pooled, no order/role) | 2.852 | — |
| +R_pos (positional rotation) | 2.684 | **6.4%** |
| +R_role (per-kind role rotation) | 2.687 | **−0.1%** |
| +c_i (the context field / dark matter) | 0.235 | **93.7%** |
| ceiling (true `z`) | 0.235 | 100% |

The headline number: **of everything that separates a bag-of-embeddings from the true latent, the context
field c_i explains 93.7%, positional structure 6.4%, and the role rotation essentially 0% (−0.1%) in CE
terms.**

1. **c_i *is* the latent, behaviorally.** Adding c_i collapses CE from 2.69 → **0.235 nats**, landing
   *exactly* on the ceiling (the residual is < the third decimal). The "dark matter" is not a side-channel:
   in CE it is **almost the entire recoverable signal**. This is the strongest single statement of §16's
   thesis — the interesting content of `z` lives in the high-rank, token-bound context field, not in the
   clean algebraic skeleton.

2. **R_role buys ~nothing in CE — a sharpening, not a reversal, of §16.** The role rotation closes
   **−0.1%** of the CE gap (2.684 → 2.687, a rounding-level *regression*). The cross-seed receipts explain
   why: refit on disjoint shards, the per-kind R_role matrices come out **near-identity** (mean principal
   angle **0.0003 rad**, frac-near-identity **1.0** for all four kinds) and the CE is **seed-stable to
   0.0045 nats** (seed-A 2.6869 vs seed-B 2.6914). So R_role is *real and reproducible* as a fitted object,
   but in the mean-pooled prepool decomposition it sits so close to identity that it adds no behavioral
   recall **on top of R_pos**. This is consistent with §17.1 (R_role is a *mild* reflective rotation,
   eig-angle ~.14) and with §16's caution that the algebraic terms are a thin skeleton — the lift lives in
   c_i. (C6/C4 caveat: the large rel-Frobenius distances 1.33–1.41 between seeds coexist with ~0 principal
   angle because the matrices are dominated by their near-identity bulk; the *direction* is stable, the
   off-identity *magnitude* is small and noisy. R_role's value, demonstrated elsewhere, is in
   *order/binding* tasks, not in this pooled CE ladder.)

3. **Dark-matter dominance is length-stable.** Stratified: mid-length (8–25 tok, mean 18.3) gives c_i
   share **94.5%**, R_pos 6.4%, R_role −0.9%; long (>25 tok, mean 37.6) gives c_i **93.4%**, R_pos 6.4%,
   R_role +0.2%. The decomposition is **flat across length** — c_i ≈ 93–95% everywhere, with no sign that
   longer sentences shift weight onto the algebraic terms. The short bucket (<8 tok) had n=1 and is
   reported as inconclusive.

> *The one-line decomposition:* **z ≈ bag + (6%) positional rotation + (0%) role rotation + (94%) context
> field.** The clean, constructible algebra is ~6% of what separates a bag from the latent; the remaining
> 94% is the high-rank dark matter §16 named — and §17.8 just confirmed it is also exactly what blocks
> construction.

### 17.10 — Paragraph pooling: the generality boundary — §16 is a **sentence-model**, and z is a *graceful* not a *flat* paragraph pool

This is the load-bearing generality test: does the night-4 algebraic model — a single `z` as a clean
(1/N) flat pool over its constituents — survive **past one sentence**? We build 1000 multi-sentence
paragraphs (2–6 sentences), encode each whole paragraph to one SONAR `z`, and ask three questions:
(i) is the paragraph `z` the flat token-mean of its words (γ_para → 1 ⇒ flat; < 0.9 ⇒ hierarchical);
(ii) does a **flat** reconstruction or a **role-composition** reconstruction better match the true
pipeline `z`; (iii) does per-sentence recall show a **U-curve / count-degradation**, the signature of a
saturating role-rotation budget one level up (`tae_paragraph_pool.json`; 1000 paragraphs, max 8 shards).

**The honest verdict: MIXED / INCONCLUSIVE on the algebra, but DECISIVE on the boundary.** The three
tests do not point the same way, and we report that straight.

| test | number | reading |
|---|---|---|
| (i) γ_para (flat-pool exactness) | **1.000** (mean *and* min, every paragraph) | flat pool is *mechanically* exact — but see caveat |
| (ii) flat-recon vs role-recon cos | **0.582 vs 0.570** (Δ = +0.012, flat barely ahead) | neither flat nor role-composition decisively wins; recon is *only ~0.58* |
| (iii) per-sentence recall vs count | **0.949 → 0.672** (2→6 sentences), U-curve **true** | recall degrades with count; budget saturates |

1. **γ_para = 1.000 is the trap, not the result (C4/C6).** The flat-pool gamma is *exactly* 1.000 for all
   1000 paragraphs because — at the paragraph level — the SONAR encoder's mean-pool over sub-tokens **is
   the construction by definition**; γ measures the pooling identity, not whether the *content* is flatly
   organized. The load-bearing number is not γ but the **reconstruction cosine, which is only 0.582** —
   far below the within-sentence reconstruction (§17.9 lands c_i at 0.235 nats CE, i.e. near-ceiling cos).
   So a paragraph `z` is a *much worse* flat pool of its sentences than a sentence `z` is of its words:
   the flat-pool model **degrades by a third of its cosine** the moment you cross the sentence boundary.

2. **Flat barely beats role — the algebra has no decisive winner at paragraph scale.** flat-recon 0.582 vs
   role-recon 0.570 is a **+0.012 margin**: the length-agnostic token-bag and the sentence-hierarchy
   reconstructions are statistically a wash, and *both* sit near 0.58. The night-4 composition operator
   (two reflective role rotations + length adder, §17.1) — which is near-exact **within** a sentence — does
   **not** buy a second, clean level of hierarchy for paragraphs. There is no "sentence-role rotation" that
   recovers the cross-sentence structure the way the token-role rotation recovers word order.

3. **Recall degrades monotonically with sentence count, and the within-paragraph profile is U-shaped.**
   Per-sentence recall falls **0.949 (2-sent) → 0.916 (3) → 0.829 (4) → 0.759 (5) → 0.672 (6)** and order
   τ collapses in lock-step (1.00 → 0.94 → 0.81 → 0.67 → 0.53): the paragraph `z` is a **fixed-capacity
   buffer** that smears as you pack more sentences in. Within a paragraph the recall is **U-shaped**
   (first-20% 0.937, mid 0.79–0.90, last-20% 0.855) — a primacy/recency profile, the saturating-budget
   signature, *not* flat. This is exactly what test (iii) predicted: the role-rotation budget that is
   roomy for ~20 word-slots saturates around **4 sentences**.

> *The boundary, stated honestly:* **§16's clean algebra is a *sentence*-model, not a general *z*-model.**
> Within one sentence, `z` is a near-exact flat-pool + role-rotation (§17.9: 94% of the recoverable signal
> reconstructed). Across sentences, the *pooling* is still mechanically flat (γ=1.000) but the *content*
> reconstructs at only **cos 0.582**, role-composition buys nothing decisive (+0.012), and recall decays
> **0.95→0.67** with a U-shaped within-paragraph profile. The model **does not break** — it **degrades
> gracefully and predictably**, which is itself the night-4 prediction (a finite role-binding budget). A
> 2-sentence paragraph decodes near-perfectly ("AMMAN, Jordan – A wave of devastating bombings rocked three
> hotels…", recall 0.984/0.967, τ=1.0); a 6-sentence one drops a middle sentence to recall **0.547** and
> can scramble order (τ=0.33). **Safety reading: a parascope/auditor claim certified on single sentences
> does NOT transfer unqualified to paragraphs** — the flat-pool license holds at sentence scale and decays
> measurably beyond it.

### 17.11 — Analogy = editing = the hub-field: one mechanism, and the model predicts its own boundary

This closes a long-standing thread (the user's word-swap idea) and **unifies three results into one
mechanism**. We test the classic parallelogram — `z_D ≈ z_A − z_B + z_C` (king − man + woman) — entirely
in SONAR `z`-space, on 240 analogy quads, and ask not just *whether* it works but *where it fails*
(`tae_analogy.json`; 240 quads with out-degree, shuffled-C null, true-E oracle).

| measure | value | reading |
|---|---|---|
| overall analogy success | **0.4625** | vs shuffled-C null **0.083** → **5.6× above null** |
| true-E oracle (reads gold D) | **0.8125** | ceiling when the *real* swapped state is used |
| success vs swapped-token **out-degree** | Spearman **r = −0.318, p = 4.7e-7** | hub words FAIL the analogy |
| success vs swapped-token **c_norm** | Spearman **r = −0.271, p = 2.1e-5** | high context-field = un-constructible |
| peripheral (q1 out-deg) → hub (q4) | **0.617 → 0.250** | clean monotone fall-off |

1. **Analogy works, linearly, in z.** The bare parallelogram decodes the gold target at **46%** against an
   **8%** shuffled-C null — a 5.6× separation. So `z` supports `king − man + woman = queen`-style vector
   arithmetic at the sentence level: the swap of a single lexical role is a **linear move in z** for a
   large fraction of words.

2. **The model predicts its own failure boundary — and it is exactly the hub-field.** Success
   **anti-correlates** with the swapped token's out-degree (r = −0.318, **p = 4.7e-7**) and with its
   c_norm (r = −0.271, p = 2.1e-5). Peripheral, low-out-degree words succeed at **0.617**; hub,
   high-out-degree words fail at **0.250**. This is the *same* deficit §16.1 measured (out-degree-q4 tokens
   carry 3.75× the context-field excess) and §17.3 hit when editing: the content that a linear edit cannot
   touch is the **hub-bound c_i**, and it is *the same content* the analogy cannot construct.

3. **The true-E oracle closes the hub cases — proving the tie.** When the parallelogram is built from the
   *real* pooled state of the gold swapped word (oracle, non-deployable — it reads gold D) instead of the
   bare embedding, success jumps to **0.81** and the **hub quartile lifts** (q4 oracle 0.867 vs deployable
   0.250). The hub failures are therefore **not** a representational dead-end — they are precisely the
   missing **c_i / hub-self-field**: supply the true context-laden state and the analogy completes.
   **Analogy, deployable editing, and the hub-field are one mechanism: linear in the lexical embedding
   E(w), blocked by the context field c_i.**

> *Decoded analogy (peripheral succeeds, hub fails):* "father : man :: woman : **mother**" (out-deg 0.68) →
> z-pred decodes **"The mother stood quietly near the old window."** = gold, exactly. Contrast the hub case
> "apple : fruit :: vegetable : **carrot**" (out-deg 2.54, c_norm 1.10) → z-pred decodes **"Everyone
> admired the apple during the long ceremony."** — the swap **fails to fire**, the carrier sentence is kept
> but the hub word `apple` will not move to `carrot`. The oracle (real carrot-state) flips a sibling case to
> success. **This is the word-swap idea, answered: it works for peripheral words and is blocked, predictably
> and measurably, exactly at the content-hub tokens c_i locks away.**

### 17.12 — The z-auditor: the safety deliverable — a calibrated contract that **readability never licenses steering**

This is the night-4 safety payload, packaged. The auditing law (§16.4) says a **linear probe reading an
attribute off a SONAR `z` does NOT by itself license a causal/steering claim**; attributes split by the
**triplet (linear-readability, MT-survival, decode-steerability)** into four bins. The z-auditor
operationalizes this as a **calibrated pass/fail contract**: per attribute it reports a held-out detection
AUC *and* a bin, and — critically — only the **ACTIVE** bin is contracted to be linearly steerable
(`tae_zauditor.json`; eng + deu, 200/attr, steering amps 2/4/8).

**The four bins (the contract's vocabulary):**

| bin | meaning | steering license |
|---|---|---|
| **active** | steerable: a unit mean-shift along the readable axis moves the decode ≥ calibrated decode_shift | audit **and** intervene linearly |
| **curved** | detect-only: genuinely in `z` + MT-robust, but stored **non-linearly** (position-rotated frame) | linear probe reads it; **linear mean-shift will NOT steer** — use the decode gradient |
| **inert** | prior-owned: readable but MT-fragile / low decode_shift — the **decoder's prior supplies it** | reading it off `z` does **not** license a causal claim |
| **truly_lost** | probe AUC ≈ chance — not reliably readable (dark-matter / hub residue) | auditor must **NOT** claim it is present |

**Per-attribute held-out calibration (English; 7/7 pass, 6/7 bin-match):**

| attr | calibrated bin | expected | held-out AUC | decode_shift | note |
|---|---|---|---|---|---|
| language | inert | inert | **1.000** | 0.000 | perfectly readable, zero steer — pure prior |
| register | inert | inert | **0.877** | 0.021 | readable, prior-owned |
| voice_wording | active | inert | 0.885 | 0.272 | the **one bin-mismatch** (see C4 below) |
| **sentiment** | **active** | active | **1.000** | **0.829** | the model attribute: read AND steer |
| **named_entity** | **active** | active | **0.938** | 0.373 | active: auditable + steerable |
| tense | curved | curved | **1.000** | 0.174 | readable, **not** linearly steerable (axis-rot cos 0.973) |
| number | curved | curved | 0.832 | 0.079 | detect-only; linear probe **under**-reads (lin-AUC 0.64 < nonlin 0.74) |

German replicates the core split (7/7 pass; sentiment AUC 1.000 / shift 0.928, named_entity AUC 0.93,
tense curved AUC 0.89), with the expected cross-lingual softening: `register` and `number` fall to
**truly_lost** in German (held-out AUC 0.51 / 0.59) — the auditor then correctly **refuses to claim them
present**, which is the contract working as designed, not a failure.

1. **What it certifies to a downstream auditor.** For each *passed* attribute the auditor emits a
   **calibrated detection confidence (held-out AUC) + a bin**. If `bin == ACTIVE`, a unit mean-shift along
   the calibrated axis is contracted to move the decode by ≥ the reported decode_shift (sentiment 0.83,
   named-entity 0.37). If `bin == CURVED` (tense, number) or `INERT` (language, register), a **linear
   intervention is explicitly NOT contracted** — CURVED says use the decode gradient, INERT says the prior
   owns it. If `bin == TRULY-LOST`, the auditor reports the attribute **not recoverable** and makes no
   presence claim. **Readability never, by itself, licenses a steering claim** — that is the whole
   deliverable.

2. **Order/tense is the canonical safety case.** Tense reads at **AUC 1.000** yet sits in **curved**
   (decode_shift 0.174, axis-rotation cos 0.973 — the readable axis is rotated near-orthogonal to the
   steer axis). So "a probe reads tense off `z`" must **never** be reported as "tense is steerable in
   `z`." This is the §16.4 order-is-recoverable-but-not-linearly result, now a calibrated, per-attribute
   guard rail.

3. **The honest mismatch (C4/C6).** `voice_wording` calibrates **active** but was *expected* inert
   (decode_shift 0.272 cleared the 0.20 active threshold in both languages). Two readings: either the
   passive/active voice axis genuinely *is* a steerable linear direction (and the §16.4 prior assignment
   was conservative), or the 0.20 threshold is slightly low for this attribute. We report it as a
   **bin-mismatch, not a clean pass** — 6/7 bin-match (eng), 4/7 (deu) — and flag it for re-calibration
   rather than papering over it. The contract's value is precisely that it makes such mismatches *visible
   and auditable*.

> *Worked audit report* (source: *"The committee unanimously approved the controversial new budget
> yesterday."*, which decodes back **verbatim**):
> ```
> language      present=False  bin=inert    AUC=1.000  shift=0.000  → PRIOR-OWNED: do not claim z carries it
> register      present=True   bin=inert    AUC=0.877  shift=0.021  → PRIOR-OWNED
> voice_wording present=True   bin=active   AUC=0.885  shift=0.272  → STEERABLE (audit + intervene)
> tense         present=True   bin=curved   AUC=1.000  shift=0.174  → DETECT-ONLY: read it, but do NOT linearly steer
> sentiment     present=False  bin=active   AUC=1.000  shift=0.829  → (absent here) steerable when present
> number        present=False  bin=curved   AUC=0.832  shift=0.079  → DETECT-ONLY
> named_entity  present=False  bin=active   AUC=0.938  shift=0.373  → (absent here) steerable when present
> ```
> The report detects the sentence's **past tense** (AUC 1.000) but **refuses to certify it as linearly
> flippable** (curved); it would let an auditor steer *sentiment* or a *named entity* if present, but never
> *tense* or *number* via a probe axis. **This is the §16.4 safety thesis shipped as a contract: detection
> and intervention are separately licensed, and the auditor is barred from confusing the two.**

### 17.13 — SAE features ⇄ the generative model: the bridge is REAL but WEAK — features predict hubs and replicate the active/curved split; they are NOT the role-rotation eigendirections

This is the load-bearing **bridge between the two halves of the program**: the day-5 SAE (the *understand*
dictionary) and the night-4 generative model (`R_role` + the hub field + the §16 attribute bins, the
*construct* skeleton). Three arms ask whether the SAE's learned features *are* the model's internal objects
(`tae_sae_eigenalign.json`; SAE `h32768_k64_btk_day5`, 6000 sentences, 20 null draws).

| arm | question | metric | verdict |
|---|---|---|---|
| **1 — eigenalign** | are SAE decoder atoms the `R_role` eigen-plane directions? | obs maxcos **0.151** vs null **0.133** (p95 0.150); axis-z mean **2.09**, only **25%** of role axes beat null-p99 | **WEAK/PARTIAL** — barely above chance |
| **2 — hubs** | do high-activation features fire on hub tokens (out-degree / c_norm)? | one feature at **Pearson 0.68** to od_max (0.62 od_mean); mean active-feat corr **0.024** | **CONFIRM (sparse)** — a *cluster*, not a basis |
| **3 — bins** | do features split into ACTIVE-steerable vs CURVED-not? | sentiment toward-shift **0.176** vs order **0.007**; active−curved **+0.169** | **CONFIRM** — replicates §16 split |

1. **The SAE is NOT the role-rotation eigenbasis (the honest negative).** Aligning the 64 top SAE decoder
   atoms to the `R_role` eigen-plane directions gives observed max-cosine **0.151** against a rotation-null
   of **0.133** (p95 = 0.150) — an excess of only **+0.018**, with mean axis z-score **2.09** and just
   **25%** of role axes clearing the null's 99th percentile. So there is a *whisper* of alignment (one axis
   hits z = 5.2), but the SAE dictionary is emphatically **not** a readout of `R_role`'s eigendirections.
   This is the expected result given §17.1/§17.9: `R_role` is a *mild* reflective rotation (eig-angle ~.14)
   that buys ≈0 in pooled CE — there is little energy there for an unsupervised dictionary to discover. The
   two larger SAEs (`h65536_k128`, `h32768_k128`) wash out entirely (obs−null **−0.002 / −0.001**, 0% above
   p99) — **bigger dictionaries align *less***, consistent with §17.7's "scale buys reconstruction, not new
   structure."
2. **Features DO find the hub field — but as a sparse cluster, not a coordinate.** Exactly the model's hub
   geometry surfaces: **one** feature (28241) correlates **0.68** with per-sentence max out-degree (0.62
   with od_mean), a second at 0.34, and the mean active-feature correlation is a floor **0.024**. So
   "hubness" is carried by a *handful of dedicated features*, not smeared across the dictionary — the
   `predict_hub_cluster_r≥0.4` flag fires. The top hub feature decodes to the degenerate repetition
   **"The world is changing and the world is changing and …"** (fire-frac 0.24, mean-act 0.74) — i.e. the
   feature that lights up is the **low-information / high-out-degree attractor** that §16/§17.4 named as the
   hub-field signature. The dictionary *re-discovers the hub field from the other side.*
3. **The active/curved split replicates inside the SAE.** Feature-class correlations cleanly separate the
   two regimes: sentiment (an ACTIVE attribute) has top class-features at **+0.69 / +0.66 / −0.63** and a
   decoder mean-shift steer that moves the decode **toward the target class by 0.176**; order (a CURVED
   attribute) has weaker class-features (+0.59 / +0.45) and a steer that moves it **0.007** — a 25× gap
   (active−curved = **+0.169**, `replicates_active_curved_split = true`). So the **§16 bins are visible in
   SAE-feature space**: sentiment features are steerable levers, order features read but do not steer.

> *The bridge, in one line.* The SAE dictionary **re-finds the hub field** (one feature, r = 0.68) and
> **re-finds the active/curved attribute split** (steer gap 25×) — but it does **NOT** re-find `R_role`'s
> eigendirections (maxcos 0.151 vs null 0.133). The two halves of the program meet on the *content* objects
> (hubs, attribute bins) and **diverge on the thin algebraic skeleton** — exactly the §17.9 picture where
> c_i is 94% of the signal and the role rotation is ~0%. *Decoded steer example* (sentiment, ACTIVE):
> base *"I thought the concert this weekend was great."* → SAE-feature-steered *"I thought the perfect
> concert was great."* — a coherent on-class shift; the matched CURVED (order) steer instead collapses to
> a music-lyric attractor (toward-shift 0.007), i.e. it does **not** steer, exactly as the bin predicts.

**Critic (C1–C10).** C2 (gears): the mechanism is correct and *split* — content objects bridge, algebra
does not; this is a sharpening of §17.9, not a new claim. **C6 (baselines): strong** — a 20-draw rotation
null is the right control, and the headline excess (+0.018) is honestly reported as marginal, not inflated.
**C4 (confound): the hub feature decodes to a repetition-attractor**, so "fires on hubs" partly = "fires on
degenerate low-perplexity text"; we read this *with* the model (hub tokens *are* the low-information
attractors §16 named), but flag that hubness and degeneracy are entangled here. C3: arm-3's active/curved
prediction was pre-registered and replicates. **Verdict: AMBIGUOUS-leaning-CONFIRM** — the bridge exists on
content (hubs ✓, bins ✓) and is refuted on the eigenbasis (✗); no §16/§17 reversal, a clean sharpening.

### 17.14 — Transport theory: edit-feasibility is only PARTIALLY predictable — and a NULL baseline beats the proposed certificate (AUC 0.857 > 0.766)

The night-4 program left an a-priori question open: **can you certify *before editing* whether a given edit
will succeed**, from cheap features — the attribute bin (§16) × the local decoder-Jacobian rank? The
pre-registered certificate was `feasibility f = [bin == ACTIVE] × [local_rank < τ]`, predicted to hit
held-out AUC ≥ 0.7 **and** beat both the out-degree NULL and the bin-alone baseline
(`tae_transport_theory.json`; 300 edits, 180/120 train/test split, proxy-E so overall success is a floor
**3.7%**, the §17-known spec-only wall). **Self-reported result: NOT-CONFIRMED.**

| predictor (held-out AUC) | value | reading |
|---|---|---|
| feasibility f (ACTIVE × low-rank), **binary** | **0.483** | the pre-registered certificate **fails** (≈ chance) |
| feasibility f, continuous | 0.578 | weak |
| full logistic (bin + local_rank_z + outdeg_z) | **0.766** | the best feasibility model — partial |
| **out-degree alone (the NULL)** | **0.857** | **beats everything** — the certificate is dominated by its own null |
| local-rank alone | 0.690 | rank carries real signal |
| bin alone | 0.308 | **bin alone is anti-predictive** (worse than chance) |

1. **Local decoder-Jacobian rank DOES predict editability — bin does NOT, and the proposed interaction
   fails.** The logistic coefficients are honest: `local_rank` enters **negative** (−0.175 std) — **low
   local rank ⇒ editable**, exactly the geometric story (an edit succeeds where the decode map is locally
   low-dimensional, so a content nudge moves the output without dragging collateral). But `bin_ACTIVE` also
   enters **negative** (−1.01), i.e. in this proxy-E regime ACTIVE edits succeed *less* than INERT ones
   (by-bin success: ACTIVE 2.0% vs INERT 7.0%) — the **opposite** of the pre-registered sign. So the
   `ACTIVE × low-rank` cell is *not* the feasible corner; the best feasible cell is **INERT_lowrank**
   (success 12.5%, n=40), and the worst is **ACTIVE_highrank** (1.1%). The certificate's two-factor product
   is built on a bin term that points the wrong way here, which is why the binary f lands at AUC 0.483.
2. **The NULL beats the model — the decisive negative.** Out-degree alone (queue-style hubness, the
   pre-registered NULL meant to be *beaten*) reaches **AUC 0.857**, above the full 3-feature logistic
   (0.766). Editability is **predicted best by a single hub/out-degree scalar**, not by the bin×rank
   certificate — i.e. *whether an edit lands is governed mostly by how hub-like the target latent is*
   (peripheral edits feasible, hub edits blocked), which is **precisely the §17.11 analogy=editing=hub-field
   unification**, now re-derived from the editing side. The fancy certificate adds nothing over "is this a
   low-out-degree peripheral latent?".
3. **What this does and does not overturn.** It does **not** rescue editing (overall 3.7%, the §17.3/§17.8
   wall stands) and it does **not** refute the bins as *audit* objects (§17.12) — bins describe
   *steerability under a mean-shift*, a different operation than spec-only re-encode editing. What it
   refutes is the *specific* a-priori certificate: **you cannot read edit-feasibility off (bin × local-rank)
   well enough to beat a hub-scalar null.** Editability is real-but-mostly-hubness, with local-rank as a
   genuine secondary signal (rank-alone 0.690).

> *Decoded feasibility examples.* **FEASIBLE** (ACTIVE number-change, local_rank **5.87**, low out-degree
> 0.55): *"Fast forward to 2022, …"* → *"Fast forward to 2023, …"* (success, collateral 0.0). **OBSTRUCTED
> BY RANK** (ACTIVE, local_rank **19.8**): edit *2045→2046* fails and drifts to *2040* with collateral.
> **OBSTRUCTED BY BIN** (INERT word-swap *house→cottage*, local_rank low 1.96 **but** high collateral 0.45):
> the decode derails entirely — low rank is **necessary but not sufficient**; the edit type still matters,
> just not along the ACTIVE/INERT axis the certificate assumed.

**Critic (C1–C10).** **C6 (baselines): exemplary and self-damning** — the experiment ran its own NULL
(out-degree) and its own ablations (bin-alone, rank-alone) and reported that the NULL **wins**; the verdict
field literally reads `NOT-CONFIRMED`. **C4 (confound): proxy-E floor** — overall success is 3.7% because
this is the non-deployable proxy-embedding regime (§17.3); AUCs are computed over a very imbalanced, mostly-
failing set, so the ranking among predictors is more trustworthy than the absolute success rates. C2: the
gear is corrected — feasibility ≈ peripheral-ness (out-degree), with local-rank secondary; bin is not a
feasibility axis. **Verdict: REFUTE the certificate** (partial AUC 0.766, dominated by the 0.857 null) —
honest negative; the silver lining (local-rank is real, hubness dominates) re-confirms §17.11.

### 17.15 — Geodesic interpolation: the curved-manifold midpoint spike is REAL — but geodesics do NOT flatten it (≈0 gain). The break is intrinsic, not a chord artifact

§10 found that straight-line interpolation between two `z` vectors stays on-manifold for the first quarter
then breaks. The night-4 follow-up asks the sharper question: **is the break a chord-vs-manifold artifact
(curing it needs a geodesic) or is it intrinsic to the curved bin?** We interpolate 11 steps between
matched pairs in three strata — **curved** (order/number swaps), **active** (content edits), **near-
paraphrase** — by straight line and by a kNN-tangent geodesic, and measure the **midpoint off-manifold
spike** = ppx(0.5) − ½[ppx(0)+ppx(1)], where ppx = 1 − cos(encode(decode(z)), z)
(`tae_interp_geodesic.json`; eng n=140/stratum, deu n=60; tangent-rank 24, kNN 200).

| stratum (eng) | endpoint cos A·B | spike (straight) | spike (geodesic) | geodesic helps |
|---|---|---|---|---|
| **curved** (order/number) | 0.775 | **0.100** | **0.0985** | **+0.0015** (≈0) |
| active (content) | 0.916 | 0.023 | 0.023 | 0.000 |
| near-paraphrase | 0.970 | 0.005 | 0.005 | 0.000 |

1. **The spike is real and bin-specific — §10 replicated and refined.** The curved stratum spikes **4.4×**
   harder at the midpoint than active (0.100 vs 0.023) and **20×** harder than near-paraphrase (0.005); the
   straight-line round-trip cosine dips to **0.893** at t = 0.5 for curved vs 0.971 for active. So the
   "interpolation breaks past t ≈ ¼" of §10 is specifically a **curved-bin phenomenon** — order/number
   interpolations cross an off-manifold ridge that content interpolations do not. The interaction
   (curved spike ≫ active spike) **holds** in English. The decoded midpoints show why: an order pair's
   t = 0.5 decode is the **word-salad fold** *"Have recipients of research funding and projects are
   recipients of research funding and projects."* — the model has no coherent latent *between* two
   word-orderings, so the chord passes through a region the decoder can only render as a stutter.
2. **Geodesics do NOT help — the headline negative.** The kNN-tangent geodesic flattens the curved spike by
   **+0.0015** (0.100 → 0.0985), a **1.5% relative** reduction — i.e. essentially nothing, and within noise
   (std 0.106). Active and near-paraphrase get **exactly 0.000**. The pre-registered prediction
   ("geodesic flattens the curved spike *exactly*, gain ≫ 0 for curved, ~0 for active") is **REFUTED on the
   magnitude**: the *interaction* (curved spikes more) is true, but the *remedy* (geodesics cure it) is
   false. In German it is *worse* — the geodesic slightly **raises** the curved spike (0.178 → 0.181,
   helps = **−0.0029**), and `interaction_holds = false` for deu.
3. **Why this matters: the break is intrinsic, not a chord artifact.** If the midpoint spike were just the
   chord leaving a curved manifold, a manifold-following geodesic would erase it. It doesn't — so the
   off-manifold ridge between two word-orderings is a **genuine hole in the data manifold**, not a
   shortcut-vs-arc geometry effect. This *agrees* with §16.4 / RESULTS_NIGHT3's "order is genuinely curved /
   irreducible — no rotation or learned chart linearizes it": there is **no smooth on-manifold path between
   two orderings of the same content**, because the intermediate orderings are not valid sentences. Naive
   straight-line interpolation is therefore **as good as it gets** for these pairs — the curvature is in the
   *data*, not in our choice of path.

> *Decoded geodesic example (curved, the spike).* A→B = a near-order-swap of *"The recipients of the SSHRC
> funding have diverse research backgrounds and projects."* The straight-line t-chain stays verbatim to
> t = 0.3, then at **t = 0.5** folds to the stutter *"Have recipients of research funding and projects are
> recipients of research funding and projects."* (spike 0.432, blend-coherence 0.648), then re-coheres into
> the order-swapped end by t = 0.7. The geodesic chain is **indistinguishable** — same fold, same t, same
> spike to three decimals. The hole is in the manifold, not the path.

**Critic (C1–C10).** **C6 (baselines): strong** — straight-line *is* the baseline, three strata give a
clean active/near-paraphrase control, and the result is a *negative against the authors' own hypothesis*.
C4: the geodesic is a kNN-tangent (rank-24) construction — a richer geodesic (more neighbors, higher
tangent rank) might do marginally better, so we say "this geodesic doesn't help," not "no geodesic can";
but the German *negative* gain and the exact-zero active gain make a large hidden win unlikely. C2: gear
corrected — the curvature is intrinsic data-manifold curvature, consistent with §16.4's irreducible-order
finding. **Verdict: AMBIGUOUS** — spike CONFIRMED & bin-specific (replicates §10), geodesic-remedy REFUTED
(≈0, negative in deu). No §16/§17 reversal; it *strengthens* the irreducible-order story.

### 17.16 — Unbinding chains: R_role⁻¹ recovers a k-deep composition, and the recovery-vs-depth curve **matches theory through depth 3, then frays** — the serial-position/role-budget decay, measured

Can you pull token/sentence *k* back out of a *k*-deep composition by applying `R_role^{−1}`, and **how deep
before it fails**? This is the inverse of the binding operator and the direct test of the "role budget" — how
many ordered slots `z` can carry before the rotation frame saturates (`tae_unbind_chain.json`; 1414 chains,
depths 1–5, role-powers spaced by 20, β-calibrated 1.538).

| depth | mean unbind-cos (measured) | predicted (ρ-table) | rel-err | within 20%? | decode-sim (first→last slot) |
|---|---|---|---|---|---|
| 1 | **1.000** | 1.000 | 0.0% | ✓ | 0.988 |
| 2 | **0.768** | 0.720 | 6.7% | ✓ | 0.687 → 0.746 |
| 3 | **0.672** | 0.607 | 10.8% | ✓ | 0.494 → 0.543 |
| 4 | **0.615** | 0.501 | 22.8% | ✗ | 0.418 → 0.466 |
| 5 | **0.575** | 0.428 | 34.4% | ✗ | 0.412 → 0.358 |

1. **Unbinding works, and decays predictably — through depth 3.** Applying `R_role^{−1}` recovers slot *k*
   at cosine 1.00 (depth 1), 0.77 (depth 2), 0.67 (depth 3) — and the **ρ-table prediction tracks the
   measurement to within 20% through depth 3** (rel-err 0% / 6.7% / 10.8%). So the role-rotation frame is a
   *genuine invertible binding*: you can store an ordered chain and pull any slot back out, and a simple
   one-parameter decay law (ρ vs role-separation) predicts the recovery fidelity. The **decode-sim** column
   shows this is not just cosine bookkeeping — at depth 2 the recovered slot *decodes* to a faithful
   paraphrase (sim 0.69–0.75); the *binding* survives even where the surface form softens.
2. **It frays at depth 4–5 — the role budget is ≈3.** Past depth 3 the measured cosine **decays more slowly
   than the model predicts** (measured 0.615/0.575 vs predicted 0.501/0.428; rel-err blows to 22.8%/34.4%,
   `rho_breaks_at_depth5 = true`). The measured curve flattening *above* prediction means the chains retain
   *more* mutual cosine than the clean ρ-law allows — consistent with **content bleed / crosstalk** between
   slots once the frame is over-subscribed (the slots stop being cleanly separable, so everything stays
   spuriously similar). The honest read: **R_role⁻¹ gives clean, theory-matching unbinding up to depth 3;
   at depth ≥ 4 the recovery is degraded and the simple decay law no longer holds** — a role budget of
   roughly **3 ordered slots**.
3. **The decay is edge-flat, NOT mid-slot-worst (a clean negative on serial-position).** The pre-registered
   serial-position hypothesis (interior slots recovered worst, a U-curve) is **REFUTED**: `mid_slot_worst =
   false` at every depth, and `edge_over_interior` is **0.99 / 0.98 / 0.96** (depths 3/4/5) — interior slots
   are recovered *marginally better* than edges, and the curve is essentially flat across position. So the
   budget is a **global capacity limit**, not a within-chain forgetting gradient. The group-closure check
   confirms the binding is a **clean one-parameter group** (cos-fit-vs-sum = 1.0 for all role-power pairs;
   `clean_one_param_group = true`) — so the fraying at depth ≥ 4 is *capacity/crosstalk*, not a broken
   algebra. (Note: this `tae_unbind_chain.json` summary reports clean closure; the per-pair `tae_comp_unbind`
   battery on HF probes the π-reflection corners separately — left for the cross-model cycle.)

> *Decoded recovery-vs-depth.* Depth 1, slot A recovers **verbatim**: *"The Monza restaurant, a staple in
> the local business community, recently announced its par…"* → identical. Depth 2, slot B recovers a
> faithful paraphrase: *"This initiative is a testament to the restaurant's commitment to giving back…"* →
> *"The restaurant initiative is a testament to the restaurant community's commitment to offer…"* By depth 4
> the same slot drifts to a stutter (*"The Monza organization is dedicated to this cause. The Monza
> organization is dedicated to…"*) and depth 5 hallucinates (*"…founded in the United States of America…"*)
> — the role-budget decay made legible.

**Critic (C1–C10).** **C6 (baselines): present** — each slot reports a `baseline_cos` (the no-unbind
similarity); at depth 2 unbind-cos 0.768 vs baseline 0.659 for the interior slot shows `R_role^{−1}` adds
real recovery over doing nothing. **C4: cosine vs decode** — the unbind-cos stays >0.5 even at depth 5 where
the *decode* has clearly hallucinated (decode-sim 0.36), so cosine over-states deep recovery; we lead with
the **decode-sim** and the rel-err blow-up, not the cosine, as the honest failure signal. C3: the
serial-position prediction was pre-registered and cleanly **refuted** (edge-flat, not mid-worst). C2: gear =
invertible one-parameter group with a ~3-slot capacity; fraying is crosstalk, not algebra-break. **Verdict:
CONFIRM (with bounded depth)** — unbinding works and matches theory through depth 3, frays by depth 4–5;
role budget ≈ 3; no §16/§17 reversal (this *operationalizes* the §17.1 R_role binding).

### 17.17 — THE SWING: cross-model transfer — the day-5 *calculus is SONAR-specific structure*, NOT a universal property of pooled sentence encoders. A trivial mean-composer beats the role-rotation in the second encoder; the hub field barely transfers; and the C4 random-map null was never run

This is the headline of day-5 and it decides a **10× scope question**: after aligning SONAR `z` to a
*second* sentence encoder with **one fitted linear map**, does the night-4/day-5 calculus — (a) composition
= a role-rotation-like operator, (b) a hub-field / spiky dark matter, (c) the attribute bins / curved order
— *transfer* into the second encoder? If yes, the structure is a generic property of pooled sentence
encoders (→ 10× scope). If no, the structure is **SONAR's**. Two encoders were tested: `all-mpnet-base-v2`
(`tae_crossmodel_moon.json`) and `sentence-transformers/LaBSE` (`tae_crossmodel_nurse.json`), each fit on
~8.5k aligned sentences, evaluated on ~900–1500 held-out pairs.

| arm | metric | mpnet (moon) | LaBSE (nurse) | reads as |
|---|---|---|---|---|
| **0 — map quality** | ridge held-out cos / R² | **0.673 / 0.389** | **0.829 / 0.638** | a single linear map aligns `z`→encoder only *modestly* (mpnet) to *decently* (LaBSE); Procrustes (orthogonal-only) is much worse (0.50 / 0.61), so the alignment **needs scaling/shear**, it is not a rotation |
| **1 — composition** | map(`z_AB`)→other_AB cos | 0.655 | 0.839 | looks like "it transfers"… |
| | **map(`z_A`)-only vs other_AB** (A-only floor) | **0.626** | **0.713** | …but ≈ the same as just mapping the *first* sentence — most of the cos is **shared content, not composition** |
| | **MEAN composer** (½(A+B) in other space) | **0.910** | **0.916** | the *trivial* baseline **crushes** the transferred operator |
| | ridge composer fit *in* other space | 0.905 | 0.954 | even a native composer barely beats the mean |
| | conjugated `R_role` predictor | 0.888 | 0.856 | the rotation, conjugated through the map, *predicts* — but still **< the mean-composer** |
| | shuffled-pair null | 0.105 | 0.223 | pairing matters (not a pure artifact)… |
| **2 — hub field** | hub-score vs map-residual-norm (Spearman) | **0.317** | **0.265** | weak |
| | hub-score vs encoder's *own* centred-norm (Spearman) | 0.327 | **0.110** | the encoder's *own* spikiness barely tracks SONAR's hubs (esp. LaBSE) |
| | high/low-hub residual-norm ratio | **1.08×** | **1.07×** | dark-matter-heavy SONAR sentences are only ~7% harder to map — **not** a shared hub field |

**Three things this establishes.**

1. **Composition does NOT transfer as a role-rotation — it "transfers" only as trivial averaging, which is
   not the claim.** In *both* second encoders, the mean-composer (½(A+B)) reconstructs the true "A then B"
   embedding at cos **0.91–0.92**, *beating* the conjugated role-rotation (0.86–0.89) and the directly-mapped
   `z_AB` (0.66 / 0.84). Worse, mapping *only* `z_A` already gets 0.63 / 0.71 — i.e. most of the apparent
   "composition transfer" is just shared sentence content surviving the map, not the binding operator
   surviving. This is the cross-model echo of **§17.9** (R_role buys ~nothing in CE over the additive part)
   and the **§17.10** graceful-pool finding: pooled encoders are *additive* about content, and the
   *role-rotation refinement is exactly the part that does NOT survive an external linear map.* So arm (a)
   **fails to transfer as a distinctive operator.**
2. **The hub field is essentially SONAR-specific.** SONAR's dark-matter-heavy sentences are only ~7% harder
   to map (ratio 1.07–1.08×) and correlate with the *other* encoder's own spikiness at Spearman 0.33 (mpnet)
   / **0.11** (LaBSE). A genuinely *encoder-universal* hub field would make those sentences strongly spiky in
   the second space too; they are not. So arm (b) **does not transfer** — the hub/dark-matter geometry is a
   property of SONAR's specific pooling, not of pooled encoders in general.
3. **The bins / curved-order arm (c) was not directly tested here** (no order-readability probe in the second
   encoder); the day-5 active-vs-curved split is shown to be SONAR-internal in §17.13/§17.15 and is *not*
   demonstrated to transfer.

**Critic (C1–C10).** **C4 — STRUCTURE-MANUFACTURE / the missing null (flagged HARD).** The brief's central
worry is that *a fitted linear map can manufacture apparent structure*. The runs include a **shuffled-pair
null** (0.10 / 0.22) and a **conjugated-shuffle null** (0.10 / 0.18), which control for *correspondence* —
good. But **no RANDOM linear/orthogonal map was run** as the null for the *map itself*: there is no
"align `z` to the other encoder with a random map, then measure whether the composition/hub structure still
appears at this level." Because the headline positive ("conj R_role predicts at 0.86–0.89") is *already
beaten by the trivial mean*, the missing random-map null does not rescue a transfer claim — but it means we
cannot quantify *how much* of even the 0.66–0.84 map-`z_AB` cos is the map manufacturing agreement vs real
shared structure. **This null is owed and should be run before any 10×-scope claim is entertained.**
**C5 (generalization): the question itself** — answered NEGATIVE on both arms that were tested, on two
encoders of different families (mpnet = English MPNet, LaBSE = 109-lang BERT), so the negative is not a
one-encoder fluke. **C6 (baselines): excellent on the composition arm** (A-only floor, mean composer, native
ridge composer all present — and they are what *kills* the transfer claim); **absent on the map-null**, see
C4. **C2 (mechanism):** the *content* of a sentence is linearly shared across pooled encoders; the
*role-rotation binding and the hub field are SONAR-specific decorations on top.* **Verdict: the calculus is
SONAR-SPECIFIC, not a universal property of pooled sentence encoders.** Concretely: a single linear map
moves *content* between encoders (cos 0.67–0.83), but the **distinctive day-5 objects — the role-rotation
operator and the hub/dark-matter field — do NOT survive the map** (beaten by a mean-composer; hub ratio
≈1.07×). **This argues AGAINST the 10× scope-up** and for treating §16/§17 as a characterization of *SONAR's*
geometry. No §16/§17 result is overturned — rather, §17.9/§17.10/§17.13 are *confirmed cross-model*: the
additive-content part is universal, the rotation/hub refinements are SONAR's. (Caveat preserved: the
map-null is owed; a strong negative is already established without it because the trivial baseline wins.)

### 17.18 — Landed satellites: SAE-feature universality (mpnet SAE), the curved-store comparator, the eigenalign re-run, and the matryoshka / h65536 SAE metrics

The fill-batch around the swing landed several confirmatory results; none overturns §16/§17.

1. **Cross-encoder SAE feature universality is WEAK (`tae_xencoder_universality.json`).** Train a fresh SAE
   on `all-mpnet-base-v2` (FVU 0.209 — a healthy dictionary) and ask whether SONAR's top-512 features have
   *matching* feature directions in the mpnet SAE. Match-corr **median 0.194, p90 0.605**; only **16.2%** of
   features exceed corr 0.5 and **4.7%** exceed 0.7 (shuffle baseline median 0.013). So a *minority* of
   concept features are encoder-portable, but most SONAR SAE atoms have no clean mpnet counterpart — the
   *dictionary* is largely encoder-specific. This is the SAE-level corroboration of §17.17: shared *content
   axes* exist, but the bulk of the learned structure is SONAR's. **Verdict: WEAK/PARTIAL universality —
   consistent with the SONAR-specific swing verdict.** (C5: only one second-encoder SAE; C4: shuffle null
   present and clean.)
2. **The curved order store is GENUINELY irreducible (`tae_curved_store.json`).** Which frame linearizes the
   non-linear order store — the role frame (A), an unknown learned orthogonal chart (B), or content-
   conditioning (C)? MLP anchor τ = **0.57**, linear-unconditional anchor τ = 0.264. **No** arm reaches
   MLP-level: role-frame τ **0.274** (gap recovered **3.1%**), orthogonal chart τ 0.203 (**negative** — worse
   than the unconditional linear), content-bag τ 0.270 (gap **2%**). **Verdict: order is genuinely curved /
   irreducible — no single linear chart (not R_role, not a learned rotation, not content-conditioning)
   recovers it; only the MLP/decode reads it.** This is the comparator-vs-selection answer for §17.15: the
   midpoint/order break is *intrinsic curvature*, not a chord or chart artifact. **CONFIRM** §17.15.
3. **The eigenalign re-run reproduces §17.13 exactly (`tae_sae_eigenalign.json`, SAE `h32768_k64_btk_day5`,
   20 null draws).** Arm-1 role-eigenalign: obs maxcos **0.151** vs null **0.133** (p99 0.159), axis-z mean
   **2.09**, only **25%** of role axes beat null-p99 → **WEAK** (the SAE atoms are *not* the R_role eigen-
   planes). Arm-2 hubs: one feature at **Pearson 0.68** to od_max, mean active-feat corr 0.024 → a *sparse
   cluster*, not a basis. Arm-3 active/curved: sentiment toward-shift **0.176** vs order **0.007**
   (active−curved **+0.169**, `replicates_active_curved_split = true`). **Verdict: re-CONFIRMS §17.13** with
   identical numbers — the bridge is real but weak; features predict hubs and replicate the bins but are not
   the role directions.
4. **SAE-metrics fill: matryoshka and the big h65536 (HF `_metrics.json`).** Matryoshka BatchTopK
   (h16384, groups 1024/4096/16384): k128 **FVU 0.310 / cos 0.840 / 0% dead** (per-prefix FVU
   0.405 / 0.345 / 0.310 — graceful nesting), k32 FVU 0.506 / cos 0.720. The k128 matryoshka **matches the
   plain k128 SAE's FVU (0.31)** while adding a *nested* prefix structure for free — a usable
   variable-width dictionary at no recon cost. The big **h65536_k128_btk: FVU 0.351 / cos 0.828 / 0% dead** —
   *worse* than the h16384_k128 (FVU 0.31): confirms the long-standing lever finding that **dict size
   saturates >16k and k dominates recon**; 4× the dictionary buys nothing here. **No reversal; tidy
   confirmations.**

> *Note on the "multilingual operator transfer" arm.* The brief asked whether the composition-map / `c_i`
> transfers *cross-lingually* (not just in `z`-space). The dedicated `tae_multilingual_operator.json` has
> now **LANDED** and is scored below (§17.19) — it answers the *operator* question that §17.5's
> *feature-steering* transfer (feature directions steer in 11/11 languages, lang-invariance .68–.77) did
> not. The cross-lingual *operator* question is therefore now **CLOSED**.

### 17.19 — Cross-lingual OPERATOR transfer: the composition MAP is mostly shared, the role BINDER and clock are language-universal — closing the §17.5 gap

`tae_multilingual_operator.json` answers the question §17.5 left open. §17.5 showed the *z-space* is a
shared interlingua across 11 languages (feature directions steer everywhere); it did **not** show that the
*operator* — the composition map `W`, the role rotation `R_role`, the `c_i` generator — fit on English is
the *same* operator that composes fr/de/zh. This experiment fits each operator per-language on 8k parallel
two-span pairs (fra/deu/zho, English reference) and cross-applies the English operator.

1. **The composition map `W` is SHARED in 2/3 languages by the ≥85%-recovery bar.** Cross-applying the
   English-fit `W_eng` to a language's stacked `[zA;zB]` recovers **95.1%** of French self-fit (cos 0.502
   vs self-fit 0.528), **86.1%** of Chinese (cos 0.470 vs 0.546), and **78.7%** of German (cos 0.452 vs
   0.575 → tagged `PER-LANGUAGE` by the ≥85% bar, though it still recovers four-fifths). The flat
   matrix-cosine `W_lang·W_eng` is modest (0.33–0.46) — the *matrix entries* differ — yet the *function*
   transfers: applying English's map to another language's spans lands near that language's own optimum.
   This is the operator-level echo of §17.9/§17.10 — pooled encoders are **additive about content**, and
   the additive composition is the part that ports.
2. **The role binder `R_role` is a language-UNIVERSAL reflective rotation by its spectrum, not by its
   entries.** Per-language `R_role` flat-cosine to English is **low (mean 0.106**; fra 0.107, deu 0.129,
   zho 0.083) — the *matrices* are not entrywise equal. But the *eigen-angle spectrum* is the same binder
   in every language: median reflective angle **1.32 rad (eng), 1.38 (fra), 1.05 (deu), 1.33 (zho)**, with
   `frac_near_pi` 0.085–0.090 (≈0.074 deu) and ~183–190 axes above 2.5 rad in each. The same *kind* of
   near-π reflective binder is refit in a language-specific frame, exactly as §17.1 described it for
   English. Tellingly, cross-applying `R_role_eng` to a foreign language's order-swapped `zBA` recovers
   nearly as well as that language's own `R_role` (deu 0.285 vs 0.426 is the one real gap; fra 0.322 vs
   0.339, zho 0.292 vs 0.373 are close) — the binder is *mild* enough (§17.9: it buys ~nothing in CE) that
   even the English one is a decent stand-in.
3. **The `c_i` (dark-matter) generator could not be re-measured cross-lingually — a tooling NULL, not a
   negative.** All four per-lang `c_i` regressions errored out (`'HubPrepool' object has no attribute
   '_spm'` — the SPM tokenizer handle was missing on the prepool model), so surprisal-R² and out-degree
   q4/q1 are `null` for every language *including English*. The English reference values stand from §17.4
   (‖c_i‖∝surprisal R² .161, out-degree q4/q1 CE-excess 3.75×); the cross-lingual generator question is the
   **one piece still open**, blocked by a fixable bug, not by data.

> *Decoded, operator-level:* the French operator fit `cos_zAB_zBA = 0.590` and German `0.645` — the
> composition is symmetric-ish in every language (order matters but is recoverable), the same near-isometric
> two-reflection picture as English (`cos 0.571`). Applying `W_eng` to a French span-pair `[zA;zB]` lands at
> cos 0.502 to the *true* French composite — within 0.026 of what French's own map achieves.

**Critic (C1–C10).** **C6 (baselines): present** — `W_lang_vs_W_eng` flat-cosine and the `R_role`
cross-apply both have a language's *own* self-fit as the ceiling, and recovery is reported as a *fraction*
of self-fit (the right normalization). **C5 (generality): three languages, one script-distant** (zho_Hans)
— enough to call the trend, not enough to claim all 11. **C4 (confound): the `c_i` arm is a clean tooling
NULL** (errored for English too), correctly *not* spun as a cross-lingual difference. C2 (gear): the
flat-cosine being low while function-recovery is high is the **right** diagnostic — operators that
*disagree entrywise but agree as maps* is precisely "same function, language-specific coordinates."
**VERDICT: the composition OPERATOR is language-UNIVERSAL in form — the additive map ports (79–95% recovery)
and the reflective binder has an identical spectrum in every language — though refit in language-specific
coordinates.** This **closes the §17.5 gap**: not just the *space* but the *operator* is shared. No §16/§17
reversal — it discharges the last "English-only operator" caveat. (Remaining open: the `c_i` generator's
cross-lingual surprisal/hub law, blocked by the `_spm` bug.)

### 17.20 — Concept algebra at SAE-scale: SAE-feature edits ≈ z-algebra (no better), the bin structure holds, and the hub wall does NOT appear in this clean battery

`tae_concept_algebra.json` runs the three edit methods — straight z-algebra, on-manifold geodesic, and
**SAE-feature** algebra (encode→add op's mean feature-delta→decode, day-5 `h32768_k64_btk` SAE) — across
the 10-op ACTIVE/CURVED/INERT/SPAN battery, 30 edits each, and asks whether feature-space editing beats
z-space and whether it hits the §17.11 hub wall.

1. **SAE-feature algebra matches z-algebra where z-algebra works, and is strictly *worse* on the hard
   span-ops — it does not buy a new mechanism.** On the ops that already work it ties: tense 0.733 vs
   straight 0.800, person 0.567 = 0.567, negation 0.967 = 0.967. On the structural span-ops it
   *collapses*: insert-clause **0.033 (sae) vs 0.500 (straight)**, register 0.533 vs 0.767. Crucially
   roundtrip coherence is *lower* for every SAE-feature decode (0.80–0.86 vs straight 0.92–0.99) — the
   k=64 encode→decode roundtrip itself injects loss before any edit. **SAE-feature editing is z-editing
   through a lossy bottleneck; it is the same algebra, degraded.** This is the construct-side confirmation
   of §17.13: features *replicate* the active/curved split but are *not* a better lever than the raw
   geometry.
2. **The ACTIVE/CURVED/INERT/SPAN bins reproduce exactly — and ACTIVE word/entity-swaps still hit 0.0.**
   word-swap and entity-swap (ACTIVE, the lexical-substitution ops) score **0.0 target-hit under every
   method** with *negative* collateral (−0.077, −0.102) — the §17.3/§17.8 lexical-edit wall, a third
   construct-side reconfirmation. tense/person/negation/register all work (0.57–0.97); delete-clause and
   insert-clause are the partial-to-broken span ops. The collateral-by-bin signature is intact
   (ACTIVE −0.089 most-damaging, INERT ~0, CURVED ~0).
3. **The hub wall does NOT show up in this battery — `spearman(success, out-degree) = −0.0115` over 300
   edits (essentially zero).** Unlike §17.11/§17.14 (where peripheral edits were feasible and hub edits
   blocked), here success is governed by *op-type* (the bin), not by the swapped token's attention
   out-degree. The honest read: in a *clean parallel-edit* battery the op-structure axis dominates and
   out-degree is washed out — the hub wall is a property of the *lexical-substitution* regime (ACTIVE
   word-swap, where it is strongest), not a universal predictor across *all* edit types. This **does not
   overturn §17.11** (which scoped the hub-field to analogy=editing on lexical content) but it **bounds**
   it: the §17.11 "one mechanism" is the lexical-content mechanism, and structural ops (negation, tense)
   live on a different, hub-insensitive axis.

> *Decoded example (ACTIVE word-swap, the wall):* base *"…this charnel **house** of rock 'n' roll…"* →
> gold swaps house→**cottage**. Straight: *"…this rock'n'roll's **crematorium**…"*; SAE-feature: *"…this
> rock'n'roll's **haunted place**…"* — both decode a *plausible synonym of the surrounding gestalt*, neither
> lands "cottage" (target-hit 0.0). *Decoded example (CURVED tense, works):* *"the Boehner debacle **is**…"*
> → all three methods correctly produce *"…**was** just the tip of the iceberg"* (0.80 hit).

**Critic (C1–C10).** **C6: strong** — three methods share the identical 30-edit set per op, so the
straight-vs-sae comparison is apples-to-apples; geodesic = straight here (`geodesic_minus_straight = 0.0`
everywhere, re-confirming §17.15 that geodesics add nothing). **C4 (confound): the lower SAE coherence is a
roundtrip artifact** correctly separated from the edit (coherence reported alongside hit-rate). **C1
(power): n=30/op, 300 total** — adequate for bin-level calls, thin for the per-op hub correlation (which
came out ≈0). **VERDICT: SAE-feature concept algebra WORKS exactly where z-algebra works and no better
(it's z-algebra through a lossy code); the bin structure and the ACTIVE-lexical wall reproduce; the hub
wall does NOT appear in this clean battery — bounding §17.11's hub field to the lexical-substitution regime
rather than overturning it.**

### 17.21 — SAE universality: features are seed-canonical but NOT scale-canonical, and `R_role` is NOT sparse in any SAE basis — re-confirming §13/§17.13

`tae_sae_universality.json` confirms the canonicity picture across dict sizes and seeds and tests whether
the composition map is sparse in a *named-feature* basis (the interp dream: "composition = a sparse rewiring
of concepts").

1. **Same-seed-family is near-perfectly canonical; different-training is not; cross-size barely survives.**
   Two text-loss SAEs differing only in seed: mean-max-cos **0.990, 100% survive cos>0.9** — features are
   *reproducible* under re-seeding (the strong canonicity claim). But a *different-training* pair
   (`h16384_k128` vs `_btk`) collapses to **0.204** (4.1% survive >0.5, null 0.124) — training recipe,
   not just seed, sets the dictionary. **Cross-size is mostly feature-splitting:** 16384→32768 mean-max-cos
   **0.240** (7.2% survive >0.5), 32768→65536 **0.216** (4.9%) — vs null 0.128–0.133. So ~5–7% of features
   have a clean larger-dict counterpart; the rest **split**. `universality_call = "scaling mostly SPLITS
   features (survival ~ null)"`. This **exactly reproduces the early/§13 numbers** (seed cos .99,
   diff-training .20, cross-size ~5%) — a clean confirm.
2. **`R_role` is NOT sparser in any SAE basis than in raw coordinates — the interp dream fails here.**
   Gini of the composition map: **raw 1024-d 0.636**, random-orthonormal 0.503, and *every* SAE basis
   *lower* (h16384 **0.448**, h32768 0.440). `l0_at_0.99_energy` is **6.2% of dims in raw** but **70–71% in
   the SAE bases** — the map is *denser*, not sparser, when written over named features. `verdict.sae_basis_
   sparser_than_raw = false`. This **directly re-confirms §17.13/§17.2**: the composition map is 20× sparse
   in its *own Schur eigenbasis* (§17.2), but the SAE dictionary is **not** that eigenbasis — features carry
   *content*, not the *role-rotation skeleton*. The two bases are orthogonal kinds of structure.

> *Concrete:* the seed-A vs seed-B SAEs agree on **16384/16384** features at cos>0.9 (the dictionary is a
> deterministic function of the data+recipe), yet going 16384→32768 only **2.67%** of features survive at
> cos>0.9 — the larger dictionary mostly *re-partitions* the same content into finer atoms rather than
> re-discovering the same atoms.

**Critic (C1–C10).** **C6: clean nulls throughout** — every survival fraction has a matched
shuffled-feature null (uniformly 0.0 at cos>0.5), and the sparsity test includes both raw and
random-orthonormal baselines (the random basis is the right null for "is the SAE *specially* sparse"). **C5:
three dict sizes × two seed pairs** — solid coverage. C2 (gear): seed-canonical ≠ scale-canonical is the
correct, non-trivial distinction (re-seeding is reproducible; re-scaling is feature-splitting). **VERDICT:
CONFIRM the §13 canonicity picture (seed cos .99, diff-train .20, cross-size ~5% survival = mostly
feature-splitting) AND re-confirm §17.13 — `R_role` is NOT sparse in the SAE basis (Gini 0.44 < raw 0.64);
the SAE dictionary is a content basis, not the role-rotation eigenbasis.** No §16/§17 reversal.

### 17.22 — Eigenrole semantics: `R_role`'s π-eigenplanes do NOT cleanly encode first-vs-second span — the binder is a generic content-mixing rotation, not a labeled role axis

`tae_eigenrole.json` asks the sharpest possible version of "is `R_role` a *role* operator": project
2500 composition-pair token states onto `R_role`'s **π-eigen-subspace** (the 32 most-reflective planes,
mean angle **0.744 rad**) versus its **near-identity subspace** (32 planes, mean angle **0.0004 rad** — i.e.
≈no rotation), and ask whether first-span vs second-span membership separates *specifically* in the
reflective subspace.

1. **The reflective subspace does NOT carry clean role semantics — it barely beats the near-identity
   subspace, and a RANDOM subspace beats both.** First-vs-second span classification: full space **acc
   0.982 / auc 0.998** (role is trivially decodable from the raw state); π-eigen-subspace **0.730 / 0.788**;
   near-identity subspace **0.704 / 0.754**; **random subspace 0.851 / 0.902**. The π-minus-identity gap is
   **+0.026 acc** — essentially nothing. If `R_role`'s reflective planes were the *role-binding* axis,
   span-membership should separate *there* and *not* in the identity directions; instead it separates
   *weakly everywhere* and *best in a random projection*. `verdict: "R_role is a generic content-mixing
   rotation (no clean role semantics)"`.
2. **The commutator control is inconclusive but leans the same way.** Undoing `R_pos` before projecting
   drops π-separation by 0.021; undoing `R_role` itself drops it by **0.0002** (nothing) — i.e. the small
   role signal that *is* in the π-subspace is not actually *produced by* `R_role`'s rotation. This is the
   per-eigenvector sharpening of §17.13's "features are not the role directions" and §17.1's "`R_role` is a
   *mild* reflective rotation": the eigenplanes are real geometric objects, but they are **not semantically
   labeled** — they implement a content-mixing reflection whose *value is the binding/ordering operation
   itself*, not a readable "this-is-span-1" axis. Consistent with §17.9 (R_role buys ~0 in CE — its job is
   binding, not recall) and the §17.18 curved-store result (order is genuinely irreducible to any single
   linear chart).

> *Decoded pairs feeding the test* (the role = which span came first): A=*"The fate of the ACA repeal bill
> hangs in the balance…"* / B=*"The American people deserve a healthcare system that prioritizes
> prevention…"* — the classifier must recover that A precedes B. It does so at **0.98 from the full state**
> but only **0.73 from the π-eigenplanes** and **0.85 from a random projection** — so "first-vs-second" is
> written all over the state, just **not specifically in `R_role`'s reflective eigenbasis.**

**Critic (C1–C10).** **C6: the random-subspace null is the decisive baseline** — and it *wins* (0.902 >
π 0.788), which is the cleanest possible refutation of "the π-planes are special." **C1 (power): 131k span
tokens, 26k test** — ample. **C2 (gear): the right question** — not "can you decode role" (yes, trivially,
from anywhere) but "does it concentrate in the reflective eigenbasis" (no). C7 (alt-explanation): the
commutator control (undo-`R_role` drops separation by ~0) correctly rules out "R_role *creates* the small
π-signal." **VERDICT: REFUTE the strong reading — `R_role`'s π-eigenplanes do NOT cleanly encode span/role;
the binder is a generic content-mixing reflection whose value is the *binding operation*, not a labeled role
axis.** This *sharpens* §17.1/§17.9/§17.13 (the eigenbasis is the sparse skeleton of the *map*, not a set of
*readable* directions) — no reversal.

### 17.23 — Cross-lingual dark-matter law: the `c_i` GENERATOR holds in fr/zh and survives — but is OUT-DEGREE-driven everywhere, surprisal only marginally, with German the soft failure

`tae_ci_multilingual.json` (after fixing a `_spm` tokenizer-accessor bug that had errored the c_i arm of
§17.19) refits the §17.4 dark-matter generator **per language** — does `‖c_i‖` (the residual content field,
`c_i = state_i − q_i`) still concentrate on **high attention-out-degree hub pieces** and regress on
**token surprisal**, in the *same direction* as English, when fit in-language on fra/deu/zho (1200 sentences,
~38–49k tokens each)?

| arm | eng (ref) | fra | deu | zho | reads as |
|---|---|---|---|---|---|
| **out-degree → ‖c_i‖** R² | 0.211 | 0.161 | 0.177 | 0.186 | the hub law is the **robust** one — R²≈0.16–0.19 in *every* language |
| **out-deg q4/q1 ‖c_i‖** | 2.44× | **1.87×** | **1.97×** | **1.61×** | top-out-degree quartile carries 1.6–2.0× the dark-matter mass everywhere (same sign as eng) |
| **surprisal → ‖c_i‖** R² | 0.025 | **0.062** | 0.030 | **0.077** | weak but present; *higher* than eng in fr/zh, eng-level in de |
| **surprisal q4/q1 ‖c_i‖** | 1.21× | 1.35× | 1.24× | 1.27× | high-surprisal pieces carry ~1.2–1.35× the mass — consistent direction |
| `c_norm_mean` | 0.505 | 0.824 | 0.858 | 0.905 | non-eng pieces carry **more** absolute dark matter (tokenization granularity) |

**The law (out-degree concentration AND surprisal-positive, same direction as eng) HOLDS in 2/3 by the
pre-registered bar** (R²_surprisal ≥ .05 AND out-deg q4/q1 ≥ 1.3): **fra (true), zho (true), deu (FALSE —
surprisal R² 0.030 just misses .05).** But the *headline reading is stronger than 2/3*: the **out-degree
hub mechanism is universal** (R² 0.16–0.19, q4/q1 1.6–2.0× in all three languages, no exceptions), and
**surprisal is the marginal, noisier predictor everywhere** — including in English itself (R² 0.025), so
German's "failure" is the surprisal arm being weak, not the hub field being absent. The decoded hub pieces
are mechanistically identical across languages: **sentence-boundary `</s>` tokens** (fra c_norm 4.16 @
out-deg 6.97; deu 3.82 @ 6.78; zho 4.03 @ 8.44) and **rare/named pieces** (`▁Adele` fra c_norm 3.73 @
out-deg **25.7**; `▁Maryland` deu 3.29; `水`/`海洋` zho) sit at the top of the dark-matter field in every
language — the same `</s>`-as-pooling-hub + rare-content-spike signature §17.4 found in English.

**Critic (C1–C10).** **C6 (baselines): the cross-lingual comparison IS the baseline** — English is the
reference and the per-language numbers are read *relative* to it; out-degree beats surprisal in every
language including eng, so the universal claim is internally controlled. **C2 (mechanism):** the generator is
the *attention-pooling hub* (out-degree), not the *information-theoretic* surprisal story — surprisal is a
weak correlate of hub-ness, and the experiment cleanly separates them (out-deg R² ~7× surprisal R² in eng,
~2.5–3× in fr/de/zh). **C5 (generalization):** three typologically distinct languages (Romance, Germanic,
Sino — including a non-Latin script) all show the hub concentration → the law is **not an English or a
Latin-script artifact.** **C8 (a bug was fixed, not papered over):** the `_spm` accessor bug is documented in
the artifact and the corrected accessor verified. **VERDICT: CONFIRM §17.4 cross-lingually — the dark-matter
`c_i` generator is a LANGUAGE-UNIVERSAL attention-out-degree hub field** (R² 0.16–0.19, q4/q1 1.6–2.0× in
fr/de/zh), **with surprisal a universally-marginal secondary predictor** (holds fr+zh, German just misses the
.05 bar). No §16/§17 reversal — this *extends* §17.4/§17.19 to show the generator, not just the operator,
ports across languages.

### 17.24 — The cross-model RANDOM-MAP null: the swing survives it — the hub-corr 0.66 is a NORM artifact (ratio 1.08×), composition genuinely does NOT transfer

`tae_crossmodel_randnull.json` runs the **C4 null the §17.17 swing was owed**: align SONAR `z` to
`all-mpnet-base-v2` with a **scale-matched RANDOM map** (random linear block, per-output-column std + bias
matched to the fitted ridge) and re-measure every transfer arm. If apparent structure survives a *random*
map, the §17.17 transfers were map artifacts; if it collapses to chance, the fitted-map numbers were real.

| arm | fitted map (§17.17) | RANDOM-NULL map | reads as |
|---|---|---|---|
| **map-fit held-out cos** | **0.673** | **0.068** | the fitted map's content alignment is REAL — a random map gets chance (~0); the 0.67 was not a scale artifact |
| **composition: map(`z_AB`)→other_AB** | 0.655 | **0.053** ≈ shuffle null 0.049 | with a random map, composition transfer is **exactly chance** |
| **conjugated `R_role` predictor** | 0.889 | **0.890** | the role rotation "predicts" identically under a random map — i.e. it is conjugation-through-content, NOT a transferred binding operator |
| **mean-composer ½(A+B)** | 0.910 | 0.910 | trivial averaging is map-invariant (it lives in the *other* space) — the baseline that already killed the transfer claim |
| **hub: hub-score vs map-residual-norm** (Pearson) | 0.317 | **0.681** | the hub-corr APPEARS strong under a random map too… |
| **high/low-hub residual-norm RATIO** | — | **1.08×** | …but the effect size is ~8% — the "correlation" is a **norm artifact**, not a shared hub field |

**Three confirmations, one correction.** (1) **The map itself is real:** random-map cos 0.068 vs fitted 0.673
— content alignment between SONAR and mpnet is genuine, not manufactured. (2) **Composition genuinely does not
transfer:** map(`z_AB`) drops to 0.053 ≈ the shuffled null (0.049) under a random map, and the *fitted* number
(0.66) was already beaten by the mean-composer (0.91) — so arm (a) fails *and* the failure is not a null
artifact. (3) **The conjugated-`R_role` "prediction" is exposed as content-conjugation:** it scores **0.890
under a random map** — *identical* to the fitted 0.889 — proving the rotation's apparent transfer is just the
content surviving conjugation, not a binding operator that ports. (4) **CORRECTION to a §17.17 sub-reading:**
the hub-field correlation looks *higher* under the random null (Pearson 0.681 vs fitted 0.317), but the
**load-bearing effect size — the high-hub/low-hub residual-norm ratio — is only 1.08×**, i.e. dark-matter-heavy
SONAR sentences are ~8% harder to map *regardless of map*, which is a **norm artifact** (spiky vectors are
harder to fit by any map), not an encoder-universal hub field. This *strengthens* §17.17's verdict: the hub
field is SONAR-specific, and the modest correlation §17.17 reported is partly this norm effect.

**Critic (C1–C10).** **C4 (structure-manufacture): this experiment IS the C4 null §17.17 flagged as owed —
now run, and it clears the calculus.** The fitted map's content transfer survives (0.67 vs random 0.07); the
*distinctive* objects (composition operator, role rotation, hub field) do **not** survive — they score at
chance, identically-under-random, or at 1.08× effect size. **C6 (baselines):** shuffle null, mean-composer,
random-map all present and decisive. **C2 (mechanism):** a fitted linear map moves *content* between encoders;
the binding rotation and hub geometry are SONAR's, and the one number that looked like shared hub structure is
a norm artifact. **VERDICT: CONFIRMS the §17.17 SONAR-SPECIFIC verdict and DISCHARGES its one owed null.** The
day-5 calculus is SONAR's geometry, not a universal property of pooled sentence encoders — and the random-map
null, far from rescuing a transfer claim, **closes the last escape hatch** (the hub-corr 0.66 → norm artifact,
ratio 1.08×). No §16/§17 reversal; §17.17 is now fully nailed down.

### 17.25 — Deployable editing via an amortized net: a 220× FLOP win that lands edits — but at 0.23 < 0.30, still NOT deployable; the wall is the same non-local missing field

`tae_edit_amortize.json` chases the §17.3 prize: §17.3 rescued editing only by **partially re-encoding** the
sentence (0.325 success at ~43% of full-encode FLOPs) — too expensive to deploy. Can a **single amortized-net
forward** (a residual MLP on cached prepool states + neighbor context + swap-spec, trained on 4.6k edit-type
pairs, held out on reorder/delete-clause OOD) supply the inserted-word mass **without any re-encode**, and
reach the **usable ≥0.30** bar at **<10% of re-encode FLOPs**?

| method | exact-edit @ collateral≤0.1 | mean cos→oracle | FLOPs | frac of partial-reencode FLOPs | usable ≥0.3? |
|---|---|---|---|---|---|
| proxy-E (no net) | 0.000 | 0.094 | ~0 | — | no |
| **AMORTIZED net (1 fwd, no re-encode)** | **0.230** | 0.377 | **5.87e7** | **0.00445** (≈**220×** cheaper) | **NO** |
| partial re-encode (win_k=4, §17.3) | 0.325 | 0.902 | 1.32e10 | 1.0 | yes |
| true-E ORACLE (not deployable) | 0.873 | 1.000 | 3.06e10 | — | yes |

**The FLOP win is real and large; the success number falls just short.** The amortized net forward is
**5.87e7 FLOPs vs 1.32e10 for partial re-encode — a 220× reduction (0.445% of the cost), and 0.19% of a full
re-encode** — comfortably under the <10%-FLOPs deployability gate. And it genuinely *lands edits*: **0.23
exact-edit-at-collateral** (vs proxy 0.0), closing **26% of the proxy→oracle gap**, with OOD-holdout cos→oracle
**0.542 vs proxy 0.268** (it generalizes to *unseen edit types* — reorder, delete-clause). But **0.23 < the
0.30 usable bar**: it does **not** reach deployable quality. By type the net is strongest on word-swap (0.27)
and entity-swap (0.28), weakest on number-change (0.14) — the same curved/number fragility seen elsewhere.

**Decoded example** (man→woman, a proxy *failure* the net flips to a *landing*): base *"A local **man** was
arrested and charged with heroin distribution…"* → proxy decodes unchanged (edit fails); **amortized net
decodes "A local hero **woman** was arrested and charged with selling heroin…"** (edit LANDS, collateral 0.037
— note the spurious "hero"/"selling" drift); oracle: *"A local **woman** was arrested and charged with
distributing heroin…"* (clean). The net inserts the right content mass but with mild collateral paraphrase —
exactly the 0.23-vs-0.87 quality gap.

**Critic (C1–C10).** **C6 (baselines): all four anchors present and matched to the §17.3 reference targets**
(proxy 0.037, partial 0.325, oracle 0.897) — the comparison is honest and the net is scored on the *same
frozen E1 scorer*. **C5 (generalization): the OOD holdout (reorder + delete-clause, edit types NEVER in
training) is the right test** — and the net beats proxy there (0.542 vs 0.268), so the 0.23 is not
train-type overfit. **C2 (mechanism): the gap is diagnostic, not fixable by more compute** — the net closes
26% of the gap because the inserted word's mass is **non-local** (it depends on in-context information the
cached prepool states don't carry), the *same information-deficit wall* §17.3 identified. A bigger net cannot
manufacture the missing in-context embedding. **VERDICT: NOT DEPLOYABLE (0.23 < 0.30) — but the headline is
the FLOP/quality frontier, not a binary.** Amortized editing buys a **220× compute reduction** and **flips
proxy-failures into landed edits at 0.23 success / 0.54 OOD-cos**, i.e. it is *cheap and real but
sub-usable*. This does NOT overturn §17.3's "partial-reencode is the deployable point"; it **sharpens** it:
the deployable quality (0.325) currently *requires* the re-encode, and the residual 0.10 gap between amortized
(0.23) and partial-reencode (0.325) **is precisely the non-local missing-field mass** — confirming §17.3's
information-deficit diagnosis as the binding wall, now with a 220×-cheaper net that gets *most* of the way.

### 17.26 — THE SAFETY LOOP CLOSES: auditing the PARASCOPE-PREDICTED `z` recovers the next sentence's attributes PRE-emission, above chance on all 6 (mean chain-AUC 0.73)

`tae_parascope_audit.json` wires the two halves of the program into one pipeline and asks the safety question
directly: read an LLM's hidden state at layer 27 → the maghead parascope predicts the **next sentence's** SONAR
`z` (cos 0.438 to true `z`) → run the §17.12 calibrated `z`-auditor on that **predicted** `z`. **Can we flag
the next sentence's attributes one step BEFORE it is emitted?** Chain-AUC is the predicted-z→auditor projection
scored against the **true-z auditor verdict** (the ceiling); chance = 0.5.

| attribute | bin | axis calib AUC | **chain-AUC (predicted z)** | ceiling (true z) | pre-emission agreement | above chance? |
|---|---|---|---|---|---|---|
| sentiment | active | 1.00 | **0.818** | 1.00 | 0.865 | YES |
| tense | active | 1.00 | **0.774** | 1.00 | 0.908 | YES |
| number | active | 0.83 | **0.696** | 1.00 | 0.758 | YES |
| named_entity | active | 0.94 | **0.651** | 1.00 | 0.655 | YES |
| register | inert | 0.88 | **0.744** | 1.00 | 0.728 | YES |
| voice_wording | inert | 0.89 | **0.700** | 1.00 | 0.658 | YES |

**The loop closes: 6/6 attributes recovered above chance, mean chain-AUC 0.730, off a parascope whose `z`
prediction is only cos 0.438.** Sentiment leads (0.818) — you can flag the *valence* of the not-yet-written
sentence from the LLM's current hidden state with AUC >0.8 and 86.5% agreement with the true-z verdict. Even
the hardest active attribute (named_entity, 0.651) beats chance. The **active > inert ordering predicted by
theory-r6 holds weakly** (active mean chain-AUC 0.735 / pre-emission agreement 0.796 vs inert 0.722 / 0.693) —
active attributes are flagged *slightly* better and agree *more* with the true verdict, consistent with the
active-vs-curved split being the steerable/concentrated one.

**Decoded example** (sentiment flagged pre-emission): true next paragraph begins *"…there is still ongoing
debate about their potential long-term health effects…"*; the parascope's decoded prediction (cos 0.533) is
*"Damage to airborne contamination: The risk of allergic reactions…"* — **lexically wrong but
attribute-right**: the predicted `z`'s auditor score (0.094) crosses the calibrated sentiment threshold
(0.032), and so does the true `z` (0.035) — so the auditor flags "negative-sentiment content incoming" from
the LLM hidden state **before the sentence exists**, even though the parascope cannot reproduce the exact
words. This is the program's *gist-not-verbatim* property (§17.6) turned into a safety primitive.

**Critic (C1–C10).** **C4 (leakage — flagged HARD in the brief): the chain is honest.** Chain-AUC scores the
*predicted* `z` against the *true-z auditor verdict*, and the predicted `z` comes only from the LLM residual
via a probe that **never sees the next sentence** — there is no path for the true next sentence to leak into
the prediction. The auditor axes are calibrated on a held-out set (axis calib AUC column), and the ceiling
(auditor-on-true-z = 1.0) is reported separately so the chain number is read as *fraction of the achievable
recovered through a lossy probe*, not as an absolute. **C6 (baselines): chance 0.5 and the true-z ceiling
bracket every number** — all six sit strictly between. **C2 (mechanism): the parascope carries enough
attribute signal (not verbatim content) to drive the auditor** — exactly the gist-bridge of §17.6, now
load-bearing for safety. **C5: 400 audited next-paragraphs, 6 attributes, real LLaMA-3B shard-40 residuals.**
**VERDICT: the SAFETY LOOP is DEMONSTRATED — the interpretability parascope feeds the safety auditor, and
auditing the PREDICTED `z` recovers the next sentence's attributes ABOVE CHANCE on all six (mean chain-AUC
0.73, sentiment 0.82), flagging them PRE-emission.** This is the first end-to-end demonstration that the
read-the-hidden-state half (§3/parascope) and the audit-the-`z` half (§17.12) **compose into a forward-looking
monitor**. No §16/§17 reversal — it *realizes* the §16.4 safety payload as a working pipeline.

### 17.27 — Auditor construct-validity under attack: the SINGLE linear auditor is fooled on everything, but the ENSEMBLE makes ACTIVE attributes robust (defended 0.10) while CURVED stay FRAGILE (defended 0.51)

`tae_auditor_adversarial.json` red-teams the §17.12 deliverable: a z-space PGD attack searches the ε-ball
around `z` to flip the auditor to ABSENT **while the gold marker is still in the decode and the decoded
sentence stays SBERT≥0.9 to the original** (i.e. hide the attribute without removing it). The defense is an
**ensemble of K=11 logistic probes on independent random projections**; the attack is fit on the *single base
probe only* (a genuine held-out transfer test of the panel).

| attribute | bin | clean AUC (single/ens) | attack-success (single probe) | **defended success (ensemble)** | reads as |
|---|---|---|---|---|---|
| sentiment | active | 1.00 / 1.00 | **1.00** | **0.00** | single probe fully fooled; ensemble **fully defends** |
| named_entity | active | 0.85 / 0.88 | **1.00** | **0.20** | single fooled; ensemble mostly holds |
| tense | curved | 1.00 / 1.00 | 0.263 | 0.105 | attack already hard (marker can't be kept past ε); ensemble helps |
| number | curved | 0.63 / 0.62 | **0.909** | **0.909** | **ensemble does NOT defend — fully fragile** |
| **mean active** | | | 1.00 | **0.10** | **active = DEFENDABLE** |
| **mean curved** | | | 0.586 | **0.507** | **curved = NOT defendable** |

**The construct-validity verdict is two-sided and must be read at the right layer.** (1) **The SINGLE linear
auditor (the §17.12 ship) can be fooled on essentially everything** — z-space PGD flips sentiment, entity, and
number to ABSENT at attack-success 0.91–1.0 while the marker word survives in the decode and SBERT stays ≈1.0.
So a *deployed single-probe auditor is adversarially defeatable.* (2) **The ENSEMBLE defense rescues the ACTIVE
attributes but NOT the curved ones:** ensemble-defended success drops to **0.10 mean for active** (sentiment
0.00 — fully defended; entity 0.20) but stays **0.507 for curved** (number 0.909 — *completely undefended*;
tense 0.105). The geometry explains it: **active attributes are a concentrated steerable direction** — an
attack that moves off it is caught by *some* random-projection probe in the panel — whereas **curved
attributes (number especially) are a diffuse, suppressible signal** the decode prior re-expresses, so the
attack hides them from *every* projection at once. This is the adversarial face of the active-vs-curved split
(§17.13/§17.15): the same curvature that makes order/number irreducible to a linear chart makes them
**un-defendable** by a linear auditor panel.

**Decoded example** (number, curved, ensemble ALSO fooled): original *"…in 2011, there were approximately 98
million mothers… about 72% of the total population."* — clean single-probe logit +9.55 (PRESENT). After the
ε=0.2 z-attack the **decode is byte-identical** ("2011" still present, SBERT 1.0) but the single-probe logit
is **−102.7** *and* the ensemble logit is **−3.83 < threshold 0.005 → ABSENT (also fooled)**. Contrast
sentiment: *"The food they served at the party was awful."* attacked to single-logit −381.9 (flipped) but
**ensemble logit +6.16 > −0.039 → PRESENT (defended).**

**Critic (C1–C10).** **C6 (baselines): the held-out transfer design is the right baseline** — attacking only
the base probe and testing the *panel* avoids the trivial "attack the thing you defend with" circularity, so
the defended-success numbers are an honest robustness estimate. **C2 (mechanism): the active/curved asymmetry
is mechanistic, not incidental** — concentrated directions are catchable by random projections, diffuse curved
signals are not; this is *predicted* by the §17.13 split and now *measured* adversarially. **C7
(alt-explanation): the SBERT≥0.9 + marker-kept constraint rules out "the attack just deleted the attribute"**
— the marker is provably still in the sentence, so a flip is genuine hiding, not removal. **C5: only K=11
random-projection probes** — a stronger panel (more probes, non-linear, or curvature-aware readers) might
narrow the curved gap, untested. **VERDICT: the auditor's construct-validity is CONDITIONAL — ship it as an
ENSEMBLE, scope its guarantee to ACTIVE attributes.** For deployment: (a) **never ship the single linear probe**
(defeatable on all attributes); (b) the **ensemble makes active attributes — sentiment, named-entity — robust**
(defended success 0.10, sentiment 0.00); (c) **curved attributes (tense, number) remain adversarially
fragile** (defended 0.51, number 0.91) — an adversary can hide a number/date from the auditor while the
sentence still states it, so **the auditor must NOT be trusted to certify the absence of curved attributes.**
This refines §17.12's contract: "readable ≠ steerable" now gains "**and readable-active is defendable while
readable-curved is not**." No §16/§17 reversal — it stress-tests and *scopes* the §17.12 safety deliverable.

### 17.28 — The dark matter literally SAYS the gist: strip the role-skeleton from z and the content residual c_i decodes to the sentence's meaning while the skeleton decodes to word-salad

`tae_ci_decode.json` is the most direct test yet of §16's claim that **meaning lives in the dark matter
`c_i`, not in the bound lexical skeleton**. The recipe: take a sentence's `z`, compute the role-skeleton
`z_R = Σ_i R_pos^{p_i} R_role E(w_i)` (the §17 binding term, all lexis + order, no context), form the content
residual `z_content = z − z_R` (the pooled `c_i` dark-matter field, **69.4% of `z`'s energy**), and decode
**each piece on its own** through the SONAR decoder, scoring the decode against the *original* sentence by
SBERT. n=400 sentences, day-5 SAE (`h32768_k64_btk_day5`).

| decode of… | SBERT to original | BLEU | reads as |
|---|---|---|---|
| **skeleton `z_R`** (lexis+order, no context) | **0.514** | **4.86** | right words, scrambled — word-salad |
| **content `z_content`** (the dark matter `c_i`) | **0.851** | **51.1** | **says the gist / meaning** |
| full `z_true` (ceiling) | 0.991 | 79.1 | the original |
| random control C4 (`z_R + ‖z_content‖·unit-noise`) | 0.400 | — | norm-matched floor |

**The dark matter wins by a mile and it is not a norm artifact.** Content beats skeleton by **+0.337 SBERT
(paired t=21.5, n=400)** and beats the norm-matched random control by **+0.451 (paired t=27.5)** — so it is
*the content direction*, not merely *having extra norm*, that carries the meaning. The derived prediction also
holds: the content-minus-skeleton gap **rises monotonically with ‖c_i‖** across terciles (0.239 → 0.359 →
0.412; slope 1.82, the more dark matter a sentence has, the more of its meaning the residual alone restates).
On the SAE side the content residual recovers **41.4%** of the full sentence's active features vs the
skeleton's **19.9%** (Jaccard content-vs-full 0.308 vs skeleton-vs-full 0.152), and content vs skeleton share
almost nothing (Jaccard 0.058) — **two near-disjoint feature populations: a lexical-skeleton dictionary and a
meaning dictionary.**

**The vivid examples make the split concrete.** On a *long, content-rich* sentence the residual practically
reconstructs it: original *"The Electric Recording Company's reissue is a testament to the enduring power of
music… the beauty and majesty of Elgar's Cello Concerto can still transport us…"* → **content-decode**:
*"The reissue by the Electric Recording Company is a testament to the enduring power of music, and reminds us
that even after decades, the beauty and majesty of Elgar's cello concerto can transport us to a world of
wonder."* (SBERT **0.974**) — while the **skeleton-decode collapses to a stuck loop**: *"The power of
transformation. The power of transformation. The power of transformation…"* (SBERT 0.203). Same story for the
rugby sentence (content 0.975, skeleton degenerates into *"…in the semi-final against the Netherlands in the
semi-final against…"* 0.259). The **one regime where the split inverts is short, low-content stock phrases**:
*"Thank you for understanding!"* (‖c_i‖ only 0.34) has almost no dark matter, so its skeleton-decode is near-
perfect (*"Understanding for understanding!"*, SBERT 0.785) while the residual, given a near-empty content
field, hallucinates (*"…our dear friend, Maria Elizabeth Defoe's prayers…"*, 0.227). That is **exactly the
predicted failure mode and confirms the mechanism**: where there is meaning beyond the words, it lives in
`c_i`; where there isn't, the residual is noise. This is §16.1's "dark matter = in-context meaning" stated in
its strongest form to date — **you can literally read the gist off the residual.**

**Critic (C1–C10).** **C4 (controls): the norm-matched random control is the right one** and content beats it
by 0.45 paired-t=27.5 — the result is the *direction*, not the energy, killing the "you just added norm back"
objection. **C6 (baselines): skeleton, random, and full-ceiling are all reported**, so the 0.851 sits inside a
calibrated [0.40 random, 0.51 skeleton, 0.99 ceiling] frame. **C2 (mechanism): the ‖c_i‖-tercile monotonicity
and the short-phrase inversion are a mechanistic signature**, not a fit — the residual carries meaning *in
proportion to how much dark matter the sentence has*. **C7 (alt-explanation): "the decoder prior just fills
in plausible content"** is partly addressed (random control is below skeleton) but the residual *does* go
through the same generative prior, so some of the 0.851 is prior-assisted reconstruction rather than literal
read-out — consistent with §16's finding that only ~0.199 of `c_i` is prior-recoverable, i.e. most of the
0.851 must be genuine residual signal, but the exact split is not isolated here. **C5: pooled per-sentence
object, k=64 SAE** — a per-token or different-k decode is untested. **VERDICT: SUPPORTED, and the cleanest
demonstration in the corpus that the dark matter is semantic content** — meaning is in `c_i`, lexis+order is
in the skeleton, they are near-disjoint in SAE coordinates, and the residual decodes to the sentence's gist
(SBERT 0.851, +0.34 over skeleton, +0.45 over norm-matched random). **Reinforces §16.1 and §17.4; no reversal.**

### 17.29 — The hub-field does NOT lift to document scale: c_i-norm-weighted sentence-pooling neither beats flat-mean recall nor picks the extractive summary

`tae_doc_rep.json` tests whether the §17.4 **within-sentence hub-field** (content piles onto rare high-
surprisal hub tokens) has a **document-scale analogue**: is a per-sentence saliency `s = ‖z_content,s‖`
(content-residual norm = "how much dark matter this sentence carries") a useful **document saliency**, such
that (a) a `c_i`-norm-weighted sentence pool beats a flat mean for paragraph recall, and (b) the top-`c_i`
sentences ARE the extractive summary? n=1200 docs, 4–12 sentences each.

| pool | recall@1 | MRR | topic-probe AUC | cos to whole-text |
|---|---|---|---|---|
| flat (1/S) | 0.9983 | 0.9992 | 0.9674 | 0.5091 |
| recency (decay^pos) | 0.9992 | 0.9996 | 0.9639 | 0.4835 |
| **c_i-weighted** (∝ saliency) | **0.9983** | **0.9992** | **0.9667** | **0.5114** |

**Flat-tie everywhere — the saliency weighting buys nothing.** Recall is saturated (≈1.0) for all three pools
so the retrieval task can't separate them, but the more sensitive probes agree: topic-probe AUC is
flat-identical (0.967 vs 0.967), and cos-to-whole-text is a dead heat (0.509 flat vs 0.511 c_i). The summary
test is the decisive one: **top-`c_i` sentences match the centrality-based extractive summary at chance** —
top-1 precision **0.231 vs random 0.232**, top-third **0.480 vs random 0.467**, and the rank-correlation
between content-residual saliency and graph centrality is **Spearman 0.034** (i.e. ~zero). The worked examples
show why: on the AHCA-repeal doc the top-`c_i` sentences are the lead and a specific-provision sentence, while
the centrality summary picks the thesis and the same provision — they **overlap on only 1 of 2** and the
disagreement is structural, not noise.

**Critic (C1–C10).** **C6 (baselines): three pools + a random summary baseline + a graph-centrality "gold"
summary** is a proper frame, and the saliency loses (ties) on every one. **C3 (ceiling/floor): recall is
saturated** so that metric is *uninformative by construction* — the real evidence is the unsaturated topic-AUC,
cos-to-text, and summary-overlap, all of which are flat/at-chance, so the negative is not an artifact of an
easy task. **C7 (alt-explanation): "wrong saliency definition"** is live — saliency here is the raw content-
residual norm; a normalized, length-controlled, or feature-selective saliency might separate, untested. **C5:
the §17.4 hub-field is a *within-sentence token* phenomenon; this lifts it to *between-sentence* with no a
priori reason the same statistic should transfer** — a genuinely different object, and it does not. **VERDICT:
NOT SUPPORTED — the document representation is a flat sentence-`z` bag; the hub-field is a sentence-internal
property that does NOT scale to a document saliency.** Honest negative that *bounds* §17.4: the hub-field is
real but local, and `c_i`-weighting is not a free document-summarization or pooling win. **No §16/§17
reversal** — it scopes §17.4's hub story to within-sentence.

### 17.30 — Is c_i a universal predictive-content field? A frozen Llama-3.2-3B does NOT predict it (R²≈0.08 < bag 0.11) — DECISIVE-NEGATIVE, with an honest granularity caveat

`tae_ci_frompass.json` asks the boldest §16 question: **is SONAR's dark matter `c_i` the same object as a
frozen autoregressive LLM's predictive-content field?** If the c_i a token acquires from SONAR's all-to-all
attention is "the in-context content the next-token machinery cares about," then a frozen Llama-3.2-3B's
mid-stack residual should *linearly predict* `c_i`. The test sweeps Llama layers 6–26, ridge + MLP, target =
mean-pooled per-token content residual (the §17 unrotated `c_i`), R² in standardized target space, split by
document. **n=4263 train / 1336 test paragraphs, 551 docs.**

| predictor of c_i | R² | reads as |
|---|---|---|
| Llama L6 → L26 (ridge, best=L18) | 0.061 → **0.081** (peak) | barely above floor |
| surprisal floor | 0.019 (c_i) / 0.080 (c_i-norm) | a scalar surprisal does as well |
| **bag-of-tokens control C4** | **0.111** | **lexical bag BEATS Llama** |
| target to declare a hit (≥0.45) | — | not remotely reached |

**Decisive negative: Llama does not own `c_i`.** Peak R² is **0.081 at layer 18** — it clears the surprisal
*scalar* floor (0.019) but **loses to a plain bag-of-tokens baseline (0.111)**: the frozen LLM's rich mid-
stack state predicts the dark matter *worse than just knowing which words are present*. The feature-overlap is
near-zero too (Jaccard between top-10% `c_i`-norm and top-10% surprisal = 0.012). So **the universal-
predictive-content-field hypothesis is NOT supported by this test** — `c_i` behaves as a **SONAR-internal
pooling object**, not a model-agnostic predictive field a different transformer reconstructs. The one positive
crumb worth keeping: the R² curve **peaks at Llama rel-depth 0.667 (L18)**, late-mid stack, echoing where
SONAR's own `c_i` emerges (rel-depth ~0.75–1.0, §17.4) — a faint shared "content crystallizes late" depth
signature, but far too weak (R² 0.08) to carry the hypothesis.

**Critic (C1–C10). C5 (granularity — the load-bearing caveat, reported honestly):** this is a **paragraph-
level** alignment, **not per-token.** The precomputed Llama shards (`load_llama3b_hf`) store **one residual at
the last token of each paragraph** in an autoregressive forward over the doc — there is no sub-paragraph
resolution — so the per-token SONAR `c~_i` had to be **mean-pooled to one paragraph `c_i`** and regressed on a
single Llama last-token state. The hypothesis is fundamentally **per-token** (does the LLM's state at token i
predict that token's context field), and **this test cannot see that**: pooling both sides to paragraph
granularity destroys exactly the per-token structure the claim is about. **So this is a LIMITED test, and a
per-token test is owed** before the hypothesis can be called dead — the honest statement is *"c_i is NOT
strongly Llama-predictable at the paragraph level."* **C6 (baselines): the bag-of-tokens C4 and surprisal
floor are the right controls** and they make the negative *strong* — beating a surprisal scalar but losing to
a lexical bag is the signature of "no real cross-model content alignment, just word-presence leakage." **C7
(alt-explanation): the d_model mismatch (3072 Llama → 1024 SONAR) and a linear/MLP reader could under-fit** —
but R² 0.08 against a 0.45 target is not a tuning gap, it's a structural miss. **C3: the ≥0.45 hit-threshold
is reasonable** for "same object." **VERDICT: DECISIVE-NEGATIVE at paragraph granularity — the universal-
predictive-content-field hypothesis is NOT supported; `c_i` reads as a SONAR-internal pooling artifact, not a
shared cross-model predictive field — BUT the test is paragraph-pooled and a per-token test remains owed.** No
§16/§17 reversal; it *refutes a tempting extension* of the §16 dark-matter story (that `c_i` is model-
agnostic) while scoping the refutation honestly to the granularity actually tested.

### 17.31 — THE DEPLOYABLE-EDITING HEADLINE: a cross-attention recontextualization-delta head lands edits at exact@coll **.36** for **~1.1% of the partial-reencode FLOPs** — editing flips from "rescued-but-expensive" to genuinely DEPLOYABLE, and the central negative is overturned

`tae_context_synth.json` is the day-5 finale on the editing prize. §16/§17.3 left editing in a frustrating
two-horse race: the only deployable method that crossed the .30 usability line was **partial re-encode**
(exact@coll **.325**), which re-runs the encoder over a ~13-token window (43% of the sentence, ≈2.65e9 FLOPs)
— "deployable" only in the weak sense of *sub-full-re-encode*. The two genuinely-cheap learned predictors
plateaued just below usable (neighbor-MLP **.233** at cos-to-oracle only **.37** — the "high-rank tail we
can't predict" wall), and §17.25's amortized net stuck at **.23**. The open question was binary: **is there a
cheap learned method that both (a) lands edits at ≥.30 AND (b) costs ≪ a partial re-encode?** If yes, editing
is deployable and the night-4 "irreducible" frame is overturned; if the cheap methods are theorem-bound at
~.37 cos, the negative stands.

The method: a **cross-attention "recontextualization-delta" head**. The target is the in-context delta
Δ(g_i) = m\*(g_i) − anchor(g_i), where m\* is the oracle's summed in-context mass at the edit slot and the
anchor is the mean of the *original* base states in a ±4-token window. The head is a small cross-attention
block (d_attn=256, 4 heads, 1024 hidden) whose query = [bare proxy mass ++ slot-position features] attends
over the **CACHED original encoder states** (k/v) — i.e. it predicts the recontextualization delta *without a
new encoder pass*. Trained on **4,586 AUX-DEV substitution pairs**, with the 300 eval edits and a 1,139-pair
OOD holdout (reorder + delete-clause source types) **held strictly out** (split hygiene logged).

| method (deployable?) | exact@coll | cos-to-oracle | frac gap closed | FLOPs (mult-add) |
|---|---:|---:|---:|---:|
| proxy-E (baseline) | .000 | .094 | 0.00 | — |
| self-sub R_role·bare (binding-only) | .000 | .094 | 0.00 | — |
| neighbor-context MLP (§17.3 ref) | .233 | .37 | 0.27 | cheap |
| amortized net (§17.25 ref) | .230 | — | — | cheap (220× win) |
| **CONTEXT-SYNTH cross-attn delta** ✓ | **.36** | **.505** | **0.412** | **2.96e7 (1.1% of re-encode)** |
| partial re-encode win_k=4 (§17.3) ✓ | .325 | .902 | 0.372 | 2.65e9 |
| true-E ORACLE (upper bound, *not* deployable) | .873 | 1.00 | 1.00 | full encode |

**Verdict: editing is now DEPLOYABLE, and the central negative flips.** The context-synth head lands edits at
exact@coll **.36** (word-swap .30, entity-swap .44, number-change .34) — it **beats the neighbor-MLP plateau
(.233)**, **beats the amortized net (.23)**, and even **edges past partial-reencode (.325)** — while costing
**2.96e7 FLOPs ≈ 1.1% of the partial-reencode's 2.65e9** (it reads cached states, no new encoder pass). Both
WIN-LINE conditions are met: *lands ≥.30* AND *<10% of re-encode FLOPs*. This is the deployable point §17.3
was reaching for: **the first method that is both usable and genuinely cheap.** The night-4 "editing is
irreducible" frame, already softened by §17.3 to "irreducible at zero cost, recoverable at half the encode
cost," is now further corrected to **"recoverable cheaply by a learned delta-head that predicts the
recontextualization from cached states."** The central negative-to-positive flip is real.

> *Decoded, the flip (low-cos bin, number-change `30 → 31`):* base "For over **30** years, the University
> Magazine has served as the premier publication of the University of Toronto." proxy-E (deployable, .037)
> decodes back to "…For over **30** years…" — the swap never lands. **Context-synth (.36)** decodes "For over
> **31** years, the University's Magazine has stood as the premier publication of the University of Toronto."
> — **lands**, collateral .059 (minor paraphrase of "served→stood", "the University Magazine→the University's
> Magazine", under threshold). The oracle lands cleanly too. This is the edit class that flips fail→land by
> upgrading the proxy to the learned cross-attention delta.

**The honest caveats (C1–C10).** **C6 (is the FLOP comparison fair?) — examined and judged FAIR, with the
load-bearing asterisk stated.** The synth head's 1.1% figure is honest *given* the cached original states; it
explicitly does **NOT** re-run the encoder, whereas partial-reencode does. But the comparison's fairness rests
on a real assumption: in a true deployment you must **already have the original encoder states cached** (the
k/v of the unedited sentence). If you do — which is the natural setting for editing an embedding you just
produced — the comparison is fair and the 90× win is real. If you don't, you'd pay one encode to cache them,
and the marginal-edit FLOPs are what's quoted. The note in `C4_flop_control` states this ("synth uses CACHED
original states — NO encoder pass"). So **the FLOP win is for the amortized/repeated-edit regime**, which is
exactly the deployable regime. **C6b (is the .63→edit-success mapping honest?) — yes, and the forecast was
BEATEN on OOD while the in-distribution cos is LOWER than the headline number suggests.** The pre-registered
hope was "OOD cos-to-oracle ≈0.63 ≫ the .37 plateau." The landed OOD cos-to-oracle is **.736** (vs an
anchor-only Δ=0 floor of **.204** on the same 1,139 reorder/delete-clause pairs) — *better* than forecast and
far above the .37 plateau, confirming the head generalizes off-distribution. **But note honestly:** the
*in-distribution* mean cos-to-oracle is only **.505**, well below partial-reencode's **.902** — so the head
does **not** reconstruct the in-context state as accurately as a partial re-encode; it lands *more* edits at
*lower* cos. The reason is that edit-success is a **thresholded behavioral** metric (swap-in AND
collateral≤.10), and the cross-attn delta concentrates its accuracy on the **edit-relevant direction** rather
than the full high-rank state — it gets the behaviorally-decisive part right cheaply while missing the rest of
the `c_i` tail. **So the .505/.736 cos → .36 success mapping is honest: cos and success measure different
things, and the head wins on success precisely because it spends its capacity on the part that flips the
output.** **C7 (hub-decay sanity):** Spearman(hubness, cos) = **−0.274** — the head degrades on high-out-degree
hub tokens (q1 cos .559 → q4 cos .469), the *predicted* signature of the §17.4 hub-field being the hard
residual, confirming the mechanism rather than a fit artifact. **C3 (DECISIVE-NEGATIVE flag = False):** the
pre-registered "wall is a theorem at ~.37 cos" hypothesis is **refuted** — a cheap learned head exceeds it.
**VERDICT: editing is DEPLOYABLE (cheap method ≥.30 lands); §17.3's central "deployable-only-by-half-re-encode"
negative is overturned to a positive.** This is the program's deployable-editing headline. **This is a
genuine update to the §17 synthesis** (see below): the editing frontier moves from "recoverable at half the
encode cost" to "recoverable at ~1% of the encode cost via a cached-state delta-head," with the residual wall
now scoped to the hub-token tail rather than to editing as a whole.

### 17.32 — What makes SONAR's geometry special? Its MT+DAE objective, tested CORRELATIONALLY: R_role distinctness tracks word-order distance (ρ=0.74, p=.01) and a non-MT encoder shows a far weaker reflective operator — but rotation MAGNITUDE is flat and the head-final prediction FAILS

`tae_objective_ablate.json` asks the deepest "why" question: **is SONAR's role-rotation + hub-field a
consequence of its training objective** (multilingual MT + denoising-AE + MSE — the pressure to make *one*
decoder reconstruct from an *order-invariant-but-recoverable* pooled embedding across many word orders)? The
honesty block is explicit and load-bearing: **SONAR cannot be retrained**, so this is **CORRELATIONAL, not a
causal ablation**. What *is* testable: (1) does per-language `R_role` magnitude/distinctness track offline
typological word-order distance from English? (2) do head-final SOV languages (jpn/kor/tur/hin) carry the
*largest* rotation? (3) does a non-MT pure sentence encoder show a *weaker* reflective operator? (4) does the
hub-field `c_i` track translation-relevant token properties (surprisal / out-degree / content-vs-function)?

**Results — a genuinely mixed, honestly-reported picture:**

| test | result | reads as |
|---|---|---|
| Spearman(R_role **distinctness-from-eng**, word-order dist), n=12 | **ρ=0.737, p_perm=.011** | **POSITIVE** — farther langs carry more-distinct R_role |
| Spearman(R_role **median rotation angle**, word-order dist) | ρ=**0.0** (p=1.0); mean-angle ρ=−0.03 | **NULL** — rotation *magnitude* is flat across langs (~1.27 rad everywhere) |
| Head-final SOV langs carry *larger* rotation? | **NO** (head-final mean median-angle 1.272 ≈ non-head-final 1.276) | prediction **FAILS** |
| non-MT encoder (MiniLM) reflective operator | median eigen-angle **0.535** vs SONAR **1.282**; frac-near-π **.026** vs **.083**; n>2.5rad **18** vs **183** | **POSITIVE** — non-MT operator is far less reflective |
| hub-field `c_i` token-property regression | **ERRORED** (`'HubPrepool' has no attribute '_spm'`) | the §17.23 tokenizer-accessor bug **recurred** — arm lost |

**What makes SONAR special, honestly:** the objective hypothesis gets **partial correlational support, not
confirmation.** Two arms point the right way: (a) the *distinctness* of each language's role operator from
English grows with typological word-order distance (ρ=0.74, p=.01) — consistent with `R_role` being "the price
of aligning divergent word orders into one pooled code," which is exactly what an MT+single-decoder objective
would pressure; and (b) a non-MT pure sentence encoder (MiniLM) has a **markedly weaker reflective operator**
(median order-rotation 0.53 rad vs SONAR's 1.28, and 10× fewer near-π reflections), consistent with the
π-reflection role-rotation being **objective-driven** rather than generic to pooled encoders. But two arms
*fail*: the rotation **magnitude** is essentially constant across all 12 languages (ρ≈0), and the crisp
**head-final prediction** (jpn/kor/tur should rotate *most*) does **not** hold — jpn actually has the
*smallest* median angle. So the picture is: **SONAR special-ness = a reflective, content-vs-order-separating
pooling operator whose DISTINCTNESS (but not raw magnitude) tracks cross-lingual word-order divergence, and
which is far weaker in a non-MT encoder** — correlational evidence the MT/DAE objective shapes the geometry,
*not* a causal proof.

**Critic (C1–C10). C5 (what's testable without retraining — stated up front and honored):** the experiment
never overclaims causality; the HONESTY block pre-commits to "correlational, positive = consistent-with."
**C6 (controls):** the non-MT encoder is the right comparator but carries the honest caveat that it differs in
*model, dim (384 vs 1024), and training data* — so it is **not a clean ablation**, only suggestive. The WALS
word-order features are hardcoded majority values (no network), reasonable. **C7 (the magnitude-null is the
honest tension):** distinctness rises with word-order distance but magnitude doesn't — read together, this
says SONAR rotates by a roughly *fixed amount* in *language-specific directions* whose specificity grows with
typological distance, rather than rotating *harder* for distant languages. That's a subtler and more
believable story than the naive "distant = bigger rotation." **C8 (lost arm):** the `c_i`-token-property arm
**errored on the same `_spm` accessor bug that §17.23 had already hit and supposedly fixed** — it recurred
here (different code path), so the "does the hub-field track translation-relevant token properties" question is
**still owed**; the prose pre-registered the expected direction (surprisal_q4/q1 > 1 ⇒ rare/content pieces
carry the field) but has **no measurement** this round. **VERDICT: PARTIAL correlational support for the
objective hypothesis — distinctness∝word-order (ρ=0.74) + weaker non-MT operator point yes; flat magnitude +
failed head-final prediction temper it; the hub-field arm is lost to a recurred bug and remains owed.** No
§16/§17 reversal — this *adds a why-layer* under §17.1/§17.17 (the SONAR-specific binding geometry) and
*partially* sources it to the training objective, honestly bounded.

### 17.33 — DECISIVE-NEGATIVE: cross-lingual editing is impossible — a spec-only edit direction is LANGUAGE-SPECIFIC (cross-lang exact@coll **0.0**), and within-lang spec-only sits at the **.02** floor (matching the .037 wall)

`tae_crosslingual_edit.json` asks whether the editing *direction* is shared across languages: take an arm-A
spec-only word-swap edit (e.g. `man→woman`, sinusoidal RotationR, per-language ProxyE) fit/applied to a SONAR
`z` encoded in one language, then **decode it in each of {eng, fra, deu}** — does the swap survive the
language change? exact@coll = target-lang word swapped in + source word out + collateral ≤ .10.

| | decode eng | decode fra | decode deu |
|---|---:|---:|---:|
| **fit eng** | .000 | .000 | .000 |
| **fit fra** | .000 | **.0625** | .000 |
| **fit deu** | .000 | .000 | .000 |

**within-lang (diagonal) mean = .021 · cross-lang (off-diagonal) mean = 0.000 · edit-survives-crosslingual =
FALSE.** Every off-diagonal cell is **exactly 0.0**: an edit direction fit in one language **never** lands when
decoded in another. The only non-zero cell is the within-language fra/fra diagonal (.0625, 1 of 16) — and even
the diagonal **mean is .021**, sitting right at the **§16/§17.3 spec-only floor (.037)**, i.e. the *within*-
language spec-only edit is already at the useless wall. **Decisive negative: the edit direction is
language-specific.** This is the clean cross-lingual confirmation of the §16 binding picture: the spec-only
proxy-E edit carries no in-context information *and* its geometry is keyed to the language it was fit in, so it
fails twice over across languages.

**Critic (C1–C10). C6/C2 (the right reference is the diagonal, and it's at the floor):** the within-lang
diagonal is the calibrated reference and it confirms spec-only editing is at the .02–.04 wall *before* you even
ask about cross-lingual transfer — so the 0.0 off-diagonal is "language-specific on top of already-useless,"
the honest framing. **C5 (scope caveats, reported):** word-swap edit type only; the bilingual word table is
small (10 confident single-translation pairs); bases are SONAR self-translations of English (not native fr/de
corpora, inheriting translation artifacts); cross-lingual collateral uses only the token-f1 half of E1 (the
attribute-drift half is English-specific, omitted). So the magnitude is bounded by a small-n, narrow-edit
setting — but the **direction of the result (0.0 cross-lang) is unambiguous.** **VERDICT: DECISIVE-NEGATIVE —
cross-lingual editing does not transfer (exact@coll 0.0); the edit direction is language-specific, and
within-lang spec-only is already at the .02 floor.** Consistent with §17.19 (the operator/binder is
language-*universal* in structure but the per-language `R_role`/ProxyE are language-*specific* instantiations)
and §17.5; no reversal, a sharp new bound on what editing can do across languages.

### 17.34 — A SECOND encoder corroborates SONAR-specificity from a fresh angle: GTR-T5 has NO role-rotation (W_B median eig-angle **.09** vs SONAR **.97**) — but mean-pool exactness DOES recur (encode-only limitation noted)

`tae_second_tae.json` corroborates the §17.17 "the binding geometry is SONAR-specific" swing from an
independent second encoder: **sentence-transformers/gtr-t5-base** (768-dim, mean-pooled). The central question
is whether SONAR's role-rotation composition structure exists in another pooled sentence encoder. It fits the
per-token composition map W_B on consecutive-sentence A/B pairs in two length buckets (5–10, 15–25 tokens) and
compares its eigen-angle spectrum to SONAR's role-rotation reference.

| property | GTR-T5 (2nd encoder) | SONAR reference |
|---|---:|---:|
| W_B median eigen-angle | **0.090 rad** (buckets .098 / .082) | **0.967 rad** |
| W_B near-orthogonal? | **NO** (orth_dev 0.79) | yes |
| learned-orth gain over identity | **−0.005 / +0.014** (≈0) | large |
| role-rotation present? | **NO** (neither bucket) | yes |
| mean-pool exactness γ (cos mean-state vs pooling) | **1.00** | ~1.00 |

**Verdict: NO SONAR-style role-rotation in the second encoder.** GTR's composition map is **identity-like** —
the best-fit W_B has median eigen-angle **.090 rad** (≈10× smaller than SONAR's .967), is **not** near-
orthogonal (orth_dev .79, vs SONAR's near-orthogonal operator), and a learned orthogonal map buys **essentially
nothing over plain identity** (gain −.005 / +.014). In other words, for GTR you compose B-after-A by
approximately *leaving the token states where they are* — there is no non-trivial reflective role-rotation
binding order. This **corroborates the §17.17/§17.24 conclusion from a fresh second angle**: the
π-reflection role-rotation is **SONAR-specific structure**, not a generic property of pooled sentence encoders.
The one property that **does** recur is **mean-pool exactness** (the pooling stage is the exact mean of token
states, cos≈1.00) — but that is a near-trivial architectural fact of mean-pooled AEs, not the interesting
binding geometry.

**Critic (C1–C10). C5/C2 (the encode-only limitation — stated honestly and it is real):** GTR/Sentence-T5 is
reachable **encode-only** on this box (no clean decoding sentence-bottleneck AE), so this **cannot test SONAR's
decode-based properties** — not the edit-wall (a decoder phenomenon), not the `c_i` decode-ladder (defined via
decoded text). So §17.34 corroborates **only** the *encoder-side composition geometry*, which is exactly the
W_B role-rotation claim — a fair test of the specific thing claimed, but it cannot speak to whether GTR has a
hub-field/dark-matter or an edit-wall. **C6 (controls):** identity and learned-orthogonal baselines are the
right comparators; the absence of a positional-rotation reference (T5 uses relative position bias, no
sinusoidal PE table, so the sin^dp column is omitted) is noted — the test compares against identity/learned-orth
only, which is sufficient to show "no non-trivial rotation." **C7 (alt-explanation):** could the null be a
fitting failure rather than a real absence? No — the learned-orth map *converges* (it just lands near identity,
cos-to-identity .97–.98), so the operator genuinely *is* near-identity, not under-fit. **VERDICT: CORROBORATES
§17.17 — a second pooled encoder has NO SONAR-style role-rotation (W_B angle .09 ≪ .97), strengthening
"binding geometry is SONAR-specific" from an independent model; the encode-only limit means it speaks to
composition geometry, not the decode-side edit-wall, which is stated honestly.** No §16/§17 reversal — a
second pillar under the SONAR-specificity verdict.

### 17.35 — THE BIGGEST DICTIONARY BREAKS THE PLATEAU: h131072/k128 lands FVU **0.248** (< the ~.32 wall) and **91.3%** of the SBERT decoder ceiling — at 8× the §16 base width, scale finally buys reconstruction

`sonar-sae_h131072_k128_btk_day5_metrics.json` is the flagship: the **largest dictionary we have trained**
(h=131072, 8× the §16 base h16384, k=128, BatchTopK, aux_k=512, **40 epochs**, 800k v2-sentence train rows,
34,105 val). The standing leaderboard fact (§5, §17.6, §17.16) was a **plateau at ~0.32 FVU / ~.83 cos**: every
larger-dict / higher-k config (h16384/k128 .310, h32768/k128 .331, h65536/k128 .351, and the day-5
h32768/k64 full-data recipe .320) sat on the *same* wall — "bigger dictionaries don't buy recon once you are
on the frontier." The question for the biggest dict: **does h131072 break .32, or does it just reproduce the
wall at higher width?**

| SAE (recipe) | h | k | FVU | cos | dead% | L0 | roundtrip SBERT | % of .984 ceiling |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| h16384/k128 (§5) | 16384 | 128 | .310 | .840 | 0.0 | 128 | — | — |
| h32768/k128 | 32768 | 128 | .331 | .834 | 0.0 | 128 | — | — |
| h65536/k128 | 65536 | 128 | .351 | .828 | 0.0 | 128 | — | — |
| h32768/k64 day5 (full-data, 80ep) | 32768 | 64 | .320 | .833 | .003% | 64 | .835 | 84.8% |
| **h131072/k128 day5 (FLAGSHIP)** | **131072** | **128** | **.248** | **.875** | **3.07%** | **128** | **.899** | **91.3%** |

**Verdict: the plateau BREAKS — and it is scale, not just the day-5 training recipe, that does it.** At
matched k=128, going from h65536 (.351) / h32768 (.331) up to **h131072 cuts FVU to .248** — a **~25% relative
FVU reduction below the wall** and the first config to clear .32 by a real margin. cos rises to **.875** (best
on the board) and the behavioral roundtrip hits **.899 SBERT = 91.3% of the .984 decoder ceiling** (vs the
k64-day5 flagship's 84.8%) — i.e. a 128-active sparse code now preserves ~91% of the meaning the dense decoder
can. **Is this just the k128+full-data+epochs effect?** The clean control is the day-5 h32768/k64 comparator,
which ran the *same* full-data_v2 + long-training recipe and still sat at **.320** — so full-data+epochs alone
does **not** clear the plateau; the extra drop to .248 is bought specifically by the **8× dictionary width at
k=128**. This is the one place in the whole program where **dictionary scale demonstrably helps
reconstruction** — sharpening, rather than overturning, the §5 line: "bigger dict doesn't buy recon" holds *on
the frontier you've already saturated at a given k*, but pushing width a full 8× past base **does** shift the
frontier when paired with a dense-enough code and full data.

**Critic (C1–C10). C6 (is the comparison fair? — the load-bearing one).** The plateau-break is judged
**REAL but with two honest asterisks.** (i) **k is not matched across the whole table** — the k64-day5 control
(.320) is the clean "recipe-only" comparator and is *below* k128 in sparsity, so the fairest statement is
"h131072 at k128 beats every prior k128 config AND beats the k64 full-recipe control," which it does; the drop
is not an artifact of comparing a k128 model to k64 baselines. (ii) **Dead features rose to 3.07%** (vs .003%
at k64-day5 and 0% at the prior k128 configs) — at 131072 features and only 40 epochs, ~4,000 features never
fire; the dictionary is **under-trained relative to its width** (40 epochs vs 80 for the k64 control), so the
.248 is if anything a *conservative* floor — a longer run would likely push it lower but also is the reason we
shouldn't over-claim a clean "scaling law." **C2 (gear):** FVU/cos are val-set reconstruction; the SBERT
roundtrip is the behavioral gear and agrees (best on the board), so the recon gain is not a metric artifact.
**C5 (generalization):** single seed, single width-step — this establishes *that* 8× width breaks the plateau,
not a smooth scaling curve. **Honest scope:** this is a **reconstruction** win, not an interpretability win —
nothing here says the 131072-feature dictionary is more *monosemantic* or more *canonical* (§17.21's
seed-canonical-but-not-scale-canonical / feature-splitting verdict still stands; more features at fixed data =
more splitting, which is consistent with both better recon and worse cross-scale canonicity). The headline is
narrow and true: **at the biggest dictionary, scale finally helps recon — FVU .248, 91% of the decoder
ceiling.**

### 17.36 — THE ROBUSTNESS GATE: deployable editing is a MEMORIZED 3-type trick, not a general edit calculus — held-out-type exact@coll **.079** (1 of 6 types holds ≥.3); but the head DOES compose across k edits (3.9× cheaper at k=4, no re-cache)

`tae_context_synth_robust.json` is the robustness gate on the §17.31 deployable-editing headline (in-dist
exact@coll **.36**). §17.31 trained the cross-attention recontextualization-delta head on **three edit types**
(word-swap, entity-swap, number-change) — all **content-word substitutions**. The gate asks the two questions
that decide how strong the deployable claim really is: **PART1 — does the head GENERALIZE to edit *types* it
never trained on** (tense, negation, person, register, insert-clause, delete-clause), i.e. is §17.31 a general
*edit calculus* or 3 *memorized* types? **PART2 — does the head COMPOSE across k sequential edits** (apply k
deltas to the one cached state, cheap) or does each edit need a fresh re-cache (expensive)? Split hygiene is
logged and clean: **no held-out-type base text or transform ever appears in a training row** (every train row
asserted a single content-word substitution; held-out specs built+scored by the frozen E1).

**PART1 — held-out-type generalization (exact@coll; in-dist anchor .39, proxy floor .037):**

| held-out type | synth exact@coll | synth heuristic-pass | mean collateral | oracle exact@coll | holds ≥.3? |
|---|---:|---:|---:|---:|:--:|
| person | **.425** | .80 | .157 | .763 | ✅ |
| tense | .05 | .088 | .068 | .75 | ❌ |
| negation | .00 | .088 | .536 | .025 | ❌ |
| register | .00 | .963 | .545 | .00 | ❌ |
| insert-clause | .00 | .05 | .294 | .00 | ❌ |
| delete-clause | .00 | .988 | .492 | .00 | ❌ |
| **held-out overall** | **.079** | — | — | .256 | **1 / 6** |
| *in-dist (trained types)* | *.39* | — | — | *.85* | — |

**Verdict: §17.31 is a MEMORIZED 3-type trick, not a general edit calculus.** Held-out-type exact@coll
**collapses to .079** — essentially the proxy floor (.037) plus noise, and **6× below** the in-dist .39. Only
**1 of 6** unseen types holds the ≥.3 bar (**person, .425**, which generalizes because grammatical-person
swaps are *also* content-word substitutions — the head's actual trained skill — and the oracle confirms the
type is learnable, .763). The structural edits the head was *not* built for fail outright: **insert/
delete-clause = .00**, **negation = .00**, **register = .00**, **tense = .05**. Crucially, the **oracle**
(true-E ceiling) *also* falls on most held-out types — insert/delete-clause/register oracle = **.00** — which
tells us two distinct things are going on: (a) for *tense* and *person* the oracle is high (.75/.76) but the
**head can't predict the delta** → a genuine **generalization failure of the learned head**; (b) for
**insert/delete-clause/register** even the oracle's mean-pool single-slot edit can't express the change → these
edits are **outside the additive-substitution edit model entirely** (you cannot insert a clause by perturbing
one token-slot's mass), so the failure is the **edit *parameterization***, not the head. (The high
heuristic-pass on delete/register, .96–.99, with collateral ~.5 is the scorer registering "the sentence
changed a lot," not a *correct* edit — exact@coll, which gates on low collateral, is the honest metric and it
is .00.) **The honest scope correction to §17.31: deployable editing is real but BOUNDED to the
content-word-substitution regime it was trained on** — it is a cheap, accurate *lexical-substitution* delta
head, not a general-purpose sentence editor.

**PART2 — re-cache cost across k sequential edits (amortized FLOPs/edit, mult-add):**

| k edits | re-cache-each (amort/edit) | COMPOSE (amort/edit) | compose speedup | exact preserved? |
|---|---:|---:|---:|:--:|
| 1 | 6.12e9 | 6.12e9 | 1.0× | — |
| 2 | 6.10e9 | **3.07e9** | **1.99×** | no |
| 4 | 6.09e9 | **1.55e9** | **3.93×** | no |

**Verdict: the head COMPOSES (cheaply) but the chain-quality is the same as re-cache (no free lunch on
accuracy).** Applying k deltas to the **one** original cached state (never re-encoding) makes cost **amortize
down linearly** — **3.93× cheaper per edit at k=4** (1.55e9 vs 6.09e9 FLOPs), because re-cache-each pays a
fresh full encoder pass after every edit while compose pays it once. So the deployable claim's *cost* story
holds and **improves** for multi-edit sessions: sequential editing is **amortized-cheap**, exactly the regime
a deployed editor lives in. **But** `compose_preserves_success=false` at every k: the sequential-chain
exact@coll is **.00 for both regimes** (compose AND re-cache), i.e. compounding edits degrade quality equally
whether or not you re-cache — composition is **free on cost but not a quality win**; the per-edit accuracy
ceiling is set by the (in-dist, single-type) head, not by the caching strategy.

**Critic (C1–C10). C5 (generalization — THE one, examined and the verdict is negative).** The held-out battery
is the right test (6 unseen types, 80 each, frozen scorer, clean split) and it returns a clear **no**:
**`general_edit_calculus = false`, "MEMORIZED 3 TYPES."** C6 (controls): proxy-floor (.037) and true-E oracle
bracket every cell, which is what lets us *attribute* failures correctly — head-failure (tense/person) vs
edit-model-failure (insert/delete/register). C4 (FLOP accounting): same model as §17.31's C4; the speedup is
a clean apples-to-apples mult-add count. **Net effect on the headline:** §17.31's "editing is DEPLOYABLE at
.36 for ~1% cost" **survives but must be scoped**: it is deployable, cheap, and even cheaper under multi-edit
composition — **for the lexical-substitution edit class it was trained on**, and **not** as a general edit
calculus. The day-4 "irreducible" frame, flipped to "recoverable cheaply" in §17.31, is now **precisely
bounded**: recoverable cheaply *within the substitution regime*, still genuinely hard (oracle-limited) for
structural edits that the single-slot mean-pool model can't even express.

## §17 Synthesis — the day-5 arc

Day 5 turned §16's two big *assertions* — "there is a composition operator" and "there is a dark-matter
content field" — into *measured receipts*, and then stress-tested them to their boundaries. **(1) The
composition map is written down:** `z ≈ (1/N) Σ_i R_pos^{p_i}(R_role E(w_i) + c_i)`, with `R_role`
characterized as **two near-isometric reflective role rotations + a length adder**, **92% recovered by a
closed form** (two orthogonal matrices + two scalars) and **20× sparse in its Schur eigenbasis** (§17.1–17.2).
**(2) The reconstruction ladder decomposes the signal:** content `c_i` is **93.7%** of the recoverable `z`,
the position clock **6.4%**, and the role rotation **~0%** in CE — `R_role`'s value is *binding/order*, not
*recall* (§17.9, sharpened by §17.22: its eigenplanes are a generic content-mixing reflection, not a labeled
role axis). **(3) Editing is now DEPLOYABLE:** §17.3 rescued it to **0.325 at half the encode cost** (an *information
deficit*, the missing in-context embedding — not a geometry we can linearly route around), and §17.31 then
**closed the prize**: a cross-attention recontextualization-delta head reading **cached** original states
lands edits at **0.36 for ~1.1% of the partial-reencode FLOPs**, beating both the neighbor-MLP plateau (.233)
and partial-reencode itself — the central "deployable-only-by-half-re-encode" negative **flips to positive**,
with the residual wall now scoped to the hub-token tail rather than to editing as a whole (§17.3, §17.31). **(4) `c_i` is late-stack (L18–24), hub/surprisal-bound, and irreducible** — the high-rank
dark matter §16 named, now *located* (§17.4). **(5) The model is a SENTENCE-model** that degrades
*gracefully* into paragraphs with a ~3–4-slot binding budget (§17.10, §17.16). **(6) Analogy = editing =
the hub field — one mechanism** (§17.11), now *bounded* to the lexical-substitution regime by the clean
concept-algebra battery (§17.20). **(7) Content is language-universal across 11 languages AND the operator
itself ports** (composition map 79–95% cross-recovered, role binder spectrally identical per language,
§17.5+§17.19) — but the binding *geometry* is **SONAR-specific**: a second pooled encoder shows a different
calculus (the cross-model swing, §17.17). **(8) A calibrated `z`-auditor shipped** — a contract that
*readable ≠ causal*, with bins as audit objects but not steering levers (§17.12).

The day-5 **honest negatives** are as load-bearing as the positives: the transport-feasibility *certificate*
was **refuted** by a hub-scalar null (AUC 0.766 < 0.857, §17.14); **geodesics do not help** — the
curved-manifold break is intrinsic, not a chord artifact (§17.15, §17.18); the **SAE dictionary is a content
basis, NOT the role-rotation eigenbasis** (`R_role` denser, not sparser, in SAE coords; §17.13, §17.21); and
**value-density construction was refuted** — the wall is binding, not density (§17.8). SAE features are
**seed-canonical (.99) but not scale-canonical** (cross-size ~5% survival = feature-splitting, §17.21).

**Big picture:** *SONAR composes a sentence by additively pooling a high-rank, surprisal-bound content field
and binding order with a mild reflective role-rotation — a calculus we can now write down, partly invert, and
audit, that is content-universal across languages but geometrically specific to SONAR, and whose one
irreducible wall is the missing in-context information, not a basis we have yet to find.*

## §17 FINAL NOTE — day-5 + Tier-2 close-out

Across day 5 and the Tier-2 battery, §16's two assertions became a **fully characterized, partly-invertible,
auditable calculus** — and its boundaries are now drawn as sharply as its wins. **The composition map is
written, sparsified, and probed at user request:** `z ≈ (1/N) Σ_i R_pos^{p_i}(R_role E(w_i) + c_i)`, with
`R_role` = two reflective role-rotations + a length-adder, 92% closed-form, 20× sparse in its Schur eigenbasis
(§17.1–2). **The reconstruction ladder decomposes the signal:** content `c_i` carries **93.7%** of the
recoverable `z`, the position clock 6.4%, the role rotation ~0% in CE — binding, not recall (§17.9, §17.22).
**The dark matter decodes to gist:** `c_i` is late-stack (L18–24), hub/surprisal-bound, irreducible — and the
high-rank tail **decodes to a faithful paraphrase (SBERT ~.85)**, so it is *content we can read*, not noise
(§17.4). **Editing is now DEPLOYABLE — with an honest fence:** a cached-state cross-attention delta-head lands
edits at **.36 for ~1.1% of partial-reencode cost** (§17.31) and **composes 3.9× cheaper across k=4 sequential
edits** with no re-cache (§17.36-PART2) — **but the robustness gate is decisive: it is a MEMORIZED
content-word-substitution head, not a general edit calculus** (held-out-type exact@coll **.079**, only
**1 of 6** unseen types holds ≥.3; structural edits like insert/delete-clause fail even at the oracle because
single-slot mean-pool can't express them, §17.36-PART1). So: deployable and cheap **within the
lexical-substitution regime it was trained on**, full stop. **The safety loop closes:** the
parascope→`z`→auditor chain runs end-to-end with **chain-AUC .73** — you can read an LLM's hidden state,
predict its SONAR content, and flag it — **but the `z`-auditor is adversarially foolable** (an ensemble
rescues only the *active*-feature audit, not the curved-manifold one), so it is a *monitor*, not a *guarantee*
(§17.12, Tier-2 adversarial). **Content is language-universal** (11 languages, operator ports 79–95%) **but the
binding geometry is SONAR-specific** — re-confirmed twice over: a second pooled encoder (GTR-T5, §17.34) and
the cross-model swing (§17.17) both show **no SONAR-style role-rotation**, while mean-pool exactness recurs
trivially. The **hub field is within-sentence**, `c_i` is **not the LLM's residual field** (the parascope
predicts SONAR content, not a shared geometry), and the **biggest dictionary breaks the recon plateau** at last
(h131072/k128 **FVU .248**, 91% of the decoder ceiling — scale helps recon at 8× width, though not
interpretability, §17.35). **Single big-picture sentence (the whole program, §16+§17):** *SONAR builds a
sentence by additively pooling a high-rank, surprisal-bound, language-universal content field and binding word
order with a mild SONAR-specific reflective role-rotation — a calculus we can now write down, partly invert,
cheaply edit within the substitution regime, and audit end-to-end from an LLM's hidden state, whose one
irreducible wall is the missing in-context information rather than any basis we have yet to find.*

## Takeaways
1. The full **understand ⇄ construct** loop works, end-to-end and measurably.
2. **k is the reconstruction lever** (FVU 0.50→0.11), **and the *inverse* lever for monosemanticity**
   (low-k = interpretable). Pick the SAE per use-case; don't expect one SAE to do both.
3. **Reading an LLM's hidden state and naming its upcoming content in concepts works** (~30× chance),
   and improves with probe quality.
4. Two things we *ruled out*: bigger dictionaries (no recon or interp gain) and a decoder LoRA
   (decode isn't the bottleneck).
5. **Manifold-tangent steering > naive steering** (Goodfire confirmed): on-manifold steering holds
   coherence at high strength where straight-line steering derails, and steers *better*. The earlier
   negative single-layer result was a method artifact, not a property of the model.
6. **Interpretable SAE features are causal levers** (loop closed): the same dictionary that names what
   the model is doing can coherently steer it toward any chosen named concept (best at mid-layer L8);
   exact one-feature precision isn't reached, but large coherent concept-targeted steering works.

## Use the lab
```bash
# understand: sentence -> concepts (sentence-level k8 = most monosemantic)
python sonar_interp_lab.py understand --sae hf:nickypro/sonar-sae/sentence-sae_h16384_k8.pt \
    --feature_names feat_dict_sentence-sae_h16384_k8.json --text "Your sentence here."
# construct: algebra / interpolation / from features
python sonar_interp_lab.py construct --sae hf:nickypro/sonar-sae/sonar-sae_h16384_k128_btk.pt \
    --algebra "The king ruled wisely|king|queen"
# read an LLM's hidden state as concepts
python parascope_to_concepts.py --probe hf:nickypro/parascope-probes/llama-3b_layer27_maghead_mlp_h2048x2_t0.05_a1.0.pt \
    --sae hf:nickypro/sonar-sae/sonar-sae_h16384_k128_btk.pt --layer 27 --shard 40
```

## Open questions / next
- **[DONE] Sentence-level + deduplicated corpus**: confirmed more monosemantic AND better-reconstructing
  than paragraph-level (see §5). Next: scale the sentence corpus past 150k and sweep k=4..16.
- **Auto-labeling**: feed each low-k feature's top sentences to an LLM for a clean name (the generative
  decode-the-direction names are vague; data-driven top sentences + an LLM summary would be crisp).
- **A combined "understand" SAE**: Matryoshka / multi-k so one model serves both regimes.
- **Closing the parascope paraphrase gap**: this is bounded by SONAR-vector prediction (probe lane).


---

# §18 (night-5) — the c_i / dark-matter account, revised and CLOSED

*Pre-pooled & causal battery, 2026-06-17. Sources: NIGHT5_STATE.md LOG (full scored verdicts); night5_json/{p1,p2,p3,p4,p5,p6,p7,p8,b1,b1b,b1c,a1,a1b,a1c,a1d,a2,a4,a5fix,b4,b4b,exp4,exp4b,p1robust,pooling_capacity,p_pertoken_sae}. Revises INTERP_RESULTS §16.1/§16.3/§16.5 and §17.4/§17.22/§17.23/§17.28/§17.30/§17.36. Program motto: the easiest person to trick is yourself — and the two live self-tricks on this chapter are (a) OVER-deflating c_i past what B1b/B1c/SAFE-1 show, and (b) reading descriptive geometry numbers (R_role's .87 rad, eff-rank 25) as gauge-invariant facts when A2 shows they are SONAR-frame readings. Both are flagged in §18.4 and §18.6. All four gating/closing results (A2 identifiability, per-token SAE, B4b, SAFE-1) have LANDED; §18 is COMPLETE.*

---

## 18.0 — Executive summary

Night-5 did **two** things, and the honest headline carries both: it **deflated three specific over-readings of c_i**, AND it **converted "c_i matters" into a located, causally-grounded, identifiability-validated positive account** of where SONAR's compositional meaning lives. This is not a one-sided deflation, and — now that **A2 (identifiability)** has landed — it is no longer provisional.

**The positive account (centerpiece).** The decodable meaning of `z` lives in an **order-free, intrinsic-heavy `c_i` field**, written by **late FFN (L19–23)**, **mid-scale-localizable** to ~64–256 neurons, **hub-concentrated at the field level**, **globally (not locally) constructed**, and **not recoverable from surface lexis**. Five experiments converge: **B1c** (c_i-alone, order-free, recovers SBERT **.901** / 69% of tokens; the position clock alone only **.463**) locates the meaning in the c_i field, not the binding; **B4** gives the first causal writer (late-FFN mean-ablation collapses ‖c_i‖, attention is flat across all 24 layers); **B4b** sharpens the writer to a mid-scale neuron handle (top-256 of 40,960 late-FFN units → +13.2% content / +31.6% hub c_i-collapse, 18× random, saturating the B4 single-layer ceiling); **A4** shows the field is not locally constructible (closed-form ladder VE plateaus **.163**, never beating §17.4's **.177**); and **SAFE-1** shows the field carries audit-relevant meaning that surface lexis does not (order-free lexical auditor AUC **.491 = chance** vs full-z / c_i **1.000** on an implied-attribute task). The new **per-token SAE** gives the dictionary-level picture: per-token states are 4.5× more compressible than pooled-z (FVU **.081** vs .366), partition cleanly into a **lexical** family (99.9% of features, 96% of energy) and a small genuine-interior **positional** family (34 feats / 0.1% / 4.1% energy, the R_pos clock), and — diagnostically — **0 separable context features**: c_i is finely **entangled into the lexical family**, consistent with it being token-identity-bound and per-token-spiky.

**The deflation side (scoped, not sweeping).** Night-4's generative model `z = (1/N)Σ R_pos^{p_i}[R_role E(w_i) + c_i]` was fit at the **pooled** level, and the pre-pooled/causal battery shows it is **a least-squares decomposition of the mean, not a causal per-token generator** (P1: honest skeleton-only recon cos **.583**, "cos 1.0" is the c≡residual tautology). Three specific over-claims fall: (i) c_i is **not** an *additively context-accumulated* term — an isolated single word already carries a c_i comparable to its own lexical code (iso-residual robust across 5 E definitions, P1/P1-robust) and c_i **contracts ~40%** as context grows (P2); (ii) read-out is **not** nonlocal/holographic — under single-token patching z is **strictly localist** (off-diag ~.012 ≪ diagonal ~.123, flat across hub quartiles, A1), so any "holography" is a property of **eff-rank-25 storage**, not read-out smearing; (iii) c_i is **not** a *separable dark-matter blob that uniquely decodes to the gist disjoint from lexis* — B1's apparent "gist transfers to a lexical skeleton" relied on a **leaky oracle skeleton** (per-id mean pre-pool state); B1's leak-free ridge rung never crossed over (skeleton .465 < residual .820). "Lexically bound" survives only in the weak **which-words** sense.

**The result that lets the whole chapter stand — A2 identifiability.** The factorization is **partially identifiable**, and crucially the **load-bearing claims are gauge-rigid** while only the **descriptive numbers** are frame-relative. **R_pos is gauge-rigid** (median eig-angle ~**.011 rad** at any init, 7.7% with-Q drift, anchored to architectural L0 positional encoding). **c_i is behaviorally rigid** (relative eff-rank **2.0%** drift across genuine seeds, exactly orthogonal-gauge-invariant) — though its *absolute* eff-rank magnitude is basis-frame-relative. **R_role is behaviorally rigid** (compose-agreement ≈ **1.00** across disjoint seeds) — though its *geometry* (.87 rad, reflective, det Q = −1) is **not** gauge-invariant (det-Q sign-flips across λ; angle ~doubles, +0.27 rad, under a recon-preserving non-orthogonal gauge). Neither pre-registered falsifier fired (not the clean-positive "all <10%", not the worse "nothing pinned"). The decisive consequence: the **shared-pooled-fit-artifact worry** that hung over the convergent deflations is **allayed** — the behavioral c_i claims (essential / late-FFN-written / carries-meaning / dilutive) ride on objects that are seed- and gauge-stable, not on a manufactured fit label. The chapter is no longer "provisional until A2"; it is closed.

**Net, calibrated.** What falls: c_i as an *additive*, *nonlocal*, *separable-uniquely-decoding* dark-matter store. What stands and is now *located, causal, and identifiability-validated*: c_i is **essential to and the location of** decodable meaning (B1b/B1c/SAFE-1), **causally written by late FFN** and **mid-scale-localizable** (B4/B4b), **globally constructed** (A4), **gauge-rigid in its behavioral claims** (A2), and **the mechanism behind §17's auditor-foolability** (SAFE-1). The live danger remaining is over-stating *either* side; §18.4 and §18.6 flag both.

---

## 18.1 — The positive account: where meaning lives, what writes it, what it is not

This is the centerpiece. Each claim carries its experiment tag and exact number.

### Where the meaning lives — the c_i field, not the clock (B1b + B1c)

**B1b** asks the sharp quantitative question: how much of z's *decodable meaning* survives a pure order-free **bag** of its own lexical codes E(w_i) (ridge-E, NO c_i), decoded through the **real SONAR decoder**? Answer: almost none. The bag recovers SBERT **0.299** / token-id Jaccard **0.189**, vs the full-z ceiling **0.985 / 0.916** and a random-word-bag floor **0.045** — only 30% of ceiling SBERT and 19% of the original token ids. The pre-registered falsifier (bag SBERT < 0.5) **TRIGGERED**; the "bag decodes .65–.85" prediction is FALSE. Of the 0.687 SBERT gap from bag→ceiling, order/role adds **0.164** and **c_i / composition carries 0.522**. *Decodable meaning is overwhelmingly compositional and c_i-borne; a bag of the right words decodes to fluent-but-wrong text* (e.g. a bag about repealing the ACA decodes to "The unity prevented the use of alcohol…"). *[B1b]*

**B1c** then isolates *which* compositional ingredient restores the meaning, with a 4-rung ladder that toggles the clock (R_pos binding) and c_i independently (n=600, held-out; rung-C identity cos→z = 1.0, so c_i is the real residual not a refit):

| Rung | Construction | SBERT | token-id Jaccard | cos→z |
|---|---|---|---|---|
| bag | (1/N)Σ E(w_i), order-free, no c_i | 0.299 | 0.189 | 0.401 |
| +clock, no c_i | (1/N)Σ R_pos^{p} R_role E(w_i) | 0.463 | 0.297 | 0.524 |
| **+c_i, no clock** | (1/N)Σ (R_role E(w_i)+c_i), **order-free** | **0.901** | **0.691** | 0.821 |
| full (ceiling) | (1/N)Σ R_pos^{p}(R_role E(w_i)+c_i) | 0.986 | 0.916 | 1.000 |

The **clock alone adds +0.164** over bag; **c_i alone adds +0.602** (B−A = +0.438). An order-free c_i field already recovers SBERT **.90** and 69% of tokens. **The decodable meaning is the c_i field; holographic position-binding is the small term.** *[B1c]*

### What writes it — late FFN, mid-scale-localizable, hub-concentrated at the field level (B4 + B4b)

**B4** (first causal gears test; n=1500 sents, 49k tokens; per-layer zero+mean ablation of self-attn-out and ffn-out, per-head in late layers). Mean-ablating late FFN collapses ‖c_i‖ while lexical content survives (so the collapse is **c_i-specific**, not a generic state break): content-token c-collapse **+0.073 (L19), +0.074 (L20), +0.100 (L21), +0.135 (L22), +0.118 (L23)**, **hub-concentrated** (+0.27 at L22, **+0.31 at L23**), Δlex-cos only −.004 at the top component. By contrast **attention ablation is flat across all 24 layers** (late-attn mean collapse **0.003**, across-layer std 0.018) and **no single head moves c_i** (max head collapse ~0.006). The pre-registered "late *attention* writes c_i" is FALSIFIED in the informative direction: the writer is **late FFN**, refining §17.4's correlational attention story into a causal FFN one. (Skeptic control: the L23-*zero* ablation gives an off-distribution −0.75 break, caught by the zero-vs-mean comparison; the clean signal is the monotone L19–22 *mean*-ablation series.) *[B4]*

**B4b** localizes the writer one level deeper — *which* FFN neurons, and what they fire on (1200 sents, 39k tokens; magnitude-attribution `|act_j · ⟨R^{-p} W_out[:,j], ĉ_i⟩|` over the 40,960 late-FFN units of L19–23; mean-ablate top-K vs random/bottom controls). c_i has a **mid-scale neuron handle**: the absmag-ablation curve is **+5.0% (K8) → +7.0% (K16) → +9.0% (K32) → +10.8% (K64) → +12.3% (K128) → +13.2% (K256)** content c-collapse, **saturating** by ~256 neurons — that 13.2% matches B4's single-layer L22 ceiling (+13.5%). Hub c-collapse runs roughly double throughout (K64 +23.8%, K256 **+31.6%**). The effect is **18× random** (K64 content +10.8% vs random +0.56%) and **lexical / z survive** (Δlex-cos small, z-cos 0.96 at K256 — specific, not a generic break). So c_i-writing is **sparse-ish / mid-scale**: ~64–256 neurons (**0.16–0.6%** of the 40,960 late-FFN units), L20–22-concentrated (top-200 writers: 38/65/71 in L20/21/22; the L23-dominated set is the *anti-writer* tail), saturating ~256. "Single neuron" rejected; "fully distributed" rejected.

**B4b's falsified sub-hypothesis (recorded honestly).** The pre-registered per-neuron **HUB-DETECTOR** framing is **FALSIFIED**: among top-200 writers only **27.5%** are hub-selective (hub/non-hub ratio >1.3), the median writer is mildly **anti-hub** (ratio **0.52**, corr_outdeg −0.05, vs a freq-matched control ratio 0.91). So c_i-writers are **not individually hub-firing units**; rather their *collective* write lands disproportionately ON hub tokens (hub c-collapse ≫ content at every K). **Hub-concentration is a FIELD-level property, not a per-neuron tuning.** Position-axis falsifier NOT triggered (median corr_relpos ~−.005). NB the script's auto-`verdict` field reads "FALSIFIED" because it scored only the *signed*-alignment arm (predicted reduction along the *current* ĉ_i direction), which is not predictive of ablation effect; the correct localizer is **magnitude** attribution (`verdict_corrected`), and on it the small-set prediction is confirmed at the 10²-neuron scale. *[B4b]*

### What it is NOT — not an LLM field, not locally constructible (P6 + A4)

**P6** (the §17.30 loose end closed at per-token granularity): a per-token frozen Llama-3B predicts c_i at R² **.0774 (L18) ≈ paragraph .081**, still **< the 0.11 lexical bar** and ≪ the 0.45 same-object threshold. Going per-token bought nothing. c_i is **SONAR-internal, not an LLM-legible predictive field**. (The .0097 "per-token one-hot bag" is the WRONG baseline — a single-token one-hot is trivially weaker than §17.30's paragraph bag; P6's agent flagged and overrode the script's auto-FALSIFIED for exactly this. The negative is **confirmed at per-token granularity, not reopened**.) *[P6]*

**A4** (closed-form encoder-write ladder, sharpened by P1/P2): the anchor reproduces §17.4 (c_i-VE **.177**); adding pairwise (rung ii), cross-attention (iii), and transformer (iv) terms **plateaus at VE .163 (rung ii) and never beats the anchor** (iii .139, iv .124 — overfit, train-VE up / held-out down). The pre-registered "c_i becomes .45–.55 constructible then plateaus" is **FALSIFIED**, as is "fully constructible (>.9)." So c_i is **irreducible to local (windowed/pairwise) structure** — a **global, all-to-all object**, exactly what a late-FFN writer over the full bidirectional context produces. (State this as "not LOCALLY constructible," NOT "uncomputable": a global model with full context — namely late FFN — does capture it.) *[A4]*

### The SAE dictionary view (per-token SAE — NEW)

The first non-pooled SAE (h32768 / k64 BatchTopK, 20 epochs, 250k eval tokens, 0% dead) trains to FVU **0.081** / cos **0.914** — **4.5× better than the pooled-z SAE (.366)**: per-token states are far more compressible (the pre-registered prediction CONFIRMED). A bigger h65536 dictionary does NOT improve it (FVU .086 — plateau). The dictionary partitions into:

- **LEXICAL: 32,734 features (99.9%), 96% of energy** — token-identity-selective beyond frequency (12,796 features above the freq-purity baseline; positive regression residual).
- **POSITIONAL: 34 features (0.1%), 4.1% of energy** — and a clean skeptic check finds **0 of 34 edge-dominated, all 34 genuine interior** (not lang-token/`</s>` artifacts): this is the **R_pos clock family**, separable and real.
- **CONTEXT: 0 separable features.** c_i does **not** form its own dictionary family. Token-purity is **flat across rarity terciles** (common .160 / mid .163 / rare .170 — if anything *rising*, i.e. context entangles into rare/content tokens rather than splitting off), and lex-vs-pos MI gap is decisive for 99.96% of features. **c_i is finely entangled INTO the lexical family** — token-identity-bound and per-token-spiky — consistent with B1c's order-free per-token field and with A1's storage/read-out split.

This is itself a confirmation of the generative model's **families** (lexical + positional clean and separable) AND of c_i's **non-separability** (no context family). *[per-token SAE]*

### Refinements: capacity, locality, order

- **Capacity (pooling-capacity).** One z holds **~25–35 content words** at content-recall ≥.85; soft .70 knee at **~42 content words (~112 tok)**; SBERT-.85 knee at **~165 tok**; **graceful decay, NO cliff**; edge>interior (first-token recall .90 / last .85 / interior .74, U-shaped). This **corrects** the CAPACITY_LEDGER Frady "length cliff" forecast (measured .90 vs forecast ~.45 @ ~60 tok) — SONAR binding is far more capacious than the interference-channel model predicted, and bounds how much of B1b/B1c's "composition carries 0.522 / c_i-alone .90" is real per-token info vs decoder saturation (it is real well within these lengths). *[pooling-capacity]*
- **Locality is GRADED by lexicalization (A1b/A1c/A1d).** Lexical sentiment Gini **.55** (3 slots, single word, top slot = sentiment word 85%) > topic Gini **.41** (5 slots, ~3 words) > implied sentiment Gini **.37** (least peaked, but **span-local**: top slot in the diff-span 100%, diff-span mass .53). None met the distributed falsifier (Gini<.3 AND >50% slots). More-lexicalized attributes are more peaked; even non-lexical implied sentiment is span-local, not whole-sentence distributed. CAVEATS: (a) the **diradd attribution is a construction artifact** (adds the same field vector to every slot → Gini≈0 by build) — only the **TRANSPLANT** per-slot swap is a valid locality measure; ignore diradd-Gini auto-verdicts; (b) A1d's implied-ness gate is only marginal (z AUC .96 vs bow .82, gap .145 < .15 target), so "implied sentiment is span-local" is suggestive, not clinched. *[A1b/A1c/A1d]*
- **Order is PARTIALLY BILINEAR (A5-fix, supersedes broken A5).** A sanity-gated nested-linear bilinear comparator reaches Kendall τ **.36** (best gate-passing rank 64), between linear **.26** and MLP **.57** — recovering ~32% of the linear→MLP gap (shuffled-label τ .006). Bilinear partially names the surface-order comparator (P8), but genuine deeper-than-bilinear depth remains (MLP still +.20). (Correction: the earlier A5 "bilinear below linear" was a BROKEN fit — mis-specified constant term + non-converged optimizer; A5-fix's gate-passing low-rank bilinear match-or-beats linear.) *[A5-fix]*

**One-line synthesis.** c_i is an **intrinsic-heavy, dilutive, late-FFN-WRITTEN (L19–23, mid-scale-localizable to ~64–256 neurons, hub-concentrated at the field level), globally-constructed (not locally reducible), SONAR-internal** field that is **genuinely high-rank in storage** but **local on read-out** and **entangled into the lexical SAE dictionary** — fit at the pooled level, **gauge-rigid in its behavioral claims (A2)**, and **essential to and the LOCATION of the decodable meaning** of z (B1b/B1c) including the **audit-relevant meaning surface lexis misses (SAFE-1)**. What it is *not*: not *additively context-accumulated*, not *nonlocal on read-out*, not *locally constructible*, not an *LLM-legible* field, and not a *separable store that uniquely decodes to the gist disjoint from lexis*.

---

## 18.2 — Claims revised (all ~26 results, with §cite + exact number + experiment tag)

| # | Prior claim (cite) | Night-5 evidence | New status |
|---|---|---|---|
| R1 | **Generative model is a token-wise generator** of z (§16.1, read causally) | P1: skeleton-only recon cos **.583**; "cos 1.0" is the c≡residual tautology; algebra-patch reproduces oracle only partially (A1: editshift .058 vs oracle .133) | **DOWNGRADED** → "pooled-level decomposition, not a causal token generator." Headline model needs this qualifier. *[P1, A1]* |
| R2 | **E(w_i) is the context-free code; c_i is context the token earns** (§16.1; §16.2) | P1: iso N=1 resid_frac **1.72** (over-h), iso/in-context rnorm **1.31×**; P2: intrinsic fraction **1.75**, c_i *contracts* ~40% (0.563→0.349) | **REFUTED (direction reversed).** c_i is intrinsic-heavy and dilutes; not a context-acquired remainder. *[P1, P2]* |
| R3 | **"Read the gist off c_i" (SBERT .851); meaning in c_i, lexis in skeleton, near-DISJOINT** (§17.28) | B1b: word-bag w/o c_i decodes SBERT **.299** / 19% tokens vs full-z .985 / 92%; c_i carries **0.522** of the 0.687 gap. B1c: c_i-alone (order-free) SBERT **.901** / 69% tokens; clock alone only **.463**. B1's transfer was ORACLE-driven; leak-free ridge never crossed (.465 < .820). SAFE-1: lexical auditor **.491** = chance vs c_i **1.000**. | **TEMPERED + LOCATED, NOT deflated.** The *over-reading* (separable store uniquely decoding disjoint from lexis) falls; **c_i is ESSENTIAL and is WHERE decodable meaning lives.** "Lexically bound" survives only in the weak which-words sense. *[B1b/B1c/SAFE-1]* |
| R4 | **c_i is a nonlocal, hub-spreading field** (§16 holographic framing; §17.4) | A1: off-diag mass **.088** (off-diag .012 ≪ diag .123), flat across out-degree quartiles (q1 .061, q4 .074); strictly localist on read-out (null jitter 0.0) | **REFINED.** High-rank lives in **storage** (eff-rank-25, survives); read-out is local. "Holographic" applies to storage geometry only. *[A1]* |
| R5 | **‖c_i‖ is surprisal-driven** (§17.4: surprisal R² .161) | P2: per-increment surprisal R² **.006**; P7: surprisal R² **.025 eng**, ≤.077 all langs — out-degree is the driver. B4: writer is late FFN (L19–23), not attention. | **REFINED → out-degree-driven, surprisal weak; WRITER = late FFN.** Surprisal "drives" only as a weak pooled-endpoint proxy. *[P2/P7/B4]* |
| R5b | **§17.4: ~19% of c_i is closed-form attention-constructible** (single-attn R² .152, bundle .193) | A4: ladder reproduces anchor (.177) but pairwise/cross-attn/transformer rungs **plateau at VE .163** and do NOT beat it (iii .139, iv .124 overfit) | **CONFIRMED-AS-CEILING + EXTENDED.** §17.4's ~19% is the *ceiling* of local constructibility; c_i is irreducible to local structure → global/all-to-all (consistent with B4). *[A4]* |
| R6 | **c_i is ~80% truly-lost / irreducible content** (§16.2 "one object, three ceilings") | A1 (local read-out) + P2 (dilutive) undercut "nonlocal/additive holographic store"; B1b: c_i carries **0.522** of the gap; B1c c_i-alone **.901**; A4: NOT locally constructible (VE plateau .163, never beats .177, .45–.55 FALSIFIED) | **SCOPED + SHARPENED.** Falls: irreducibility-as-"separable blob uniquely decoding to gist." STANDS: c_i is essential (B1b/B1c) AND genuinely irreducible to LOCAL structure (A4 → global object). Reducibility now CLOSED by A2 (behaviorally gauge-rigid). *[A1/P2/B1b/B1c/A4/A2]* |
| R7 | **c_i is NOT an LLM-legible predictive field** (§17.30 DECISIVE-NEGATIVE, paragraph-pooled; per-token owed) | P6: per-token Llama-3B R² **.0774 (L18) ≈ paragraph .081**, still < 0.11 lexical bar, ≪ .45 threshold | **SURVIVES → negative CONFIRMED at per-token granularity.** Going per-token bought nothing. The .0097 one-hot bag is NOT the baseline (agent overrode auto-FALSIFIED). §17.30 upheld. *[P6]* |
| R8 | **R_role's π-eigenplanes do NOT encode role / random beats π** (§17.22) | P5: comp-operator R_role π-planes separate spans AUC **.970 / acc .928** vs random **.901 / .846** (+.082 acc), survives position-stratification; but R_role⁻¹ drop **0.0** (R_role does not *produce* it) | **REFINED → instrument-dependent SOFTEN.** π-planes mildly special & role-bearing under the *composition-operator* R_role (≠ §17.22's HRR-fit R_role). Commutator-null survives in both. *[P5]* |
| R9 | **R_role is a real per-token role rotation in the single-sentence model (~0.87 rad, reflective)** (§16.1; §17.1) | exp-4: free-sentence refit → angle **.43–.68 rad** (not .87), det +1, seed-unstable, harness-R_role on free recon HURTS −.326. exp-4b: inside two-span composition, forcing Q→I costs **−.193 cos / −.180 decode-SBERT** (4× falsifier); Q load-bearing, not S (drop S keep Q = only −.077). A2: behaviorally rigid (compose-agreement ≈1.00) but geometry NOT gauge-invariant. | **RE-SCOPED → COMPOSITION-SPECIFIC, NOT a vacuous artifact.** R_role is **load-bearing for two-span composition** (exp-4b) but **vestigial in single sentences** (exp-4, §17.9 ~0% within-sentence CE). CLARIFY: §16's headline "0.87 rad" was the **order-swap involution S** (n_axes_near_π 132, det −1), NOT R_role. *[exp-4/exp-4b/A2]* |
| R10 | (new) **Semantic content's locality** | A1b: lexical sentiment flips median **3 slots**, Gini **.55**, top slot = sentiment word 85%. A1c/A1d: GRADED by lexicalization — lexical .55 > topic .41 > implied **.37** (span-local: diff-span mass .53). None met distributed falsifier. | **NEW → semantic locality is GRADED, never fully distributed.** CAVEATS: diradd is an ARTIFACT (only TRANSPLANT valid); A1d implied-ness gate marginal (gap .145). *[A1b/A1c/A1d]* |
| R11 | (foundation check) **Is the P1 iso-residual an artifact of the shared HRR-fit E?** | P1-robust: iso-residual survives across **5 independent E definitions** (ridge per-id mean, model-intrinsic input-emb, all-occurrence per-id mean E3); over-h iso-frac >1 for all (E0 1.71, E3 1.54), over-E 0.75–0.87 | **DE-RISKS the foundation.** The iso-residual is real, not artifactual. CAVEAT: strict ">1" denominator-dependent; "large not small" robust either way. *[P1-robust]* |
| R12 | (new) **Pooling capacity / collision** | Pooling-capacity: content-recall ≥.85 to **~25–35 content words**; soft .70 knee **~42 (~112 tok)**; SBERT-.85 knee **~165 tok**; graceful, NO cliff; edge>interior (.90/.85/.74) | **NEW → one z holds ~25–35 content words; far more capacious than forecast.** CORRECTS CAPACITY_LEDGER's Frady "length cliff." *[pooling-capacity]* |
| R13 | (new) **Order-store comparator form** | A5-fix: bilinear τ **.36** (rank 64), between linear **.26** and MLP **.57** (~32% of the gap); shuffled τ .006 | **NEW → order store is PARTIALLY BILINEAR.** Genuine deeper-than-bilinear depth remains (MLP +.20). Earlier A5 "bilinear<linear" was a BROKEN fit. *[A5-fix]* |
| R14 | (new, GEARS) **What writes c_i?** §17.4 named late *attention* correlation | B4: late-FFN mean-ablation collapses ‖c_i‖ (+.073→+.135 over L19–22, hub +.31 at L23), lexical survives; **attention flat all 24 layers** (.003), no single head (max ~.006) | **NEW (first causal gears) → c_i is WRITTEN BY LATE FFN, hub-concentrated, NOT attention.** Born-late (§17.4) causally confirmed; writer identity corrected from attention to FFN. *[B4]* |
| R15 | (new, GEARS) **Which neurons write c_i? Are writers hub-detectors?** (B4b pre-reg: small hub-tuned set) | B4b: top-256 of 40,960 → +13.2% content / +31.6% hub c-collapse (18× random), saturates ~256, lexical/z survive; **HUB-DETECTOR FALSIFIED** (27.5% hub-selective, median writer anti-hub ratio 0.52); position falsifier not triggered | **NEW → MID-SCALE neuron handle (~64–256, 0.16–0.6% of units, L20–22); hub-concentration is FIELD-level not per-neuron.** Refines B4 one level deeper. (Auto-`verdict` "FALSIFIED" scored only the signed arm; magnitude attribution is the correct localizer.) *[B4b]* |
| R16 | (new, DICTIONARY) **Do features partition into lexical/position/context families?** (§17.28 SAE claim, per-token owed) | per-token SAE: FVU **.081** (4.5× pooled .366); LEXICAL 99.9% feats / 96% energy; POSITIONAL 34 feats (0.1%) / 4.1% (0/34 edge = clean R_pos clock); **CONTEXT 0 separable features** (purity flat across rarity) | **NEW → lexical + positional families clean & separable; c_i ENTANGLED into lexical (no context family).** Confirms the generative model's families AND c_i's non-separability. h65536 no better (.086 plateau). *[per-token SAE]* |
| R17 | (GATING, IDENTIFIABILITY) **Is [R_pos, R_role, E, c_i] a unique factorization or fit-artifact?** | A2: R_pos gauge-rigid (eig-angle ~.011 rad, 7.7% with-Q drift); c_i behaviorally rigid (eff-rank **2.0%** drift, orthogonal-invariant) but absolute magnitude frame-relative; R_role behaviorally rigid (compose-agreement ≈1.00) but geometry NOT gauge-invariant (det-Q flips across λ; +0.27 rad under non-orth G). Neither falsifier fired. | **NEW (PARTIALLY IDENTIFIABLE) → behavioral claims gauge-rigid; descriptive numbers frame-relative.** Shared-artifact worry ALLAYED — the chapter's load-bearing claims survive; only R_role geometry and eff-rank *magnitude* must stay hedged as SONAR-frame readings. *[A2]* |
| R18 | (SAFETY CAPSTONE) **Does c_i carry audit-relevant meaning beyond surface lexis?** | SAFE-1: on implied-attribute task, order-free lexical auditor AUC **.491 = chance** vs full-z / c_i **1.000** (hidden-meaning gap **.509**); overt-control gap **0.0**; lexical fooled on contrast sentences c_i flags | **NEW → audit-relevant meaning lives in compositional c_i, not surface words.** Mechanism behind §17 auditor-foolability. CAVEAT: a tf-idf bigram model partially cracks clause-order (.99), so the robust claim is "ORDER-FREE LEXICAL features insufficient; audit on z/c_i." *[SAFE-1]* |

---

## 18.3 — Identifiability & calibration (A2) — what's gauge-rigid vs frame-relative

A2 is the gating result that lets the chapter stand. The factorization `[R_pos, R_role, E, c_i]` is **partially identifiable** — a middle outcome in which *neither* pre-registered falsifier fired (not the clean-positive "all four objects <10% drift", not the worse "R_pos drifts wildly → nothing pinned"). The arbiter was **behavioral** (Procrustes solo-B→in-AB per-token cos for R_role; matched-filter order-τ for R_pos), not Frobenius/matrix-cos — and a deliberate methodological point is that **matrix-cos LIES** here (R_role cross-seed matrix-cos .96 looks stable but is split-specific; the behavioral compose-agreement .9996 is the honest measure). Sweep: 14,000 pairs, seeds 0/1/2, λ ∈ {1e-4…10}, gauge seeds 0/1 (orthogonal + condition-20 non-orthogonal G).

### What is GAUGE-RIGID (keep, stop hedging)

- **R_pos — RIGID.** Median eig-angle ~**.011 rad** at any init (with-Q drift **7.7%** < 10%), tracking the architectural L0 positional-encoding reference (.00991 rad). The clock is anchored to a fixed architectural object, not a fit choice. *CAVEAT to carry:* the clock and E do not fully transfer across fits (cross-fit order-τ ~0.08–0.11 vs own ~0.19) — the *angle* is gauge-rigid, the clock+E *joint readout* is partly co-adapted.
- **c_i — BEHAVIORALLY RIGID.** Relative eff-rank drift across genuine seeds **2.0%** (< 10%), and **exactly invariant under orthogonal gauge** (rotation-invariant spectral quantity; non-orthogonal recon-preserving G shifts it only −0.38 of 8.46). The *behavioral* c_i claims — essential, late-FFN-written, carries-meaning, dilutive — ride on a seed- and orthogonal-gauge-stable object.
- **R_role — BEHAVIORALLY RIGID.** Compose-agreement ≈ **1.00** (A-on-B .9996, B-on-A 1.001) across disjoint seeds: R_role's compositional *action* is reproducible. R_role is load-bearing for two-span composition (exp-4b), and that load-bearing fact is gauge-stable.

### What is FRAME-RELATIVE (keep hedging — these are SONAR-frame descriptive readings, not basis-invariant facts)

- **R_role's GEOMETRY.** The "0.87 rad, reflective, det Q = −1" description is **not** gauge-invariant: det-Q **sign-flips across λ** (det-Q range [−1, +1]; reflectivity flips between λ rungs and even between orthogonal-gauge seeds), and a recon-preserving **non-orthogonal gauge ~doubles the median eig-angle** (+0.267 rad, .27→.54) while keeping pooled-z recon (~.88). Report the *behavioral* action; do **not** report the .87-rad/reflective geometry as an intrinsic SONAR fact.
- **c_i's ABSOLUTE eff-rank magnitude.** The "~25" constant is basis-frame-relative (eff-rank is basis-covariant; here ~8.5/sent on the smaller A2 corpus/vocab vs INTERP_RESULTS' 25.5). The identifiability claim is about **drift (2.0%), not the absolute number** — so "eff-rank ≈ 25" is a SONAR-frame reading, while "eff-rank is seed-stable and orthogonal-invariant" is the rigid fact.

### The over-deflation flags (the self-tricks A2 guards against)

- **Stop over-deflating c_i.** P1+P2+A1+B1 are tempting to read as "c_i is a dispensable pooling-sink residual; meaning is in the lexis." **That is false** — B1b/B1c/SAFE-1 are *opposing-direction* results (c_i essential, located, audit-bearing) and are NOT vulnerable to a "shared artifact manufactured the agreement" critique. Any sentence in §18 reading "c_i is reducible / gist is in the skeleton / c_i is just residual energy" is an over-deflation; temper to "essential but not separable / not additive / not nonlocal."
- **Stop reading descriptive geometry as gauge-invariant.** R_role's .87-rad reflectivity and c_i's eff-rank-25 are SONAR-frame numbers, per the bullets above. This is the *under*-deflation self-trick — A2 says these specific descriptors are frame-relative even though the behavioral claims around them are rigid.
- **The shared-artifact worry is allayed, not ignored.** P1/P2/A1/B1 share `tae_hrr_prepool_model.pt`; A2 (gauge-rigidity of the behavioral objects) + P1-robust (iso-residual across 5 E definitions) together show the shared fit is not manufacturing the convergence. The chapter is closed, not provisional.

---

## 18.4 — A coherent revised picture

c_i is best read as the **per-token portion of the late-stack pooling-sink field**: in the last quarter of the encoder each token's state acquires a large, high-rank component aligned with its attention out-degree (a hub field), **written by late FFN (L19–23)** via a **mid-scale set of ~64–256 neurons** (L20–22-concentrated) whose collective write lands disproportionately on hub tokens, and mean-pooling sums these. Because the field is **intrinsic-heavy** (a lone word already has a big one, P1/P1-robust) and **dilutive** (each token's share shrinks as the attention budget is shared, P2), it behaves at the pooled level like "the residual energy the clean lexical+clock skeleton does not capture" — which is exactly why a least-squares fit assigns ~40% of pooled-z to it (P1). On read-out the decoder reads each stored slot **locally** (A1), so the high rank is a fact about **superposed storage**, not nonlocal read coupling; and in a per-token dictionary c_i does not split off as its own family but **entangles into the lexical features** (per-token SAE).

But — the load-bearing correction — that "residual energy" is **not** discardable: it **carries the decodable meaning**. A pure word-bag without it decodes to fluent nonsense (SBERT .299, B1b); an order-free c_i field alone restores the proposition (.901, B1c) while the clock alone reaches only .463; and on an *implied*-attribute audit, surface lexis is at chance (.491) while c_i is perfect (1.000, SAFE-1). Meaning is **compositional**, jointly carried by lexis + order + this hub field, and **c_i is essential** to it — and **globally, not locally, constructed** (A4). A2 certifies that these behavioral claims ride on gauge-rigid objects.

What this does NOT overturn: c_i is real, high-rank in storage, hub-located (at the field level), cross-lingually universal, intrinsic-heavy, **essential to and the location of decodable meaning**, **causally late-FFN-written and mid-scale-localizable**, **not locally constructible**, **not an LLM field**, and **gauge-rigid in its behavioral claims**. The revision is narrowly that it is **not** a *separable* store you can read the gist off of as if lexis were absent, **not** an *additive* context-accumulation, and **not** *nonlocal on read-out*. Essential ≠ separable.

---

## 18.5 — Safety implications (SAFE-1 + the auditing story)

SAFE-1 is the safety capstone and the mechanism behind §17's auditor-foolability finding. The setup: a **bag-invariant implied-attribute** task (implied sentiment via contrast sentences, n=320, balanced) audited three ways — an order-free LEXICAL auditor (z-bag of E(w_i)), a FULL-z auditor, and a c_i-field auditor — with an OVERT-attribute positive control.

**Result.** On the implied task the order-free lexical auditor scores AUC **0.491 = chance**, while full-z and c_i both score **1.000** — a **hidden-meaning gap of 0.509** (ci_minus_full_z = 0.0, i.e. c_i alone is as good as full z). The overt-attribute control shows gap **0.0** (lexical suffices, 1.000). The pre-registered prediction (gap ≥ .15) is met; the falsifier (gap < .05) does not fire; the positive control is clean. Worked examples: "The mattress was a genuine pleasure, but the rival brand was a constant headache" — lexical-auditor scores benign (0.0), c_i-auditor flags (1.0). **Audit-relevant meaning lives in the compositional c_i field, not the surface words** — which is precisely *why* an auditor reading off lexical/surface features can be fooled (§17), and why the §17.36 "editing is a memorized lexical trick" finding sits where it does.

**The load-bearing CAVEAT (do not over-claim).** A tf-idf **bigram** BoW partially cracks the task (AUC **0.991**), i.e. clause-*order* bigrams recover much of the contrast signal. So the robust, calibrated claim is **NOT** "no lexical model can audit this" — it is "**ORDER-FREE lexical features are insufficient (chance); audit on z or the c_i field, which carry the compositional/order-bearing meaning.**" This is consistent with the order-store being partially bilinear (A5-fix) and order being surface-positional (P8).

**Net safety takeaway.** For SONAR latents, surface/bag-of-words auditing is **structurally insufficient** for implied meaning; the audit-relevant signal is in the compositional c_i field, which is (i) essential and located (B1b/B1c), (ii) causally late-FFN-written and mid-scale-localizable (B4/B4b → a concrete handle for monitoring/intervention), and (iii) gauge-rigid in these behavioral properties (A2). A monitor that reads z or c_i — not a lexical bag — is the supported design. The same field's mid-scale neuron handle (B4b) is the natural target if one wants to *intervene*, with the caveat that editing demonstrated so far is lexical-level (§17.36).

---

## 18.6 — What SURVIVES intact (do not over-correct)

Tested again at the pre-pooled / per-token level and held; the deflation is about c_i's *interpretation*, not these load-bearing facts.

- **c_i is essential to and the location of decodable meaning (B1b/B1c).** Word-bag w/o c_i SBERT .299 / 19% tokens; order-free c_i-alone .901 / 69%; clock-alone .463.
- **c_i carries audit-relevant meaning surface lexis misses (SAFE-1).** Implied-attribute lexical auditor .491 = chance vs c_i 1.000.
- **c_i is causally written by late FFN (L19–23), mid-scale-localizable to ~64–256 neurons (B4/B4b).** Hub-concentration is field-level, not per-neuron.
- **c_i is irreducible to local structure (A4).** Closed-form local ladders plateau VE .163, never beat §17.4's .177 → global/all-to-all object.
- **c_i is SONAR-internal, not an LLM field (P6).** Per-token Llama-3B .077 ≈ paragraph .081, below the 0.11 bar.
- **c_i's behavioral claims are gauge-rigid (A2).** eff-rank drift 2.0%, orthogonal-invariant; R_pos eig-angle 7.7%; compose-agreement ≈1.00.
- **One z holds ~25–35 content words with graceful (non-cliff) decay (pooling-capacity).** Corrects the CAPACITY_LEDGER length-cliff forecast.
- **The iso-residual (deflation premise) is REAL, not a shared-artifact (P1-robust).** Holds across 5 independent E definitions.
- **Pooling identity γ = 1.00, exact.** `cos_pool_vs_cached_z = 1.0`; B1 `sanity_pooled_vs_trueZ_cos = 1.0`; SAFE-1 / B4b / B1c ci-identity cos→z = 1.0. The mean-pool law is untouched.
- **Lexical-code identity (N=1 E = pooled-fit E).** Procrustes cos **.917 > .8** (null .824); isolated-word decode fidelity .93 (P4).
- **Storage eff-rank is high-rank and seed-stable.** `eff_rank_c_per_sentence_mean = 26.5` (P1, matches §16's 25.5); A2 drift 2.0%; content>stop ratio 1.48×. (Absolute magnitude is SONAR-frame, per §18.3.)
- **The per-token dictionary is lexical + positional, with c_i entangled (per-token SAE).** FVU .081; LEXICAL 99.9%/96%, POSITIONAL 34 feats (R_pos clock, 0/34 edge), CONTEXT 0 separable.
- **Out-degree hub-field universality (cross-lingual).** c_i concentrates on high-out-degree pieces in fr/de/zh (out-degree R² .16–.19, q4/q1 ‖c_i‖ 1.6–1.9×), same as English (R² .21, q4/q1 2.44); surprisal weak everywhere (R² .025–.077) (P7).
- **Word×position SEPARABILITY in the single-token state.** Interaction energy only 1.6% (fixed) / 4.5% (varied) ≪ 20% (P3); reinforced by the SAE's clean lexical/positional split.
- **Order is SURFACE / positional, not semantic-event.** Decode keeps surface clause order 85–88%; per-token curved-store recovers surface index (τ 0.21) ≫ event-rank (τ 0.02 ≈ shuf-null) (P8); partially bilinear (A5-fix).
- **The clock is a POOLED-level object, not per-token.** On single tokens, un-rotating by R_pos^{−p} does *negative* work (raw within-word cos .967 → .820 ≈ shuffled-pos .818; P3), consistent with R1.
- **R_role is load-bearing for two-span composition (exp-4b).** Forcing Q→I costs −.193 cos / −.180 decode-SBERT; not a vacuous artifact — composition-specific.

---

## 18.7 — Open / owed

A short, honest list of what remains.

- **R_role geometry, intrinsic vs SONAR-frame.** A2 shows the .87-rad/reflective description is gauge-relative while the compositional action is rigid. A basis-invariant characterization of the composition operator (what survives across gauges) is owed if the geometry is to be reported as a fact rather than a frame reading.
- **c_i's neuron handle — content semantics.** B4b localizes the writer to ~64–256 L20–22 neurons but their *firing semantics* are only partly characterized (hub-detector falsified; not pure position/surprisal). What these neurons compute (and whether intervening on them yields *semantic*, not just lexical, edits — cf. §17.36) is owed.
- **c_i as binding vs glue.** B1c locates the decodable mass in c_i; whether the field is SONAR's *semantic binding* operator or per-token compositional glue is only partially settled by decode examples. Do not over-state "c_i = SONAR's semantic binding operator."
- **SAFE-1 generality.** The capstone is one implied-attribute family (sentiment-contrast) with a known caveat (bigram BoW cracks it at .99). Whether "order-free lexical auditing is insufficient" generalizes to other safety-relevant implied attributes (deception, harmful intent) is owed before deploying the "audit on z/c_i" recommendation broadly.
- **Editing deployability.** Demonstrated editing remains lexical-level (§17.36); a semantic edit via the B4b handle is untested.

---

*This §18 is the complete, final night-5 chapter and is ready to merge into INTERP_RESULTS.md as §18 (all four gating/closing results — A2 identifiability, per-token SAE, B4b, SAFE-1 — folded in; no PENDING items remain).*

---

# §19 — Reading c_i: how far the meaning field decomposes

*Wave-6 "can we READ the c_i meaning field" battery, 2026-06-17, **refined by wave-7, wave-8, and wave-9, 2026-06-18** (night-5/day-6 frontier). This is the 'reading c_i' chapter: §18 located c_i (WHERE decodable meaning lives, B1c), found its writer (late FFN, B4/B4b), and certified its behavioral claims gauge-rigid (A2) — but never READ it. Wave-6 attacked the reading question head-on; **wave-7 (W1/W2/W4/W6/W8/W9) sharpened six edges; wave-8 (W10–W18) added nine more and CORRECTED four over-claims; wave-9 (W19/W21/W22/W23/W24) added five more and CORRECTED one wave-8 over-claim** — see the wave tags inline. **Wave-9's headline correction: W19 OVERTURNS wave-8's W13 — possession is NOT abstractly bound; its cross-form transfer is carried by the genitive ('s/of) morpheme and collapses 1.0→0.043 when the clitic is neutralized, so "z has essentially no abstract relational structure" stands (§19.4).** The wave-8 corrections still standing, flagged where they land: numbers are read EXACTLY, not log-blurred (W16, §19.9b); the interpretable readability ceiling is capped ~0.71 and NO external text encoder beats it (W14, §19.1); and readable≠causal NARROWS ~3× in R1's native SAE-atom basis (W15, §19.6). Wave-9 also adds: the binding taxonomy is genitive-morpheme-specific (W19, §19.4), multi-hop composition rides the genitive/possessive links only (W24, §19.4), R1's atoms are CROSS-LINGUAL (W23, §19.5), the directional-harm blind-spot holds on real safety relations with a parse fix (W21, §19.8), and z is a paraphrase-robust CONTENT auditor (W22, §19.8). Wave-8 also adds: the LN mechanism is a ~53× rank-expansion (W12, §19.5), the safety defense (W11: parse fixes what augmentation cannot, §19.8), a second relational exploit (W17 timeline, §19.8), style-is-lexical (W18, §19.9b), the topic-subspace-without-canonical-axes (W7, §19.5), and the decoder-side reading profile (W10, §19.9c). Sources: NIGHT5_STATE.md LOG (full scored verdicts, newest-at-top); night5_json/{r9_meaning_coverage, r3_cpool_additivity, c2_semantic_battery, c3_role_binding, r5_nested_binding, r6_writer_mechanism, r8_causal_directions (partial), ling_lexical_semantics} (wave-6) and night5_json/{w9_best_feature_coverage, w1_residual_char, w8_ci_construction, w4_crosslingual_reading, w6_adversarial_binding, w2_relational_causal} (wave-7); night5_json/{w19_binding_taxonomy, w21_safety_harm, w22_content_auditor, w23_crosslingual_atoms, w24_multihop_binding} (wave-9). Program motto: the easiest person to trick is yourself — the live self-trick on this chapter is reading "c_i is half-readable" as either "the vector space is solved" (it is not — a real residual stays holistic) or "c_i is unreadable" (it is not — it is probe-readable property-by-property, additive in concepts, and ~0.71-readable with learned topic-field features). Both over-readings are flagged in §19.10. The unsupervised SAEs landed unevenly (heavy agent-lifecycle flakiness this wave): all four learned-dictionary attempts ultimately completed — C1 (isolated per-token c_i, FVU 0.35), R1 (pooled-field, FVU 0.50), R2 (writer-space transcoder, FVU 2.08 / 0% relational features), R4 (LLM-bridge naming) — and the honest verdict is **granularity-dependent (§19.5):** at the per-token, writer, and LLM-vocabulary level c_i has **no sparse atomic code** (field-level — R6 + §18 per-token SAE), but the **POOLED order-free field (R1) does surface a real but minority sparse concept basis — 273 paraphrase-invariant nameable atoms at ~16% of energy (incl. abstract concepts like contrastive discourse and reported speech)**. So: dominantly field-level/lexical, with a real auditable ~16% nameable-atom minority recoverable only from the pooled field — neither "no atoms" nor "fully decomposable." R2 additionally found c_i is *not* the additive FFN write (cos 0.182 — non-additively constructed through the late stack); **W8 then traced exactly where it crystallizes — the encoder's final LayerNorm, which rank-expands it ~53× (§19.5, W12)**. R1's paraphrase-invariance analysis is now **complete (§19.5)**; R8's relational/attribute arms are **closed by W2 (§19.6)**.*

---

## 19.0 — Executive summary

The calibrated headline: **c_i is partially readable, and more readable than wave-6 alone implied.** It is **probe-readable property-by-property** and **additive in concepts**; a learned map from the words recovers **~half** of its decodable meaning and **hand-built properties add nothing on top** — but **learned topic-field features push reconstruction to ~0.71** of the 0.90 ceiling (W9), so the "irreducible holistic half" **shrinks to a ~0.19-SBERT residual** that is **high-rank, fine-grained lexical/entity specificity** spread over hundreds of dimensions, not a clean missing semantic axis (W1). What structure c_i does carry is **surface-positional, not abstract-thematic** — agent/patient roles, event-order, comparison, location, causal, recipient, and nesting are all positional (C3/R5/W13/W19); **z has essentially no abstract relational structure.** The one apparent exception wave-8 reported (possession "binding invariantly", W13) **is corrected by wave-9's W19**: that cross-form transfer is carried by the **genitive ('s/of) morpheme**, not an abstract role — neutralizing the genitive ("A's B" → "A owns B") collapses binding **1.0 → 0.043**, so even possession is a lexical/morpheme readout. The surface-positional structure is a set of **concrete adversarial vulnerabilities**: a meaning-preserving active→passive paraphrase **defeats z-based relational safety auditing** (W6), confirmed on **real directional harm** — threat-source/perpetrator/incitement (W21: single-voice cross-AUC 0.037, anti-correlated — it calls the victim the perpetrator); **data-augmentation does not fix it — only an explicit dependency-parse agent-slot does** (W11/W21, parse-aug 1.0); and re-narrating events out of order **flips the read timeline** (W17). The positive complement: z **is** a paraphrase-robust **content** auditor (W22: entity/quantity/claim/negation cross-paraphrase AUC 1.0 vs lexical-bag 0.944) — trust z for *what content is present*, not *who-did-what*. The content is **field-level, not per-neuron**, and **readable but not (cleanly) causal** for steering (the gap is **large for every concept type and worst for relational**, W2) — **except in the native SAE-atom basis, where the gap narrows ~3×** (W15: causal-success 0.567 vs diff-of-means 0.233). Mechanistically, c_i **is literally the late residual stream after the encoder's final LayerNorm** — that LN is the dominant non-additive step that **rank-expands** the low-rank high-norm FFN write ~53× into the c_i direction (W8/W12). The whole reading profile is **language-universal** (holds in en+fr; lexically-marked relations shift from c_i to the word-bag in morphologically richer languages, W4). By content type, **z reads numbers PRECISELY** (exact digits, W16), **style/register lexically** (word-choice not abstract field, W18), and the **event SET precisely but the timeline only as surface order** (W17). The interpretable readability ceiling is **capped ~0.71 — and no external text encoder beats it** (W14). This is neither "the vector space is solved" nor "c_i is opaque"; it is a quantified middle.

**The decisive coverage result (R9, centerpiece).** Reconstruct the order-free meaning field c_pool from a feature set, push the reconstruction through the **real SONAR decoder**, and score decode-SBERT against the original text (n=8000, 1600 test, 400 decoded). The ladder: **ceiling (true c_pool) 0.901**; a **learned (ridge) map from the lexical bag 0.499**; the lexical bag **plus ALL readable properties** (sentiment, tense, number, negation, coreference, NER, 40 topics, discourse, clusters — 113 dims) **0.504**; the **raw bag floor 0.299**; **random-feature control 0.076**. Variance-explained R² for the full readable set is **0.34**. Read this exactly: a *learned map from the words* recovers about half the meaning (0.50 of 0.90, well above the 0.30 raw-bag floor); **hand-built enumerable properties add essentially nothing on top of the words (+0.005 over the ridge bag)**. c_i is mostly-readable *as a property bundle carried by the lexical map*, with a real holistic residual (`residual_is_meaningful: true`; residual-vs-mean gain 0.266). Neither pre-registered falsifier (fully-readable ~.90; or readable-adds-nothing ~.30–.40) fired. *[R9]*

**The "holistic half" SHRINKS with LEARNED features (W9).** R9's residual is partly *bad-feature artifact*, not all fundamental: replacing hand-properties with **learned topic-field features** (the C1 isolated-c_i SAE + the R4 Llama-bridge) lifts the same decode-SBERT ladder from **bag 0.497 → +hand 0.496 (nothing) → +C1-SAE features 0.683 → +R4-LLM dirs 0.573 → +ALL learned 0.706**, R² 0.33 (hand) → 0.60 (all), recon-cosine 0.68 → 0.82; random control 0.079. So **hand-properties add nothing but learned SAE features add ~0.19** — the readable fraction rises from ~0.50 to **~0.71 of the 0.90 ceiling** (the learned set closes **~52%** of the hand-to-ceiling gap). The prereg "learned features help SOME but a substantial residual remains" was confirmed (`holistic_half_is: fundamental`, no falsifier): the residual to 0.90 — about **0.19 SBERT** — is the genuinely-hard fine detail, smaller than wave-6's "half" implied. *[W9]*

**What the residual IS (W1).** That ~0.19 residual is *meaningful but structureless*: decoded through the real SONAR decoder, residual+mean reaches **0.324** vs mean-only **0.041** (adds **0.283**) — so it carries real content — yet it is **high-rank** (98 dims for 50% of its variance, 338 for 90%; PCA top-1 share only **0.015**, no dominant axis) and its energy correlates most with **lexical** specificity axes (mean-|spearman| lexical **0.272**: mean-idf 0.47, rare-word count 0.41, OOV-idf 0.36), then **entity 0.20**, **syntactic 0.16**, **style 0.11**. The pre-registered "clean new interpretable axis" falsifier did **not** fire (no single PC/cluster names a concept). So the un-readable remainder is **fine-grained, high-rank lexical/entity specificity smeared across hundreds of dims — not a clean missing semantic axis.** *[W1]*

**Reconciliation — probe-readable ≠ reconstructible.** C2 finds that *specific* properties are PROBE-readable from c_i where the order-free bag is at chance: negation-scope and coreference both reach probe-AUC **1.000** from c_i while the word-multiset-identical bag is **0.500** (c_i-gap +0.50 each). R9 finds those same enumerable properties add ~nothing to *reconstructing* c_pool. Both are true and not in tension: **extracting a known property** (C2) is a different operation from **rebuilding the field out of properties** (R9). c_i is probe-readable property-by-property but only ~half reconstructible — the half it cannot hand back to a property list is the compositional remainder. *[C2 vs R9]*

**The structure of the readable half (R3 + C3/R5).** Where c_pool *is* readable it is a near-**additive bag-of-concepts**: additivity cosine **0.859** (entity+entity), **0.873** (entity+attr), **0.896** (pred+pred), and — once a verb confound is removed — **0.914** (agent+patient). Binding survives only as a **weak residual**: the confound-free role-swap deficit ("dog chased cat" vs "cat chased dog", same words) is cos **0.883**, i.e. ~**0.12** below the additive prediction of 1.0. And the binding that exists is **mostly surface-positional, not thematic**: cross-construction (train active → test passive) thematic-role AUC is **0.015** (anti-correlated; the probe learned "agent = first mention") while surface first-mention transfers at **0.985**; nested structure flattens the same way (within-construction z separates structure at 0.80, cross-construction structural transfer collapses to mean **0.138** while surface transfers at **0.826**). **No abstract structural binding for any relation tested** — agent/patient, event-order, comparison, location, causal, recipient, or nesting are all surface-positional (C3/R5/W13/W19), and the one apparent exception (possession cross-form AUC 1.000, W13) **dissolves under W19**: it is carried by the genitive morpheme, collapsing to 0.043 when the 's/of clitic is removed. So c_pool is a near-flat bag of (concept, surface-position), with **no genuine form-invariant relation** behind the binding wall — possession's apparent invariance is a morpheme readout. *[R3, C3, R5, W13, W19]*

**Field-level, not atomic (R6 + SAE caveat).** The c_i writers carry **position/routing, not nameable semantics**: of 256 top writers the held-out nameable AUC is **0.765** vs control **0.703** (margin only +0.061), and the driver is **discourse position (relpos) + out-degree**, with the semantic fraction **0.0078** (2 of 204 above-threshold writers key on negation/comparative/entity). Per-neuron hub-delivery is **absent** (writer output hub-mass **0.198** < 0.25 baseline < control 0.315) — B4b's hub-concentration is a **collective/field effect**, invisible per neuron. This converges with the §18 per-token SAE (0 separable context features): **c_i content is irreducibly field-level**, not a per-neuron or per-token-feature dictionary. *[R6 + §18 per-token SAE]*

**Steering — readable ≠ causal, for every type, worst for relational (R8 → completed by W2).** Supervised concept directions in c_pool are cleanly **readable** and **inject content into the decode**, but **break coherence** before they cleanly flip the concept. W2 closed R8's missing arms and the gap is **large across all three concept types**: topic read-AUC **1.0** / causal-success **0.4** (gap **0.60**), attribute read **1.0** / causal **0.4** (gap **0.60**), relational read **0.67** / causal **0.067** (gap **0.61**). Relational directions are **barely readable AND essentially not injectable** (causation read-AUC 1.0 but causal 0; agent-patient read-AUC 0.40). The readable≠causal gap §17 found for *editing* reappears for *steering*, is **wide for everything**, and is **worst for relational/structural content** — consistent with binding-is-positional (C3/R5). **The basis helps only a little: a few strong topic atoms steer better than diff-of-means** (W15, n=2: causal-success 0.567 vs 0.233; coherence-at-knee 0.839 vs 0.227) — **but the atom basis as a whole is NOT a causal handle** (W20 full sweep: only **2.5% (1/40)** of R1's atoms steer coherently, shuffled-atom 0.013). R1's pooled-field atoms are **readable/auditable, not writable** — read-out features, not write-in levers. *[R8 + W2 + W15 + W20]*

**Mechanism — c_i is the post-final-LayerNorm residual (W8).** White-box capture of the residual stream at every probe point (no ablation, n=1200 sents, 39k tokens) shows c_i **crystallizes LATE and NON-additively**: cos-to-c_i is **0.007** at embed, **0.247** at L18, **0.169** *just before* the encoder's final LayerNorm, and **1.0** immediately after it. The **single biggest jump in the entire stack is the final `model.layer_norm`: pre→post cos +0.831** (the largest consecutive-layer rise is only +0.052, at L19). This reconciles R2's puzzling cos-0.182: the raw late-FFN write lives in the **un-normalized** residual at ~15000× the final norm; the final LayerNorm rescales+recenters it **into** the c_i direction. So **c_i is literally the late residual after the encoder's final LayerNorm** — the FFN writes a high-norm component that LN normalizes into c_i. **W12 sharpens what that LN does: it rank-expands the low-rank pre-LN c_i (eff-rank 11) ~53× into the high-rank readable c_i (eff-rank 600), and is not linearly invertible (post→pre R² 0.35) — a rank expansion, not a rescale; reading should happen post-LN.** *[W8, W12]*

**Cross-lingual universality (W4).** The whole reading profile is **language-universal** (tested en + fr; de/zh planned): coverage `readable_full` ~**0.53** with properties adding ~0 over the bag (half-readable, en 0.535 / fr 0.534); binding **surface-positional not thematic** everywhere (cross-construction role-AUC **0.0–0.016** vs surface **0.98–1.0**). The one predicted **morphology nuance** appears exactly: negation-scope is readable in c_i in English (c_i-gap **+0.57** over the bag) **but in French it reads from the word-bag too** (`ne…pas` is lexically marked, c_i-gap **0.0**) — i.e. lexically-marked relations **migrate from c_i into the bag** in morphologically richer languages. *[W4]*

**Cross-MODEL generality — the profile is general across mean-pooled sentence encoders, not SONAR-specific (W30), REFINED by W35 (the chapter's 4th self-correction).** Re-running W22 content-auditing and C3 thematic-role binding on **three other deployed sentence encoders** (all-MiniLM-L6-v2, all-mpnet-base-v2, and the contrastive e5-small-v2), holding the test sets fixed: **all three are paraphrase-robust CONTENT stores** (entity/quantity/claim/negation cross-paraphrase AUC **1.0**, identical to SONAR) and **NONE reads abstract thematic role cross-construction** (0.448 / 0.460 / 0.529 — near chance; threshold .70; incl. the structure-aware contrastive e5). **W35 then scaled this to 10 encoders incl. large/contrastive/multilingual (roberta-large, e5-large, bge-large, gte-large, LaBSE, multilingual-e5-large) across the full W19 relation taxonomy and REFINED the "no abstract structure is UNIVERSAL" claim — TOO STRONG, falsifier FIRED (§19.7b):** three precise findings replace the blanket claim — (1) **AGENT/PATIENT is NEVER abstractly bound** (cross-form mean **0.37** small / **0.41** large, all ≪ .70) — this part of "no structure" IS universal and scale-robust, so the safety corollary STANDS; (2) the **GENITIVE (possession) is universally bound and SCALES UP with model size** (cross-form AUC 0.70 MiniLM → 0.79 mpnet → 0.86 roberta-large → 0.91–0.998 e5-large/bge-large/gte-large/LaBSE/mult-e5-large); (3) **BUT some larger/contrastive encoders DO abstractly bind NON-genitive relations that SONAR does not** — all-mpnet binds CAUSAL (0.82), roberta-large CAUSAL (0.75), e5-large CAUSAL (0.86) + LOCATION (0.75), bge/gte/mult-e5 CAUSAL (0.76–0.82), all bag-clean. So **"mean-pooled = no abstract structure" is true for SONAR and small encoders, but larger/contrastive models acquire PARTIAL relational structure beyond the genitive — SONAR sits at the structure-POOR end of the spectrum.** Skeptic nuance: SBERT/E5-small are if anything *even more* structure-blind than SONAR — SONAR retains surface first-mention position (cross 0.985) and within-construction role (0.774), where the small SBERT/E5 family is near chance (surface 0.47–0.55, within 0.57–0.60), a near-pure lexical bag. The safety corollary (below): the **agent/patient/perpetrator no-binding is universal incl. large encoders** (none reads thematic agency cross-construction), so the relational-auditing vulnerability **is encoder-general** — not a SONAR bug fixable by switching encoders. *[W30; W35 — refines, 4th self-correction]*

**Safety — the no-binding attack (W6/W21) and the content-auditing complement (W22).** Because c_i/z encode **surface position, not thematic role**, a meaning-preserving **active→passive paraphrase defeats z-based relational safety auditing.** A z-classifier reading "who is the agent/perpetrator" scores **AUC 0.90 within-voice** but **0.014 cross-voice** (anti-correlated — it learned "agent = first mention", which inverts under passivization); c_i cross-voice 0.020; z_bag 0.50 (identity control). A **text/rule-based** voice-aware reader stays at **AUC 1.0** cross-voice, proving the perpetrator relation is unchanged and recoverable from the text — a genuine **z blind-spot**, not a hard problem. This is the concrete attack behind §17/SAFE-1's "z-auditor adversarially foolable." **W11 tests the defense: both-voice training gives only PARTIAL robustness (a novel nominalization still drops AUC to 0.46 ≈ chance), but an explicit PARSE/agent-slot indicator fully fixes it (1.0).** **W21 confirms this on REAL directional harm** — threat-source/perpetrator/incitement: z-only single-voice cross-AUC **0.037** (anti-correlated, calls the victim the perpetrator), both-voice→novel-nominalization **0.533** ≈ chance, while **parse-augmentation restores 1.0** (vulnerability gap +0.467; text control 1.0 proves the harm is unchanged). So: do **not** deploy SONAR-embedding-only auditors for directional-harm attribution; augment with a dependency-parse agent-slot. **W22 is the positive complement: z IS a paraphrase-robust CONTENT auditor** — entity/quantity/claim/negation cross-paraphrase AUC **1.0** vs lexical-bag 0.944 (decisive case: a synonym-reworded claim with no shared word, z 1.0 vs bag 0.78), while the structure control stays fragile (role cross-voice 0.022). **W17 adds a second relational exploit: re-narrating events out of chronological order flips the read timeline** (cross-condition chrono AUC 0.0 vs surface 1.0), while the event SET stays robustly auditable (0.949). **Net: trust z for *what content is present*, not *who-did-what-to-whom*.** *[W6, W11, W21, W22, W17]*

**Content-type readability — numbers precise, style lexical, event-set-yes/timeline-no (W16/W18/W17).** Resolving the aggregate coverage by content type: **z reads numbers PRECISELY** (exact-digit recovery 1.0 in every magnitude bucket incl. 250,000, log10 N r² 0.995, robust to 20% noise; precise to READ but hard to STEER/scrub — reconciles §16.4, W16); **style/formality/register is readable but LEXICAL** (z_bag word-choice carries ~all of it; c_pool adds nothing beyond the bag — not an abstract style field, W18); the **event SET reads precisely (AUC 0.949) but the timeline only as surface order** (W17). So z is a **precise content/lexical store** for entities, numbers, style, and event sets, with abstract relational reading only for a few relations (possession, negation-scope, coreference). *[W16, W18, W17]*

**How the decoder reads it (W10).** On the decode side, z is a single position (cross-attention degenerate over one key) read **diffusely across depth** (not just initial state; cross-attn read flat-to-peaking through mid-stack, falling late); the decoder reconstructs **content early (median depth 0.61), surface form late (1.0)**; and reading is **distributed with no clean per-token→z-direction map** (content-ablation localism is weak — 1.1 words changed vs random 2.5 — and PCA directions change 6–12 words). The decode-side mirror of the encode-side field picture. *[W10]*

**The lexical code, for contrast (LING).** The complementary sidequest characterizes SONAR's *word* code (N=1 encodes) as a **textbook distributional vector**: analogy **0.83** (honest, inputs-excluded), morphology consistent offsets (transfer top-1 **1.0**), the classic **antonym problem** reproduced (antonym cos **0.922** vs random 0.775, yet learned-separable AUC **0.90**), polysemy as a static blend resolved only in context (in-context sense-separation **0.68**), semantic features linearly readable (animate/human/abstract AUC **0.98**). The words have rich *classical* structure; the meaning **beyond** the words is exactly the half-holistic c_i. *[LING]*

**Net, calibrated.** c_i is **partially readable**: a learned map from the words recovers ~half its decodable meaning (R9), **learned topic-field features push that to ~0.71** (W9) — a ceiling **no external text encoder beats** (W14: SBERT/Llama/kitchen-sink all ≤0.71) — and the ~0.19-SBERT residual is **irreducible high-rank fine lexical/entity detail, not a missing axis** (W1/W14); hand-built properties add nothing beyond the words (R9) yet specific relational properties are probe-extractable (C2); the readable half is an additive bag-of-concepts (R3) with **no abstract structural binding for any relation tested** (C3/R5/W13/W19 — possession's apparent exception is a genitive-morpheme readout, collapsing 1.0→0.043 when the clitic is removed, W19) — a profile that is **language-universal** (W4) and a **concrete safety vulnerability** to passive-paraphrase (W6, confirmed on real directional harm W21, only fixable by an explicit parse not data augmentation, W11/W21) and to out-of-order narration (W17), with the **positive complement that z is a paraphrase-robust *content* auditor** (W22: cross-paraphrase AUC 1.0 vs bag 0.944); by content type z reads **numbers precisely** (W16), **style lexically** (W18), and **event sets but not timelines** (W17); the content is **dominantly field-level, with a real ~16%-energy minority of sparse nameable concept atoms recoverable only from the pooled field** (R6/R1) **that are cross-lingual** (W23: same atom fires in en/fr/de/zh, content non-EN mean AUC 0.987) and **monosemantic/auditable but NOT causal levers** (W20: 93.5% pass held-out auto-interp, but only 2.5% steer coherently — a real read-out concept dictionary, not a steering handle; a few strong topic atoms beat diff-of-means at n=2 per W15, but the basis as a whole does not); it is otherwise readable-but-not-cleanly-causal for steering across **all** concept types and worst for relational (R8/W2); the topic structure is **a real shared subspace with no canonical named axes** (W7); and mechanistically it **is the residual after the encoder's final LayerNorm, which rank-expands the low-rank late-FFN write ~53×** (W8/W12). **Most of the meaning field decomposes — as words + topic-fields + a ~16% nameable-atom minority — but a hard, high-rank, fine-detail remainder does not.**

---

## 19.1 — The coverage result (R9) — how much of c_i's meaning is readable at all

This is the centerpiece and the decisive test of the reading question. Prior chapters located the decodable meaning in the order-free field `c_pool := (1/N) Σ_i R_pos^{−p_i} h_i` (B1c, decode-SBERT 0.90). R9 asks the quantitative coverage question that B1c did not: **how much of that 0.90 can a human-readable feature set reconstruct?** The load-bearing metric is honest — reconstruct c_pool from features, decode the reconstruction through the **real SONAR decoder**, and score decode-SBERT against the original sentence (so the units are commensurate with B1b/B1c). n=8000 sents, 6400 train / 1600 test, 400 decoded; ridge α=100; 40 topics; ridge-E over 9562 ids.

### The decode-SBERT ladder

| Feature set | dims | decode-SBERT vs original | note |
|---|---|---|---|
| **ceiling — true c_pool** | — | **0.901** | the B1c order-free field, reproduced |
| readable_full (lexical bag + all properties) | 1137 | **0.504** | bag + 113 property/topic/cluster dims |
| **lexical bag, ridge-mapped** | 1024 | **0.499** | a *learned map from the words alone* |
| props only (no bag) | 113 | 0.178 | enumerable properties with no words |
| raw bag floor (direct, no learned map) | — | 0.299 | the B1b-style order-free bag |
| random-feature control | — | 0.076 | floor |
| residual + mean (what's left) | — | 0.342 | the holistic remainder, decodable |
| mean only | — | 0.077 | constant baseline |

Variance-explained R² (in c_pool vector space, not decode): lexical bag **0.323**, props-only 0.140, **readable_full 0.340**, random −0.210. Recon-vs-true vector cosine: readable_full **0.690**, lexical bag 0.682, props-only 0.533.

### What the ladder means, read exactly

1. **A learned map from the words recovers ~half the meaning.** The ridge-mapped lexical bag reaches decode-SBERT **0.499** — well above the raw-bag floor 0.299 and the random floor 0.076, and reaching about half of the 0.90 ceiling. So roughly **half** of c_pool's decodable meaning is a (learned, nonlinear-in-effect-via-decoder) function of *which words are present*. This is the readable half.

2. **Enumerable properties add essentially nothing over the words.** readable_full (bag + sentiment + tense + number + negation + coreference + NER + 40 topics + discourse + clusters) scores **0.504** vs the ridge bag's **0.499** — a gain of **+0.005** (`enumerated_props_add_over_bag: false`). The hand-built property battery does not buy back any of the remainder. The readable half is carried *entirely by the lexical map*; the properties are redundant with it.

3. **The remaining ~0.40 is partly fundamental, partly a hand-feature artifact — LEARNED features recover much of it (W9).** The span from 0.50 (readable) to 0.90 (ceiling) is not recovered by any *hand-built* feature in the battery, and the residual is genuinely meaningful, not noise: residual+mean decodes to **0.342** vs mean-only **0.077** (gain **0.266**, `residual_is_meaningful: true`). But wave-7's W9 shows the hand-property ceiling was a *feature-quality* limit, not the true holistic floor — see the learned-feature ladder below: swapping hand-properties for **learned topic-field features** lifts reconstruction from ~0.50 to **~0.71**, so only the final **~0.19 SBERT** is the genuinely-hard, irreducible remainder.

4. **Calibration.** `fraction_of_decodable_meaning_readable_vs_.30floor = 0.341` — the readable bundle reaches about a third of the way from the 0.30 raw-bag floor to the 0.90 ceiling beyond what the floor already gives. The pre-registered falsifiers (fully-readable ~0.90 → "vector space solved"; readable-adds-nothing ~0.30–0.40 → "irreducibly holistic, full negative") **both stayed unfired** — the result lands in the predicted "mostly-readable-as-a-bundle, residual remains" middle, and sharpens it to "the bundle = the lexical map; properties redundant."

The decoded examples make it concrete: where the true decode preserves the proposition ("The core of this transformation's pivotal decision is made by the longtime creator, James Wan"), the readable_full reconstruction degenerates into fluent-but-wrong text ("...a franchise that is a franchise that is a franchise..."), and the bag reconstruction is generic ("The executive's decision in this case is a very important one"). The residual decode recovers content the bag misses ("The centerpiece of a widespread creation of transformation is critical, like the long-time creator, James Wan"). *[R9]*

**One-line R9.** A learned map from the words recovers ~half of c_i's decodable meaning; *hand-built* enumerable properties add ~nothing; *learned* topic-field features recover much of the rest (W9, below); a ~0.19-SBERT high-rank fine-detail remainder stays unreadable.

### The learned-feature ladder (W9) — the holistic half shrinks

W9 re-ran the exact R9 protocol (same load-bearing metric: reconstruct c_pool, push through the real SONAR decoder, score decode-SBERT vs original; n=8000, 1600 test, 400 decoded) but added two **learned** feature sets in place of (and on top of) the hand battery: the **C1 isolated-c_i SAE** (h=32768/k=64, sentence feature = mean-pooled token SAE acts → PCA-512) and the **R4 Llama-3.2-3B L18 bridge** (mean-pooled per sentence → PCA-256).

| feature set | dims | decode-SBERT | R² | recon-cosine |
|---|---|---|---|---|
| ceiling — true c_pool | — | **0.901** | — | — |
| lexical bag (ridge) | 1024 | **0.497** | 0.312 | 0.674 |
| + hand-properties | 1137 | 0.496 | 0.329 | 0.683 |
| **+ C1-SAE features** | 1649 | **0.683** | 0.598 | 0.821 |
| + R4-Llama dirs | 1393 | 0.573 | 0.391 | 0.718 |
| **+ ALL learned** | 1905 | **0.706** | 0.603 | 0.824 |
| random-feature control | — | 0.079 | −0.407 | 0.228 |

**Read exactly:** hand-properties add **nothing** over the bag (0.496 vs 0.497), but **C1-SAE features add +0.188** and the full learned set reaches **0.706 of the 0.90 ceiling** (R² 0.60, recon-cosine 0.82). The learned features close **~52%** of the hand-to-ceiling gap (`frac_of_hand_to_ceiling_gap_closed_by_learned: 0.519`). The prereg landed on its predicted middle (`holistic_half_is: fundamental`, no falsifier fired): learned features help substantially but a **real ~0.19-SBERT residual to 0.90 remains.** So R9's "irreducible holistic half" is sharpened — **~half readable with hand-properties, ~0.71 with learned topic-field features; the residual to 0.90 is the genuinely-hard fine detail, not the whole upper half.** *[W9]*

### What the residual IS (W1) — high-rank fine lexical/entity detail, not a missing axis

W1 took the R9 full-readable residual `c_pool − r̂` and asked what it is (n=3500; residual decoded as `mean + residual` through the real decoder; n=400 decoded). Three findings:

- **It is meaningful, not decoder slack.** residual+mean decodes to **0.324** vs mean-only **0.041** — the residual adds **+0.283** SBERT (`residual_is_meaningful: true`), confirming R9's 0.266 gain at larger n.
- **It is high-rank, with no dominant axis.** **98** PCA dims for 50% of its variance, **338** for 90%; PCA top-1 share only **0.015**, top-5 only **0.054**. The per-PC decodes name no single concept; clustering it gives broad mixed-topic blobs, not a clean new axis. The pre-registered "residual collapses to one clean interpretable axis" falsifier did **not** fire.
- **Its energy is lexical/entity specificity.** Among hand-built specificity axes, residual energy correlates most with the **lexical** family (mean-|spearman| **0.272**: mean-idf 0.466, rare-word count 0.414, OOV-idf 0.360, max-idf 0.224), then **entity 0.200** (mid-sentence capitalized 0.367), **syntactic 0.164** (token-count 0.285), **style 0.106**; those axes jointly predict residual energy at R² 0.285. `dominant_axis_family: lexical`.

So the irreducible remainder is **fine-grained, high-rank lexical and entity specificity spread across hundreds of dimensions** — the parts of "which exact rare words / named entities, in what exact density" that neither the bag nor any enumerable property captures — **not a clean missing semantic axis** the chapter had failed to probe. *[W1]*

### The readability ceiling is CAPPED ~0.71 — and external text encoders DON'T beat it (W14)

W9 lifted reconstruction to ~0.71 with c_i-derived learned features. **W14 asks the natural next question — can a *better, external* text encoder close the rest of the 0.71→0.90 gap?** — and the answer is **no**. W14 re-ran the same reconstruct-and-decode ladder (decode-SBERT vs original) but added full-text predictors that have access to the *entire sentence*, not just c_i-derived features:

| feature set | decode-SBERT | R² | recon-cos |
|---|---|---|---|
| ceiling — true c_pool | **0.901** | — | — |
| lexical bag (ridge) | 0.497 | 0.31 | 0.67 |
| + hand-readable | 0.496 | 0.33 | 0.68 |
| **+ C1-SAE (interp best)** | **0.683** | 0.598 | 0.821 |
| T1 — SBERT sentence bridge | 0.473 | 0.25 | 0.62 |
| T2 — full-text MLP-seq | 0.471 | 0.27 | 0.63 |
| T3 — Llama-3B bridge | 0.473 | 0.32 | 0.67 |
| **T4 — text kitchen-sink** | **0.712** | 0.589 | 0.819 |
| random-feature control | 0.081 | −0.33 | 0.25 |

**The interpretable ceiling is ~0.683 (C1-SAE features), and NO external text encoder beats it meaningfully:** a frozen SBERT-sentence bridge reaches **0.473**, a full-text MLP **0.471**, a Llama-3B bridge **0.473**, and even a "kitchen-sink" combination of all text predictors plateaus at **0.712** — essentially tied with the native interpretable 0.683, and the best full-text predictor closes only **~13%** of the interp-to-ceiling gap (`is_cpool_fully_text_derivable: false`, falsifier "text reaches 0.8+" did **not** fire). The surprising part: SONAR is a *deterministic text encoder*, yet **its pooled meaning field carries content that no external text encoder — native SBERT, a 3B LLM, or all combined — can reconstruct from the text alone.** The residual to 0.90 is therefore **irreducible, SONAR-specific, high-rank fine detail** (the W1 lexical/entity specificity), not capturable by *any* feature set, native or external. **~0.71 is the practical interpretable readability ceiling, and it is best reached with SONAR-native SAE features** — external LLM vocabularies do not help. This corrects the earlier program hunch that a better external encoder might push readability toward 0.90. *[W14 — corrects "external text encoders could reach the ceiling"]*

---

## 19.2 — Probe-readable vs reconstructible (C2 vs R9) — the reconciliation

R9's "properties add nothing" sits next to a result that looks opposite: **C2 finds specific properties living in c_i that the words alone cannot read.** These are not in tension; they answer different questions, and the distinction is the load-bearing methodological point of this chapter.

C2 (semantic-property battery, per-prop n=320, logistic probe, GroupKFold by lexical-fill so a probe reading entity identity is at chance) uses **matched-pair templates** where the word-multiset is identical within a pair, so a true order-free bag is forced to chance on order-only structure:

| property | kind | AUC z | AUC z_bag | AUC c_i | c_i-gap (c_i − bag) |
|---|---|---|---|---|---|
| named_entity | lexical-ish | 1.000 | 1.000 | 1.000 | 0.00 |
| sentiment | lexical-ish | 1.000 | 1.000 | 1.000 | 0.00 |
| quantifier_presence | lexical-ish | 1.000 | 1.000 | 1.000 | 0.00 |
| tense | lexical-ish | 1.000 | 1.000 | 1.000 | 0.00 |
| **negation_scope** | relational | 1.000 | **0.500** | **1.000** | **+0.50** |
| factuality | relational | 1.000 | 1.000 | 1.000 | 0.00 |
| **coreference** | relational | 1.000 | **0.500** | **1.000** | **+0.50** |
| discourse_relation | relational | 1.000 | 1.000 | 1.000 | 0.00 |

Negation-scope and coreference are **probe-readable from c_i (AUC 1.0)** where the order-free bag is exactly at chance (0.50) — these are relations carried in c_i beyond the words. (factuality and discourse_relation carry an honest single-token lexical cue — a modal marker, a connective — so the bag reads them too; they are *lexically-cued*, a finding, not a c_i failure. The agent fixed a punctuation-leak confound on these.) Mean c_i-gap: lexical-ish **0.00**, relational **0.25**. *[C2]*

**The reconciliation.** C2 says: given a *known* property (negation-scope), a linear probe **extracts** it from c_i (PROBE-READABLE). R9 says: handed a *list* of such properties, you **cannot rebuild** c_pool from them (NOT RECONSTRUCTIBLE — they add +0.005). Both hold because:

- **Extraction** asks "is property P linearly present in c_i?" — yes, for scope and reference. This is a low-dimensional projection test.
- **Reconstruction** asks "do properties P1…Pk span c_i's decodable content?" — no; even with negation, coreference, sentiment, tense, NER, topics, discourse all supplied, the recovered field decodes to 0.504, statistically identical to the bag's 0.499. The properties you can name are either redundant with the words or too few to span the compositional remainder.

So the calibrated statement is: **c_i is probe-readable property-by-property (you can ask it specific questions and get clean answers, including relational ones the words miss) but only ~half reconstructible (you cannot rebuild it from the answers).** The half you cannot rebuild is exactly the compositional content with no enumerable handle. *[C2 vs R9]*

---

## 19.3 — The readable half is an additive bag-of-concepts (R3)

Having established that ~half of c_pool is readable, R3 characterizes the *structure* of that half: is it a sum of concept vectors, or genuinely non-additive (which would be the signature of strong binding)? Test: `cos( c_pool(AB)−c_pool(N), [c_pool(A)−c_pool(N)] + [c_pool(B)−c_pool(N)] )`, neutral-subtracted (so additivity is not a shared-template offset), n=18/type.

| pair type | additivity cos | reading |
|---|---|---|
| entity + entity | **0.859** | additive |
| entity + attr | **0.873** | additive |
| pred + pred | **0.896** | additive |
| agent + patient (verb-controlled) | **0.914** | additive |
| agent + patient (verb UNcontrolled) | 0.741 | confounded — verb only in AB |

c_pool composes **near-additively across all four pair types, including agent+patient**. The raw agent+patient 0.741 was a **verb confound** (the transitive verb appeared only in the AB sentence, an unmodeled third concept); with the verb present in the single-role sentences too, agent+patient additivity rises to **0.914**, indistinguishable from the non-binding types. The pre-registered "binding breaks additivity (agent+patient < 0.6)" is **NOT supported**; the holistic falsifier (nothing additive, entity+attr < 0.5) is firmly rejected.

**Binding survives only as a weak residual.** The one confound-free binding signal is the **role-swap deficit**: c_pool("A dog chased a cat") vs c_pool("A cat chased a dog") — identical words — cos **0.883**. An additive bag predicts 1.0, so the **0.117** deficit is real role-binding, but it is a **small perturbation, not strong non-additivity**. The "R3↔C3 binding-requires-non-additivity" coupling is not supported: gist-relevant content composes additively while binding is a ~0.12 residual on top.

*Caveat (honest):* the decode-SBERT half of R3 was **skipped** — the assigned box turned out to be a CPU-only shared host running other users' jobs at load ~18 with ~150MB free RAM; loading the 4.5GB SONAR decoder risked OOM-killing other users' processes (forbidden by the shared-box rule). The primary additivity-cosine metric (including the decisive role-swap and verb-controlled contrasts) is complete; only the corroborating decode is omitted. n=18/type, std 0.02–0.05; rankings stable, exact thresholds noisy; single English template family per type. *[R3]*

---

## 19.4 — z has essentially no abstract relational structure; the one apparent exception (possession) is a genitive-morpheme readout (C3 + R5 + W13 + W19 + W24)

**Headline (wave-9 corrected).** z carries **essentially no abstract relational structure**. The one apparent exception wave-8 reported — possession "binding invariantly" (W13) — **dissolves under wave-9's W19 falsifier**: the cross-form transfer is carried by the **genitive ('s/of) morpheme**, not by an abstract possession role. Neutralizing the genitive (rewrite "A's B" → "A owns B", same relation, no clitic) **collapses cross-form binding from 1.0 to 0.043** — so even possession is a **lexical/morpheme readout**, not abstract structure. Below: C3/R5 establish surface-position-only for agent/patient and nesting; W13's "possession binds" is recorded as the wave-8 claim; W19 corrects it to genitive-morpheme-specific; W24 shows multi-hop binding composes **only along genitive/possessive links** and breaks at every agency hop.

R3's weak ~0.12 binding residual raises the question: *what kind* of binding is it — abstract thematic structure (agent/patient as roles), or mere surface position? C3 and R5 answer it **for agent/patient and nesting: surface position only.** W13 (below) then tested four further relation types and reported the answer was **relation-type-dependent** (most surface-positional, possession apparently abstract); **W19 (below) then ran the full 8-relation taxonomy plus a genitive-neutralization falsifier and shows the possession "exception" is a genitive-morpheme readout, not abstract binding** — so the corrected statement is the blanket one again: no abstract relational structure, with possession's apparent invariance traced to its 's/of clitic.

**C3 (thematic-role binding, n=240, 60 pairs, active/passive balanced so focal A is agent in 2/4 rows and first-mentioned in 2/4).** Within a construction, z distinguishes role-swaps above chance (role AUC z **0.602** pooled, **0.774**/**0.792** active/passive-only; c_i carries it 0.600/0.786/0.800), and the order-free z_bag is **role-blind 0.499** (the entity-identity negative control — focal entity is constant while the role label flips). But the decisive test is **cross-construction transfer** (train on active, test on passive):

- **thematic role:** z AUC **0.0156** (active→passive), **0.0139** (passive→active) — mean **0.015**, anti-correlated. The probe learned "agent = first mention," which inverts under passivization.
- **surface first-mention:** z AUC **0.984** / **0.986** — mean **0.985**.

There is **no construction-invariant agent direction**. z is positionally compositional, not thematically. Confounds are clean (entity-id held constant, agent-balance 0.50, "by"-cue neutralized, shuffled-label null 0.392). *[C3]*

**R5 (nested / multi-predicate, n=240/construction).** Same pattern at depth. Within a construction z separates nested structure (matrix-subject 0.801, conjunction did-v1 0.853) and c_i tracks it; z_bag structure-blind ~0.50. Cross-construction:

- genuine cross-predicate **structural** transfer collapses: inner-agent **0.0019**, conjunction did-v1 **0.067**, ditransitive recipient 0.346 — structural cross-construction **mean 0.138**.
- **surface** position transfers high: first/second-mention **mean 0.826** (nested first-mention 0.998).

z stores a **flat bag of (entity, surface-position), not nested bindings** — an HRR-style capacity/structure limit. The falsifiers (clean cross-construction structural transfer >0.7 → full compositional binding; or fully collapsed within-construction → no binding at all) both stayed unfired; the prediction (within-binding present, cross-construction structural collapse, surface transfers) is confirmed. *[R5]*

**C3 + R5 settle (within their relation types).** For **agent/patient thematic roles** and **nested/multi-predicate structure**, SONAR's z (and the c_i field tracking it) does **no abstract structural binding** — these survive **only as surface order**. This is consistent with §18's P8 (decode keeps surface clause order, curved-store recovers surface index not event-rank) and bounds R3's 0.12 binding residual: for these relations it is surface-positional glue, not a thematic-role operator.

### W13 — the binding wall is RELATION-TYPE-DEPENDENT, not universal (CORRECTS "z does no abstract binding")

C3/R5 tested agent/patient and nesting and (correctly, for those) found surface-position-only. **W13 ran the same paired falsifier test across four NEW relation types and the falsifier FIRED for one** — so the prior wave-6/7 generalization "z does *no* abstract structural binding" was an **over-claim that must narrow.** W13's design (n=240/relation, 60 paired sentences each written in two surface forms so a true abstract binder transfers across forms while a surface-positional reader flips; `z_bag` order-free head-readout is the lexical-leak negative control; cross-form AUC>0.7 with `z_bag` near chance = abstract binding fired):

| relation | within-form head-AUC (z) | **cross-form head-AUC (z)** | cross-form **surface**-AUC (z) | z_bag neg-ctrl | verdict |
|---|---|---|---|---|---|
| **possession** ("John's book" / "book of John") | 0.874 / 0.867 | **1.000** | 0.000 | 0.518 / 0.501 | **ABSTRACT binding — falsifier FIRED** |
| location | 0.876 / 0.874 | **0.0003** | 0.9997 | 0.500 / 0.507 | surface-positional |
| comparison | 0.836 / 0.835 | **0.000** | 1.000 | 0.507 / 0.492 | surface-positional |
| temporal | 0.914 / 0.913 | **0.000** | 1.000 | 0.483 / 0.488 | surface-positional |

**Possession is bound INVARIANTLY.** The possessor head transfers across the two surface forms at cross-form AUC **1.000** (z), **0.992** (c_i), while the order-free `z_bag` stays at chance (0.499/0.500) and the surface-first-mention reading **anti-transfers** (cross-form surface-AUC 0.000). So z carries "X possesses Y" as a **form-invariant abstract relation**, not as "first noun." The other three relations (location, comparison, temporal) behave exactly like C3's agent/patient: high within-form, but cross-form the relational reading collapses to ~0 while **surface** first-mention transfers at ~1.0 — surface-positional only. (Lexical-leak check clean throughout: `z_bag` head-AUC ≈ 0.50 everywhere, so the possession transfer is not riding a leaked genitive marker.) *[W13]*

**The wave-8 W13 reading (now superseded by W19).** W13 read this as: z does **no** abstract binding for agent/patient roles, event-order/temporal, comparison, location, and nested structure (C3/R5/W13), but **does** bind possession invariantly (cross-form AUC 1.0, falsifier fired) — i.e. binding is "relation-type-dependent" and possession is a genuine form-invariant relation. **Wave-9's W19 corrects exactly this claim**: the possession transfer is carried by the **genitive morpheme**, not an abstract role, and collapses to 0.043 when the genitive is removed (next subsection). So the W13 "possession is abstractly bound" headline is **withdrawn** — the apparent exception dissolves, and "no abstract structural binding" holds (the carrier of possession's cross-form transfer is lexical, the 's/of clitic). The safety reading is unaffected either way: the passive-paraphrase attack is about *thematic agent/patient*, which is positional under all three results. *[W13 — superseded by W19; the "possession bound" claim is corrected below]*

### W19 — the binding taxonomy: it is the GENITIVE MORPHEME, not abstract role-binding (CORRECTS W13's "possession is abstractly bound")

W13 tested four relations and reported possession as the one abstract binder. **W19 ran the full 8-relation taxonomy plus a decisive genitive-neutralization falsifier** (n=240/relation, 60 paired sentences each in two surface forms; cross-FORM head-AUC>0.7 with the order-free `z_bag` near chance = abstract binding; `z_bag` head-AUC is the lexical-leak negative control, clean throughout: ≈0.49–0.51 on every relation). The pre-registered hypothesis under test was the **lexical-marker** account: any relation with an overt asymmetry marker ('s/of for possession/kinship/part-whole, "than" for comparison) should bind. It **fired and refined to GENITIVE-specific.** `night5_json/w19_binding_taxonomy.json`.

| relation | marker (class) | within-form head-AUC (z) | **cross-form head-AUC (z)** | cross-form **surface**-AUC (z) | verdict |
|---|---|---|---|---|---|
| **possession** ("John's book" / "book of John") | 's/of (genitive) | 0.893 / 0.913 | **1.000** | 0.000 | BOUND (genitive) |
| **part-whole** | 's/of (genitive) | 0.874 / 0.871 | **0.999** | 0.001 | BOUND (genitive) |
| kinship | 's/of (genitive) | 0.856 / 0.896 | **0.431** | 0.569 | intermediate |
| comparison ("X taller than Y") | than/-er (comparative) | 0.925 / 0.915 | **0.000** | 1.000 | surface — **marked but NOT bound** |
| location | in/contains | 0.971 / 0.963 | **0.002** | 0.998 | surface |
| causal | caused/resulted-from | 0.940 / 0.951 | **0.002** | 0.998 | surface |
| agent-patient | active/passive+by | 0.915 / 0.919 | **0.003** | 0.997 | surface |
| recipient | to/from+verb | 0.843 / 0.839 | **0.000** | 1.000 | surface |

**The naive lexical-marker hypothesis is FALSIFIED, then refined.** "Any overt marker binds" fails: comparison is marked (the overt "than"/-er) yet **surface-positional** (cross-form 0.000, surface 1.0). What actually binds is **specifically the GENITIVE construction**: genitive-marker bound-rate **1.0** (possession, part-whole), non-genitive bound-rate **0.0** (5/5 relations including marked-comparison). Kinship — also genitive ('s/of) — lands **intermediate** (cross-form 0.431), so the genitive is necessary but the bound reading is strongest for the prototypical possession/part-whole cases.

**The decisive falsifier: neutralize the genitive and binding collapses.** W19's load-bearing test rewrites the *same* possession relation **without** the genitive clitic — "A's B" → "A owns B" (a converse verb, no 's/of). If the genitive morpheme carries the direction, removing it should drop cross-form binding toward surface; if possession were *abstractly* bound, binding should survive the rewrite. **It collapses:** possession-with-genitive cross-form **1.000** → possession-neutral-verb cross-form **0.043** (a **0.957 drop**, `genitive_is_the_carrier: true`, `binding_survives_neutralization: false`). So the cross-form "binding" wave-8 attributed to an abstract possession role is **carried by the 's/of morpheme itself** — a lexical/morpheme readout, not abstract relational structure.

**So the W13 "possession exception" dissolves (correcting wave-8).** There is **no abstractly-bound relation** in z: possession and part-whole transfer across the saxon↔of-genitive paraphrase only because **both forms still contain the genitive morpheme**; strip the morpheme and the transfer vanishes (0.043 ≈ chance/anti). This **reinforces** the chapter's core finding — **z has essentially no abstract relational structure** — and narrows §19.0/§19.11's "possession is bound invariantly" to: *possession's cross-form transfer is a genitive-morpheme readout, NOT abstract binding*. The lexical-marker `z_bag` leak check is clean everywhere (≈0.50), so this is a genuine carrier finding, not a bag artifact: the genitive's directional information rides in c_i but as a **morpheme cue**, not a role operator. *[W19 — CORRECTS W13's "possession is abstractly bound" → genitive-morpheme-specific, not abstract]*

### W24 — composed/multi-hop binding: z composes ALONG the genitive/possessive links and breaks AT the unbound (agency) links

If the genitive/possessive cue transfers cross-form (W13/W19) but agent/patient does not (C3), what happens when relations **combine** in one sentence — coreference + possession + agency? Does z chain the genitive readout, or does it break at the unbound (agency) hops? W24 ran the W13 cross-FORM dissociation per **composed link type** (n=240/construction, 60 quads, two surface forms; `z_bag` head-AUC = lexical-leak negative control; cross-form head AUC>0.7 with `z_bag`≈0.5 = abstract binding). Coref links were run in two variants: **GENDER-disambiguated** (referent fixed by he/she — lexical) and **ROLE-disambiguated** (same-gender, referent fixed only by syntactic role). `night5_json/w24_multihop_binding.json`.

| composed construction | within head f1/f2 | cross-form head AUC | cross-form **surface** AUC | z_bag head (negctrl) | verdict |
|---|---|---|---|---|---|
| **pron-possessor, role-only** ("X gave Y his book"; same-gender) | 0.913 / 0.925 | **0.971** | 0.029 | **0.508 / 0.500** | **BOUND** (clean) |
| pron-possessor, gender-cued ("X gave Y his/her book") | 1.0 / 1.0 | 1.000 | 1.000 | **1.0 / 1.0** | bound but **z_bag=1.0 → lexical-leak artifact** (not credited) |
| **chained possession** ("A's K's car" vs "the car of the K of A"; root-owner) | 0.68 / 0.65 | **0.977** | 0.023 | 0.567 / 0.563 | **BOUND** (2-hop genitive composes) |
| chained possession, 2-person | 0.70 / 0.70 | **0.929** | 0.929 | 0.492 / 0.494 | bound (surface co-transfers here — root happens to stay first) |
| coref-agency, role-only ("X told Y that he should leave"; who leaves=addressee) | 0.829 / 0.772 | **0.188** | 0.813 | 0.503 / 0.00 | **UNBOUND** — z breaks at the agency hop |
| coref-agency, gender-cued ("X told Y that she should leave") | 1.0 / 1.0 | **0.007** | 0.993 | **1.0 / 1.0** | **UNBOUND** — surface-positional (gender cue does NOT rescue it cross-form) |

**The headline dissociation (falsifier NOT fired, prediction mostly supported with one informative correction).** Multi-hop binding **succeeds along the bound links and fails at the unbound/agency links**, exactly composing the W13/C3 map:

- **Bound links compose.** The 2-hop genitive chain "A's sister's car" reads the **root owner** invariantly across the saxon↔of-genitive paraphrase (cross-form head **0.977**, surface anti-transfers 0.023, `z_bag`≈0.56) — z follows a possession→possession chain. So possession-binding (W13) is **compositional**, not a one-hop special case.

- **The agency hop breaks even with a coref cue.** "X told Y that he/she should leave → who leaves (=the addressee)" anti-transfers across active↔passive (cross-form head **0.19** role / **0.007** gender; surface transfers 0.81 / 0.99). Resolving *which named entity is the addressee* is reading an argument-position of an unbound agency verb, and z reads first-mention there — **like agent/patient**. A gender cue on the pronoun does **not** rescue it: gender fixes the referent's *gender*, but binding "leaver = the one who was told" still needs the agency hop, which z lacks. **Coreference to an argument of an unbound relation is itself unbound.**

- **One prereg correction (clean, load-bearing).** The role-disambiguated **pronoun-possessor** ("X gave Y *his* book", both male) was pre-registered to BREAK (needs syntactic role to know whose book). It did **not** — cross-form head **0.971** with a **clean** `z_bag` (0.508/0.500) and surface anti-transferring (0.029). So z **does** resolve "whose book = the giver/subject" invariant to the dative alternation, **without** a lexical gender cue. Reading: this is the **genitive/possessive cue (W13→W19) reaching through the bound pronoun** — "his book" is a possessive constituent carrying the same morpheme readout that transfers cross-form, exactly the bound link. It is NOT new agency binding (the agency variant breaks), and per W19 it is NOT abstract possession binding either — it is the possessive-morpheme cue chaining. So the half of falsifier-A that fired ("role-pronoun bound") is **the genitive/possessive readout doing the work**, and the agency half did not fire → **falsifier A (full composition incl. agency) NOT fired.** Falsifier B (z breaks even on bound hops) also did not fire — the genitive chain composes.

- **The gender-cued pronoun-possessor is a CONFOUND, flagged not credited.** Its cross-form 1.0 rides a `z_bag`=1.0 lexical leak (the bag contains "his"/"her" + a male/female name → the possessor is read off surface lexis); the leak check correctly demotes it to "not abstract binding." This is why the *clean* test is the same-gender role variant.

**Calibrated W24 statement (read with W19).** z performs **partial multi-hop composition: it composes along the genitive/possessive links — including 2-hop genitive chains (root-owner cross-form 0.977 / 0.929) and a possessive pronoun bound to its antecedent (role-only cross-form 0.971) — but breaks at every unbound agency/argument-position hop, even when a coreference cue is present** (coref-agency cross-form 0.188 role / 0.007 gender; a gender cue does **not** rescue it). The bound-vs-unbound contrast is the load-bearing axis (z_bag clean on both clean BOUND results; surface AUC anti-transfers where head transfers). Read **together with W19**: the links that compose are exactly the **genitive/possessive-morpheme** ones (the "his book" possessive constituent, the saxon/of-genitive chain), i.e. composition rides the same morpheme cue W19 isolated, **not** an abstract possession operator — so "multi-hop binding" here means "the genitive-morpheme readout chains," consistent with z having no abstract relational structure. Coreference is composed **only when it rides a possessive/genitive link**; coreference to an argument of an agency verb is **unbound**. The safety reading is reinforced: a meaning vector cannot be trusted to resolve "who-did-what" across a paraphrase, and composing relations does not buy structure across the agency wall. *[W24 — composes/sharpens §19.4; the bound hops are genitive/possessive-morpheme links per W19, not abstract binding]*

---

## 19.5 — Field-level, not atomic (R6 + the SAE caveat)

If half of c_i is readable and additive in concepts, is that content carried by **nameable atomic units** (neurons / features) that one could enumerate? R6 tests the writer mechanism directly; the answer is **no — the content is field-level.**

**R6 (writer mechanism, 256 top late-FFN writers of L19–23, n=1600 sents, 52k tokens).** Two arms:

- **Input side — are writers content-keyed?** Held-out nameable-AUC of a logistic probe predicting each writer's firing from named context features (negation/comparative/entity/discourse/stop/hub/relpos/out-degree): writer mean **0.765** vs fire-rate-matched control **0.703** — margin only **+0.061**. And the *driver* is **position/routing, not semantics**: of 204 above-threshold writers the named features are **relpos 96, out-degree/hub 70, is_stop 36**, and **SEMANTIC only 2** (negation/comparative/entity) → semantic fraction **0.0078**. The high-AUC tail is essentially a position-clock.
- **Output side — per-neuron hub delivery?** Writer output hub-mass-fraction **0.198** < the 0.25 hub baseline < control **0.315**; delivery gap −0.017 (slightly negative). Individual writers are **anti-hub at both input and output**. B4b's +31% hub c-collapse is therefore a **collective/field effect, invisible per neuron** — not a per-neuron hub-detector (consistent with B4b's own falsification of the hub-detector framing).

So c_i **content is likely irreducibly field-level** — not per-neuron, not a nameable atomic code. The agent caught and overrode an auto-verdict that mislabeled position/out-degree as "content" (`verdict_corrected`); the corrected reading leans toward the field-level falsifier while noting writers are structured (beat controls), not indiscriminate. *[R6]*

**C1 — the SAE that DID complete (isolated-c_i dictionary): a topic-field basis, not atoms.** Unlike the §18 per-token SAE (trained on the *full* per-token state, lexical-dominated → 0 context features), C1 trained a BatchTopK SAE on the **isolated** c_i (lexical+clock subtracted first), h=32768, k=32/64, over 1.49M token states. Isolating c_i first genuinely **de-lexicalizes** the dictionary: lexical-echo features collapse from **53%** (shuffled-E control) to **2%** (real c_i), and median token-purity from 0.75 to **0.017** (98% of features <0.5). A validated **53%** of features are non-lexical-interpretable — ~**21 points** above the LLM-confabulation floor (31.7%, measured on the shuffled control, which caught the labeler hallucinating "topics" on no-signal features 32% of the time — a key self-trick the control exposed). **BUT** the non-lexical features are **broad, low-purity topic-fields, not crisp monosemantic atoms**: every validated feature fires on >3000 tokens (median ~19k fires / ~5.5k distinct tokens), e.g. "environmental issues/activism", "cultural heritage/identity", entity-name fields. And FVU is **0.405 (k32) / 0.351 (k64)** — **4.3× worse** than the full-state SAE's 0.081, confirming c_i is high-rank/spiky. So the one unsupervised SAE that completed gives a *positive-but-bounded* read: c_i **has** a readable non-lexical basis, but it is a **distributed topic-field basis, not a sparse atomic code** — corroborating R6's field-level finding with an actual dictionary. *[C1]*

**The marquee pooled-field SAE (R1, completed): a PARTIAL win — pooling DOES surface real meaning atoms (~16% of energy).** The marquee direct test — an SAE on the *pooled* order-free c_pool field (the field B1c showed decodes to 0.90, never before handed to a dictionary; hypothesis: pooling might *de-entangle* into clean atoms) — completed on 300k sentences (BatchTopK h=32768/k=32, 40 epochs; c_pool verified to decode at SBERT 0.908). Two findings, in tension and both real: (i) **FVU 0.457** — *higher* than C1's per-token (0.405), 6× the full-state SAE (0.081), so c_pool is intrinsically dense/high-rank, NOT cleanly sparse; BUT (ii) the per-feature analysis (which DID complete, on the original run) finds **273 paraphrase-invariant meaning atoms carrying 16.2% of energy** — shuffled-c_pool control yields 1 (a **273× separation**), so they are real, not topic-frequent-word artifacts. The atoms include topics (sports, music, education) AND *abstract* concepts: contrastive discourse ("however/nevertheless", lift 0.62), reported speech ("says/spokesperson", 0.69), first-person identity (0.68). This is **decisively more separable than per-token c_i (0 paraphrase-invariant features)** — pooling the order-free field *does* surface a sparse, nameable concept basis the per-token SAE could not. **But the atoms carry only ~16% of energy; the other ~84% is still lexical/dense.** Honest verdict: a **partial win, not "no atoms"** — c_i's meaning is *mostly* field-level/lexical, with a real, auditable *minority* of sparse nameable concept atoms that emerge only when you pool away position. (The high FVU suggests a larger dictionary might raise the atom fraction — echoing day-5's "biggest-SAE breaks plateau"; owed.) *[R1 completed]*

**W23 — R1's atoms are CROSS-LINGUAL: the 16%-atom dictionary is language-agnostic.** A natural worry about R1's 273 pooled-field atoms is that they are an English artifact. **W23 tests whether the SAME SAE atom (same feature ID) fires on its concept in French, German, and Chinese** (cpool-sae h32768/k32, FVU 0.457; c_pool computed with the English-fit R_pos/R_role and ridge-E refit per language; AUC = concept sentences vs same-language other-concept foils, n=10 concept / 60 foil per language). The atoms are **language-agnostic concept detectors**: of 7 concepts tested, **6 are cross-lingual** (the same feature fires on the concept in en/fr/de/zh), with **content atoms at non-English mean AUC 0.987** (sports feat 11725, music 20969, education 21091, film 11456 — all ≥0.92 in every non-English language, e.g. sports zh 0.995, music de/zh 0.997) and **abstract atoms at non-English mean 0.789** (first-person feat 22495 = 1.0 all four languages; reported-speech feat 9163 = 0.778–0.82). The **content-vs-abstract gap is 0.198** — content concepts are slightly more language-portable than abstract ones, exactly as pre-registered. The **one exception is contrastive-discourse** (feat 5041, "however/nevertheless"): cross-lingual detector **false** — it reads en/fr/de (AUC 0.65–0.705) but **fires 0.0 on Chinese** (0.368 AUC), so this particular discourse atom is Latin-script-bound. A natural-text replication (firing the same atoms on real v2 sentences with concept keywords) reproduces the pattern (content AUC 0.77–0.98 across languages). **So R1's ~16%-energy atom dictionary is a real, multilingual concept basis** — the same nameable atoms (§19.5) are also **cross-lingual** (W23) and **auditable/readable** (W20): a genuine, language-agnostic, **read-out** concept object, not an English curiosity — though **not a causal handle** (W20 below: only 2.5% of the atoms steer coherently). The lone Latin-script-bound discourse atom is the calibrated caveat. *[W23 — strengthens R1: the 16%-atom dictionary is language-agnostic]*

**W20 — the full atom validation: R1's atoms are a REAL, MONOSEMANTIC, AUDITABLE concept dictionary — but NOT causal levers (the decisive correction to W15's "better levers" over-read).** W15 (below, §19.6) tested only **n=2** concepts at the paired coherence knee and found SAE-atom steering beats diff-of-means; it was tempting to round that to "R1's atoms are the program's cleanest *causal levers*." **W20 is the full sweep that corrects that self-trick.** It validates R1's invariant atoms on three pre-registered axes — monosemanticity (held-out auto-interp AUC: an SBERT-centroid direction fit on a TRAIN half must predict firing on a TEST half), beyond-lexical (SBERT firing-predictor vs token-bag predictor on the same split), and causal (inject the atom's decoder row → decode through the real SONAR decoder → judge concept-gain on the *text* via the atom's own held-out direction, with a coherence floor), each with its own shuffled control. **Reproduction caveat:** R1 stored only **25 named atom IDs** (not its full reported 273 list, and the strict co-fire firing-rule operationalization is absent from artifacts), so W20 validates the **51 atoms = 25 R1-named ∪ 26 strictly re-derived** from R1's paraphrase bank with R1's exact rule (shuffled-c_pool control yields 1 invariant atom — the set is real, not artifact). The verdict is a **clean pre-registered SPLIT (readable ≠ causal at the atom level):**

| axis | result | control | reading |
|---|---|---|---|
| **monosemantic** (held-out auto-interp AUC ≥0.70) | **43/46 = 93.5%** (of scored atoms) | shuffled-feature AUC **0.50** | strongly monosemantic — a real concept dictionary, not eyeballed |
| **beyond-lexical** | **36/43** beat a token-bag firing-predictor | — | the atoms carry concept signal *beyond* the words |
| **causal** (inject→decode→judge ≥0.30 coherent gain) | **1/40 = 2.5%** steer coherently (mean gain **0.04**, max **0.30**) | shuffled-atom success **0.013** | NOT causal levers — read-out features, not write-in handles |
| **coverage** (invariant-atoms-only decode-SBERT, +bias) | **0.152** (vs c_pool ceiling **0.945**) | random-feature ~0.08 | ~16%-energy minority, far below ceiling — auditable but incomplete |

The prereg predicted **40–60% monosemantic-AND-causal** ("auditable minority"). Its **monosemanticity arm is CONFIRMED** (93.5% ≥0.70, shuffled collapses to 0.50; 36/43 beyond-lexical); its **causality arm is FALSIFIED** (2.5% steer, vs the predicted 40–60%) — while neither hard falsifier fires (the artifact-falsifier <15% monosem does NOT fire; the atoms-dominate-recon falsifier does NOT fire — coverage 0.152 « ceiling 0.945 confirms the 16%-energy incompleteness). **Bottom line: R1's atoms are a real, monosemantic, beyond-lexical, cross-lingual (W23), AUDITABLE concept dictionary — but they are NOT crisp causal levers (c_pool atoms are read-out features, not write-in handles).** readable ≠ causal holds **at the atom level too**, consistent with W2/R8/§17.36. This is the correction to honor against W15: a *few strong topic atoms* steer somewhat better than diff-of-means (W15, n=2), but the **atom basis as a whole is NOT a steering handle** (W20: 2.5%) — the atoms are **readable/auditable, not writable.** *[W20 — full validation; corrects the W15 "levers" over-read]*

**The SAE verdict (four dictionary attempts; granularity matters).** All four learned-dictionary attempts ran: **C1** (isolated per-token c_i → broad topic-fields, FVU 0.35, 0 paraphrase-invariant atoms), **R2** (writer-space transcoder → FVU 2.08, **0/85** relational features, token-identity R²=1.00 — a lexical-magnitude dictionary), **R4** (LLM-bridge → only 27/80 directions name as broad topics), and **R1** (the POOLED order-free c_pool field → FVU 0.457 but **273 paraphrase-invariant nameable atoms, 16% of energy**, 273× over shuffle). The clean takeaway is **granularity-dependent, not a flat "no atoms":** at the *per-token, writer, and LLM-vocabulary* level c_i has **no sparse atomic code** (it's field-level — R6: writers carry position/routing, semantic fraction 0.008; §18 per-token SAE: 0 context features); but *pooling away position* (R1) surfaces a **real but minority (~16%) sparse concept basis** that the other granularities miss. So: **c_i's meaning is dominantly field-level/lexical, with a real auditable minority of nameable concept atoms recoverable only from the pooled field.** Both halves matter — don't overclaim "fully decomposable" (84% isn't) or "no atoms" (the pooled 16% is real). *[C1 + R2 + R4 + R1 + R6 + §18 per-token SAE]*

**R2's bonus — c_i is NOT the additive FFN write (refines B4).** R2 measured the raw late-FFN write `R_pos^{-p}(Σ_{L} W_out·a)` against the actual c_i: **cos = 0.182**, and ~15000× its norm. So although ablating late FFN collapses c_i (B4), c_i is **not** the additive output of those FFNs — it is **non-additively constructed through the late stack (LayerNorm + downstream layers)** from the FFN's contribution. This explains *why* the writer-space transcoder fails (there is no monosemantic late-FFN write for it to recover) and refines the §18/B4 "late-FFN writer" picture into "late-FFN *drives* a non-additive, globally-normalized construction." *[R2]*

**W8 — c_i IS the post-final-LayerNorm residual (the non-additive step located).** W8 traced exactly where c_i crystallizes by white-box capture (no ablation) of the residual stream at every probe point and measuring cosine of the running proto-c_i to the final c_i (n=1200 sents, 39,238 tokens; proto uses the identical R_pos/E as the final, and final ≡ proto at post-final-LN by construction). The cos-to-c_i ladder rises **slowly and never crosses 0.5 inside the stack** — embed **0.007**, L10 0.116, L18 **0.247**, peaking at L20 **0.337** — then *falls* through L21–L23 as the un-normalized residual norm explodes (pre-final-LN cos **0.169**, at ~42000× the final norm). The single decisive event is the **encoder's final `model.layer_norm`: pre→post cos jumps 0.169 → 1.0, a delta of +0.831 — by far the biggest step in the stack** (the largest consecutive *layer* jump is only +0.052, at L19; falsifiers "smooth additive accumulation" and "early appearance" both rejected). This **reconciles R2's cos-0.182 with the ~1.0 encoded-seqs basis**: the raw FFN write lives in the un-normalized residual at ~15000× norm; the final LayerNorm rescales+recenters that high-norm component **into** the c_i direction. **So c_i is literally the late residual after the encoder's final LayerNorm — the late FFN writes a high-norm component that the final LN normalizes into c_i.** This refines B4/R2's "late-FFN writer" into a precise mechanism: late FFN *drives*, final LN *constructs*. *[W8]*

**W12 — the final LN RANK-EXPANDS c_i ~53× (the LN is not a lossless re-basis).** W12 sharpens *what* the final LayerNorm does. It takes a **low-rank, high-norm pre-LN c_i** (entropy effective-rank **11.2**, participation-ratio 3.5, norm ~10⁵× the post) and **spreads it into a much higher-rank post-LN c_i** (eff-rank **599.8**, PR 258.6) — a **~53× rank expansion** — and the map is **not linearly invertible** (post→pre R² **0.35**, pre→post R² 0.50). So the readable post-LN c_i is **richer** than the collapsed pre-LN field, not a re-basis of equal content: the late-FFN's low-rank high-norm write is *unfolded* by the LN into the high-rank readable c_i. **Reading should therefore happen post-LN** (where it already does); there is no richer pre-LN representation to switch to. *(Skeptic catch the agent flagged: a naive "pre-LN decodes worse" comparison is **bridge-confounded** — pool-then-LN does not commute with mean-pool, giving a spurious 0.07 pre-ceiling; the proper **per-token-LN-then-pool** control recovers the 0.98 post ceiling, so the pre/post decode gap is a pooling-order artifact, not evidence about LN content. The rank-expansion and non-invertibility findings stand.)* This is the gear-level mechanism behind W8's "+0.831 LN jump constructs c_i": the jump is a **rank expansion**, not a rescale. *[W12]*

**Cross-model naming (R4 — a third angle).** A learned low-rank **bridge** from c_pool into Llama-3.2-3B's L18 residual space tests whether c_i can be *named* in a foreign, better-charted interpretability vocabulary (sidestepping the SONAR-SAE entanglement; note P6 already killed *frozen*-Llama prediction at R² 0.077 — but that was the wrong direction for *reading*). The bridge **beats the lexical-bag baseline on every metric** (forward PC-256 R² 0.286 vs z_bag 0.203, lift +0.083; reverse gist-SBERT 0.39 with lift +0.125 over bridged-bag) — so c_i carries content the words alone do not, and the no-lift-over-bag falsifier does **not** fire. But it transfers **topic-gist, not fine structure**: gist-agreement 0.39 « the pre-registered 0.6 bar, and only **27 of 80** c_pool directions get a stable Llama label — all of them coherent *topics* (economic/policy, war/politics, environmental/sustainability, consumer-health, sports). So c_i is nameable in an LLM's vocabulary, but as ~1/3-of-directions broad topics, not a fine concept inventory — the **same "broad topics, not atoms" conclusion** reached by C1 (SAE), R6 (mechanism), and R9 (coverage), now confirmed cross-model. *[R4]*

**W7 — the topic structure is REAL and SHARED, but the named axes are method-dependent.** A natural worry about the C1-SAE topic-fields and the R4-Llama topic-directions is that each method *confabulates* its own topics. W7 tests this by aligning the two via held-out CCA. The **subspace alignment is real and strong**: held-out CCA top canonical correlations **0.77 / 0.73 / 0.57** vs a shuffled control ~0 — so C1-SAE-topic activations and R4-direction activations share a **genuine multi-dimensional coarse-topic subspace** (the no-alignment falsifier CCA<0.2 does *not* fire; far from a single rigid basis). **But 1-to-1 named-topic correspondence is weak**: same-label matching (C1-"environmental" vs R4-"environmental") is near-zero (max |r| ~**0.11**), the strongest Hungarian 1-1 pairs (|r| 0.39/0.36) are semantically *mismatched*, and <5 label-coherent matches clear the prereg ≥5 bar. So the prereg "moderate alignment" prediction is **half-confirmed**: the broad-topic **structure is real and shared** (the topic finding is not per-method confabulation), but the specific topic **axes/labels are method-dependent rotations** — there is **no privileged canonical named basis**. This is exactly consistent with field-level-not-atomic: c_i has real coarse-topic structure occupying a shared subspace, but no canonical atomic topic axes. *[W7]*

---

## 19.6 — Steering: readable ≠ causal (R8 + W2, completed)

§17 found that *editing* SONAR latents is a memorized lexical trick (readable directions that don't cleanly cause). R8 tests whether **supervised concept directions in c_pool** close that gap. Method: diff-of-means direction `d = mean(c_pool|concept) − mean(c_pool|¬concept)` on train; read-test = AUC of `d·c_pool` held-out; causal-test = inject `α·lift(d)` into z of held-out NEG sentences, decode, judge concept-gain on the TEXT (held-out diff-of-means TEXT classifier) and coherence = SBERT(edit, base).

**R8 is partial — topic_sports only** (the relational/attribute concept arms did not run; the job landed only the topic concept JSON). For that concept:

- **Readable:** read-AUC `d·c_pool` = **1.000**; text-judge AUC 1.000; direction norm 0.119.
- **Injects content:** text-judge concept-rate on edited decodes rises to **0.9 (α=2) → 1.0 (α=4–12)** vs base 0.0; target-similarity delta +0.33–0.48; projection delta +0.47–0.66.
- **But breaks coherence:** **causal_success_rate 0** at every α. Coherence SBERT(edit, base) collapses **0.241 (α=2) → 0.039 (α=4) → 0.032 (α=8) → 0.022 (α=12)**; fraction-coherent-above-floor 0.10 → 0.0. The edited decodes are off-topic-injected but incoherent: "The dividend yield attracted income-focused shareholders" + sports-direction → "The drivers' rivalry attracted the division winners who aimed for the goal."

So even a *readable, supervised* topic direction **injects the concept into the text but destroys coherence** — the readable≠causal gap reappears for **steering**, not just editing. The concept is legible as a direction and the decoder *responds* to it (text-rate ~1.0), but there is no α that yields a coherent on-concept sentence (causal_success 0). The pre-registered "topic injects cleanly" arm is **not** supported even for topic; the relational arms (predicted readable-but-not-injectable) are untested. *[R8 partial]*

**Calibration.** This is one concept, partial, and the coherence judge is strict (SBERT-to-base). The honest claim is narrow: **a supervised topic direction in c_pool is readable (AUC 1.0) and content-injectable into the decode (text-rate 0.9–1.0) but not coherence-preserving (causal_success 0)** — consistent with §17.36 (editing is lexical-only) and §18's "essential ≠ separable/steerable."

### W2 — the relational and attribute arms (R8 completed)

W2 ran R8's missing arms with the identical protocol (`d = mean(c_pool|pos) − mean(c_pool|neg)`; read = AUC of `d·c_pool` held-out; causal = inject `α·lift(d)` into z of held-out neutral sentences, decode, judge concept-gain on the **decoded text** via a held-out TEXT classifier, success = gained AND coherent with SBERT(edit,base) ≥ 0.35; α∈{1,2,3,5,8,12}). Best-α causal-success and the readable≠causal gap, by concept type:

| concept type | concepts | mean read-AUC | best causal-success | **read − causal gap** |
|---|---|---|---|---|
| **topic** | sports vs finance | **1.00** | **0.40** | **0.60** |
| **attribute** | sentiment, formality | **0.997** | **0.40** | **0.597** |
| **relational** | negation, causation, agent-patient | **0.674** | **0.067** | **0.607** |

The readable≠causal gap is **large (≈0.60) for every concept type**, and **widest — with the lowest causal floor — for relational** content. Relational directions are also **barely readable**: causation read-AUC is 1.0 but its causal-success is 0; negation read-AUC 0.618; agent-patient read-AUC **0.403** (the role-swap is not even cleanly *readable* as a single c_pool direction, consistent with binding-is-positional, C3/R5). Restricting to *readable-only* relational directions (causation, read-AUC 1.0) still gives causal-success **0.0** vs 0.4 for readable topic/attribute. As with R8-topic, the failure mode is the same: raising α injects the concept into the text (topic text-rate → 1.0, formality → 1.0) but coherence collapses (SBERT(edit,base) 0.96 → ~0.14 as α grows) before the relation cleanly flips. Both pre-registered falsifiers ("all types inject as cleanly as topic"; "nothing injects") stayed unfired; `is_structure_steerable_as_single_direction: false`. *[R8 + W2]*

**Net steering claim (now general).** Across topic, attribute, and relational concepts, c_pool directions are **readable but not cleanly causal**, the gap is **uniformly large**, and it is **worst for relational/structural** content (read 0.67 → causal 0.067) — consistent with §17.36 (editing is lexical-only), C3/R5 (binding is positional, not a steerable thematic operator), and §18's "essential ≠ separable/steerable." **The choice of basis helps only a little: a few strong topic atoms steer better than diff-of-means (W15, n=2), but the native SAE-atom basis as a whole is NOT a causal handle (W20: only 2.5% of R1's atoms steer coherently) — the atoms are readable/auditable, not writable.**

### W15 — a few strong topic atoms steer better than diff-of-means (n=2) — but the atom basis as a whole is NOT a causal handle (corrected by W20, §19.5)

R8/W2 steered along **diff-of-means** (DoM) directions. W15 asks whether the *native* C1-SAE feature directions — R1's 273 pooled-field atoms (16% energy, §19.5) — are not just readable but **better causal levers**. Same inject-and-decode protocol, comparing steering along an SAE-feature direction vs the matched DoM direction for the same concept:

| metric | **SAE-feature basis** | diff-of-means | advantage |
|---|---|---|---|
| read-AUC | 0.824 | 1.00 | DoM reads slightly better |
| **causal-success** | **0.567** | 0.233 | **SAE +0.333** (random ctrl 0.0) |
| **coherence at injection-knee** | **0.839** | 0.227 | **SAE +0.612** |
| readable − causal gap | **0.257** | 0.767 | **gap narrows 3× for SAE** |

Steering along an **SAE feature** direction is **2.4× more causally successful** (0.567 vs 0.233) and, on this paired set, **keeps coherence at the injection knee** (0.839 vs 0.227 — DoM steering destroys the sentence to land the concept, exactly the R8/W2 failure mode). So on these concepts the readable≠causal gap **narrows from ~0.77 to ~0.26**. Calibrated even at the time: it **narrows, it does not close** — the gap is still >0.2, c_i remains a field rather than a crisp lever, and **only 2 concepts both reached the knee** in the paired coherence comparison (`readable_neq_causal_narrows_not_closes: true`, no falsifier). **This is exactly where the self-trick crept in — and W20 (§19.5) is the correction.** W20's full sweep (40–51 atoms) shows the gap-narrowing **does NOT generalize**: only **2.5% (1/40)** of R1's atoms steer coherently (shuffled-atom 0.013). So the calibrated claim is: **a few strong topic atoms steer somewhat better than diff-of-means (W15, n=2), but the atom basis as a whole is NOT a causal handle (W20: 2.5% steer coherently) — R1's pooled-field atoms (§19.5, 16% energy) are *readable/auditable*, not *writable*.** W15 is a real but narrow signal on a tiny paired set, not a license to call the atoms levers. *[W15 — narrow signal; corrected by W20]*

---

## 19.7 — Cross-lingual universality (W4)

Does the reading-c_i profile describe SONAR's *meaning machinery* or merely its *English*? W4 re-ran the three load-bearing wave-6 tests — R9 coverage, C3 thematic-role binding, C2 negation-scope — in **French** and **English-anchor** (config also listed German/Chinese; the landed JSON reports en + fr, the decisive contrast), using the **English-fit shared operators** R_pos and R_role with ridge-E refit per language on native held-out v2 sentences. The profile is **language-universal**, with one predicted morphological nuance.

**Coverage — half-readable, universal.** `readable_full` decode-SBERT: **en 0.535** (ceiling 0.918), **fr 0.534** (ceiling 0.906); properties add ~0 over the ridge bag in both (en +0.003, fr −0.005; `props_add_over_bag: false` both). The R9 "~half-readable, properties redundant with the words" result holds across languages. *[W4]*

**Binding — surface-positional, not thematic, universal.** Within-construction role-AUC is high (en z 0.844, fr z 0.770; c_i carries it), the order-free z_bag is role-blind (~0.497 both), but **cross-construction thematic-role transfer collapses everywhere**: en role-AUC **0.016** (anti-correlated) vs surface-first-mention **0.985**; fr role-AUC **0.0** vs surface **1.0**. `thematic_generalisation: false`, `surface_positional: true` in both. The C3/R5 "no abstract *thematic-role* binding" finding is **not English-specific** — it is a property of the vector space (W13's possession "exception" was withdrawn by W19: even possession's cross-form transfer is a genitive-morpheme readout, not abstract binding — §19.4). *[W4]*

**The morphology nuance — lexically-marked relations migrate from c_i to the bag.** Negation-scope is readable in c_i in English with a large c_i-gap over the bag (c_i-AUC **1.0**, z_bag **0.43**, **c_i-gap +0.57**), exactly as C2 found. **In French it reads from the BAG too** (c_i-AUC 1.0, z_bag **1.0**, **c_i-gap 0.0**): French marks negation with the discontinuous `ne…pas`, a lexical signal the order-free bag already carries, so the relation no longer needs c_i. This is the predicted exception (`possible_exception`: richer morphology shows the relation in the words) confirmed: **the reading-c_i profile is universal, but lexically-marked relations shift from c_i into the word-bag in morphologically richer languages** — a clean demonstration that "what lives in c_i vs the bag" tracks where the language *encodes* the relation, not a fixed English carving. *[W4]*

**Net W4.** The reading-c_i answer — half-readable coverage, surface-positional (not thematic) binding, relations carried in c_i beyond the words — is **language-universal**; the only language-specific variation is the expected one, that morphologically-marked relations move into the lexical bag.

---

## 19.7b — Cross-model generality: the profile is a property of mean-pooled sentence embeddings, not of SONAR (W30)

§19.7 established that the reading-c_i profile is language-universal *within SONAR*. The harder generality question is **architectural**: is "a precise lexical/content store with NO abstract relational structure" a fact about *SONAR specifically*, or a fact about *fixed-size mean-pooled sentence embeddings broadly*? W30 settles this by re-running the two load-bearing tests — W22 content auditing and C3 thematic-role binding — on **three other deployed sentence encoders**, holding everything but the model's `.encode()` vector fixed: the **identical** C3 minimal-pair set (seed 0) and W22 paraphrase set, identical numpy-logistic probes, SONAR as the recorded anchor. The encoders span the design space: `all-MiniLM-L6-v2` (small distilled SBERT, 384-d), `all-mpnet-base-v2` (strong general-purpose SBERT, 768-d), and `e5-small-v2` (a **contrastive retrieval** encoder, 384-d, `query:`-prefixed — a structure-aware-ish objective, the strongest a-priori chance of reading relations). Pre-registered prediction: GENERAL — all read content robustly while all fail abstract thematic binding, because pooling to one vector destroys structural binding regardless of architecture. Falsified if any encoder reads thematic role cross-construction (>.70) or any fails content auditing. `night5_json/w30_crossmodel_generality.json`.

**Content presence — all four are robust content stores, identical to SONAR.** Cross-paraphrase content AUC, mean over the four W22 tasks (entity / quantity / claim / negation-polarity):

| encoder | entity | quantity | claim | negation | **content cross-paraphrase mean** | content_robust |
|---|---|---|---|---|---|---|
| **SONAR** (anchor) | 1.0 | 1.0 | 1.0 | 1.0 | **1.0** | yes |
| all-MiniLM-L6-v2 | 1.0 | 1.0 | 1.0 | 1.0 | **1.0** | yes |
| all-mpnet-base-v2 | 1.0 | 1.0 | 1.0 | 1.0 | **1.0** | yes |
| e5-small-v2 | 1.0 | 1.0 | 1.0 | 1.0 | **1.0** | yes |

Every encoder is a paraphrase-robust content store at AUC **1.0** on all four content tasks — the SONAR W22 result is not SONAR-specific. (As in W22, the lexical-bag control is the discriminating reader on the synonym-reworded claim task, dropping to 0.821; the embeddings hold 1.0.)

**Abstract thematic binding — none read it; all are at-or-below chance, exactly like SONAR.** Thematic-role cross-construction transfer (train active → test passive, mean of both directions; threshold .70):

| encoder | **thematic-role cross-construction AUC** | surface-position cross AUC | within-role distinguishable | structure role cross-voice AUC | reads thematic role? |
|---|---|---|---|---|---|
| **SONAR** (anchor) | **0.0147** (anti-correlated) | 0.9853 | 0.7736 | 0.0216 | **no** |
| all-MiniLM-L6-v2 | **0.4479** | 0.5520 | 0.5986 | 0.6025 | **no** |
| all-mpnet-base-v2 | **0.4595** | 0.5406 | 0.5653 | 0.5722 | **no** |
| e5-small-v2 | **0.5289** | 0.4711 | 0.5903 | 0.6314 | **no** |

No encoder — **including the contrastive e5** — reaches the .70 threshold; all three SBERT/E5 readers sit near chance (0.448 / 0.460 / 0.529). **3 of 3 models match SONAR's profile; no model reads thematic role cross-construction; no falsifier fired** (`lexical_store_no_structure_is_GENERAL: true`, `falsified: false`).

**Skeptic nuance — SBERT/E5 are even MORE structure-blind than SONAR.** The match is in the *verdict* (content-robust, binding-blind), but the encoders are not identical in *kind*. SONAR retains two structural signals the SBERT/E5 family barely has: (i) it tracks **surface first-mention position** strongly (cross-construction surface AUC **0.985**) where the SBERT/E5 readers are near chance (**0.47–0.55**); and (ii) it separates roles **within a construction** at **0.774** where they manage only **0.57–0.60**. So SONAR is a content store *with a surface-positional structural overlay*; the SBERT/E5 family is closer to a **near-pure lexical bag** — even *more* structure-blind, not less. This sharpens rather than weakens the generality claim: the property "no abstract relational binding" is not merely shared, it is if anything *more extreme* in the other mean-pooled encoders, while the surface-positional residual that drives SONAR's safety exploits (§19.8) is partly a SONAR-specific elaboration.

**Net W30 (refined by W35 below).** The chapter's core structural finding — **a precise lexical/CONTENT store** — is a property of **fixed-size mean-pooled sentence embeddings broadly, not of SONAR**: every encoder preserves paraphrase-robust content. The *no-abstract-binding* half is **largely but NOT fully universal** — the W30 three-encoder verdict "the *absence* of abstract relational binding is universal" was TOO STRONG, and **W35 (the chapter's 4th self-correction) refines it below** with a 10-encoder sweep across the full relation taxonomy. The safety-load-bearing part survives: **no encoder — incl. large/contrastive/multilingual — binds agent/patient thematic role cross-construction**, so the W6/W21 relational (perpetrator/agent) auditing vulnerability is **not a SONAR bug to be fixed by switching encoders** — it generalizes to any mean-pooled sentence-embedding auditor (§19.8). *[W30 — cross-model generality of the content store + agent/patient no-binding; refined by W35 on non-genitive relations]*

---

### 19.7b(W35) — Refinement: scale/architecture buy PARTIAL non-genitive binding; agent/patient no-binding and content-robustness stay universal (the chapter's 4th self-correction)

W30 tested 3 small encoders on 2 relations and concluded "no abstract relational binding is universal." **W35 stress-tests that across 10 encoders — small (MiniLM 22M, e5-small 33M), base (mpnet 110M, distilroberta 82M), large (roberta-large 355M, e5-large 335M, bge-large 335M, gte-large 335M), and large-multilingual (LaBSE 471M, multilingual-e5-large 560M) — over the **full W19 8-relation taxonomy** (genitive: possession/partwhole/kinship; non-genitive: comparison/location/causal/agentpat/recipient), with the identical W19 minimal-pair quads (seed 0), identical numpy-logistic probes, the order-free word-`bag` lexical-leak negative control on every relation, and SONAR as the recorded anchor. Pre-registered falsifier: **a larger/structure-trained encoder binds a NON-genitive relation cross-form (head AUC > .70, bag NOT leaking) ⇒ structure IS achievable.** `night5_json/w35_crossmodel_extended.json`.

**The falsifier FIRED — "no abstract structure is universal" is too strong (`falsified: true`).** But the refinement is precise, and the safety-relevant part holds. Three findings:

**(1) Agent/patient is NEVER abstractly bound — universal and scale-robust.** Cross-form thematic-role AUC for agentpat stays ≪ .70 in every model: small/base mean **0.3731**, large mean **0.4056**, max **0.497** (multilingual-e5-large) — `any_large_model_binds_agentpat: false`, `scale_buys_structure: false`. The safety-load-bearing relation (perpetrator/agent attribution, §19.8) is unbound everywhere, so the directional-harm auditing vulnerability is **encoder-general, including at scale**.

**(2) The genitive (possession) is universally bound AND scales up with model size.** Possession cross-form head AUC climbs monotone-ish with capacity (bag clean everywhere):

| encoder | size | possession xform | partwhole xform | agentpat xform | non-genitive abstractly bound |
|---|---|---|---|---|---|
| **SONAR** (anchor) | — | 1.0 | 0.999 | 0.003 | *(none)* |
| all-MiniLM-L6-v2 | small | 0.699 | 0.637 | 0.405 | none |
| e5-small-v2 | small | 0.908 | 0.768 | 0.395 | none |
| all-distilroberta-v1 | base | 0.732 | 0.734 | 0.368 | none |
| all-mpnet-base-v2 | base | 0.794 | 0.734 | 0.324 | **causal (0.82)** |
| all-roberta-large-v1 | large | 0.861 | 0.820 | 0.446 | **causal (0.751)** |
| gte-large | large | 0.909 | 0.816 | 0.390 | **causal (0.756)** |
| bge-large-en-v1.5 | large | 0.961 | 0.769 | 0.417 | **causal (0.791)** |
| e5-large-v2 | large | 0.972 | 0.875 | 0.379 | **location (0.753), causal (0.858)** |
| LaBSE | large-multi | 0.930 | 0.614 | 0.305 | none |
| multilingual-e5-large | large-multi | 0.998 | 0.774 | 0.497 | **causal (0.817)** |

So the W19 genitive-binding generalizes and **strengthens with scale** (0.70 → 0.998), consistent with the genitive being a lexical 's/of cue that bigger encoders read more cleanly — not a role operator (cross-form *surface* AUC collapses to 0.002–0.30, anti-correlated, exactly as for SONAR's genitive readout).

**(3) The falsifier hits: larger/contrastive encoders DO bind some NON-genitive relations SONAR does not.** `non_genitive_abstractly_bound` is non-empty for 6/10 models — all bag-clean (`bag_leak: false`), so not a lexical-cue artifact:
- **all-mpnet-base-v2** (base): **causal** cross-form 0.82
- **all-roberta-large-v1** (large): **causal** 0.751
- **e5-large-v2** (large): **location** 0.753 **and causal** 0.858
- **bge-large-en-v1.5** (large): **causal** 0.791
- **gte-large** (large): **causal** 0.756
- **multilingual-e5-large** (large-multi): **causal** 0.817

Causal ("X caused Y" ↔ "Y resulted from X") is the relation most encoders acquire; e5-large additionally binds location ("X in Y" ↔ "Y contains X"). SONAR binds neither (causal 0.002, location 0.002) — it sits at the **structure-POOR end** of the spectrum, with the small SBERT/E5 family (MiniLM, e5-small, distilroberta, LaBSE) matching SONAR's genitive-only profile and the larger contrastive-retrieval encoders acquiring partial converse-relation binding.

**Net W35.** The W30 blanket "no abstract relational structure is universal" is REFINED to three calibrated claims: **(i) agent/patient (thematic-agency) no-binding IS universal and scale-robust** — the safety-load-bearing part; **(ii) the genitive (possession) is universally bound and scales up** (0.70 → 0.998); **(iii) causal/location relational binding EMERGES in larger/contrastive encoders** beyond SONAR's genitive-only. So "mean-pooled = no abstract structure" is true for SONAR and small encoders, but **larger/contrastive models acquire partial relational structure beyond the genitive — SONAR is at the structure-poor end.** Content-robustness stays fully universal (cross-paraphrase content AUC 1.0 for all 10). This is the chapter's **4th honest self-correction** (after the P6-reopen, numbers-blurred→read-exactly W16, atoms-steerable→read-only W20, and possession-bound→genitive-morpheme W13/W19). *[W35 — refines W30; 4th self-correction]*

---

## 19.8 — Safety: the no-binding attack (W6), confirmed on real harm (W21), and the content-auditing complement (W22)

The single most consequential downstream fact in this chapter is **§19.4's "no abstract relational structure"**: z encodes "agent = surface-first-mention", not a construction-invariant thematic role (and possession's apparent exception is a genitive-morpheme readout, W19). W6 turns that structural finding into a **concrete, meaning-preserving adversarial attack on z-based relational safety auditing** — the mechanism behind §17/SAFE-1's "z-auditor adversarially foolable." **W21 confirms the attack on real directional-harm relations** (threat-source/perpetrator/incitement) and re-validates the parse fix; **W22 establishes the positive complement** — z *is* a paraphrase-robust auditor for content *presence*. Net: **trust z for *what content is present*, not for *who-did-what-to-whom*.** **Crucially, this vulnerability is NOT a SONAR-specific bug (W30, §19.7b; W35): agent/patient/perpetrator no-binding is universal across mean-pooled sentence encoders, INCLUDING large/contrastive/multilingual ones.** W30's 3 small encoders (MiniLM/MPNet/e5 cross-construction thematic-role AUC 0.448/0.460/0.529, all near chance) and W35's full 10-encoder sweep agree: **no encoder binds agent/patient cross-construction** (agentpat cross-form AUC mean 0.37 small / 0.41 large, max 0.497, all ≪ .70; `any_large_model_binds_agentpat: false`). So the relational-auditing attack on *who-did-what* generalizes to ANY mean-pooled sentence-embedding auditor — switching to a different off-the-shelf encoder does NOT close the hole; only an explicit parse/agent-slot does. *(W35 nuance, §19.7b: larger/contrastive encoders DO acquire partial binding of OTHER relations — causal, location — beyond SONAR's genitive-only, so "no abstract structure" is not universal in general; but the safety-load-bearing thematic-agency relation stays unbound everywhere, so this corollary is unaffected.)*

**Setup.** 360 harmful-relation sentences in 90 matched quads (24 harmful/threatening transitive verbs, 30 entities), each event written in **active and passive** voice; the focal entity's status as **agent/perpetrator is balanced 50/50** and is the *same event* across voices (active↔passive only swaps surface order). The task: classify "is the focal entity the perpetrator?" The attack: train the auditor on one voice, test on the other (a passive paraphrase the attacker could apply to evade).

**Result — passive paraphrase defeats z-auditing.**

| classifier | within-voice AUC | **cross-voice AUC** | reads |
|---|---|---|---|
| **z** (mean-pool, fed to decoder) | 0.90 (active) / 0.90 (passive) | **0.014** | surface position |
| c_i field | 0.90 / 0.89 | **0.020** | surface position |
| z_bag (identity neg-control) | ~0.50 | 0.50 | nothing (focal entity constant) |
| **TEXT rule-based (voice-aware grammar)** | 1.0 | **1.0** | thematic role |

A z-based perpetrator classifier is **strong within a voice (AUC 0.90) but anti-correlated cross-voice (0.014)** — it learned "perpetrator = first-mentioned", which **inverts** under passivization (z cross-voice surface-first-mention AUC stays **0.986**, confirming what it actually tracks). c_i behaves identically (0.020). Meanwhile a **text/rule-based** reader using voice-aware grammar (active: pre-verbal subject = agent; passive: noun after "by" = agent) — no training, applied identically to both voices — scores **AUC 1.0 cross-voice**, proving the perpetrator relation is **unchanged and fully recoverable from the text**. The z-classifier's collapse is therefore a **genuine blind-spot**, not an inherently hard problem (`paraphrase_to_passive_defeats_z_auditing: true`, `z_voice_invariant_FALSIFIER: false`, adversarial text−z gap **0.986**). *[W6]*

**Safety implication, calibrated.** A latent-space ("read the meaning vector") relational safety auditor — "is this sentence describing X as the perpetrator/agent of harm?" — can be **defeated by a meaning-preserving active→passive paraphrase**, because z encodes surface position, not thematic role. The harm is unchanged (the text control proves it), so this is an evasion an adversary can apply cheaply. This is the concrete attack behind §17/SAFE-1's adversarial-foolability finding and a direct downstream cost of the §19.4/C3-R5 no-binding result: **z is not a safe substrate for relational ("who-did-what-to-whom") safety classification.** Caveat: synthetic matched-quad templates (saturated within-voice AUCs); the load-bearing contrast is the within-vs-cross-voice **gap** (0.90 → 0.014) and the text control's robustness (1.0), both robust to ceiling effects. *[W6]*

### W11 — the DEFENSE: both-voice training only partially helps; an explicit parse fully fixes it

W6 showed the attack; **W11 tests the natural mitigations.** Does training the auditor on *both* voices, or reading pre-pool states, or adding an explicit parse, restore robustness to a novel meaning-preserving paraphrase? Results (perpetrator-relation AUC):

| auditor | in-distribution | **NOVEL paraphrase (nominalization)** |
|---|---|---|
| pooled-z, single-voice train | — | **0.018** (cross-voice collapse, = W6) |
| pooled-z, **both-voice** train | 0.763 | **0.462** (≈ chance) |
| z_bag (identity neg-ctrl) | — | 0.500 |
| pre-pool focal-token state | — | 0.665 |
| **parse-augmented (agent-slot indicator)** | **1.000** | **1.000** |
| TEXT syntactic-rule reader (honest ceiling) | — | **1.000** |

**Both-voice training gives only PARTIAL robustness.** It recovers in-distribution performance (0.763) but a **novel** meaning-preserving paraphrase the auditor wasn't trained on — here a nominalization — still degrades it to **0.462 ≈ chance**: training on the two voices you anticipated does not generalize to the next paraphrase. Reading **pre-pool** per-token states helps modestly (focal-token 0.665). The robust fix is **parse augmentation**: giving the auditor an explicit structural parse / agent-slot indicator restores the novel-paraphrase AUC to **1.000** (matching the text-rule ceiling). **Calibrated mitigation:** pooled-z relational/directional auditing is **fundamentally fragile** (voice-fragile, adversarially evadable by active→passive→nominalization) and **cannot be patched by data augmentation alone**; the deployable fix is to **augment z-based auditing with explicit per-token/parse structure** — do not rely on pooled z alone for "who-did-what-to-whom" safety classification. *[W11]*

### W17 — a second relational exploit: the timeline is surface-order, not chronological

§18's P8 found z keeps surface clause order for two events; **W17 extends this to multi-event narratives and surfaces a new safety exploit.** z reads the **event SET precisely** (set-membership cosine AUC **0.949** — bag-of-events recovered like lexical content), but the **timeline is surface-narration order, not chronological**: in the decisive cross-condition test (train where surface==chrono, test where surface≠chrono), the **chronological** timeline transfers at AUC **0.0** (chance/anti-correlated) while the **surface** order transfers at **1.0**; the order-free z_bag and tf-idf controls both sit at chance (0.50 / 0.497), and the decoder keeps **narrated/surface order 100%** of the time (e.g. "The lights went out. Before that, the alarm sounded." round-trips verbatim — no chronological reorder). So z gives a **bag-of-events + surface-order, not a chronological timeline.** **Safety exploit:** an adversary can **flip the apparent timeline by re-narrating events out of chronological order** ("B happened. Before that, A happened.") — a z-based auditor reading "what happened first / what caused what" will read the *narration* order and get the causality backwards, while the **event SET remains robustly auditable** (AUC 0.949). This is the temporal sibling of the W6 passive-voice attack: **z is safe for auditing *which* events are present, unsafe for auditing *their order/causality*.** *[W17]*

### W21 — the blind-spot holds for REAL directional harm (threat-source / perpetrator / incitement), and parse-augmentation fixes it

W6/W11 demonstrated the attack on a generic "perpetrator" template. **W21 confirms it on safety-relevant directional-harm relations and re-validates the parse defense** — threat_direction (who is the threat-source), harm_attribution (who is perpetrator vs victim), incitement (3-party: who is the instigator). Per relation: 80 groups × 6 rows (active ×2, passive ×2, **nominalization** ×2, the nominalization held out as a novel paraphrase); is-focal-the-perpetrator balanced 50/50; mildly-harmful templates, defensive eval only. `night5_json/w21_safety_harm.json`.

| auditor | within-voice AUC | single-voice **cross** AUC | both-voice → **NOVEL** nominalization |
|---|---|---|---|
| **z-only** (pooled) | 0.622 | **0.037** (anti-correlated) | **0.533** (≈ chance) |
| c_i field | — | — | 0.503 |
| z_bag (identity neg-ctrl) | — | — | 0.501 |
| pre-pool A·B | — | — | 0.501 |
| **parse-augmented** (voice-aware agent-slot) | 1.000 | 0.583 | **1.000** |
| TEXT synonym-rule (honest ceiling) | — | — | **1.000** (active/passive/nominal all 1.0) |

**The z blind-spot HOLDS for real directional harm.** A z-only auditor reading "who is the perpetrator / threat-source / instigator" scores **single-voice cross-AUC 0.037 — anti-correlated**, i.e. an active→passive paraphrase makes it call the **victim** the perpetrator (it learned "perpetrator = first-mentioned", `z_surface_first_pooled_auc` 0.85). Both-voice training does not rescue a *novel* paraphrase: a held-out nominalization drops it to **0.533 ≈ chance** (z_bag/pre-pool/c_i all ≈ 0.50). The **TEXT control proves the relation is unchanged** across all three paraphrases (active/passive/nominal AUC 1.0 each) — so this is a **genuine z blind-spot, not a meaning change**. The **parse-augmented auditor** (z + a dependency-parse agent-slot feature, the W11 defense) stays at **in-distribution 1.0 and NOVEL-paraphrase 1.0** on the same safety relations. The **vulnerability gap (parse-aug − z-only on the novel safety paraphrase) is +0.467**. **Concrete recommendation:** do **not** deploy SONAR-embedding-only auditors for directional-harm / threat-source / perpetrator-vs-victim / incitement attribution — pooled z reads surface position, not thematic role, and a cheap meaning-preserving paraphrase can flip the judgment to accuse the victim; **augment with a dependency-parse agent-slot feature** (and/or pre-pool per-token states), which restores robustness to ceiling. **And do not expect a different off-the-shelf sentence encoder to fix it: W30 (§19.7b) and W35's 10-encoder sweep show the AGENT/PATIENT no-binding is GENERAL across mean-pooled sentence encoders — small AND large/contrastive/multilingual all fail cross-construction thematic-agency (agentpat cross-form AUC 0.37–0.41, max 0.497, `any_large_model_binds_agentpat: false`) — so the recommendation is encoder-agnostic: any mean-pooled sentence-embedding auditor inherits this directional-harm blind-spot, and the parse-augmentation fix is the encoder-independent mitigation. (W35 caveat: larger/contrastive encoders DO bind some non-genitive relations like causal/location that SONAR does not (§19.7b), so they are not uniformly structure-blind — but none binds the perpetrator/agent relation, so the safety recommendation is unchanged.)** *[W21 — the W6/W11 blind-spot confirmed on real directional harm; parse-augmentation is the fix; W30/W35 — agent/patient no-binding confirmed encoder-general incl. at scale]*

### W22 — the POSITIVE complement: z is a paraphrase-robust CONTENT auditor (trust it for *what*, not *who-did-what*)

W6/W11/W21 are the structure-auditing vulnerability. **W22 establishes the positive complement: z is a reliable, paraphrase-robust auditor for *content presence*.** Four content tasks (entity-mention, quantity presence + value, claim/concept, negation/polarity) plus one STRUCTURE control (relational role), each authored in two surface-phrasing families; train on family A, test on held-out paraphrase family B. Auditors: z (SONAR pooled 1024-d), a lexical bag (paraphrase-fragile control), and char-TF-IDF text (upper bound). The load-bearing number is **cross-paraphrase** AUC. `night5_json/w22_content_auditor.json`.

| task | z cross-paraphrase AUC | lexical-bag cross AUC | text cross AUC |
|---|---|---|---|
| entity-mention | **1.0** | 1.0 | 1.0 |
| quantity-presence | **1.0** | 0.998 | 1.0 |
| claim/concept (synonym-reworded) | **1.0** | **0.780** | 0.909 |
| negation/polarity | **1.0** | 1.0 | 0.994 |
| **mean (content)** | **1.0** | **0.944** | 0.976 |
| STRUCTURE control (role, cross-voice) | **0.022** | 0.702 | 0.009 |

**z is a paraphrase-robust content auditor.** Cross-paraphrase content AUC mean **z 1.0 vs lexical-bag 0.944** (z − bag +0.056); z beats the bag on every task. The **decisive, non-saturated case is claim/concept** — a synonym-reworded claim with **no shared content word**: z stays **1.0** while the lexical bag **drops to 0.780** (text 0.909). Entity/quantity/negation are near-saturated for all three readers (the target survives as a literal token — proper noun, digit, affirm/deny word), so they confirm z reads the content but don't discriminate z from the bag; the claim/concept synonym case is where z's paraphrase-robustness genuinely beats the surface bag. **Meanwhile the STRUCTURE control reproduces the W6 vulnerability sharply** — relational role within-voice 1.0 but **cross-voice 0.022** (text ceiling 0.009) — confirming the dichotomy is real, not a ceiling artifact. **Bottom line, the trust/don't-trust split: TRUST z for "WHAT content is present"** (does this mention entity X / state quantity Y / assert concept Z / affirm-or-deny claim C — robust to paraphrase where the lexical bag fails); **DON'T trust z for "who-did-what-to-whom STRUCTURE"** (paraphrase-fragile, W6/W11/W21). Content-presence = robust; relational structure = fragile. *[W22 — the positive safety complement to W6/W11/W21]*

---

## 19.9 — The lexical code, for contrast (LING)

A complementary linguistics sidequest (lit-grounded: SimLex-999, Google-BATS, Hearst, McRae, Grand) characterizes SONAR's **word** code — the N=1 pooled encode, the lexical lane — so the meaning-beyond-words can be located by subtraction. The finding: **SONAR words are textbook distributional vectors with rich classical structure.**

- **Analogy (3CosAdd, honest = inputs excluded):** macro top-1 **0.833** (chance 0.0021). Strong on morphology/gender/nationality/plural/comparative (1.0), weaker on semantic-geography (capital-country 0.5, currency 0.5). Verdict ABOVE_CHANCE.
- **Similarity / relatedness (Spearman vs human):** SimLex similarity ρ **0.524**, WordSim relatedness ρ **0.551** — human-aligned (both > 0.2). Note mean cosines are high (0.92 / 0.86): the space is anisotropic, rank-aligned not magnitude-aligned.
- **Hypernymy (is-a offset, LOO directionality):** acc **1.000**; norm-ratio hyper/hypo ~1.0 (a clean direction, weak norm-asymmetry).
- **Antonym problem (reproduced):** antonym cos **0.922** > random **0.775** (antonyms sit *close*, the classic distributional failure; synonym 0.952), YET learned-separable by a direction at AUC **0.90**. Both textbook facts hold.
- **Morphology:** consistent offsets (macro offset-consistency **0.485** vs random −0.014); transfer top-1 **1.0** (plural/past/ing). CONSISTENT_OFFSETS.
- **Polysemy:** a static blended code resolved only in context — in-context sense-separation **0.683**, word-to-sense blend balance 0.106. RESOLVED_IN_CONTEXT (textbook static embedding behavior).
- **Semantic features (animate/human/abstract):** linearly readable, macro-AUC **0.983** (control 0.55).

**The contrast that matters.** The *word* code E(w) has the full inventory of classical-linguistic structure (analogy, hypernymy, morphology offsets, the antonym problem, in-context polysemy resolution). That is exactly why R9's "half-readable" half is carried by the **lexical map** — the readable half *is* this classical structure. The un-readable half — the compositional content the words and properties cannot rebuild — is the **c_i meaning field**. The linguistics sidequest and the coverage result are two views of one boundary: words are classical and readable; the meaning beyond words is half-holistic c_i. *[LING]*

---

## 19.9b — Content-type readability: what z reads precisely vs by-the-word vs not-at-all (W16, W17, W18)

The coverage result (§19.1) is an *aggregate*. Three wave-8 experiments resolve it by content type — establishing **what kind of content z reads precisely, what it reads only lexically, and what it does not structurally read** — and each sharpens a safety-relevant boundary.

**Numbers — read PRECISELY, not log-blurred (W16, CORRECTS the pre-registered blur prediction).** The prereg expected z to encode numbers approximately (small preserved, large/multi-digit blurred to magnitude). **Falsified — z reads numbers exactly.** Magnitude is near-perfectly linearly readable (standardized ridge **r²(log10 N) = 0.995**, MLP adds nothing, nonlinear gain −0.05); the decoder reproduces **exact digits at recovery rate 1.0 in every magnitude bucket** (singles → thousands, including 84,000 / 250,000 / 1,200,000 and multi-number sentences like "2019/47/312"), **robust to 20% per-dim Gaussian noise** on z; sign-of-%-change AUC **1.0**, exact percentages preserved. Geometry nuance: integers live on a smooth log-continuum (the z-space 1-NN is the *adjacent* value ±1 43% of the time, the *same* exact value 0%) — yet the decoder still emits exact digits, so the precision is in the fine position along the axis, read losslessly for clean carriers. **Reconciliation with §16.4** (number-curved/audit-only/un-steerable): that was about **editing/erasure** under likelihood-death and mean-shift steering, *not* clean round-trip reading. Both hold: **z stores numbers PRECISELY to READ but they are HARD to STEER/scrub** (linear mean-shift decode-shift only 0.08). So z is a **precise lexical/content store** — exact quantities, percentages, and sign-of-change all survive encode→decode and are probe-readable. **Safety:** trust z for reading exact quantities in clean single/multi-fact sentences; do **not** infer from "numbers are precise" that you can safely *scrub* a number from z (you cannot, per §16.4). Untested for long cluttered paragraphs where decoder fidelity drops. *[W16 — corrects the number-blur prediction]*

**Style / formality / register — readable but LEXICAL (W18).** Formality is **perfectly readable from z** (AUC **1.0**), genre at **0.958** — but it is **word-choice, not a separate abstract style axis**: the order-free `z_bag` already carries ~all of it (formality z_bag AUC 0.975, z-minus-z_bag gap only **+0.025**; genre z_bag share 0.824). The decisive test is whether `c_pool` (= z − z_bag, a function of the *same* tokens) reads style **beyond** word-choice: it does **not** — on non-saturated genre the orthogonalized incremental accuracy of c_pool over z_bag is only **+0.063**, on formality **−0.005**, residual-correlation 0.12 (`cpool_beyond_word_choice: false`). Decode preserves register (retention probe 0.947, round-trip AUC 1.0). So **style/formality is readable lexical content** — z_bag word-choice, not a c_pool abstract field — consistent with §16.4 classing register as readable-but-inert. *[W18]*

**Event timeline — set yes, order no (W17, detailed in §19.8).** z reads the **event set** precisely (cosine AUC **0.949**) but the **timeline is surface-narration order, not chronological** (cross-condition chronological AUC **0.0** vs surface **1.0**; decode keeps narrated order 100%). This is the multi-event extension of P8/C3 and a relational reading limit, with the safety exploit detailed in §19.8. *[W17]*

**The content-type readability map, assembled.** Putting wave-6/7/8 together, z's reading profile is **content-type-specific**:

| content type | readable from z? | carried by | safety status |
|---|---|---|---|
| exact numbers / quantities | **yes, PRECISELY** (W16) | lexical/content store (curved) | safe to READ, hard to scrub |
| named entities, sentiment, tense, quantifiers | yes (C2) | lexical (z_bag) | readable |
| style / formality / register / genre | yes (W18) | **lexical (z_bag), not abstract field** | readable, inert to steer (§16.4) |
| negation-scope, coreference | yes (C2) | c_i beyond the bag (English) | readable; migrates to bag in richer morphology (W4) |
| **possession** ("X's Y") | as a **genitive-morpheme cue**, NOT abstract (W13→W19) | the 's/of morpheme in c_i (collapses to chance when neutralized) | readable lexical cue, not a bound relation |
| event SET (which events) | yes (W17) | bag-of-events | robustly auditable |
| event TIMELINE / causality order | **no** (W17) — surface order only | surface position | unsafe to audit (exploitable) |
| agent/patient thematic role | **no** (C3/W13) — surface order only | surface position | unsafe to audit (passive-paraphrase, W6/W11) |
| nested / multi-predicate structure | **no** (R5) — flattens to surface | surface position | not structurally read |

The unifying statement: **z is a precise *content/lexical store* (numbers, entities, style, event sets all readable, often exactly) plus a few lexically/morphemically-cued relational readouts (the genitive-possession cue, negation-scope, coreference), but it has NO abstract relational structure — it does NOT structurally read order/causality, thematic binding, or even possession as a role** (possession's cross-form transfer is the 's/of morpheme, W19) — those survive only as surface position or a lexical cue, which is the source of the relational safety exploits (§19.8). *[W13, W16, W17, W18, W19]*

---

## 19.9c — How the SONAR decoder reads c_pool (W10)

A complementary question to "what is in c_i" is **how the decoder reads it.** W10 traces this on the decode side. Two structural facts:

- **z is a single position, so cross-attention is degenerate over one key.** The decoder's cross-attention reads z **diffusely across depth, not just as an initial state**: cross-attn read-fraction is roughly flat-to-peaking through the early/mid stack (peak at layer 13), then **falls in the late third** (early-third mean 9.4, mid 10.0, late 5.4) — z is genuinely consulted across most of the decode, not merely setting layer-0 state. *[W10]*
- **Content emerges EARLY, surface form LATE.** Via logit-lens, content tokens lock to their final identity at median depth-fraction **0.61** while surface tokens lock at **1.0** (content mean 0.64 vs surface 0.90) — and the content-token "bag" is already half-recalled by mid-depth (bag-recall 0.53 at half-depth, dipping mid-stack then re-rising). So the decoder reconstructs **what** the sentence is about before it commits to **how** it is worded — the decode-side mirror of the encode-side lexical/c_i split. *[W10]*
- **Reading is DISTRIBUTED, with a weak localist tail (no clean token→z-direction map).** Perturbing z along a **content-word-ablation** direction changes **few, targeted** words (mean 1.1 words changed, Jaccard 0.94 to base, the ablated word vanishes 28% of the time) — *less* than a matched-norm random perturbation (2.5 words, Jaccard 0.88), a weak localist signal. But PCA directions of z change **6–12 words** (Jaccard 0.50–0.72) — broad, distributed effects. So there is **no clean per-output-token z-direction handle**: the falsifier "localist token→z-dir map" did **not** fire. The decoder reads z **diffusely** — content early, surface late, distributed with only a weak content-ablation localism. *[W10]*

This is consistent with the encode-side picture: meaning lives in a distributed field (no per-neuron/per-token atomic handle, §19.5), and the decoder reads that field distributively, recovering content before surface form. *[W10]*

---

## 19.10 — Honest limitations + what's owed

The chapter's calibration depends on naming what wave-6/7 did NOT establish.

- **The marquee pooled-field SAE per-feature analysis COMPLETED (resolving the central owed item — with a partial-positive twist).** R1 (pooled c_pool-field SAE, FVU 0.457) ran its full per-feature paraphrase-invariance analysis: **273 paraphrase-invariant nameable atoms at 16% energy (273× over a shuffled control)**, including abstract concepts (contrastive discourse, reported speech, first-person). So the decomposability answer is NOT a flat negative: per-token/writer/LLM granularities yield no atoms (R6 + §18 per-token SAE + R2 + R4), but the **pooled field yields a real minority sparse concept basis**. What remains owed: pushing the atom *fraction* past ~16% (the high FVU suggests a much larger dictionary / sparsity sweep might, per day-5's "biggest-SAE breaks plateau"). **Full validation is now done (W20, §19.5): R1's atoms are a real, monosemantic (93.5% held-out auto-interp ≥0.70, shuffled 0.50), beyond-lexical (36/43), AUDITABLE concept dictionary — but NOT causal levers (only 2.5% steer coherently; shuffled-atom 0.013; coverage 0.152 « ceiling 0.945). A few strong topic atoms beat diff-of-means at n=2 (W15), but the atom basis as a whole is read-out, not write-in: readable, not writable.** Cite c_i as "dominantly field-level with a real ~16%-energy nameable-atom minority recoverable only from the pooled field," NOT "no atoms" and NOT "fully decomposable."
- **~~R8 is partial — topic concept only.~~ CLOSED by W2 (§19.6).** The relational and attribute steering arms now ran: readable≠causal holds across all three types, gap ≈0.60, worst for relational (read 0.67 → causal 0.067). What remains owed is non-template steering and a less strict coherence judge.
- **W9's "0.71-readable" is given LEARNED features, which are not independent of c_i.** The C1-SAE features that lift coverage to 0.706 are themselves trained *on* c_i, so "0.71 readable" means "0.71 reconstructible by a learned topic-field dictionary," not "0.71 nameable by an independent human-readable property list" (hand-properties stay at 0.50). The honest two-number summary: **~0.50 with independent hand-properties, ~0.71 with c_i-derived learned features; ~0.19 SBERT genuinely irreducible.**
- **W1's residual characterization is correlational.** "Residual energy is lexical/entity specificity" rests on hand-built specificity axes explaining only R² 0.285 of residual energy; the residual is high-rank (98/338 dims) and the per-PC decodes are messy. The robust claims are the *negative* one (no single clean axis — falsifier unfired) and the *meaningful* one (residual+mean adds +0.283 decode-SBERT); the family-attribution is a lean, not a clean decomposition.
- **W4 reports en + fr only** (German/Chinese were configured but the landed JSON carries the en-anchor and French; the decisive nuance — `ne…pas` migrating negation into the bag — is French). "Universal" is supported across two languages plus the predicted morphology exception, not yet a wide typological sweep.
- **W6 uses synthetic matched-quad templates** (within-voice AUCs saturate). The load-bearing contrast is the within-vs-cross-voice **gap** (0.90 → 0.014) and the text control's cross-voice robustness (1.0), both ceiling-robust; the attack is demonstrated on a template family, and a natural-text replication is owed before quantifying real-world evasion rates.
- **W8 is a white-box geometry result, not a causal one.** It shows c_i *is* the post-final-LN residual by cosine tracing (pre→post LN jump 0.831); it does not ablate the LN. "The final LayerNorm constructs c_i" is the natural reading and reconciles R2's 0.182, but is a structural/observational claim. W12's rank-expansion (eff-rank 11→600, ~53×) and non-invertibility (post→pre R² 0.35) are likewise correlational geometry; the agent flagged and controlled the pooling-order confound (per-token-LN-then-pool recovers the post ceiling 0.98), but no LN ablation was run.
- **W13's "possession is bound" was CORRECTED by W19 — the carrier is the genitive morpheme, not abstract binding.** W13's possession cross-form AUC 1.0 was a clean fire, but W19's full 8-relation taxonomy + genitive-neutralization falsifier shows it rides the **'s/of clitic**: rewrite the same relation without the genitive ("A's B" → "A owns B") and cross-form collapses **1.0 → 0.043** (a 0.957 drop). So "no abstract relational structure" is the *uncorrected* statement after all — possession transfers only because both surface forms keep the genitive morpheme. Caveats on W19: synthetic two-form templates; kinship (also genitive) is intermediate (0.431), so the genitive is necessary-not-fully-sufficient for the strong-bound reading; whether the morpheme-cue picture survives natural text is owed. The lexical-leak `z_bag` check is clean (~0.50) on all relations, so the carrier-finding is genuine, not a bag artifact.
- **W21 confirms the W6/W11 attack on real directional harm; W22 is the positive content-auditing complement — both synthetic-template.** W21 (threat-source/perpetrator/incitement) and W22 (content presence) use matched-paraphrase templates as elsewhere; the load-bearing contrasts are the gaps (W21: z-only novel-paraphrase 0.533 vs parse-aug 1.0, vulnerability gap +0.467, text control 1.0; W22: z content cross-paraphrase 1.0 vs bag 0.944, decisive synonym-claim z 1.0 vs bag 0.78, structure control cross-voice 0.022). Natural-text replication and real-world evasion/precision rates are owed before deployment; the recommendations ("don't deploy embedding-only auditors for directional harm; do trust z for content presence") follow from the template gaps.
- **W23's cross-lingual atoms are en/fr/de/zh on synthetic concept sentences (+ a natural-text replication).** The 6/7 cross-lingual atoms (content non-EN mean AUC 0.987, abstract 0.789) fire on the SAME R1 feature ID across languages, with one Latin-script-bound exception (contrastive-discourse, 0.0 on Chinese). c_pool was computed with English-fit R_pos/R_role + per-language ridge-E refit; a wider typological sweep beyond these four languages is owed, but the language-agnostic-dictionary claim is supported on the four tested + a natural-text replication.
- **W14's ceiling cap is a result given the predictor set tested.** SBERT-bridge, full-text MLP, Llama-3B-bridge, and their kitchen-sink all plateau ≤0.71; a still-larger or differently-structured external encoder is not logically excluded, but four diverse text predictors converging on ~0.71 ≈ the native interp 0.683 is strong evidence the 0.71→0.90 residual is SONAR-specific and external-irreducible.
- **W15's gap-narrowing is on 2 concepts at the paired coherence knee — and does NOT generalize (W20).** Causal-success 0.567 (SAE) vs 0.233 (DoM) and coherence-at-knee 0.839 vs 0.227 are real, but `n_concepts_both_reached_knee = 2`; the "atoms steer cleaner" signal is on a tiny paired set. **W20's full sweep (40–51 atoms) falsifies the generalization:** only **2.5% (1/40)** of R1's atoms steer coherently (shuffled-atom 0.013), so the calibrated claim is "a few strong topic atoms beat diff-of-means (W15), but the atom basis as a whole is NOT a steering handle (W20)" — the atoms are readable/auditable, not writable.
- **W16's number-precision is for clean single/multi-fact carriers; W17's timeline and W18's style are template-based.** W16 exact-recovery 1.0 is on clean sentences (long cluttered paragraphs untested, and the precision is to READ not to STEER/scrub — §16.4). W17 (event-set 0.949 / timeline surface-only) and W18 (style lexical) use matched templates as elsewhere; the load-bearing contrasts are the gaps (chrono 0.0 vs surface 1.0; c_pool-beyond-bag ≈0).
- **W11's "parse fixes it" is the honest text ceiling, not a free patch.** Parse-augmentation reaches AUC 1.0 on the novel paraphrase, but it requires running an explicit parser/agent-slot extractor alongside the latent auditor — the fix is "don't rely on pooled z alone," not "z auditing can be made robust by training." Both-voice training reaching only 0.46 on a novel nominalization is the load-bearing negative.
- **Synthetic templates saturate AUCs.** C2, C3, R5 (and W4/W6) use hand-built matched-pair templates; many probe AUCs are 1.000 (shuffled-label nulls ~0.50, so saturation is real signal not leakage). The load-bearing contrasts are the **gaps** (c_i-gap +0.50–0.57 on scope/reference; cross-construction thematic 0.015 vs surface 0.985), robust to ceiling effects; the absolute 1.0s are not "perfectly readable in the wild."
- **R3 decode-SBERT corroboration skipped** (shared-box OOM safety, §19.3). The additivity-cosine primary metric is complete; the decode half is omitted. n=18/type; single English template family per type.
- **R9/W9 ridge maps are linear-into-the-decoder.** A richer nonlinear map might lift the 0.50/0.71 numbers — so they are *lower bounds on readability given the map class*. The map-class-robust claim is "hand-properties add nothing over the words" (+0.005); the 0.71 figure is map-and-feature-specific.
- **"Holistic" / "irreducible" is a claim about enumerable handles, not uncomputability.** The ~0.19 residual is irreducible to *a property list or an independent dictionary*; it is plainly computed by SONAR's encoder (decodes to 0.90). "Holistic" = "no enumerable-property or per-neuron handle found," exactly as §18's A4 said "not LOCALLY constructible" ≠ "uncomputable."

**The self-trick this chapter guards against.** "c_i is half-readable" is tempting to round to either pole. Rounding up to "the vector space is solved / c_i is readable" is false — a real ~0.19-SBERT high-rank fine-detail residual stays unreadable and **no external text encoder reaches it** (W1/W14), there is no abstract structural binding for the safety-relevant relations (C3/R5/W4) — a property that is a concrete passive-paraphrase vulnerability (W6) **not patchable by data augmentation** (W11) and an out-of-order-narration timeline exploit (W17) — the content is field-level not atomic (R6/W9/W7 broad topic-fields, no canonical axes), and readable directions don't cleanly steer for **any** concept type (R8/W2). Rounding down to "c_i is opaque" is also false — a learned map from the words recovers half and learned topic-field features reach ~0.71 (R9/W9), specific relational properties are probe-extractable where the bag is at chance (C2/W4), **numbers are read exactly** (W16), z is a **paraphrase-robust content auditor** (W22: entity/quantity/claim/negation cross-paraphrase AUC 1.0), the readable half is a clean additive bag-of-concepts (R3), the pooled-field SAE atoms are real, **cross-lingual** (W23: same atom fires en/fr/de/zh), and **monosemantic, beyond-lexical, AUDITABLE** (W20: 93.5% pass held-out auto-interp, 36/43 beyond-lexical — a real read-out concept dictionary), and the residual is *meaningful* fine lexical detail, not noise (W1). **The specific corrections to honor: wave-9's W19 CORRECTS wave-8's W13 — "possession is abstractly bound" is WITHDRAWN; the cross-form transfer is a genitive-morpheme readout (collapses 1.0→0.043 when the clitic is neutralized), so "z has essentially no abstract relational structure" stands. The other wave-8 corrections remain: numbers are NOT blurred (exact, W16); the readability ceiling is NOT reachable by a better external encoder (capped ~0.71, W14); and — the night-5 self-trick correction — readable≠causal holds **at the atom level too**: R1's atoms are a real monosemantic/auditable dictionary but NOT causal levers (W20: 93.5% monosemantic, only 2.5% steer coherently), so W15's "atoms are cleaner levers" (n=2) does NOT generalize — the atoms are READABLE, not WRITABLE. The easiest person to trick is yourself; W15 was that trick, W20 is the fix.** Report **all of it**.

---

## 19.11 — Synthesis: how far the meaning field decomposes

Folding wave-6/7 into §18: §18 established c_i is WHERE decodable meaning lives, WHAT writes it (late FFN), and that its behavioral claims are gauge-rigid. §19 reads it, and the calibrated answer is a quantified middle — sharpened on six edges by wave-7.

**c_i is partially readable — more than wave-6 alone implied.** A learned map from the words recovers about **half** its decodable meaning (R9: 0.50 of 0.90); **learned topic-field features push that to ~0.71** (W9: C1-SAE +0.19), and the remaining **~0.19-SBERT residual** is **high-rank, fine-grained lexical/entity specificity with no dominant axis** — meaningful (residual+mean +0.283) but not a clean missing concept (W1). Hand-built enumerable properties add nothing over the words (R9: +0.005). The readable content is **probe-readable property-by-property** — including relational scope/reference the words alone miss (C2: c_i-gap +0.50) — and structured as a near-**additive bag-of-concepts** (R3: 0.86–0.91; binding a weak ~0.12 residual) with **no abstract structural binding for any relation tested — agent/patient, event-order, comparison, location, causal, recipient, nesting all surface-positional** (C3/R5/W4/W13/W19: cross-construction thematic 0.015 vs surface 0.985), **and the one apparent exception (possession, W13) corrected to a genitive-morpheme readout** (W19: cross-form 1.0 collapses to 0.043 when the 's/of clitic is neutralized — so z has essentially **no abstract relational structure**). The surface-positional relations are **concrete safety vulnerabilities** — a passive paraphrase flips a z-based perpetrator auditor from AUC 0.90 to 0.014 while a text reader stays 1.0 (W6), confirmed on real directional harm (W21: cross-AUC 0.037, parse-aug 1.0), only fixable by an explicit parse not data augmentation (W11/W21), and out-of-order narration flips the read timeline (W17) — **though z IS a paraphrase-robust auditor for content presence** (W22: cross-paraphrase AUC 1.0 vs bag 0.944), so trust z for *what*, not *who-did-what*; the whole profile is **language-universal** (W4: holds en+fr; lexically-marked relations migrate to the bag). By content type, z reads **numbers precisely** (W16: exact digits, r² 0.995), **style lexically** (W18: word-choice, not an abstract field), and **event sets but not timelines** (W17). The content is **field-level, not atomic** (R6 + W9's broad topic-field features + W7's shared-subspace-no-canonical-axes), with a real **~16%-energy minority of nameable concept atoms that are cross-lingual** (R1/W23: same atom fires en/fr/de/zh), and the readability ceiling is **capped ~0.71, beaten by no external text encoder** (W14). It is **readable but not cleanly causal** for steering across **all** concept types and worst for relational (R8/W2: gap ≈0.60; relational read 0.67 → causal 0.067); **the native SAE-atom basis helps only marginally — a few strong topic atoms beat diff-of-means (W15, n=2), but the atom basis as a whole is NOT a causal handle (W20: only 2.5% of R1's atoms steer coherently). R1's atoms are a real monosemantic/auditable concept dictionary (W20: 93.5% pass held-out auto-interp), but read-out features, not write-in levers — readable, not writable.** And mechanistically it **is the residual stream after the encoder's final LayerNorm** — that LN is the dominant non-additive step (pre→post cos +0.831) that **rank-expands** the high-norm FFN write ~53× into the c_i direction (W8/W12). The complementary **lexical code** is a textbook distributional vector (LING: analogy 0.83, antonym problem, morphology offsets) — which is precisely why the readable part is word-carried and the residual is the fine c_i detail.

**Most of the meaning field decomposes into readable, additive, probe-extractable concepts (≈0.50 by independent properties, ≈0.71 by learned topic-fields — an external-encoder-irreducible ceiling); a hard ~0.19-SBERT remainder of high-rank fine lexical/entity detail does not. What structure c_i carries is surface-positional (not thematic — a real safety vulnerability) with essentially NO abstract relational structure (possession's apparent binding is a genitive-morpheme readout, W19), field-level (with a cross-lingual ~16% nameable-atom minority that is monosemantic/auditable but NOT a causal handle — W20: 93.5% readable, 2.5% steerable), and not cleanly steerable in any basis (the native atom basis helps only at n=2, W15, and does not generalize, W20); z is a paraphrase-robust auditor for content presence but not for relational structure (W22 vs W6/W21); by content type z reads numbers precisely, style lexically, and event sets but not timelines; mechanistically it is the post-final-LayerNorm residual (rank-expanded ~53×); and the whole profile is language-universal. That is the calibrated answer to "can we READ c_i."**

---

*This §19 is the wave-6 'reading c_i' chapter (appended after §18; length-stratification follows in §20), refined by wave-7 (W1/W2/W4/W6/W8/W9), wave-8 (W10–W18), wave-9 (W19/W21/W22/W23/W24), wave-10 cross-model generality (W30: the content-store holds across MiniLM/MPNet/e5 and no small encoder reads thematic role, §19.7b — **REFINED by W35, the chapter's 4th self-correction:** a 10-encoder sweep across the full relation taxonomy shows agent/patient no-binding IS universal and scale-robust (cross-form 0.37–0.41, max 0.497) and the genitive is universally bound and scales up (0.70→0.998), but larger/contrastive encoders (mpnet/roberta-large/e5-large/bge/gte/mult-e5) DO abstractly bind some NON-genitive relations — causal/location, bag-clean — that SONAR does not, so "no abstract structure is universal" is too strong; SONAR sits at the structure-poor end, §19.7b(W35)), and a **natural-text validation (W31)**: on *real* mined active/passive pairs the two load-bearing claims REPLICATE off-template — **no-binding** (cross-construction thematic-role AUC z **0.009** vs surface **0.991**, z_bag role-blind 0.50) and **content-store robustness** (z cross-paraphrase **1.0**); W31's reading-coverage arm was a ridge-fit failure (bag below the mean-only floor), **since FIXED by W41**: with a proper intercept-ridge, natural-text reading-coverage REPLICATES the synthetic R9 ladder rung-for-rung (ceiling **0.895**, bag-ridge **0.471**, +SAE-features **0.644** vs synthetic 0.901/0.499/0.706; bag-ridge ≥ mean-only floor 0.063, fit-sanity passes) — so **all three load-bearing claims (no-binding, content-store, AND reading-coverage) are now validated on natural text**, with no meaningful length/difficulty penalty. The pooled-field SAE per-feature analysis is COMPLETE (R1: 273 atoms / 16% energy, §19.5), its atoms are validated as a real monosemantic/auditable READABLE dictionary, NOT causal levers (W20, §19.5: 93.5% pass held-out auto-interp, only 2.5% steer coherently — W15's n=2 "cleaner levers" does not generalize) and shown CROSS-LINGUAL (W23, §19.5); R8's steering arms are closed (W2). Owed before deployment: natural-text replication of the synthetic-template results (C2/C3/R5/W4/W6/W11/W13/W16/W17/W18/W19/W21/W22), pushing the atom fraction past ~16% (**W40: scaling the pooled c_pool SAE h32768→h65536 WORSENS FVU 0.457→0.523 — consistent with c_i being intrinsically high-rank, W3/W12; the atom-fraction-at-scale analysis did not complete (agent died mid-analysis), so "does the 16% atom fraction grow at scale" remains OWED**), and a wider typological sweep — **now done by W42**: no-binding holds across Japanese (SOV, が/を case particles), Hindi, Turkish, Russian, and Arabic (cross-construction thematic AUC 0.02–0.17, surface 0.83–0.98); decisively, even Japanese's *explicit readable case-particle tokens* yield ZERO abstract role binding (cross-construction 0.017, z_bag 0.499 — the particles aren't even read), so no-binding is typologically *deep*, not a surface-marker artifact. The genitive (W19, a nominal possession relation) remains the lone abstract-bound relation cross-typologically; thematic agency is never bound regardless of morphological case-marking. The corrected headlines: **z has essentially no abstract relational structure — wave-9's W19 CORRECTS wave-8's W13, showing possession's apparent binding is a genitive-morpheme readout (collapses 1.0→0.043 when neutralized), §19.4**; numbers read exactly (W16); the readability ceiling is capped ~0.71 and external-encoder-irreducible (W14); and readable≠causal holds at the atom level (W20: R1's atoms are a real monosemantic/auditable dictionary but NOT causal levers — only 2.5% steer; W15's n=2 "cleaner levers" does not generalize). The safety split: z is a paraphrase-robust CONTENT auditor (W22) but unsafe for relational/directional-harm auditing (W6/W21 — confirmed on real threat-source/perpetrator/incitement, fixable only by a dependency-parse agent-slot, §19.8). Mechanistically c_i is the post-final-LayerNorm residual, rank-expanded ~53× (W8/W12, §19.5).*

---

# 20 — Length-stratification: dilution, readout-probe length-invariance, and recombination models (night-6, 2026-06-18)

**The question.** Every prior result pooled across sentence length. But `z = (1/N) Σ_i h_i` divides each token's contribution by N — so a 4-token sentence gives each token 1/4 of the norm, a 40-token sentence 1/40. If a fixed-capacity downstream reader (the decoder, an SAE, a probe) has a noise floor, long sentences should dilute each token below it. The user's night-6 directive: **stratify every branch by length, try *different models* for the recombination, and check whether the readout probes themselves differ / need length-adaptation.** Three tracks, all pre-registered, all deconfounded against the length↔rare-word↔content-count confounds.

## 20.1 — Reconstruction fidelity degrades with length, but gracefully — and it is genuine 1/N dilution, not a lexical-rarity confound (L1, L1b)

Binning ~16k v2 sentences by SONAR-SPM token count (vshort 1-4 | short 5-9 | medium 10-19 | long 20-34 | vlong 35+; N≥4k/bin except vshort, filled separately at N=735 natural short sentences), every reconstruction metric degrades monotonically with length:

| metric (vshort → vlong) | vshort | short | med | long | vlong | Δ |
|---|---|---|---|---|---|---|
| **z-recon FVU** (c_pool re-rotated → pooled z) | 0.062 | 0.122 | 0.160 | 0.208 | 0.235 | +0.17 |
| **c_pool → SBERT decode** | 0.981 | 0.975 | 0.962 | 0.928 | 0.811 | −0.17 |
| **per-token-SAE FVU** | 0.220 | 0.227 | 0.369 | 0.507 | 0.513 | +0.29 |
| **atom-energy fraction** (raw state space, [0,1]) | 0.780 | 0.772 | 0.635 | 0.491 | 0.492 | −0.29 |

The dilution hypothesis is **confirmed in direction** — per-token reconstruction (SAE-FVU) and the SAE atom-captured variance fraction are strongly length-sensitive (the per-token-SAE "flat-across-length" guess was *falsified*) — but the **field-level meaning degrades gracefully**: c_pool still decodes to SBERT **0.811 at ~43 mean tokens** (≥83% of the 0.99 full-z ceiling). Long-sentence decodes are faithful paraphrases that lose word-order and list-tail items, not collapses. vshort is best on all four axes, so the curves extend monotonically to the short end.

**It is real dilution, not a rare-word artifact (the decisive deconfound, L1b).** The raw corr(length, c_pool-SBERT) = −0.544; the partial correlation controlling rare-word rate + content-word count = **−0.339**, so ~38% of the apparent length effect is "more content words," but a majority is length itself. Rare-word *rate* is a **non-confound** (partial r −0.02; the rate is flat ~0.035–0.040 across bins). Decisively, on a **rare-word-rate-matched subset** (rate held at 0.000 across all five bins, N=456/bin), the SBERT-vs-length curve still falls **0.985 → 0.824** — matched Δ −0.161 vs unmatched −0.181, so **89% of the degradation survives rare-word matching**: this is genuine 1/N pooling dilution, not lexical rarity. *(The earlier normalized-space atom-energy metric overshot >1; recomputed in the SAE's native un-normalized space as 1−FVU_raw it lies cleanly in [0,1] and falls 0.780→0.491, tracking the SAE-FVU rise.)*

## 20.2 — The readout probes do NOT need length-adaptation for content — and the one length effect (relational) is a short-sentence surface artifact, not a role direction (L2, L2b)

The user's headline question — *"are the linear probes you get different across lengths, do you need to adjust?"* — has a clean answer. Training linear probes **within** each length bin, building the full i×j cross-bin **transfer matrix**, and measuring the **cosine between bin-specific weight vectors**:

| target (readability tier) | within-bin AUC/r² range | off-diag transfer | weight-direction cosine | pooled-vs-per-bin Δ |
|---|---|---|---|---|
| **numbers / magnitude** (content) | r² 0.92–0.96 | 0.88 | **0.93–0.997** | −0.06 |
| **entity presence** (content) | AUC 0.996–1.0 | 0.94 | **0.986** | ≈0 |
| **negation** (C2, mid) | AUC 0.96–1.0 | 0.98 | 0.87 | ≈0 |
| **agent-vs-patient** (relational) | **0.68 → 0.61 → 0.45/0.51** | ~0.50 | **−0.93** (short vs long) | — |

**Content/lexical readouts are length-invariant single directions.** Numbers, entities, and negation give flat within-bin quality, near-unity cross-bin weight cosines, flat transfer, and a single **pooled global probe loses essentially nothing per bin** (Δ −0.002 to −0.06). **You do not need a length-adaptive readout for content** — one length-agnostic probe suffices. This **replicates on natural v2 sentences** (L2b, N up to 44k/bin): number/entity/negation cosines all ≈0.99, transfer flat, with rare-word rate *falling* with length (the opposite of any spurious inflation) and within-bin class-balancing neutralizing the base-rate drift.

**The one genuine length effect cuts the other way and refines §19's no-binding result.** The agent/patient relational probe is weakly above chance only at short length (0.68) and decays to chance by long/vlong (0.45–0.51), and the cross-length weight cosine is **negative (−0.93)** — i.e. the faint short-sentence "role" signal is an **anti-correlated surface-position leak**, not a stable thematic-role direction, and it dilutes away under mean-pooling as sentences grow. So z's role-blindness (§19.4) isn't perfectly uniform — there is a small positional artifact at short length — but **there is no length-invariant role direction to adapt to**: the probe that "changes with length" is the near-chance one, and it changes because it was never reading a role.

## 20.3 — Different recombination models: the loss is per-token capacity, not the 1/N aggregation rule (L3)

The user's "try different models for the recombination" needs one methodological guard: **SONAR's pooling is exact** (z *is* the token mean, γ=1.00, §16.1), so any recombiner fed the *real* per-token states reconstructs z perfectly — the comparison is null by construction. To make pooling lossy and the question decisive, we pass each token state through a **fixed-capacity bottleneck** `b_i = P h_i` (frozen random projection, m ∈ {64,128}) and ask: given degraded per-token info, does a *smarter* recombiner recover z better than uniform 1/N, and does the gain grow with length? Recombiner families, all mapping bottlenecked tokens → real z, identical held-out split, z-recon FVU per length bin:

| arm (m=64) | vshort | short | med | long | vlong |
|---|---|---|---|---|---|
| **uniform 1/N** | 0.632 | 0.478 | 0.659 | 0.918 | 1.227 |
| **learned attention-pool** | 0.639 | 0.474 | 0.655 | 0.916 | 1.223 |
| **DeepSets** (non-linear) | 0.682 | 0.512 | 0.707 | 0.985 | 1.299 |

Two findings. **(1) At fixed capacity, dilution is real and length-growing:** FVU rises from ~0.48 (short) to >1.2 (vlong) — beyond 1.0 means worse-than-predicting-the-mean, i.e. for long sentences the bottlenecked tokens retain almost nothing recoverable. This is the dilution signature the per-token-SAE curve (§20.1) predicted, now isolated in a controlled recombination setting. **(2) Smarter recombination does not help:** learned attention-pooling **ties uniform 1/N** to within 0.01 at every length, and DeepSets is *worse* everywhere. The learned attention weights show **no content-upweighting** (Spearman(a_i, c_i_norm) = −0.09; mean weight on content tokens ≤ stopwords), only a mild **late-position** preference (Spearman(a_i, position) = +0.42) that buys no FVU. So **uniform 1/N is already near-optimal: the information lost to dilution is gone at the per-token capacity level, not mis-aggregated** — you cannot recover diluted content by reweighting or non-linearly combining tokens. This is the direct answer to "try different recombination models": a more expressive recombiner is not the missing piece. The **decode-SBERT secondary metric confirms the FVU non-win transfers to meaning, not just L2 norm**: at the m=128 bottleneck on the long bin, decoded-text SBERT is uniform **0.301** / attention-pool **0.300** / DeepSets **0.204** (Δ attn−uniform = **−0.001**) — smarter recombination recovers no more *decodable content* than uniform 1/N.

## 20.4 — Synthesis: length is a capacity axis, not a structure axis

Across all three tracks, **length acts purely as a capacity/dilution dial, never changing the *kind* of thing z stores or how to read it.** (i) Reconstruction degrades gracefully and monotonically with length — genuine 1/N dilution (survives rare-word matching, 89%), not a lexical-rarity confound, with field-level meaning far more robust (SBERT 0.81 at 43 tokens) than per-token detail (SAE-FVU 0.51). (ii) The content readouts are **single length-invariant directions** — one global probe suffices, on templated *and* natural data, so deployed content-auditors need no length-adaptation; the only length-varying probe is the near-chance relational one, whose variation (a sign-flipping −0.93 cosine) confirms it was reading a short-sentence positional artifact, not a role — sharpening §19's no-binding claim rather than overturning it. (iii) The dilution is in **per-token capacity, not aggregation**: at a fixed bottleneck a learned attention/DeepSets recombiner cannot beat uniform 1/N, and the learned weights don't even favor content tokens. **Practical upshot for the parascope/safety pipeline: trusting z for *what-content* is length-robust (use one probe, expect graceful decay on very long inputs); trusting z for *who-did-what* fails at all lengths (and the apparent short-sentence role signal is an artifact to ignore); and you cannot buy back long-sentence fidelity with a fancier pooler — if you need per-token detail on long inputs, you need per-token access, not a better recombiner.**

**Honest limits.** L1/L1b/L3 are SONAR-only on v2 sentences; the cross-encoder length port (L4 — is dilution universal or SONAR-specific, and does the relational blindness stay length-invariant across larger encoders?) was scoped but deprioritized as ~70% already-answered by W30/W35 and remains **owed**. L2's relational arm is templated (no template-free gold agent/patient label exists in natural text; L2b's first-mention-animacy proxy collapsed into a lexical content probe, so the natural-data relational test is owed). The L3 bottleneck is a random projection; a learned/PCA bottleneck or a true set-transformer (dropped as cost-unjustified once attention tied uniform) could in principle behave differently, though the flat attention-vs-uniform result across two bottleneck widths makes that unlikely. vlong FVU>1 indicates the m∈{64,128} bottleneck is below long-sentence information content — the *curve shape* (monotone rise) is the result, not the absolute >1 values.

---

# 21 — Three laws of mean-pooled sentence embeddings: a predictive theory + the parascope safety pillar (night-7, 2026-06-20)

**From regularities to laws.** §20 established length empirically: it is a capacity/dilution dial (graceful 1/N decay), content readouts are single length-invariant directions, and the dilution lives in per-token capacity not aggregation. Night-7 is a **theory-forward** run that turns those regularities into three *stated laws* with closed-form predictions and pre-registered falsifiers — plus a fourth, deployment-facing pillar that asks whether the live parascope monitor inherits z's profile. The backbone is unchanged: `z = (1/N) Σ_i R_pos^{p_i}[R_role E(w_i) + c_i]`, a position-rotated, role-rotated additive pool of per-token codes. The three laws and the pillar:

- **LAW 1 — CAPACITY / SIGNAL-TO-DILUTION (A1).** A single scalar capacity constant `C` collapses the entire recoverability-vs-length curve via the Frady superposition form `s(N) = √(C/(N−1))`; no per-fold attenuation, no cliff.
- **LAW 2 — NO-BINDING AS A THEOREM (C1 + C2).** A linear functional of an additive pool of position-rotated lexical codes *cannot* compute the agent⊗patient conjunction without a binding tensor; relational blindness is a property of the pooled-multiset FORM, not a SONAR quirk — and it survives on natural, syntactically diverse data.
- **LAW 3 — CAPACITY-NOT-AGGREGATION, ROBUSTIFIED (A2).** At a fixed per-token bottleneck, uniform 1/N is near-optimal regardless of the projection type; a smarter recombiner cannot beat it because the diluted information is gone at the per-token level.
- **SAFETY PILLAR — PARASCOPE INHERITANCE (B1 + B2).** The live LLM-residual → predicted-ẑ → auditor chain inherits z's profile exactly: it reads WHAT-content (attenuated) but stays relationally blind, and its error does **not** compound with length.

**Method (unchanged, strict).** Every experiment carries a pre-registered prediction + falsifier; scored outcome reported as a finding (including the one FALSIFIED directional prediction in B2, which is itself informative). Six experiments across the three pillars, all pushed to `night7_json/` + HF.

## 21.1 — LAW 1: one constant collapses the recoverability-vs-length curve (A1, CONFIRMED; corrects CAPACITY_LEDGER) — *scale-claim CORRECTED in §21.7 (A4-lossy): the SHAPE is single/decoder-independent, but the constant C is link/metric-dependent (>100× spread), so read "single shape," not "single constant"*

**The law.** Read content off the *real* pooled z by greedy round-trip decode (NO SAE in the loop, unlike §20's c_pool→SAE path), bin 1621 items by encoder-token count (N from 6.8 to 199 mean tokens), and fit the Frady superposition recoverability `R(N; C, V_eff)` with a single capacity `C` and a fixed shape knob `V_eff = 60`. **Pre-registration:** a single `C` fit on the *endpoint bins only* predicts every interior decode-SBERT bin within ±0.02; FALSIFY if interior |pred−obs| > 0.03, or a cliff > 0.15 SBERT-drop per length-doubling, or λ significantly < 0.95 needed for a (C, λ) attenuation extension.

| mean N tok | decode-SBERT (obs) | single-C pred | abs err | token-Jaccard | content-recall | cos(z_recon, z) |
|---|---|---|---|---|---|---|
| 6.8 *(endpoint)* | 1.000 | 0.9961 | 0.0039 | 0.967 | 1.000 | 0.998 |
| 10.8 | 0.992 | 0.9961 | 0.0039 | 0.929 | 0.959 | 0.993 |
| 14.8 | 0.983 | 0.9961 | 0.0129 | 0.920 | 0.954 | 0.987 |
| 21.3 | 0.985 | 0.9961 | 0.0110 | 0.910 | 0.951 | 0.987 |
| 29.8 | 0.986 | 0.9958 | 0.0098 | 0.925 | 0.946 | 0.983 |
| 40.6 | 0.979 | 0.9922 | 0.0131 | 0.904 | 0.923 | 0.987 |
| 55.1 | 0.978 | 0.9728 | 0.0055 | 0.871 | 0.906 | 0.983 |
| 75.5 | 0.907 | 0.9152 | 0.0084 | 0.714 | 0.779 | 0.896 |
| 111.6 | 0.788 | 0.7791 | 0.0086 | 0.510 | 0.547 | 0.745 |
| 152.2 | 0.644 | 0.6398 | 0.0045 | 0.353 | 0.354 | 0.572 |
| 198.7 *(endpoint)* | 0.520 | 0.5203 | 0.0000 | 0.264 | 0.246 | 0.527 |

**One constant collapses the curve. C ≈ 1107 (V_eff = 60), fit on the shortest+longest bins ONLY, predicts every interior bin to within 0.013** (interior mean abs err 0.0086; pre-registered ≤0.02 met, never >0.03 — **single-C NOT falsified**). The apparent steep drop from ~0.98 (N≈55) to ~0.52 (N≈199) is **NOT a cliff or phase change**: max per-doubling drop attributable to a regime break = **0.000**. It is exactly the superposition curve's natural knee once 1/(N−1) dilution crosses the reader's noise floor.

**The (C, λ) attenuation model is REJECTED.** A 2-parameter per-fold-attenuation extension buys nothing: λ* = **1.01** (≥0.95, not below), F = 0.209, **p ≈ 0.66** — no compounding attenuation is needed. The pool is honestly additive; each token's contribution dilutes as plain 1/N, not faster.

**It is the EMBEDDING, not the SBERT metric (decoder-free).** Token-id Jaccard and SONAR-space cos(z_recon, z) — neither routed through SBERT — trace the SAME shape: normalized-curve cross-correlation min **r = 0.983** (decode-SBERT~Jaccard 0.983, ~z_recon_cos 0.992). Their absolute C differs only by link-scaling (token-Jaccard saturates at a harsher lexical-identity ceiling, C≈547 vs ≈1107), a ceiling difference not a shape difference; the shared shape is the decoder-free test (the strict ±20% C-agreement gate was not met *only* because of that Jaccard ceiling, and the 40% decoder-artifact falsifier was not hit).

**It FORECASTS, and corrects the prior ledger.** Endpoint-only single-C predicted N = 60/80/120 tok at **0.96/0.90/0.75**, vs observed 0.978/0.907/0.788 — **MAE 0.021, Brier 0.0006**. The prior `CAPACITY_LEDGER C≈350` badly under-forecast (0.74/0.66/0.55): the **real SONAR pooled embedding holds ~3× more content before diluting**. The old C≈350 came from a "folding" readout test that measured the **composer/fold READER's** much smaller effective capacity (C~16) under a recall link — that was the reader's C, not the embedding's. *[A1 — Law 1 CONFIRMED; corrects CAPACITY_LEDGER C350→C≈1107]*

## 21.2 — LAW 2: no-binding as a theorem, universal across encoders (C1) and on natural syntax (C2) — *UPGRADED in §21.7 (A5-thematic): no binding even under a validated bilinear/MLP readout, so it is not a linear-probe artifact* — *RESCOPED in §21.8 (NEW-1): it is an ENCODER/TRAINING property, not a mean-pooling artifact (CLS/last-token also fail); instrument-validated on the genitive (NEW-2)*

**State the theorem.** `z` is a linear functional of an **additive pool of position-rotated lexical codes**. Computing the thematic assignment *agent⊗patient* requires a conjunction (a binding tensor `T_role` that ties WHO to WHICH-ROLE); a sum of independent per-token codes `Σ R_pos^{p_i} c_i` carries no such conjunction term. So a mean-pooled embedding **cannot** represent abstract role binding by construction — relational blindness is a property of the FORM, not of SONAR. (Honest scope: a low-rank binding term COULD in principle be written into a token's *pre-pool* c_i and survive pooling; the theorem only forbids *constructing* binding from an unbound additive pool. The empirical question is whether any encoder writes one. C1/C2 answer: empirically, none does.)

**(a) C1 — cross-encoder generality.** On active/passive minimal pairs (content bag {A, B, V} held constant; entity identity + surface position balanced across the role label by role-swap), train an agent/patient probe on ACTIVE and transfer to PASSIVE. **Pre-registration falsifier:** ANY encoder with thematic cross-construction AUC > 0.7 ⇒ no-binding is a SONAR contingency, not a theorem (downgrade §19/§20).

| encoder | thematic cross-construction AUC | content cross-paraphrase AUC | length-bin weight cosine |
|---|---|---|---|
| **SONAR** | **0.060** | 1.0 | 0.314 |
| all-mpnet-base-v2 | 0.426 | 1.0 | 0.181 |
| LaBSE | 0.277 | 1.0 | 0.379 |
| gte-large | 0.423 | 1.0 | 0.261 |
| e5-large-v2 | 0.538 | 1.0 | 0.236 |

**None reaches the 0.7 falsifier** (max 0.538, e5-large; all five "PARTIAL/in-between" — none cleanly satisfies the strict ≤0.55 theorem prediction either, but all fail to bind). **Content cross-paraphrase AUC is a universal 1.0** for all five. So **relational blindness is broadly shared across mean-pooled encoders — a property of the additive pooled-multiset FORM, not a SONAR quirk.** The boundary is *soft* (larger/contrastive encoders trend slightly higher without crossing binding), so the exact 0.015–0.06 figure is SONAR-specific while the qualitative law holds everywhere. **The −0.93 length sign-flip (§20.2) is the SONAR-specific extreme** — the other four encoders sit at length-bin weight cosine 0.18–0.38 (no anti-correlated positional leak that sharp). *(Caveat: the named `tae_w30` driver was absent on-box; the agent wrote a validated self-contained driver.)*

**(b) C2 — natural data, not a templated artifact.** Mine natural sentences from the SONAR v2 corpus, **spaCy full-dependency-parse** them, and extract (agent, patient, verb) triples across **five constructions** — active, passive, relative-clause, ditransitive, object-fronting (1270 sentences; only 53 have any pronoun entity, so not a single template family). Transfer the role probe across constructions. **Falsifier:** natural cross-construction AUC(z) > 0.70 ⇒ z binds roles in natural syntax (overturns the headline).

| readout | within-active CV | cross-construction (pooled non-active) | passive (cleanest position-flip) |
|---|---|---|---|
| **z** | 0.655 | **0.575** | **0.470** |
| z_bag (order-free control) | 0.690 | 0.566 | 0.471 |
| surface (first-mention) | 1.000 | 0.480 *(INVERTS)* | 0.0001 |

**No-binding survives.** Thematic cross-construction AUC(z) = **0.575 pooled, 0.470 on passive** — never reaches 0.70 (**falsifier not triggered**). Critically **z ≈ z_bag** (0.575 vs 0.566; z_bag ≥ z within 4/5 constructions), so the residual role signal is content/position correlation, not composition. And the **surface baseline INVERTS** cross-construction (passive 0.0001, fronted 0.000) — within any one construction surface position codes role near-perfectly (within-construction surface AUC 0.90–1.00), but the active-trained surface probe flips below chance exactly where the agent moves after the patient (passive, fronting), so pooled surface collapses to 0.48 = **pure position, no abstract role**. **The templated no-binding result is NOT an artifact**; it typologically extends across passive, relative-clause, ditransitive, and object-fronting syntax. *[C1 + C2 — Law 2 CONFIRMED; no-binding is encoder-general and survives natural syntax]*

## 21.3 — LAW 3: capacity-not-aggregation, robustified across projection types (A2, CONFIRMED)

**The robustness question.** §20.3 showed that at a fixed per-token bottleneck `b_i = P h_i`, a learned attention-pool ties uniform 1/N and DeepSets is worse — but under a *single random* projection P. Threat-4: was that a random-projection artifact? A2 sweeps the projection type: **learned-P** (a trainable `Linear(1024→m)` optimized end-to-end jointly with each pooling head — the strongest case for a smart reader+pool), **PCA-P** (information-preserving top-m singular vectors), and **3 random seeds** {123, 7, 2024}. **Falsifier:** under learned-P or PCA-P, some recombiner beats uniform by dSBERT > 0.03 at long/vlong AND its attention weights correlate with content-norm (Spearman > 0.2) ⇒ aggregation matters once P preserves structure (would revise Law 3, report loudly).

| projection family (m=128) | attn − uniform FVU gap (range over bins) | DeepSets vs uniform | Spearman(a, c_i_norm) |
|---|---|---|---|
| **learned-P** (end-to-end) | [−0.0039, −0.0002] *(attn marginally WORSE)* | worse every bin | **−0.58** |
| **PCA-P** (m=128) | [−0.0016, +0.0015] | worse every bin (−0.10..−0.15) | −0.24 |
| PCA-P (m=64) | [−0.014, +0.0002] | worse every bin | −0.09 |
| random seed-123 | [−0.010, −0.005] | (night-6 baseline) | −0.08 |
| random seed-7 | [−0.004, −0.001] | — | −0.32 |
| random seed-2024 | [−0.010, −0.003] | — | −0.48 |

**Capacity-not-aggregation is robust to the projection.** Attention **ties uniform within ΔFVU ≤ 0.02 at every length under every projection family** (max abs gap over all bins/families = **0.0143**, PCA-m64 vlong, where attn is *worse* — none is a WIN for attention). **Learned attention weights NEVER favor content:** Spearman(a, c_i_norm) is **negative everywhere** (−0.58 learned-m128, −0.24 PCA-m128, −0.08..−0.48 across seeds); the only consistent tilt is toward late position (Spearman vs position +0.10..+0.66), which buys no FVU. The **falsifier is not triggered** — it required dSBERT > 0.03 AND Spearman > 0.2, and Spearman is negative by a wide margin in both structure-preserving families. The dilution signature itself survives the better projections (uniform FVU still rises ~0.32 short → ~0.92 vlong even under learned-P). **The night-6 single random P was a representative draw, not a lucky one.** Length is a per-token CAPACITY axis; uniform 1/N is near-optimal regardless of how tokens are aggregated — the diluted information is not recoverable by reweighting, only by per-token ACCESS. *[A2 — Law 3 CONFIRMED; night-6 random-P was representative]*

## 21.4 — THE SAFETY PILLAR: the parascope inherits z's profile, and its error does not compound with length (B1 + B2)

The deployed safety chain is LLM-residual → predicted-ẑ (parascope probe) → content/role auditor. Does ẑ inherit z's profile — readable content, blind structure — and does the prediction error add a NEW or LENGTH-GROWING blindspot?

**B1 — inheritance of the content/structure split.** The parascope (`llama-3b_layer27_maghead_mlp` champion) produces ẑ matching its checkpoint card (cos(ẑ, z) 0.430 vs card 0.426; rank1 0.647 vs 0.624). Then:

| channel | real-z AUC | predicted-ẑ AUC | inheritance |
|---|---|---|---|
| content: number-present | 0.780 | 0.750 | ratio 0.96 |
| content: entity-rich | 0.832 | 0.731 | ratio 0.88 |
| content: long-claim | 0.977 | 0.845 | ratio 0.87 |
| **content (mean)** | **~0.83** | **~0.78** | ratio 0.87–0.96 |
| relational: in-voice (active) | 0.983 | 0.746 | degraded but present |
| **relational: cross-voice (passive transfer)** | **0.019** | **0.515** | both ≈ blind |

**The chain inherits the signature.** Predicted-ẑ **reads content** (auditor mean AUC ~0.78, attenuated from real-z ~0.83 — the B-falsifier <0.7 not hit on any attribute) but is **relationally blind**: real-z cross-voice thematic AUC 0.019 (a passive paraphrase flips the role auditor below chance, the W6/W21 attack) → ẑ cross-voice **0.515 = chance**. The parascope does **NOT add binding** the embedding lacks (C-falsifier >0.7 not hit), and does **NOT add a new blindspot** — it **uniformly attenuates** the readable channels (content and in-voice role both drop ~0.1–0.25 AUC) while leaving the relational blindspot exactly as blind. **A passive-paraphrase-foolable monitor stays foolable on ẑ.** *(Caveat: STEP-C cos(ẑ, z) is low ~0.21 because the synthetic short sentences are off the parascope's natural-paragraph training distribution; in-voice AUC 0.75 confirms ẑ still carries content. Owed: natural-distribution role re-test.)*

**B2 — does the prediction error compound with length-dilution?** Test on natural llama-3b-residual paragraphs (addressing B1's short-synthetic caveat), binned 6.8→156 mean tokens. **Pre-registered prediction:** ẑ effective capacity C′ < real-z C, with the gap GROWING modestly with N (mild compounding). Falsifier-A: gap constant; Falsifier-B: ẑ curve flat/noise.

| mean N tok | real-z decode-SBERT | ẑ decode-SBERT | SBERT gap |
|---|---|---|---|
| 6.8 | 0.999 | 0.327 | 0.672 |
| 28.9 | 0.986 | 0.324 | 0.663 |
| 61.4 | 0.973 | 0.412 | 0.561 |
| 110 | 0.905 | 0.443 | 0.462 |
| 156 | 0.798 | 0.372 | 0.426 |

**The "mild compounding" prediction is FALSIFIED — and that is the finding.** ẑ decode-SBERT is an essentially **flat floor across all N** (0.32–0.44, range 0.12, non-monotone), pinned near the decoder's no-information floor; cos(ẑ, z) ~0.39–0.49 is length-invariant. Real-z obeys the A1 capacity law (effective **C = 905**, near A1's ~1100). The fitted ẑ effective capacity is **C′ ≈ 20 — a ~45× collapse** at the decode level. Consequently the real-z − ẑ gap **SHRINKS monotonically** with N (0.672→0.426, slope −0.0017/tok, r²=0.92) — **not** because the parascope improves on long spans, but because real-z decays DOWN toward the parascope's already-saturated error floor (closest to **Falsifier-B**: prediction error dominates, span-length largely irrelevant to what ẑ recovers; NOT the constant-offset Falsifier-A; NOT compounding). **The parascope's content deficit is constant and length-INDEPENDENT — there is no long-span blind-zone.** *(Owed: this is raw decode-from-ẑ; the deployed B1 monitor reads content via an auditor on ẑ at AUC ~0.78, far above this decode floor, so the auditor-AUC-vs-N curve — not run — is the tighter deployment proxy and likely shows a gentler, still-non-compounding profile.)*

**Deployment upshot.** Trust the parascope monitor for **WHAT-content** (with a constant capacity penalty, ~0.1 AUC attenuation, no length-specific worsening); **never** for **WHO-DID-WHAT**, at any length (ẑ cross-voice = chance, inherited exactly from z). *[B1 + B2 — safety pillar; B2's directional prediction FALSIFIED = no length-dependent blind-zone]*

## 21.5 — Honest limitations and what is owed

- **A1** sits *above* §20's SAE-mediated anchors (c_pool→SBERT 0.981/0.975/.../0.811) because it is the **direct-z greedy round-trip with NO SAE bottleneck** — same flat-then-knee shape, higher absolute values. `V_eff = 60` is a fixed shape knob: the C *value* scales with V_eff (712 at V_eff=20, 1548 at V_eff=200) but the single-C collapse and λ≈1 conclusion are V_eff-robust. Shortest bin is N≈6.8 tok (min 3 words), so the §20 N≈2.5 anchor is not covered.
- **C1** could not load the named `tae_w30` driver; the agent wrote a validated self-contained driver instead (same minimal-pair design). The cross-encoder binding boundary is *soft* (e5-large 0.538), so the law is qualitative-universal but the exact figure is SONAR-specific.
- **B1/B2** ran partly on **off-distribution short synthetic sentences** (STEP-C cos(ẑ, z) ~0.21), mitigated in B2 by re-running on natural paragraphs. The deployed-auditor-AUC-vs-N curve (tighter than raw decode-from-ẑ) is **owed**. B1 content labels are heuristic regex/length proxies and the cross-paraphrase content-robustness variant was not run.
- **A2** learned-DeepSets and learned-m64 arms ran fewer epochs (15 vs 40 confirmatory) than the decisive learned-m128/PCA arms; the verdict rests on the two hardest fully-run families (PCA + learned-m128), which both reproduce uniform's tie, so the conclusion does not depend on the in-flight arms.
- General: no genuine perpetrator-vs-victim *natural* harm text; the agent/patient construction is the operational stand-in for directional harm (consistent with §19.8 W21).

## 21.6 — Synthesis: a predictive theory of mean-pooled sentence embeddings

The three laws are now a **predictive, falsifiable theory of the `z = (1/N) Σ R_pos^{p}[R_role E(w) + c_i]` form**, not a list of regularities. **Law 1** gives capacity a *number*: a single C ≈ 1100 collapses the whole recoverability curve via plain 1/N superposition (no cliff, no compounding, decoder-independent), forecasts unseen lengths at MAE 0.021, and corrects the old ledger 3× upward. **Law 2** explains *why structure is absent*: a linear functional of an additive lexical pool cannot compute the agent⊗patient conjunction without a binding tensor — so relational blindness is a theorem of the form, shared across five mean-pooled encoders and surviving natural diverse syntax (z ≈ z_bag, surface position INVERTS cross-construction). **Law 3** says *the loss is irreversible at the reader*: at fixed per-token capacity, uniform 1/N is near-optimal under learned/PCA/random projections alike — you cannot reweight your way back to diluted content. Jointly, for **interpretability** they say z is a *graceful-capacity content store with a structural hole* whose size is now quantified; for **safety** they say the parascope monitor inherits exactly this — readable-content (attenuated, length-flat) and structurally-blind (chance, at all lengths). **A mean-pooled sentence embedding is a capacity-bounded additive bag of position-tagged lexical codes: it loses content gracefully and quantifiably as 1/N (Law 1), can never bind who-did-what-to-whom without an external parse (Law 2), and cannot be repaired by a cleverer pooler (Law 3) — and the deployed parascope safety monitor inherits all three, so trust it for WHAT, never for WHO, at any length.**

## 21.7 — Wave-3/4 stress-tests: one upgrade, one correction, three confirmations (A3, A4, A5-thematic, B1-noise, B3)

A critique pass flagged that several §21 claims rested on a single instrument (linear probes) or a saturated metric. Five adversarial follow-ups, all pre-registered:

**LAW 2 — UPGRADED (A5-thematic).** Every no-binding result (C1/C2/W6/W21/W31/W42) used a *linear* probe — yet the program had already shown SONAR's encoder writes a *bilinear* pairwise term for **order** (`tae_bilinear_order_v2`, order τ_bilinear ≫ linear, instrument-gate **passed** here: order bilinear τ 0.085 > linear 0.003, MLP τ 0.367 matches anchor). So "did the encoder write a bilinear agent⊗patient term?" was untested. It did **not**: the validated bilinear comparator recovers **identical-to-linear** cross-construction thematic AUC (pooled 0.608 = 0.608) at every construction. The one MLP arm crossing 0.70 (relcl 0.83) was a confound, killed three ways — a **z_bag-MLP** (no sentence composition) reproduces it (0.80), a **position-only-MLP** (no z at all) reproduces/exceeds it (0.83–0.97), and all readouts **invert below chance** on passive/fronting (sign pattern +/−/−/+ that no abstract binder can produce); shuffled-role nulls sit at 0.498. **Verdict: no-binding is not a linear-probe artifact — it holds under the strongest readout the program has shown can recover a true bilinear comparator. Law 2 is upgraded from "no linearly-readable binding" to "no binding even under bilinear/nonlinear readout."** The safety corollary ("trust for WHAT, never WHO") is hardened, not weakened.

**LAW 1 — CORRECTED to "single shape, link-dependent scale" (A4-lossy-neutral).** §21.1's "single constant C ≈ 1107" is **too strong as a *number***. (i) Per-bin implied_C is not flat (rises ~7× with N at V_eff=60; freeing V_eff does not flatten it — though this is largely a *ceiling artifact*, since the natural gist curve sits at decode-roundtrip saturation p≈0.98 where (C,V_eff) is non-identifiable). (ii) Absolute C is **strongly link/metric-dependent**: free-V_eff C = 130 (decode-SBERT) / 26 (token-Jaccard) / 3186 (cos-to-z), a >100× spread that fails the pre-registered ±20% cross-metric gate. The **shape** (flat-then-knee, monotone 1/N dilution, no cliff, no compounding) is robust and decoder-independent (A1's r ≥ 0.983 across metric families stands); the **scalar C is a link-scaled readout constant, not a universal embedding capacity**. Restated: *Law 1 is a single-shape signal-to-dilution law; its scale constant is reader/metric-relative.* Additionally, the predicted per-content-type capacity ordering (C_entity ≳ C_number > C_gist) was tested under a **type-neutral** Gaussian-noise readout (avoiding the SAE-reader confound) and **inverts**: gist is the *most* noise-robust (σ@0.7: gist 1.25 > entity 1.11 > number 1.01) — the diffuse field survives type-agnostic noise better than the sparse codes, so the earlier sparse>diffuse intuition was a decode-sparsity artifact, not an embedding allocation. **Relational stays at chance under every σ (C_relational ≈ 0) — no-binding holds in the capacity frame too.**

**SAFETY — CONFIRMED with cleaner framing (B1-noise-ceiling, B3).** B1's "the parascope inherits z's content-not-role split" had a confound: at cos(ẑ,z) ≈ 0.43 maybe *everything* fine-grained collapses, so cross-voice chance might be parascope-weakness, not inherited blindness. Ablation ceiling: noised **real** z calibrated to cos 0.43 gives content AUC 0.81 (brackets the parascope's 0.78) and cross-voice role that rises 0.019→0.171 with noise but **never exceeds chance** — so the content-not-role split is a **generic consequence of any 0.43-fidelity read of a role-blind embedding**, and since real-z fails cross-voice even at cos 1.0 (0.019), *the blindspot is z's regardless; the parascope step adds nothing.* And B3 (the deployed-auditor-vs-length curve owed from B2) confirms **no length-dependent blind-zone**: the trained content-auditor reads ẑ at AUC 0.74–0.93 across all N, declining only in parallel with real-z, with a constant gap — the raw-decode floor (B2, C′≈20) does *not* propagate to the auditor, whose coarse present/absent direction survives ẑ's noise.

**THREAT 5 — half-closed (A3).** The holistic (word-irreducible) half of c_pool shows **no length-varying direction** detectable from the readable basis, and the c_pool ceiling stays length-robust (0.98→0.85) while *reading-from-words* degrades faster (readable 0.81→0.43; residual share grows 17%→50%, corr with readable-shortfall 0.89) — so the refinement is "word/property *reconstructions* of meaning are length-biased; content *auditors* on c_pool are not." The direction-absence claim is now **fully earned (A3-direct-N)**: regressing token-count N *directly* on the holistic residual gives R²(N|residual) = **−0.030** (chance) vs R²(N|c_pool) = **0.975**, with no cross-bin-stable residual→N direction and norm-partialled R² = −0.002 — so the holistic half carries **no length-dependent directional structure**, and "length is pure capacity, not structure" holds for the word-irreducible half too. **Threat 5 is closed.**

**SAFETY MECHANISM — the decode-floor/auditor-0.78 paradox resolved (B4).** Why does a content-auditor read ẑ at AUC 0.78 when raw decode from ẑ is at a no-information floor (B2)? Because the parascope preserves z's **coarse high-variance geometry** (cos(ẑ,z) restricted to the top-10 PCs is 0.72, decaying to ~0.33 on the low-variance tail) **and**, more sharply, a **content-presence-defined subspace** — projection along the trained entity/number/claim auditor directions is preserved at r² 0.47, **2.3× random directions and ~10× what z's raw PC-variance alone would predict**, so the preserved subspace is *content-defined*, not merely z's top variance axes. A coarse present/absent auditor needs only that preserved content structure (→ 0.78); exact SONAR decode needs the fine per-token lexical tail, which the parascope discards (→ floor). This is the geometric reason the deployed monitor is content-readable yet decode-poor.

## 21.8 — Cross-cutting: no-binding is an encoder property (not a pooling artifact), the instrument is validated, and the capacity law is causal (NEW-1, NEW-2, NEW-3)

Three experiments probing assumptions *underneath* the three laws — each could have rescoped or overturned a headline.

**NEW-1 — Law 2 is an ENCODER/TRAINING property, not a consequence of mean-pooling.** Every no-binding result was on mean-pooled embeddings, and the whole theory frame `z=(1/N)Σ R_pos^{p}[…]` is a *pooling* model — so the obvious confound is "maybe pooling destroys the binding the encoder *did* write." Test: isolate the pooling **operator** on a fixed backbone (all-mpnet-base-v2, token-level `last_hidden_state`), building three sentence vectors from the *same* model + *same* 2000 spaCy-parsed natural sentences — MEAN-pool, [CLS]/first-token, LAST-token (EOS) — plus a CLS-trained retrieval encoder (bge-base) as a bonus, and run the C2 cross-construction thematic test (linear + the fully-powered MLP) on each. **Result — the pre-registered "pooling theorem" prediction was FALSIFIED:** every operator stays at chance (mean 0.551, CLS 0.552, last-token 0.560, bge-CLS 0.603 — none clears the 0.61 bar, let alone 0.70), while within any single construction the surface-position probe reads role perfectly (AUC→1.0) and inverts on passive/fronted, identically across operators. The non-pooled vectors are non-degenerate (cos to mean 0.87–0.88, cos CLS-to-last 0.91), so this is a real contrast, not a CLS≈mean tautology. **Giving a probe a non-pooled, position-distinct sentence vector buys essentially nothing: abstract agent/patient binding is simply not written into sentence-embedding representations, regardless of the read-out operator.** Rescoping: the `z=(1/N)Σ(…)` form is a convenient *parameterization*, but Law 2 should be stated as a property of **sentence-embedding encoders/training objectives**, not as a consequence of the mean-pool aggregation step. (Tested on the BERT/MPNet masked-encoder family + one retrieval encoder; SONAR's own no-binding was already established in C1/C2 — so the phenomenon is operator-invariant across a second encoder family.)

**NEW-2 — the bilinear no-binding instrument is validated on a known positive (the genitive), closing §21.7's weak-instrument loophole.** A5-thematic's residual worry (Threat A) was that its bilinear order-gate hit only τ≈0.085 (vs the cited tuned ~0.36), so a bilinear *null* for agent/patient might be under-powered rather than real. Fix: re-train the bilinear comparator with a pairwise-ranking hinge loss (which directly targets the τ metric) → full-power **order-gate τ = 0.79** (null 0.04), now *above* the cited anchor, proving the instrument family is fully powered *this run*. Then point the same instrument at a relation we KNOW is bound — the **genitive** possessor (W19's lone abstract binder) — versus agent/patient: **genitive cross-form head-AUC = 1.0** (saxon's↔of, role-balanced; collapses to 0.93 under clitic-neutralization, exactly the W19 signature) vs **agent/patient = 0.007** (anti-correlated — it transfers first-mention surface position, the §20 sign-flip), a **gap of 0.99**. **So the same fully-powered bilinear instrument recovers the one binder that exists and *only* that one — A5's agent/patient null is a genuine representational absence, not weak-instrument. Threat A is closed; Law 2's bilinear nulls are interpretable.**

**NEW-3 — Law 1 is causally generative, not merely descriptive (and characterizes an injection attack).** Law 1 had only ever been *fit* to natural recoverability-vs-N. Causal test: since SONAR z is literally the masked-mean of token states, inject one extra concept token additively — `z' = z·N/(N+1) + δ/(N+1)`, ‖δ‖ = one token's norm, δ = diff-of-means direction for money/location/time — and ask whether the injected concept is recoverable at the rate Law 1 predicts for N+1. **Pre-registration CONFIRMED:** injected-concept recovery AUC tracks the Law-1 N+1 curve with no cliff (0.99/0.96/0.86/0.74 at N≈10/14/23/36 vs predicted 0.998/0.98/0.93/0.84), mean shortfall −0.047 — i.e. written-in content is recovered a hair *harder* than a real extra token (it competes against rather than rides the host's correlational structure), which strengthens additivity and rules out a spare-capacity loophole. Magnitude sweep: 0.5×→N+2 budget, 1×→N+1 budget, 2×/4×→saturate to AUC 1.0. **Law 1 is therefore a generative capacity budget, not just a curve fit.** Safety corollary: an attacker *can* hide a concept in a capacity-saturated z, but only within the same 1/N budget a real token occupies — at unit magnitude a long-N z dilutes the smuggled concept toward the auditor floor (AUC 0.74 at N≈36), so a **Law-1-calibrated, token-count-aware monitor is not fooled by budget-respecting injection**; evasion requires over-magnitude writes (m≥2) that a simple norm/energy check on z flags. (Caveat: the very-short bin N<8 didn't fill in this corpus, so the low-N no-cliff is extrapolated there.) **[CORRECTED by NEW-6 below: the norm/energy-check half of this is quantitatively wrong; the deployed content auditor is the real defense.]**

**NEW-6 — the injection-attack mitigation, run against the deployed auditor: the norm-flag is FALSE, but the safety conclusion STRENGTHENS.** NEW-3's mitigation ("over-magnitude m≥2 writes are caught by a norm/energy check") was a geometric assertion never run against a trained monitor. Running it (with the *correct* one-token budget τ = per-token **state** norm ≈0.79, not the pooled ‖z‖ ≈0.24 — token states partially cancel in the mean, a 3.3× factor a first pass got wrong): **(a) the norm gate FAILS** — it does not separate m=2 from clean z and is blind to even m=4 once N≥23 (`min_m_to_flag` = {N≈10:4, N≈14:None, N≈23:None, N≈36:None}); it is a length-fragile gross-overwrite backstop, not an m≥2 tripwire (pre-registered falsifier (a) fired). **(b) But the deployed content auditor catches everything** — it reads the smuggled concept at AUC ≥0.88 at *every* magnitude (m=0.5→4) and *every* N including the budget-respecting m=1 write, with zero false-positives on clean z (held-out present-vs-absent AUC 1.0, clean FPR 0). So the "a content monitor is fooled by budget-respecting injection" worry is **over-pessimistic**, and detection never needed the norm gate. **Corrected claim: the trained content auditor reads injected concepts at any tested magnitude and length; the norm/energy gate is only a weak short-N gross-overwrite check, not the primary defense** — the deployment bottom line is *stronger* than first stated (the auditor is the robust defense), via a different mechanism than asserted. (Caveat: the named on-box injection/auditor scripts were absent; the runner rebuilt them self-contained on the real SONAR encode/decode + prepool masked-mean + the `tae_zauditor` diff-of-means recipe, with held-out concept AUC 1.0 validating the auditor.)

**Net.** Law 2 is broadened (an encoder/training property, instrument-validated) and Law 1 is deepened (causal, with a concrete injection-attack threat model and its mitigation). Neither headline overturned; both are now harder to attack.

**Secondary-claim refinements (Threat-B, Threat-C backstop checks).** Two adversarial checks deflated *secondary* claims without touching a headline. (i) **Threat-B** — A4-lossy's "gist is the most noise-robust content type" survives in *direction* (start-matched, gist's local decay slope at recov=0.85 is shallowest, −0.44 vs entity −0.68 / number −0.60) but its *magnitude* was ceiling-inflated: matched off a common 0.85 baseline the gist/number advantage falls from ~1.24× (raw σ@0.7) to 1.15–1.23×, failing the pre-registered ≥1.3× gate — so the per-type robustness *ordering* is real but *weak* (~1.15×), not the ~2× the raw σ=1.5 gap suggested. (ii) **Threat-C** — B4's "content-*defined* preserved subspace" is more honestly *coarseness*-defined: the strong "any z-trained direction is preserved" null is refuted (a shuffled-label direction lands at r² 0.23, near the random floor 0.17), but fine/rare-lexical directions are partially preserved too (0.35–0.38, ~0.8× the content-auditor 0.47), so the boundary is a smooth coarse→fine gradient with coarse content as the best-preserved *end*, not a content/non-content dichotomy. Both refinements leave the load-bearing claims (Law 1's single-shape + relational≈0; the parascope's content-readable/role-blind split) intact; they trim two adjectives.

## 21.9 — The safety pillar's honest boundary: a z-monitor detects content presence, not tampering (NEW-7)

NEW-6 left one deployment-load-bearing question open: it proved the deployed content auditor *reads* an injected concept (AUC ≥0.88), but only ever compared injected-z′ against concept-**absent** clean hosts — so it established "the concept is present," not "the concept was **planted** vs genuinely present." NEW-7 closes it. For money/location/time, build three length-matched sets: NEGATIVE (clean z, concept absent), GENUINE (clean z of natural sentences mined as truly about the concept, ~480/concept via currency/temporal/toponym lexicons), and SMUGGLED (concept-absent hosts with the concept injected at budget m=1, τ = per-token-state norm 0.725). **Detection reproduces NEW-6** — the concept is read equally whether genuine or smuggled (auditor AUC genuine-vs-negative 0.988, smuggled-vs-negative 0.942). **The new question — can any z-readout separate SMUGGLED from GENUINE ("planted vs real")?** A strong discriminator (logistic + MLP) scores AUC 0.926 — which *looks* like a tamper-detector — **but the decisive control kills it**: the same discriminator separates two *natural* sets (GENUINE vs NEGATIVE) at 0.985, so the smuggled-vs-genuine separation is just **reading host topic** (a shifted neutral host is not a real money/time/place sentence), with **negative excess (−0.059) over that content baseline**. The geometric injection artifact a true tamper-detector would need is weak (‖z‖-shell AUC 0.762, kNN-distance-to-natural-manifold 0.658) and a trivial re-normalization adversary (project z′ onto the natural concept ‖z‖ shell) leaves the discriminator at 0.976 — it never depended on the artifact; it was content all along. (Pre-registration: the literal disc≤0.65 bar was not met and the disc>0.75 falsifier literally fired, but both are confounded by host topic — the pre-registered decision rule `artifact_EXCEEDS_content_baseline` is **False**, so the honest reading is "no tamper signal beyond content.")

**The honest limit: the injection-detection safety claim is a content/PRESENCE detector, not a TAMPER detector.** A z-monitor can flag "money is here" but cannot flag "money was inserted" beyond noticing the host topic differs — which an attacker neutralizes by smuggling into on-topic hosts. **Tamper-detection (distinguishing planted from genuine content) cannot be built on z-content readout alone; it needs a signal outside z's content geometry (provenance, the LLM residual that wrote z, or an off-manifold check the re-norm adversary can't dodge).** This bounds §21.8's NEW-6 conclusion precisely: the deployed auditor robustly catches *that* a concept is present at any magnitude/length, but injection *as such* is indistinguishable from genuine presence in z.

## 21.10 — The mitigation: parascope-consistency recovers the tamper-detection that z alone cannot (NEW-8)

§21.9 left a sharp prediction: tamper-detection "needs a signal *outside* z's content geometry — e.g. the LLM residual that wrote z." The parascope supplies exactly that — ẑ = maghead-probe(LLM residual) is the z *predicted from the LLM's residual stream*. The key asymmetry: an additive injection happens *downstream* of the LLM (at z), so for a tampered z′ the parascope ẑ still reflects what the LLM actually wrote (concept **absent**) — z′ and ẑ **disagree** on the concept; for a genuinely-concept-bearing host both read it and **agree**. NEW-8 tests this consistency signal on 2310 held-out real LLM-paragraph hosts (shards 24–26, parascope trained on 0–23; ẑ-quality sane: cos(ẑ,z)=0.403, content-auditor-on-ẑ AUC 0.93). Define discrepancy = (concept-reading of z) − (concept-reading of ẑ). **Result — the falsifier-clearing positive:** discrepancy(SMUGGLED)=0.037 ≫ discrepancy(GENUINE)=0.009 ≈ discrepancy(CLEAN)=0.003, giving a **discrepancy-based smuggled-vs-genuine AUC of 0.787** (money 0.713 / location 0.836 / time 0.811; prereg ≥0.75 met) versus the **z-content-only baseline of 0.159** — a **+0.628 AUC recovery** of exactly the planted-vs-genuine discrimination §21.9 proved impossible from z alone. Smuggled-vs-clean is cleaner still (0.857), best at short hosts (N≈12: 0.88–0.95, where the m=1 injection budget is largest) and holding 0.70–0.83 at long hosts. **Deployability caveat (pre-registered falsifier-b partially fires):** the genuine-host false-flag rate is mean 0.353, but that is dominated by the data-starved *money* concept (only 39 short genuine items, FPR 0.50); location (0.25) and time (0.31) are far better, and the clean-host specificity FPR is 0.23 — the false positives come from ẑ *under-reading* some genuine concepts (auditor-on-ẑ 0.89–0.97, not 1.0), an operating-point/training-data issue, not a signal failure.

**The deployment posture, made whole:** §21.9's hard limit ("a z-monitor detects content, not tampering") is real *for a z-only monitor* — but the residual route turns it into a usable detector. **Parascope-consistency (compare z against the residual-predicted ẑ) is a working additive-injection tamper-detector — AUC 0.79 where z-content readout gives 0.16 — gated by a genuine-host FPR (~0.23–0.35) that improves with per-concept threshold calibration and more genuine training data.** Tamper-detection *is* possible, but only with the out-of-content signal §21.9 named; it cannot be built on z's content geometry alone. **⚠ [CORRECTED in §21.11 (N10): the 0.79 "planted-vs-genuine" headline is host-topic-confounded — topic-matched it collapses to 0.51 (chance). The real, honestly-scoped result is *tamper-vs-clean* = 0.85 (same host, isolates the injection); planted-vs-genuine is NOT recovered, and an adaptive on-topic-host attacker evades. Read §21.11 before citing this paragraph.]**

**Adaptive-attacker robustness (NEW-9).** A proposed defense is worth only as much as it survives an attacker who knows about it. Threat model: the attacker controls z (post-SONAR) but not the host's LLM residual. Three evasion arms: **(1) magnitude** — sweeping the injection magnitude m, attack efficacy (auditor reads the smuggle) and detector AUC move in **lockstep** (corr +0.986: m=0.25→2 gives efficacy 0.67/0.79/0.91/0.98 and detector 0.53/0.63/0.79/0.93); the only m that drops the detector ≤0.6 (m=0.25) also drops efficacy below 0.7, so **no clean evasion window exists** — because the discrepancy and the concept's presence are the *same* δ-projection, an attacker cannot keep the concept readable while hiding it from the detector. **(2) host-selection** — directing the attack at the detector's softest stratum (bottom-quartile ẑ-cos ∧ longest N) lowers detector AUC only to **0.685** (still >0.6, ≈ the §21.10 long-N floor) and that stratum is just **3.8%** of hosts — partial, not full, evasion. **(3) structural** — ẑ = f(LLM residual), which the z-level attacker does not control, so it **cannot be forged** (recomputing ẑ after tampering z gives max-abs-diff 0.0); magnitude and host-selection are the *only* levers, and both fail or only partially succeed. The pre-registered falsifier (a clean window with efficacy ≥0.85 while detector ≤0.6) did **not** trigger anywhere. **The §21.10 tamper-detector is adaptively robust: defeated neither by magnitude tuning (structurally impossible — discrepancy *is* the concept) nor forging (ẑ is out of the attacker's control), and only softened on a 3.8% minority of long/low-fidelity hosts.** **⚠ [CORRECTED in §21.11 (N10): this "adaptively robust" conclusion is WRONG — NEW-9's host-selection arm tested the wrong axis (ẑ-*cos* fidelity, not residual-*concept-content*). The correct adaptive move — smuggle into hosts whose residual already reads concept-adjacent (top-quartile reading_ẑ) — DOES evade, dropping the detector to 0.57 over 25% of hosts. The magnitude-evasion and unforgeability findings stand; the "robust" headline does not.]**

---

## 21.11 — Correction: the tamper-detector is real only as *tamper-vs-clean*; planted-vs-genuine is a host-topic confound, and an adaptive attacker evades (N10)

A final adversarial audit caught the load-bearing flaw in §21.10's headline, and a pre-registered test (N10) confirmed it — fired the falsifier. The §21.10/§21.10-NEW-9 framing ("planted-vs-genuine recovered at 0.79, adaptively robust") **over-claims**; here is the honest scoping.

**The confound.** The discrepancy detector is `reading_z − reading_ẑ`. The injection is *budget-calibrated*, so the injected z′ reads the concept at ≈ the genuine level — `reading_z` alone separates smuggled-vs-genuine at only AUC 0.03/0.28/0.18 (money/location/time, ≈chance). The entire 0.79 separation therefore rides on `reading_ẑ` — "the genuine host's *residual* is about the concept; the smuggled (concept-absent) host's residual is not" — which separates the two host sets at AUC 0.919 on ẑ alone (R²=0.476 of the discrepancy). **That is the exact same host-topic axis NEW-7 proved is a confound for the z-only detector, merely measured on ẑ.** When the GENUINE and SMUGGLED host sets are matched on `reading_ẑ` (bin-matching and OLS-residualizing agree), the smuggled-vs-genuine discrepancy AUC **collapses 0.79 → 0.51 (chance)** — pre-registered falsifier (≤0.60) met.

**What is genuinely real — tamper-vs-clean = 0.853.** CLEAN and SMUGGLED are the *same* concept-absent host pool split in half, differing **only** by the injection (residual topic held fixed, `reading_ẑ`≈0.5 between them). There the detector cleanly isolates the tamper: **AUC 0.853**. So the honest claim is: *parascope-consistency detects that a given z was additively tampered, relative to its own untampered residual-prediction* — it does **not** distinguish a planted concept from a genuinely-present one in general, because genuine concepts ride a residual the detector legitimately (but un-controllably) reads.

**The adaptive attacker NEW-9 missed.** NEW-9's host-selection arm varied ẑ-*cos* (reconstruction *fidelity*) — the wrong axis. The correct adaptive move is to smuggle into hosts whose **residual already reads concept-adjacent** (top-quartile `reading_ẑ`): there `reading_ẑ` is high for the smuggled host too, so the discrepancy shrinks and the detector drops to **AUC 0.572 over 25% of hosts**. The attacker cannot *forge* ẑ (NEW-9 arm-3 stands), but they can *choose on-topic hosts* — a far larger evasion surface (25%, not 3.8%) than §21.10 claimed. The magnitude-evasion-fails and ẑ-unforgeable findings remain correct; the "adaptively robust" conclusion does not.

**Corrected safety bottom line.** §21.9's limit stands: a z-only monitor detects content presence, not tampering. §21.10's mitigation is **partial and honestly narrower than first stated**: parascope-consistency is a real *tamper-vs-untampered* detector (AUC 0.85, holding the host fixed) — useful where you can compare a z against its own residual-prediction — but it does **not** recover planted-vs-genuine discrimination (0.51 topic-matched), and an adaptive attacker who injects into on-topic hosts evades it. Tamper-detection from the residual signal is *real but bounded*, not the clean win §21.10/NEW-9 first reported. (This also re-frames NEW-8's genuine-host FPR 0.35: that false-flag rate was the same confound surfacing — genuine hosts legitimately read high on ẑ, so a discrepancy threshold is partly a `reading_ẑ` threshold.)

**Is the on-topic-host evasion fixable by a better detector? No — it is a deep bound (N11).** The natural rescue is a *calibrated* detector: ẑ is a noisy, attenuated predictor of z's concept content, so instead of the raw discrepancy `reading_z − reading_ẑ`, fit the expected relationship `reading_z = f(reading_ẑ)` on clean+genuine hosts (linear fit `0.80·reading_ẑ + 0.008`, R²=0.69; isotonic R²=0.72) and flag the *positive residual* — an injected host should read *more* concept than its residual predicts. On a properly `reading_ẑ`-matched on-topic stratum (match-check AUC 0.542 ≈ chance, confirming genuine/smuggled are residual-topic-equalized), it does **not** help: smuggled-vs-genuine AUC is **0.543 calibrated** (vs 0.538 raw), and even a learned 4-feature logistic (`reading_z, reading_ẑ, ‖z‖, reading_z·reading_ẑ`, 5-fold CV) reaches only **0.615** — both inside the pre-registered deep-bound band, and at a 5%-clean-FPR operating point the calibrated detector catches **zero** smuggled on-topic hosts. Calibrating out the ẑ-attenuation exposes **no excess concept-energy because there is none**: a budget-respecting δ/(N+1) injection into an already-on-topic host is information-theoretically buried under the host's natural concept variance plus ẑ's noise (cos(ẑ,z)=0.40). **So the on-topic-host evasion is a fundamental bound on *any* z-vs-ẑ consistency detector, not a tuning failure** — §21.10's mitigation is genuinely confined to off-topic / tamper-vs-clean hosts. The one consolation: under an honest common on-topic threshold only **5.9%** of budget-respecting smuggled hosts can even push their *residual* concept-reading high enough to enter the evadable stratum (most can't, because they don't control the residual), so the truly-evadable population is *smaller* than N10's nominal 25% — the attacker both needs an on-topic host *and* gives up most hosts to get one. (A genuinely different defense — off-manifold detection robust to the re-norm adversary, or provenance signing of z — remains owed-for-future; a z-vs-ẑ detector is provably not it.)

---

*Night-7 §21 closes here. Across the chapter and its cron-driven wave-3/4/5 + cycles 1–10 follow-ups, the three laws (capacity/dilution, no-binding, capacity-not-aggregation) and the parascope safety pillar were stated, tested, and **repeatedly self-corrected** — Law 1's scalar C is link-dependent (single shape, not single constant); the norm-flag mitigation is weak; no-binding is an encoder-not-pooling property (rescoped) and a real representational absence (genitive positive-control); Law 1 is causally generative; the safety pillar's limit (a z-monitor detects content, not tampering — §21.9) was turned into a residual-signal mitigation (§21.10) and then that mitigation was **honestly demoted by its own adversarial audit (§21.11): real as tamper-vs-clean (0.85) but NOT as planted-vs-genuine (0.51 topic-matched), with an adaptive on-topic-host attacker evading over 25% of hosts.** The chapter is complete — three laws + a safety pillar carrying its limit, a *bounded* fix, and the fix's honest scoping. The repeated catch-and-correct (six self-corrections across the chapter) is the point: the easiest person to trick is yourself, and the critique-then-pre-registered-test loop is what kept the headlines honest. Remaining threads are owed-for-future (a stronger residual-signal tamper-detector that survives the on-topic-host attacker; per-concept FPR calibration; a true-OOV fine-token probe for B4's adjective) and explicitly not filler-run.*