--- title: "Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging" status: working draft date: 2026-06-20 backbones: [MedDINOv3 ViT-B/16 (CT-3M), DINOv2-base] --- # Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging ## Abstract We study label-free token pruning for medical imaging built on frozen self-supervised vision transformers. Our central object is a **label-free lesion subspace**: a geometric region of a frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion tokens are locally rare/distinctive. Three findings organize the paper. (1) **Where to look.** The lesion-localizable signal in frozen SSL ViTs lives in **mid-layer**, not final-layer, features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to **0.871** (block 3). (2) **A label-free localizer that generalizes.** A simple density estimate over a held-out token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney 0.82 CT) and **across modalities and backbones** — 0.73 on breast ultrasound with DINOv2, where attention saliency collapses to chance. Pruning tokens by subspace membership beats attention- saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits a per-image **conformal retention certificate** (empirical coverage 0.978 ≥ nominal 0.90) and a **lesion-routed adaptive depth** that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) **A negative result with a transferable mechanism.** We set out to gate pruning with a *coverage constraint* — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned by retained tokens, controlled by an interpretable dual variable. This **fails**: at matched budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain membership ranking. The mechanism generalizes past our method: **rank-based coverage objectives reward diverse subspace *spanning*, whereas rare small-region pathology requires *concentration* on a few high-membership tokens.** Effective-rank coverage is therefore structurally mismatched to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored objectives in medical SSL. ## 1. Introduction Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of patches; a pruner optimized for throughput or generic saliency can discard exactly those. We ask a narrower, label-free question: **can a frozen SSL backbone tell us, without any labels, which tokens carry diagnostic signal — well enough to prune around them, certify the result, and adapt compute?** Our answer is a *label-free lesion subspace* and the operations built on it. We deliberately also report what did **not** work. Our original hypothesis was that pruning should be a *constrained optimization* — minimize tokens subject to a floor on lesion-subspace coverage, with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an instructive reason we make precise. We treat the negative as a first-class result. **Contributions.** 1. A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final. 2. A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone. 3. Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a conformal retention certificate and lesion-routed depth. 4. A negative result with a transferable mechanism: rank-based coverage objectives fail for rare-lesion retention. ## 2. Method ### 2.1 Setup Frozen backbone, patch-token features `Z(x) = {z_1,...,z_n}`, `z_i ∈ R^d`. For CT we use **MedDINOv3 ViT-B/16 (CT-3M)**; for ultrasound, **DINOv2-base** (modality-agnostic), establishing that the method is not backbone-specific. We extract **mid-layer** tokens (Sec. 4.1). ### 2.2 Label-free lesion subspace We estimate, without labels, the region of feature space carrying diagnostic signal. - **Construction A (density).** Lesions are rare, so lesion tokens lie in locally sparse regions. Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score is the mean k-NN distance (low density ⇒ high score). The candidate subspace `L(x)` is spanned by the low-density tokens. - **Construction B (residual).** Fit a low-rank normal-tissue subspace `U` by PCA on the bank; lesion-relevant tokens have high residual `‖(I-UU^T)z‖`. Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens. ### 2.3 Membership pruning, certificate, routing (what ships) - **Lesion-subspace membership pruning.** Retain the top-k tokens by membership score. - **Conformal retention certificate.** With split conformal on a calibration set, emit per image a distribution-free lower bound on the *fraction of lesion mass retained* under membership pruning: `P(Y(x) ≥ guaranteed) ≥ 1-α`. (Certifies lesion retention under the shipping policy, not any internal coverage statistic.) - **Lesion-routed depth.** Route tokens by membership at a mid block: high-membership tokens continue through full depth; the rest exit early. ### 2.4 The coverage constraint (the hypothesis we falsify) We define a coverage functional `C(S;x) = effrank(P_L Z_S)` (RankMe form; coding-rate surrogate to avoid SVD backprop) and pose pruning as `min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε`, with Lagrangian dual `μ` learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this underperforms the simple membership rule of Sec. 2.3. ## 3. Experimental protocol (gated falsification) Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC; paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are **evaluation-only**; no label touches subspace construction (enforced by a CI label-leak test). Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI (breast ultrasound). All compute ran as Hugging Face Jobs. ## 4. Results: the label-free localizer ### 4.1 Lesion signal lives mid-layer (Finding 1) Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1): | layer | final (12) | block 6 | block 4 | block 3 | |---|---|---|---|---| | AUROC | 0.565 | 0.769 | 0.865 | **0.871** | Final-layer features are tuned for the global self-distillation objective; the dense local lesion signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable (peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study. ### 4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2) density-A token-level lesion AUROC, with attention-saliency as the label-free comparator: | dataset (modality, backbone) | density-A | attention | random | |---|---|---|---| | LIDC lung CT (MedDINOv3) | **0.871** | 0.767 | 0.51 | | MSD pancreas CT (MedDINOv3) | 0.876 | 0.920 | 0.49 | | KiTS23 kidney CT (MedDINOv3) | 0.823 | 0.823 | 0.50 | | MSD liver CT (MedDINOv3) | 0.670 | 0.756 | 0.50 | | BUSI breast US (DINOv2) | **0.733** | 0.492 | 0.50 | The subspace localizes lesions without labels across very different anatomies, two modalities, and two backbones. On ultrasound, attention is at chance — the geometric subspace is the *only* label- free signal that works. ### 4.3 Precondition and characterized failure The method's value tracks **whether feature density localizes the lesion**, not the modality. Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not locally rare in feature space. Liver is the *mirror image* of ultrasound — on liver attention (0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid does **not** rescue liver (0.713, between the two; the weak density signal drags down better attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back to attention. ## 5. Results: pruning, certificate, routing ### 5.1 Membership pruning beats saliency pruning (Finding 3) Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning (paired bootstrap CI excludes 0 throughout): | dataset | budget 0.25 | budget 0.5 | |---|---|---| | LIDC lung CT | +27.6 pts | +15.8 pts (89% miss-red) | | KiTS23 kidney CT | +7.4 pts (40% miss-red) | +1.6 pts (91% miss-red) | | BUSI breast US | +13.8 pts | +19.0 pts | Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions; ultrasound). ### 5.2 Conformal retention certificate Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage **0.978 ≥ 0.90** — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that the hardest ~10% of small-lesion cases cannot be guaranteed. ### 5.3 Lesion-routed depth Routing depth by membership yields **1.6× FLOP reduction at 98.2% small-lesion sensitivity** and dominates saliency routing at every retention (saliency never reaches equal sensitivity at any FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented sensitivity cost (tunable deployment knob). ## 6. The negative result: rank-based coverage fails for rare pathology (Finding 4) ### 6.1 The ablation Three pruning strategies, small lesions, matched budget: | budget | saliency | subspace-only (membership top-k) | subspace + coverage floor | |---|---|---|---| | 0.25 | 0.521 | **0.817** | **0.219** | | 0.50 | 0.827 | **0.981** | **0.460** | Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is **far worse** than subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it. ### 6.2 Mechanism (transferable) `C(S)=effrank(P_L Z_S)` is maximized by a retained set that **diversely spans** the subspace's directions. The decisive property of a lesion is that it is **rare** — a handful of tokens out of ~196. A set-level rank/coverage objective is therefore *insensitive* to it: a few tokens cannot materially raise the retained set's effective rank, so the objective spends the budget on abundant background directions and drops the lesion. This is a **rarity** mechanism, not an internal-geometry one — and we checked: measured at the operating layer, lesion tokens are *not* low-rank relative to background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to them anyway because they are few. (The synthetic law of the companion paper reaches the same failure via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.) Rank coverage rewards the entropy of the retained *set* spectrum; lesion retention rewards mass on the top membership tokens; these diverge whenever the critical signal is a **rare** cluster, of any internal rank. **For rare-pathology tasks, prefer concentration objectives (energy / membership mass) over rank/spanning objectives (RankMe, coding rate, MCR2).** ### 6.3 Convergent evidence Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2 faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2), since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage *constraint machinery* (dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on 99% of cases) — it simply optimizes the wrong quantity. ## 7. Related work and positioning Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank) either treats rank as a monitor rather than a target, prunes by attention/labels, or is non- medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation from a companion representation-coverage probe study: if such a probe reads final-layer features, Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap. **Relation to label-free FM adaptation (FINO; Gardès et al., 2026).** Concurrent work adapts vision foundation models to scientific domains *without task labels* by guiding a self-supervised objective with **metadata**, training the backbone. Our work is orthogonal and complementary: we keep the backbone **frozen**, use **no metadata or labels**, and contribute a geometric *analysis* (where the signal lives), a token *economy* (membership pruning, routed depth), a retention *certificate*, and a *law* on objective choice — none of which an adaptation method addresses. The two compose: our probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the mid-layer concentration subspace at depth and improves rare-signal separability (a question our training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training signal). Their result is also indirect support for our mechanism: counteracting depth-globalization of informative local factors is plausibly part of why metadata guidance helps. ## 8. Limitations - The method helps only where feature density localizes the lesion (liver = characterized failure); a deployment check on density-AUROC is required, with attention fallback. - Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under the random-pruning protocol. - Pretraining-time application is untested (inference-time/fine-tuning only); the conformal guarantee assumes exchangeable calibration/test data. ## 9. Conclusion The contribution is the **label-free lesion subspace** — a mid-layer geometry that localizes lesions without labels across modality and backbone — together with membership pruning, a conformal retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs concentration) is a transferable caution for medical SSL. --- ### Appendix A — Gate ledger (locked Phase-1b thresholds) | Gate | Verdict | Key number | |---|---|---| | 0 reproducibility | PASS | frozen load, Δ=0, 2.1M-token bank | | 1 subspace validity | PASS | density-A 0.871, +0.105 vs attention | | 2 faithfulness | guard PASS; not superior | coverage 0.480 vs saliency 0.479 (tied) | | 3 membership pruning > saliency | PASS | LIDC, KiTS23, BUSI (CT + ultrasound) | | 4 coverage floor | NEGATIVE | floor 0.22 vs subspace 0.82 @0.25 | | 5 invariance | FALLBACK | inference-time | | 6 conformal retention cert. | PASS | empirical 0.978 ≥ 0.90 | | 6 lesion-routed depth | PASS | 1.6× FLOPs @ 98% sensitivity | | 6 volumetric | PARTIAL | ~2× at 82% lesion mass (tunable) | ### Appendix B — Reproducibility All experiments ran as Hugging Face Jobs (MedDINOv3 `ricklisz123/MedDINOv3-ViTB-16-CT-3M`, DINOv2 `facebook/dinov2-base`). Artifacts (token banks, materialized masks, per-gate metrics) in the `processed/covtoken/` bucket; per-gate decision records in `covtoken/gate_reports/`; locked thresholds in `covtoken/configs/thresholds.lock.json`.