| --- |
| title: "Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging" |
| status: working draft |
| date: 2026-06-20 |
| backbones: [MedDINOv3 ViT-B/16 (CT-3M), DINOv2-base] |
| --- |
| |
| # Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging |
|
|
| ## Abstract |
|
|
| We study label-free token pruning for medical imaging built on frozen self-supervised vision |
| transformers. Our central object is a **label-free lesion subspace**: a geometric region of a |
| frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion |
| tokens are locally rare/distinctive. Three findings organize the paper. (1) **Where to look.** |
| The lesion-localizable signal in frozen SSL ViTs lives in **mid-layer**, not final-layer, |
| features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to **0.871** (block |
| 3). (2) **A label-free localizer that generalizes.** A simple density estimate over a held-out |
| token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney |
| 0.82 CT) and **across modalities and backbones** — 0.73 on breast ultrasound with DINOv2, where |
| attention saliency collapses to chance. Pruning tokens by subspace membership beats attention- |
| saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits |
| a per-image **conformal retention certificate** (empirical coverage 0.978 ≥ nominal 0.90) and a |
| **lesion-routed adaptive depth** that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) **A |
| negative result with a transferable mechanism.** We set out to gate pruning with a *coverage |
| constraint* — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned |
| by retained tokens, controlled by an interpretable dual variable. This **fails**: at matched |
| budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain |
| membership ranking. The mechanism generalizes past our method: **rank-based coverage objectives |
| reward diverse subspace *spanning*, whereas rare small-region pathology requires *concentration* |
| on a few high-membership tokens.** Effective-rank coverage is therefore structurally mismatched |
| to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored |
| objectives in medical SSL. |
|
|
| ## 1. Introduction |
|
|
| Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that |
| matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of |
| patches; a pruner optimized for throughput or generic saliency can discard exactly those. |
|
|
| We ask a narrower, label-free question: **can a frozen SSL backbone tell us, without any labels, |
| which tokens carry diagnostic signal — well enough to prune around them, certify the result, and |
| adapt compute?** Our answer is a *label-free lesion subspace* and the operations built on it. |
|
|
| We deliberately also report what did **not** work. Our original hypothesis was that pruning should |
| be a *constrained optimization* — minimize tokens subject to a floor on lesion-subspace coverage, |
| with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an |
| instructive reason we make precise. We treat the negative as a first-class result. |
|
|
| **Contributions.** |
| 1. A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final. |
| 2. A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone. |
| 3. Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a |
| conformal retention certificate and lesion-routed depth. |
| 4. A negative result with a transferable mechanism: rank-based coverage objectives fail for |
| rare-lesion retention. |
|
|
| ## 2. Method |
|
|
| ### 2.1 Setup |
| Frozen backbone, patch-token features `Z(x) = {z_1,...,z_n}`, `z_i ∈ R^d`. For CT we use |
| **MedDINOv3 ViT-B/16 (CT-3M)**; for ultrasound, **DINOv2-base** (modality-agnostic), establishing |
| that the method is not backbone-specific. We extract **mid-layer** tokens (Sec. 4.1). |
|
|
| ### 2.2 Label-free lesion subspace |
| We estimate, without labels, the region of feature space carrying diagnostic signal. |
| - **Construction A (density).** Lesions are rare, so lesion tokens lie in locally sparse regions. |
| Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score |
| is the mean k-NN distance (low density ⇒ high score). The candidate subspace `L(x)` is spanned |
| by the low-density tokens. |
| - **Construction B (residual).** Fit a low-rank normal-tissue subspace `U` by PCA on the bank; |
| lesion-relevant tokens have high residual `‖(I-UU^T)z‖`. |
| Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens. |
|
|
| ### 2.3 Membership pruning, certificate, routing (what ships) |
| - **Lesion-subspace membership pruning.** Retain the top-k tokens by membership score. |
| - **Conformal retention certificate.** With split conformal on a calibration set, emit per image a |
| distribution-free lower bound on the *fraction of lesion mass retained* under membership pruning: |
| `P(Y(x) ≥ guaranteed) ≥ 1-α`. (Certifies lesion retention under the shipping policy, not any |
| internal coverage statistic.) |
| - **Lesion-routed depth.** Route tokens by membership at a mid block: high-membership tokens |
| continue through full depth; the rest exit early. |
|
|
| ### 2.4 The coverage constraint (the hypothesis we falsify) |
| We define a coverage functional `C(S;x) = effrank(P_L Z_S)` (RankMe form; coding-rate surrogate to |
| avoid SVD backprop) and pose pruning as `min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε`, with Lagrangian |
| dual `μ` learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this |
| underperforms the simple membership rule of Sec. 2.3. |
|
|
| ## 3. Experimental protocol (gated falsification) |
|
|
| Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked |
| Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC; |
| paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are |
| **evaluation-only**; no label touches subspace construction (enforced by a CI label-leak test). |
| Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI |
| (breast ultrasound). All compute ran as Hugging Face Jobs. |
|
|
| ## 4. Results: the label-free localizer |
|
|
| ### 4.1 Lesion signal lives mid-layer (Finding 1) |
| Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1): |
|
|
| | layer | final (12) | block 6 | block 4 | block 3 | |
| |---|---|---|---|---| |
| | AUROC | 0.565 | 0.769 | 0.865 | **0.871** | |
|
|
| Final-layer features are tuned for the global self-distillation objective; the dense local lesion |
| signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum |
| is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable |
| (peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector |
| regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study. |
|
|
| ### 4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2) |
| density-A token-level lesion AUROC, with attention-saliency as the label-free comparator: |
|
|
| | dataset (modality, backbone) | density-A | attention | random | |
| |---|---|---|---| |
| | LIDC lung CT (MedDINOv3) | **0.871** | 0.767 | 0.51 | |
| | MSD pancreas CT (MedDINOv3) | 0.876 | 0.920 | 0.49 | |
| | KiTS23 kidney CT (MedDINOv3) | 0.823 | 0.823 | 0.50 | |
| | MSD liver CT (MedDINOv3) | 0.670 | 0.756 | 0.50 | |
| | BUSI breast US (DINOv2) | **0.733** | 0.492 | 0.50 | |
|
|
| The subspace localizes lesions without labels across very different anatomies, two modalities, and |
| two backbones. On ultrasound, attention is at chance — the geometric subspace is the *only* label- |
| free signal that works. |
|
|
| ### 4.3 Precondition and characterized failure |
| The method's value tracks **whether feature density localizes the lesion**, not the modality. |
| Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not |
| locally rare in feature space. Liver is the *mirror image* of ultrasound — on liver attention |
| (0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid |
| does **not** rescue liver (0.713, between the two; the weak density signal drags down better |
| attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back |
| to attention. |
|
|
| ## 5. Results: pruning, certificate, routing |
|
|
| ### 5.1 Membership pruning beats saliency pruning (Finding 3) |
| Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning |
| (paired bootstrap CI excludes 0 throughout): |
|
|
| | dataset | budget 0.25 | budget 0.5 | |
| |---|---|---| |
| | LIDC lung CT | +27.6 pts | +15.8 pts (89% miss-red) | |
| | KiTS23 kidney CT | +7.4 pts (40% miss-red) | +1.6 pts (91% miss-red) | |
| | BUSI breast US | +13.8 pts | +19.0 pts | |
|
|
| Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where |
| pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions; |
| ultrasound). |
|
|
| ### 5.2 Conformal retention certificate |
| Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage **0.978 ≥ |
| 0.90** — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee |
| tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that |
| the hardest ~10% of small-lesion cases cannot be guaranteed. |
|
|
| ### 5.3 Lesion-routed depth |
| Routing depth by membership yields **1.6× FLOP reduction at 98.2% small-lesion sensitivity** and |
| dominates saliency routing at every retention (saliency never reaches equal sensitivity at any |
| FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented |
| sensitivity cost (tunable deployment knob). |
|
|
| ## 6. The negative result: rank-based coverage fails for rare pathology (Finding 4) |
|
|
| ### 6.1 The ablation |
| Three pruning strategies, small lesions, matched budget: |
|
|
| | budget | saliency | subspace-only (membership top-k) | subspace + coverage floor | |
| |---|---|---|---| |
| | 0.25 | 0.521 | **0.817** | **0.219** | |
| | 0.50 | 0.827 | **0.981** | **0.460** | |
|
|
| Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is **far worse** than |
| subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it. |
|
|
| ### 6.2 Mechanism (transferable) |
| `C(S)=effrank(P_L Z_S)` is maximized by a retained set that **diversely spans** the subspace's |
| directions. The decisive property of a lesion is that it is **rare** — a handful of tokens out of |
| ~196. A set-level rank/coverage objective is therefore *insensitive* to it: a few tokens cannot |
| materially raise the retained set's effective rank, so the objective spends the budget on abundant |
| background directions and drops the lesion. This is a **rarity** mechanism, not an internal-geometry |
| one — and we checked: measured at the operating layer, lesion tokens are *not* low-rank relative to |
| background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image |
| internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to |
| them anyway because they are few. (The synthetic law of the companion paper reaches the same failure |
| via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.) |
| Rank coverage rewards the entropy of the retained *set* spectrum; lesion retention rewards mass on |
| the top membership tokens; these diverge whenever the critical signal is a **rare** cluster, of any |
| internal rank. **For rare-pathology tasks, prefer concentration objectives (energy / membership |
| mass) over rank/spanning objectives (RankMe, coding rate, MCR2).** |
|
|
| ### 6.3 Convergent evidence |
| Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2 |
| faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better |
| than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by |
| small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never |
| emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2), |
| since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage *constraint machinery* |
| (dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on |
| 99% of cases) — it simply optimizes the wrong quantity. |
|
|
| ## 7. Related work and positioning |
|
|
| Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank) |
| either treats rank as a monitor rather than a target, prunes by attention/labels, or is non- |
| medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace |
| that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a |
| mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation |
| from a companion representation-coverage probe study: if such a probe reads final-layer features, |
| Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap. |
|
|
| **Relation to label-free FM adaptation (FINO; Gardès et al., 2026).** Concurrent work adapts vision |
| foundation models to scientific domains *without task labels* by guiding a self-supervised objective |
| with **metadata**, training the backbone. Our work is orthogonal and complementary: we keep the |
| backbone **frozen**, use **no metadata or labels**, and contribute a geometric *analysis* (where the |
| signal lives), a token *economy* (membership pruning, routed depth), a retention *certificate*, and |
| a *law* on objective choice — none of which an adaptation method addresses. The two compose: our |
| probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the |
| mid-layer concentration subspace at depth and improves rare-signal separability (a question our |
| training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training |
| signal). Their result is also indirect support for our mechanism: counteracting depth-globalization |
| of informative local factors is plausibly part of why metadata guidance helps. |
|
|
| ## 8. Limitations |
|
|
| - The method helps only where feature density localizes the lesion (liver = characterized |
| failure); a deployment check on density-AUROC is required, with attention fallback. |
| - Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under |
| the random-pruning protocol. |
| - Pretraining-time application is untested (inference-time/fine-tuning only); the conformal |
| guarantee assumes exchangeable calibration/test data. |
|
|
| ## 9. Conclusion |
|
|
| The contribution is the **label-free lesion subspace** — a mid-layer geometry that localizes |
| lesions without labels across modality and backbone — together with membership pruning, a conformal |
| retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with |
| is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs |
| concentration) is a transferable caution for medical SSL. |
|
|
| --- |
|
|
| ### Appendix A — Gate ledger (locked Phase-1b thresholds) |
|
|
| | Gate | Verdict | Key number | |
| |---|---|---| |
| | 0 reproducibility | PASS | frozen load, Δ=0, 2.1M-token bank | |
| | 1 subspace validity | PASS | density-A 0.871, +0.105 vs attention | |
| | 2 faithfulness | guard PASS; not superior | coverage 0.480 vs saliency 0.479 (tied) | |
| | 3 membership pruning > saliency | PASS | LIDC, KiTS23, BUSI (CT + ultrasound) | |
| | 4 coverage floor | NEGATIVE | floor 0.22 vs subspace 0.82 @0.25 | |
| | 5 invariance | FALLBACK | inference-time | |
| | 6 conformal retention cert. | PASS | empirical 0.978 ≥ 0.90 | |
| | 6 lesion-routed depth | PASS | 1.6× FLOPs @ 98% sensitivity | |
| | 6 volumetric | PARTIAL | ~2× at 82% lesion mass (tunable) | |
|
|
| ### Appendix B — Reproducibility |
| All experiments ran as Hugging Face Jobs (MedDINOv3 `ricklisz123/MedDINOv3-ViTB-16-CT-3M`, |
| DINOv2 `facebook/dinov2-base`). Artifacts (token banks, materialized masks, per-gate metrics) in |
| the `processed/covtoken/` bucket; per-gate decision records in `covtoken/gate_reports/`; locked |
| thresholds in `covtoken/configs/thresholds.lock.json`. |
|
|