title: >-
Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical
Medical Imaging
status: working draft
date: 2026-06-20T00:00:00.000Z
backbones:
- MedDINOv3 ViT-B/16 (CT-3M)
- DINOv2-base
Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging
Abstract
We study label-free token pruning for medical imaging built on frozen self-supervised vision transformers. Our central object is a label-free lesion subspace: a geometric region of a frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion tokens are locally rare/distinctive. Three findings organize the paper. (1) Where to look. The lesion-localizable signal in frozen SSL ViTs lives in mid-layer, not final-layer, features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to 0.871 (block 3). (2) A label-free localizer that generalizes. A simple density estimate over a held-out token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney 0.82 CT) and across modalities and backbones — 0.73 on breast ultrasound with DINOv2, where attention saliency collapses to chance. Pruning tokens by subspace membership beats attention- saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits a per-image conformal retention certificate (empirical coverage 0.978 ≥ nominal 0.90) and a lesion-routed adaptive depth that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) A negative result with a transferable mechanism. We set out to gate pruning with a coverage constraint — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned by retained tokens, controlled by an interpretable dual variable. This fails: at matched budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain membership ranking. The mechanism generalizes past our method: rank-based coverage objectives reward diverse subspace spanning, whereas rare small-region pathology requires concentration on a few high-membership tokens. Effective-rank coverage is therefore structurally mismatched to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored objectives in medical SSL.
1. Introduction
Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of patches; a pruner optimized for throughput or generic saliency can discard exactly those.
We ask a narrower, label-free question: can a frozen SSL backbone tell us, without any labels, which tokens carry diagnostic signal — well enough to prune around them, certify the result, and adapt compute? Our answer is a label-free lesion subspace and the operations built on it.
We deliberately also report what did not work. Our original hypothesis was that pruning should be a constrained optimization — minimize tokens subject to a floor on lesion-subspace coverage, with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an instructive reason we make precise. We treat the negative as a first-class result.
Contributions.
- A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final.
- A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone.
- Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a conformal retention certificate and lesion-routed depth.
- A negative result with a transferable mechanism: rank-based coverage objectives fail for rare-lesion retention.
2. Method
2.1 Setup
Frozen backbone, patch-token features Z(x) = {z_1,...,z_n}, z_i ∈ R^d. For CT we use
MedDINOv3 ViT-B/16 (CT-3M); for ultrasound, DINOv2-base (modality-agnostic), establishing
that the method is not backbone-specific. We extract mid-layer tokens (Sec. 4.1).
2.2 Label-free lesion subspace
We estimate, without labels, the region of feature space carrying diagnostic signal.
- Construction A (density). Lesions are rare, so lesion tokens lie in locally sparse regions.
Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score
is the mean k-NN distance (low density ⇒ high score). The candidate subspace
L(x)is spanned by the low-density tokens. - Construction B (residual). Fit a low-rank normal-tissue subspace
Uby PCA on the bank; lesion-relevant tokens have high residual‖(I-UU^T)z‖. Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens.
2.3 Membership pruning, certificate, routing (what ships)
- Lesion-subspace membership pruning. Retain the top-k tokens by membership score.
- Conformal retention certificate. With split conformal on a calibration set, emit per image a
distribution-free lower bound on the fraction of lesion mass retained under membership pruning:
P(Y(x) ≥ guaranteed) ≥ 1-α. (Certifies lesion retention under the shipping policy, not any internal coverage statistic.) - Lesion-routed depth. Route tokens by membership at a mid block: high-membership tokens continue through full depth; the rest exit early.
2.4 The coverage constraint (the hypothesis we falsify)
We define a coverage functional C(S;x) = effrank(P_L Z_S) (RankMe form; coding-rate surrogate to
avoid SVD backprop) and pose pruning as min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε, with Lagrangian
dual μ learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this
underperforms the simple membership rule of Sec. 2.3.
3. Experimental protocol (gated falsification)
Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC; paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are evaluation-only; no label touches subspace construction (enforced by a CI label-leak test). Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI (breast ultrasound). All compute ran as Hugging Face Jobs.
4. Results: the label-free localizer
4.1 Lesion signal lives mid-layer (Finding 1)
Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1):
| layer | final (12) | block 6 | block 4 | block 3 |
|---|---|---|---|---|
| AUROC | 0.565 | 0.769 | 0.865 | 0.871 |
Final-layer features are tuned for the global self-distillation objective; the dense local lesion signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable (peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study.
4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2)
density-A token-level lesion AUROC, with attention-saliency as the label-free comparator:
| dataset (modality, backbone) | density-A | attention | random |
|---|---|---|---|
| LIDC lung CT (MedDINOv3) | 0.871 | 0.767 | 0.51 |
| MSD pancreas CT (MedDINOv3) | 0.876 | 0.920 | 0.49 |
| KiTS23 kidney CT (MedDINOv3) | 0.823 | 0.823 | 0.50 |
| MSD liver CT (MedDINOv3) | 0.670 | 0.756 | 0.50 |
| BUSI breast US (DINOv2) | 0.733 | 0.492 | 0.50 |
The subspace localizes lesions without labels across very different anatomies, two modalities, and two backbones. On ultrasound, attention is at chance — the geometric subspace is the only label- free signal that works.
4.3 Precondition and characterized failure
The method's value tracks whether feature density localizes the lesion, not the modality. Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not locally rare in feature space. Liver is the mirror image of ultrasound — on liver attention (0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid does not rescue liver (0.713, between the two; the weak density signal drags down better attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back to attention.
5. Results: pruning, certificate, routing
5.1 Membership pruning beats saliency pruning (Finding 3)
Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning (paired bootstrap CI excludes 0 throughout):
| dataset | budget 0.25 | budget 0.5 |
|---|---|---|
| LIDC lung CT | +27.6 pts | +15.8 pts (89% miss-red) |
| KiTS23 kidney CT | +7.4 pts (40% miss-red) | +1.6 pts (91% miss-red) |
| BUSI breast US | +13.8 pts | +19.0 pts |
Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions; ultrasound).
5.2 Conformal retention certificate
Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage 0.978 ≥ 0.90 — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that the hardest ~10% of small-lesion cases cannot be guaranteed.
5.3 Lesion-routed depth
Routing depth by membership yields 1.6× FLOP reduction at 98.2% small-lesion sensitivity and dominates saliency routing at every retention (saliency never reaches equal sensitivity at any FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented sensitivity cost (tunable deployment knob).
6. The negative result: rank-based coverage fails for rare pathology (Finding 4)
6.1 The ablation
Three pruning strategies, small lesions, matched budget:
| budget | saliency | subspace-only (membership top-k) | subspace + coverage floor |
|---|---|---|---|
| 0.25 | 0.521 | 0.817 | 0.219 |
| 0.50 | 0.827 | 0.981 | 0.460 |
Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is far worse than subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it.
6.2 Mechanism (transferable)
C(S)=effrank(P_L Z_S) is maximized by a retained set that diversely spans the subspace's
directions. The decisive property of a lesion is that it is rare — a handful of tokens out of
~196. A set-level rank/coverage objective is therefore insensitive to it: a few tokens cannot
materially raise the retained set's effective rank, so the objective spends the budget on abundant
background directions and drops the lesion. This is a rarity mechanism, not an internal-geometry
one — and we checked: measured at the operating layer, lesion tokens are not low-rank relative to
background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image
internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to
them anyway because they are few. (The synthetic law of the companion paper reaches the same failure
via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.)
Rank coverage rewards the entropy of the retained set spectrum; lesion retention rewards mass on
the top membership tokens; these diverge whenever the critical signal is a rare cluster, of any
internal rank. For rare-pathology tasks, prefer concentration objectives (energy / membership
mass) over rank/spanning objectives (RankMe, coding rate, MCR2).
6.3 Convergent evidence
Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2 faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2), since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage constraint machinery (dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on 99% of cases) — it simply optimizes the wrong quantity.
7. Related work and positioning
Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank) either treats rank as a monitor rather than a target, prunes by attention/labels, or is non- medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation from a companion representation-coverage probe study: if such a probe reads final-layer features, Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap.
Relation to label-free FM adaptation (FINO; Gardès et al., 2026). Concurrent work adapts vision foundation models to scientific domains without task labels by guiding a self-supervised objective with metadata, training the backbone. Our work is orthogonal and complementary: we keep the backbone frozen, use no metadata or labels, and contribute a geometric analysis (where the signal lives), a token economy (membership pruning, routed depth), a retention certificate, and a law on objective choice — none of which an adaptation method addresses. The two compose: our probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the mid-layer concentration subspace at depth and improves rare-signal separability (a question our training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training signal). Their result is also indirect support for our mechanism: counteracting depth-globalization of informative local factors is plausibly part of why metadata guidance helps.
8. Limitations
- The method helps only where feature density localizes the lesion (liver = characterized failure); a deployment check on density-AUROC is required, with attention fallback.
- Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under the random-pruning protocol.
- Pretraining-time application is untested (inference-time/fine-tuning only); the conformal guarantee assumes exchangeable calibration/test data.
9. Conclusion
The contribution is the label-free lesion subspace — a mid-layer geometry that localizes lesions without labels across modality and backbone — together with membership pruning, a conformal retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs concentration) is a transferable caution for medical SSL.
Appendix A — Gate ledger (locked Phase-1b thresholds)
| Gate | Verdict | Key number |
|---|---|---|
| 0 reproducibility | PASS | frozen load, Δ=0, 2.1M-token bank |
| 1 subspace validity | PASS | density-A 0.871, +0.105 vs attention |
| 2 faithfulness | guard PASS; not superior | coverage 0.480 vs saliency 0.479 (tied) |
| 3 membership pruning > saliency | PASS | LIDC, KiTS23, BUSI (CT + ultrasound) |
| 4 coverage floor | NEGATIVE | floor 0.22 vs subspace 0.82 @0.25 |
| 5 invariance | FALLBACK | inference-time |
| 6 conformal retention cert. | PASS | empirical 0.978 ≥ 0.90 |
| 6 lesion-routed depth | PASS | 1.6× FLOPs @ 98% sensitivity |
| 6 volumetric | PARTIAL | ~2× at 82% lesion mass (tunable) |
Appendix B — Reproducibility
All experiments ran as Hugging Face Jobs (MedDINOv3 ricklisz123/MedDINOv3-ViTB-16-CT-3M,
DINOv2 facebook/dinov2-base). Artifacts (token banks, materialized masks, per-gate metrics) in
the processed/covtoken/ bucket; per-gate decision records in covtoken/gate_reports/; locked
thresholds in covtoken/configs/thresholds.lock.json.