covtoken / paper /working_draft.md

verify two reviewer-probe claims: (1) measured lesion spectra REFUTE 'low internal rank' (RankMe 339>307) -> correct attribution to RARITY across papers #1/#2/NEGATIVE_RESULT; (2) verified MedDINOv3/DINOv3=RoPE vs DINOv2=learned-absolute, paper #3 §3 stated precisely

d99ea58 verified 10 days ago

preview code

Raw

History Blame Contribute Delete

16.5 kB

metadata

title: >-
  Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical
  Medical Imaging
status: working draft
date: 2026-06-20T00:00:00.000Z
backbones:
  - MedDINOv3 ViT-B/16 (CT-3M)
  - DINOv2-base

Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging

Abstract

We study label-free token pruning for medical imaging built on frozen self-supervised vision transformers. Our central object is a label-free lesion subspace: a geometric region of a frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion tokens are locally rare/distinctive. Three findings organize the paper. (1) Where to look. The lesion-localizable signal in frozen SSL ViTs lives in mid-layer, not final-layer, features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to 0.871 (block 3). (2) A label-free localizer that generalizes. A simple density estimate over a held-out token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney 0.82 CT) and across modalities and backbones — 0.73 on breast ultrasound with DINOv2, where attention saliency collapses to chance. Pruning tokens by subspace membership beats attention- saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits a per-image conformal retention certificate (empirical coverage 0.978 ≥ nominal 0.90) and a lesion-routed adaptive depth that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) A negative result with a transferable mechanism. We set out to gate pruning with a coverage constraint — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned by retained tokens, controlled by an interpretable dual variable. This fails: at matched budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain membership ranking. The mechanism generalizes past our method: rank-based coverage objectives reward diverse subspace spanning, whereas rare small-region pathology requires concentration on a few high-membership tokens. Effective-rank coverage is therefore structurally mismatched to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored objectives in medical SSL.

1. Introduction

Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of patches; a pruner optimized for throughput or generic saliency can discard exactly those.

We ask a narrower, label-free question: can a frozen SSL backbone tell us, without any labels, which tokens carry diagnostic signal — well enough to prune around them, certify the result, and adapt compute? Our answer is a label-free lesion subspace and the operations built on it.

We deliberately also report what did not work. Our original hypothesis was that pruning should be a constrained optimization — minimize tokens subject to a floor on lesion-subspace coverage, with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an instructive reason we make precise. We treat the negative as a first-class result.

Contributions.

A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final.
A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone.
Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a conformal retention certificate and lesion-routed depth.
A negative result with a transferable mechanism: rank-based coverage objectives fail for rare-lesion retention.

2. Method

2.1 Setup

Frozen backbone, patch-token features Z(x) = {z_1,...,z_n}, z_i ∈ R^d. For CT we use MedDINOv3 ViT-B/16 (CT-3M); for ultrasound, DINOv2-base (modality-agnostic), establishing that the method is not backbone-specific. We extract mid-layer tokens (Sec. 4.1).

2.2 Label-free lesion subspace

We estimate, without labels, the region of feature space carrying diagnostic signal.

Construction A (density). Lesions are rare, so lesion tokens lie in locally sparse regions. Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score is the mean k-NN distance (low density ⇒ high score). The candidate subspace L(x) is spanned by the low-density tokens.
Construction B (residual). Fit a low-rank normal-tissue subspace U by PCA on the bank; lesion-relevant tokens have high residual ‖(I-UU^T)z‖. Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens.

2.3 Membership pruning, certificate, routing (what ships)

Lesion-subspace membership pruning. Retain the top-k tokens by membership score.
Conformal retention certificate. With split conformal on a calibration set, emit per image a distribution-free lower bound on the fraction of lesion mass retained under membership pruning: P(Y(x) ≥ guaranteed) ≥ 1-α. (Certifies lesion retention under the shipping policy, not any internal coverage statistic.)
Lesion-routed depth. Route tokens by membership at a mid block: high-membership tokens continue through full depth; the rest exit early.

2.4 The coverage constraint (the hypothesis we falsify)

We define a coverage functional C(S;x) = effrank(P_L Z_S) (RankMe form; coding-rate surrogate to avoid SVD backprop) and pose pruning as min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε, with Lagrangian dual μ learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this underperforms the simple membership rule of Sec. 2.3.

3. Experimental protocol (gated falsification)

Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC; paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are evaluation-only; no label touches subspace construction (enforced by a CI label-leak test). Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI (breast ultrasound). All compute ran as Hugging Face Jobs.

4. Results: the label-free localizer

4.1 Lesion signal lives mid-layer (Finding 1)

Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1):

layer	final (12)	block 6	block 4	block 3
AUROC	0.565	0.769	0.865	0.871

Final-layer features are tuned for the global self-distillation objective; the dense local lesion signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable (peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study.

4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2)

density-A token-level lesion AUROC, with attention-saliency as the label-free comparator:

dataset (modality, backbone)	density-A	attention	random
LIDC lung CT (MedDINOv3)	0.871	0.767	0.51
MSD pancreas CT (MedDINOv3)	0.876	0.920	0.49
KiTS23 kidney CT (MedDINOv3)	0.823	0.823	0.50
MSD liver CT (MedDINOv3)	0.670	0.756	0.50
BUSI breast US (DINOv2)	0.733	0.492	0.50

The subspace localizes lesions without labels across very different anatomies, two modalities, and two backbones. On ultrasound, attention is at chance — the geometric subspace is the only label- free signal that works.

4.3 Precondition and characterized failure

The method's value tracks whether feature density localizes the lesion, not the modality. Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not locally rare in feature space. Liver is the mirror image of ultrasound — on liver attention (0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid does not rescue liver (0.713, between the two; the weak density signal drags down better attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back to attention.

5. Results: pruning, certificate, routing

5.1 Membership pruning beats saliency pruning (Finding 3)

Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning (paired bootstrap CI excludes 0 throughout):

dataset	budget 0.25	budget 0.5
LIDC lung CT	+27.6 pts	+15.8 pts (89% miss-red)
KiTS23 kidney CT	+7.4 pts (40% miss-red)	+1.6 pts (91% miss-red)
BUSI breast US	+13.8 pts	+19.0 pts

Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions; ultrasound).

5.2 Conformal retention certificate

Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage 0.978 ≥ 0.90 — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that the hardest ~10% of small-lesion cases cannot be guaranteed.

5.3 Lesion-routed depth

Routing depth by membership yields 1.6× FLOP reduction at 98.2% small-lesion sensitivity and dominates saliency routing at every retention (saliency never reaches equal sensitivity at any FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented sensitivity cost (tunable deployment knob).

6. The negative result: rank-based coverage fails for rare pathology (Finding 4)

6.1 The ablation

Three pruning strategies, small lesions, matched budget:

budget	saliency	subspace-only (membership top-k)	subspace + coverage floor
0.25	0.521	0.817	0.219
0.50	0.827	0.981	0.460

Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is far worse than subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it.

6.2 Mechanism (transferable)

C(S)=effrank(P_L Z_S) is maximized by a retained set that diversely spans the subspace's directions. The decisive property of a lesion is that it is rare — a handful of tokens out of ~196. A set-level rank/coverage objective is therefore insensitive to it: a few tokens cannot materially raise the retained set's effective rank, so the objective spends the budget on abundant background directions and drops the lesion. This is a rarity mechanism, not an internal-geometry one — and we checked: measured at the operating layer, lesion tokens are not low-rank relative to background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to them anyway because they are few. (The synthetic law of the companion paper reaches the same failure via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.) Rank coverage rewards the entropy of the retained set spectrum; lesion retention rewards mass on the top membership tokens; these diverge whenever the critical signal is a rare cluster, of any internal rank. For rare-pathology tasks, prefer concentration objectives (energy / membership mass) over rank/spanning objectives (RankMe, coding rate, MCR2).

6.3 Convergent evidence

Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2 faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2), since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage constraint machinery (dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on 99% of cases) — it simply optimizes the wrong quantity.

7. Related work and positioning

Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank) either treats rank as a monitor rather than a target, prunes by attention/labels, or is non- medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation from a companion representation-coverage probe study: if such a probe reads final-layer features, Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap.

Relation to label-free FM adaptation (FINO; Gardès et al., 2026). Concurrent work adapts vision foundation models to scientific domains without task labels by guiding a self-supervised objective with metadata, training the backbone. Our work is orthogonal and complementary: we keep the backbone frozen, use no metadata or labels, and contribute a geometric analysis (where the signal lives), a token economy (membership pruning, routed depth), a retention certificate, and a law on objective choice — none of which an adaptation method addresses. The two compose: our probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the mid-layer concentration subspace at depth and improves rare-signal separability (a question our training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training signal). Their result is also indirect support for our mechanism: counteracting depth-globalization of informative local factors is plausibly part of why metadata guidance helps.

8. Limitations

The method helps only where feature density localizes the lesion (liver = characterized failure); a deployment check on density-AUROC is required, with attention fallback.
Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under the random-pruning protocol.
Pretraining-time application is untested (inference-time/fine-tuning only); the conformal guarantee assumes exchangeable calibration/test data.

9. Conclusion

The contribution is the label-free lesion subspace — a mid-layer geometry that localizes lesions without labels across modality and backbone — together with membership pruning, a conformal retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs concentration) is a transferable caution for medical SSL.

Appendix A — Gate ledger (locked Phase-1b thresholds)

Gate	Verdict	Key number
0 reproducibility	PASS	frozen load, Δ=0, 2.1M-token bank
1 subspace validity	PASS	density-A 0.871, +0.105 vs attention
2 faithfulness	guard PASS; not superior	coverage 0.480 vs saliency 0.479 (tied)
3 membership pruning > saliency	PASS	LIDC, KiTS23, BUSI (CT + ultrasound)
4 coverage floor	NEGATIVE	floor 0.22 vs subspace 0.82 @0.25
5 invariance	FALLBACK	inference-time
6 conformal retention cert.	PASS	empirical 0.978 ≥ 0.90
6 lesion-routed depth	PASS	1.6× FLOPs @ 98% sensitivity
6 volumetric	PARTIAL	~2× at 82% lesion mass (tunable)

Appendix B — Reproducibility

All experiments ran as Hugging Face Jobs (MedDINOv3 ricklisz123/MedDINOv3-ViTB-16-CT-3M, DINOv2 facebook/dinov2-base). Artifacts (token banks, materialized masks, per-gate metrics) in the processed/covtoken/ bucket; per-gate decision records in covtoken/gate_reports/; locked thresholds in covtoken/configs/thresholds.lock.json.