File size: 16,491 Bytes
3510f1d 3f7b9bb 3510f1d 3f7b9bb 3510f1d d99ea58 3510f1d 3df3d6a 3510f1d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | ---
title: "Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging"
status: working draft
date: 2026-06-20
backbones: [MedDINOv3 ViT-B/16 (CT-3M), DINOv2-base]
---
# Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging
## Abstract
We study label-free token pruning for medical imaging built on frozen self-supervised vision
transformers. Our central object is a **label-free lesion subspace**: a geometric region of a
frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion
tokens are locally rare/distinctive. Three findings organize the paper. (1) **Where to look.**
The lesion-localizable signal in frozen SSL ViTs lives in **mid-layer**, not final-layer,
features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to **0.871** (block
3). (2) **A label-free localizer that generalizes.** A simple density estimate over a held-out
token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney
0.82 CT) and **across modalities and backbones** — 0.73 on breast ultrasound with DINOv2, where
attention saliency collapses to chance. Pruning tokens by subspace membership beats attention-
saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits
a per-image **conformal retention certificate** (empirical coverage 0.978 ≥ nominal 0.90) and a
**lesion-routed adaptive depth** that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) **A
negative result with a transferable mechanism.** We set out to gate pruning with a *coverage
constraint* — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned
by retained tokens, controlled by an interpretable dual variable. This **fails**: at matched
budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain
membership ranking. The mechanism generalizes past our method: **rank-based coverage objectives
reward diverse subspace *spanning*, whereas rare small-region pathology requires *concentration*
on a few high-membership tokens.** Effective-rank coverage is therefore structurally mismatched
to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored
objectives in medical SSL.
## 1. Introduction
Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that
matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of
patches; a pruner optimized for throughput or generic saliency can discard exactly those.
We ask a narrower, label-free question: **can a frozen SSL backbone tell us, without any labels,
which tokens carry diagnostic signal — well enough to prune around them, certify the result, and
adapt compute?** Our answer is a *label-free lesion subspace* and the operations built on it.
We deliberately also report what did **not** work. Our original hypothesis was that pruning should
be a *constrained optimization* — minimize tokens subject to a floor on lesion-subspace coverage,
with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an
instructive reason we make precise. We treat the negative as a first-class result.
**Contributions.**
1. A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final.
2. A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone.
3. Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a
conformal retention certificate and lesion-routed depth.
4. A negative result with a transferable mechanism: rank-based coverage objectives fail for
rare-lesion retention.
## 2. Method
### 2.1 Setup
Frozen backbone, patch-token features `Z(x) = {z_1,...,z_n}`, `z_i ∈ R^d`. For CT we use
**MedDINOv3 ViT-B/16 (CT-3M)**; for ultrasound, **DINOv2-base** (modality-agnostic), establishing
that the method is not backbone-specific. We extract **mid-layer** tokens (Sec. 4.1).
### 2.2 Label-free lesion subspace
We estimate, without labels, the region of feature space carrying diagnostic signal.
- **Construction A (density).** Lesions are rare, so lesion tokens lie in locally sparse regions.
Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score
is the mean k-NN distance (low density ⇒ high score). The candidate subspace `L(x)` is spanned
by the low-density tokens.
- **Construction B (residual).** Fit a low-rank normal-tissue subspace `U` by PCA on the bank;
lesion-relevant tokens have high residual `‖(I-UU^T)z‖`.
Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens.
### 2.3 Membership pruning, certificate, routing (what ships)
- **Lesion-subspace membership pruning.** Retain the top-k tokens by membership score.
- **Conformal retention certificate.** With split conformal on a calibration set, emit per image a
distribution-free lower bound on the *fraction of lesion mass retained* under membership pruning:
`P(Y(x) ≥ guaranteed) ≥ 1-α`. (Certifies lesion retention under the shipping policy, not any
internal coverage statistic.)
- **Lesion-routed depth.** Route tokens by membership at a mid block: high-membership tokens
continue through full depth; the rest exit early.
### 2.4 The coverage constraint (the hypothesis we falsify)
We define a coverage functional `C(S;x) = effrank(P_L Z_S)` (RankMe form; coding-rate surrogate to
avoid SVD backprop) and pose pruning as `min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε`, with Lagrangian
dual `μ` learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this
underperforms the simple membership rule of Sec. 2.3.
## 3. Experimental protocol (gated falsification)
Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked
Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC;
paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are
**evaluation-only**; no label touches subspace construction (enforced by a CI label-leak test).
Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI
(breast ultrasound). All compute ran as Hugging Face Jobs.
## 4. Results: the label-free localizer
### 4.1 Lesion signal lives mid-layer (Finding 1)
Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1):
| layer | final (12) | block 6 | block 4 | block 3 |
|---|---|---|---|---|
| AUROC | 0.565 | 0.769 | 0.865 | **0.871** |
Final-layer features are tuned for the global self-distillation objective; the dense local lesion
signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum
is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable
(peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector
regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study.
### 4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2)
density-A token-level lesion AUROC, with attention-saliency as the label-free comparator:
| dataset (modality, backbone) | density-A | attention | random |
|---|---|---|---|
| LIDC lung CT (MedDINOv3) | **0.871** | 0.767 | 0.51 |
| MSD pancreas CT (MedDINOv3) | 0.876 | 0.920 | 0.49 |
| KiTS23 kidney CT (MedDINOv3) | 0.823 | 0.823 | 0.50 |
| MSD liver CT (MedDINOv3) | 0.670 | 0.756 | 0.50 |
| BUSI breast US (DINOv2) | **0.733** | 0.492 | 0.50 |
The subspace localizes lesions without labels across very different anatomies, two modalities, and
two backbones. On ultrasound, attention is at chance — the geometric subspace is the *only* label-
free signal that works.
### 4.3 Precondition and characterized failure
The method's value tracks **whether feature density localizes the lesion**, not the modality.
Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not
locally rare in feature space. Liver is the *mirror image* of ultrasound — on liver attention
(0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid
does **not** rescue liver (0.713, between the two; the weak density signal drags down better
attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back
to attention.
## 5. Results: pruning, certificate, routing
### 5.1 Membership pruning beats saliency pruning (Finding 3)
Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning
(paired bootstrap CI excludes 0 throughout):
| dataset | budget 0.25 | budget 0.5 |
|---|---|---|
| LIDC lung CT | +27.6 pts | +15.8 pts (89% miss-red) |
| KiTS23 kidney CT | +7.4 pts (40% miss-red) | +1.6 pts (91% miss-red) |
| BUSI breast US | +13.8 pts | +19.0 pts |
Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where
pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions;
ultrasound).
### 5.2 Conformal retention certificate
Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage **0.978 ≥
0.90** — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee
tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that
the hardest ~10% of small-lesion cases cannot be guaranteed.
### 5.3 Lesion-routed depth
Routing depth by membership yields **1.6× FLOP reduction at 98.2% small-lesion sensitivity** and
dominates saliency routing at every retention (saliency never reaches equal sensitivity at any
FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented
sensitivity cost (tunable deployment knob).
## 6. The negative result: rank-based coverage fails for rare pathology (Finding 4)
### 6.1 The ablation
Three pruning strategies, small lesions, matched budget:
| budget | saliency | subspace-only (membership top-k) | subspace + coverage floor |
|---|---|---|---|
| 0.25 | 0.521 | **0.817** | **0.219** |
| 0.50 | 0.827 | **0.981** | **0.460** |
Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is **far worse** than
subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it.
### 6.2 Mechanism (transferable)
`C(S)=effrank(P_L Z_S)` is maximized by a retained set that **diversely spans** the subspace's
directions. The decisive property of a lesion is that it is **rare** — a handful of tokens out of
~196. A set-level rank/coverage objective is therefore *insensitive* to it: a few tokens cannot
materially raise the retained set's effective rank, so the objective spends the budget on abundant
background directions and drops the lesion. This is a **rarity** mechanism, not an internal-geometry
one — and we checked: measured at the operating layer, lesion tokens are *not* low-rank relative to
background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image
internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to
them anyway because they are few. (The synthetic law of the companion paper reaches the same failure
via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.)
Rank coverage rewards the entropy of the retained *set* spectrum; lesion retention rewards mass on
the top membership tokens; these diverge whenever the critical signal is a **rare** cluster, of any
internal rank. **For rare-pathology tasks, prefer concentration objectives (energy / membership
mass) over rank/spanning objectives (RankMe, coding rate, MCR2).**
### 6.3 Convergent evidence
Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2
faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better
than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by
small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never
emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2),
since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage *constraint machinery*
(dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on
99% of cases) — it simply optimizes the wrong quantity.
## 7. Related work and positioning
Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank)
either treats rank as a monitor rather than a target, prunes by attention/labels, or is non-
medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace
that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a
mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation
from a companion representation-coverage probe study: if such a probe reads final-layer features,
Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap.
**Relation to label-free FM adaptation (FINO; Gardès et al., 2026).** Concurrent work adapts vision
foundation models to scientific domains *without task labels* by guiding a self-supervised objective
with **metadata**, training the backbone. Our work is orthogonal and complementary: we keep the
backbone **frozen**, use **no metadata or labels**, and contribute a geometric *analysis* (where the
signal lives), a token *economy* (membership pruning, routed depth), a retention *certificate*, and
a *law* on objective choice — none of which an adaptation method addresses. The two compose: our
probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the
mid-layer concentration subspace at depth and improves rare-signal separability (a question our
training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training
signal). Their result is also indirect support for our mechanism: counteracting depth-globalization
of informative local factors is plausibly part of why metadata guidance helps.
## 8. Limitations
- The method helps only where feature density localizes the lesion (liver = characterized
failure); a deployment check on density-AUROC is required, with attention fallback.
- Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under
the random-pruning protocol.
- Pretraining-time application is untested (inference-time/fine-tuning only); the conformal
guarantee assumes exchangeable calibration/test data.
## 9. Conclusion
The contribution is the **label-free lesion subspace** — a mid-layer geometry that localizes
lesions without labels across modality and backbone — together with membership pruning, a conformal
retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with
is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs
concentration) is a transferable caution for medical SSL.
---
### Appendix A — Gate ledger (locked Phase-1b thresholds)
| Gate | Verdict | Key number |
|---|---|---|
| 0 reproducibility | PASS | frozen load, Δ=0, 2.1M-token bank |
| 1 subspace validity | PASS | density-A 0.871, +0.105 vs attention |
| 2 faithfulness | guard PASS; not superior | coverage 0.480 vs saliency 0.479 (tied) |
| 3 membership pruning > saliency | PASS | LIDC, KiTS23, BUSI (CT + ultrasound) |
| 4 coverage floor | NEGATIVE | floor 0.22 vs subspace 0.82 @0.25 |
| 5 invariance | FALLBACK | inference-time |
| 6 conformal retention cert. | PASS | empirical 0.978 ≥ 0.90 |
| 6 lesion-routed depth | PASS | 1.6× FLOPs @ 98% sensitivity |
| 6 volumetric | PARTIAL | ~2× at 82% lesion mass (tunable) |
### Appendix B — Reproducibility
All experiments ran as Hugging Face Jobs (MedDINOv3 `ricklisz123/MedDINOv3-ViTB-16-CT-3M`,
DINOv2 `facebook/dinov2-base`). Artifacts (token banks, materialized masks, per-gate metrics) in
the `processed/covtoken/` bucket; per-gate decision records in `covtoken/gate_reports/`; locked
thresholds in `covtoken/configs/thresholds.lock.json`.
|