covtoken / paper /working_draft.md
Chucks90's picture
verify two reviewer-probe claims: (1) measured lesion spectra REFUTE 'low internal rank' (RankMe 339>307) -> correct attribution to RARITY across papers #1/#2/NEGATIVE_RESULT; (2) verified MedDINOv3/DINOv3=RoPE vs DINOv2=learned-absolute, paper #3 §3 stated precisely
d99ea58 verified
|
Raw
History Blame Contribute Delete
16.5 kB
---
title: "Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging"
status: working draft
date: 2026-06-20
backbones: [MedDINOv3 ViT-B/16 (CT-3M), DINOv2-base]
---
# Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging
## Abstract
We study label-free token pruning for medical imaging built on frozen self-supervised vision
transformers. Our central object is a **label-free lesion subspace**: a geometric region of a
frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion
tokens are locally rare/distinctive. Three findings organize the paper. (1) **Where to look.**
The lesion-localizable signal in frozen SSL ViTs lives in **mid-layer**, not final-layer,
features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to **0.871** (block
3). (2) **A label-free localizer that generalizes.** A simple density estimate over a held-out
token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney
0.82 CT) and **across modalities and backbones** — 0.73 on breast ultrasound with DINOv2, where
attention saliency collapses to chance. Pruning tokens by subspace membership beats attention-
saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits
a per-image **conformal retention certificate** (empirical coverage 0.978 ≥ nominal 0.90) and a
**lesion-routed adaptive depth** that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) **A
negative result with a transferable mechanism.** We set out to gate pruning with a *coverage
constraint* — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned
by retained tokens, controlled by an interpretable dual variable. This **fails**: at matched
budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain
membership ranking. The mechanism generalizes past our method: **rank-based coverage objectives
reward diverse subspace *spanning*, whereas rare small-region pathology requires *concentration*
on a few high-membership tokens.** Effective-rank coverage is therefore structurally mismatched
to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored
objectives in medical SSL.
## 1. Introduction
Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that
matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of
patches; a pruner optimized for throughput or generic saliency can discard exactly those.
We ask a narrower, label-free question: **can a frozen SSL backbone tell us, without any labels,
which tokens carry diagnostic signal — well enough to prune around them, certify the result, and
adapt compute?** Our answer is a *label-free lesion subspace* and the operations built on it.
We deliberately also report what did **not** work. Our original hypothesis was that pruning should
be a *constrained optimization* — minimize tokens subject to a floor on lesion-subspace coverage,
with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an
instructive reason we make precise. We treat the negative as a first-class result.
**Contributions.**
1. A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final.
2. A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone.
3. Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a
conformal retention certificate and lesion-routed depth.
4. A negative result with a transferable mechanism: rank-based coverage objectives fail for
rare-lesion retention.
## 2. Method
### 2.1 Setup
Frozen backbone, patch-token features `Z(x) = {z_1,...,z_n}`, `z_i ∈ R^d`. For CT we use
**MedDINOv3 ViT-B/16 (CT-3M)**; for ultrasound, **DINOv2-base** (modality-agnostic), establishing
that the method is not backbone-specific. We extract **mid-layer** tokens (Sec. 4.1).
### 2.2 Label-free lesion subspace
We estimate, without labels, the region of feature space carrying diagnostic signal.
- **Construction A (density).** Lesions are rare, so lesion tokens lie in locally sparse regions.
Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score
is the mean k-NN distance (low density ⇒ high score). The candidate subspace `L(x)` is spanned
by the low-density tokens.
- **Construction B (residual).** Fit a low-rank normal-tissue subspace `U` by PCA on the bank;
lesion-relevant tokens have high residual `‖(I-UU^T)z‖`.
Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens.
### 2.3 Membership pruning, certificate, routing (what ships)
- **Lesion-subspace membership pruning.** Retain the top-k tokens by membership score.
- **Conformal retention certificate.** With split conformal on a calibration set, emit per image a
distribution-free lower bound on the *fraction of lesion mass retained* under membership pruning:
`P(Y(x) ≥ guaranteed) ≥ 1-α`. (Certifies lesion retention under the shipping policy, not any
internal coverage statistic.)
- **Lesion-routed depth.** Route tokens by membership at a mid block: high-membership tokens
continue through full depth; the rest exit early.
### 2.4 The coverage constraint (the hypothesis we falsify)
We define a coverage functional `C(S;x) = effrank(P_L Z_S)` (RankMe form; coding-rate surrogate to
avoid SVD backprop) and pose pruning as `min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε`, with Lagrangian
dual `μ` learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this
underperforms the simple membership rule of Sec. 2.3.
## 3. Experimental protocol (gated falsification)
Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked
Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC;
paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are
**evaluation-only**; no label touches subspace construction (enforced by a CI label-leak test).
Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI
(breast ultrasound). All compute ran as Hugging Face Jobs.
## 4. Results: the label-free localizer
### 4.1 Lesion signal lives mid-layer (Finding 1)
Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1):
| layer | final (12) | block 6 | block 4 | block 3 |
|---|---|---|---|---|
| AUROC | 0.565 | 0.769 | 0.865 | **0.871** |
Final-layer features are tuned for the global self-distillation objective; the dense local lesion
signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum
is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable
(peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector
regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study.
### 4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2)
density-A token-level lesion AUROC, with attention-saliency as the label-free comparator:
| dataset (modality, backbone) | density-A | attention | random |
|---|---|---|---|
| LIDC lung CT (MedDINOv3) | **0.871** | 0.767 | 0.51 |
| MSD pancreas CT (MedDINOv3) | 0.876 | 0.920 | 0.49 |
| KiTS23 kidney CT (MedDINOv3) | 0.823 | 0.823 | 0.50 |
| MSD liver CT (MedDINOv3) | 0.670 | 0.756 | 0.50 |
| BUSI breast US (DINOv2) | **0.733** | 0.492 | 0.50 |
The subspace localizes lesions without labels across very different anatomies, two modalities, and
two backbones. On ultrasound, attention is at chance — the geometric subspace is the *only* label-
free signal that works.
### 4.3 Precondition and characterized failure
The method's value tracks **whether feature density localizes the lesion**, not the modality.
Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not
locally rare in feature space. Liver is the *mirror image* of ultrasound — on liver attention
(0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid
does **not** rescue liver (0.713, between the two; the weak density signal drags down better
attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back
to attention.
## 5. Results: pruning, certificate, routing
### 5.1 Membership pruning beats saliency pruning (Finding 3)
Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning
(paired bootstrap CI excludes 0 throughout):
| dataset | budget 0.25 | budget 0.5 |
|---|---|---|
| LIDC lung CT | +27.6 pts | +15.8 pts (89% miss-red) |
| KiTS23 kidney CT | +7.4 pts (40% miss-red) | +1.6 pts (91% miss-red) |
| BUSI breast US | +13.8 pts | +19.0 pts |
Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where
pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions;
ultrasound).
### 5.2 Conformal retention certificate
Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage **0.978 ≥
0.90** — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee
tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that
the hardest ~10% of small-lesion cases cannot be guaranteed.
### 5.3 Lesion-routed depth
Routing depth by membership yields **1.6× FLOP reduction at 98.2% small-lesion sensitivity** and
dominates saliency routing at every retention (saliency never reaches equal sensitivity at any
FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented
sensitivity cost (tunable deployment knob).
## 6. The negative result: rank-based coverage fails for rare pathology (Finding 4)
### 6.1 The ablation
Three pruning strategies, small lesions, matched budget:
| budget | saliency | subspace-only (membership top-k) | subspace + coverage floor |
|---|---|---|---|
| 0.25 | 0.521 | **0.817** | **0.219** |
| 0.50 | 0.827 | **0.981** | **0.460** |
Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is **far worse** than
subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it.
### 6.2 Mechanism (transferable)
`C(S)=effrank(P_L Z_S)` is maximized by a retained set that **diversely spans** the subspace's
directions. The decisive property of a lesion is that it is **rare** — a handful of tokens out of
~196. A set-level rank/coverage objective is therefore *insensitive* to it: a few tokens cannot
materially raise the retained set's effective rank, so the objective spends the budget on abundant
background directions and drops the lesion. This is a **rarity** mechanism, not an internal-geometry
one — and we checked: measured at the operating layer, lesion tokens are *not* low-rank relative to
background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image
internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to
them anyway because they are few. (The synthetic law of the companion paper reaches the same failure
via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.)
Rank coverage rewards the entropy of the retained *set* spectrum; lesion retention rewards mass on
the top membership tokens; these diverge whenever the critical signal is a **rare** cluster, of any
internal rank. **For rare-pathology tasks, prefer concentration objectives (energy / membership
mass) over rank/spanning objectives (RankMe, coding rate, MCR2).**
### 6.3 Convergent evidence
Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2
faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better
than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by
small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never
emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2),
since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage *constraint machinery*
(dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on
99% of cases) — it simply optimizes the wrong quantity.
## 7. Related work and positioning
Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank)
either treats rank as a monitor rather than a target, prunes by attention/labels, or is non-
medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace
that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a
mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation
from a companion representation-coverage probe study: if such a probe reads final-layer features,
Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap.
**Relation to label-free FM adaptation (FINO; Gardès et al., 2026).** Concurrent work adapts vision
foundation models to scientific domains *without task labels* by guiding a self-supervised objective
with **metadata**, training the backbone. Our work is orthogonal and complementary: we keep the
backbone **frozen**, use **no metadata or labels**, and contribute a geometric *analysis* (where the
signal lives), a token *economy* (membership pruning, routed depth), a retention *certificate*, and
a *law* on objective choice — none of which an adaptation method addresses. The two compose: our
probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the
mid-layer concentration subspace at depth and improves rare-signal separability (a question our
training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training
signal). Their result is also indirect support for our mechanism: counteracting depth-globalization
of informative local factors is plausibly part of why metadata guidance helps.
## 8. Limitations
- The method helps only where feature density localizes the lesion (liver = characterized
failure); a deployment check on density-AUROC is required, with attention fallback.
- Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under
the random-pruning protocol.
- Pretraining-time application is untested (inference-time/fine-tuning only); the conformal
guarantee assumes exchangeable calibration/test data.
## 9. Conclusion
The contribution is the **label-free lesion subspace** — a mid-layer geometry that localizes
lesions without labels across modality and backbone — together with membership pruning, a conformal
retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with
is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs
concentration) is a transferable caution for medical SSL.
---
### Appendix A — Gate ledger (locked Phase-1b thresholds)
| Gate | Verdict | Key number |
|---|---|---|
| 0 reproducibility | PASS | frozen load, Δ=0, 2.1M-token bank |
| 1 subspace validity | PASS | density-A 0.871, +0.105 vs attention |
| 2 faithfulness | guard PASS; not superior | coverage 0.480 vs saliency 0.479 (tied) |
| 3 membership pruning > saliency | PASS | LIDC, KiTS23, BUSI (CT + ultrasound) |
| 4 coverage floor | NEGATIVE | floor 0.22 vs subspace 0.82 @0.25 |
| 5 invariance | FALLBACK | inference-time |
| 6 conformal retention cert. | PASS | empirical 0.978 ≥ 0.90 |
| 6 lesion-routed depth | PASS | 1.6× FLOPs @ 98% sensitivity |
| 6 volumetric | PARTIAL | ~2× at 82% lesion mass (tunable) |
### Appendix B — Reproducibility
All experiments ran as Hugging Face Jobs (MedDINOv3 `ricklisz123/MedDINOv3-ViTB-16-CT-3M`,
DINOv2 `facebook/dinov2-base`). Artifacts (token banks, materialized masks, per-gate metrics) in
the `processed/covtoken/` bucket; per-gate decision records in `covtoken/gate_reports/`; locked
thresholds in `covtoken/configs/thresholds.lock.json`.