covtoken / paper /working_draft.md

verify two reviewer-probe claims: (1) measured lesion spectra REFUTE 'low internal rank' (RankMe 339>307) -> correct attribution to RARITY across papers #1/#2/NEGATIVE_RESULT; (2) verified MedDINOv3/DINOv3=RoPE vs DINOv2=learned-absolute, paper #3 §3 stated precisely

d99ea58 verified 11 days ago

preview code

Raw

History Blame Contribute Delete

16.5 kB

	---
	title: "Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging"
	status: working draft
	date: 2026-06-20
	backbones: [MedDINOv3 ViT-B/16 (CT-3M), DINOv2-base]
	---

	# Where Lesions Live: Label-Free Mid-Layer Lesion Subspaces for Token-Economical Medical Imaging

	## Abstract

	We study label-free token pruning for medical imaging built on frozen self-supervised vision
	transformers. Our central object is a label-free lesion subspace: a geometric region of a
	frozen ViT's patch-token feature space, estimated without any lesion labels, in which lesion
	tokens are locally rare/distinctive. Three findings organize the paper. (1) Where to look.
	The lesion-localizable signal in frozen SSL ViTs lives in mid-layer, not final-layer,
	features: on lung CT, token-level lesion AUROC rises from 0.565 (final block) to 0.871 (block
	3). (2) A label-free localizer that generalizes. A simple density estimate over a held-out
	token bank localizes lesions without labels across anatomies (lung 0.87, pancreas 0.88, kidney
	0.82 CT) and across modalities and backbones — 0.73 on breast ultrasound with DINOv2, where
	attention saliency collapses to chance. Pruning tokens by subspace membership beats attention-
	saliency pruning on small-lesion miss-rate by +14–28 points across CT and ultrasound, and admits
	a per-image conformal retention certificate (empirical coverage 0.978 ≥ nominal 0.90) and a
	lesion-routed adaptive depth that cuts 1.6× FLOPs at 98% small-lesion sensitivity. (3) **A
	negative result with a transferable mechanism.** We set out to gate pruning with a *coverage
	constraint* — a floor on the effective rank (RankMe / coding rate) of the lesion subspace spanned
	by retained tokens, controlled by an interpretable dual variable. This fails: at matched
	budget the coverage-constrained pruner retains 0.22 vs 0.82 of small lesions versus plain
	membership ranking. The mechanism generalizes past our method: **rank-based coverage objectives
	reward diverse subspace spanning, whereas rare small-region pathology requires concentration
	on a few high-membership tokens.** Effective-rank coverage is therefore structurally mismatched
	to rare-lesion retention — a warning for the increasingly common use of RankMe-flavored
	objectives in medical SSL.

	## 1. Introduction

	Token pruning makes vision transformers cheaper, but in medical imaging the failure mode that
	matters is dropping the pathology. A tiny lung nodule or microcalcification occupies a handful of
	patches; a pruner optimized for throughput or generic saliency can discard exactly those.

	We ask a narrower, label-free question: **can a frozen SSL backbone tell us, without any labels,
	which tokens carry diagnostic signal — well enough to prune around them, certify the result, and
	adapt compute?** Our answer is a label-free lesion subspace and the operations built on it.

	We deliberately also report what did not work. Our original hypothesis was that pruning should
	be a constrained optimization — minimize tokens subject to a floor on lesion-subspace coverage,
	with an interpretable dual as the controller. That hypothesis is wrong, and wrong for an
	instructive reason we make precise. We treat the negative as a first-class result.

	Contributions.
	1. A mid-layer localization finding: lesion signal in frozen SSL ViTs is mid-layer, not final.
	2. A label-free lesion subspace that localizes lesions across anatomy, modality, and backbone.
	3. Subspace-membership pruning that beats saliency pruning on small-lesion miss-rate, with a
	conformal retention certificate and lesion-routed depth.
	4. A negative result with a transferable mechanism: rank-based coverage objectives fail for
	rare-lesion retention.

	## 2. Method

	### 2.1 Setup
	Frozen backbone, patch-token features `Z(x) = {z_1,...,z_n}`, `z_i ∈ R^d`. For CT we use
	MedDINOv3 ViT-B/16 (CT-3M); for ultrasound, DINOv2-base (modality-agnostic), establishing
	that the method is not backbone-specific. We extract mid-layer tokens (Sec. 4.1).

	### 2.2 Label-free lesion subspace
	We estimate, without labels, the region of feature space carrying diagnostic signal.
	- Construction A (density). Lesions are rare, so lesion tokens lie in locally sparse regions.
	Estimate token density via k-NN distance to a held-out token bank; the lesion-membership score
	is the mean k-NN distance (low density ⇒ high score). The candidate subspace `L(x)` is spanned
	by the low-density tokens.
	- Construction B (residual). Fit a low-rank normal-tissue subspace `U` by PCA on the bank;
	lesion-relevant tokens have high residual `‖(I-UU^T)z‖`.
	Both are label-free. The held-out CT token bank holds 2.1M mid-layer tokens.

	### 2.3 Membership pruning, certificate, routing (what ships)
	- Lesion-subspace membership pruning. Retain the top-k tokens by membership score.
	- Conformal retention certificate. With split conformal on a calibration set, emit per image a
	distribution-free lower bound on the fraction of lesion mass retained under membership pruning:
	`P(Y(x) ≥ guaranteed) ≥ 1-α`. (Certifies lesion retention under the shipping policy, not any
	internal coverage statistic.)
	- Lesion-routed depth. Route tokens by membership at a mid block: high-membership tokens
	continue through full depth; the rest exit early.

	### 2.4 The coverage constraint (the hypothesis we falsify)
	We define a coverage functional `C(S;x) = effrank(P_L Z_S)` (RankMe form; coding-rate surrogate to
	avoid SVD backprop) and pose pruning as `min_m Σ m_i s.t. C*(x) - C(S;x) ≤ ε`, with Lagrangian
	dual `μ` learned by dual ascent and a Gumbel straight-through mask. Section 5 shows why this
	underperforms the simple membership rule of Sec. 2.3.

	## 3. Experimental protocol (gated falsification)

	Each claim is a gate with an explicit metric, comparator, threshold (calibrated in a locked
	Phase-1b step against the saliency/random baselines), and statistical test (DeLong for AUROC;
	paired bootstrap n=2000 for recall; Spearman with permutation for coupling). Masks are
	evaluation-only; no label touches subspace construction (enforced by a CI label-leak test).
	Datasets: LIDC-IDRI (lung CT), KiTS23 (kidney CT), MSD Task03 Liver, MSD Task07 Pancreas, BUSI
	(breast ultrasound). All compute ran as Hugging Face Jobs.

	## 4. Results: the label-free localizer

	### 4.1 Lesion signal lives mid-layer (Finding 1)
	Token-level lesion AUROC by depth (LIDC, density-A; Fig. 1):

	\| layer \| final (12) \| block 6 \| block 4 \| block 3 \|
	\|---\|---\|---\|---\|---\|
	\| AUROC \| 0.565 \| 0.769 \| 0.865 \| 0.871 \|

	Final-layer features are tuned for the global self-distillation objective; the dense local lesion
	signal sits mid/early. We fix block 3 (MedDINOv3) as the operating layer; for DINOv2 the optimum
	is block 8 — backbone-dependent, but always mid/late, never final. The curve is multi-seed stable
	(peak 0.866 ± 0.010, n=3), the operating layer is selectable without labels (tail-gap selector
	regret 0.006), and the depth-erosion holds across objectives — see the companion mechanism study.

	### 4.2 Cross-anatomy, cross-modality, cross-backbone localization (Finding 2)
	density-A token-level lesion AUROC, with attention-saliency as the label-free comparator:

	\| dataset (modality, backbone) \| density-A \| attention \| random \|
	\|---\|---\|---\|---\|
	\| LIDC lung CT (MedDINOv3) \| 0.871 \| 0.767 \| 0.51 \|
	\| MSD pancreas CT (MedDINOv3) \| 0.876 \| 0.920 \| 0.49 \|
	\| KiTS23 kidney CT (MedDINOv3) \| 0.823 \| 0.823 \| 0.50 \|
	\| MSD liver CT (MedDINOv3) \| 0.670 \| 0.756 \| 0.50 \|
	\| BUSI breast US (DINOv2) \| 0.733 \| 0.492 \| 0.50 \|

	The subspace localizes lesions without labels across very different anatomies, two modalities, and
	two backbones. On ultrasound, attention is at chance — the geometric subspace is the only label-
	free signal that works.

	### 4.3 Precondition and characterized failure
	The method's value tracks whether feature density localizes the lesion, not the modality.
	Liver (0.67) is the characterized failure: low-contrast tumors in heterogeneous parenchyma are not
	locally rare in feature space. Liver is the mirror image of ultrasound — on liver attention
	(0.756) is the better localizer, on ultrasound it collapses (0.49). A density+attention hybrid
	does not rescue liver (0.713, between the two; the weak density signal drags down better
	attention). Deployment rule: use the subspace where density-AUROC clears the floor, else fall back
	to attention.

	## 5. Results: pruning, certificate, routing

	### 5.1 Membership pruning beats saliency pruning (Finding 3)
	Small-lesion recall at matched token budget, membership pruning vs attention-saliency pruning
	(paired bootstrap CI excludes 0 throughout):

	\| dataset \| budget 0.25 \| budget 0.5 \|
	\|---\|---\|---\|
	\| LIDC lung CT \| +27.6 pts \| +15.8 pts (89% miss-red) \|
	\| KiTS23 kidney CT \| +7.4 pts (40% miss-red) \| +1.6 pts (91% miss-red) \|
	\| BUSI breast US \| +13.8 pts \| +19.0 pts \|

	Pancreas ties — tumors are large and salient (attention already 0.92), the safe regime where
	pruning is not a clinical risk. The gain is largest exactly where saliency fails (subtle lesions;
	ultrasound).

	### 5.2 Conformal retention certificate
	Multi-split split-conformal (50 resamples, pooled n=4352, α=0.1): empirical coverage **0.978 ≥
	0.90** — the per-image guarantee is valid. The certificate honestly exposes a budget↔guarantee
	tradeoff: ~100% guaranteed lesion retention at budget 0.5, and at 0.25 it correctly reports that
	the hardest ~10% of small-lesion cases cannot be guaranteed.

	### 5.3 Lesion-routed depth
	Routing depth by membership yields 1.6× FLOP reduction at 98.2% small-lesion sensitivity and
	dominates saliency routing at every retention (saliency never reaches equal sensitivity at any
	FLOP saving). A volumetric two-level (slice+token) economy gives a further ~2× at a documented
	sensitivity cost (tunable deployment knob).

	## 6. The negative result: rank-based coverage fails for rare pathology (Finding 4)

	### 6.1 The ablation
	Three pruning strategies, small lesions, matched budget:

	\| budget \| saliency \| subspace-only (membership top-k) \| subspace + coverage floor \|
	\|---\|---\|---\|---\|
	\| 0.25 \| 0.521 \| 0.817 \| 0.219 \|
	\| 0.50 \| 0.827 \| 0.981 \| 0.460 \|

	Subspace-only beats saliency (+29.6 / +15.4 pts). The coverage floor is far worse than
	subspace-only (−0.60 / −0.52, CI excludes 0). The constraint does not add value — it removes it.

	### 6.2 Mechanism (transferable)
	`C(S)=effrank(P_L Z_S)` is maximized by a retained set that diversely spans the subspace's
	directions. The decisive property of a lesion is that it is rare — a handful of tokens out of
	~196. A set-level rank/coverage objective is therefore insensitive to it: a few tokens cannot
	materially raise the retained set's effective rank, so the objective spends the budget on abundant
	background directions and drops the lesion. This is a rarity mechanism, not an internal-geometry
	one — and we checked: measured at the operating layer, lesion tokens are not low-rank relative to
	background (pooled effective rank 339 vs 307; participation ratio 18.9 vs 13.9; within-image
	internal-rank/m ≈ equal). Lesion tokens are in fact diverse; the set-coverage objective is blind to
	them anyway because they are few. (The synthetic law of the companion paper reaches the same failure
	via a genuinely low-rank signal; real lesions reach it via rarity — two routes to one principle.)
	Rank coverage rewards the entropy of the retained set spectrum; lesion retention rewards mass on
	the top membership tokens; these diverge whenever the critical signal is a rare cluster, of any
	internal rank. **For rare-pathology tasks, prefer concentration objectives (energy / membership
	mass) over rank/spanning objectives (RankMe, coding rate, MCR2).**

	### 6.3 Convergent evidence
	Three independent lines reach the same verdict: (a) the ablation above; (b) principled Gate-2
	faithfulness — under a random-pruning protocol, coverage-drop predicts detection-drop no better
	than attention-drop (Spearman 0.480 vs 0.479; difference CI includes 0; both capped near 0.48 by
	small-lesion combinatorics, not by faithfulness); (c) the difficulty-adaptive budget never
	emerges — aggregate coverage is identical on lesion-positive vs -negative slices (250.4 vs 247.2),
	since 1–3 patches cannot move an aggregate over ~196 tokens. The coverage constraint machinery
	(dual, floor) is intact and stable as an optimizer (dual μ stabilizes; the floor is satisfied on
	99% of cases) — it simply optimizes the wrong quantity.

	## 7. Related work and positioning

	Prior label-free / medical token pruning (AFFMAE, PrATo/MedPruner, RankMe as a diagnostic, WERank)
	either treats rank as a monitor rather than a target, prunes by attention/labels, or is non-
	medical. We contribute: (i) the mid-layer localization finding; (ii) a label-free lesion subspace
	that transfers across modality and backbone; (iii) a conformal retention certificate; and (iv) a
	mechanistic negative result on rank-based coverage objectives. Note (iv) gives clean separation
	from a companion representation-coverage probe study: if such a probe reads final-layer features,
	Finding 1 says it reads the wrong layer — the two results reinforce rather than overlap.

	Relation to label-free FM adaptation (FINO; Gardès et al., 2026). Concurrent work adapts vision
	foundation models to scientific domains without task labels by guiding a self-supervised objective
	with metadata, training the backbone. Our work is orthogonal and complementary: we keep the
	backbone frozen, use no metadata or labels, and contribute a geometric analysis (where the
	signal lives), a token economy (membership pruning, routed depth), a retention certificate, and
	a law on objective choice — none of which an adaptation method addresses. The two compose: our
	probe can be run on a FINO-adapted backbone to test whether metadata-guided adaptation preserves the
	mid-layer concentration subspace at depth and improves rare-signal separability (a question our
	training-free steering result, Sec. 6 / companion study, shows cannot be solved without a training
	signal). Their result is also indirect support for our mechanism: counteracting depth-globalization
	of informative local factors is plausibly part of why metadata guidance helps.

	## 8. Limitations

	- The method helps only where feature density localizes the lesion (liver = characterized
	failure); a deployment check on density-AUROC is required, with attention fallback.
	- Faithfulness of coverage as a proxy is moderate, not tight, and not better than saliency under
	the random-pruning protocol.
	- Pretraining-time application is untested (inference-time/fine-tuning only); the conformal
	guarantee assumes exchangeable calibration/test data.

	## 9. Conclusion

	The contribution is the label-free lesion subspace — a mid-layer geometry that localizes
	lesions without labels across modality and backbone — together with membership pruning, a conformal
	retention certificate, and lesion-routed depth. The coverage-constrained optimization we began with
	is reported as a clean negative whose mechanism (rank rewards spanning, rare pathology needs
	concentration) is a transferable caution for medical SSL.

	---

	### Appendix A — Gate ledger (locked Phase-1b thresholds)

	\| Gate \| Verdict \| Key number \|
	\|---\|---\|---\|
	\| 0 reproducibility \| PASS \| frozen load, Δ=0, 2.1M-token bank \|
	\| 1 subspace validity \| PASS \| density-A 0.871, +0.105 vs attention \|
	\| 2 faithfulness \| guard PASS; not superior \| coverage 0.480 vs saliency 0.479 (tied) \|
	\| 3 membership pruning > saliency \| PASS \| LIDC, KiTS23, BUSI (CT + ultrasound) \|
	\| 4 coverage floor \| NEGATIVE \| floor 0.22 vs subspace 0.82 @0.25 \|
	\| 5 invariance \| FALLBACK \| inference-time \|
	\| 6 conformal retention cert. \| PASS \| empirical 0.978 ≥ 0.90 \|
	\| 6 lesion-routed depth \| PASS \| 1.6× FLOPs @ 98% sensitivity \|
	\| 6 volumetric \| PARTIAL \| ~2× at 82% lesion mass (tunable) \|

	### Appendix B — Reproducibility
	All experiments ran as Hugging Face Jobs (MedDINOv3 `ricklisz123/MedDINOv3-ViTB-16-CT-3M`,
	DINOv2 `facebook/dinov2-base`). Artifacts (token banks, materialized masks, per-gate metrics) in
	the `processed/covtoken/` bucket; per-gate decision records in `covtoken/gate_reports/`; locked
	thresholds in `covtoken/configs/thresholds.lock.json`.