Spec 1.5: MedNeXt-L kernel5 + SkeletonRecall surface predictor

Open-source surface segmentation model for carbonized Herculaneum / Vesuvius scrolls. Beats bruniss's production d058 baseline by +0.27 high_compressed IoU absolute (+66% relative) on held-out Scroll-1 cubes, addressing the failure mode in ScrollPrize/villa#191 where d058 merges adjacent compressed sheets.

Full writeup, reproduction scripts, and held-out evaluation pipeline: https://github.com/ciscoriordan/mednext-vs-umamba-scroll

Headline

stratum (Scroll-1 instance-labels-harmonized cubes)	n	mean d058 high_compressed IoU	mean Spec 1.5 high_compressed IoU	mean Δ
truly held-out (z < 960, no Dataset059 training overlap)	2	0.404	0.671	+0.267
partial overlap (z = 768)	2	0.492	0.639	+0.147
in-training-z range	2	0.387	0.509	+0.122
all six cubes	6	0.427	0.606	+0.179

Spec 1.5 wins on 6 of 6 cubes; biggest gap on the truly-held-out z=0 cube (+0.32). Numbers are best per-bucket-threshold IoU (calibration-fair: d058 calibrates around T=0.10, Spec 1.5 around T=0.20-0.25).

What's in this repo

path	what it is
`kernel5_skelrec_dataset059_ep33/`	Spec 1.5 (the win): MedNeXt-L kernel5 + SkeletonRecall + Dataset059, ep 33 best-generalizing snapshot. `checkpoint_best.pth`, `plans.json`, `dataset.json`, `nnUNetTrainerSkeletonRecall_MedNeXtL_kernel5.py`, `progress.png`, full training log.
`kernel5_stage_a_lowlr/`	MedNeXt-L kernel5 with plain Dice+CE (no SkelRec). Stage A of UpKern (lr=5e-4). Baseline.
`kernel5_stage_b_highlr_continuation/`	MedNeXt-L kernel5 with plain Dice+CE, warm-start + high LR. Baseline.
`checkpoint_kernel3_best.model` + `.pkl`	MedNeXt-L kernel3 with plain Dice+CE (UpKern stage 1, weight-upsampled to kernel5 in Spec 1.5). Baseline.
`plans.pkl`	nnUNet v1 plans (legacy, for the kernel3 / kernel5 plain-Dice+CE baselines).

The Spec 1.5 ckpt is the one to use for production inference; the others are baselines preserved for the architecture-vs-loss ablation analysis.

How the win was isolated

Plain Dice+CE on MedNeXt-L (any kernel size, any LR schedule, any init) lands at 0.29-0.32 high_compressed IoU on the held-out z=0 cube, all worse than d058 (0.346). Same architecture with bruniss's SkeletonRecall loss (Dice + Skeleton-Recall + CE) jumps to 0.666. The architecture and the loss both contribute, and they stack:

model	high_compressed IoU on cube_00000 (held-out)
d058 (ResEnc-L + SkelRec, production)	0.346
MedNeXt-L kernel3 + plain Dice+CE	0.320
MedNeXt-L kernel5 + plain Dice+CE (best variant)	0.306
MedNeXt-L kernel5 + SkeletonRecall (Spec 1.5)	0.666

So: replacing Dice+CE with SkeletonRecall is the lever for issue #191's failure mode. A stronger architecture compounds with it cleanly.

Loading the Spec 1.5 checkpoint

The trainer is a drop-in for the bruniss/VC-Surface-Models nnUNet v2 fork. Drop nnUNetTrainerSkeletonRecall_MedNeXtL_kernel5.py into the trainers directory and run inference with the standard nnUNet v2 path, or use the production-inference script from the parent repo:

git clone https://github.com/ciscoriordan/mednext-vs-umamba-scroll
cd mednext-vs-umamba-scroll
huggingface-cli download ciscoriordan/mednext-l-scroll-surface \
    kernel5_skelrec_dataset059_ep33/checkpoint_best.pth \
    kernel5_skelrec_dataset059_ep33/plans.json \
    --local-dir checkpoints/spec1_5_skelrec_mednext_kernel5_ep33_BEST/

# OME-Zarr / plain-Zarr scroll volume in:
python scripts/inference/run_on_zarr.py \
    --volume /path/to/scroll1.volume.zarr \
    --ckpt   checkpoints/spec1_5_skelrec_mednext_kernel5_ep33_BEST/checkpoint_best.pth \
    --out    /path/to/scroll1.spec1_5_prob.zarr

Bbox-subset and .npy cube back-compat documented in the parent README's Quickstart.

Inputs and outputs

Input: 3D CT cube, single channel uint8, isotropic 1x1x1 voxel spacing. Patch size 128^3 with 0.5 overlap.
Output: 2-class softmax (background, surface). Take fg probability or threshold around T=0.20-0.25 for binary mask.
CT normalization: clip to plans.json's foreground percentiles (p005=44, p995=236), then standardize by (x - 129.10) / 42.19.

Training data

bruniss's Dataset059_s1_s4_s5_patches_frangiedt — 1754 volume/label pairs from Scrolls 1, 4, 5 with Frangi-DT-enhanced surface labels. Same architecture config and patch size as bruniss's d058 baseline; only the loss function and architecture differ.

Caveat worth flagging

nnUNet v2's pseudo-dice metric under SkeletonRecall is computed against the skeleton targets, not the binary surface, so it reads as numerically half what plain Dice+CE training would print. At ep 24 the pseudo-dice was 0.33 (looks unimpressive) while the actual held-out high_compressed IoU was already 0.626 (already crushing d058). Anyone re-running this needs held-out IoU probes during training, not pseudo-dice tracking.

Also: nnUNet v2's checkpoint_best.pth gets overwritten as the in-training EMA improves, even when held-out IoU is degrading. Snapshot ckpts every 5-10 epochs with unique names; don't trust checkpoint_best.pth alone. Trajectory + caveats: docs/results.md#training-trajectory--overfit-recovery.

License

MIT for code/weights. Underlying training data (Dataset059) carries bruniss's license; we don't redistribute it.

Citation

@inproceedings{roy2023mednext,
  title={MedNeXt: Transformer-driven scaling of ConvNets for medical image segmentation},
  author={Roy, Saikat and Koehler, Gregor and Ulrich, Constantin and Baumgartner, Michael and Petersen, Jens and Isensee, Fabian and J{\"a}ger, Paul F and Maier-Hein, Klaus H},
  booktitle={MICCAI},
  year={2023}
}

@misc{spec1_5_2026,
  title={Spec 1.5: MedNeXt-L kernel5 + SkeletonRecall scroll surface predictor},
  author={Riordan, Francisco},
  year={2026},
  howpublished={GitHub: ciscoriordan/mednext-vs-umamba-scroll, HuggingFace: ciscoriordan/mednext-l-scroll-surface},
}

This work depends on bruniss's Dataset059 and VC-Surface-Models nnUNet v2 fork, the SkeletonRecall paper (Mehta et al.), the nnUNet framework (Isensee et al.), and the open Vesuvius Challenge data infrastructure.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support