Spec 1.5: MedNeXt-L kernel5 + SkeletonRecall surface predictor

Open-source surface segmentation model for carbonized Herculaneum / Vesuvius scrolls. Beats bruniss's production d058 baseline by +0.27 high_compressed IoU absolute (+66% relative) on held-out Scroll-1 cubes, addressing the failure mode in ScrollPrize/villa#191 where d058 merges adjacent compressed sheets.

Full writeup, reproduction scripts, and held-out evaluation pipeline: https://github.com/ciscoriordan/mednext-vs-umamba-scroll

Issue #191 failure mode: d058 merges adjacent sheets in compressed regions; Spec 1.5 keeps them separate

Headline

stratum (Scroll-1 instance-labels-harmonized cubes) n mean d058 high_compressed IoU mean Spec 1.5 high_compressed IoU mean Δ
truly held-out (z < 960, no Dataset059 training overlap) 2 0.404 0.671 +0.267
partial overlap (z = 768) 2 0.492 0.639 +0.147
in-training-z range 2 0.387 0.509 +0.122
all six cubes 6 0.427 0.606 +0.179

Spec 1.5 wins on 6 of 6 cubes; biggest gap on the truly-held-out z=0 cube (+0.32). Numbers are best per-bucket-threshold IoU (calibration-fair: d058 calibrates around T=0.10, Spec 1.5 around T=0.20-0.25).

What's in this repo

path what it is
kernel5_skelrec_dataset059_ep33/ Spec 1.5 (the win): MedNeXt-L kernel5 + SkeletonRecall + Dataset059, ep 33 best-generalizing snapshot. checkpoint_best.pth, plans.json, dataset.json, nnUNetTrainerSkeletonRecall_MedNeXtL_kernel5.py, progress.png, full training log.
kernel5_stage_a_lowlr/ MedNeXt-L kernel5 with plain Dice+CE (no SkelRec). Stage A of UpKern (lr=5e-4). Baseline.
kernel5_stage_b_highlr_continuation/ MedNeXt-L kernel5 with plain Dice+CE, warm-start + high LR. Baseline.
checkpoint_kernel3_best.model + .pkl MedNeXt-L kernel3 with plain Dice+CE (UpKern stage 1, weight-upsampled to kernel5 in Spec 1.5). Baseline.
plans.pkl nnUNet v1 plans (legacy, for the kernel3 / kernel5 plain-Dice+CE baselines).

The Spec 1.5 ckpt is the one to use for production inference; the others are baselines preserved for the architecture-vs-loss ablation analysis.

How the win was isolated

Plain Dice+CE on MedNeXt-L (any kernel size, any LR schedule, any init) lands at 0.29-0.32 high_compressed IoU on the held-out z=0 cube, all worse than d058 (0.346). Same architecture with bruniss's SkeletonRecall loss (Dice + Skeleton-Recall + CE) jumps to 0.666. The architecture and the loss both contribute, and they stack:

model high_compressed IoU on cube_00000 (held-out)
d058 (ResEnc-L + SkelRec, production) 0.346
MedNeXt-L kernel3 + plain Dice+CE 0.320
MedNeXt-L kernel5 + plain Dice+CE (best variant) 0.306
MedNeXt-L kernel5 + SkeletonRecall (Spec 1.5) 0.666

So: replacing Dice+CE with SkeletonRecall is the lever for issue #191's failure mode. A stronger architecture compounds with it cleanly.

Loading the Spec 1.5 checkpoint

The trainer is a drop-in for the bruniss/VC-Surface-Models nnUNet v2 fork. Drop nnUNetTrainerSkeletonRecall_MedNeXtL_kernel5.py into the trainers directory and run inference with the standard nnUNet v2 path, or use the production-inference script from the parent repo:

git clone https://github.com/ciscoriordan/mednext-vs-umamba-scroll
cd mednext-vs-umamba-scroll
huggingface-cli download ciscoriordan/mednext-l-scroll-surface \
    kernel5_skelrec_dataset059_ep33/checkpoint_best.pth \
    kernel5_skelrec_dataset059_ep33/plans.json \
    --local-dir checkpoints/spec1_5_skelrec_mednext_kernel5_ep33_BEST/

# OME-Zarr / plain-Zarr scroll volume in:
python scripts/inference/run_on_zarr.py \
    --volume /path/to/scroll1.volume.zarr \
    --ckpt   checkpoints/spec1_5_skelrec_mednext_kernel5_ep33_BEST/checkpoint_best.pth \
    --out    /path/to/scroll1.spec1_5_prob.zarr

Bbox-subset and .npy cube back-compat documented in the parent README's Quickstart.

Inputs and outputs

  • Input: 3D CT cube, single channel uint8, isotropic 1x1x1 voxel spacing. Patch size 128^3 with 0.5 overlap.
  • Output: 2-class softmax (background, surface). Take fg probability or threshold around T=0.20-0.25 for binary mask.
  • CT normalization: clip to plans.json's foreground percentiles (p005=44, p995=236), then standardize by (x - 129.10) / 42.19.

Training data

bruniss's Dataset059_s1_s4_s5_patches_frangiedt — 1754 volume/label pairs from Scrolls 1, 4, 5 with Frangi-DT-enhanced surface labels. Same architecture config and patch size as bruniss's d058 baseline; only the loss function and architecture differ.

Caveat worth flagging

nnUNet v2's pseudo-dice metric under SkeletonRecall is computed against the skeleton targets, not the binary surface, so it reads as numerically half what plain Dice+CE training would print. At ep 24 the pseudo-dice was 0.33 (looks unimpressive) while the actual held-out high_compressed IoU was already 0.626 (already crushing d058). Anyone re-running this needs held-out IoU probes during training, not pseudo-dice tracking.

Also: nnUNet v2's checkpoint_best.pth gets overwritten as the in-training EMA improves, even when held-out IoU is degrading. Snapshot ckpts every 5-10 epochs with unique names; don't trust checkpoint_best.pth alone. Trajectory + caveats: docs/results.md#training-trajectory--overfit-recovery.

License

MIT for code/weights. Underlying training data (Dataset059) carries bruniss's license; we don't redistribute it.

Citation

@inproceedings{roy2023mednext,
  title={MedNeXt: Transformer-driven scaling of ConvNets for medical image segmentation},
  author={Roy, Saikat and Koehler, Gregor and Ulrich, Constantin and Baumgartner, Michael and Petersen, Jens and Isensee, Fabian and J{\"a}ger, Paul F and Maier-Hein, Klaus H},
  booktitle={MICCAI},
  year={2023}
}

@misc{spec1_5_2026,
  title={Spec 1.5: MedNeXt-L kernel5 + SkeletonRecall scroll surface predictor},
  author={Riordan, Francisco},
  year={2026},
  howpublished={GitHub: ciscoriordan/mednext-vs-umamba-scroll, HuggingFace: ciscoriordan/mednext-l-scroll-surface},
}

This work depends on bruniss's Dataset059 and VC-Surface-Models nnUNet v2 fork, the SkeletonRecall paper (Mehta et al.), the nnUNet framework (Isensee et al.), and the open Vesuvius Challenge data infrastructure.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support