Spec 1.5: MedNeXt-L kernel5 + SkeletonRecall surface predictor
Open-source surface segmentation model for carbonized Herculaneum / Vesuvius scrolls. Beats bruniss's production d058 baseline by +0.27 high_compressed IoU absolute (+66% relative) on held-out Scroll-1 cubes, addressing the failure mode in ScrollPrize/villa#191 where d058 merges adjacent compressed sheets.
Full writeup, reproduction scripts, and held-out evaluation pipeline: https://github.com/ciscoriordan/mednext-vs-umamba-scroll
Headline
| stratum (Scroll-1 instance-labels-harmonized cubes) | n | mean d058 high_compressed IoU | mean Spec 1.5 high_compressed IoU | mean Δ |
|---|---|---|---|---|
| truly held-out (z < 960, no Dataset059 training overlap) | 2 | 0.404 | 0.671 | +0.267 |
| partial overlap (z = 768) | 2 | 0.492 | 0.639 | +0.147 |
| in-training-z range | 2 | 0.387 | 0.509 | +0.122 |
| all six cubes | 6 | 0.427 | 0.606 | +0.179 |
Spec 1.5 wins on 6 of 6 cubes; biggest gap on the truly-held-out z=0 cube (+0.32). Numbers are best per-bucket-threshold IoU (calibration-fair: d058 calibrates around T=0.10, Spec 1.5 around T=0.20-0.25).
What's in this repo
| path | what it is |
|---|---|
kernel5_skelrec_dataset059_ep33/ |
Spec 1.5 (the win): MedNeXt-L kernel5 + SkeletonRecall + Dataset059, ep 33 best-generalizing snapshot. checkpoint_best.pth, plans.json, dataset.json, nnUNetTrainerSkeletonRecall_MedNeXtL_kernel5.py, progress.png, full training log. |
kernel5_stage_a_lowlr/ |
MedNeXt-L kernel5 with plain Dice+CE (no SkelRec). Stage A of UpKern (lr=5e-4). Baseline. |
kernel5_stage_b_highlr_continuation/ |
MedNeXt-L kernel5 with plain Dice+CE, warm-start + high LR. Baseline. |
checkpoint_kernel3_best.model + .pkl |
MedNeXt-L kernel3 with plain Dice+CE (UpKern stage 1, weight-upsampled to kernel5 in Spec 1.5). Baseline. |
plans.pkl |
nnUNet v1 plans (legacy, for the kernel3 / kernel5 plain-Dice+CE baselines). |
The Spec 1.5 ckpt is the one to use for production inference; the others are baselines preserved for the architecture-vs-loss ablation analysis.
How the win was isolated
Plain Dice+CE on MedNeXt-L (any kernel size, any LR schedule, any init) lands at 0.29-0.32 high_compressed IoU on the held-out z=0 cube, all worse than d058 (0.346). Same architecture with bruniss's SkeletonRecall loss (Dice + Skeleton-Recall + CE) jumps to 0.666. The architecture and the loss both contribute, and they stack:
| model | high_compressed IoU on cube_00000 (held-out) |
|---|---|
| d058 (ResEnc-L + SkelRec, production) | 0.346 |
| MedNeXt-L kernel3 + plain Dice+CE | 0.320 |
| MedNeXt-L kernel5 + plain Dice+CE (best variant) | 0.306 |
| MedNeXt-L kernel5 + SkeletonRecall (Spec 1.5) | 0.666 |
So: replacing Dice+CE with SkeletonRecall is the lever for issue #191's failure mode. A stronger architecture compounds with it cleanly.
Loading the Spec 1.5 checkpoint
The trainer is a drop-in for the bruniss/VC-Surface-Models nnUNet v2 fork. Drop nnUNetTrainerSkeletonRecall_MedNeXtL_kernel5.py into the trainers directory and run inference with the standard nnUNet v2 path, or use the production-inference script from the parent repo:
git clone https://github.com/ciscoriordan/mednext-vs-umamba-scroll
cd mednext-vs-umamba-scroll
huggingface-cli download ciscoriordan/mednext-l-scroll-surface \
kernel5_skelrec_dataset059_ep33/checkpoint_best.pth \
kernel5_skelrec_dataset059_ep33/plans.json \
--local-dir checkpoints/spec1_5_skelrec_mednext_kernel5_ep33_BEST/
# OME-Zarr / plain-Zarr scroll volume in:
python scripts/inference/run_on_zarr.py \
--volume /path/to/scroll1.volume.zarr \
--ckpt checkpoints/spec1_5_skelrec_mednext_kernel5_ep33_BEST/checkpoint_best.pth \
--out /path/to/scroll1.spec1_5_prob.zarr
Bbox-subset and .npy cube back-compat documented in the parent README's Quickstart.
Inputs and outputs
- Input: 3D CT cube, single channel uint8, isotropic 1x1x1 voxel spacing. Patch size 128^3 with 0.5 overlap.
- Output: 2-class softmax (background, surface). Take fg probability or threshold around T=0.20-0.25 for binary mask.
- CT normalization: clip to plans.json's foreground percentiles (p005=44, p995=236), then standardize by (x - 129.10) / 42.19.
Training data
bruniss's Dataset059_s1_s4_s5_patches_frangiedt — 1754 volume/label pairs from Scrolls 1, 4, 5 with Frangi-DT-enhanced surface labels. Same architecture config and patch size as bruniss's d058 baseline; only the loss function and architecture differ.
Caveat worth flagging
nnUNet v2's pseudo-dice metric under SkeletonRecall is computed against the skeleton targets, not the binary surface, so it reads as numerically half what plain Dice+CE training would print. At ep 24 the pseudo-dice was 0.33 (looks unimpressive) while the actual held-out high_compressed IoU was already 0.626 (already crushing d058). Anyone re-running this needs held-out IoU probes during training, not pseudo-dice tracking.
Also: nnUNet v2's checkpoint_best.pth gets overwritten as the in-training EMA improves, even when held-out IoU is degrading. Snapshot ckpts every 5-10 epochs with unique names; don't trust checkpoint_best.pth alone. Trajectory + caveats: docs/results.md#training-trajectory--overfit-recovery.
License
MIT for code/weights. Underlying training data (Dataset059) carries bruniss's license; we don't redistribute it.
Citation
@inproceedings{roy2023mednext,
title={MedNeXt: Transformer-driven scaling of ConvNets for medical image segmentation},
author={Roy, Saikat and Koehler, Gregor and Ulrich, Constantin and Baumgartner, Michael and Petersen, Jens and Isensee, Fabian and J{\"a}ger, Paul F and Maier-Hein, Klaus H},
booktitle={MICCAI},
year={2023}
}
@misc{spec1_5_2026,
title={Spec 1.5: MedNeXt-L kernel5 + SkeletonRecall scroll surface predictor},
author={Riordan, Francisco},
year={2026},
howpublished={GitHub: ciscoriordan/mednext-vs-umamba-scroll, HuggingFace: ciscoriordan/mednext-l-scroll-surface},
}
This work depends on bruniss's Dataset059 and VC-Surface-Models nnUNet v2 fork, the SkeletonRecall paper (Mehta et al.), the nnUNet framework (Isensee et al.), and the open Vesuvius Challenge data infrastructure.
