Scribble Segmentation Ensemble

Binary foreground/background segmentation from sparse user scribbles. Honest cross-validated mean IoU of 0.842 on the PASCAL VOC scribble subset, trained from scratch with no pretrained weights. The pipeline combines two five-fold ensembles of small U-Nets averaged with multi-scale test-time augmentation, then calibrated by a per-image threshold model.

Results

Method	Mean IoU	Background IoU	Foreground IoU
Per-image K-NN baseline (k=11)	0.499	0.637	0.361
First U-Net (no CutMix)	0.788	0.900	0.676
U-Net trained with CutMix augmentation	0.819	0.913	0.724
Pair of CutMix U-Nets (different random seeds), averaged	0.832	0.919	0.743
CutMix U-Net averaged with a U-Net trained on pseudo-labels (this release)	0.842	0.925	0.760

The progression reads top to bottom. Switching from per-image K-NN to a single globally-trained U-Net is the largest jump because the U-Net learns from every pixel of the 228 ground-truth masks instead of just the sparse scribbles. CutMix augmentation gives the next bump by spatially recombining training examples, which matters on a small dataset like this one. Averaging two seed twins removes some of the variance between any single model's mistakes. The final step replaces one seed twin with a U-Net that also saw 454 unlabeled test images with predicted labels from the previous ensemble. The pseudo-labels are noisy (around 17% wrong on average) but the extra visual diversity wins by about 0.01 mIoU.

For context: the original course leaderboard had 28 teams. This release would place top four. The winning team reached 0.868.

Quick start

The two essential inference scripts (predict_ensemble.py and train_global_unet.py) ship inside this repo along with the weights, so the model is runnable without any external code.

hf download Enorenio/scribble-segmentation --local-dir release
cd release
pip install torch numpy pillow opencv-python scipy

Inference expects images at dataset/test1/images/*.jpg and matching scribbles at dataset/test1/scribbles/*.png (values 0=bg, 1=fg, 255=unlabeled). It also needs a reference palette file at dataset/train/ground_truth/<any>.png to colorize the output (any PASCAL VOC palette PNG works).

python predict_ensemble.py \
    --ckpt-dirs runs_v4:64:44 runs_v7_pseudo:64:47 \
    --gpu 0

Predictions land at dataset/test1/predictions/*.png.

The interactive demo at https://enorenio.github.io/scribble-seg-demo/ shows side by side predictions for every method on all 682 train and test images, plus an analysis of the five universally hardest cases.

What is in this repo

Two sets of five fold checkpoints plus a tiny threshold model:

Path	Contents
`runs_v4/fold_{0..4}/best.pth`	Five seed-twin U-Nets, each trained with CutMix augmentation on the 228 labeled training images.
`runs_v7_pseudo/fold_{0..4}/best.pth`	Five U-Nets trained on those same images plus 454 unlabeled test images, using an earlier ensemble's predictions as pseudo ground truth.
`threshold_predictor.json`	Five-feature linear model that picks the optimal binary cutoff per image, fit on out-of-fold ensemble probabilities.

At inference time the ten checkpoints all predict on the input, each at three scales (0.7, 1.0, 1.3) and with horizontal flip. Their probabilities are averaged, the per-image threshold is applied, morphological cleanup runs over the result, and any pixel inside a user scribble is hard-snapped to its given label.

Model details

The architecture is a small U-Net, roughly 30 million parameters per checkpoint, with 64 base channels and standard encoder-decoder skip connections. Inputs are five channels: three RGB, two one-hot scribble (one channel marks background scribbles, the other marks foreground scribbles). Output is per-pixel foreground probability after a sigmoid.

Training loss combines binary cross-entropy with soft Dice at equal weights. The optimizer is AdamW at 1e-3 with cosine annealing, batch size 6, 150 epochs per fold, image size 384x512, on a single NVIDIA A40. Augmentation includes horizontal flip, random affine (rotation up to 12 degrees, scale 0.85 to 1.2), color jitter, scribble dropout, and CutMix at probability 0.4.

The two ensembles differ only in training data. The first sees the 228 labeled images. The second adds 454 unlabeled test images with predicted labels from a previous CutMix ensemble. That roughly triples the visual diversity at the cost of label noise, and the trade favored diversity by about 0.01 mIoU.

Strengths and weaknesses

Works well when foreground and background differ clearly in color: a red car on a white wall, a dark animal against bright grass, a sofa filling most of the frame.

Three kinds of cases break it. Low contrast figure-ground, like a black cat on a dark couch, where the model and the supervising scribbles cannot resolve where the object ends. Cluttered scenes where many objects look like the target, like a bicycle frame surrounded by other metal parts in a junkyard. Thin or articulated structures where parts of one object look disconnected, like the spokes and frame segments of a bicycle. The "Hardest 5" tab in the demo walks through specific examples of each.

Limitations

Binary only. This model predicts foreground vs background, not multi-class semantic segmentation.

Scribbles required. Two of the five input channels carry the user's scribbles. The network was trained to expect them, so passing zeros there degrades quality noticeably.

Trained from scratch. The original course rules forbade pretrained encoders. With a pretrained backbone the same pipeline would likely add five to ten mIoU points.

PASCAL VOC domain. Training images are natural indoor and outdoor scenes from PASCAL VOC. Out-of-distribution images (medical, aerial, microscopy) need retraining or domain adaptation.

Citation

@misc{morshnev2025scribbleseg,
  author = {Aleksey Morshnev},
  title  = {Scribble Segmentation Ensemble},
  year   = {2025}
}

License

MIT for the model weights and inference code. PASCAL VOC dataset has its own license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Mean IoU (5-fold out-of-fold) on PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
self-reported

0.842
Background IoU on PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
self-reported

0.925
Foreground IoU on PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
self-reported

0.760