| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - image-segmentation |
| - scribble-supervised |
| - pascal-voc |
| - u-net |
| - ensemble |
| pipeline_tag: image-segmentation |
| datasets: |
| - pascal-voc |
| metrics: |
| - miou |
| model-index: |
| - name: scribble-segmentation |
| results: |
| - task: |
| type: image-segmentation |
| name: Scribble-supervised binary segmentation |
| dataset: |
| type: pascal-voc |
| name: PASCAL VOC scribble subset (228 train, 226 test1, 228 test2) |
| metrics: |
| - type: miou |
| value: 0.842 |
| name: Mean IoU (5-fold out-of-fold) |
| - type: bg_iou |
| value: 0.925 |
| name: Background IoU |
| - type: fg_iou |
| value: 0.760 |
| name: Foreground IoU |
| --- |
| |
| # Scribble Segmentation Ensemble |
|
|
| Binary foreground/background segmentation from sparse user scribbles. Honest cross-validated mean IoU of 0.842 on the PASCAL VOC scribble subset, trained from scratch with no pretrained weights. The pipeline combines two five-fold ensembles of small U-Nets averaged with multi-scale test-time augmentation, then calibrated by a per-image threshold model. |
|
|
| ## Results |
|
|
| | Method | Mean IoU | Background IoU | Foreground IoU | |
| |---|---|---|---| |
| | Per-image K-NN baseline (k=11) | 0.499 | 0.637 | 0.361 | |
| | First U-Net (no CutMix) | 0.788 | 0.900 | 0.676 | |
| | U-Net trained with CutMix augmentation | 0.819 | 0.913 | 0.724 | |
| | Pair of CutMix U-Nets (different random seeds), averaged | 0.832 | 0.919 | 0.743 | |
| | **CutMix U-Net averaged with a U-Net trained on pseudo-labels (this release)** | **0.842** | **0.925** | **0.760** | |
|
|
| The progression reads top to bottom. Switching from per-image K-NN to a single globally-trained U-Net is the largest jump because the U-Net learns from every pixel of the 228 ground-truth masks instead of just the sparse scribbles. CutMix augmentation gives the next bump by spatially recombining training examples, which matters on a small dataset like this one. Averaging two seed twins removes some of the variance between any single model's mistakes. The final step replaces one seed twin with a U-Net that also saw 454 unlabeled test images with predicted labels from the previous ensemble. The pseudo-labels are noisy (around 17% wrong on average) but the extra visual diversity wins by about 0.01 mIoU. |
|
|
| For context: the original course leaderboard had 28 teams. This release would place top four. The winning team reached 0.868. |
|
|
| ## Quick start |
|
|
| The two essential inference scripts (`predict_ensemble.py` and `train_global_unet.py`) ship inside this repo along with the weights, so the model is runnable without any external code. |
|
|
| ```bash |
| hf download Enorenio/scribble-segmentation --local-dir release |
| cd release |
| pip install torch numpy pillow opencv-python scipy |
| ``` |
|
|
| Inference expects images at `dataset/test1/images/*.jpg` and matching scribbles at `dataset/test1/scribbles/*.png` (values 0=bg, 1=fg, 255=unlabeled). It also needs a reference palette file at `dataset/train/ground_truth/<any>.png` to colorize the output (any PASCAL VOC palette PNG works). |
|
|
| ```bash |
| python predict_ensemble.py \ |
| --ckpt-dirs runs_v4:64:44 runs_v7_pseudo:64:47 \ |
| --gpu 0 |
| ``` |
|
|
| Predictions land at `dataset/test1/predictions/*.png`. |
|
|
| The interactive demo at https://enorenio.github.io/scribble-seg-demo/ shows side by side predictions for every method on all 682 train and test images, plus an analysis of the five universally hardest cases. |
|
|
| ## What is in this repo |
|
|
| Two sets of five fold checkpoints plus a tiny threshold model: |
|
|
| | Path | Contents | |
| |---|---| |
| | `runs_v4/fold_{0..4}/best.pth` | Five seed-twin U-Nets, each trained with CutMix augmentation on the 228 labeled training images. | |
| | `runs_v7_pseudo/fold_{0..4}/best.pth` | Five U-Nets trained on those same images plus 454 unlabeled test images, using an earlier ensemble's predictions as pseudo ground truth. | |
| | `threshold_predictor.json` | Five-feature linear model that picks the optimal binary cutoff per image, fit on out-of-fold ensemble probabilities. | |
|
|
| At inference time the ten checkpoints all predict on the input, each at three scales (0.7, 1.0, 1.3) and with horizontal flip. Their probabilities are averaged, the per-image threshold is applied, morphological cleanup runs over the result, and any pixel inside a user scribble is hard-snapped to its given label. |
|
|
| ## Model details |
|
|
| The architecture is a small U-Net, roughly 30 million parameters per checkpoint, with 64 base channels and standard encoder-decoder skip connections. Inputs are five channels: three RGB, two one-hot scribble (one channel marks background scribbles, the other marks foreground scribbles). Output is per-pixel foreground probability after a sigmoid. |
|
|
| Training loss combines binary cross-entropy with soft Dice at equal weights. The optimizer is AdamW at 1e-3 with cosine annealing, batch size 6, 150 epochs per fold, image size 384x512, on a single NVIDIA A40. Augmentation includes horizontal flip, random affine (rotation up to 12 degrees, scale 0.85 to 1.2), color jitter, scribble dropout, and CutMix at probability 0.4. |
|
|
| The two ensembles differ only in training data. The first sees the 228 labeled images. The second adds 454 unlabeled test images with predicted labels from a previous CutMix ensemble. That roughly triples the visual diversity at the cost of label noise, and the trade favored diversity by about 0.01 mIoU. |
|
|
| ## Strengths and weaknesses |
|
|
| Works well when foreground and background differ clearly in color: a red car on a white wall, a dark animal against bright grass, a sofa filling most of the frame. |
|
|
| Three kinds of cases break it. Low contrast figure-ground, like a black cat on a dark couch, where the model and the supervising scribbles cannot resolve where the object ends. Cluttered scenes where many objects look like the target, like a bicycle frame surrounded by other metal parts in a junkyard. Thin or articulated structures where parts of one object look disconnected, like the spokes and frame segments of a bicycle. The "Hardest 5" tab in the demo walks through specific examples of each. |
|
|
| ## Limitations |
|
|
| Binary only. This model predicts foreground vs background, not multi-class semantic segmentation. |
|
|
| Scribbles required. Two of the five input channels carry the user's scribbles. The network was trained to expect them, so passing zeros there degrades quality noticeably. |
|
|
| Trained from scratch. The original course rules forbade pretrained encoders. With a pretrained backbone the same pipeline would likely add five to ten mIoU points. |
|
|
| PASCAL VOC domain. Training images are natural indoor and outdoor scenes from PASCAL VOC. Out-of-distribution images (medical, aerial, microscopy) need retraining or domain adaptation. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{morshnev2025scribbleseg, |
| author = {Aleksey Morshnev}, |
| title = {Scribble Segmentation Ensemble}, |
| year = {2025} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT for the model weights and inference code. PASCAL VOC dataset has its own license. |
|
|