File size: 7,011 Bytes

---
license: mit
library_name: pytorch
tags:
  - image-segmentation
  - scribble-supervised
  - pascal-voc
  - u-net
  - ensemble
pipeline_tag: image-segmentation
datasets:
  - pascal-voc
metrics:
  - miou
model-index:
  - name: scribble-segmentation
    results:
      - task:
          type: image-segmentation
          name: Scribble-supervised binary segmentation
        dataset:
          type: pascal-voc
          name: PASCAL VOC scribble subset (228 train, 226 test1, 228 test2)
        metrics:
          - type: miou
            value: 0.842
            name: Mean IoU (5-fold out-of-fold)
          - type: bg_iou
            value: 0.925
            name: Background IoU
          - type: fg_iou
            value: 0.760
            name: Foreground IoU
---

# Scribble Segmentation Ensemble

Binary foreground/background segmentation from sparse user scribbles. Honest cross-validated mean IoU of 0.842 on the PASCAL VOC scribble subset, trained from scratch with no pretrained weights. The pipeline combines two five-fold ensembles of small U-Nets averaged with multi-scale test-time augmentation, then calibrated by a per-image threshold model.

## Results

| Method | Mean IoU | Background IoU | Foreground IoU |
|---|---|---|---|
| Per-image K-NN baseline (k=11) | 0.499 | 0.637 | 0.361 |
| First U-Net (no CutMix) | 0.788 | 0.900 | 0.676 |
| U-Net trained with CutMix augmentation | 0.819 | 0.913 | 0.724 |
| Pair of CutMix U-Nets (different random seeds), averaged | 0.832 | 0.919 | 0.743 |
| **CutMix U-Net averaged with a U-Net trained on pseudo-labels (this release)** | **0.842** | **0.925** | **0.760** |

The progression reads top to bottom. Switching from per-image K-NN to a single globally-trained U-Net is the largest jump because the U-Net learns from every pixel of the 228 ground-truth masks instead of just the sparse scribbles. CutMix augmentation gives the next bump by spatially recombining training examples, which matters on a small dataset like this one. Averaging two seed twins removes some of the variance between any single model's mistakes. The final step replaces one seed twin with a U-Net that also saw 454 unlabeled test images with predicted labels from the previous ensemble. The pseudo-labels are noisy (around 17% wrong on average) but the extra visual diversity wins by about 0.01 mIoU.

For context: the original course leaderboard had 28 teams. This release would place top four. The winning team reached 0.868.

## Quick start

The two essential inference scripts (`predict_ensemble.py` and `train_global_unet.py`) ship inside this repo along with the weights, so the model is runnable without any external code.

```bash
hf download Enorenio/scribble-segmentation --local-dir release
cd release
pip install torch numpy pillow opencv-python scipy
```

Inference expects images at `dataset/test1/images/*.jpg` and matching scribbles at `dataset/test1/scribbles/*.png` (values 0=bg, 1=fg, 255=unlabeled). It also needs a reference palette file at `dataset/train/ground_truth/<any>.png` to colorize the output (any PASCAL VOC palette PNG works).

```bash
python predict_ensemble.py \
    --ckpt-dirs runs_v4:64:44 runs_v7_pseudo:64:47 \
    --gpu 0
```

Predictions land at `dataset/test1/predictions/*.png`.

The interactive demo at https://enorenio.github.io/scribble-seg-demo/ shows side by side predictions for every method on all 682 train and test images, plus an analysis of the five universally hardest cases.

## What is in this repo

Two sets of five fold checkpoints plus a tiny threshold model:

| Path | Contents |
|---|---|
| `runs_v4/fold_{0..4}/best.pth` | Five seed-twin U-Nets, each trained with CutMix augmentation on the 228 labeled training images. |
| `runs_v7_pseudo/fold_{0..4}/best.pth` | Five U-Nets trained on those same images plus 454 unlabeled test images, using an earlier ensemble's predictions as pseudo ground truth. |
| `threshold_predictor.json` | Five-feature linear model that picks the optimal binary cutoff per image, fit on out-of-fold ensemble probabilities. |

At inference time the ten checkpoints all predict on the input, each at three scales (0.7, 1.0, 1.3) and with horizontal flip. Their probabilities are averaged, the per-image threshold is applied, morphological cleanup runs over the result, and any pixel inside a user scribble is hard-snapped to its given label.

## Model details

The architecture is a small U-Net, roughly 30 million parameters per checkpoint, with 64 base channels and standard encoder-decoder skip connections. Inputs are five channels: three RGB, two one-hot scribble (one channel marks background scribbles, the other marks foreground scribbles). Output is per-pixel foreground probability after a sigmoid.

Training loss combines binary cross-entropy with soft Dice at equal weights. The optimizer is AdamW at 1e-3 with cosine annealing, batch size 6, 150 epochs per fold, image size 384x512, on a single NVIDIA A40. Augmentation includes horizontal flip, random affine (rotation up to 12 degrees, scale 0.85 to 1.2), color jitter, scribble dropout, and CutMix at probability 0.4.

The two ensembles differ only in training data. The first sees the 228 labeled images. The second adds 454 unlabeled test images with predicted labels from a previous CutMix ensemble. That roughly triples the visual diversity at the cost of label noise, and the trade favored diversity by about 0.01 mIoU.

## Strengths and weaknesses

Works well when foreground and background differ clearly in color: a red car on a white wall, a dark animal against bright grass, a sofa filling most of the frame.

Three kinds of cases break it. Low contrast figure-ground, like a black cat on a dark couch, where the model and the supervising scribbles cannot resolve where the object ends. Cluttered scenes where many objects look like the target, like a bicycle frame surrounded by other metal parts in a junkyard. Thin or articulated structures where parts of one object look disconnected, like the spokes and frame segments of a bicycle. The "Hardest 5" tab in the demo walks through specific examples of each.

## Limitations

Binary only. This model predicts foreground vs background, not multi-class semantic segmentation.

Scribbles required. Two of the five input channels carry the user's scribbles. The network was trained to expect them, so passing zeros there degrades quality noticeably.

Trained from scratch. The original course rules forbade pretrained encoders. With a pretrained backbone the same pipeline would likely add five to ten mIoU points.

PASCAL VOC domain. Training images are natural indoor and outdoor scenes from PASCAL VOC. Out-of-distribution images (medical, aerial, microscopy) need retraining or domain adaptation.

## Citation

```bibtex
@misc{morshnev2025scribbleseg,
  author = {Aleksey Morshnev},
  title  = {Scribble Segmentation Ensemble},
  year   = {2025}
}
```

## License

MIT for the model weights and inference code. PASCAL VOC dataset has its own license.