SSync: Selective Synergistic Learning for Video Object-Centric Learning

---
datasets:
- movi-c
- movi-e
- youtube-vis-2021
- coco
language:
- en
library_name: pytorch
license: mit
pipeline_tag: image-segmentation
metrics:
- fg-ari
- mbo
tags:
- video-object-centric-learning
- object-discovery
- slot-attention
- unsupervised-segmentation
- video-understanding
- pytorch
---

<div align="center">
  
<h1> SSync: Selective Synergistic Learning for Video Object-Centric Learning </h1>

**ECCV 2026** · [Paper](https://arxiv.org/abs/2606.15527v1) · [Code](https://github.com/wjun0830/SSync) · [Project Page](https://wjun0830.github.io/SSync)

**Authors:** [WonJun Moon](https://github.com/wjun0830) (KAIST), Jae-Pil Heo (Sungkyunkwan University)

  <img src="images/fig1.png" width="50%">
  <img src="images/fig2.png" width="50%">
  
</div>

## Model Description

SSync is a selective mutual-distillation framework for video object-centric learning (VOCL). Slot-based VOCL methods are guided by two spatial maps — the encoder's **attention map** (sharp boundaries, noisy interiors) and the decoder's **object map** (coherent interiors, blurry boundaries). Rather than forcing dense agreement across all spatio-temporal patches, SSync selectively distills only the most reliable cues from each map:

- **Encoder → Decoder:** boundary refinement via crisp attention boundaries
- **Decoder → Encoder:** interior denoising via coherent object maps

This is realized through a **linear-complexity pseudo-labeling** scheme, eliminating quadratic spatial comparisons. A **transitive pseudo-label merging** step further consolidates redundant slots based on spatio-temporal activation consistency, making SSync robust to slot count configurations.

## Evaluation Results

Object discovery on VOCL benchmarks (averaged over 3 runs):

| Dataset | FG-ARI ↑ | mBO ↑ |
|---|---|---|
| MOVi-C (336×336) | **79.4** | **39.5** |
| MOVi-E (336×336) | **84.0** | 34.8 |
| YouTube-VIS 2021 (518×518) | 42.6 | **38.7** |

## Training Data

| Dataset | Size |
|---|---|
| YouTube-VIS 2021 | 26.43 GB |
| MOVi-C | 7.43 GB |
| MOVi-E | 8.26 GB |

See [data/README.md](https://github.com/wjun0830/SSync/blob/main/data/README.md) for download instructions.

## Citation

```bibtex
@inproceedings{moon2026ssync,
  title     = {Selective Synergistic Learning for Video Object-Centric Learning},
  author    = {Moon, WonJun and Heo, Jae-Pil},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
```

## Acknowledgements

Built upon [VideoSAUR](https://github.com/martius-lab/videosaur), [SlotContrast](https://github.com/martius-lab/slotcontrast), [SRL](https://github.com/hynnsk/SRL), and [SlotCurri](https://github.com/wjun0830/SlotCurri).