| --- |
| datasets: |
| - movi-c |
| - movi-e |
| - youtube-vis-2021 |
| - coco |
| language: |
| - en |
| library_name: pytorch |
| license: mit |
| pipeline_tag: image-segmentation |
| metrics: |
| - fg-ari |
| - mbo |
| tags: |
| - video-object-centric-learning |
| - object-discovery |
| - slot-attention |
| - unsupervised-segmentation |
| - video-understanding |
| - pytorch |
| --- |
| |
| <div align="center"> |
| |
| <h1> SSync: Selective Synergistic Learning for Video Object-Centric Learning </h1> |
|
|
| **ECCV 2026** · [Paper](https://arxiv.org/abs/2606.15527v1) · [Code](https://github.com/wjun0830/SSync) · [Project Page](https://wjun0830.github.io/SSync) |
|
|
| **Authors:** [WonJun Moon](https://github.com/wjun0830) (KAIST), Jae-Pil Heo (Sungkyunkwan University) |
|
|
| <img src="images/fig1.png" width="50%"> |
| <img src="images/fig2.png" width="50%"> |
| |
| </div> |
|
|
| ## Model Description |
|
|
| SSync is a selective mutual-distillation framework for video object-centric learning (VOCL). Slot-based VOCL methods are guided by two spatial maps — the encoder's **attention map** (sharp boundaries, noisy interiors) and the decoder's **object map** (coherent interiors, blurry boundaries). Rather than forcing dense agreement across all spatio-temporal patches, SSync selectively distills only the most reliable cues from each map: |
|
|
| - **Encoder → Decoder:** boundary refinement via crisp attention boundaries |
| - **Decoder → Encoder:** interior denoising via coherent object maps |
|
|
| This is realized through a **linear-complexity pseudo-labeling** scheme, eliminating quadratic spatial comparisons. A **transitive pseudo-label merging** step further consolidates redundant slots based on spatio-temporal activation consistency, making SSync robust to slot count configurations. |
|
|
| ## Evaluation Results |
|
|
| Object discovery on VOCL benchmarks (averaged over 3 runs): |
|
|
| | Dataset | FG-ARI ↑ | mBO ↑ | |
| |---|---|---| |
| | MOVi-C (336×336) | **79.4** | **39.5** | |
| | MOVi-E (336×336) | **84.0** | 34.8 | |
| | YouTube-VIS 2021 (518×518) | 42.6 | **38.7** | |
|
|
| ## Training Data |
|
|
| | Dataset | Size | |
| |---|---| |
| | YouTube-VIS 2021 | 26.43 GB | |
| | MOVi-C | 7.43 GB | |
| | MOVi-E | 8.26 GB | |
|
|
| See [data/README.md](https://github.com/wjun0830/SSync/blob/main/data/README.md) for download instructions. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{moon2026ssync, |
| title = {Selective Synergistic Learning for Video Object-Centric Learning}, |
| author = {Moon, WonJun and Heo, Jae-Pil}, |
| booktitle = {European Conference on Computer Vision (ECCV)}, |
| year = {2026} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| Built upon [VideoSAUR](https://github.com/martius-lab/videosaur), [SlotContrast](https://github.com/martius-lab/slotcontrast), [SRL](https://github.com/hynnsk/SRL), and [SlotCurri](https://github.com/wjun0830/SlotCurri). |