--- datasets: - movi-c - movi-e - youtube-vis-2021 - coco language: - en library_name: pytorch license: mit pipeline_tag: image-segmentation metrics: - fg-ari - mbo tags: - video-object-centric-learning - object-discovery - slot-attention - unsupervised-segmentation - video-understanding - pytorch ---

SSync: Selective Synergistic Learning for Video Object-Centric Learning

**ECCV 2026** · [Paper](https://arxiv.org/abs/2606.15527v1) · [Code](https://github.com/wjun0830/SSync) · [Project Page](https://wjun0830.github.io/SSync) **Authors:** [WonJun Moon](https://github.com/wjun0830) (KAIST), Jae-Pil Heo (Sungkyunkwan University)
## Model Description SSync is a selective mutual-distillation framework for video object-centric learning (VOCL). Slot-based VOCL methods are guided by two spatial maps — the encoder's **attention map** (sharp boundaries, noisy interiors) and the decoder's **object map** (coherent interiors, blurry boundaries). Rather than forcing dense agreement across all spatio-temporal patches, SSync selectively distills only the most reliable cues from each map: - **Encoder → Decoder:** boundary refinement via crisp attention boundaries - **Decoder → Encoder:** interior denoising via coherent object maps This is realized through a **linear-complexity pseudo-labeling** scheme, eliminating quadratic spatial comparisons. A **transitive pseudo-label merging** step further consolidates redundant slots based on spatio-temporal activation consistency, making SSync robust to slot count configurations. ## Evaluation Results Object discovery on VOCL benchmarks (averaged over 3 runs): | Dataset | FG-ARI ↑ | mBO ↑ | |---|---|---| | MOVi-C (336×336) | **79.4** | **39.5** | | MOVi-E (336×336) | **84.0** | 34.8 | | YouTube-VIS 2021 (518×518) | 42.6 | **38.7** | ## Training Data | Dataset | Size | |---|---| | YouTube-VIS 2021 | 26.43 GB | | MOVi-C | 7.43 GB | | MOVi-E | 8.26 GB | See [data/README.md](https://github.com/wjun0830/SSync/blob/main/data/README.md) for download instructions. ## Citation ```bibtex @inproceedings{moon2026ssync, title = {Selective Synergistic Learning for Video Object-Centric Learning}, author = {Moon, WonJun and Heo, Jae-Pil}, booktitle = {European Conference on Computer Vision (ECCV)}, year = {2026} } ``` ## Acknowledgements Built upon [VideoSAUR](https://github.com/martius-lab/videosaur), [SlotContrast](https://github.com/martius-lab/slotcontrast), [SRL](https://github.com/hynnsk/SRL), and [SlotCurri](https://github.com/wjun0830/SlotCurri).