SSync / README.md

Add image-segmentation pipeline tag to metadata (#1)

2883d44 12 days ago

2.69 kB

	---
	datasets:
	- movi-c
	- movi-e
	- youtube-vis-2021
	- coco
	language:
	- en
	library_name: pytorch
	license: mit
	pipeline_tag: image-segmentation
	metrics:
	- fg-ari
	- mbo
	tags:
	- video-object-centric-learning
	- object-discovery
	- slot-attention
	- unsupervised-segmentation
	- video-understanding
	- pytorch
	---

	<div align="center">

	<h1> SSync: Selective Synergistic Learning for Video Object-Centric Learning </h1>

	ECCV 2026 · [Paper](https://arxiv.org/abs/2606.15527v1) · [Code](https://github.com/wjun0830/SSync) · [Project Page](https://wjun0830.github.io/SSync)

	Authors: [WonJun Moon](https://github.com/wjun0830) (KAIST), Jae-Pil Heo (Sungkyunkwan University)

	<img src="images/fig1.png" width="50%">
	<img src="images/fig2.png" width="50%">

	</div>

	## Model Description

	SSync is a selective mutual-distillation framework for video object-centric learning (VOCL). Slot-based VOCL methods are guided by two spatial maps — the encoder's attention map (sharp boundaries, noisy interiors) and the decoder's object map (coherent interiors, blurry boundaries). Rather than forcing dense agreement across all spatio-temporal patches, SSync selectively distills only the most reliable cues from each map:

	- Encoder → Decoder: boundary refinement via crisp attention boundaries
	- Decoder → Encoder: interior denoising via coherent object maps

	This is realized through a linear-complexity pseudo-labeling scheme, eliminating quadratic spatial comparisons. A transitive pseudo-label merging step further consolidates redundant slots based on spatio-temporal activation consistency, making SSync robust to slot count configurations.

	## Evaluation Results

	Object discovery on VOCL benchmarks (averaged over 3 runs):

	\| Dataset \| FG-ARI ↑ \| mBO ↑ \|
	\|---\|---\|---\|
	\| MOVi-C (336×336) \| 79.4 \| 39.5 \|
	\| MOVi-E (336×336) \| 84.0 \| 34.8 \|
	\| YouTube-VIS 2021 (518×518) \| 42.6 \| 38.7 \|

	## Training Data

	\| Dataset \| Size \|
	\|---\|---\|
	\| YouTube-VIS 2021 \| 26.43 GB \|
	\| MOVi-C \| 7.43 GB \|
	\| MOVi-E \| 8.26 GB \|

	See [data/README.md](https://github.com/wjun0830/SSync/blob/main/data/README.md) for download instructions.

	## Citation

	```bibtex
	@inproceedings{moon2026ssync,
	title = {Selective Synergistic Learning for Video Object-Centric Learning},
	author = {Moon, WonJun and Heo, Jae-Pil},
	booktitle = {European Conference on Computer Vision (ECCV)},
	year = {2026}
	}
	```

	## Acknowledgements

	Built upon [VideoSAUR](https://github.com/martius-lab/videosaur), [SlotContrast](https://github.com/martius-lab/slotcontrast), [SRL](https://github.com/hynnsk/SRL), and [SlotCurri](https://github.com/wjun0830/SlotCurri).