ILSVRC/imagenet-1k
Viewer • Updated • 1.43M • 79.3k • 814
RGB reconstruction head based on RAE, trained on ImageNet. Part of A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens (CVPR 2026 Highlight).
Requires a frozen DINOv3 ViT-B backbone. See the DeltaTok GitHub repository for training and evaluation code.
@inproceedings{kerssies2026deltatok,
title = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
author = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}