DeltaTok
Collection
Pre-trained DeltaTok tokenizer, DeltaWorld predictor, and evaluation heads for segmentation, depth, and RGB reconstruction. See deltatok.github.io. โข 6 items โข Updated โข 1
Generative ViT-B predictor that samples future delta tokens for diverse video prediction. Trained on Kinetics-700 at 512x512 resolution. Requires a trained DeltaTok tokenizer and a frozen DINOv3 ViT-B backbone (neither included).
See the DeltaTok GitHub repository for training and evaluation code.
@inproceedings{kerssies2026deltatok,
title = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
author = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}