datasets:
- kinetics700
library_name: pytorch
license: apache-2.0
pipeline_tag: image-to-video
tags:
- deltatok
- cvpr2026-highlight
DeltaWorld (Predictor) — Kinetics-700
DeltaWorld is a generative world model operating on "delta" tokens to efficiently generate diverse plausible futures, as introduced in A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens (CVPR 2026 Highlight).
This repository contains the generative ViT-B predictor trained on Kinetics-700 at 512x512 resolution.
Metrics
Prediction quality, measured by applying downstream task heads to the predicted features. Cells report best-of-20 with mean in parentheses. best selects the sample with lowest DINOv3-feature loss to ground truth; mean averages DINOv3 features across all samples before evaluation.
| Method | Horizon | VSPW mIoU (↑) | Cityscapes mIoU (↑) | KITTI RMSE (↓) |
|---|---|---|---|---|
| Copy last (lower bound) | Short (1 frame) | 51.2 | 53.5 | 3.76 |
| DeltaWorld | Short (1 frame) | 56.3 (54.2) | 66.2 (64.2) | 2.95 (3.32) |
| Copy last (lower bound) | Mid (3 frames) | 44.3 | 39.6 | 4.86 |
| DeltaWorld | Mid (3 frames) | 51.5 (46.6) | 55.3 (49.5) | 3.71 (4.74) |
Usage
Requires a trained DeltaTok tokenizer and a frozen DINOv3 ViT-B backbone. Full training and evaluation code is available in the DeltaTok GitHub repository. To evaluate:
python main.py validate -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
--model.ckpt_path=path/to/deltaworld-kinetics/pytorch_model.bin \
--model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin
Acknowledgements
Citation
@inproceedings{kerssies2026deltatok,
title = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
author = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}