Add arrows to metric columns

5458045 verified 1 day ago

2.49 kB

datasets:
  - kinetics700
library_name: pytorch
license: apache-2.0
pipeline_tag: image-to-video
tags:
  - deltatok
  - cvpr2026-highlight

DeltaWorld (Predictor) — Kinetics-700

DeltaWorld is a generative world model operating on "delta" tokens to efficiently generate diverse plausible futures, as introduced in A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens (CVPR 2026 Highlight).

This repository contains the generative ViT-B predictor trained on Kinetics-700 at 512x512 resolution.

Metrics

Prediction quality, measured by applying downstream task heads to the predicted features. Cells report best-of-20 with mean in parentheses. best selects the sample with lowest DINOv3-feature loss to ground truth; mean averages DINOv3 features across all samples before evaluation.

Method	Horizon	VSPW mIoU (↑)	Cityscapes mIoU (↑)	KITTI RMSE (↓)
Copy last (lower bound)	Short (1 frame)	51.2	53.5	3.76
DeltaWorld	Short (1 frame)	56.3 (54.2)	66.2 (64.2)	2.95 (3.32)
Copy last (lower bound)	Mid (3 frames)	44.3	39.6	4.86
DeltaWorld	Mid (3 frames)	51.5 (46.6)	55.3 (49.5)	3.71 (4.74)

Usage

Requires a trained DeltaTok tokenizer and a frozen DINOv3 ViT-B backbone. Full training and evaluation code is available in the DeltaTok GitHub repository. To evaluate:

python main.py validate -c configs/deltaworld_vitb_dinov3_vitb_kinetics.yaml \
  --model.ckpt_path=path/to/deltaworld-kinetics/pytorch_model.bin \
  --model.network.tokenizer.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin

Acknowledgements

Citation

@inproceedings{kerssies2026deltatok,
  title     = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
  author    = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}