Image Feature Extraction
PyTorch
deltatok
cvpr2026-highlight

Improve model card: add pipeline tag, links to paper, code, and project page

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +16 -7
README.md CHANGED
@@ -1,19 +1,28 @@
1
  ---
 
 
2
  library_name: pytorch
3
- tags:
4
- - deltatok
5
  license: apache-2.0
6
- datasets:
7
- - kinetics700
 
8
  ---
9
 
10
  # DeltaTok (Tokenizer) — Kinetics-700
11
 
12
- ViT-B encoder and decoder that compresses consecutive video frame features into a single continuous delta token. Trained on Kinetics-700 at 512x512 resolution. Requires a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-B backbone (not included).
 
 
 
 
 
 
 
 
13
 
14
  ## Usage
15
 
16
- See the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok) for training and evaluation code.
17
 
18
  ## Acknowledgements
19
 
@@ -29,4 +38,4 @@ See the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok) for
29
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
30
  year = {2026}
31
  }
32
- ```
 
1
  ---
2
+ datasets:
3
+ - kinetics700
4
  library_name: pytorch
 
 
5
  license: apache-2.0
6
+ pipeline_tag: image-feature-extraction
7
+ tags:
8
+ - deltatok
9
  ---
10
 
11
  # DeltaTok (Tokenizer) — Kinetics-700
12
 
13
+ This repository contains the DeltaTok weights as presented in the paper [A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens](https://huggingface.co/papers/2604.04913) (CVPR 2026).
14
+
15
+ [**Project Page**](https://deltatok.github.io) | [**GitHub**](https://github.com/amazon-far/deltatok)
16
+
17
+ DeltaTok is a video tokenizer that encodes the vision foundation model (VFM) feature differences between consecutive frames into a single continuous "delta" token. This approach significantly reduces the token count in video sequences (e.g., 1,024x reduction) while enabling efficient generative world modeling.
18
+
19
+ ## Model Description
20
+
21
+ This repository contains the ViT-B encoder and decoder trained on Kinetics-700 at 512x512 resolution. The model is designed to work with a frozen [DINOv3](https://github.com/facebookresearch/dinov3) ViT-B backbone (not included).
22
 
23
  ## Usage
24
 
25
+ Please refer to the [DeltaTok GitHub repository](https://github.com/amazon-far/deltatok) for setup, training, and evaluation instructions.
26
 
27
  ## Acknowledgements
28
 
 
38
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
39
  year = {2026}
40
  }
41
+ ```