EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Code: https://github.com/HKU-MMLab/EVATok

Project Page: https://silentview.github.io/EVATok

Arxiv: https://arxiv.org/abs/2603.12267

Download Checkpoints

Tokenizers and Routers

All the video tokenizers and routers are for 16x128x128 videos.

Tokenizer Train Set Config Param. (Tokenizer) router config router ckpt (link) #rTokens rFVD LPIPS Tokenizer ckpt (link)
S-B WebVid-10M VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml 145M router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt 721 7.3 0.1063 VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt
S-B UCF-101 & K600 VQ_SB_final_with_router_w_lpips_1.2.yaml 145M router_w_lpips_1.2.yaml router_w_lpips_1.2_50k.pt 774 9.7 0.1140 VQ_SB_final_with_router_ucf_k600_1000k.pt
S-B (Proxy) WebVid-10M VQ_SB_proxy_3fps.yaml 145M - - - - - VQ_SB_proxy_3fps_webvid_400k.pt

AR models Downloading

Note that the inference of AR models will not use routers.

For UCF-101 Class-to-video Generation

If you have no access to V-JEPA2, you can use another router that does not depend on V-JEPA2. Config: router_w_lpips_1.2_raw.yaml, ckpt: router_w_lpips_l1.2_raw_50k.pt. In reconstruction test, there is no obvious gap between this router and the one depending on V-JEPA2.

For Kinetics-600 Frame Prediction

The condition 5 frames are encoded into 512+128=640 tokens, as the conditioning tokens.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for YuuTennYi/EVATok