EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Code: https://github.com/HKU-MMLab/EVATok

Project Page: https://silentview.github.io/EVATok

Download Checkpoints

Tokenizers and Routers

All the video tokenizers and routers are for 16x128x128 videos.

Tokenizer	Train Set	Config	Param. (Tokenizer)	router config	router ckpt (link)	#rTokens	rFVD	LPIPS	Tokenizer ckpt (link)
S-B	WebVid-10M	VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml	145M	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt	721	7.3	0.1063	VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt
S-B	UCF-101 & K600	VQ_SB_final_with_router_w_lpips_1.2.yaml	145M	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt	774	9.7	0.1140	VQ_SB_final_with_router_ucf_k600_1000k.pt
S-B (Proxy)	WebVid-10M	VQ_SB_proxy_3fps.yaml	145M	-	-	-	-	-	VQ_SB_proxy_3fps_webvid_400k.pt

AR models Downloading

Note that the inference of AR models will not use routers.

For UCF-101 Class-to-video Generation

AR Model	Param. (AR)	gFVD	#gTokens	AR Model Download Link	Tok. ckpt	Tok. Config	router config	router ckpt
GPT-L-plus	633M	48	756	GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt
GPT-L	327M	62	756	GPT_L_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt

If you have no access to V-JEPA2, you can use another router that does not depend on V-JEPA2. Config: router_w_lpips_1.2_raw.yaml, ckpt: router_w_lpips_l1.2_raw_50k.pt. In reconstruction test, there is no obvious gap between this router and the one depending on V-JEPA2.

For Kinetics-600 Frame Prediction

The condition 5 frames are encoded into 512+128=640 tokens, as the conditioning tokens.

AR Model	Param. (AR)	gFVD	#gTokens	AR Model Download Link	Tok. ckpt	Tok. Config	router config	router ckpt
GPT-L-plus	633M	4.0	862	GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt
GPT-L	327M	4.6	862	GPT_L_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt	VQ_SB_final_with_router_ucf_k600_1000k.pt	VQ_SB_final_with_router_w_lpips_1.2.yaml	router_w_lpips_1.2.yaml	router_w_lpips_1.2_50k.pt

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for YuuTennYi/EVATok

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

Paper • 2603.12267 • Published Mar 12 • 13