EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
Paper
• 2603.12267 • Published
• 9
Code: https://github.com/HKU-MMLab/EVATok
Project Page: https://silentview.github.io/EVATok
Arxiv: https://arxiv.org/abs/2603.12267
All the video tokenizers and routers are for 16x128x128 videos.
| Tokenizer | Train Set | Config | Param. (Tokenizer) | router config | router ckpt (link) | #rTokens | rFVD | LPIPS | Tokenizer ckpt (link) |
|---|---|---|---|---|---|---|---|---|---|
| S-B | WebVid-10M | VQ_SB_final_with_router_w_lpips_1.2_3fps_webvid.yaml | 145M | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt | 721 | 7.3 | 0.1063 | VQ_SB_with_router_w_lpips_1.2_3fps_webvid_1000k.pt |
| S-B | UCF-101 & K600 | VQ_SB_final_with_router_w_lpips_1.2.yaml | 145M | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt | 774 | 9.7 | 0.1140 | VQ_SB_final_with_router_ucf_k600_1000k.pt |
| S-B (Proxy) | WebVid-10M | VQ_SB_proxy_3fps.yaml | 145M | - | - | - | - | - | VQ_SB_proxy_3fps_webvid_400k.pt |
Note that the inference of AR models will not use routers.
| AR Model | Param. (AR) | gFVD | #gTokens | AR Model Download Link | Tok. ckpt | Tok. Config | router config | router ckpt |
|---|---|---|---|---|---|---|---|---|
| GPT-L-plus | 633M | 48 | 756 | GPT_LP_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
| GPT-L | 327M | 62 | 756 | GPT_L_c2v_VQ_SB_with_router_w_lpips_1.2_e3000.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
If you have no access to V-JEPA2, you can use another router that does not depend on V-JEPA2. Config: router_w_lpips_1.2_raw.yaml, ckpt: router_w_lpips_l1.2_raw_50k.pt. In reconstruction test, there is no obvious gap between this router and the one depending on V-JEPA2.
The condition 5 frames are encoded into 512+128=640 tokens, as the conditioning tokens.
| AR Model | Param. (AR) | gFVD | #gTokens | AR Model Download Link | Tok. ckpt | Tok. Config | router config | router ckpt |
|---|---|---|---|---|---|---|---|---|
| GPT-L-plus | 633M | 4.0 | 862 | GPT_LP_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |
| GPT-L | 327M | 4.6 | 862 | GPT_L_fp_VQ_SB_with_router_w_lpips_1.2_512_128_prefix_e75.pt | VQ_SB_final_with_router_ucf_k600_1000k.pt | VQ_SB_final_with_router_w_lpips_1.2.yaml | router_w_lpips_1.2.yaml | router_w_lpips_1.2_50k.pt |