File size: 2,003 Bytes
5b40df0
 
 
 
 
 
 
 
 
2a90c49
5b40df0
 
 
 
 
2a90c49
5b40df0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

# PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

<div align="center">
  <a href="https://plan-lab.github.io/pyratok"><img src="https://img.shields.io/badge/Project-Website-blue?style=for-the-badge&logo=googlechrome"></a>
  <a href="https://arxiv.org/abs/2601.16210"><img src="https://img.shields.io/badge/arXiv-2601.16210-b31b1b.svg?style=for-the-badge"></a>
  <a href="https://github.com/PLAN-Lab/PyraTok"><img src="https://img.shields.io/badge/Code-GitHub-black?style=for-the-badge&logo=github"></a>
</div>

---

### 📢 Official Announcement
**PyraTok** has been officially accepted to **CVPR 2026**! 🎉  
This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer.

---

## 🚀 Overview

**PyraTok** is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a **Language-aligned Pyramidal Quantization (LaPQ)** module.

### Key Innovations:
* **Pyramidal Structure:** Learns semantically structured discrete latents across multiple spatiotemporal resolutions.
* **Language Alignment:** Tightly couples visual tokens with language using a shared, large binary codebook (up to 48K tokens).
* **Scalability:** Robustly scales from standard resolutions to **4K/8K video** processing.
* **Unified Backbone:** A single model that excels in Video QA, Zero-Shot Segmentation, and high-fidelity Text-to-Video generation.



```

@inproceedings{susladkar2026pyratok,

  title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},

  author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},

  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},

  year={2026}

}

```