| --- |
| license: mit |
| pipeline_tag: feature-extraction |
| tags: |
| - visual-tokenizer |
| - image-reconstruction |
| --- |
| |
| # TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders |
|
|
| <p align="center"> |
| <a href="https://huggingface.co/papers/2604.07340"><img src="https://img.shields.io/badge/Paper-Arxiv-b31b1b.svg" alt="arXiv"></a> |
| <a href="https://github.com/inclusionAI/TC-AE"><img src="https://img.shields.io/badge/Code-GitHub-blue?logo=github" alt="GitHub"></a> |
| </p> |
| <div align="center"> |
| <a href="https://tliby.github.io/" target="_blank">Teng Li</a><sup>1,2*</sup>, |
| <a href="https://huang-ziyuan.github.io/" target="_blank">Ziyuan Huang</a><sup>1,*,✉</sup>, |
| <a href="https://scholar.google.com/citations?user=kwDXTpAAAAAJ&hl=en" target="_blank">Cong Chen</a><sup>1,3,*</sup>, |
| <a href="https://ychenl.github.io/" target="_blank">Yangfu Li</a><sup>1,4</sup>, |
| <a href="https://qc-ly.github.io/" target="_blank">Yuanhuiyi Lyu</a><sup>1,5</sup>, <br> |
| <a href="#" target="_blank">Dandan Zheng</a><sup>1</sup>, |
| <a href="https://scholar.google.com/citations?user=Ljk2BvIAAAAJ&hl=en" target="_blank">Chunhua Shen</a><sup>3</sup>, |
| <a href="https://eejzhang.people.ust.hk/" target="_blank">Jun Zhang</a><sup>2✉</sup><br> |
| <sup>1</sup>Inclusion AI, Ant Group, <sup>2</sup>HKUST, <sup>3</sup>ZJU, <sup>4</sup>ECNU, <sup>5</sup>HKUST (GZ) <br> |
| <sup>*</sup>Equal contribution, ✉ Corresponding authors <br> |
| </div> |
| |
|
|
|
|
| ## Introduction |
|
|
| <p align="center"> |
| <img src="assets/pipeline.png" width=98%> |
| <p> |
|
|
|
|
|
|
| **TC-AE** is a novel Vision Transformer (ViT)-based tokenizer for deep image compression and visual generation. Traditional deep compression methods typically increase channel dimensions to maintain reconstruction quality at high compression ratios, but this often leads to representation collapse that degrades generative performance. TC-AE addresses this fundamental challenge from a new perspective: **optimizing the token space** — the critical bridge between pixels and latent representations. By scaling token numbers and enhancing their semantic structure, TC-AE achieves superior reconstruction and generation quality. Key Innovations: |
|
|
| - Token Space Optimization: First to address representation collapse through token sapce optimization |
| - Staged Token Compression: Decomposes token-to-latent mapping into two stages, reducing structural information loss in the bottleneck |
| - Semantic Enhancement: Incorporates self-supervised learning to produce more generative-friendly latents |
|
|
| ## Usage |
|
|
| ### Environment Setup |
|
|
| To set up the environment for TC-AE, follow these steps: |
|
|
| ```shell |
| conda create -n tcae python=3.9 |
| conda activate tcae |
| pip install -r requirements.txt |
| ``` |
|
|
| ### Image Reconstruction Demo |
|
|
| ```shell |
| python tcae/script/demo_recon.py \ |
| --img_folder /path/to/your/images \ |
| --output_folder /path/to/output \ |
| --ckpt_path results/tcae.pt \ |
| --config configs/TC-AE-SL.yaml \ |
| --rank 0 |
| ``` |
|
|
| ### ImageNet Reconstruction Evaluation |
|
|
| Evaluate reconstruction quality on ImageNet validation set: |
|
|
| ```shell |
| python tcae/script/eval_recon.py \ |
| --ckpt_path results/tcae.pt \ |
| --dataset_root /path/to/imagenet_val \ |
| --config configs/TC-AE-SL.yaml \ |
| --rank 0 |
| ``` |
|
|
| ## Citation |
|
|
|
|
| ```bibtex |
| @article{li2026tcae, |
| title={TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders}, |
| author={Li, Teng and Huang, Ziyuan and Chen, Cong and Li, Yangfu and Lyu, Yuanhuiyi and Zheng, Dandan and Shen, Chunhua and Zhang, Jun}, |
| journal={arXiv preprint arXiv:2604.07340}, |
| year={2026} |
| } |
| ``` |