Update README.md
Browse files
README.md
CHANGED
|
@@ -16,8 +16,7 @@ tags:
|
|
| 16 |
|
| 17 |
# TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
|
| 18 |
|
| 19 |
-
|
| 20 |
-
<h5 align="left">
|
| 21 |
|
| 22 |
[](https://arxiv.org/abs/2505.05422)
|
| 23 |
[](https://github.com/TencentARC/TokLIP)
|
|
@@ -27,20 +26,22 @@ tags:
|
|
| 27 |
|
| 28 |
</h5>
|
| 29 |
|
| 30 |
-
|
| 31 |
Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
|
| 32 |
|
| 33 |
-
Your star means a lot
|
| 34 |
|
| 35 |
|
| 36 |
## π° News
|
|
|
|
|
|
|
|
|
|
| 37 |
* [2025/06/05] π₯ We release the code and models!
|
| 38 |
* [2025/05/09] π Our paper is available on arXiv!
|
| 39 |
|
| 40 |
|
| 41 |
## π Introduction
|
| 42 |
|
| 43 |
-
<img src="./TokLIP.png" alt="TokLIP" style="zoom:50%;" />
|
| 44 |
|
| 45 |
- We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
|
| 46 |
|
|
@@ -63,18 +64,31 @@ pip install -r requirements.txt
|
|
| 63 |
|
| 64 |
### Model Weight
|
| 65 |
|
| 66 |
-
|
|
| 67 |
-
|
|
| 68 |
-
| TokLIP-S
|
| 69 |
-
| TokLIP-L
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
|
| 71 |
-
We are current working on TokLIP-XL with 512x512 resolution and it will be released soon!
|
| 72 |
|
| 73 |
### Evaluation
|
| 74 |
|
| 75 |
Please first download the TokLIP model weights.
|
| 76 |
|
| 77 |
-
We provide the
|
| 78 |
|
| 79 |
Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
|
| 80 |
|
|
@@ -89,12 +103,12 @@ python inference.py --model-config 'ViT-SO400M-16-SigLIP2-384-toklip' --pretrain
|
|
| 89 |
|
| 90 |
### Model Usage
|
| 91 |
|
| 92 |
-
We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could
|
| 93 |
|
| 94 |
|
| 95 |
## π TODOs
|
| 96 |
-
- [
|
| 97 |
-
- [
|
| 98 |
|
| 99 |
|
| 100 |
## π Contact
|
|
@@ -104,11 +118,12 @@ Discussions and potential collaborations are also welcome.
|
|
| 104 |
|
| 105 |
|
| 106 |
## π Acknowledgement
|
| 107 |
-
This repo is
|
| 108 |
|
| 109 |
* [OpenCLIP](https://github.com/mlfoundations/open_clip)
|
| 110 |
* [LlamaGen](https://github.com/FoundationVision/LlamaGen)
|
| 111 |
* [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
|
|
|
|
| 112 |
|
| 113 |
We thank the authors for their codes.
|
| 114 |
|
|
@@ -117,7 +132,7 @@ We thank the authors for their codes.
|
|
| 117 |
Please cite our work if you use our code or discuss our findings in your own research:
|
| 118 |
```bibtex
|
| 119 |
@article{lin2025toklip,
|
| 120 |
-
title={
|
| 121 |
author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
|
| 122 |
journal={arXiv preprint arXiv:2505.05422},
|
| 123 |
year={2025}
|
|
|
|
| 16 |
|
| 17 |
# TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
|
| 18 |
|
| 19 |
+
<h5 align="center">
|
|
|
|
| 20 |
|
| 21 |
[](https://arxiv.org/abs/2505.05422)
|
| 22 |
[](https://github.com/TencentARC/TokLIP)
|
|
|
|
| 26 |
|
| 27 |
</h5>
|
| 28 |
|
|
|
|
| 29 |
Welcome to the official code repository for "[**TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation**](https://arxiv.org/abs/2505.05422)".
|
| 30 |
|
| 31 |
+
Your star means a lot to us in developing this project! βββ
|
| 32 |
|
| 33 |
|
| 34 |
## π° News
|
| 35 |
+
* [2025/08/18] π Check our latest results on arXiv ([PDF](https://arxiv.org/pdf/2505.05422))!
|
| 36 |
+
* [2025/08/18] π₯ We release TokLIP XL with 512 resolution [π€ TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt)!
|
| 37 |
+
* [2025/08/05] π₯ We release the training code!
|
| 38 |
* [2025/06/05] π₯ We release the code and models!
|
| 39 |
* [2025/05/09] π Our paper is available on arXiv!
|
| 40 |
|
| 41 |
|
| 42 |
## π Introduction
|
| 43 |
|
| 44 |
+
<img src="./docs//TokLIP.png" alt="TokLIP" style="zoom:50%;" />
|
| 45 |
|
| 46 |
- We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
|
| 47 |
|
|
|
|
| 64 |
|
| 65 |
### Model Weight
|
| 66 |
|
| 67 |
+
| Model | Resolution | VQGAN | IN Top1 | COCO TR@1 | COCO IR@1 | Weight |
|
| 68 |
+
| :-------: | :--------: | :----------------------------------------------------------: | :-----: | :-------: | :-------: | :----------------------------------------------------------: |
|
| 69 |
+
| TokLIP-S | 256 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 76.4 | 64.06 | 48.46 | [π€ TokLIP_S_256](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_S_256.pt) |
|
| 70 |
+
| TokLIP-L | 384 | [LlamaGen](https://huggingface.co/peizesun/llamagen_t2i/blob/main/vq_ds16_t2i.pt) | 80.0 | 68.00 | 52.87 | [π€ TokLIP_L_384](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_L_384.pt) |
|
| 71 |
+
| TokLIP-XL | 512 | [IBQ](https://huggingface.co/TencentARC/IBQ-Tokenizer-262144/blob/main/imagenet256_262144.ckpt) | 80.8 | 69.36 | 53.79 | [π€ TokLIP_XL_512](https://huggingface.co/TencentARC/TokLIP/blob/main/TokLIP_XL_512.pt) |
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
### Training
|
| 75 |
+
|
| 76 |
+
1. Please refer to [img2dataset](https://github.com/rom1504/img2dataset) to prepare the WebDataset required for training. You may choose datasets such as **CC3M**, **CC12M**, or **LAION**.
|
| 77 |
+
|
| 78 |
+
2. Prepare the teacher models using `src/covert.py`:
|
| 79 |
+
```bash
|
| 80 |
+
cd src
|
| 81 |
+
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-256' --save_path './model/siglip2-so400m-vit-l16-256.pt'
|
| 82 |
+
TIMM_MODEL='original' python covert.py --model_name 'ViT-SO400M-16-SigLIP2-384' --save_path './model/siglip2-so400m-vit-l16-384.pt'
|
| 83 |
+
```
|
| 84 |
+
3. Train TokLIP using the scripts `src\train_toklip_256.sh` and `src\train_toklip_384.sh`. You need to set `--train-data` and `--train-num-samples` arguments accordingly.
|
| 85 |
|
|
|
|
| 86 |
|
| 87 |
### Evaluation
|
| 88 |
|
| 89 |
Please first download the TokLIP model weights.
|
| 90 |
|
| 91 |
+
We provide the evaluation scripts for ImageNet classification and MSCOCO Retrieval in `src\test_toklip_256.sh`, `src\test_toklip_384.sh`, and `src\test_toklip_512.sh`.
|
| 92 |
|
| 93 |
Please revise the `--pretrained`, `--imagenet-val`, and `--coco-dir` with your specific paths.
|
| 94 |
|
|
|
|
| 103 |
|
| 104 |
### Model Usage
|
| 105 |
|
| 106 |
+
We provide `build_toklip_encoder` function in `src/create_toklip.py`, you could directly load TokLIP with `model`, `image_size`, and `model_path` parameters.
|
| 107 |
|
| 108 |
|
| 109 |
## π TODOs
|
| 110 |
+
- [x] Release training codes.
|
| 111 |
+
- [x] Release TokLIP-XL with 512 resolution.
|
| 112 |
|
| 113 |
|
| 114 |
## π Contact
|
|
|
|
| 118 |
|
| 119 |
|
| 120 |
## π Acknowledgement
|
| 121 |
+
This repo is built upon the following projects:
|
| 122 |
|
| 123 |
* [OpenCLIP](https://github.com/mlfoundations/open_clip)
|
| 124 |
* [LlamaGen](https://github.com/FoundationVision/LlamaGen)
|
| 125 |
* [DeCLIP](https://github.com/Sense-GVT/DeCLIP)
|
| 126 |
+
* [SEED-Voken](https://github.com/TencentARC/SEED-Voken)
|
| 127 |
|
| 128 |
We thank the authors for their codes.
|
| 129 |
|
|
|
|
| 132 |
Please cite our work if you use our code or discuss our findings in your own research:
|
| 133 |
```bibtex
|
| 134 |
@article{lin2025toklip,
|
| 135 |
+
title={Toklip: Marry visual tokens to clip for multimodal comprehension and generation},
|
| 136 |
author={Lin, Haokun and Wang, Teng and Ge, Yixiao and Ge, Yuying and Lu, Zhichao and Wei, Ying and Zhang, Qingfu and Sun, Zhenan and Shan, Ying},
|
| 137 |
journal={arXiv preprint arXiv:2505.05422},
|
| 138 |
year={2025}
|