|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-feature-extraction |
|
|
tags: |
|
|
- visual-tokenizer |
|
|
- feature-extraction |
|
|
- image-reconstruction |
|
|
- autoregressive |
|
|
--- |
|
|
|
|
|
## MingTok: A Unified Tokenizer for Visual Understanding and Generation without Vector Quantization |
|
|
|
|
|
<p align="center">📑 <a href="https://arxiv.org/pdf/2510.06590">Technical Report</a> | 📖 <a href="https://inclusionai.github.io/blog/mingtok/">Project Page</a> | 🤗 <a href="https://huggingface.co/inclusionAI/MingTok-Vision">Hugging Face</a> | 🤖 <a href="https://modelscope.cn/models/inclusionAI/MingTok-Vision">ModelScope</a> | 💾 <a href="https://github.com/inclusionAI/Ming-UniVision">GitHub</a></p> |
|
|
|
|
|
## Key Features |
|
|
- 🖼️ **First Continuous Unified Vision Tokenizer:** MingTok enables unified vision understanding and generation via a continuous latent space, eliminating quantization while preserving semantic and perceptual fidelity. |
|
|
- 🎯 **High-Fidelity Image Reconstruction:** A three-stage architecture (encoding, expansion, reconstruction) captures fine details and global structure for accurate, high-quality image recovery. |
|
|
- ⚡ **Accelerated Autoregressive Convergence:** Masked modeling with multi-level supervision shapes a compact, semantically rich latent space, enabling faster and more stable autoregressive training. |
|
|
|
|
|
|
|
|
<div align="center"> |
|
|
<img src="assets/0830-mingtok-fig1.jpg" alt="Model Architecture" width="80%"/> |
|
|
</div> |
|
|
|
|
|
**Figure 1: Conceptual comparison and qualitative examples of MingTok.** |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
# build MingTok |
|
|
|
|
|
from mingtok.modeling_mingtok import MingTok |
|
|
|
|
|
mingtok_model = MingTok.from_pretrained("inclusionAI/MingTok-Vision") |
|
|
mingtok_model = mingtok_model.cuda() |
|
|
|
|
|
img_path = "mingtok/asset/mingtok.png" |
|
|
save_path = "mingtok/asset/mingtok_recon.png" |
|
|
|
|
|
# loading original image |
|
|
image = Image.open(img_path).convert("RGB") |
|
|
processor = CenterCropProcessor(image_size=512, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]) |
|
|
image = processor(image).cuda().unsqueeze(0) |
|
|
|
|
|
# performing reconstruction |
|
|
with torch.no_grad(): |
|
|
image_recon = mingtok_model.forward_enc_dec(image) |
|
|
# latent = mingtok_model.low_level_encoder(image) |
|
|
# semantic_feat = mingtok_model.semantic_decoder(latent)['x_norm_patchtokens'] |
|
|
# image_recon = mingtok_model.forward_pixel_decoder(semantic_feat) |
|
|
|
|
|
|
|
|
output_mean = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda() |
|
|
output_std = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda() |
|
|
output_image = (image_recon*output_std + output_mean)[0] |
|
|
output_image = T.ToPILImage()(output_image) |
|
|
output_image.save(save_path) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
### Image Reconstruction |
|
|
|
|
|
<style> |
|
|
body { |
|
|
font-family: Arial, sans-serif; |
|
|
margin: 20px; |
|
|
} |
|
|
table { |
|
|
width: 100%; |
|
|
border-collapse: collapse; |
|
|
font-size: 12px; |
|
|
} |
|
|
th, td { |
|
|
border: 1px solid #ccc; |
|
|
padding: 6px 8px; |
|
|
text-align: center; |
|
|
} |
|
|
thead th { |
|
|
background-color: transparent; |
|
|
font-weight: bold; |
|
|
} |
|
|
.section-row { |
|
|
background-color: transparent; |
|
|
text-align: center; |
|
|
font-style: italic; |
|
|
} |
|
|
.uparrow { |
|
|
font-size: 10px; vertical-align: super; |
|
|
} |
|
|
.dagger { |
|
|
font-size: 10px; color: gray; |
|
|
} |
|
|
caption { |
|
|
font-weight: bold; |
|
|
font-size: 14px; |
|
|
margin: 10px 0; |
|
|
text-align: left; |
|
|
} |
|
|
</style> |
|
|
|
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th>Tokenizer</th> |
|
|
<th>Res.</th> |
|
|
<th># Tokens</th> |
|
|
<th>rFID ↓</th> |
|
|
<th>PSNR ↑</th> |
|
|
<th>SSIM ↑</th> |
|
|
<th>LPIPS ↓</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<!-- Specialized tokenizers --> |
|
|
<tr class="italic"> |
|
|
<td colspan="7"><em>Specialized tokenizers</em></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>SD-VAE</td> |
|
|
<td>256</td> |
|
|
<td>1024</td> |
|
|
<td>1.06</td> |
|
|
<td>28.62</td> |
|
|
<td>0.86</td> |
|
|
<td>-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>GigaTok</td> |
|
|
<td>256</td> |
|
|
<td>256</td> |
|
|
<td>0.51</td> |
|
|
<td>21.32</td> |
|
|
<td>0.69</td> |
|
|
<td>0.21</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>VA-VAE</td> |
|
|
<td>256</td> |
|
|
<td>256</td> |
|
|
<td>0.26</td> |
|
|
<td>28.59</td> |
|
|
<td>0.80</td> |
|
|
<td>0.09</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>HieraTok</td> |
|
|
<td>256</td> |
|
|
<td>256</td> |
|
|
<td>1.04</td> |
|
|
<td>23.90</td> |
|
|
<td>0.72</td> |
|
|
<td>0.09</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>DC-AE</td> |
|
|
<td>512</td> |
|
|
<td>64</td> |
|
|
<td>0.22</td> |
|
|
<td>26.15</td> |
|
|
<td>0.71</td> |
|
|
<td>0.08</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>MAE-Tok</td> |
|
|
<td>512</td> |
|
|
<td>128</td> |
|
|
<td>0.62</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>TexTok</td> |
|
|
<td>512</td> |
|
|
<td>256</td> |
|
|
<td>0.73</td> |
|
|
<td>24.45</td> |
|
|
<td>0.66</td> |
|
|
<td>0.19</td> |
|
|
</tr> |
|
|
<!-- Unified tokenizers --> |
|
|
<tr class="italic"> |
|
|
<td colspan="7"><em>Unified tokenizers</em></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>UniTok</td> |
|
|
<td>256</td> |
|
|
<td>256</td> |
|
|
<td>0.38</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
<td>-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>TokenFlow</td> |
|
|
<td>384</td> |
|
|
<td>729</td> |
|
|
<td>0.63</td> |
|
|
<td>22.77</td> |
|
|
<td>0.73</td> |
|
|
<td>-</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>MingTok-Vision</strong></td> |
|
|
<td>512</td> |
|
|
<td>256</td> |
|
|
<td>0.54</td> |
|
|
<td>30.77</td> |
|
|
<td>0.62</td> |
|
|
<td>0.14</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>MingTok-Vision</strong> †</td> |
|
|
<td>512</td> |
|
|
<td>256</td> |
|
|
<td>0.38</td> |
|
|
<td>31.09</td> |
|
|
<td>0.64</td> |
|
|
<td>0.12</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
<div class="footnote"> |
|
|
<strong>†</strong> denotes using semantic decoder after joint pre-training. |
|
|
</div> |
|
|
|
|
|
## Reference |
|
|
``` |
|
|
@article{huang2025mingunivision, |
|
|
title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer}, |
|
|
author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun}, |
|
|
journal={arXiv preprint arXiv:2510.06590}, |
|
|
year={2025} |
|
|
} |
|
|
``` |