InsightTok / README.md
nielsr's picture
nielsr HF Staff
Improve model card and add metadata
5800784 verified
|
raw
history blame
2.07 kB
metadata
license: mit
pipeline_tag: image-to-image
tags:
  - discrete tokenization
  - autoregressive generation

InsightTok

InsightTok is a discrete visual tokenizer designed to improve the fidelity of text and faces, two of the most challenging yet perceptually important structures in autoregressive image generation.

It was introduced in the paper InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation.

Model Details

Property Value
Downsampling rate 16×
Codebook size 16,384
Latent dimension 256
Number of parameters 426M

Performance

InsightTok achieves strong text and face reconstruction quality while maintaining a compact discrete representation through localized, content-aware perceptual losses.

Usage

InsightTok follows the standard VQGAN-style autoencoding interface. For setup and implementation details, please refer to the GitHub repository.

# image encoding
latents, _, [_, _, indices] = vq_model.encode(input_image_tensor)
# image decoding
recon_image_tensor = vq_model.decode(latents)

Citation

@article{yue2026insighttok,
  title={InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation},
  author={Yue, Yang and Wei, Fangyun and He, Tianyu and Zhao, Jinjing and Ni, Zanlin and Liu, Zeyu and Guo, Jiayi and Shi, Lei and Dong, Yue bit and Chen, Li and Li, Ji and Huang, Gao and Chen, Dong},
  journal={arXiv preprint arXiv:2605.14333},
  year={2026}
}