Communication-Inspired Tokenization for Structured Image Representations
Aram Davtyan • Yusuf Sahin • Yasaman Haghighi • Sebastian Stapf • Pablo Acuaviva • Alexandre Alahi • Paolo Favaro
Official pre-trained models for the paper: https://arxiv.org/abs/2602.20731
Project's website: https://araachie.github.io/comit/
Installation
Follow the instructions at https://github.com/Araachie/comit
Usage
Example usage, downloading COMiT-XL from the Hugging Face Hub:
from comit import COMiT
model = COMiT.from_pretrained('cvg-unibe/comit-xl')
model.eval().to(device)
With a pretrained COMiT model images can be encoded into token sequences as follows:
with torch.no_grad():
token_dict = model.tokenize(
batch,
global_crop=False, # Whether to use the global crop as the first observation
order="adaptive", # One of ["raster_scan", "random", "adaptive"] or a list of crop indices
num_crops=3, # Used to truncate the list of crops to embed
)
By default the tokenization pipeline returns a list of 256 6-dimensional tokens. If token indices are needed instead, they can be obtained via:
token_ids = model.quantizer.codes_to_indices(token_dict["msgs"])
To visually probe the information in the token sequences, one can decode the tokens back into images:
with torch.no_grad():
detoken_dict = model.detokenize(
msgs=token_dict["msgs"],
offsets=token_dict["offsets"],
num_steps=10, # Number of denoising steps
odesolver="euler", # The numerical velocity field integration method
cfg_weight=7.5, # CFG strength
)
For convenience we also provide the reconstruct method that pipelines tokenize and detokenize into a single call:
with torch.no_grad():
rec_dict = model.reconstruct(
batch,
global_crop=False,
order="adaptive",
num_crops=3,
num_steps=10,
odesolver="euler",
cfg_weight=7.5,
)
Licensing
Unless otherwise noted, the model weights are licensed under Apache license 2.0. For the code licensing, see https://github.com/Araachie/comit?tab=readme-ov-file#licensing
Citation
If you find this work helpful, please consider citing our work:
@misc{davtyan2026comit,
title={Communication-Inspired Tokenization for Structured Image Representations},
author={Aram Davtyan and Yusuf Sahin and Yasaman Haghighi and Sebastian Stapf and Pablo Acuaviva and Alexandre Alahi and Paolo Favaro},
year={2026},
eprint={2602.20731},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.20731},
}
- Downloads last month
- 59