Communication-Inspired Tokenization for Structured Image Representations

Aram Davtyan • Yusuf Sahin • Yasaman Haghighi • Sebastian Stapf • Pablo Acuaviva • Alexandre Alahi • Paolo Favaro

Official pre-trained models for the paper: Communication-Inspired Tokenization for Structured Image Representations.

Installation

Follow the instructions at https://github.com/Araachie/comit

Usage

Example usage, downloading COMiT-B from the Hugging Face Hub:

import torch
from comit import COMiT

device = "cuda" if torch.cuda.is_available() else "cpu"
model = COMiT.from_pretrained('cvg-unibe/comit-b')
model.eval().to(device)

With a pretrained COMiT model images can be encoded into token sequences as follows:

with torch.no_grad():
  token_dict = model.tokenize(
      batch,
      global_crop=False,  # Whether to use the global crop as the first observation
      order="adaptive",  # One of ["raster_scan", "random", "adaptive"] or a list of crop indices
      num_crops=3,  # Used to truncate the list of crops to embed
  )

By default the tokenization pipeline returns a list of 256 6-dimensional tokens. If token indices are needed instead, they can be obtained via:

token_ids = model.quantizer.codes_to_indices(token_dict["msgs"])

To visually probe the information in the token sequences, one can decode the tokens back into images:

with torch.no_grad():
  detoken_dict = model.detokenize(
      msgs=token_dict["msgs"],
      offsets=token_dict["offsets"],
      num_steps=10,  # Number of denoising steps
      odesolver="euler",  # The numerical velocity field integration method
      cfg_weight=7.5,  # CFG strength
  )

For convenience we also provide the reconstruct method that pipelines tokenize and detokenize into a single call:

with torch.no_grad():
  rec_dict = model.reconstruct(
      batch,
      global_crop=False,
      order="adaptive",
      num_crops=3,
      num_steps=10,
      odesolver="euler",
      cfg_weight=7.5,
  )

Licensing

Unless otherwise noted, the model weights are licensed under Apache license 2.0. For the code licensing, see https://github.com/Araachie/comit?tab=readme-ov-file#licensing

Citation

If you find this work helpful, please consider citing our work:

@misc{davtyan2026comit,
      title={Communication-Inspired Tokenization for Structured Image Representations}, 
      author={Aram Davtyan and Yusuf Sahin and Yasaman Haghighi and Sebastian Stapf and Pablo Acuaviva and Alexandre Alahi and Paolo Favaro},
      year={2026},
      eprint={2602.20731},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.20731}, 
}

Downloads last month: 2

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cvg-unibe/comit-b

Communication-Inspired Tokenization for Structured Image Representations

Paper • 2602.20731 • Published Feb 24 • 4