cvg-unibe
/

comit-l

@@ -1,12 +1,22 @@
 ---
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
-license: apache-2.0
 ---
 # Communication-Inspired Tokenization for Structured Image Representations
-</h1>
-  <p align="left">
     <a href="https://araachie.github.io">Aram Davtyan</a> •
     <a href="https://www.cvg.unibe.ch/people/sahin">Yusuf Sahin</a> •
     <a href="https://people.epfl.ch/yasaman.haghighi?lang=en">Yasaman Haghighi</a> •
@@ -14,28 +24,38 @@ license: apache-2.0
     <a href="https://www.cvg.unibe.ch/people/acuaviva">Pablo Acuaviva</a> •
     <a href="https://people.epfl.ch/alexandre.alahi?lang=en">Alexandre Alahi</a> •
     <a href="https://www.cvg.unibe.ch/people/favaro">Paolo Favaro</a>
-  </p>
-Official pre-trained models for the paper: https://arxiv.org/abs/2602.20731
-Project's website: https://araachie.github.io/comit/
 ## Installation
-Follow the instructions at https://github.com/Araachie/comit
 ## Usage
-Example usage, downloading `COMiT-L` from the Hugging Face Hub:
 ```python
 from comit import COMiT
 model = COMiT.from_pretrained('cvg-unibe/comit-l')
 model.eval().to(device)
 ```
-With a pretrained COMiT model images can be encoded into token sequences as follows:
 ```python
 with torch.no_grad():
@@ -45,14 +65,12 @@ with torch.no_grad():
       order="adaptive",  # One of ["raster_scan", "random", "adaptive"] or a list of crop indices
       num_crops=3,  # Used to truncate the list of crops to embed
   )
-```
-By default the tokenization pipeline returns a list of 256 6-dimensional tokens. If token indices are needed instead, they can be obtained via:
-```python
 token_ids = model.quantizer.codes_to_indices(token_dict["msgs"])
 ```
 To visually probe the information in the token sequences, one can decode the tokens back into images:
 ```python
@@ -83,12 +101,12 @@ with torch.no_grad():
 ## Licensing
-Unless otherwise noted, the model weights are licensed under Apache license 2.0.
-For the code licensing, see https://github.com/Araachie/comit?tab=readme-ov-file#licensing
 ## Citation
-If you find this work helpful, please consider citing our work:
 ```bibtex
 @misc{davtyan2026comit,

 ---
+license: apache-2.0
+pipeline_tag: image-feature-extraction
 tags:
 - model_hub_mixin
 - pytorch_model_hub_mixin
+- vision
+- image-tokenization
 ---
 # Communication-Inspired Tokenization for Structured Image Representations
+<p align="left">
+    <a href="https://huggingface.co/papers/2602.20731">Paper</a> •
+    <a href="https://araachie.github.io/comit/">Project Website</a> •
+    <a href="https://github.com/Araachie/comit">GitHub</a>
+</p>
+<p align="left">
     <a href="https://araachie.github.io">Aram Davtyan</a> •
     <a href="https://www.cvg.unibe.ch/people/sahin">Yusuf Sahin</a> •
     <a href="https://people.epfl.ch/yasaman.haghighi?lang=en">Yasaman Haghighi</a> •
     <a href="https://www.cvg.unibe.ch/people/acuaviva">Pablo Acuaviva</a> •
     <a href="https://people.epfl.ch/alexandre.alahi?lang=en">Alexandre Alahi</a> •
     <a href="https://www.cvg.unibe.ch/people/favaro">Paolo Favaro</a>
+</p>
+COMmunication inspired Tokenization (**COMiT**) is a framework for learning structured discrete visual token sequences. Unlike traditional tokenizers optimized primarily for reconstruction, COMiT constructs a latent message by iteratively observing localized image crops and recurrently updating its discrete representation, resulting in interpretable, object-centric token structure.
 ## Installation
+Follow the instructions at the [official repository](https://github.com/Araachie/comit):
+```bash
+git clone https://github.com/Araachie/comit.git
+cd comit
+conda create -n comit python==3.11 -y
+conda activate comit
+pip install -e .
+```
 ## Usage
+### Loading the Model
+You can download and load the pre-trained `COMiT-L` model directly from the Hub:
 ```python
+import torch
 from comit import COMiT
+device = "cuda" if torch.cuda.is_available() else "cpu"
 model = COMiT.from_pretrained('cvg-unibe/comit-l')
 model.eval().to(device)
 ```
+### Encoding Images (Tokenization)
+With a pretrained COMiT model, images can be encoded into token sequences as follows:
 ```python
 with torch.no_grad():
       order="adaptive",  # One of ["raster_scan", "random", "adaptive"] or a list of crop indices
       num_crops=3,  # Used to truncate the list of crops to embed
   )
+# Get token indices (discrete IDs)
 token_ids = model.quantizer.codes_to_indices(token_dict["msgs"])
 ```
+### Decoding Tokens (Reconstruction)
 To visually probe the information in the token sequences, one can decode the tokens back into images:
 ```python
 ## Licensing
+Unless otherwise noted, the model weights are licensed under the Apache License 2.0.
+For the code licensing, see [GitHub licensing](https://github.com/Araachie/comit?tab=readme-ov-file#licensing).
 ## Citation
+If you find this work helpful, please consider citing:
 ```bibtex
 @misc{davtyan2026comit,