File size: 4,390 Bytes
95c4be6
 
 
 
 
 
 
 
 
f7a899b
 
 
 
 
e72e13e
 
 
 
af8ea22
 
e72e13e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8699dec
 
e72e13e
 
 
 
 
192a1fb
e72e13e
 
 
 
af8ea22
 
e72e13e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: mit
datasets:
- HuggingFaceM4/the_cauldron
- HuggingFaceM4/Docmatix
language:
- en
base_model:
- jhu-clsp/ettin-encoder-150m
tags:
- colpali
- vidore-experimental
- vidore
pipeline_tag: visual-document-retrieval
---

# ModernVBERT

![bg](https://cdn-uploads.huggingface.co/production/uploads/6720a87e392e9cea0187fde6/nRa7iE30dqCUHGblnK8GQ.png)

## Model
This is the model card for `modernvbert`. 

## Table of Contents
1. [Overview](#overview)
2. [Usage](#Usage)
3. [Evaluation](#Evaluation)
4. [License](#license)
5. [Citation](#citation)

## Overview

The [ModernVBERT](https://arxiv.org/abs/2510.01149) suite is a suite of compact 250M-parameter vision-language encoders, achieving state-of-the-art performance in this size class, matching the performance of models up to 10x larger.

For more information about ModernVBERT, please check the [arXiv](https://arxiv.org/abs/2510.01149) preprint.

### Models
- `colmodernvbert` (*ColModernVBERT* in the paper) is the late-interaction version that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
- `bimodernvbert` (*BiModernVBERT* in the paper) is the bi-encoder version that is fine-tuned for visual document retrieval tasks.
- `modernvbert-embed` is the bi-encoder version after modality alignment (using a MLM objective) and contrastive learning, without document specialization.
- `modernvbert` is the base model after modality alignment (using a MLM objective).


## Usage
You can use these models directly with the `transformers` library:

```sh
pip install torch transformers pillow
```

**🏎️ If your GPU supports it, we recommend using ModernVBERT with Flash Attention 2 to achieve the highest GPU throughput. To do so, install Flash Attention 2 as follows, then use the model as normal:**

```bash
pip install flash-attn
```

Here is an example of masked token prediction using ModernVBERT:

```python
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoProcessor
from PIL import Image
from huggingface_hub import hf_hub_download

model_id = "ModernVBERT/modernvbert"

processor = AutoProcessor.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
            model_id,
            torch_dtype=torch.float32, # use torch_dtype=torch.bfloat16 for flash attention
            # _attn_implementation="flash_attention_2",
            trust_remote_code=True
)

image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
text = "This [MASK] is on the wall."

# Create input messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": text}
        ]
    },
]

# Prepare inputs
prompt = processor.apply_chat_template(messages)
inputs = processor(text=prompt, images=[image], return_tensors="pt")

# Inference
with torch.no_grad():
  outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token) # Predicted token: painting
```

## Evaluation


![table](https://cdn-uploads.huggingface.co/production/uploads/6720a87e392e9cea0187fde6/KEx0Y7r3hrgPJUh0_I9_1.png)
Our results can be found in the [arXiv](https://arxiv.org/abs/2510.01149) preprint.
When finetuned for visual document retrieval tasks, ModernVBERT matches the performance of models nearly 10x larger on visual document benchmarks. Additionally, it provides an interesting inference speed on CPU compared to the models of similar performance.

## License

We release the ModernVBERT model architectures, model weights, and training codebase under the MIT license.

## Citation

If you use ModernVBERT in your work, please cite:

```
@misc{teiletche2025modernvbertsmallervisualdocument,
      title={ModernVBERT: Towards Smaller Visual Document Retrievers}, 
      author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
      year={2025},
      eprint={2510.01149},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2510.01149}, 
}