mlconti commited on
Commit
e72e13e
·
verified ·
1 Parent(s): 95c4be6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -1
README.md CHANGED
@@ -7,4 +7,114 @@ language:
7
  - en
8
  base_model:
9
  - jhu-clsp/ettin-encoder-150m
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - en
8
  base_model:
9
  - jhu-clsp/ettin-encoder-150m
10
+ ---
11
+
12
+ # ModernVBERT
13
+
14
+ ## Model
15
+ This is the model card for `modernvbert`.
16
+
17
+ ## Table of Contents
18
+ 1. [Overview](#overview)
19
+ 2. [Usage](#Usage)
20
+ 3. [Evaluation](#Evaluation)
21
+ 4. [License](#license)
22
+ 5. [Citation](#citation)
23
+
24
+ ## Overview
25
+
26
+ The [ModernVBERT](https://arxiv.org/abs/2510.01149) suite is a suite of compact 250M-parameter vision-language encoders, achieving state-of-the-art performance in this size class, matching the performance of models up to 10x larger.
27
+
28
+ For more information about ModernVBERT, please check the [arXiv](https://arxiv.org/abs/2510.01149) preprint.
29
+
30
+ ### Models
31
+ - `colmodernvbert` (*ColModernVBERT* in the paper) is the late-interaction version that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
32
+ - `bimodernvbert` (*BiModernVBERT* in the paper) is the bi-encoder version that is fine-tuned for visual document retrieval tasks.
33
+ - `modernvbert-embed` is the bi-encoder version after modality alignment (using a MLM objective) and contrastive learning, without document specialization.
34
+ - `modernvbert` is the base model after modality alignment (using a MLM objective).
35
+
36
+
37
+ ## Usage
38
+ You can use these models directly with the `transformers` library:
39
+
40
+ ```sh
41
+ pip install torch transformers pillow
42
+ ```
43
+
44
+ **🏎️ If your GPU supports it, we recommend using ModernVBERT with Flash Attention 2 to achieve the highest GPU throughput. To do so, install Flash Attention 2 as follows, then use the model as normal:**
45
+
46
+ ```bash
47
+ pip install flash-attn
48
+ ```
49
+
50
+ Here is an example of masked token prediction using ModernVBERT:
51
+
52
+ ```python
53
+ import torch
54
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoProcessor
55
+ from PIL import Image
56
+ from huggingface_hub import hf_hub_download
57
+
58
+ model_id = "ModernVBERT/modernvbert"
59
+
60
+ processor = AutoProcessor.from_pretrained(model_id)
61
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
62
+ model = AutoModelForMaskedLM.from_pretrained(
63
+ model_id,
64
+ torch_dtype=torch.float32, # use torch_dtype=torch.bfloat16 for flash attention
65
+ # _attn_implementation="flash_attention_2",
66
+ trust_remote_code=True
67
+ )
68
+
69
+ image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
70
+ text = "This [MASK] is on the wall."
71
+
72
+ # Create input messages
73
+ messages = [
74
+ {
75
+ "role": "user",
76
+ "content": [
77
+ {"type": "image"},
78
+ {"type": "text", "text": text}
79
+ ]
80
+ },
81
+ ]
82
+
83
+ # Prepare inputs
84
+ prompt = processor.apply_chat_template(messages)
85
+ inputs = processor(text=prompt, images=[image], return_tensors="pt")
86
+
87
+ # Inference
88
+ outputs = model(**inputs)
89
+
90
+ # To get predictions for the mask:
91
+ masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
92
+ predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
93
+ predicted_token = tokenizer.decode(predicted_token_id)
94
+ print("Predicted token:", predicted_token)
95
+ # Predicted token: painting
96
+ ```
97
+
98
+ ## Evaluation
99
+
100
+ Our results can be found in the [arXiv](https://arxiv.org/abs/2510.01149) preprint.
101
+ When finetuned for visual document retrieval tasks, ModernVBERT matches the performance of models nearly 10x larger on visual document benchmarks. Additionally, it provides an interesting inference speed on CPU compared to the models of similar performance.
102
+
103
+ ## License
104
+
105
+ We release the ModernVBERT model architectures, model weights, and training codebase under the MIT license.
106
+
107
+ ## Citation
108
+
109
+ If you use ModernVBERT in your work, please cite:
110
+
111
+ ```
112
+ @misc{teiletche2025modernvbertsmallervisualdocument,
113
+ title={ModernVBERT: Towards Smaller Visual Document Retrievers},
114
+ author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
115
+ year={2025},
116
+ eprint={2510.01149},
117
+ archivePrefix={arXiv},
118
+ primaryClass={cs.IR},
119
+ url={https://arxiv.org/abs/2510.01149},
120
+ }