paultltc commited on
Commit
f6d292b
·
verified ·
1 Parent(s): d8386c5

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - HuggingFaceM4/the_cauldron
5
+ - HuggingFaceM4/Docmatix
6
+ language:
7
+ - en
8
+ base_model:
9
+ - jhu-clsp/ettin-encoder-150m
10
+ tags:
11
+ - colpali
12
+ - vidore-experimental
13
+ - vidore
14
+ pipeline_tag: visual-document-retrieval
15
+ ---
16
+
17
+ # ModernVBERT
18
+
19
+ ![bg](https://cdn-uploads.huggingface.co/production/uploads/6720a87e392e9cea0187fde6/nRa7iE30dqCUHGblnK8GQ.png)
20
+
21
+ ## Model
22
+ This is the model card for `modernvbert`.
23
+
24
+ ## Table of Contents
25
+ 1. [Overview](#overview)
26
+ 2. [Usage](#Usage)
27
+ 3. [Evaluation](#Evaluation)
28
+ 4. [License](#license)
29
+ 5. [Citation](#citation)
30
+
31
+ ## Overview
32
+
33
+ The [ModernVBERT](https://arxiv.org/abs/2510.01149) suite is a suite of compact 250M-parameter vision-language encoders, achieving state-of-the-art performance in this size class, matching the performance of models up to 10x larger.
34
+
35
+ For more information about ModernVBERT, please check the [arXiv](https://arxiv.org/abs/2510.01149) preprint.
36
+
37
+ ### Models
38
+ - `colmodernvbert` (*ColModernVBERT* in the paper) is the late-interaction version that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
39
+ - `bimodernvbert` (*BiModernVBERT* in the paper) is the bi-encoder version that is fine-tuned for visual document retrieval tasks.
40
+ - `modernvbert-embed` is the bi-encoder version after modality alignment (using a MLM objective) and contrastive learning, without document specialization.
41
+ - `modernvbert` is the base model after modality alignment (using a MLM objective).
42
+
43
+
44
+ ## Usage
45
+ You can use these models directly with the `transformers` library:
46
+
47
+ ```sh
48
+ pip install torch transformers pillow
49
+ ```
50
+
51
+ **🏎️ If your GPU supports it, we recommend using ModernVBERT with Flash Attention 2 to achieve the highest GPU throughput. To do so, install Flash Attention 2 as follows, then use the model as normal:**
52
+
53
+ ```bash
54
+ pip install flash-attn
55
+ ```
56
+
57
+ Here is an example of masked token prediction using ModernVBERT:
58
+
59
+ ```python
60
+ import torch
61
+ from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoProcessor
62
+ from PIL import Image
63
+ from huggingface_hub import hf_hub_download
64
+
65
+ model_id = "ModernVBERT/modernvbert"
66
+
67
+ processor = AutoProcessor.from_pretrained(model_id)
68
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
69
+ model = AutoModelForMaskedLM.from_pretrained(
70
+ model_id,
71
+ torch_dtype=torch.float32, # use torch_dtype=torch.bfloat16 for flash attention
72
+ # _attn_implementation="flash_attention_2",
73
+ trust_remote_code=True
74
+ )
75
+
76
+ image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
77
+ text = "This [MASK] is on the wall."
78
+
79
+ # Create input messages
80
+ messages = [
81
+ {
82
+ "role": "user",
83
+ "content": [
84
+ {"type": "image"},
85
+ {"type": "text", "text": text}
86
+ ]
87
+ },
88
+ ]
89
+
90
+ # Prepare inputs
91
+ prompt = processor.apply_chat_template(messages)
92
+ inputs = processor(text=prompt, images=[image], return_tensors="pt")
93
+
94
+ # Inference
95
+ with torch.no_grad():
96
+ outputs = model(**inputs)
97
+
98
+ # To get predictions for the mask:
99
+ masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
100
+ predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
101
+ predicted_token = tokenizer.decode(predicted_token_id)
102
+ print("Predicted token:", predicted_token) # Predicted token: painting
103
+ ```
104
+
105
+ ## Evaluation
106
+
107
+
108
+ ![table](https://cdn-uploads.huggingface.co/production/uploads/6720a87e392e9cea0187fde6/KEx0Y7r3hrgPJUh0_I9_1.png)
109
+ Our results can be found in the [arXiv](https://arxiv.org/abs/2510.01149) preprint.
110
+ When finetuned for visual document retrieval tasks, ModernVBERT matches the performance of models nearly 10x larger on visual document benchmarks. Additionally, it provides an interesting inference speed on CPU compared to the models of similar performance.
111
+
112
+ ## License
113
+
114
+ We release the ModernVBERT model architectures, model weights, and training codebase under the MIT license.
115
+
116
+ ## Citation
117
+
118
+ If you use ModernVBERT in your work, please cite:
119
+
120
+ ```
121
+ @misc{teiletche2025modernvbertsmallervisualdocument,
122
+ title={ModernVBERT: Towards Smaller Visual Document Retrievers},
123
+ author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
124
+ year={2025},
125
+ eprint={2510.01149},
126
+ archivePrefix={arXiv},
127
+ primaryClass={cs.IR},
128
+ url={https://arxiv.org/abs/2510.01149},
129
+ }
chat_template.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "chat_template": "{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}{% if add_generation_prompt %}<end_of_utterance>\n{% endif %}{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
3
+ }
config.json ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_token_id": 50407,
3
+ "initializer_range": 0.02,
4
+ "model_type": "modernvbert",
5
+ "pixel_shuffle_factor": 4,
6
+ "text_config": {
7
+ "_name_or_path": "ettin-encoder-150m",
8
+ "architectures": [
9
+ "ModernBertForMaskedLM"
10
+ ],
11
+ "attention_bias": false,
12
+ "attention_dropout": 0.0,
13
+ "causal_mask": false,
14
+ "classifier_activation": "gelu",
15
+ "classifier_bias": false,
16
+ "classifier_dropout": 0.0,
17
+ "classifier_pooling": "mean",
18
+ "cls_token_id": 50281,
19
+ "decoder_bias": true,
20
+ "deterministic_flash_attn": false,
21
+ "dtype": "float32",
22
+ "embedding_dropout": 0.0,
23
+ "global_attn_every_n_layers": 3,
24
+ "global_rope_theta": 160000.0,
25
+ "gradient_checkpointing": false,
26
+ "hidden_activation": "gelu",
27
+ "hidden_size": 768,
28
+ "initializer_cutoff_factor": 2.0,
29
+ "initializer_range": 0.02,
30
+ "intermediate_size": 1152,
31
+ "is_causal": false,
32
+ "layer_norm_eps": 1e-05,
33
+ "layer_types": [
34
+ "full_attention",
35
+ "sliding_attention",
36
+ "sliding_attention",
37
+ "full_attention",
38
+ "sliding_attention",
39
+ "sliding_attention",
40
+ "full_attention",
41
+ "sliding_attention",
42
+ "sliding_attention",
43
+ "full_attention",
44
+ "sliding_attention",
45
+ "sliding_attention",
46
+ "full_attention",
47
+ "sliding_attention",
48
+ "sliding_attention",
49
+ "full_attention",
50
+ "sliding_attention",
51
+ "sliding_attention",
52
+ "full_attention",
53
+ "sliding_attention",
54
+ "sliding_attention",
55
+ "full_attention"
56
+ ],
57
+ "local_attention": 128,
58
+ "local_rope_theta": 160000.0,
59
+ "max_position_embeddings": 7999,
60
+ "mlp_bias": false,
61
+ "mlp_dropout": 0.0,
62
+ "model_type": "modernbert",
63
+ "norm_bias": false,
64
+ "norm_eps": 1e-05,
65
+ "num_attention_heads": 12,
66
+ "num_hidden_layers": 22,
67
+ "position_embedding_type": "sans_pos",
68
+ "repad_logits_with_grad": false,
69
+ "rope_parameters": {
70
+ "full_attention": {
71
+ "rope_theta": 160000.0,
72
+ "rope_type": "default"
73
+ },
74
+ "sliding_attention": {
75
+ "rope_theta": 160000.0,
76
+ "rope_type": "default"
77
+ }
78
+ },
79
+ "sparse_pred_ignore_index": -100,
80
+ "sparse_prediction": false,
81
+ "vocab_size": 50408
82
+ },
83
+ "transformers_version": "5.0.0.dev0",
84
+ "vision_config": {
85
+ "attention_dropout": 0.0,
86
+ "hidden_act": "gelu_pytorch_tanh",
87
+ "hidden_size": 768,
88
+ "image_size": 512,
89
+ "intermediate_size": 3072,
90
+ "layer_norm_eps": 1e-06,
91
+ "model_type": "siglip_vision_model",
92
+ "num_attention_heads": 12,
93
+ "num_channels": 3,
94
+ "num_hidden_layers": 12,
95
+ "patch_size": 16
96
+ },
97
+ "tie_word_embeddings": false
98
+ }
configuration_modernvbert.py ADDED
@@ -0,0 +1,225 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
2
+ # This file was automatically generated from src/transformers/models/modernvbert/modular_modernvbert.py.
3
+ # Do NOT edit this file manually as any edits will be overwritten by the generation of
4
+ # the file from the modular. If any change should be done, please apply the change to the
5
+ # modular_modernvbert.py file directly. One of our CI enforces this.
6
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
7
+ import os
8
+ from typing import Any, Union
9
+
10
+ from ...configuration_utils import PretrainedConfig
11
+ from ..modernbert import ModernBertConfig
12
+ from ..siglip import SiglipConfig
13
+
14
+
15
+ class ModernVBertTextConfig(PretrainedConfig):
16
+ r"""
17
+ This is the configuration class to store the configuration of a [`ModernBERT`]. It is used to instantiate an ModernBERT
18
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
19
+ defaults will yield a similar configuration to that of the [jhu-clsp/ettin-encoder-150m](https://huggingface.co/jhu-clsp/ettin-encoder-150m) architecture.
20
+
21
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
22
+ documentation from [`PretrainedConfig`] for more information.
23
+ """
24
+
25
+ model_type = "modernvbert_text"
26
+
27
+ def __init__(
28
+ self,
29
+ text_model_name="jhu-clsp/ettin-encoder-150m",
30
+ hidden_size=768,
31
+ num_hidden_layers=22,
32
+ intermediate_size=1152,
33
+ mlp_bias=False,
34
+ vocab_size=50368,
35
+ **kwargs,
36
+ ):
37
+ super().__init__(
38
+ text_model_name=text_model_name,
39
+ hidden_size=hidden_size,
40
+ num_hidden_layers=num_hidden_layers,
41
+ intermediate_size=intermediate_size,
42
+ mlp_bias=mlp_bias,
43
+ vocab_size=vocab_size,
44
+ **kwargs,
45
+ )
46
+
47
+ @classmethod
48
+ def from_base_model(
49
+ cls,
50
+ text_model_name,
51
+ **kwargs,
52
+ ):
53
+ text_config = ModernBertConfig.from_pretrained(text_model_name)
54
+ if hasattr(text_config, "text_config"):
55
+ text_config = text_config.text_config
56
+
57
+ return cls(
58
+ text_model_name=text_model_name,
59
+ hidden_size=text_config.hidden_size,
60
+ num_hidden_layers=text_config.num_hidden_layers,
61
+ intermediate_size=text_config.intermediate_size,
62
+ mlp_bias=text_config.mlp_bias,
63
+ vocab_size=text_config.vocab_size,
64
+ **kwargs,
65
+ )
66
+
67
+
68
+ class ModernVBertVisionConfig(PretrainedConfig):
69
+ r"""
70
+ This is the configuration class to store the configuration of a [`SigLIP`]. It is used to instantiate the vision encoder part of the ModernVBERT
71
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
72
+ defaults will yield a similar configuration to that of the SigLIP.
73
+
74
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
75
+ documentation from [`PretrainedConfig`] for more information.
76
+ """
77
+
78
+ model_type = "modernvbert_vision"
79
+
80
+ attribute_map = {
81
+ "hidden_size": "embed_dim",
82
+ }
83
+
84
+ def __init__(
85
+ self,
86
+ vision_model_name="google/siglip2-base-patch16-512",
87
+ embed_dim=768,
88
+ image_size=512,
89
+ patch_size=16,
90
+ num_hidden_layers=12,
91
+ intermediate_size=3072,
92
+ **kwargs,
93
+ ):
94
+ super().__init__(
95
+ vision_model_name=vision_model_name,
96
+ embed_dim=embed_dim,
97
+ image_size=image_size,
98
+ patch_size=patch_size,
99
+ num_hidden_layers=num_hidden_layers,
100
+ intermediate_size=intermediate_size,
101
+ **kwargs,
102
+ )
103
+
104
+ @classmethod
105
+ def from_base_model(
106
+ cls,
107
+ vision_model_name,
108
+ **kwargs,
109
+ ):
110
+ vision_config = SiglipConfig.from_pretrained(vision_model_name)
111
+ if hasattr(vision_config, "vision_config"):
112
+ vision_config = vision_config.vision_config
113
+
114
+ return cls(
115
+ vision_model_name=vision_model_name,
116
+ embed_dim=vision_config.hidden_size,
117
+ image_size=vision_config.image_size,
118
+ patch_size=vision_config.patch_size,
119
+ num_hidden_layers=vision_config.num_hidden_layers,
120
+ intermediate_size=vision_config.intermediate_size,
121
+ **kwargs,
122
+ )
123
+
124
+
125
+ class ModernVBertConfig(PretrainedConfig):
126
+ r"""
127
+ This is the configuration class to store the configuration of a `ModernVBert` model. It is used to
128
+ instantiate a ModernVBert model according to the specified arguments and defines the model architecture.
129
+
130
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs.
131
+ See the documentation for [`PretrainedConfig`] for more details.
132
+
133
+ Args:
134
+ text_config (`PretrainedConfig` or `dict`, optional):
135
+ Custom text config or a dict with a `text_model_name` key for the text encoder. If `None`, the
136
+ default text backbone defined by `DEFAULT_TEXT_MODEL_NAME` is used.
137
+ vision_config (`PretrainedConfig` or `dict`, optional):
138
+ Custom vision config or a dict with a `vision_model_name` key for the vision encoder. If `None`, the
139
+ default vision backbone defined by `DEFAULT_VISION_MODEL_NAME` is used.
140
+ image_token_id (`int`, optional, defaults to 128257):
141
+ Token id reserved for image tokens inserted into the text stream.
142
+ vocab_size (`int`, optional, defaults to 128256):
143
+ Vocabulary size used by the text embeddings.
144
+ tie_word_embeddings (`bool`, optional, defaults to `False`):
145
+ Whether to tie input token embeddings and output token embeddings.
146
+ pixel_shuffle_factor (`int`, optional, defaults to 4):
147
+ Scale factor used by any pixel-shuffle / upsampling operations in the vision head.
148
+ additional_vocab_size (`int`, optional, defaults to 0):
149
+ Number of extra tokens appended to the base vocabulary (useful for adapters / special tokens).
150
+ pad_token_id (`int`, optional):
151
+ Padding token id.
152
+ initializer_range (`float`, optional, defaults to 0.02):
153
+ Stddev used for weight initialization.
154
+
155
+ Example:
156
+ ```python
157
+ >>> from modernvbert import ModernVBertConfig
158
+
159
+ >>> # Initializing configuration
160
+ >>> configuration = ModernVBertConfig()
161
+
162
+ >>> # Initializing a model from the configuration (model class is implemented in
163
+ >>> # `modernvbert.modeling_modernvbert`)
164
+
165
+ >>> from modernvbert import ModernVBertModel
166
+ >>> model = ModernVBertModel(configuration)
167
+
168
+ >>> # Accessing the model configuration
169
+ >>> cfg = model.config
170
+ ```"""
171
+
172
+ model_type = "modernvbert"
173
+ sub_configs: dict[str, Any] = {"text_config": ModernVBertTextConfig, "vision_config": ModernVBertVisionConfig}
174
+
175
+ def __init__(
176
+ self,
177
+ text_config=None,
178
+ vision_config=None,
179
+ image_token_id: int = 50407,
180
+ initializer_range=0.02,
181
+ vocab_size=50368,
182
+ pad_token_id=None,
183
+ pixel_shuffle_factor=4,
184
+ additional_vocab_size=0,
185
+ **kwargs,
186
+ ):
187
+ super().__init__(**kwargs)
188
+
189
+ if text_config is None:
190
+ text_config = self.sub_configs["text_config"].from_base_model("jhu-clsp/ettin-encoder-150m")
191
+ elif isinstance(text_config, dict):
192
+ text_config = self.sub_configs["text_config"].from_dict(text_config)
193
+ self.text_config = text_config
194
+
195
+ if vision_config is None:
196
+ vision_config = self.sub_configs["vision_config"].from_base_model("google/siglip2-base-patch16-512")
197
+ elif isinstance(vision_config, dict):
198
+ vision_config = self.sub_configs["vision_config"].from_dict(vision_config)
199
+ self.vision_config = vision_config
200
+
201
+ self.initializer_range = initializer_range
202
+ self.image_token_id = image_token_id
203
+ self.pad_token_id = pad_token_id
204
+ self.pixel_shuffle_factor = pixel_shuffle_factor
205
+ self.vocab_size = vocab_size
206
+ self.additional_vocab_size = additional_vocab_size
207
+ self.hidden_size = kwargs.pop("hidden_size", self.text_config.hidden_size)
208
+
209
+ @classmethod
210
+ def from_pretrained_models(
211
+ cls,
212
+ text_model_name: Union[str, os.PathLike],
213
+ vision_model_name: Union[str, os.PathLike],
214
+ **kwargs,
215
+ ) -> "PretrainedConfig":
216
+ text_model_config = ModernVBertTextConfig.from_base_model(text_model_name)
217
+ vision_model_config = ModernVBertVisionConfig.from_base_model(vision_model_name)
218
+ return cls(
219
+ text_config=text_model_config,
220
+ vision_config=vision_model_config,
221
+ **kwargs,
222
+ )
223
+
224
+
225
+ __all__ = ["ModernVBertConfig", "ModernVBertTextConfig", "ModernVBertVisionConfig"]
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7d38dafdb2bc949c08f0fd320fd515479e2e93f2b849dd177f89cc0362571de7
3
+ size 1165471416
modeling_modernvbert.py ADDED
@@ -0,0 +1,610 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
2
+ # This file was automatically generated from src/transformers/models/modernvbert/modular_modernvbert.py.
3
+ # Do NOT edit this file manually as any edits will be overwritten by the generation of
4
+ # the file from the modular. If any change should be done, please apply the change to the
5
+ # modular_modernvbert.py file directly. One of our CI enforces this.
6
+ # 🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨🚨
7
+ from dataclasses import dataclass
8
+ from typing import Optional, Union
9
+
10
+ import torch
11
+ import torch.nn as nn
12
+ import torch.nn.functional as F
13
+ from torch.nn import CrossEntropyLoss
14
+
15
+ from ...modeling_flash_attention_utils import FlashAttentionKwargs
16
+ from ...modeling_outputs import BaseModelOutput, BaseModelOutputWithPoolingAndCrossAttentions, MaskedLMOutput
17
+ from ...modeling_utils import PreTrainedModel
18
+ from ...processing_utils import Unpack
19
+ from ...utils import auto_docstring, can_return_tuple
20
+ from ..modernbert import ModernBertConfig, ModernBertForMaskedLM, ModernBertModel
21
+ from ..siglip import SiglipVisionConfig, SiglipVisionModel
22
+ from .configuration_modernvbert import ModernVBertConfig
23
+
24
+
25
+ class DecoupledEmbedding(nn.Embedding):
26
+ # Derived from https://pytorch.org/docs/stable/_modules/torch/nn/modules/sparse.html#Embedding
27
+ """
28
+ Implements a decoupling of parameters to allow freezing (or not) a subset of the embeddings.
29
+ In practise, the regular `weight` can be trained or frozen (i.e. `partially_freeze=True`), and if `num_additional_embeddings` > 0, then it will create `num_additional_embeddings` additional parameters that are always trained.
30
+ If `num_additional_embeddings=0`, then the module defaults back to the regular behavior of `nn.Embedding`.
31
+ """
32
+
33
+ def __init__(
34
+ self,
35
+ num_embeddings,
36
+ num_additional_embeddings,
37
+ embedding_dim,
38
+ partially_freeze=False,
39
+ device=None,
40
+ dtype=None,
41
+ padding_idx=None,
42
+ **kwargs,
43
+ ) -> None:
44
+ """
45
+ num_additional_embeddings: int. Number of additional embeddings. Only useful when you `partially_freeze=True`.
46
+ partially_freeze: bool. If True, the regular `weight` will be frozen. `additional_weight` is never frozen.
47
+
48
+ Note: there are a lot of other parameters to initialize a standard `nn.Embedding` such as `padding_idx`, `max_norm` or `norm_type`. We are not supporting these.
49
+ """
50
+ if padding_idx is not None and padding_idx > num_embeddings:
51
+ raise ValueError(f"padding_idx must be within num_embeddings. Got {padding_idx} and {num_embeddings}")
52
+
53
+ super().__init__(
54
+ num_embeddings=num_embeddings,
55
+ embedding_dim=embedding_dim,
56
+ device=device,
57
+ dtype=dtype,
58
+ padding_idx=padding_idx,
59
+ **kwargs,
60
+ )
61
+ self.num_embeddings = num_embeddings
62
+ self.num_additional_embeddings = num_additional_embeddings
63
+ self.partially_freeze = partially_freeze
64
+
65
+ if partially_freeze:
66
+ self.weight.requires_grad_(False)
67
+
68
+ if self.num_additional_embeddings > 0:
69
+ self.additional_embedding = nn.Embedding(
70
+ num_embeddings=num_additional_embeddings,
71
+ embedding_dim=embedding_dim,
72
+ device=device,
73
+ dtype=dtype,
74
+ )
75
+
76
+ def forward(self, input_ids):
77
+ """
78
+ we have 2 embeddings, with different indices - one pretrained self.weight and another
79
+ self.additional_embedding.weight that is being trained.
80
+
81
+ in order to make a lookup of the input ids, we:
82
+ 1. find out the indices of the entries belonging to the 2nd embedding
83
+ 2. extract those values while subtracting the size of the first embedding (num_embeddings),
84
+ since the 2nd embedding starts from 0 and not num_embeddings
85
+ 3. perform the 2nd embedding lookup
86
+ 4. now we handle the 1st embedding, we overwrite indices belonging to the 2nd embedding with a padding index
87
+ 5. perform the 1st embedding lookup
88
+ 6. now we overwrite the values in the 1st embedding lookup with the values of the 2nd embedding lookup
89
+
90
+ note: for the 1st embedding lookup we could have looked up only the low indices and not do
91
+ the padding, but then we have to create a new tensor and populate it with 2 tensors that are
92
+ spread out across various indices - i.e. not a simple concat - I haven't benchmarked the
93
+ complex case if it's any faster, given that seqlens are usually relatively short it's
94
+ probably not faster or if faster not by much - but might be a good idea to measure.
95
+
96
+ """
97
+ if self.num_additional_embeddings == 0:
98
+ return super().forward(input_ids)
99
+
100
+ input_ids = input_ids.clone()
101
+ additional_vocab_indices = torch.where(input_ids >= self.num_embeddings)
102
+ input_ids_additional_vocab = input_ids[additional_vocab_indices]
103
+ additional_embeddings = self.additional_embedding(input_ids_additional_vocab - self.num_embeddings)
104
+
105
+ # for successful lookup replace input_ids with 0, the results of these will be discarded anyway
106
+ input_ids[additional_vocab_indices] = 0
107
+ full_vector = F.embedding(input_ids, self.weight)
108
+ full_vector[additional_vocab_indices] = additional_embeddings # overwrite the records with high indices
109
+ return full_vector
110
+
111
+
112
+ @dataclass
113
+ class ModernVBertBaseModelOutput(BaseModelOutput):
114
+ """
115
+ Base class for ModernVBERT model's outputs that may also contain a past key/values (to speed up sequential decoding).
116
+ Args:
117
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
118
+ Sequence of hidden-states at the output of the last layer of the model.
119
+ If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
120
+ hidden_size)` is output.
121
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
122
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
123
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
124
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
125
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
126
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
127
+ sequence_length)`.
128
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
129
+ heads.
130
+ image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
131
+ Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
132
+ sequence_length, hidden_size)`.
133
+ image_hidden_states of the model produced by the vision encoder
134
+ """
135
+
136
+ last_hidden_state: torch.FloatTensor = None
137
+ hidden_states: Optional[tuple[torch.FloatTensor]] = None
138
+ attentions: Optional[tuple[torch.FloatTensor]] = None
139
+ image_hidden_states: Optional[tuple[torch.FloatTensor]] = None
140
+
141
+
142
+ @dataclass
143
+ class ModernVBertMaskedLMOutput(MaskedLMOutput):
144
+ """
145
+ Base class for ModernVBERT model's outputs that may also contain a past key/values (to speed up sequential decoding).
146
+ Args:
147
+ loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided):
148
+ Masked language modeling (MLM) loss.
149
+ logits (`torch.FloatTensor`):
150
+ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
151
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
152
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
153
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
154
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
155
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
156
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
157
+ sequence_length)`.
158
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
159
+ heads.
160
+ image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
161
+ Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
162
+ sequence_length, hidden_size)`.
163
+ image_hidden_states of the model produced by the vision encoder
164
+ """
165
+
166
+ loss: Optional[torch.FloatTensor] = None
167
+ logits: torch.FloatTensor = None
168
+ hidden_states: Optional[tuple[torch.FloatTensor, ...]] = None
169
+ attentions: Optional[tuple[torch.FloatTensor, ...]] = None
170
+ image_hidden_states: Optional[torch.FloatTensor] = None
171
+
172
+
173
+ class ModernVBertSimpleMLP(nn.Module):
174
+ """A simple linear projection layer to project the vision hidden states to the text hidden states."""
175
+
176
+ def __init__(self, input_size, output_size):
177
+ super().__init__()
178
+ self.proj = nn.Linear(input_size, output_size, bias=False)
179
+
180
+ def forward(self, x):
181
+ return self.proj(x)
182
+
183
+
184
+ class ModernVBertConnector(nn.Module):
185
+ """
186
+ Connector module for ModernVBERT. It performs a pixel shuffle operation followed by a linear projection to match the text model's hidden size.
187
+ Based on https://pytorch.org/docs/stable/generated/torch.nn.PixelShuffle.html
188
+ """
189
+
190
+ def __init__(self, config):
191
+ super().__init__()
192
+ self.pixel_shuffle_factor = config.pixel_shuffle_factor
193
+ self.modality_projection = ModernVBertSimpleMLP(
194
+ input_size=config.vision_config.hidden_size * (config.pixel_shuffle_factor**2),
195
+ output_size=config.text_config.hidden_size,
196
+ )
197
+
198
+ def pixel_shuffle(self, x, pixel_shuffle_factor):
199
+ bsz, seq, embed_dim = x.size()
200
+ height = width = int(seq**0.5)
201
+ x = x.view(bsz, height, width, embed_dim)
202
+ x = x.view(bsz, height, int(width / pixel_shuffle_factor), embed_dim * pixel_shuffle_factor)
203
+ x = x.permute(0, 2, 1, 3)
204
+ x = x.reshape(
205
+ bsz,
206
+ int(width / pixel_shuffle_factor),
207
+ int(height / pixel_shuffle_factor),
208
+ embed_dim * (pixel_shuffle_factor**2),
209
+ )
210
+ x = x.permute(0, 2, 1, 3)
211
+ return x.reshape(bsz, int(seq / (pixel_shuffle_factor**2)), embed_dim * (pixel_shuffle_factor**2))
212
+
213
+ def forward(self, image_hidden_states):
214
+ image_hidden_states = self.pixel_shuffle(image_hidden_states, self.pixel_shuffle_factor)
215
+ return self.modality_projection(image_hidden_states)
216
+
217
+
218
+ class ModernVBertPreTrainedModel(PreTrainedModel):
219
+ config_class = ModernVBertConfig
220
+ base_model_prefix = "model"
221
+ supports_gradient_checkpointing = True
222
+ _supports_flash_attn_2 = True
223
+ _supports_sdpa = True
224
+
225
+ def _init_weights(self, module):
226
+ std = getattr(self.config, "initializer_range", 0.02)
227
+ if isinstance(module, (nn.Linear, nn.Conv2d)):
228
+ module.weight.data.normal_(mean=0.0, std=std)
229
+ if module.bias is not None:
230
+ module.bias.data.zero_()
231
+ elif isinstance(module, nn.Embedding):
232
+ module.weight.data.normal_(mean=0.0, std=std)
233
+ if module.padding_idx is not None:
234
+ module.weight.data[module.padding_idx].zero_()
235
+
236
+
237
+ @auto_docstring
238
+ class ModernVBertModel(ModernVBertPreTrainedModel):
239
+ def __init__(self, config: ModernVBertConfig):
240
+ super().__init__(config)
241
+
242
+ # init components
243
+ self.vision_model = ModernVBertModel.init_vision_model(config)
244
+ self.connector = ModernVBertConnector(config)
245
+ self.text_model = ModernVBertModel.init_language_model(config)
246
+
247
+ # set the correct dtype for vision and text models
248
+ self.vision_model.to(self.dtype)
249
+ self.text_model.to(self.dtype)
250
+ self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2"
251
+
252
+ self.image_seq_len = int(
253
+ ((config.vision_config.image_size // config.vision_config.patch_size) ** 2)
254
+ / (config.pixel_shuffle_factor**2)
255
+ )
256
+
257
+ self.post_init()
258
+
259
+ @staticmethod
260
+ def init_vision_model(config: ModernVBertConfig):
261
+ vision_model_config = SiglipVisionConfig.from_pretrained(
262
+ config.vision_config.vision_model_name,
263
+ _attn_implementation=config._attn_implementation,
264
+ )
265
+ vision_model = SiglipVisionModel(vision_model_config).vision_model
266
+ return vision_model
267
+
268
+ @staticmethod
269
+ def init_language_model(config: ModernVBertConfig):
270
+ text_model_config = ModernBertConfig.from_pretrained(
271
+ config.text_config.text_model_name,
272
+ _attn_implementation=config._attn_implementation,
273
+ )
274
+ text_model = ModernBertModel(text_model_config)
275
+ embed_layer = DecoupledEmbedding(
276
+ num_embeddings=text_model_config.vocab_size,
277
+ num_additional_embeddings=config.additional_vocab_size,
278
+ embedding_dim=config.hidden_size,
279
+ partially_freeze=getattr(config, "freeze_config", {"freeze_text_layers": False})["freeze_text_layers"],
280
+ padding_idx=config.pad_token_id,
281
+ )
282
+ text_model.set_input_embeddings(embed_layer)
283
+ return text_model
284
+
285
+ # Copied from transformers.models.idefics2.modeling_idefics2.Idefics2Model.enable_input_require_grads
286
+ def enable_input_require_grads(self):
287
+ """
288
+ Enables the gradients for the input embeddings.
289
+
290
+ This is useful for lora when using gradient checkpointing.
291
+ c.f. https://github.com/huggingface/peft/issues/1402#issuecomment-1913675032
292
+
293
+ Override to set output.requires_grad = True for both the decoder's and vision model's embeddings.
294
+ """
295
+
296
+ def get_lowest_module(module):
297
+ if len(list(module.children())) == 0:
298
+ # If the module has no children, it is a leaf module (e.g., Linear, Conv2d, etc.)
299
+ return module
300
+ else:
301
+ # Recursively call the function on each child module
302
+ return get_lowest_module(list(module.children())[0])
303
+
304
+ def make_inputs_require_grads(module, input, output):
305
+ output.requires_grad_(True)
306
+
307
+ self._text_require_grads_hook = self.get_input_embeddings().register_forward_hook(make_inputs_require_grads)
308
+ self._vision_require_grads_hook = get_lowest_module(self.vision_model).register_forward_hook(
309
+ make_inputs_require_grads
310
+ )
311
+
312
+ # Copied from transformers.models.idefics2.modeling_idefics2.Idefics2Model.disable_input_require_grads
313
+ def disable_input_require_grads(self):
314
+ self._text_require_grads_hook.remove()
315
+ self._vision_require_grads_hook.remove()
316
+
317
+ def get_input_embeddings(self):
318
+ return self.text_model.get_input_embeddings()
319
+
320
+ def set_input_embeddings(self, value):
321
+ self.text_model.set_input_embeddings(value)
322
+
323
+ def get_image_features(
324
+ self, pixel_values: torch.FloatTensor, pixel_attention_mask: Optional[torch.LongTensor] = None
325
+ ):
326
+ """
327
+ Derived from: https://github.com/huggingface/transformers/blob/main/src/transformers/models/smolvlm/modeling_smolvlm.py
328
+ Encodes images into continuous embeddings that can be forwarded to the language model.
329
+
330
+ Args:
331
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
332
+ The tensors corresponding to the input images.
333
+ pixel_attention_mask (`torch.LongTensor`, *optional*):
334
+ The attention mask indicating padded regions in the image.
335
+ """
336
+ batch_size, num_images, num_channels, height, width = pixel_values.shape
337
+ pixel_values = pixel_values.to(dtype=self.dtype) # fp16 compatibility
338
+ pixel_values = pixel_values.view(batch_size * num_images, *pixel_values.shape[2:])
339
+
340
+ # Remove padding images - padding images are full 0.
341
+ nb_values_per_image = pixel_values.shape[1:].numel()
342
+ real_images_inds = (pixel_values == 0.0).sum(dim=(-1, -2, -3)) != nb_values_per_image
343
+
344
+ if not any(real_images_inds):
345
+ real_images_inds[0] = True
346
+
347
+ pixel_values = pixel_values[real_images_inds].contiguous()
348
+ # Handle the vision attention mask
349
+ if pixel_attention_mask is None:
350
+ pixel_attention_mask = torch.ones(
351
+ size=[pixel_values.shape[i] for i in (0, 2, 3)],
352
+ dtype=torch.bool,
353
+ device=pixel_values.device,
354
+ )
355
+ else:
356
+ # Remove padding images from the mask
357
+ pixel_attention_mask = pixel_attention_mask.view(batch_size * num_images, *pixel_attention_mask.shape[2:])
358
+ pixel_attention_mask = pixel_attention_mask[real_images_inds].contiguous()
359
+
360
+ patch_size = self.config.vision_config.patch_size
361
+ patches_subgrid = pixel_attention_mask.unfold(dimension=1, size=patch_size, step=patch_size)
362
+ patches_subgrid = patches_subgrid.unfold(dimension=2, size=patch_size, step=patch_size)
363
+ patch_attention_mask = (patches_subgrid.sum(dim=(-1, -2)) > 0).bool()
364
+
365
+ # Get sequence from the vision encoder
366
+ image_hidden_states = self.vision_model(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
367
+ image_hidden_states = image_hidden_states.last_hidden_state
368
+
369
+ return image_hidden_states
370
+
371
+ def inputs_merger(self, input_ids, inputs_embeds, image_hidden_states):
372
+ """Adapted from https://github.com/huggingface/transformers/blob/main/src/transformers/models/smolvlm/modeling_smolvlm.py
373
+
374
+ This method aims at merging the token embeddings with the image hidden states into one single sequence of vectors that are fed to the transformer LM.
375
+ The merging happens as follows:
376
+ - The text token sequence is: `tok_1 tok_2 tok_3 <fake_token_around_image> <image> <image> ... <image> <fake_token_around_image> tok_4`.
377
+ - We get the image hidden states for the image through the vision encoder and that hidden state, after a pixel shuffle operation, is then projected into the text embedding space.
378
+ We thus have a sequence of image hidden states of size (1, image_seq_len, hidden_dim), where 1 is for batch_size of 1 image and hidden_dim is the hidden_dim of the LM transformer.
379
+ - The merging happens so that we obtain the following sequence: `vector_tok_1 vector_tok_2 vector_tok_3 vector_fake_tok_around_image {sequence of image_seq_len image hidden states} vector_fake_toke_around_image vector_tok_4`. That sequence is fed to the LM.
380
+ - To fit the format of that sequence, `input_ids`, `input_embeds`, `attention_mask` are all 3 adapted to insert the image hidden states.
381
+ """
382
+
383
+ _, patch_size, _ = image_hidden_states.shape
384
+
385
+ if input_ids is None:
386
+ image_mask = inputs_embeds == self.get_input_embeddings()(
387
+ torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)
388
+ )
389
+ image_mask = image_mask[..., 0] # slice off the hidden dim
390
+ else:
391
+ image_mask = input_ids == self.config.image_token_id
392
+
393
+ # Assert that the input <image> tokens are valid (i.e. multiple of patch_size)
394
+ num_image_tokens = image_mask.sum(dim=1)
395
+ if not torch.all(num_image_tokens % patch_size == 0):
396
+ raise ValueError("Number of <image> tokens not divisible by patch_size.")
397
+
398
+ blocks_per_sample = num_image_tokens // patch_size
399
+
400
+ offsets = torch.nn.functional.pad(blocks_per_sample.cumsum(dim=0), (1, 0), value=0)
401
+ block_offset = offsets[:-1]
402
+ row_cum = image_mask.cumsum(dim=-1)
403
+ chunk_idx = (row_cum - 1) // patch_size
404
+ local_idx = (row_cum - 1) % patch_size
405
+ block_idx = block_offset.unsqueeze(1) + chunk_idx
406
+
407
+ image_embeds = torch.zeros_like(inputs_embeds)
408
+ image_embeds[image_mask] = image_hidden_states[block_idx[image_mask], local_idx[image_mask], :]
409
+
410
+ return torch.where(image_mask.unsqueeze(-1), image_embeds, inputs_embeds)
411
+
412
+ @can_return_tuple
413
+ @auto_docstring(
414
+ custom_intro="""
415
+ Inputs fed to the model can have an arbitrary number of images. To account for this, pixel_values fed to
416
+ the model have image padding -> (batch_size, max_num_images, 3, max_heights, max_widths) where
417
+ max_num_images is the maximum number of images among the batch_size samples in the batch.
418
+ Padding images are not needed beyond padding the pixel_values at the entrance of the model.
419
+ For efficiency, we only pass through the vision_model's forward the real images by
420
+ discarding the padding images i.e. pixel_values of size (image_batch_size, 3, height, width) where
421
+ image_batch_size would be 7 when num_images_per_sample=[1, 3, 1, 2] and max_num_images would be 3.
422
+ """,
423
+ checkpoint="modernvbert/ModernVBert",
424
+ )
425
+ def forward(
426
+ self,
427
+ input_ids: torch.LongTensor = None,
428
+ attention_mask: Optional[torch.Tensor] = None,
429
+ position_ids: Optional[torch.LongTensor] = None,
430
+ inputs_embeds: Optional[torch.FloatTensor] = None,
431
+ pixel_values: Optional[torch.FloatTensor] = None,
432
+ pixel_attention_mask: Optional[torch.BoolTensor] = None,
433
+ image_hidden_states: Optional[torch.FloatTensor] = None,
434
+ output_attentions: Optional[bool] = None,
435
+ output_hidden_states: Optional[bool] = None,
436
+ return_dict: Optional[bool] = None,
437
+ **kwargs: Unpack[FlashAttentionKwargs],
438
+ ) -> Union[tuple, BaseModelOutputWithPoolingAndCrossAttentions]:
439
+ r"""
440
+ pixel_attention_mask (`torch.Tensor` of shape `(batch_size, image_size, image_size)`, *optional*):
441
+ Mask to avoid performing attention on padding pixel indices.
442
+ image_hidden_states (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
443
+ The hidden states of the image encoder after modality projection.
444
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
445
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
446
+ config.vocab_size]` or `model.image_token_id`. Tokens with indices set to `model.image_token_id` are
447
+ ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
448
+ """
449
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
450
+ output_hidden_states = (
451
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
452
+ )
453
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
454
+
455
+ if inputs_embeds is None:
456
+ inputs_embeds = self.text_model.get_input_embeddings()(input_ids).to(input_ids.device)
457
+
458
+ # Images processing
459
+ if pixel_values is not None:
460
+ # Vision encoder pass
461
+ image_hidden_states = self.get_image_features(
462
+ pixel_values=pixel_values, pixel_attention_mask=pixel_attention_mask
463
+ )
464
+ # Modality projection & resampling
465
+ image_hidden_states = self.connector(image_hidden_states)
466
+
467
+ # Merge image and text embeddings
468
+ if image_hidden_states is not None:
469
+ image_hidden_states = image_hidden_states.to(dtype=self.dtype, device=inputs_embeds.device)
470
+ inputs_embeds = self.inputs_merger(
471
+ input_ids=input_ids, inputs_embeds=inputs_embeds, image_hidden_states=image_hidden_states
472
+ )
473
+
474
+ # Language model pass
475
+ outputs = self.text_model(
476
+ inputs_embeds=inputs_embeds,
477
+ attention_mask=attention_mask,
478
+ position_ids=position_ids,
479
+ output_attentions=output_attentions,
480
+ output_hidden_states=output_hidden_states,
481
+ return_dict=return_dict,
482
+ **kwargs,
483
+ )
484
+
485
+ return ModernVBertBaseModelOutput(
486
+ last_hidden_state=outputs.last_hidden_state,
487
+ hidden_states=outputs.hidden_states,
488
+ attentions=outputs.attentions,
489
+ image_hidden_states=image_hidden_states,
490
+ )
491
+
492
+
493
+ class ModernVBertLMHead(nn.Module):
494
+ def __init__(self, config):
495
+ super().__init__()
496
+ pretrained_config = ModernBertConfig.from_pretrained(config.text_config.text_model_name)
497
+ pretrained_model = ModernBertForMaskedLM(pretrained_config)
498
+ self.head = pretrained_model.head
499
+ self.decoder = pretrained_model.decoder
500
+
501
+ def forward(self, hidden_states):
502
+ return self.decoder(self.head(hidden_states))
503
+
504
+
505
+ @auto_docstring
506
+ class ModernVBertForMaskedLM(ModernVBertPreTrainedModel):
507
+ _tied_weights_keys = ["lm_head.decoder.weight", "model.text_model.embeddings.word_embeddings.weight"]
508
+
509
+ def __init__(self, config):
510
+ super().__init__(config)
511
+ self.in_features = config.hidden_size
512
+ self.out_additional_features = config.additional_vocab_size
513
+ self.vocab_size = config.vocab_size
514
+ self.model = ModernVBertModel(config)
515
+ self.lm_head = ModernVBertLMHead(config)
516
+ if self.out_additional_features > 0:
517
+ self.additional_fc = nn.Linear(self.in_features, self.out_additional_features, bias=False)
518
+ self.lm_head.to(self.dtype)
519
+ self.post_init()
520
+
521
+ # Copied from transformers.models.idefics2.modeling_idefics2.Idefics2ForConditionalGeneration.disable_input_require_grads
522
+ def disable_input_require_grads(self):
523
+ self._text_require_grads_hook.remove()
524
+ self._vision_require_grads_hook.remove()
525
+
526
+ @can_return_tuple
527
+ @auto_docstring(
528
+ custom_intro="""
529
+ Inputs fed to the model can have an arbitrary number of images. To account for this, pixel_values fed to
530
+ the model have image padding -> (batch_size, max_num_images, 3, max_heights, max_widths) where
531
+ max_num_images is the maximum number of images among the batch_size samples in the batch.
532
+ Padding images are not needed beyond padding the pixel_values at the entrance of the model.
533
+ For efficiency, we only pass through the vision_model's forward the real images by
534
+ discarding the padding images i.e. pixel_values of size (image_batch_size, 3, height, width) where
535
+ image_batch_size would be 7 when num_images_per_sample=[1, 3, 1, 2] and max_num_images would be 3.
536
+ """,
537
+ checkpoint="modernvbert/ModernVBert",
538
+ )
539
+ def forward(
540
+ self,
541
+ input_ids: torch.LongTensor = None,
542
+ attention_mask: Optional[torch.Tensor] = None,
543
+ position_ids: Optional[torch.LongTensor] = None,
544
+ inputs_embeds: Optional[torch.FloatTensor] = None,
545
+ pixel_values: Optional[torch.FloatTensor] = None,
546
+ pixel_attention_mask: Optional[torch.BoolTensor] = None,
547
+ image_hidden_states: Optional[torch.FloatTensor] = None,
548
+ output_attentions: Optional[bool] = None,
549
+ output_hidden_states: Optional[bool] = None,
550
+ return_dict: Optional[bool] = None,
551
+ labels: Optional[torch.LongTensor] = None,
552
+ **kwargs: Unpack[FlashAttentionKwargs],
553
+ ) -> Union[tuple, ModernVBertMaskedLMOutput]:
554
+ r"""
555
+ pixel_attention_mask (`torch.Tensor` of shape `(batch_size, image_size, image_size)`, *optional*):
556
+ Mask to avoid performing attention on padding pixel indices.
557
+ image_hidden_states (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
558
+ The hidden states of the image encoder after modality projection.
559
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
560
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
561
+ config.vocab_size]` or `model.image_token_id`. Tokens with indices set to `model.image_token_id` are
562
+ ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
563
+ """
564
+
565
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
566
+ output_hidden_states = (
567
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
568
+ )
569
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
570
+
571
+ outputs = self.model(
572
+ input_ids=input_ids,
573
+ attention_mask=attention_mask,
574
+ position_ids=position_ids,
575
+ inputs_embeds=inputs_embeds,
576
+ pixel_values=pixel_values,
577
+ pixel_attention_mask=pixel_attention_mask,
578
+ image_hidden_states=image_hidden_states,
579
+ output_attentions=output_attentions,
580
+ output_hidden_states=output_hidden_states,
581
+ return_dict=return_dict,
582
+ **kwargs,
583
+ )
584
+ hidden_states = outputs[0]
585
+
586
+ logits = self.lm_head(hidden_states)
587
+
588
+ if self.out_additional_features > 0:
589
+ proj_states = self.lm_head.head(hidden_states)
590
+ additional_features = self.additional_fc(proj_states)
591
+ logits = torch.cat((logits, additional_features), -1)
592
+
593
+ loss = None
594
+ if labels is not None:
595
+ loss = CrossEntropyLoss()(logits.view(-1, self.vocab_size + self.out_additional_features), labels.view(-1))
596
+
597
+ if not return_dict:
598
+ output = (logits,) + outputs[2:]
599
+ return ((loss,) + output) if loss is not None else output
600
+
601
+ return ModernVBertMaskedLMOutput(
602
+ loss=loss,
603
+ logits=logits.float(),
604
+ hidden_states=outputs.hidden_states,
605
+ attentions=outputs.attentions,
606
+ image_hidden_states=outputs.image_hidden_states,
607
+ )
608
+
609
+
610
+ __all__ = ["ModernVBertPreTrainedModel", "ModernVBertModel", "ModernVBertForMaskedLM"]
preprocessor_config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_image_splitting": true,
4
+ "do_normalize": true,
5
+ "do_pad": true,
6
+ "do_rescale": true,
7
+ "do_resize": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "Idefics3ImageProcessor",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "max_image_size": {
20
+ "longest_edge": 512
21
+ },
22
+ "processor_class": "Idefics3Processor",
23
+ "resample": 1,
24
+ "rescale_factor": 0.00392156862745098,
25
+ "size": {
26
+ "longest_edge": 2048
27
+ }
28
+ }
processor_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "image_seq_len": 64,
3
+ "processor_class": "Idefics3Processor"
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<global-img>",
4
+ "<row_1_col_1>",
5
+ "<row_1_col_2>",
6
+ "<row_1_col_3>",
7
+ "<row_1_col_4>",
8
+ "<row_1_col_5>",
9
+ "<row_1_col_6>",
10
+ "<row_2_col_1>",
11
+ "<row_2_col_2>",
12
+ "<row_2_col_3>",
13
+ "<row_2_col_4>",
14
+ "<row_2_col_5>",
15
+ "<row_2_col_6>",
16
+ "<row_3_col_1>",
17
+ "<row_3_col_2>",
18
+ "<row_3_col_3>",
19
+ "<row_3_col_4>",
20
+ "<row_3_col_5>",
21
+ "<row_3_col_6>",
22
+ "<row_4_col_1>",
23
+ "<row_4_col_2>",
24
+ "<row_4_col_3>",
25
+ "<row_4_col_4>",
26
+ "<row_4_col_5>",
27
+ "<row_4_col_6>",
28
+ "<row_5_col_1>",
29
+ "<row_5_col_2>",
30
+ "<row_5_col_3>",
31
+ "<row_5_col_4>",
32
+ "<row_5_col_5>",
33
+ "<row_5_col_6>",
34
+ "<row_6_col_1>",
35
+ "<row_6_col_2>",
36
+ "<row_6_col_3>",
37
+ "<row_6_col_4>",
38
+ "<row_6_col_5>",
39
+ "<row_6_col_6>",
40
+ "<end_of_utterance>",
41
+ "<fake_token_around_image>",
42
+ "<image>"
43
+ ],
44
+ "cls_token": {
45
+ "content": "[CLS]",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ },
51
+ "mask_token": {
52
+ "content": "[MASK]",
53
+ "lstrip": true,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false
57
+ },
58
+ "pad_token": {
59
+ "content": "[PAD]",
60
+ "lstrip": false,
61
+ "normalized": false,
62
+ "rstrip": false,
63
+ "single_word": false
64
+ },
65
+ "sep_token": {
66
+ "content": "[SEP]",
67
+ "lstrip": false,
68
+ "normalized": false,
69
+ "rstrip": false,
70
+ "single_word": false
71
+ },
72
+ "unk_token": {
73
+ "content": "[UNK]",
74
+ "lstrip": false,
75
+ "normalized": false,
76
+ "rstrip": false,
77
+ "single_word": false
78
+ }
79
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,1310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "|||IP_ADDRESS|||",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": false
10
+ },
11
+ "1": {
12
+ "content": "<|padding|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "50254": {
20
+ "content": " ",
21
+ "lstrip": false,
22
+ "normalized": true,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": false
26
+ },
27
+ "50255": {
28
+ "content": " ",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": false
34
+ },
35
+ "50256": {
36
+ "content": " ",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": false
42
+ },
43
+ "50257": {
44
+ "content": " ",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "50258": {
52
+ "content": " ",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": false
58
+ },
59
+ "50259": {
60
+ "content": " ",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": false
66
+ },
67
+ "50260": {
68
+ "content": " ",
69
+ "lstrip": false,
70
+ "normalized": true,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": false
74
+ },
75
+ "50261": {
76
+ "content": " ",
77
+ "lstrip": false,
78
+ "normalized": true,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": false
82
+ },
83
+ "50262": {
84
+ "content": " ",
85
+ "lstrip": false,
86
+ "normalized": true,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": false
90
+ },
91
+ "50263": {
92
+ "content": " ",
93
+ "lstrip": false,
94
+ "normalized": true,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": false
98
+ },
99
+ "50264": {
100
+ "content": " ",
101
+ "lstrip": false,
102
+ "normalized": true,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": false
106
+ },
107
+ "50265": {
108
+ "content": " ",
109
+ "lstrip": false,
110
+ "normalized": true,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": false
114
+ },
115
+ "50266": {
116
+ "content": " ",
117
+ "lstrip": false,
118
+ "normalized": true,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": false
122
+ },
123
+ "50267": {
124
+ "content": " ",
125
+ "lstrip": false,
126
+ "normalized": true,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": false
130
+ },
131
+ "50268": {
132
+ "content": " ",
133
+ "lstrip": false,
134
+ "normalized": true,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": false
138
+ },
139
+ "50269": {
140
+ "content": " ",
141
+ "lstrip": false,
142
+ "normalized": true,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": false
146
+ },
147
+ "50270": {
148
+ "content": " ",
149
+ "lstrip": false,
150
+ "normalized": true,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": false
154
+ },
155
+ "50271": {
156
+ "content": " ",
157
+ "lstrip": false,
158
+ "normalized": true,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": false
162
+ },
163
+ "50272": {
164
+ "content": " ",
165
+ "lstrip": false,
166
+ "normalized": true,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": false
170
+ },
171
+ "50273": {
172
+ "content": " ",
173
+ "lstrip": false,
174
+ "normalized": true,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": false
178
+ },
179
+ "50274": {
180
+ "content": " ",
181
+ "lstrip": false,
182
+ "normalized": true,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": false
186
+ },
187
+ "50275": {
188
+ "content": " ",
189
+ "lstrip": false,
190
+ "normalized": true,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": false
194
+ },
195
+ "50276": {
196
+ "content": " ",
197
+ "lstrip": false,
198
+ "normalized": true,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": false
202
+ },
203
+ "50277": {
204
+ "content": "|||EMAIL_ADDRESS|||",
205
+ "lstrip": false,
206
+ "normalized": true,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": false
210
+ },
211
+ "50278": {
212
+ "content": "|||PHONE_NUMBER|||",
213
+ "lstrip": false,
214
+ "normalized": true,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": false
218
+ },
219
+ "50279": {
220
+ "content": "<|endoftext|>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "50280": {
228
+ "content": "[UNK]",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "50281": {
236
+ "content": "[CLS]",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "50282": {
244
+ "content": "[SEP]",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "50283": {
252
+ "content": "[PAD]",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "50284": {
260
+ "content": "[MASK]",
261
+ "lstrip": true,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "50285": {
268
+ "content": "[unused0]",
269
+ "lstrip": false,
270
+ "normalized": true,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": false
274
+ },
275
+ "50286": {
276
+ "content": "[unused1]",
277
+ "lstrip": false,
278
+ "normalized": true,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": false
282
+ },
283
+ "50287": {
284
+ "content": "[unused2]",
285
+ "lstrip": false,
286
+ "normalized": true,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": false
290
+ },
291
+ "50288": {
292
+ "content": "[unused3]",
293
+ "lstrip": false,
294
+ "normalized": true,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": false
298
+ },
299
+ "50289": {
300
+ "content": "[unused4]",
301
+ "lstrip": false,
302
+ "normalized": true,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": false
306
+ },
307
+ "50290": {
308
+ "content": "[unused5]",
309
+ "lstrip": false,
310
+ "normalized": true,
311
+ "rstrip": false,
312
+ "single_word": false,
313
+ "special": false
314
+ },
315
+ "50291": {
316
+ "content": "[unused6]",
317
+ "lstrip": false,
318
+ "normalized": true,
319
+ "rstrip": false,
320
+ "single_word": false,
321
+ "special": false
322
+ },
323
+ "50292": {
324
+ "content": "[unused7]",
325
+ "lstrip": false,
326
+ "normalized": true,
327
+ "rstrip": false,
328
+ "single_word": false,
329
+ "special": false
330
+ },
331
+ "50293": {
332
+ "content": "[unused8]",
333
+ "lstrip": false,
334
+ "normalized": true,
335
+ "rstrip": false,
336
+ "single_word": false,
337
+ "special": false
338
+ },
339
+ "50294": {
340
+ "content": "[unused9]",
341
+ "lstrip": false,
342
+ "normalized": true,
343
+ "rstrip": false,
344
+ "single_word": false,
345
+ "special": false
346
+ },
347
+ "50295": {
348
+ "content": "[unused10]",
349
+ "lstrip": false,
350
+ "normalized": true,
351
+ "rstrip": false,
352
+ "single_word": false,
353
+ "special": false
354
+ },
355
+ "50296": {
356
+ "content": "[unused11]",
357
+ "lstrip": false,
358
+ "normalized": true,
359
+ "rstrip": false,
360
+ "single_word": false,
361
+ "special": false
362
+ },
363
+ "50297": {
364
+ "content": "[unused12]",
365
+ "lstrip": false,
366
+ "normalized": true,
367
+ "rstrip": false,
368
+ "single_word": false,
369
+ "special": false
370
+ },
371
+ "50298": {
372
+ "content": "[unused13]",
373
+ "lstrip": false,
374
+ "normalized": true,
375
+ "rstrip": false,
376
+ "single_word": false,
377
+ "special": false
378
+ },
379
+ "50299": {
380
+ "content": "[unused14]",
381
+ "lstrip": false,
382
+ "normalized": true,
383
+ "rstrip": false,
384
+ "single_word": false,
385
+ "special": false
386
+ },
387
+ "50300": {
388
+ "content": "[unused15]",
389
+ "lstrip": false,
390
+ "normalized": true,
391
+ "rstrip": false,
392
+ "single_word": false,
393
+ "special": false
394
+ },
395
+ "50301": {
396
+ "content": "[unused16]",
397
+ "lstrip": false,
398
+ "normalized": true,
399
+ "rstrip": false,
400
+ "single_word": false,
401
+ "special": false
402
+ },
403
+ "50302": {
404
+ "content": "[unused17]",
405
+ "lstrip": false,
406
+ "normalized": true,
407
+ "rstrip": false,
408
+ "single_word": false,
409
+ "special": false
410
+ },
411
+ "50303": {
412
+ "content": "[unused18]",
413
+ "lstrip": false,
414
+ "normalized": true,
415
+ "rstrip": false,
416
+ "single_word": false,
417
+ "special": false
418
+ },
419
+ "50304": {
420
+ "content": "[unused19]",
421
+ "lstrip": false,
422
+ "normalized": true,
423
+ "rstrip": false,
424
+ "single_word": false,
425
+ "special": false
426
+ },
427
+ "50305": {
428
+ "content": "[unused20]",
429
+ "lstrip": false,
430
+ "normalized": true,
431
+ "rstrip": false,
432
+ "single_word": false,
433
+ "special": false
434
+ },
435
+ "50306": {
436
+ "content": "[unused21]",
437
+ "lstrip": false,
438
+ "normalized": true,
439
+ "rstrip": false,
440
+ "single_word": false,
441
+ "special": false
442
+ },
443
+ "50307": {
444
+ "content": "[unused22]",
445
+ "lstrip": false,
446
+ "normalized": true,
447
+ "rstrip": false,
448
+ "single_word": false,
449
+ "special": false
450
+ },
451
+ "50308": {
452
+ "content": "[unused23]",
453
+ "lstrip": false,
454
+ "normalized": true,
455
+ "rstrip": false,
456
+ "single_word": false,
457
+ "special": false
458
+ },
459
+ "50309": {
460
+ "content": "[unused24]",
461
+ "lstrip": false,
462
+ "normalized": true,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": false
466
+ },
467
+ "50310": {
468
+ "content": "[unused25]",
469
+ "lstrip": false,
470
+ "normalized": true,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": false
474
+ },
475
+ "50311": {
476
+ "content": "[unused26]",
477
+ "lstrip": false,
478
+ "normalized": true,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": false
482
+ },
483
+ "50312": {
484
+ "content": "[unused27]",
485
+ "lstrip": false,
486
+ "normalized": true,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": false
490
+ },
491
+ "50313": {
492
+ "content": "[unused28]",
493
+ "lstrip": false,
494
+ "normalized": true,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": false
498
+ },
499
+ "50314": {
500
+ "content": "[unused29]",
501
+ "lstrip": false,
502
+ "normalized": true,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": false
506
+ },
507
+ "50315": {
508
+ "content": "[unused30]",
509
+ "lstrip": false,
510
+ "normalized": true,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": false
514
+ },
515
+ "50316": {
516
+ "content": "[unused31]",
517
+ "lstrip": false,
518
+ "normalized": true,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": false
522
+ },
523
+ "50317": {
524
+ "content": "[unused32]",
525
+ "lstrip": false,
526
+ "normalized": true,
527
+ "rstrip": false,
528
+ "single_word": false,
529
+ "special": false
530
+ },
531
+ "50318": {
532
+ "content": "[unused33]",
533
+ "lstrip": false,
534
+ "normalized": true,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": false
538
+ },
539
+ "50319": {
540
+ "content": "[unused34]",
541
+ "lstrip": false,
542
+ "normalized": true,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": false
546
+ },
547
+ "50320": {
548
+ "content": "[unused35]",
549
+ "lstrip": false,
550
+ "normalized": true,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": false
554
+ },
555
+ "50321": {
556
+ "content": "[unused36]",
557
+ "lstrip": false,
558
+ "normalized": true,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": false
562
+ },
563
+ "50322": {
564
+ "content": "[unused37]",
565
+ "lstrip": false,
566
+ "normalized": true,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": false
570
+ },
571
+ "50323": {
572
+ "content": "[unused38]",
573
+ "lstrip": false,
574
+ "normalized": true,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": false
578
+ },
579
+ "50324": {
580
+ "content": "[unused39]",
581
+ "lstrip": false,
582
+ "normalized": true,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": false
586
+ },
587
+ "50325": {
588
+ "content": "[unused40]",
589
+ "lstrip": false,
590
+ "normalized": true,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": false
594
+ },
595
+ "50326": {
596
+ "content": "[unused41]",
597
+ "lstrip": false,
598
+ "normalized": true,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": false
602
+ },
603
+ "50327": {
604
+ "content": "[unused42]",
605
+ "lstrip": false,
606
+ "normalized": true,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": false
610
+ },
611
+ "50328": {
612
+ "content": "[unused43]",
613
+ "lstrip": false,
614
+ "normalized": true,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": false
618
+ },
619
+ "50329": {
620
+ "content": "[unused44]",
621
+ "lstrip": false,
622
+ "normalized": true,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": false
626
+ },
627
+ "50330": {
628
+ "content": "[unused45]",
629
+ "lstrip": false,
630
+ "normalized": true,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": false
634
+ },
635
+ "50331": {
636
+ "content": "[unused46]",
637
+ "lstrip": false,
638
+ "normalized": true,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": false
642
+ },
643
+ "50332": {
644
+ "content": "[unused47]",
645
+ "lstrip": false,
646
+ "normalized": true,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": false
650
+ },
651
+ "50333": {
652
+ "content": "[unused48]",
653
+ "lstrip": false,
654
+ "normalized": true,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": false
658
+ },
659
+ "50334": {
660
+ "content": "[unused49]",
661
+ "lstrip": false,
662
+ "normalized": true,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": false
666
+ },
667
+ "50335": {
668
+ "content": "[unused50]",
669
+ "lstrip": false,
670
+ "normalized": true,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": false
674
+ },
675
+ "50336": {
676
+ "content": "[unused51]",
677
+ "lstrip": false,
678
+ "normalized": true,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": false
682
+ },
683
+ "50337": {
684
+ "content": "[unused52]",
685
+ "lstrip": false,
686
+ "normalized": true,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": false
690
+ },
691
+ "50338": {
692
+ "content": "[unused53]",
693
+ "lstrip": false,
694
+ "normalized": true,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": false
698
+ },
699
+ "50339": {
700
+ "content": "[unused54]",
701
+ "lstrip": false,
702
+ "normalized": true,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": false
706
+ },
707
+ "50340": {
708
+ "content": "[unused55]",
709
+ "lstrip": false,
710
+ "normalized": true,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": false
714
+ },
715
+ "50341": {
716
+ "content": "[unused56]",
717
+ "lstrip": false,
718
+ "normalized": true,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": false
722
+ },
723
+ "50342": {
724
+ "content": "[unused57]",
725
+ "lstrip": false,
726
+ "normalized": true,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": false
730
+ },
731
+ "50343": {
732
+ "content": "[unused58]",
733
+ "lstrip": false,
734
+ "normalized": true,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": false
738
+ },
739
+ "50344": {
740
+ "content": "[unused59]",
741
+ "lstrip": false,
742
+ "normalized": true,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": false
746
+ },
747
+ "50345": {
748
+ "content": "[unused60]",
749
+ "lstrip": false,
750
+ "normalized": true,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": false
754
+ },
755
+ "50346": {
756
+ "content": "[unused61]",
757
+ "lstrip": false,
758
+ "normalized": true,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": false
762
+ },
763
+ "50347": {
764
+ "content": "[unused62]",
765
+ "lstrip": false,
766
+ "normalized": true,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": false
770
+ },
771
+ "50348": {
772
+ "content": "[unused63]",
773
+ "lstrip": false,
774
+ "normalized": true,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": false
778
+ },
779
+ "50349": {
780
+ "content": "[unused64]",
781
+ "lstrip": false,
782
+ "normalized": true,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": false
786
+ },
787
+ "50350": {
788
+ "content": "[unused65]",
789
+ "lstrip": false,
790
+ "normalized": true,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": false
794
+ },
795
+ "50351": {
796
+ "content": "[unused66]",
797
+ "lstrip": false,
798
+ "normalized": true,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": false
802
+ },
803
+ "50352": {
804
+ "content": "[unused67]",
805
+ "lstrip": false,
806
+ "normalized": true,
807
+ "rstrip": false,
808
+ "single_word": false,
809
+ "special": false
810
+ },
811
+ "50353": {
812
+ "content": "[unused68]",
813
+ "lstrip": false,
814
+ "normalized": true,
815
+ "rstrip": false,
816
+ "single_word": false,
817
+ "special": false
818
+ },
819
+ "50354": {
820
+ "content": "[unused69]",
821
+ "lstrip": false,
822
+ "normalized": true,
823
+ "rstrip": false,
824
+ "single_word": false,
825
+ "special": false
826
+ },
827
+ "50355": {
828
+ "content": "[unused70]",
829
+ "lstrip": false,
830
+ "normalized": true,
831
+ "rstrip": false,
832
+ "single_word": false,
833
+ "special": false
834
+ },
835
+ "50356": {
836
+ "content": "[unused71]",
837
+ "lstrip": false,
838
+ "normalized": true,
839
+ "rstrip": false,
840
+ "single_word": false,
841
+ "special": false
842
+ },
843
+ "50357": {
844
+ "content": "[unused72]",
845
+ "lstrip": false,
846
+ "normalized": true,
847
+ "rstrip": false,
848
+ "single_word": false,
849
+ "special": false
850
+ },
851
+ "50358": {
852
+ "content": "[unused73]",
853
+ "lstrip": false,
854
+ "normalized": true,
855
+ "rstrip": false,
856
+ "single_word": false,
857
+ "special": false
858
+ },
859
+ "50359": {
860
+ "content": "[unused74]",
861
+ "lstrip": false,
862
+ "normalized": true,
863
+ "rstrip": false,
864
+ "single_word": false,
865
+ "special": false
866
+ },
867
+ "50360": {
868
+ "content": "[unused75]",
869
+ "lstrip": false,
870
+ "normalized": true,
871
+ "rstrip": false,
872
+ "single_word": false,
873
+ "special": false
874
+ },
875
+ "50361": {
876
+ "content": "[unused76]",
877
+ "lstrip": false,
878
+ "normalized": true,
879
+ "rstrip": false,
880
+ "single_word": false,
881
+ "special": false
882
+ },
883
+ "50362": {
884
+ "content": "[unused77]",
885
+ "lstrip": false,
886
+ "normalized": true,
887
+ "rstrip": false,
888
+ "single_word": false,
889
+ "special": false
890
+ },
891
+ "50363": {
892
+ "content": "[unused78]",
893
+ "lstrip": false,
894
+ "normalized": true,
895
+ "rstrip": false,
896
+ "single_word": false,
897
+ "special": false
898
+ },
899
+ "50364": {
900
+ "content": "[unused79]",
901
+ "lstrip": false,
902
+ "normalized": true,
903
+ "rstrip": false,
904
+ "single_word": false,
905
+ "special": false
906
+ },
907
+ "50365": {
908
+ "content": "[unused80]",
909
+ "lstrip": false,
910
+ "normalized": true,
911
+ "rstrip": false,
912
+ "single_word": false,
913
+ "special": false
914
+ },
915
+ "50366": {
916
+ "content": "[unused81]",
917
+ "lstrip": false,
918
+ "normalized": true,
919
+ "rstrip": false,
920
+ "single_word": false,
921
+ "special": false
922
+ },
923
+ "50367": {
924
+ "content": "[unused82]",
925
+ "lstrip": false,
926
+ "normalized": true,
927
+ "rstrip": false,
928
+ "single_word": false,
929
+ "special": false
930
+ },
931
+ "50368": {
932
+ "content": "<global-img>",
933
+ "lstrip": false,
934
+ "normalized": false,
935
+ "rstrip": false,
936
+ "single_word": false,
937
+ "special": true
938
+ },
939
+ "50369": {
940
+ "content": "<row_1_col_1>",
941
+ "lstrip": false,
942
+ "normalized": false,
943
+ "rstrip": false,
944
+ "single_word": false,
945
+ "special": true
946
+ },
947
+ "50370": {
948
+ "content": "<row_1_col_2>",
949
+ "lstrip": false,
950
+ "normalized": false,
951
+ "rstrip": false,
952
+ "single_word": false,
953
+ "special": true
954
+ },
955
+ "50371": {
956
+ "content": "<row_1_col_3>",
957
+ "lstrip": false,
958
+ "normalized": false,
959
+ "rstrip": false,
960
+ "single_word": false,
961
+ "special": true
962
+ },
963
+ "50372": {
964
+ "content": "<row_1_col_4>",
965
+ "lstrip": false,
966
+ "normalized": false,
967
+ "rstrip": false,
968
+ "single_word": false,
969
+ "special": true
970
+ },
971
+ "50373": {
972
+ "content": "<row_1_col_5>",
973
+ "lstrip": false,
974
+ "normalized": false,
975
+ "rstrip": false,
976
+ "single_word": false,
977
+ "special": true
978
+ },
979
+ "50374": {
980
+ "content": "<row_1_col_6>",
981
+ "lstrip": false,
982
+ "normalized": false,
983
+ "rstrip": false,
984
+ "single_word": false,
985
+ "special": true
986
+ },
987
+ "50375": {
988
+ "content": "<row_2_col_1>",
989
+ "lstrip": false,
990
+ "normalized": false,
991
+ "rstrip": false,
992
+ "single_word": false,
993
+ "special": true
994
+ },
995
+ "50376": {
996
+ "content": "<row_2_col_2>",
997
+ "lstrip": false,
998
+ "normalized": false,
999
+ "rstrip": false,
1000
+ "single_word": false,
1001
+ "special": true
1002
+ },
1003
+ "50377": {
1004
+ "content": "<row_2_col_3>",
1005
+ "lstrip": false,
1006
+ "normalized": false,
1007
+ "rstrip": false,
1008
+ "single_word": false,
1009
+ "special": true
1010
+ },
1011
+ "50378": {
1012
+ "content": "<row_2_col_4>",
1013
+ "lstrip": false,
1014
+ "normalized": false,
1015
+ "rstrip": false,
1016
+ "single_word": false,
1017
+ "special": true
1018
+ },
1019
+ "50379": {
1020
+ "content": "<row_2_col_5>",
1021
+ "lstrip": false,
1022
+ "normalized": false,
1023
+ "rstrip": false,
1024
+ "single_word": false,
1025
+ "special": true
1026
+ },
1027
+ "50380": {
1028
+ "content": "<row_2_col_6>",
1029
+ "lstrip": false,
1030
+ "normalized": false,
1031
+ "rstrip": false,
1032
+ "single_word": false,
1033
+ "special": true
1034
+ },
1035
+ "50381": {
1036
+ "content": "<row_3_col_1>",
1037
+ "lstrip": false,
1038
+ "normalized": false,
1039
+ "rstrip": false,
1040
+ "single_word": false,
1041
+ "special": true
1042
+ },
1043
+ "50382": {
1044
+ "content": "<row_3_col_2>",
1045
+ "lstrip": false,
1046
+ "normalized": false,
1047
+ "rstrip": false,
1048
+ "single_word": false,
1049
+ "special": true
1050
+ },
1051
+ "50383": {
1052
+ "content": "<row_3_col_3>",
1053
+ "lstrip": false,
1054
+ "normalized": false,
1055
+ "rstrip": false,
1056
+ "single_word": false,
1057
+ "special": true
1058
+ },
1059
+ "50384": {
1060
+ "content": "<row_3_col_4>",
1061
+ "lstrip": false,
1062
+ "normalized": false,
1063
+ "rstrip": false,
1064
+ "single_word": false,
1065
+ "special": true
1066
+ },
1067
+ "50385": {
1068
+ "content": "<row_3_col_5>",
1069
+ "lstrip": false,
1070
+ "normalized": false,
1071
+ "rstrip": false,
1072
+ "single_word": false,
1073
+ "special": true
1074
+ },
1075
+ "50386": {
1076
+ "content": "<row_3_col_6>",
1077
+ "lstrip": false,
1078
+ "normalized": false,
1079
+ "rstrip": false,
1080
+ "single_word": false,
1081
+ "special": true
1082
+ },
1083
+ "50387": {
1084
+ "content": "<row_4_col_1>",
1085
+ "lstrip": false,
1086
+ "normalized": false,
1087
+ "rstrip": false,
1088
+ "single_word": false,
1089
+ "special": true
1090
+ },
1091
+ "50388": {
1092
+ "content": "<row_4_col_2>",
1093
+ "lstrip": false,
1094
+ "normalized": false,
1095
+ "rstrip": false,
1096
+ "single_word": false,
1097
+ "special": true
1098
+ },
1099
+ "50389": {
1100
+ "content": "<row_4_col_3>",
1101
+ "lstrip": false,
1102
+ "normalized": false,
1103
+ "rstrip": false,
1104
+ "single_word": false,
1105
+ "special": true
1106
+ },
1107
+ "50390": {
1108
+ "content": "<row_4_col_4>",
1109
+ "lstrip": false,
1110
+ "normalized": false,
1111
+ "rstrip": false,
1112
+ "single_word": false,
1113
+ "special": true
1114
+ },
1115
+ "50391": {
1116
+ "content": "<row_4_col_5>",
1117
+ "lstrip": false,
1118
+ "normalized": false,
1119
+ "rstrip": false,
1120
+ "single_word": false,
1121
+ "special": true
1122
+ },
1123
+ "50392": {
1124
+ "content": "<row_4_col_6>",
1125
+ "lstrip": false,
1126
+ "normalized": false,
1127
+ "rstrip": false,
1128
+ "single_word": false,
1129
+ "special": true
1130
+ },
1131
+ "50393": {
1132
+ "content": "<row_5_col_1>",
1133
+ "lstrip": false,
1134
+ "normalized": false,
1135
+ "rstrip": false,
1136
+ "single_word": false,
1137
+ "special": true
1138
+ },
1139
+ "50394": {
1140
+ "content": "<row_5_col_2>",
1141
+ "lstrip": false,
1142
+ "normalized": false,
1143
+ "rstrip": false,
1144
+ "single_word": false,
1145
+ "special": true
1146
+ },
1147
+ "50395": {
1148
+ "content": "<row_5_col_3>",
1149
+ "lstrip": false,
1150
+ "normalized": false,
1151
+ "rstrip": false,
1152
+ "single_word": false,
1153
+ "special": true
1154
+ },
1155
+ "50396": {
1156
+ "content": "<row_5_col_4>",
1157
+ "lstrip": false,
1158
+ "normalized": false,
1159
+ "rstrip": false,
1160
+ "single_word": false,
1161
+ "special": true
1162
+ },
1163
+ "50397": {
1164
+ "content": "<row_5_col_5>",
1165
+ "lstrip": false,
1166
+ "normalized": false,
1167
+ "rstrip": false,
1168
+ "single_word": false,
1169
+ "special": true
1170
+ },
1171
+ "50398": {
1172
+ "content": "<row_5_col_6>",
1173
+ "lstrip": false,
1174
+ "normalized": false,
1175
+ "rstrip": false,
1176
+ "single_word": false,
1177
+ "special": true
1178
+ },
1179
+ "50399": {
1180
+ "content": "<row_6_col_1>",
1181
+ "lstrip": false,
1182
+ "normalized": false,
1183
+ "rstrip": false,
1184
+ "single_word": false,
1185
+ "special": true
1186
+ },
1187
+ "50400": {
1188
+ "content": "<row_6_col_2>",
1189
+ "lstrip": false,
1190
+ "normalized": false,
1191
+ "rstrip": false,
1192
+ "single_word": false,
1193
+ "special": true
1194
+ },
1195
+ "50401": {
1196
+ "content": "<row_6_col_3>",
1197
+ "lstrip": false,
1198
+ "normalized": false,
1199
+ "rstrip": false,
1200
+ "single_word": false,
1201
+ "special": true
1202
+ },
1203
+ "50402": {
1204
+ "content": "<row_6_col_4>",
1205
+ "lstrip": false,
1206
+ "normalized": false,
1207
+ "rstrip": false,
1208
+ "single_word": false,
1209
+ "special": true
1210
+ },
1211
+ "50403": {
1212
+ "content": "<row_6_col_5>",
1213
+ "lstrip": false,
1214
+ "normalized": false,
1215
+ "rstrip": false,
1216
+ "single_word": false,
1217
+ "special": true
1218
+ },
1219
+ "50404": {
1220
+ "content": "<row_6_col_6>",
1221
+ "lstrip": false,
1222
+ "normalized": false,
1223
+ "rstrip": false,
1224
+ "single_word": false,
1225
+ "special": true
1226
+ },
1227
+ "50405": {
1228
+ "content": "<end_of_utterance>",
1229
+ "lstrip": false,
1230
+ "normalized": false,
1231
+ "rstrip": false,
1232
+ "single_word": false,
1233
+ "special": true
1234
+ },
1235
+ "50406": {
1236
+ "content": "<fake_token_around_image>",
1237
+ "lstrip": false,
1238
+ "normalized": false,
1239
+ "rstrip": false,
1240
+ "single_word": false,
1241
+ "special": true
1242
+ },
1243
+ "50407": {
1244
+ "content": "<image>",
1245
+ "lstrip": false,
1246
+ "normalized": false,
1247
+ "rstrip": false,
1248
+ "single_word": false,
1249
+ "special": true
1250
+ }
1251
+ },
1252
+ "additional_special_tokens": [
1253
+ "<global-img>",
1254
+ "<row_1_col_1>",
1255
+ "<row_1_col_2>",
1256
+ "<row_1_col_3>",
1257
+ "<row_1_col_4>",
1258
+ "<row_1_col_5>",
1259
+ "<row_1_col_6>",
1260
+ "<row_2_col_1>",
1261
+ "<row_2_col_2>",
1262
+ "<row_2_col_3>",
1263
+ "<row_2_col_4>",
1264
+ "<row_2_col_5>",
1265
+ "<row_2_col_6>",
1266
+ "<row_3_col_1>",
1267
+ "<row_3_col_2>",
1268
+ "<row_3_col_3>",
1269
+ "<row_3_col_4>",
1270
+ "<row_3_col_5>",
1271
+ "<row_3_col_6>",
1272
+ "<row_4_col_1>",
1273
+ "<row_4_col_2>",
1274
+ "<row_4_col_3>",
1275
+ "<row_4_col_4>",
1276
+ "<row_4_col_5>",
1277
+ "<row_4_col_6>",
1278
+ "<row_5_col_1>",
1279
+ "<row_5_col_2>",
1280
+ "<row_5_col_3>",
1281
+ "<row_5_col_4>",
1282
+ "<row_5_col_5>",
1283
+ "<row_5_col_6>",
1284
+ "<row_6_col_1>",
1285
+ "<row_6_col_2>",
1286
+ "<row_6_col_3>",
1287
+ "<row_6_col_4>",
1288
+ "<row_6_col_5>",
1289
+ "<row_6_col_6>",
1290
+ "<end_of_utterance>",
1291
+ "<fake_token_around_image>",
1292
+ "<image>"
1293
+ ],
1294
+ "clean_up_tokenization_spaces": true,
1295
+ "cls_token": "[CLS]",
1296
+ "extra_special_tokens": {},
1297
+ "legacy": false,
1298
+ "mask_token": "[MASK]",
1299
+ "model_input_names": [
1300
+ "input_ids",
1301
+ "attention_mask",
1302
+ "pixel_values",
1303
+ "pixel_attention_mask"
1304
+ ],
1305
+ "model_max_length": 8192,
1306
+ "pad_token": "[PAD]",
1307
+ "sep_token": "[SEP]",
1308
+ "tokenizer_class": "PreTrainedTokenizerFast",
1309
+ "unk_token": "[UNK]"
1310
+ }