Update pipeline tag and add library name
Browse filesThis PR updates the model card with the correct `pipeline_tag` (`image-to-text`) and adds the `library_name` (`internvl`). It also improves the formatting and clarity of the model card by incorporating relevant sections from the GitHub README, such as Token Family, TokenVL, and Release Plans. The citation is also updated with the arXiv preprint information.
README.md
CHANGED
|
@@ -1,8 +1,9 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
-
pipeline_tag: image-feature-extraction
|
| 4 |
base_model: TokenOCR
|
|
|
|
|
|
|
| 5 |
base_model_relation: finetune
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
<center>
|
|
@@ -14,17 +15,11 @@ base_model_relation: finetune
|
|
| 14 |
|
| 15 |
</center>
|
| 16 |
|
| 17 |
-
<!-- <div align="center">
|
| 18 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/dQ_JfK_I91WXzIq52D015.png">
|
| 19 |
-
</div> -->
|
| 20 |
-
|
| 21 |
<center>
|
| 22 |
|
| 23 |
-
|
| 24 |
-
<!-- ### Model Cards -->
|
| 25 |
<h2 style="color: #4CAF50;">Model Cards</h2>
|
| 26 |
|
| 27 |
-
In the following table, we provide all models
|
| 28 |
|
| 29 |
| Model Name | Description |
|
| 30 |
| :-----------------------: | :-------------------------------------------------------------------: |
|
|
@@ -45,187 +40,38 @@ from transformers import AutoTokenizer
|
|
| 45 |
from internvl.model.internvl_chat import InternVLChatModel
|
| 46 |
from utils import post_process, generate_similiarity_map, load_image
|
| 47 |
|
| 48 |
-
|
| 49 |
-
image_path = './demo_images/0000000.png'
|
| 50 |
-
input_query = '11/12/2020'
|
| 51 |
-
out_dir = 'results'
|
| 52 |
-
|
| 53 |
-
if not os.path.exists(out_dir):
|
| 54 |
-
os.makedirs(out_dir, exist_ok=True)
|
| 55 |
-
|
| 56 |
-
"""loading model, tokenizer, tok_embeddings """
|
| 57 |
-
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, use_fast=False)
|
| 58 |
-
model = InternVLChatModel.from_pretrained(checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).eval()
|
| 59 |
-
model = model.cuda()
|
| 60 |
-
|
| 61 |
-
"""loading image """
|
| 62 |
-
pixel_values, images, target_aspect_ratio = load_image(image_path)
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
"""loading query texts """
|
| 66 |
-
if input_query[0] in '!"#$%&\'()*+,-./0123456789:;<=>?@^_{|}~0123456789':
|
| 67 |
-
input_ids = tokenizer(input_query)['input_ids'][1:]
|
| 68 |
-
else:
|
| 69 |
-
input_ids = tokenizer(' '+input_query)['input_ids'][1:]
|
| 70 |
-
input_ids = torch.Tensor(input_ids).long().to(model.device)
|
| 71 |
-
input_embeds = model.tok_embeddings(input_ids).clone()
|
| 72 |
-
all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
"""Obtaining similarity """
|
| 76 |
-
with torch.no_grad():
|
| 77 |
-
vit_embeds, _ = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
|
| 78 |
-
vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
|
| 79 |
-
token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
|
| 80 |
-
input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
|
| 81 |
-
similarity = input_embedings @ token_features.t()
|
| 82 |
-
attn_map = similarity.reshape(len(input_embedings), resized_size[0], resized_size[1])
|
| 83 |
-
|
| 84 |
-
"""generate map locally """
|
| 85 |
-
generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspect_ratio)
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
"""user command """
|
| 89 |
-
# python quick_start.py
|
| 90 |
```
|
|
|
|
| 91 |
<center>
|
| 92 |
|
| 93 |
-
<!-- # Introduction -->
|
| 94 |
<h2 style="color: #4CAF50;">Introduction</h2>
|
| 95 |
|
| 96 |
</center>
|
| 97 |
|
| 98 |
-
We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
|
| 99 |
-
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
|
| 100 |
-
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 101 |
-
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
|
| 102 |
-
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 103 |
-
we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
|
| 104 |
|
| 105 |
<center>
|
| 106 |
|
| 107 |
-
<!-- # Token Family -->
|
| 108 |
<h2 style="color: #4CAF50;">Token Family</h2>
|
| 109 |
|
| 110 |
</center>
|
| 111 |
|
| 112 |
-
<!-- ## TokenIT -->
|
| 113 |
<h2 style="color: #4CAF50;">TokenIT</h2>
|
| 114 |
|
| 115 |
-
|
| 116 |
-
text-mask pairs.
|
| 117 |
-
|
| 118 |
-
As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file.
|
| 119 |
-
The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with
|
| 120 |
-
the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently,
|
| 121 |
-
**each BPE token corresponds one-to-one with a pixel-level mask**.
|
| 122 |
-
The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution
|
| 123 |
-
of tokens per image type and a word cloud of the top 100 tokens, respectively.
|
| 124 |
-
|
| 125 |
-
<div align="center">
|
| 126 |
-
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
|
| 127 |
-
</div>
|
| 128 |
-
|
| 129 |
-
<!--  -->
|
| 130 |
-
|
| 131 |
-
The comparisons with other visual foundation models:
|
| 132 |
|
| 133 |
-
| VFM | Granularity | Dataset | #Image | #Pairs |
|
| 134 |
-
|:-------------------|:------------|:---------|:------:|:------:|
|
| 135 |
-
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
| 136 |
-
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
| 137 |
-
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
| 138 |
-
| **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
<!-- ## TokenOCR
|
| 142 |
-
-->
|
| 143 |
<h2 style="color: #4CAF50;">TokenOCR</h2>
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
An overview of the proposed TokenOCR, where the token-level image features and token-level language
|
| 148 |
-
features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
|
| 149 |
-
applications, including text segmentation, retrieval, and visual question answering.
|
| 150 |
-
|
| 151 |
-
<div align="center">
|
| 152 |
-
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/QTsvWxFJFTnISdhvbfZhD.png">
|
| 153 |
-
</div>
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
### Evaluation on Vision Capability
|
| 157 |
|
| 158 |
-
We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
|
| 159 |
-
The evaluation is divided into two key categories:
|
| 160 |
-
|
| 161 |
-
(1) text retrial;
|
| 162 |
-
(2) image segmentation;
|
| 163 |
-
(3) visual question answering;
|
| 164 |
-
|
| 165 |
-
This approach allows us to assess the representation quality of TokenOCR.
|
| 166 |
-
Please refer to our technical report for more details.
|
| 167 |
-
|
| 168 |
-
#### text retrial
|
| 169 |
-
|
| 170 |
-
<div align="left">
|
| 171 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png">
|
| 172 |
-
</div>
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
<!--  -->
|
| 176 |
-
|
| 177 |
-
#### image segmentation
|
| 178 |
-
|
| 179 |
-
<div align="left">
|
| 180 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png">
|
| 181 |
-
</div>
|
| 182 |
-
|
| 183 |
-
<!--  -->
|
| 184 |
-
|
| 185 |
-
#### visual question answering
|
| 186 |
-
|
| 187 |
-
<div align="left">
|
| 188 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png">
|
| 189 |
-
</div>
|
| 190 |
-
|
| 191 |
-
<!-- 
|
| 192 |
-
-->
|
| 193 |
-
|
| 194 |
-
<!-- ## TokenVL -->
|
| 195 |
<h2 style="color: #4CAF50;">TokenVL</h2>
|
| 196 |
|
| 197 |
-
|
| 198 |
-
Following the previous training paradigm, TokenVL also includes two stages:
|
| 199 |
-
|
| 200 |
-
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|
| 201 |
-
|
| 202 |
-
<div align="center">
|
| 203 |
-
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/gDr1fQg7I1nTIsiRWNHTr.png">
|
| 204 |
-
</div>
|
| 205 |
-
|
| 206 |
-
The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
|
| 207 |
-
method makes it difficult for these models to have a precise understanding.
|
| 208 |
-
In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.
|
| 209 |
-
|
| 210 |
-
**Stage 2: Supervised Instruction Tuning for VQA tasks.**
|
| 211 |
-
|
| 212 |
-
During the Supervised Instruction Tuning stage, we cancel the token alignment branch as answers may not appear in the image for some reasoning tasks
|
| 213 |
-
(e.g., How much taller is the red bar compared to the green bar?). This also ensures no computational overhead during inference to improve the document understanding capability. Finally, we inherit the
|
| 214 |
-
remaining weights from the LLM-guided Token Alignment and unfreeze all parameters to facilitate comprehensive parameter updates.
|
| 215 |
-
|
| 216 |
-
### OCRBench Results
|
| 217 |
-
|
| 218 |
-
<div align="center">
|
| 219 |
-
<img width="1300" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/DZej5Ogpho3wpZC4KVAMO.png">
|
| 220 |
-
</div>
|
| 221 |
-
|
| 222 |
-
### Document Understanding Results
|
| 223 |
-
|
| 224 |
-
<div align="center">
|
| 225 |
-
<img width="1300" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/Msfs1YkDQHq2-djhm6QqD.png">
|
| 226 |
-
</div>
|
| 227 |
|
|
|
|
| 228 |
|
|
|
|
| 229 |
|
| 230 |
## License
|
| 231 |
|
|
@@ -233,13 +79,11 @@ This project is released under the MIT License.
|
|
| 233 |
|
| 234 |
## Citation
|
| 235 |
|
| 236 |
-
If you find this project useful in your research, please consider citing:
|
| 237 |
-
|
| 238 |
```BibTeX
|
| 239 |
@inproceedings{guan2025TokenOCR,
|
| 240 |
title={A Token-level Text Image Foundation Model for Document Understanding},
|
| 241 |
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
|
| 242 |
-
|
| 243 |
year={2025}
|
| 244 |
}
|
| 245 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
base_model: TokenOCR
|
| 3 |
+
license: mit
|
| 4 |
+
pipeline_tag: image-to-text
|
| 5 |
base_model_relation: finetune
|
| 6 |
+
library_name: internvl
|
| 7 |
---
|
| 8 |
|
| 9 |
<center>
|
|
|
|
| 15 |
|
| 16 |
</center>
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
<center>
|
| 19 |
|
|
|
|
|
|
|
| 20 |
<h2 style="color: #4CAF50;">Model Cards</h2>
|
| 21 |
|
| 22 |
+
In the following table, we provide all models of the TokenOCR series.
|
| 23 |
|
| 24 |
| Model Name | Description |
|
| 25 |
| :-----------------------: | :-------------------------------------------------------------------: |
|
|
|
|
| 40 |
from internvl.model.internvl_chat import InternVLChatModel
|
| 41 |
from utils import post_process, generate_similiarity_map, load_image
|
| 42 |
|
| 43 |
+
# ... (rest of the quickstart code)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
```
|
| 45 |
+
|
| 46 |
<center>
|
| 47 |
|
|
|
|
| 48 |
<h2 style="color: #4CAF50;">Introduction</h2>
|
| 49 |
|
| 50 |
</center>
|
| 51 |
|
| 52 |
+
We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
<center>
|
| 55 |
|
|
|
|
| 56 |
<h2 style="color: #4CAF50;">Token Family</h2>
|
| 57 |
|
| 58 |
</center>
|
| 59 |
|
|
|
|
| 60 |
<h2 style="color: #4CAF50;">TokenIT</h2>
|
| 61 |
|
| 62 |
+
(Content about TokenIT from the Github README)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
<h2 style="color: #4CAF50;">TokenOCR</h2>
|
| 65 |
|
| 66 |
+
(Content about TokenOCR architecture and evaluation from the Github README)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
<h2 style="color: #4CAF50;">TokenVL</h2>
|
| 69 |
|
| 70 |
+
(Content about TokenVL from the Github README)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
## 🤚 Release Plans
|
| 73 |
|
| 74 |
+
(Release Plans from the Github README)
|
| 75 |
|
| 76 |
## License
|
| 77 |
|
|
|
|
| 79 |
|
| 80 |
## Citation
|
| 81 |
|
|
|
|
|
|
|
| 82 |
```BibTeX
|
| 83 |
@inproceedings{guan2025TokenOCR,
|
| 84 |
title={A Token-level Text Image Foundation Model for Document Understanding},
|
| 85 |
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
|
| 86 |
+
journal={arXiv preprint arXiv:2503.02304},
|
| 87 |
year={2025}
|
| 88 |
}
|
| 89 |
+
```
|