File size: 11,328 Bytes
2a13afa
 
c5c6c15
f97d1f4
2a13afa
 
 
7956f56
 
 
c2717f7
 
b515051
2a13afa
7956f56
 
352be89
839bd5d
352be89
2a13afa
fc92c4b
1e93eb6
79bf05b
4ffdea2
 
79bf05b
1d88d77
79bf05b
5b514d0
79bf05b
775220c
cf532cf
f97d1f4
5b514d0
 
 
 
 
 
 
 
 
 
 
cbdf2e2
0620217
58ccb56
5561ee0
58ccb56
759bb09
0620217
 
cbdf2e2
f97d1f4
0620217
 
 
 
 
 
 
 
f97d1f4
0620217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5f0b30d
b5dd9c4
5f0b30d
 
 
 
 
0620217
 
 
 
 
 
 
 
cbdf2e2
1e93eb6
abba8cd
 
2a13afa
fc92c4b
 
f97d1f4
 
54bf132
33a506b
54bf132
f97d1f4
2a13afa
fc92c4b
 
9e2bffa
 
eb1624a
fc92c4b
 
c2717f7
 
eb1624a
33a506b
 
 
 
 
 
 
 
 
 
4a119d5
33a506b
4a119d5
 
 
47e6ad9
33a506b
47e6ad9
 
 
 
 
 
f97d1f4
47e6ad9
 
f97d1f4
c2717f7
f97d1f4
2a13afa
f8af4fc
2a13afa
f97d1f4
f8af4fc
 
 
 
a07d3da
f8af4fc
 
 
a07d3da
c2573da
2a13afa
8b89f0e
 
2a13afa
8b89f0e
 
 
2a13afa
f97d1f4
8b89f0e
2a13afa
8b89f0e
2a13afa
733ddb2
a07d3da
8b89f0e
 
 
 
 
 
aeb3911
a07d3da
8b89f0e
 
 
2a13afa
8b89f0e
 
aeb3911
a07d3da
8b89f0e
2a13afa
8b89f0e
 
fc92c4b
c2717f7
 
2a13afa
f97d1f4
c2573da
2a13afa
c2573da
2a13afa
c2573da
a07d3da
c2573da
 
a07d3da
c2573da
 
 
 
 
 
 
 
 
 
 
2a13afa
c6425c8
 
 
2a13afa
4d82faf
 
90c5590
 
 
4d82faf
2a13afa
 
 
 
 
 
 
 
 
 
 
f97d1f4
e0865b0
 
 
 
 
2a13afa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
license: mit
pipeline_tag: image-to-text
base_model: TokenFD
base_model_relation: finetune
---

<center>

<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>


[\[📂 GitHub\]](https://github.com/Token-family/TokenFD)  [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/project_page/)   [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/Token-level_Text_Image_Foundation_Model)

</center>

<!-- <div align="center">
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/dQ_JfK_I91WXzIq52D015.png">
</div> -->

<center>


<!-- ### Model Cards -->
<h2 style="color: #4CAF50;">Model Cards</h2>

<!-- In the following table, we provide all models [🤗 link] of the  series. -->

<!-- |        Model Name         |                                Description                                |
| :-----------------------: | :-------------------------------------------------------------------: |
|  [R50](https://huggingface.co/TongkunGuan/R50)  |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts.   |
|  [TokenFD-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenFD_2048_Bilingual_seg)  |Backbone is ViT;feature dimension is 2048;  support interactive with English and Chinese texts. |
| [TokenFD-4096-English-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts.  |
 -->
我们在当前开源的基座模型上进行了定制适配,模型地址:

[TokenFD-ResNet50-bilingual](https://huggingface.co/TongkunGuan/R50)

[TokenFD-InternViT2.5-bilingual](https://huggingface.co/TongkunGuan/TokenFD_2048_Bilingual_seg)

[TokenFD-InternViT2.5-english](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg)

[TokenFD-QwenViT2.5-bilingual](https://huggingface.co/TongkunGuan/TokenFD_no_dis)

</center>

<div align="center">
  <img width="800" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/MHfn1TgL1DkiEhNmD8rnK.gif">
</div>

### Quick Start

> \[!Warning\]
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenFD-4096-English-seg`** version.

```python
import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image

checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenFD_4096_English_seg'
image_path = './demo_images/0000000.png'
input_query = '11/12/2020'
out_dir = 'results'

if not os.path.exists(out_dir):
    os.makedirs(out_dir, exist_ok=True)

"""loading model, tokenizer, tok_embeddings """
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, use_fast=False)
model = InternVLChatModel.from_pretrained(checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).eval()
model = model.cuda()

"""loading image """
pixel_values, images, target_aspect_ratio = load_image(image_path)
 

"""loading query texts """
if input_query[0] in '!"#$%&\'()*+,-./0123456789:;<=>?@^_{|}~0123456789':
    input_ids = tokenizer(input_query)['input_ids'][1:]
else:
    input_ids = tokenizer(' '+input_query)['input_ids'][1:]
input_ids = torch.Tensor(input_ids).long().to(model.device)
input_embeds = model.tok_embeddings(input_ids).clone()
all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]


"""Obtaining similarity """
with torch.no_grad():
  vit_embeds, _ = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
  vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
  token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
  input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
  similarity = input_embedings @ token_features.t()
  attn_map = similarity.reshape(len(input_embedings), resized_size[0], resized_size[1])

"""generate map locally """
generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspect_ratio)


"""user command """
# python quick_start.py
```
<center>

<!-- # Introduction -->
<h2 style="color: #4CAF50;">Introduction</h2>

</center>

We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks, 
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD, 
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, 
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs. 
Furthermore, leveraging this foundation with exceptional image-as-text capability, 
we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks. 

<center>
  
<!-- # Token Family -->
<h2 style="color: #4CAF50;">Token Family</h2>

</center>

<!-- ## TokenIT -->
<h2 style="color: #4CAF50;">TokenIT</h2>

In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
text-mask pairs. 

As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file. 
The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with 
the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently,
**each BPE token corresponds one-to-one with a pixel-level mask**. 
The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution 
of tokens per image type and a word cloud of the top 100 tokens, respectively.

<div align="center">
  <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
</div>

<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png) -->

The comparisons with other visual foundation models:

| VFM                | Granularity | Dataset  | #Image | #Pairs |
|:-------------------|:------------|:---------|:------:|:------:|
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
| [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
| **TokenFD**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |


<!-- ## TokenFD
 -->
<h2 style="color: #4CAF50;">TokenFD</h2>

### Model Architecture

An overview of the proposed TokenFD, where the token-level image features and token-level language
features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
applications, including text segmentation, retrieval, and visual question answering.

<div align="center">
  <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/6vNkEPzolBWVM--beoxLI.png">
</div>



### Evaluation on Vision Capability

We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks. 
The evaluation is divided into two key categories:

(1) text retrial; 
(2) image segmentation;
(3) visual question answering;

This approach allows us to assess the representation quality of TokenFD. 
Please refer to our technical report for more details.

#### text retrial

<div align="left">
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/wlLcdB0hpC666PrEQSDaM.png">
</div>

<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/b2b2g23o9GMmPe1PiCn0f.png) -->

#### image segmentation

<div align="left">
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/0HqFXP8OC2tLH4d7scdMt.png">
</div>

<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/C15-Ica6XVfX6y_MgiVds.png) -->

#### visual question answering

<div align="left">
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/3PBP0akDiMbupu_Gr7lzP.png">
</div>

<!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/IbLZ0CxCxDkTaHAMe7M0Q.png)
 -->

<!-- ## TokenVL -->
<h2 style="color: #4CAF50;">TokenVL</h2>

we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding. 
Following the previous training paradigm, TokenVL also includes two stages: 

**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**

<div align="center">
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/5ZCzz1tYy0bnIIZFxgTPN.png">
</div>


The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
method makes it difficult for these models to have a precise understanding. 
In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.

**Stage 2: Supervised Instruction Tuning for VQA tasks.**

During the Supervised Instruction Tuning stage, we cancel the token alignment branch as answers may not appear in the image for some reasoning tasks 
(e.g., How much taller is the red bar compared to the green bar?). This also ensures no computational overhead during inference to improve the document understanding capability. Finally, we inherit the
remaining weights from the LLM-guided Token Alignment and unfreeze all parameters to facilitate comprehensive parameter updates.

### OCRBench Results

<div align="center">
  <img width="1300" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/DZej5Ogpho3wpZC4KVAMO.png">
</div>

### Document Understanding Results

<div align="center">
  <img width="1300" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/Msfs1YkDQHq2-djhm6QqD.png">
</div>



## License

This project is released under the MIT License.

## Citation

If you find this project useful in your research, please consider citing:

```BibTeX
@inproceedings{guan2025TokenFD,
  title={A Token-level Text Image Foundation Model for Document Understanding},
  author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
  booktitle={Arxiv},
  year={2025}
}
```