File size: 11,328 Bytes
2a13afa c5c6c15 f97d1f4 2a13afa 7956f56 c2717f7 b515051 2a13afa 7956f56 352be89 839bd5d 352be89 2a13afa fc92c4b 1e93eb6 79bf05b 4ffdea2 79bf05b 1d88d77 79bf05b 5b514d0 79bf05b 775220c cf532cf f97d1f4 5b514d0 cbdf2e2 0620217 58ccb56 5561ee0 58ccb56 759bb09 0620217 cbdf2e2 f97d1f4 0620217 f97d1f4 0620217 5f0b30d b5dd9c4 5f0b30d 0620217 cbdf2e2 1e93eb6 abba8cd 2a13afa fc92c4b f97d1f4 54bf132 33a506b 54bf132 f97d1f4 2a13afa fc92c4b 9e2bffa eb1624a fc92c4b c2717f7 eb1624a 33a506b 4a119d5 33a506b 4a119d5 47e6ad9 33a506b 47e6ad9 f97d1f4 47e6ad9 f97d1f4 c2717f7 f97d1f4 2a13afa f8af4fc 2a13afa f97d1f4 f8af4fc a07d3da f8af4fc a07d3da c2573da 2a13afa 8b89f0e 2a13afa 8b89f0e 2a13afa f97d1f4 8b89f0e 2a13afa 8b89f0e 2a13afa 733ddb2 a07d3da 8b89f0e aeb3911 a07d3da 8b89f0e 2a13afa 8b89f0e aeb3911 a07d3da 8b89f0e 2a13afa 8b89f0e fc92c4b c2717f7 2a13afa f97d1f4 c2573da 2a13afa c2573da 2a13afa c2573da a07d3da c2573da a07d3da c2573da 2a13afa c6425c8 2a13afa 4d82faf 90c5590 4d82faf 2a13afa f97d1f4 e0865b0 2a13afa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 |
---
license: mit
pipeline_tag: image-to-text
base_model: TokenFD
base_model_relation: finetune
---
<center>
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
[\[📂 GitHub\]](https://github.com/Token-family/TokenFD) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/project_page/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/Token-level_Text_Image_Foundation_Model)
</center>
<!-- <div align="center">
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/dQ_JfK_I91WXzIq52D015.png">
</div> -->
<center>
<!-- ### Model Cards -->
<h2 style="color: #4CAF50;">Model Cards</h2>
<!-- In the following table, we provide all models [🤗 link] of the series. -->
<!-- | Model Name | Description |
| :-----------------------: | :-------------------------------------------------------------------: |
| [R50](https://huggingface.co/TongkunGuan/R50) |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
| [TokenFD-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenFD_2048_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
| [TokenFD-4096-English-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
-->
我们在当前开源的基座模型上进行了定制适配,模型地址:
[TokenFD-ResNet50-bilingual](https://huggingface.co/TongkunGuan/R50)
[TokenFD-InternViT2.5-bilingual](https://huggingface.co/TongkunGuan/TokenFD_2048_Bilingual_seg)
[TokenFD-InternViT2.5-english](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg)
[TokenFD-QwenViT2.5-bilingual](https://huggingface.co/TongkunGuan/TokenFD_no_dis)
</center>
<div align="center">
<img width="800" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/MHfn1TgL1DkiEhNmD8rnK.gif">
</div>
### Quick Start
> \[!Warning\]
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenFD-4096-English-seg`** version.
```python
import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image
checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenFD_4096_English_seg'
image_path = './demo_images/0000000.png'
input_query = '11/12/2020'
out_dir = 'results'
if not os.path.exists(out_dir):
os.makedirs(out_dir, exist_ok=True)
"""loading model, tokenizer, tok_embeddings """
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, use_fast=False)
model = InternVLChatModel.from_pretrained(checkpoint, low_cpu_mem_usage=True, torch_dtype=torch.bfloat16).eval()
model = model.cuda()
"""loading image """
pixel_values, images, target_aspect_ratio = load_image(image_path)
"""loading query texts """
if input_query[0] in '!"#$%&\'()*+,-./0123456789:;<=>?@^_{|}~0123456789':
input_ids = tokenizer(input_query)['input_ids'][1:]
else:
input_ids = tokenizer(' '+input_query)['input_ids'][1:]
input_ids = torch.Tensor(input_ids).long().to(model.device)
input_embeds = model.tok_embeddings(input_ids).clone()
all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
"""Obtaining similarity """
with torch.no_grad():
vit_embeds, _ = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
similarity = input_embedings @ token_features.t()
attn_map = similarity.reshape(len(input_embedings), resized_size[0], resized_size[1])
"""generate map locally """
generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspect_ratio)
"""user command """
# python quick_start.py
```
<center>
<!-- # Introduction -->
<h2 style="color: #4CAF50;">Introduction</h2>
</center>
We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
Furthermore, leveraging this foundation with exceptional image-as-text capability,
we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
<center>
<!-- # Token Family -->
<h2 style="color: #4CAF50;">Token Family</h2>
</center>
<!-- ## TokenIT -->
<h2 style="color: #4CAF50;">TokenIT</h2>
In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
text-mask pairs.
As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file.
The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with
the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently,
**each BPE token corresponds one-to-one with a pixel-level mask**.
The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution
of tokens per image type and a word cloud of the top 100 tokens, respectively.
<div align="center">
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
</div>
<!--  -->
The comparisons with other visual foundation models:
| VFM | Granularity | Dataset | #Image | #Pairs |
|:-------------------|:------------|:---------|:------:|:------:|
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
| **TokenFD** | **token-level** | **TokenIT** | **20M** | **1.8B** |
<!-- ## TokenFD
-->
<h2 style="color: #4CAF50;">TokenFD</h2>
### Model Architecture
An overview of the proposed TokenFD, where the token-level image features and token-level language
features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
applications, including text segmentation, retrieval, and visual question answering.
<div align="center">
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/6vNkEPzolBWVM--beoxLI.png">
</div>
### Evaluation on Vision Capability
We present a comprehensive evaluation of the vision encoder’s performance across various domains and tasks.
The evaluation is divided into two key categories:
(1) text retrial;
(2) image segmentation;
(3) visual question answering;
This approach allows us to assess the representation quality of TokenFD.
Please refer to our technical report for more details.
#### text retrial
<div align="left">
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/wlLcdB0hpC666PrEQSDaM.png">
</div>
<!--  -->
#### image segmentation
<div align="left">
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/0HqFXP8OC2tLH4d7scdMt.png">
</div>
<!--  -->
#### visual question answering
<div align="left">
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/3PBP0akDiMbupu_Gr7lzP.png">
</div>
<!-- 
-->
<!-- ## TokenVL -->
<h2 style="color: #4CAF50;">TokenVL</h2>
we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
Following the previous training paradigm, TokenVL also includes two stages:
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
<div align="center">
<img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/5ZCzz1tYy0bnIIZFxgTPN.png">
</div>
The framework of LLM-guided Token Alignment Training. Existing MLLMs primarily enhance spatial-wise text perception capabilities by integrating localization prompts to predict coordinates. However, this implicit
method makes it difficult for these models to have a precise understanding.
In contrast, the proposed token alignment uses BPE token masks to directly and explicitly align text with corresponding pixels in the input image, enhancing the MLLM’s localization awareness.
**Stage 2: Supervised Instruction Tuning for VQA tasks.**
During the Supervised Instruction Tuning stage, we cancel the token alignment branch as answers may not appear in the image for some reasoning tasks
(e.g., How much taller is the red bar compared to the green bar?). This also ensures no computational overhead during inference to improve the document understanding capability. Finally, we inherit the
remaining weights from the LLM-guided Token Alignment and unfreeze all parameters to facilitate comprehensive parameter updates.
### OCRBench Results
<div align="center">
<img width="1300" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/DZej5Ogpho3wpZC4KVAMO.png">
</div>
### Document Understanding Results
<div align="center">
<img width="1300" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/Msfs1YkDQHq2-djhm6QqD.png">
</div>
## License
This project is released under the MIT License.
## Citation
If you find this project useful in your research, please consider citing:
```BibTeX
@inproceedings{guan2025TokenFD,
title={A Token-level Text Image Foundation Model for Document Understanding},
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
booktitle={Arxiv},
year={2025}
}
```
|