Update README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-feature-extraction
|
| 4 |
-
base_model:
|
| 5 |
base_model_relation: finetune
|
| 6 |
---
|
| 7 |
|
|
@@ -10,7 +10,7 @@ base_model_relation: finetune
|
|
| 10 |
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
|
| 11 |
|
| 12 |
|
| 13 |
-
[\[📂 GitHub\]](https://github.com/Token-family/
|
| 14 |
|
| 15 |
</center>
|
| 16 |
|
|
@@ -24,19 +24,19 @@ base_model_relation: finetune
|
|
| 24 |
<!-- ### Model Cards -->
|
| 25 |
<h2 style="color: #4CAF50;">Model Cards</h2>
|
| 26 |
|
| 27 |
-
In the following table, we provide all models [🤗 link] of the
|
| 28 |
|
| 29 |
| Model Name | Description |
|
| 30 |
| :-----------------------: | :-------------------------------------------------------------------: |
|
| 31 |
| R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
|
| 32 |
-
| [
|
| 33 |
-
| [
|
| 34 |
</center>
|
| 35 |
|
| 36 |
### Quick Start
|
| 37 |
|
| 38 |
> \[!Warning\]
|
| 39 |
-
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`
|
| 40 |
|
| 41 |
```python
|
| 42 |
import os
|
|
@@ -45,7 +45,7 @@ from transformers import AutoTokenizer
|
|
| 45 |
from internvl.model.internvl_chat import InternVLChatModel
|
| 46 |
from utils import post_process, generate_similiarity_map, load_image
|
| 47 |
|
| 48 |
-
checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/
|
| 49 |
image_path = './demo_images/0000000.png'
|
| 50 |
input_query = '11/12/2020'
|
| 51 |
out_dir = 'results'
|
|
@@ -74,7 +74,7 @@ all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
|
|
| 74 |
|
| 75 |
"""Obtaining similarity """
|
| 76 |
with torch.no_grad():
|
| 77 |
-
vit_embeds, _ = model.
|
| 78 |
vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
|
| 79 |
token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
|
| 80 |
input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
|
|
@@ -95,12 +95,12 @@ generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspe
|
|
| 95 |
|
| 96 |
</center>
|
| 97 |
|
| 98 |
-
We are excited to announce the release of **`
|
| 99 |
-
designed to support a variety of traditional downstream applications. To facilitate the pretraining of
|
| 100 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 101 |
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
|
| 102 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 103 |
-
we seamlessly replace previous VFMs with
|
| 104 |
|
| 105 |
<center>
|
| 106 |
|
|
@@ -135,16 +135,16 @@ The comparisons with other visual foundation models:
|
|
| 135 |
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
| 136 |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
| 137 |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
| 138 |
-
| **
|
| 139 |
|
| 140 |
|
| 141 |
-
<!-- ##
|
| 142 |
-->
|
| 143 |
-
<h2 style="color: #4CAF50;">
|
| 144 |
|
| 145 |
### Model Architecture
|
| 146 |
|
| 147 |
-
An overview of the proposed
|
| 148 |
features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
|
| 149 |
applications, including text segmentation, retrieval, and visual question answering.
|
| 150 |
|
|
@@ -162,7 +162,7 @@ The evaluation is divided into two key categories:
|
|
| 162 |
(2) image segmentation;
|
| 163 |
(3) visual question answering;
|
| 164 |
|
| 165 |
-
This approach allows us to assess the representation quality of
|
| 166 |
Please refer to our technical report for more details.
|
| 167 |
|
| 168 |
#### text retrial
|
|
@@ -194,7 +194,7 @@ Please refer to our technical report for more details.
|
|
| 194 |
<!-- ## TokenVL -->
|
| 195 |
<h2 style="color: #4CAF50;">TokenVL</h2>
|
| 196 |
|
| 197 |
-
we employ the
|
| 198 |
Following the previous training paradigm, TokenVL also includes two stages:
|
| 199 |
|
| 200 |
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|
|
@@ -236,7 +236,7 @@ This project is released under the MIT License.
|
|
| 236 |
If you find this project useful in your research, please consider citing:
|
| 237 |
|
| 238 |
```BibTeX
|
| 239 |
-
@inproceedings{
|
| 240 |
title={A Token-level Text Image Foundation Model for Document Understanding},
|
| 241 |
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
|
| 242 |
booktitle={Arxiv},
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-feature-extraction
|
| 4 |
+
base_model: TokenFD
|
| 5 |
base_model_relation: finetune
|
| 6 |
---
|
| 7 |
|
|
|
|
| 10 |
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
|
| 11 |
|
| 12 |
|
| 13 |
+
[\[📂 GitHub\]](https://github.com/Token-family/TokenFD) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenFD_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenFD)
|
| 14 |
|
| 15 |
</center>
|
| 16 |
|
|
|
|
| 24 |
<!-- ### Model Cards -->
|
| 25 |
<h2 style="color: #4CAF50;">Model Cards</h2>
|
| 26 |
|
| 27 |
+
In the following table, we provide all models [🤗 link] of the series.
|
| 28 |
|
| 29 |
| Model Name | Description |
|
| 30 |
| :-----------------------: | :-------------------------------------------------------------------: |
|
| 31 |
| R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
|
| 32 |
+
| [TokenFD-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
|
| 33 |
+
| [TokenFD-4096-English-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
|
| 34 |
</center>
|
| 35 |
|
| 36 |
### Quick Start
|
| 37 |
|
| 38 |
> \[!Warning\]
|
| 39 |
+
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenFD-4096-English-seg`** version.
|
| 40 |
|
| 41 |
```python
|
| 42 |
import os
|
|
|
|
| 45 |
from internvl.model.internvl_chat import InternVLChatModel
|
| 46 |
from utils import post_process, generate_similiarity_map, load_image
|
| 47 |
|
| 48 |
+
checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenFD_4096_English_seg'
|
| 49 |
image_path = './demo_images/0000000.png'
|
| 50 |
input_query = '11/12/2020'
|
| 51 |
out_dir = 'results'
|
|
|
|
| 74 |
|
| 75 |
"""Obtaining similarity """
|
| 76 |
with torch.no_grad():
|
| 77 |
+
vit_embeds, _ = model.forward_TokenFD(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
|
| 78 |
vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
|
| 79 |
token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
|
| 80 |
input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
|
|
|
|
| 95 |
|
| 96 |
</center>
|
| 97 |
|
| 98 |
+
We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
|
| 99 |
+
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
|
| 100 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 101 |
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
|
| 102 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 103 |
+
we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
|
| 104 |
|
| 105 |
<center>
|
| 106 |
|
|
|
|
| 135 |
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
| 136 |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
| 137 |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
| 138 |
+
| **TokenFD** | **token-level** | **TokenIT** | **20M** | **1.8B** |
|
| 139 |
|
| 140 |
|
| 141 |
+
<!-- ## TokenFD
|
| 142 |
-->
|
| 143 |
+
<h2 style="color: #4CAF50;">TokenFD</h2>
|
| 144 |
|
| 145 |
### Model Architecture
|
| 146 |
|
| 147 |
+
An overview of the proposed TokenFD, where the token-level image features and token-level language
|
| 148 |
features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
|
| 149 |
applications, including text segmentation, retrieval, and visual question answering.
|
| 150 |
|
|
|
|
| 162 |
(2) image segmentation;
|
| 163 |
(3) visual question answering;
|
| 164 |
|
| 165 |
+
This approach allows us to assess the representation quality of TokenFD.
|
| 166 |
Please refer to our technical report for more details.
|
| 167 |
|
| 168 |
#### text retrial
|
|
|
|
| 194 |
<!-- ## TokenVL -->
|
| 195 |
<h2 style="color: #4CAF50;">TokenVL</h2>
|
| 196 |
|
| 197 |
+
we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
|
| 198 |
Following the previous training paradigm, TokenVL also includes two stages:
|
| 199 |
|
| 200 |
**Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
|
|
|
|
| 236 |
If you find this project useful in your research, please consider citing:
|
| 237 |
|
| 238 |
```BibTeX
|
| 239 |
+
@inproceedings{guan2025TokenFD,
|
| 240 |
title={A Token-level Text Image Foundation Model for Document Understanding},
|
| 241 |
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
|
| 242 |
booktitle={Arxiv},
|