TongkunGuan commited on
Commit
f97d1f4
·
verified ·
1 Parent(s): 1cb3e8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-feature-extraction
4
- base_model: TokenOCR
5
  base_model_relation: finetune
6
  ---
7
 
@@ -10,7 +10,7 @@ base_model_relation: finetune
10
  <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
11
 
12
 
13
- [\[📂 GitHub\]](https://github.com/Token-family/TokenOCR) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)
14
 
15
  </center>
16
 
@@ -24,19 +24,19 @@ base_model_relation: finetune
24
  <!-- ### Model Cards -->
25
  <h2 style="color: #4CAF50;">Model Cards</h2>
26
 
27
- In the following table, we provide all models [🤗 link] of the TokenOCR series.
28
 
29
  | Model Name | Description |
30
  | :-----------------------: | :-------------------------------------------------------------------: |
31
  | R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
32
- | [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
33
- | [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
34
  </center>
35
 
36
  ### Quick Start
37
 
38
  > \[!Warning\]
39
- > 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenOCR-4096-English-seg`** version.
40
 
41
  ```python
42
  import os
@@ -45,7 +45,7 @@ from transformers import AutoTokenizer
45
  from internvl.model.internvl_chat import InternVLChatModel
46
  from utils import post_process, generate_similiarity_map, load_image
47
 
48
- checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenOCR_4096_English_seg'
49
  image_path = './demo_images/0000000.png'
50
  input_query = '11/12/2020'
51
  out_dir = 'results'
@@ -74,7 +74,7 @@ all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
74
 
75
  """Obtaining similarity """
76
  with torch.no_grad():
77
- vit_embeds, _ = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
78
  vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
79
  token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
80
  input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
@@ -95,12 +95,12 @@ generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspe
95
 
96
  </center>
97
 
98
- We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
99
- designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
100
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
101
  **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
102
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
103
- we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
104
 
105
  <center>
106
 
@@ -135,16 +135,16 @@ The comparisons with other visual foundation models:
135
  | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
136
  | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
137
  | [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
138
- | **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
139
 
140
 
141
- <!-- ## TokenOCR
142
  -->
143
- <h2 style="color: #4CAF50;">TokenOCR</h2>
144
 
145
  ### Model Architecture
146
 
147
- An overview of the proposed TokenOCR, where the token-level image features and token-level language
148
  features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
149
  applications, including text segmentation, retrieval, and visual question answering.
150
 
@@ -162,7 +162,7 @@ The evaluation is divided into two key categories:
162
  (2) image segmentation;
163
  (3) visual question answering;
164
 
165
- This approach allows us to assess the representation quality of TokenOCR.
166
  Please refer to our technical report for more details.
167
 
168
  #### text retrial
@@ -194,7 +194,7 @@ Please refer to our technical report for more details.
194
  <!-- ## TokenVL -->
195
  <h2 style="color: #4CAF50;">TokenVL</h2>
196
 
197
- we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
198
  Following the previous training paradigm, TokenVL also includes two stages:
199
 
200
  **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
@@ -236,7 +236,7 @@ This project is released under the MIT License.
236
  If you find this project useful in your research, please consider citing:
237
 
238
  ```BibTeX
239
- @inproceedings{guan2025TokenOCR,
240
  title={A Token-level Text Image Foundation Model for Document Understanding},
241
  author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
242
  booktitle={Arxiv},
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-feature-extraction
4
+ base_model: TokenFD
5
  base_model_relation: finetune
6
  ---
7
 
 
10
  <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
11
 
12
 
13
+ [\[📂 GitHub\]](https://github.com/Token-family/TokenFD) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenFD_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenFD)
14
 
15
  </center>
16
 
 
24
  <!-- ### Model Cards -->
25
  <h2 style="color: #4CAF50;">Model Cards</h2>
26
 
27
+ In the following table, we provide all models [🤗 link] of the series.
28
 
29
  | Model Name | Description |
30
  | :-----------------------: | :-------------------------------------------------------------------: |
31
  | R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
32
+ | [TokenFD-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
33
+ | [TokenFD-4096-English-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
34
  </center>
35
 
36
  ### Quick Start
37
 
38
  > \[!Warning\]
39
+ > 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenFD-4096-English-seg`** version.
40
 
41
  ```python
42
  import os
 
45
  from internvl.model.internvl_chat import InternVLChatModel
46
  from utils import post_process, generate_similiarity_map, load_image
47
 
48
+ checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenFD_4096_English_seg'
49
  image_path = './demo_images/0000000.png'
50
  input_query = '11/12/2020'
51
  out_dir = 'results'
 
74
 
75
  """Obtaining similarity """
76
  with torch.no_grad():
77
+ vit_embeds, _ = model.forward_TokenFD(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
78
  vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
79
  token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
80
  input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
 
95
 
96
  </center>
97
 
98
+ We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
99
+ designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
100
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
101
  **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
102
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
103
+ we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
104
 
105
  <center>
106
 
 
135
  | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
136
  | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
137
  | [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
138
+ | **TokenFD** | **token-level** | **TokenIT** | **20M** | **1.8B** |
139
 
140
 
141
+ <!-- ## TokenFD
142
  -->
143
+ <h2 style="color: #4CAF50;">TokenFD</h2>
144
 
145
  ### Model Architecture
146
 
147
+ An overview of the proposed TokenFD, where the token-level image features and token-level language
148
  features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
149
  applications, including text segmentation, retrieval, and visual question answering.
150
 
 
162
  (2) image segmentation;
163
  (3) visual question answering;
164
 
165
+ This approach allows us to assess the representation quality of TokenFD.
166
  Please refer to our technical report for more details.
167
 
168
  #### text retrial
 
194
  <!-- ## TokenVL -->
195
  <h2 style="color: #4CAF50;">TokenVL</h2>
196
 
197
+ we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
198
  Following the previous training paradigm, TokenVL also includes two stages:
199
 
200
  **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
 
236
  If you find this project useful in your research, please consider citing:
237
 
238
  ```BibTeX
239
+ @inproceedings{guan2025TokenFD,
240
  title={A Token-level Text Image Foundation Model for Document Understanding},
241
  author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
242
  booktitle={Arxiv},