TongkunGuan
/

TokenFD

Image-to-Text

Model card Files Files and versions

xet

Community

TongkunGuan commited on Mar 8, 2025

Commit

f97d1f4

verified ·

1 Parent(s): 1cb3e8a

Update README.md

Browse files

Files changed (1) hide show

README.md +18 -18

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 license: mit
 pipeline_tag: image-feature-extraction
-base_model: TokenOCR
 base_model_relation: finetune
 ---
@@ -10,7 +10,7 @@ base_model_relation: finetune
 <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
-[\[📂 GitHub\]](https://github.com/Token-family/TokenOCR)  [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/)   [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)
 </center>
@@ -24,19 +24,19 @@ base_model_relation: finetune
 <!-- ### Model Cards -->
 <h2 style="color: #4CAF50;">Model Cards</h2>
-In the following table, we provide all models [🤗 link] of the TokenOCR series.
 |        Model Name         |                                Description                                |
 | :-----------------------: | :-------------------------------------------------------------------: |
 |  R50  |Backbone is ResNet-50；feature dimension is 2048; support interactive with English and Chinese texts.   |
-|  [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg)  |Backbone is ViT；feature dimension is 2048;  support interactive with English and Chinese texts. |
-| [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts.  |
 </center>
 ### Quick Start
 > \[!Warning\]
-> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenOCR-4096-English-seg`** version.
 ```python
 import os
@@ -45,7 +45,7 @@ from transformers import AutoTokenizer
 from internvl.model.internvl_chat import InternVLChatModel
 from utils import post_process, generate_similiarity_map, load_image
-checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenOCR_4096_English_seg'
 image_path = './demo_images/0000000.png'
 input_query = '11/12/2020'
 out_dir = 'results'
@@ -74,7 +74,7 @@ all_bpe_strings = [tokenizer.decode(input_id) for input_id in input_ids]
 """Obtaining similarity """
 with torch.no_grad():
-  vit_embeds, _ = model.forward_tokenocr(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
   vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
   token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
   input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
@@ -95,12 +95,12 @@ generate_similiarity_map(images, attn_map, all_bpe_strings, out_dir, target_aspe
 </center>
-We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
-designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
 **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
-we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
 <center>
@@ -135,16 +135,16 @@ The comparisons with other visual foundation models:
 | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
 | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
 | [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
-| **TokenOCR**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
-<!-- ## TokenOCR
  -->
-<h2 style="color: #4CAF50;">TokenOCR</h2>
 ### Model Architecture
-An overview of the proposed TokenOCR, where the token-level image features and token-level language
 features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
 applications, including text segmentation, retrieval, and visual question answering.
@@ -162,7 +162,7 @@ The evaluation is divided into two key categories:
 (2) image segmentation;
 (3) visual question answering;
-This approach allows us to assess the representation quality of TokenOCR.
 Please refer to our technical report for more details.
 #### text retrial
@@ -194,7 +194,7 @@ Please refer to our technical report for more details.
 <!-- ## TokenVL -->
 <h2 style="color: #4CAF50;">TokenVL</h2>
-we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
 Following the previous training paradigm, TokenVL also includes two stages:
 **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
@@ -236,7 +236,7 @@ This project is released under the MIT License.
 If you find this project useful in your research, please consider citing:
 ```BibTeX
-@inproceedings{guan2025TokenOCR,
   title={A Token-level Text Image Foundation Model for Document Understanding},
   author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
   booktitle={Arxiv},

 ---
 license: mit
 pipeline_tag: image-feature-extraction
+base_model: TokenFD
 base_model_relation: finetune
 ---
 <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
+[\[📂 GitHub\]](https://github.com/Token-family/TokenFD)  [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenFD_project/)   [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenFD)
 </center>
 <!-- ### Model Cards -->
 <h2 style="color: #4CAF50;">Model Cards</h2>
+In the following table, we provide all models [🤗 link] of the  series.
 |        Model Name         |                                Description                                |
 | :-----------------------: | :-------------------------------------------------------------------: |
 |  R50  |Backbone is ResNet-50；feature dimension is 2048; support interactive with English and Chinese texts.   |
+|  [TokenFD-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_Bilingual_seg)  |Backbone is ViT；feature dimension is 2048;  support interactive with English and Chinese texts. |
+| [TokenFD-4096-English-seg](https://huggingface.co/TongkunGuan/TokenFD_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts.  |
 </center>
 ### Quick Start
 > \[!Warning\]
+> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenFD-4096-English-seg`** version.
 ```python
 import os
 from internvl.model.internvl_chat import InternVLChatModel
 from utils import post_process, generate_similiarity_map, load_image
+checkpoint = '/mnt/dolphinfs/hdd_pool/docker/user/hadoop-mt-ocr/guantongkun/VFM_try/processed_models/TokenFD_4096_English_seg'
 image_path = './demo_images/0000000.png'
 input_query = '11/12/2020'
 out_dir = 'results'
 """Obtaining similarity """
 with torch.no_grad():
+  vit_embeds, _ = model.forward_TokenFD(pixel_values.to(model.device)) #(vit_batch_size, 16*16, 2048)
   vit_embeds_local, resized_size = post_process(vit_embeds, target_aspect_ratio)
   token_features = vit_embeds_local / vit_embeds_local.norm(dim=-1, keepdim=True)
   input_embedings = input_embeds / input_embeds.norm(dim=-1, keepdim=True)
 </center>
+We are excited to announce the release of **`TokenFD`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
+designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
 **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
+we seamlessly replace previous VFMs with TokenFD to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
 <center>
 | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
 | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
 | [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
+| **TokenFD**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
+<!-- ## TokenFD
  -->
+<h2 style="color: #4CAF50;">TokenFD</h2>
 ### Model Architecture
+An overview of the proposed TokenFD, where the token-level image features and token-level language
 features are aligned within the same semantic space. This “image-as-text” alignment seamlessly facilitates user-interactive
 applications, including text segmentation, retrieval, and visual question answering.
 (2) image segmentation;
 (3) visual question answering;
+This approach allows us to assess the representation quality of TokenFD.
 Please refer to our technical report for more details.
 #### text retrial
 <!-- ## TokenVL -->
 <h2 style="color: #4CAF50;">TokenVL</h2>
+we employ the TokenFD as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
 Following the previous training paradigm, TokenVL also includes two stages:
 **Stage 1: LLM-guided Token Alignment Training for text parsing tasks.**
 If you find this project useful in your research, please consider citing:
 ```BibTeX
+@inproceedings{guan2025TokenFD,
   title={A Token-level Text Image Foundation Model for Document Understanding},
   author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
   booktitle={Arxiv},