TongkunGuan
/

TokenFD

Image-to-Text

Model card Files Files and versions

xet

Community

TongkunGuan commited on Feb 21, 2025

Commit

33a506b

verified ·

1 Parent(s): 4a119d5

Update README.md

Browse files

Files changed (1) hide show

README.md +16 -9

README.md CHANGED Viewed

@@ -13,34 +13,41 @@ base_model_relation: finetune
 # Introduction
-We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
 designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
-`TokenIT`, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
-we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, `TokenVL`, for VQA-based document understanding tasks.
 # Token Family
 ## TokenIT
 <div align="center">
-  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png) -->
-n overview of the self-constructed token-level TokenIT dataset, comprising 20 million images and 1.8 billion
-text-mask pairs. (a) provides a detailed description of each sample, including the raw image, a mask, and a JSON file that
-records BPE token information. We also count (b) the data distribution, (c) the number of selected BPE tokens, and (d) a
-word cloud map highlighting the top 100 BPE tokens.
 | VFM                | Granularity | Dataset  | #Image | #Pairs |
 |:-------------------|:------------|:---------|:------:|:------:|
 | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
 | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
 | [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
-| **TokenOCR**           | token-level | **TokenIT**  | **20M**    | **1.8B**   |
 ## TokenOCR

 # Introduction
+We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
 designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
+**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
+we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
 # Token Family
 ## TokenIT
+In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
+text-mask pairs.
+As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file.
+The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with
+the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently,
+**each BPE token corresponds one-to-one with a pixel-level mask**.
+The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution
+of tokens per image type and a word cloud of the top 100 tokens, respectively.
 <div align="center">
+  <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
 </div>
 <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png) -->
+The comparisons with other visual foundation models:
 | VFM                | Granularity | Dataset  | #Image | #Pairs |
 |:-------------------|:------------|:---------|:------:|:------:|
 | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M  | 400M   | 0.4B   |
 | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M    | -      |
 | [SAM](https://github.com/facebookresearch/SAM)  | pixel-level | SA1B     | 11M    | 1.1B   |
+| **TokenOCR**           | **token-level** | **TokenIT**  | **20M**    | **1.8B**   |
 ## TokenOCR