TongkunGuan commited on
Commit
33a506b
·
verified ·
1 Parent(s): 4a119d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -9
README.md CHANGED
@@ -13,34 +13,41 @@ base_model_relation: finetune
13
 
14
  # Introduction
15
 
16
- We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
17
  designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
18
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
19
- `TokenIT`, comprising 20 million images and 1.8 billion token-mask pairs.
20
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
21
- we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, `TokenVL`, for VQA-based document understanding tasks.
22
 
23
  # Token Family
24
 
25
  ## TokenIT
26
 
 
 
 
 
 
 
 
 
 
 
27
  <div align="center">
28
- <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
29
  </div>
30
 
31
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png) -->
32
- n overview of the self-constructed token-level TokenIT dataset, comprising 20 million images and 1.8 billion
33
- text-mask pairs. (a) provides a detailed description of each sample, including the raw image, a mask, and a JSON file that
34
- records BPE token information. We also count (b) the data distribution, (c) the number of selected BPE tokens, and (d) a
35
- word cloud map highlighting the top 100 BPE tokens.
36
 
 
37
 
38
  | VFM | Granularity | Dataset | #Image | #Pairs |
39
  |:-------------------|:------------|:---------|:------:|:------:|
40
  | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
41
  | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
42
  | [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
43
- | **TokenOCR** | token-level | **TokenIT** | **20M** | **1.8B** |
44
 
45
 
46
  ## TokenOCR
 
13
 
14
  # Introduction
15
 
16
+ We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
17
  designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
18
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
19
+ **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
20
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
21
+ we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
22
 
23
  # Token Family
24
 
25
  ## TokenIT
26
 
27
+ In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
28
+ text-mask pairs.
29
+
30
+ As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file.
31
+ The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with
32
+ the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently,
33
+ **each BPE token corresponds one-to-one with a pixel-level mask**.
34
+ The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution
35
+ of tokens per image type and a word cloud of the top 100 tokens, respectively.
36
+
37
  <div align="center">
38
+ <img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
39
  </div>
40
 
41
  <!-- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png) -->
 
 
 
 
42
 
43
+ The comparisons with other visual foundation models:
44
 
45
  | VFM | Granularity | Dataset | #Image | #Pairs |
46
  |:-------------------|:------------|:---------|:------:|:------:|
47
  | [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
48
  | [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
49
  | [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
50
+ | **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
51
 
52
 
53
  ## TokenOCR