Update README.md
Browse files
README.md
CHANGED
|
@@ -13,34 +13,41 @@ base_model_relation: finetune
|
|
| 13 |
|
| 14 |
# Introduction
|
| 15 |
|
| 16 |
-
We are excited to announce the release of
|
| 17 |
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
|
| 18 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 19 |
-
|
| 20 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 21 |
-
we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM,
|
| 22 |
|
| 23 |
# Token Family
|
| 24 |
|
| 25 |
## TokenIT
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
<div align="center">
|
| 28 |
-
<img width="
|
| 29 |
</div>
|
| 30 |
|
| 31 |
<!--  -->
|
| 32 |
-
n overview of the self-constructed token-level TokenIT dataset, comprising 20 million images and 1.8 billion
|
| 33 |
-
text-mask pairs. (a) provides a detailed description of each sample, including the raw image, a mask, and a JSON file that
|
| 34 |
-
records BPE token information. We also count (b) the data distribution, (c) the number of selected BPE tokens, and (d) a
|
| 35 |
-
word cloud map highlighting the top 100 BPE tokens.
|
| 36 |
|
|
|
|
| 37 |
|
| 38 |
| VFM | Granularity | Dataset | #Image | #Pairs |
|
| 39 |
|:-------------------|:------------|:---------|:------:|:------:|
|
| 40 |
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
| 41 |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
| 42 |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
| 43 |
-
| **TokenOCR** | token-level | **TokenIT** | **20M** | **1.8B** |
|
| 44 |
|
| 45 |
|
| 46 |
## TokenOCR
|
|
|
|
| 13 |
|
| 14 |
# Introduction
|
| 15 |
|
| 16 |
+
We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks,
|
| 17 |
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
|
| 18 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 19 |
+
**`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs.
|
| 20 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 21 |
+
we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
|
| 22 |
|
| 23 |
# Token Family
|
| 24 |
|
| 25 |
## TokenIT
|
| 26 |
|
| 27 |
+
In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
|
| 28 |
+
text-mask pairs.
|
| 29 |
+
|
| 30 |
+
As depicted in Figure 2 (a), each sample in this dataset includes a raw image, a mask image, and a JSON file.
|
| 31 |
+
The JSON file provides the question-answer pairs and several BPE tokens randomly selected from the answer, along with
|
| 32 |
+
the ordinal number of each BPE token in the answer and its corresponding pixel value on the mask image. Consequently,
|
| 33 |
+
**each BPE token corresponds one-to-one with a pixel-level mask**.
|
| 34 |
+
The data ratios are summarized in Figure 2 (b). Figure 2 (c) and (d) further provide the number distribution
|
| 35 |
+
of tokens per image type and a word cloud of the top 100 tokens, respectively.
|
| 36 |
+
|
| 37 |
<div align="center">
|
| 38 |
+
<img width="1000" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/650d4a36cbd0c7d550d3b41b/WcQwU3-xjyT5Vm-pZhACo.png">
|
| 39 |
</div>
|
| 40 |
|
| 41 |
<!--  -->
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
The comparisons with other visual foundation models:
|
| 44 |
|
| 45 |
| VFM | Granularity | Dataset | #Image | #Pairs |
|
| 46 |
|:-------------------|:------------|:---------|:------:|:------:|
|
| 47 |
| [CLIP](https://github.com/openai/CLIP) | image-level | WIT400M | 400M | 0.4B |
|
| 48 |
| [DINO](https://github.com/facebookresearch/dino) | image-level | ImageNet | 14M | - |
|
| 49 |
| [SAM](https://github.com/facebookresearch/SAM) | pixel-level | SA1B | 11M | 1.1B |
|
| 50 |
+
| **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
|
| 51 |
|
| 52 |
|
| 53 |
## TokenOCR
|