Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,9 @@ base_model: TokenOCR
|
|
| 5 |
base_model_relation: finetune
|
| 6 |
---
|
| 7 |
|
|
|
|
|
|
|
|
|
|
| 8 |
[\[π GitHub\]](https://github.com/Token-family/TokenOCR) [\[π Paper\]]() [\[π Blog\]]() [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start)
|
| 9 |
|
| 10 |
<div align="center">
|
|
@@ -22,7 +25,8 @@ we seamlessly replace previous VFMs with TokenOCR to construct a document-level
|
|
| 22 |
|
| 23 |
# Token Family
|
| 24 |
|
| 25 |
-
## TokenIT
|
|
|
|
| 26 |
|
| 27 |
In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
|
| 28 |
text-mask pairs.
|
|
@@ -50,7 +54,9 @@ The comparisons with other visual foundation models:
|
|
| 50 |
| **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
|
| 51 |
|
| 52 |
|
| 53 |
-
## TokenOCR
|
|
|
|
|
|
|
| 54 |
|
| 55 |
### Model Architecture
|
| 56 |
|
|
@@ -136,7 +142,8 @@ Please refer to our technical report for more details.
|
|
| 136 |
|
| 137 |
<!-- 
|
| 138 |
-->
|
| 139 |
-
## TokenVL
|
|
|
|
| 140 |
|
| 141 |
we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
|
| 142 |
Following the previous training paradigm, TokenVL also includes two stages:
|
|
|
|
| 5 |
base_model_relation: finetune
|
| 6 |
---
|
| 7 |
|
| 8 |
+
# A Token-level Text Image Foundation Model for Document Understanding
|
| 9 |
+
|
| 10 |
+
|
| 11 |
[\[π GitHub\]](https://github.com/Token-family/TokenOCR) [\[π Paper\]]() [\[π Blog\]]() [\[π€ HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[π Quick Start\]](#quick-start)
|
| 12 |
|
| 13 |
<div align="center">
|
|
|
|
| 25 |
|
| 26 |
# Token Family
|
| 27 |
|
| 28 |
+
<!-- ## TokenIT -->
|
| 29 |
+
<h2 style="color: #4CAF50;">TokenIT</h2>
|
| 30 |
|
| 31 |
In the following picture, we provide an overview of the self-constructed token-level **TokenIT** dataset, comprising 20 million images and 1.8 billion
|
| 32 |
text-mask pairs.
|
|
|
|
| 54 |
| **TokenOCR** | **token-level** | **TokenIT** | **20M** | **1.8B** |
|
| 55 |
|
| 56 |
|
| 57 |
+
<!-- ## TokenOCR
|
| 58 |
+
-->
|
| 59 |
+
<h2 style="color: #4CAF50;">TokenOCR</h2>
|
| 60 |
|
| 61 |
### Model Architecture
|
| 62 |
|
|
|
|
| 142 |
|
| 143 |
<!-- 
|
| 144 |
-->
|
| 145 |
+
<!-- ## TokenVL -->
|
| 146 |
+
<h2 style="color: #4CAF50;">TokenVL</h2>
|
| 147 |
|
| 148 |
we employ the TokenOCR as the visual foundation model and further develop an MLLM, named TokenVL, tailored for document understanding.
|
| 149 |
Following the previous training paradigm, TokenVL also includes two stages:
|