Update README.md
Browse files
README.md
CHANGED
|
@@ -18,13 +18,17 @@ base_model_relation: finetune
|
|
| 18 |
We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
|
| 19 |
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
|
| 20 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 21 |
-
|
| 22 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 23 |
-
we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM,
|
| 24 |
|
| 25 |

|
| 26 |
|
| 27 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
In the following table, we provide an overview of the InternViT 2.5 series.
|
| 30 |
|
|
@@ -33,6 +37,9 @@ In the following table, we provide an overview of the InternViT 2.5 series.
|
|
| 33 |
| InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
|
| 34 |
| InternViT-6B-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) |
|
| 35 |
|
|
|
|
|
|
|
|
|
|
| 36 |
## Model Architecture
|
| 37 |
|
| 38 |
As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
|
|
|
|
| 18 |
We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
|
| 19 |
designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
|
| 20 |
we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
|
| 21 |
+
`TokenIT`, comprising 20 million images and 1.8 billion token-mask pairs.
|
| 22 |
Furthermore, leveraging this foundation with exceptional image-as-text capability,
|
| 23 |
+
we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, `TokenVL`, for VQA-based document understanding tasks.
|
| 24 |
|
| 25 |

|
| 26 |
|
| 27 |
+
## Token Family
|
| 28 |
+
|
| 29 |
+
## TokenIT
|
| 30 |
+
|
| 31 |
+
## TokenOCR
|
| 32 |
|
| 33 |
In the following table, we provide an overview of the InternViT 2.5 series.
|
| 34 |
|
|
|
|
| 37 |
| InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
|
| 38 |
| InternViT-6B-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) |
|
| 39 |
|
| 40 |
+
## TokenVL
|
| 41 |
+
|
| 42 |
+
|
| 43 |
## Model Architecture
|
| 44 |
|
| 45 |
As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
|