TongkunGuan commited on
Commit
eb1624a
·
verified ·
1 Parent(s): 54bf132

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -18,13 +18,17 @@ base_model_relation: finetune
18
  We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
19
  designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
20
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
21
- \textbf{TokenIT}, comprising 20 million images and 1.8 billion token-mask pairs.
22
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
23
- we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, \textbf{TokenVL}, for VQA-based document understanding tasks.
24
 
25
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/o9_FX5D8_NOS1gfnebp5s.png)
26
 
27
- ## InternViT 2.5 Family
 
 
 
 
28
 
29
  In the following table, we provide an overview of the InternViT 2.5 series.
30
 
@@ -33,6 +37,9 @@ In the following table, we provide an overview of the InternViT 2.5 series.
33
  | InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
34
  | InternViT-6B-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) |
35
 
 
 
 
36
  ## Model Architecture
37
 
38
  As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.
 
18
  We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
19
  designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
20
  we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
21
+ `TokenIT`, comprising 20 million images and 1.8 billion token-mask pairs.
22
  Furthermore, leveraging this foundation with exceptional image-as-text capability,
23
+ we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, `TokenVL`, for VQA-based document understanding tasks.
24
 
25
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/o9_FX5D8_NOS1gfnebp5s.png)
26
 
27
+ ## Token Family
28
+
29
+ ## TokenIT
30
+
31
+ ## TokenOCR
32
 
33
  In the following table, we provide an overview of the InternViT 2.5 series.
34
 
 
37
  | InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
38
  | InternViT-6B-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5) |
39
 
40
+ ## TokenVL
41
+
42
+
43
  ## Model Architecture
44
 
45
  As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.