TongkunGuan
/

TokenFD

Model card Files Files and versions

TongkunGuan commited on Feb 21, 2025

Commit

eb1624a

·

verified ·

1 Parent(s): 54bf132

Update README.md

Files changed (1) hide show

README.md +10 -3

README.md CHANGED Viewed

@@ -18,13 +18,17 @@ base_model_relation: finetune
 We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
 designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
-\textbf{TokenIT}, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
-we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, \textbf{TokenVL}, for VQA-based document understanding tasks.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/o9_FX5D8_NOS1gfnebp5s.png)
-## InternViT 2.5 Family
 In the following table, we provide an overview of the InternViT 2.5 series.
@@ -33,6 +37,9 @@ In the following table, we provide an overview of the InternViT 2.5 series.
 | InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
 |  InternViT-6B-448px-V2_5  |  [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5)  |
 ## Model Architecture
 As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.

 We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks,
 designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenOCR,
 we also devise a high-quality data production pipeline that constructs the first token-level image text dataset,
+`TokenIT`, comprising 20 million images and 1.8 billion token-mask pairs.
 Furthermore, leveraging this foundation with exceptional image-as-text capability,
+we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, `TokenVL`, for VQA-based document understanding tasks.
 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/o9_FX5D8_NOS1gfnebp5s.png)
+## Token Family
+## TokenIT
+## TokenOCR
 In the following table, we provide an overview of the InternViT 2.5 series.
 | InternViT-300M-448px-V2_5 | [🤗 link](https://huggingface.co/OpenGVLab/InternViT-300M-448px-V2_5) |
 |  InternViT-6B-448px-V2_5  |  [🤗 link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V2_5)  |
+## TokenVL
 ## Model Architecture
 As shown in the following figure, InternVL 2.5 retains the same model architecture as its predecessors, InternVL 1.5 and 2.0, following the "ViT-MLP-LLM" paradigm. In this new version, we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.