File size: 3,358 Bytes
2a13afa a615a83 80f707a 2a13afa 80f707a 2a13afa 7956f56 c2717f7 5f0b30d 2a13afa 7956f56 fc92c4b 1e93eb6 4ffdea2 79bf05b 80f707a 79bf05b 1cb3e8a 5f0b30d efc8eb2 cbdf2e2 0620217 cbdf2e2 0620217 80f707a 0620217 80f707a cbdf2e2 1e93eb6 abba8cd 2a13afa fc92c4b 80f707a 2a13afa fc92c4b 9e2bffa eb1624a fc92c4b c2717f7 eb1624a 80f707a 47e6ad9 c2717f7 2a13afa 80f707a 2a13afa c2717f7 2a13afa 80f707a 4d82faf 80f707a 2a13afa 80f707a 2a13afa e0865b0 80f707a e0865b0 80f707a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | ---
base_model: TokenOCR
license: mit
pipeline_tag: image-to-text
base_model_relation: finetune
library_name: internvl
---
<center>
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
[\[📂 GitHub\]](https://github.com/Token-family/TokenOCR) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)
</center>
<center>
<h2 style="color: #4CAF50;">Model Cards</h2>
In the following table, we provide all models of the TokenOCR series.
| Model Name | Description |
| :-----------------------: | :-------------------------------------------------------------------: |
| R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
| [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
| [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
</center>
### Quick Start
> \[!Warning\]
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenOCR-4096-English-seg`** version.
```python
import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image
# ... (rest of the quickstart code)
```
<center>
<h2 style="color: #4CAF50;">Introduction</h2>
</center>
We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
<center>
<h2 style="color: #4CAF50;">Token Family</h2>
</center>
<h2 style="color: #4CAF50;">TokenIT</h2>
(Content about TokenIT from the Github README)
<h2 style="color: #4CAF50;">TokenOCR</h2>
(Content about TokenOCR architecture and evaluation from the Github README)
<h2 style="color: #4CAF50;">TokenVL</h2>
(Content about TokenVL from the Github README)
## 🤚 Release Plans
(Release Plans from the Github README)
## License
This project is released under the MIT License.
## Citation
```BibTeX
@inproceedings{guan2025TokenOCR,
title={A Token-level Text Image Foundation Model for Document Understanding},
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
journal={arXiv preprint arXiv:2503.02304},
year={2025}
}
``` |