File size: 3,358 Bytes

2a13afa
a615a83
80f707a
 
2a13afa
80f707a
2a13afa
 
7956f56
 
 
c2717f7
 
5f0b30d
2a13afa
7956f56
 
fc92c4b
1e93eb6
4ffdea2
79bf05b
80f707a
79bf05b
 
 
1cb3e8a
5f0b30d
efc8eb2
cbdf2e2
0620217
 
 
cbdf2e2
 
0620217
 
 
 
 
 
 
 
80f707a
0620217
80f707a
cbdf2e2
1e93eb6
abba8cd
2a13afa
fc92c4b
 
80f707a
2a13afa
fc92c4b
 
9e2bffa
eb1624a
fc92c4b
 
c2717f7
eb1624a
80f707a
47e6ad9
c2717f7
2a13afa
80f707a
2a13afa
c2717f7
2a13afa
80f707a
4d82faf
80f707a
2a13afa
80f707a
2a13afa
 
 
 
 
 
 
 
e0865b0
 
 
80f707a
e0865b0
 
80f707a

---
base_model: TokenOCR
license: mit
pipeline_tag: image-to-text
base_model_relation: finetune
library_name: internvl
---

<center>

<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>


[\[📂 GitHub\]](https://github.com/Token-family/TokenOCR)  [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/)   [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)

</center>

<center>

<h2 style="color: #4CAF50;">Model Cards</h2>

In the following table, we provide all models of the TokenOCR series.

|        Model Name         |                                Description                                |
| :-----------------------: | :-------------------------------------------------------------------: |
|  R50  |Backbone is ResNet-50；feature dimension is 2048; support interactive with English and Chinese texts.   |
|  [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg)  |Backbone is ViT；feature dimension is 2048;  support interactive with English and Chinese texts. |
| [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts.  |
</center>

### Quick Start

> \[!Warning\]
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenOCR-4096-English-seg`** version.

```python
import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image

# ... (rest of the quickstart code)
```

<center>

<h2 style="color: #4CAF50;">Introduction</h2>

</center>

We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications.  We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks. 

<center>
  
<h2 style="color: #4CAF50;">Token Family</h2>

</center>

<h2 style="color: #4CAF50;">TokenIT</h2>

(Content about TokenIT from the Github README)

<h2 style="color: #4CAF50;">TokenOCR</h2>

(Content about TokenOCR architecture and evaluation from the Github README)

<h2 style="color: #4CAF50;">TokenVL</h2>

(Content about TokenVL from the Github README)

## 🤚 Release Plans

(Release Plans from the Github README)

## License

This project is released under the MIT License.

## Citation

```BibTeX
@inproceedings{guan2025TokenOCR,
  title={A Token-level Text Image Foundation Model for Document Understanding},
  author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
  journal={arXiv preprint arXiv:2503.02304},
  year={2025}
}
```