TokenFD / README.md
nielsr's picture
nielsr HF Staff
Update pipeline tag and add library name
80f707a verified
|
raw
history blame
3.36 kB
---
base_model: TokenOCR
license: mit
pipeline_tag: image-to-text
base_model_relation: finetune
library_name: internvl
---
<center>
<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>
[\[📂 GitHub\]](https://github.com/Token-family/TokenOCR) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)
</center>
<center>
<h2 style="color: #4CAF50;">Model Cards</h2>
In the following table, we provide all models of the TokenOCR series.
| Model Name | Description |
| :-----------------------: | :-------------------------------------------------------------------: |
| R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
| [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
| [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
</center>
### Quick Start
> \[!Warning\]
> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenOCR-4096-English-seg`** version.
```python
import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image
# ... (rest of the quickstart code)
```
<center>
<h2 style="color: #4CAF50;">Introduction</h2>
</center>
We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks.
<center>
<h2 style="color: #4CAF50;">Token Family</h2>
</center>
<h2 style="color: #4CAF50;">TokenIT</h2>
(Content about TokenIT from the Github README)
<h2 style="color: #4CAF50;">TokenOCR</h2>
(Content about TokenOCR architecture and evaluation from the Github README)
<h2 style="color: #4CAF50;">TokenVL</h2>
(Content about TokenVL from the Github README)
## 🤚 Release Plans
(Release Plans from the Github README)
## License
This project is released under the MIT License.
## Citation
```BibTeX
@inproceedings{guan2025TokenOCR,
title={A Token-level Text Image Foundation Model for Document Understanding},
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
journal={arXiv preprint arXiv:2503.02304},
year={2025}
}
```