| | --- |
| | base_model: TokenOCR |
| | license: mit |
| | pipeline_tag: image-to-text |
| | base_model_relation: finetune |
| | library_name: internvl |
| | --- |
| | |
| | <center> |
| |
|
| | <h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1> |
| |
|
| |
|
| | [\[📂 GitHub\]](https://github.com/Token-family/TokenOCR) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR) |
| | |
| | </center> |
| | |
| | <center> |
| | |
| | <h2 style="color: #4CAF50;">Model Cards</h2> |
| | |
| | In the following table, we provide all models of the TokenOCR series. |
| | |
| | | Model Name | Description | |
| | | :-----------------------: | :-------------------------------------------------------------------: | |
| | | R50 |Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. | |
| | | [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg) |Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. | |
| | | [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) |(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. | |
| | </center> |
| | |
| | ### Quick Start |
| | |
| | > \[!Warning\] |
| | > 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the **`TokenOCR-4096-English-seg`** version. |
| | |
| | ```python |
| | import os |
| | import torch |
| | from transformers import AutoTokenizer |
| | from internvl.model.internvl_chat import InternVLChatModel |
| | from utils import post_process, generate_similiarity_map, load_image |
| |
|
| | # ... (rest of the quickstart code) |
| | ``` |
| | |
| | <center> |
| | |
| | <h2 style="color: #4CAF50;">Introduction</h2> |
| | |
| | </center> |
| | |
| | We are excited to announce the release of **`TokenOCR`**, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, **`TokenIT`**, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, **`TokenVL`**, for VQA-based document understanding tasks. |
| | |
| | <center> |
| | |
| | <h2 style="color: #4CAF50;">Token Family</h2> |
| | |
| | </center> |
| | |
| | <h2 style="color: #4CAF50;">TokenIT</h2> |
| | |
| | (Content about TokenIT from the Github README) |
| | |
| | <h2 style="color: #4CAF50;">TokenOCR</h2> |
| | |
| | (Content about TokenOCR architecture and evaluation from the Github README) |
| | |
| | <h2 style="color: #4CAF50;">TokenVL</h2> |
| | |
| | (Content about TokenVL from the Github README) |
| | |
| | ## 🤚 Release Plans |
| | |
| | (Release Plans from the Github README) |
| | |
| | ## License |
| | |
| | This project is released under the MIT License. |
| | |
| | ## Citation |
| | |
| | ```BibTeX |
| | @inproceedings{guan2025TokenOCR, |
| | title={A Token-level Text Image Foundation Model for Document Understanding}, |
| | author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang}, |
| | journal={arXiv preprint arXiv:2503.02304}, |
| | year={2025} |
| | } |
| | ``` |