base_model: TokenOCR
license: mit
pipeline_tag: image-to-text
base_model_relation: finetune
library_name: internvl
A Token-level Text Image Foundation Model for Document Understanding
Model Cards
In the following table, we provide all models of the TokenOCR series.
| Model Name | Description |
|---|---|
| R50 | Backbone is ResNet-50;feature dimension is 2048; support interactive with English and Chinese texts. |
| TokenOCR-2048-Bilingual-seg | Backbone is ViT;feature dimension is 2048; support interactive with English and Chinese texts. |
| TokenOCR-4096-English-seg | (We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. |
Quick Start
🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the
TokenOCR-4096-English-segversion.
import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image
# ... (rest of the quickstart code)
Introduction
We are excited to announce the release of TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks.
Token Family
TokenIT
(Content about TokenIT from the Github README)
TokenOCR
(Content about TokenOCR architecture and evaluation from the Github README)
TokenVL
(Content about TokenVL from the Github README)
🤚 Release Plans
(Release Plans from the Github README)
License
This project is released under the MIT License.
Citation
@inproceedings{guan2025TokenOCR,
title={A Token-level Text Image Foundation Model for Document Understanding},
author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
journal={arXiv preprint arXiv:2503.02304},
year={2025}
}