TongkunGuan
/

TokenFD

Model card Files Files and versions

TokenFD / README.md

nielsr's picture

nielsr HF Staff

Update pipeline tag and add library name

80f707a verified 12 months ago

|

3.36 kB

	---
	base_model: TokenOCR
	license: mit
	pipeline_tag: image-to-text
	base_model_relation: finetune
	library_name: internvl
	---

	<center>

	<h1 style="color: black;">A Token-level Text Image Foundation Model for Document Understanding</h1>


	[\[📂 GitHub\]](https://github.com/Token-family/TokenOCR) [\[📖 Paper\]](https://arxiv.org/pdf/2503.02304) [\[🆕 Project Pages\]](https://token-family.github.io/TokenOCR_project/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/TongkunGuan/TokenOCR)

	</center>

	<center>

	<h2 style="color: #4CAF50;">Model Cards</h2>

	In the following table, we provide all models of the TokenOCR series.

	\| Model Name \| Description \|
	\| :-----------------------: \| :-------------------------------------------------------------------: \|
	\| R50 \|Backbone is ResNet-50；feature dimension is 2048; support interactive with English and Chinese texts. \|
	\| [TokenOCR-2048-Bilingual-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_Bilingual_seg) \|Backbone is ViT；feature dimension is 2048; support interactive with English and Chinese texts. \|
	\| [TokenOCR-4096-English-seg](https://huggingface.co/TongkunGuan/TokenOCR_4096_English_seg) \|(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts. \|
	</center>

	### Quick Start

	> \[!Warning\]
	> 🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the `TokenOCR-4096-English-seg` version.

	```python
	import os
	import torch
	from transformers import AutoTokenizer
	from internvl.model.internvl_chat import InternVLChatModel
	from utils import post_process, generate_similiarity_map, load_image

	# ... (rest of the quickstart code)
	```

	<center>

	<h2 style="color: #4CAF50;">Introduction</h2>

	</center>

	We are excited to announce the release of `TokenOCR`, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, `TokenIT`, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, `TokenVL`, for VQA-based document understanding tasks.

	<center>

	<h2 style="color: #4CAF50;">Token Family</h2>

	</center>

	<h2 style="color: #4CAF50;">TokenIT</h2>

	(Content about TokenIT from the Github README)

	<h2 style="color: #4CAF50;">TokenOCR</h2>

	(Content about TokenOCR architecture and evaluation from the Github README)

	<h2 style="color: #4CAF50;">TokenVL</h2>

	(Content about TokenVL from the Github README)

	## 🤚 Release Plans

	(Release Plans from the Github README)

	## License

	This project is released under the MIT License.

	## Citation

	```BibTeX
	@inproceedings{guan2025TokenOCR,
	title={A Token-level Text Image Foundation Model for Document Understanding},
	author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
	journal={arXiv preprint arXiv:2503.02304},
	year={2025}
	}
	```