TokenFD / README.md

nielsr HF Staff

Update pipeline tag and add library name

80f707a verified 12 months ago

3.36 kB

base_model: TokenOCR
license: mit
pipeline_tag: image-to-text
base_model_relation: finetune
library_name: internvl

A Token-level Text Image Foundation Model for Document Understanding

[📂 GitHub] [📖 Paper] [🆕 Project Pages] [🤗 HF Demo]

Model Cards

In the following table, we provide all models of the TokenOCR series.

Model Name	Description
R50	Backbone is ResNet-50；feature dimension is 2048; support interactive with English and Chinese texts.
TokenOCR-2048-Bilingual-seg	Backbone is ViT；feature dimension is 2048; support interactive with English and Chinese texts.
TokenOCR-4096-English-seg	(We recommend 👍) Backbone is ViT; feature dimension is 4096; only supports interactive with English texts.

Quick Start

🚨 Note: Since there are fewer Chinese images in public data than English, we recommend you use the TokenOCR-4096-English-seg version.

import os
import torch
from transformers import AutoTokenizer
from internvl.model.internvl_chat import InternVLChatModel
from utils import post_process, generate_similiarity_map, load_image

# ... (rest of the quickstart code)

Introduction

We are excited to announce the release of TokenOCR, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. We also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenOCR to construct a document-level MLLM, TokenVL, for VQA-based document understanding tasks.

Token Family

TokenIT

(Content about TokenIT from the Github README)

TokenOCR

(Content about TokenOCR architecture and evaluation from the Github README)

TokenVL

(Content about TokenVL from the Github README)

🤚 Release Plans

(Release Plans from the Github README)

License

This project is released under the MIT License.

Citation

@inproceedings{guan2025TokenOCR,
  title={A Token-level Text Image Foundation Model for Document Understanding},
  author={Tongkun Guan, Zining Wang, Pei Fu, Zhentao Guo, Wei Shen, Kai zhou, Tiezhu Yue, Chen Duan, Hao Sun, Qianyi Jiang, Junfeng Luo, Xiaokang Yang},
  journal={arXiv preprint arXiv:2503.02304},
  year={2025}
}