Image-Text-to-Text
Transformers
Safetensors
youtu_vl
text-generation
conversational
custom_code

Improve model card: add transformers library, pipeline tag, and paper link

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +76 -35
README.md CHANGED
@@ -1,12 +1,14 @@
1
  ---
 
 
2
  license: other
3
  license_name: youtu-parsing
4
  license_link: https://huggingface.co/tencent/Youtu-Parsing/blob/main/LICENSE.txt
5
- pipeline_tag: image-text-to-text
6
- base_model:
7
- - tencent/Youtu-LLM-2B
8
  base_model_relation: finetune
9
  ---
 
10
  <div align="center">
11
 
12
  # <img src="assets/youtu-parsing-logo.png" alt="Youtu-Parsing Logo" height="100px">
@@ -22,7 +24,9 @@ base_model_relation: finetune
22
 
23
  ## 🎯 Introduction
24
 
25
- **Youtu-Parsing** is a specialized document parsing model built upon the open-source Youtu-LLM 2B foundation. By extending the capabilities of the base model with a prompt-guided framework and NaViT-style dynamic visual encoder, Youtu-Parsing offers enhanced parsing capabilities for diverse document elements including text, tables, formulas, and charts. The model incorporates an efficient parallel decoding mechanism that significantly accelerates inference, making it practical for real-world document analysis applications. We share Youtu-Parsing with the community to facilitate research and development in document understanding.
 
 
26
 
27
 
28
  ## ✨ Key Features
@@ -62,33 +66,70 @@ base_model_relation: finetune
62
  <a id="quickstart"></a>
63
 
64
  ## 🚀 Quick Start
65
- ### Install packages
66
- ```bash
67
- conda create -n youtu_parsing python=3.10
68
- conda activate youtu_parsing
69
- pip install git+https://github.com/TencentCloudADP/youtu-parsing.git#subdirectory=youtu_hf_parser
70
-
71
- # install the flash-attn2
72
- # For CUDA 12.x + PyTorch 2.6 + Python 3.10 + Linux x86_64:
73
- pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
74
 
75
- # Alternative: Install from PyPI
76
- pip install flash-attn==2.7.0
 
 
77
  ```
78
 
79
- ### Usage with transformers
 
 
80
  ```python
81
- from youtu_hf_parser import YoutuOCRParserHF
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
- # Initialize the parser
84
- parser = YoutuOCRParserHF(
85
- model_path=model_path,
86
- enable_angle_correct=True, # Set to False to disable angle correction
87
- angle_correct_model_path=angle_correct_model_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  )
89
 
90
- # Parse an image
91
- parser.parse_file(input_path=image_path, output_dir=output_dir)
 
 
 
 
 
 
92
  ```
93
 
94
  ## 🎨 Visualization
@@ -138,16 +179,6 @@ We would like to thank [Youtu-LLM](https://github.com/TencentCloudADP/youtu-tip/
138
 
139
  If you find our work useful in your research, please consider citing the following paper:
140
  ```
141
- @article{youtu-parsing,
142
- title={Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding},
143
- author={Tencent Youtu Lab},
144
- year={2026},
145
- eprint={},
146
- archivePrefix={},
147
- primaryClass={},
148
- url={},
149
- }
150
-
151
  @article{youtu-vl,
152
  title={Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision},
153
  author={Tencent Youtu Lab},
@@ -158,6 +189,16 @@ If you find our work useful in your research, please consider citing the followi
158
  url={https://arxiv.org/abs/2601.19798},
159
  }
160
 
 
 
 
 
 
 
 
 
 
 
161
  @article{youtu-llm,
162
  title={Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models},
163
  author={Tencent Youtu Lab},
@@ -167,4 +208,4 @@ If you find our work useful in your research, please consider citing the followi
167
  primaryClass={cs.CL},
168
  url={https://arxiv.org/abs/2512.24618},
169
  }
170
- ```
 
1
  ---
2
+ base_model:
3
+ - tencent/Youtu-LLM-2B
4
  license: other
5
  license_name: youtu-parsing
6
  license_link: https://huggingface.co/tencent/Youtu-Parsing/blob/main/LICENSE.txt
7
+ pipeline_tag: image-segmentation
8
+ library_name: transformers
 
9
  base_model_relation: finetune
10
  ---
11
+
12
  <div align="center">
13
 
14
  # <img src="assets/youtu-parsing-logo.png" alt="Youtu-Parsing Logo" height="100px">
 
24
 
25
  ## 🎯 Introduction
26
 
27
+ **Youtu-Parsing** is a specialized document parsing model built upon the open-source Youtu-LLM 2B foundation, as presented in the paper [Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision](https://huggingface.co/papers/2601.19798).
28
+
29
+ By extending the capabilities of the base model with a prompt-guided framework and NaViT-style dynamic visual encoder, Youtu-Parsing offers enhanced parsing capabilities for diverse document elements including text, tables, formulas, and charts. The model incorporates an efficient parallel decoding mechanism that significantly accelerates inference, making it practical for real-world document analysis applications. We share Youtu-Parsing with the community to facilitate research and development in document understanding.
30
 
31
 
32
  ## ✨ Key Features
 
66
  <a id="quickstart"></a>
67
 
68
  ## 🚀 Quick Start
 
 
 
 
 
 
 
 
 
69
 
70
+ ### Installation
71
+ Ensure your Python environment has the `transformers` library installed:
72
+ ```bash
73
+ pip install "transformers>=4.56.0,<=4.57.1" torch accelerate pillow torchvision opencv-python-headless
74
  ```
75
 
76
+ ### Usage with Transformers
77
+ You can interact with the model using the `transformers` library:
78
+
79
  ```python
80
+ from transformers import AutoProcessor, AutoModelForCausalLM
81
+ import torch
82
+
83
+ model = AutoModelForCausalLM.from_pretrained(
84
+ "tencent/Youtu-VL-4B-Instruct",
85
+ attn_implementation="flash_attention_2",
86
+ torch_dtype="auto",
87
+ device_map="cuda",
88
+ trust_remote_code=True
89
+ ).eval()
90
+
91
+ processor = AutoProcessor.from_pretrained(
92
+ "tencent/Youtu-VL-4B-Instruct",
93
+ use_fast=True,
94
+ trust_remote_code=True
95
+ )
96
 
97
+ img_path = "path/to/your/image.png"
98
+ messages = [
99
+ {
100
+ "role": "user",
101
+ "content": [
102
+ {"type": "image", "image": img_path},
103
+ {"type": "text", "text": "Describe the image"},
104
+ ],
105
+ }
106
+ ]
107
+
108
+ inputs = processor.apply_chat_template(
109
+ messages,
110
+ tokenize=True,
111
+ add_generation_prompt=True,
112
+ return_dict=True,
113
+ return_tensors="pt"
114
+ ).to(model.device)
115
+
116
+ generated_ids = model.generate(
117
+ **inputs,
118
+ temperature=0.1,
119
+ top_p=0.001,
120
+ repetition_penalty=1.05,
121
+ do_sample=True,
122
+ max_new_tokens=32768,
123
  )
124
 
125
+ generated_ids_trimmed = [
126
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
127
+ ]
128
+ outputs = processor.batch_decode(
129
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
130
+ )
131
+ print(f"Youtu-VL output:
132
+ {outputs[0]}")
133
  ```
134
 
135
  ## 🎨 Visualization
 
179
 
180
  If you find our work useful in your research, please consider citing the following paper:
181
  ```
 
 
 
 
 
 
 
 
 
 
182
  @article{youtu-vl,
183
  title={Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision},
184
  author={Tencent Youtu Lab},
 
189
  url={https://arxiv.org/abs/2601.19798},
190
  }
191
 
192
+ @article{youtu-parsing,
193
+ title={Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding},
194
+ author={Tencent Youtu Lab},
195
+ year={2026},
196
+ eprint={},
197
+ archivePrefix={},
198
+ primaryClass={},
199
+ url={},
200
+ }
201
+
202
  @article{youtu-llm,
203
  title={Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models},
204
  author={Tencent Youtu Lab},
 
208
  primaryClass={cs.CL},
209
  url={https://arxiv.org/abs/2512.24618},
210
  }
211
+ ```