Add library_name and project page link

#3
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +10 -7
README.md CHANGED
@@ -1,15 +1,17 @@
1
  ---
2
  license: other
3
  license_name: youtu-vl
4
- extra_gated_eu_disallowed: true
5
  license_link: https://huggingface.co/tencent/Youtu-VL-4B-Instruct/blob/main/LICENSE.txt
6
  pipeline_tag: image-text-to-text
 
 
7
  ---
 
8
  <div align="center">
9
 
10
  # <img src="assets/youtu-vl-logo.png" alt="Youtu-VL Logo" height="100px">
11
 
12
- [πŸ“ƒ License](LICENSE.txt) β€’ [πŸ’» Code](https://github.com/TencentCloudADP/youtu-vl) β€’ [πŸ“‘ Technical Report](https://arxiv.org/abs/2601.19798) β€’ [πŸ“Š Benchmarks](#benchmarks) β€’ [πŸš€ Getting Started](#quickstart)
13
  </div>
14
 
15
  ## 🎯 Introduction
@@ -23,7 +25,7 @@ pipeline_tag: image-text-to-text
23
 
24
  - **Promising Performance with High Efficiency**: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.
25
 
26
- <p align="center">
27
  <img src="assets/youtu-vl-overview.png" width="90%"/>
28
  <p>
29
 
@@ -40,7 +42,7 @@ pipeline_tag: image-text-to-text
40
 
41
  - **Vision-Centric Prediction with a Standard Architecture (no task-specific modules)**: Youtu-VL treats image and text tokens with equivalent autoregressive status, empowering it to perform vision-centric tasks for both dense vision prediction (e.g., segmentation, depth) and text-based prediction (e.g., grounding, detection) within a standard VLM architecture, eliminating the need for task-specific additions. This design yields a versitile general-purpose VLM, allowing a single model to flexibly accommodate a wide range of vision-centric and vsion-language requirements.
42
 
43
- <p align="center">
44
  <img src="assets/architecture.png" width="90%"/>
45
  <p>
46
 
@@ -49,7 +51,7 @@ pipeline_tag: image-text-to-text
49
 
50
  ### Vision-Centric Tasks
51
 
52
- <p align="center">
53
  <img src="assets/vision-centric-performance.png" width="90%"/>
54
  <p>
55
 
@@ -57,7 +59,7 @@ pipeline_tag: image-text-to-text
57
  ### General Multimodal Tasks
58
 
59
 
60
- <p align="center">
61
  <img src="assets/general-multimodal-performance.png" width="90%"/>
62
  <p>
63
 
@@ -122,7 +124,8 @@ outputs = processor.batch_decode(
122
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
123
  )
124
  generated_text = outputs[0]
125
- print(f"Youtu-VL output:\n{generated_text}")
 
126
  ```
127
 
128
  ## πŸŽ‰ Citation
 
1
  ---
2
  license: other
3
  license_name: youtu-vl
 
4
  license_link: https://huggingface.co/tencent/Youtu-VL-4B-Instruct/blob/main/LICENSE.txt
5
  pipeline_tag: image-text-to-text
6
+ extra_gated_eu_disallowed: true
7
+ library_name: transformers
8
  ---
9
+
10
  <div align="center">
11
 
12
  # <img src="assets/youtu-vl-logo.png" alt="Youtu-VL Logo" height="100px">
13
 
14
+ [🏠 Project Page](https://youtu-tip.com/#llm) β€’ [πŸ“ƒ License](LICENSE.txt) β€’ [πŸ’» Code](https://github.com/TencentCloudADP/youtu-vl) β€’ [πŸ“‘ Technical Report](https://arxiv.org/abs/2601.19798) β€’ [πŸ“Š Benchmarks](#benchmarks) β€’ [πŸš€ Getting Started](#quickstart)
15
  </div>
16
 
17
  ## 🎯 Introduction
 
25
 
26
  - **Promising Performance with High Efficiency**: Despite its compact 4B-parameter architecture, the model achieves competitive results across a wide range of general multimodal tasks, including general visual question answering (VQA), multimodal reasoning and mathematics, optical character recognition (OCR), multi-image and real-world understanding, hallucination evaluation, and GUI agent tasks.
27
 
28
+ <p align="center\">
29
  <img src="assets/youtu-vl-overview.png" width="90%"/>
30
  <p>
31
 
 
42
 
43
  - **Vision-Centric Prediction with a Standard Architecture (no task-specific modules)**: Youtu-VL treats image and text tokens with equivalent autoregressive status, empowering it to perform vision-centric tasks for both dense vision prediction (e.g., segmentation, depth) and text-based prediction (e.g., grounding, detection) within a standard VLM architecture, eliminating the need for task-specific additions. This design yields a versitile general-purpose VLM, allowing a single model to flexibly accommodate a wide range of vision-centric and vsion-language requirements.
44
 
45
+ <p align="center\">
46
  <img src="assets/architecture.png" width="90%"/>
47
  <p>
48
 
 
51
 
52
  ### Vision-Centric Tasks
53
 
54
+ <p align="center\">
55
  <img src="assets/vision-centric-performance.png" width="90%"/>
56
  <p>
57
 
 
59
  ### General Multimodal Tasks
60
 
61
 
62
+ <p align="center\">
63
  <img src="assets/general-multimodal-performance.png" width="90%"/>
64
  <p>
65
 
 
124
  generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
125
  )
126
  generated_text = outputs[0]
127
+ print(f"Youtu-VL output:
128
+ {generated_text}")
129
  ```
130
 
131
  ## πŸŽ‰ Citation