Add pipeline tag, library name and paper link to model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +14 -81
README.md CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  <div align="center">
2
  <h1>
3
  Yuan3.0 Ultra Multimodal Foundation Large Language Model
@@ -15,13 +25,12 @@
15
 
16
  </div>
17
 
18
-
19
-
20
-
21
-
22
  -----
23
 
 
24
 
 
 
25
 
26
  ## Recent Updates πŸŽ‰πŸŽ‰
27
 
@@ -65,20 +74,12 @@ Yuan3.0 Ultra delivers outstanding performance on retrieval-augmented generation
65
 
66
  Fig.2: Benchmark Evaluation Results of Yuan3.0 Ultra
67
 
68
-
69
-
70
  </div>
71
 
72
  ## 3. Core Technologies
73
 
74
  ### Layer-Adaptive Expert Pruning (LAEP)
75
 
76
- The evolution of expert load during large model pre-training can be divided into two phases:
77
- * **Phase 1 β€” Initial Transition Phase**: Occurring at the early stage of model pre-training, where expert loads exhibit substantial volatility inherited from random initialization, with the number of tokens routed to the same expert potentially varying by orders of magnitude;
78
- * **Phase 2 β€” Stable Phase**: Expert token loads across experts become temporally stable, with per-expert token counts exhibiting only relatively minor fluctuations;
79
-
80
- In the stable phase of training, expert token loads are highly imbalanced: a small number of experts carry a large share of computation while some experts remain persistently underutilized, leading to wasted computational resources. The disparity between the highest- and lowest-load experts in the stable phase can reach nearly 500Γ—.
81
-
82
  LAEP adaptively prunes low-load experts layer by layer according to the token distribution in each layer during the stable phase, and proposes an expert rearrangement algorithm that greedily rearranges the remaining experts across computing devices to achieve balanced load. Yuan3.0 Ultra begins pre-training with 1515B parameters and applies LAEP during the stable phase, achieving 33.3% parameter reduction and a 49% improvement in pre-training efficiency.
83
 
84
  ### Revised Reflection Inhibition Reward Mechanism (RIRM)
@@ -96,115 +97,47 @@ During the Fast-thinking RL phase, models tend to produce excessive reflection s
96
  | Yuan3.0 Ultra int4 | 1.01T | 4bit | 64K | HuggingFace | [ModelScope]( https://modelscope.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [HuggingFace]( https://huggingface.co/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [WiseModel]( https://www.wisemodel.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 )
97
 
98
 
99
-
100
  ## 5. Evaluation Results
101
 
102
  Yuan3.0 Ultra achieves leading performance across multiple enterprise-level core benchmarks.
103
 
104
  ### 5.1 Multimodal RAG Evaluation: Docmatix πŸ†
105
 
106
- Docmatix evaluates a model's comprehensive ability to retrieve, associate, and accurately answer questions across multiple modalities (text, tables, images) within multi-page, complex documents.
107
-
108
  | Model | Accuracy (%) |
109
  |---|:---:|
110
  | GPT-4o | 56.8 |
111
- | o3 | 45.6 |
112
- | GPT-5.1 | 48.5 |
113
- | GPT-5.2 | 48.4 |
114
- | Gemini 3.1 Pro | 35.3 |
115
- | Claude Opus 4.6 | 46.2 |
116
- | Kimi K2.5 | 36.9 |
117
  | **Yuan3.0 Ultra** | **67.4** |
118
 
119
- ---
120
-
121
  ### 5.2 Text RAG Evaluation: ChatRAG πŸ†
122
 
123
- ChatRAG comprises 10 tasks, covering long-context retrieval (D2D, QuAC, QReCC), short-context and structured retrieval (CoQA, DoQA, CFQA, SQA, HDial), and Wikipedia-based retrieval (TCQA, INSCIT). Yuan3.0 Ultra achieves an average accuracy of **68.2%**, ranking first on 9 out of 10 tasks.
124
-
125
  | Model | Avg. | D2D | QuAC | QReCC | CoQA | DoQA | CFQA | SQA | TCQA | HDial | INSCIT |
126
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
127
- | DeepSeek-V3 | 50.5 | 31.6 | 28.9 | 49.3 | 77.0 | 26.1 | 83.5 | 82.1 | 46.7 | 47.4 | 32.1 |
128
- | GPT-4o | 50.5 | 32.8 | 26.6 | 49.3 | 76.1 | 28.8 | 81.9 | 81.1 | 49.8 | 41.3 | 26.7 |
129
- | o3 | 44.1 | 23.1 | 20.8 | 40.4 | 69.4 | 18.6 | 67.8 | 86.7 | 45.9 | 41.3 | 26.7 |
130
- | DeepSeek-R1 | 39.4 | 21.5 | 22.2 | 42.4 | 62.5 | 24.7 | 81.5 | 82.1 | 30.7 | 38.0 | 28.7 |
131
- | GPT-5.1 | 46.1 | 28.2 | 23.2 | 45.4 | 68.8 | 20.9 | 73.1 | 81.3 | 44.7 | 45.4 | 30.0 |
132
- | GPT-5.2 | 45.6 | 30.2 | 23.1 | 47.0 | 64.8 | 25.3 | 72.3 | 79.1 | 38.3 | 45.3 | 30.9 |
133
- | Gemini 3.1 Pro | 49.7 | 33.1 | 27.3 | 47.0 | 73.5 | 34.2 | 75.7 | 85.5 | 42.4 | 48.2 | 30.3 |
134
- | Claude Opus 4.6 | 52.9 | 35.3 | 26.6 | 49.4 | 76.4 | 37.3 | **86.5** | 85.5 | 50.2 | 48.9 | 33.2 |
135
- | Kimi K2.5 | 53.6 | 34.6 | 30.9 | 49.9 | 82.5 | 35.8 | 82.3 | 83.6 | 50.8 | 51.1 | 34.4 |
136
  | **Yuan3.0 Ultra** | **68.2** | **55.8** | **54.5** | **57.3** | **94.6** | **63.4** | 79.8 | **91.0** | **72.4** | **72.9** | **40.0** |
137
 
138
- ---
139
-
140
  ### 5.3 Multimodal Complex Table Understanding Evaluation: MMTab
141
 
142
- MMTab spans 15 evaluation sets, covering task types including table question answering, fact checking, and long-context table processing. Yuan3.0 Ultra surpasses Claude Opus 4.6 and Gemini 3.1 Pro with an average accuracy of **62.3%**, demonstrating comprehensive and well-balanced multimodal table processing capability.
143
-
144
  | Model | Avg. | TABMWP | WTQ | HiTab | TAT-QA | FeTaQA | TabFact | InfoTabs |
145
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
146
- | GPT-5.1 | 55.2 | 65.0 | 60.8 | **77.8** | 61.4 | 8.7 | 52.8 | 64.3 |
147
- | GPT-5.2 | 37.3 | 67.2 | 69.8 | 15.8 | 28.0 | 6.2 | 63.5 | 69.3 |
148
- | Gemini 3.1 Pro | 45.1 | 80.1 | **79.6** | 48.3 | 50.5 | 9.6 | 71.1 | 74.4 |
149
- | Claude Opus 4.6 | 39.8 | 67.6 | 76.0 | 44.1 | 44.5 | 12.0 | 30.7 | 59.6 |
150
- | Kimi K2.5 | **66.2** | **95.9** | 79.3 | 63.9 | 62.4 | 7.4 | **90.6** | 81.8 |
151
  | **Yuan3.0 Ultra** | 62.3 | 91.8 | 77.9 | 67.6 | **74.9** | **39.2** | 90.4 | **89.7** |
152
 
153
-
154
- *Full results across all 15 tasks are available in the technical report.*
155
-
156
- ---
157
-
158
  ### 5.4 Text Summarization Evaluation: SummEval πŸ†
159
 
160
- SummEval comprehensively evaluates summarization quality from three dimensions: lexical overlap (ROUGE-1/2), semantic similarity (BERTScore), and factual consistency (SummaC), serving as an important reference for historical information compression capability in Agent applications. Yuan3.0 Ultra achieves an average accuracy of **62.8%**.
161
-
162
  | Model | Avg. | ROUGE-1 | ROUGE-2 | BERTScore | SummaC |
163
  |---|:---:|:---:|:---:|:---:|:---:|
164
- | DeepSeek-V3 | 59.3 | 25.5 | 9.2 | 86.3 | **68.2** |
165
- | DeepSeek-V3.2 | 51.4 | 33.3 | 11.9 | 85.6 | 41.8 |
166
- | GPT-4o | 46.5 | 25.0 | 8.9 | 85.9 | 32.5 |
167
- | GPT-5.1 | 49.4 | 27.5 | 10.2 | 84.6 | 40.5 |
168
- | GPT-5.2 | 48.6 | 30.3 | 10.7 | 84.9 | 36.4 |
169
- | Gemini 3.1 Pro | 48.5 | 32.4 | 11.4 | 85.4 | 34.3 |
170
- | Claude Opus 4.6 | 49.9 | 33.1 | 11.0 | 85.9 | 37.8 |
171
- | Kimi K2.5 | 49.8 | 32.3 | 11.3 | 85.4 | 38.2 |
172
  | **Yuan3.0 Ultra** | **62.8** | **59.1** | **41.0** | **91.1** | 45.4 |
173
 
174
- ---
175
-
176
  ### 5.5 Tool Invocation Evaluation: BFCL V3
177
 
178
- BFCL V3 evaluates real-world tool invocation capability across dimensions including static function selection (Non-Live AST), dynamic real-time execution (Live AST), multi-turn context maintenance (Multi-turn), relevance detection (Relevance), and irrelevant call rejection (Irrelevance Detection). Yuan3.0 Ultra delivers balanced performance across all categories, achieving an average score of **67.8%**, with particular strength in Irrelevance Detection (86.0%).
179
-
180
  | Model | Avg. | Non-Live AST | Live AST | Multi-turn | Relevance | Irrelevance |
181
  |---|:---:|:---:|:---:|:---:|:---:|:---:|
182
- | Qwen3-235B-A22B | 68.0 | 87.9 | 77.0 | 40.1 | **83.3** | 76.3 |
183
- | Claude-3.7-Sonnet | 58.6 | 41.3 | 78.4 | 48.4 | 72.2 | 81.4 |
184
- | GPT-5.2 | 60.6 | 80.9 | 76.2 | 24.6 | 72.2 | 79.7 |
185
- | Gemini 3.1 Pro | **78.8** | **91.5** | **84.9** | **60.3** | 61.1 | **88.2** |
186
- | Claude Opus 4.6 | 74.9 | 88.2 | 78.9 | 59.8 | 61.1 | 78.0 |
187
- | Kimi K2.5 | 70.6 | 86.4 | 78.6 | 48.6 | 61.1 | 77.0 |
188
  | **Yuan3.0 Ultra** | 67.8 | 81.7 | 74.5 | 45.3 | 66.7 | 86.0 |
189
 
190
-
191
-
192
-
193
- ---
194
-
195
  ### 5.6 Text-to-SQL Evaluation: Spider 1.0 & BIRD
196
 
197
- Spider 1.0 and BIRD are two major benchmarks in the Text-to-SQL domain. Yuan3.0 Ultra demonstrates strong performance on both evaluations.
198
-
199
  | Model | Spider 1.0 | BIRD |
200
  |---|:---:|:---:|
201
- | Qwen3.5-397B-A17B | 82.4 | 39.6 |
202
- | DeepSeek-V3.2 | 80.7 | 38.9 |
203
- | Kimi K2.5 | 82.7 | **43.5** |
204
  | **Yuan3.0 Ultra** | **83.9** | 39.2 |
205
 
206
 
207
  ## 6. License
208
- Use of Yuan 3.0 code and models must comply with the [Yuan 3.0 Model License Agreement](https://github.com/Yuan-lab-LLM/Yuan3.0?tab=License-1-ov-file). Yuan 3.0 models support commercial use and do not require an application for authorization. Please familiarize yourself with and adhere to the agreement. Do not use the open-source models, code, or any derivatives produced from this open-source project for any purposes that may cause harm to the nation or society, or for any services that have not undergone safety assessment and registration.
209
-
210
- Although measures have been taken during training to ensure data compliance and accuracy to the best of our ability, given the enormous scale of model parameters and the influence of probabilistic randomness, we cannot guarantee the accuracy of generated outputs, and models are susceptible to being misled by input instructions. This project assumes no responsibility for data security risks, public opinion risks, or any risks and liabilities arising from the model being misled, misused, disseminated, or improperly exploited due to the use of open-source models and code. You shall bear full and sole responsibility for all risks and consequences arising from your use, copying, distribution, and modification of this open-source project.
 
1
+ ---
2
+ license: other
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - moe
7
+ - multimodal
8
+ - vision
9
+ ---
10
+
11
  <div align="center">
12
  <h1>
13
  Yuan3.0 Ultra Multimodal Foundation Large Language Model
 
25
 
26
  </div>
27
 
 
 
 
 
28
  -----
29
 
30
+ Yuan3.0 Ultra is a trillion-parameter Mixture-of-Experts (MoE) multimodal large language model designed for enterprise-grade scenarios.
31
 
32
+ - **Paper:** [Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM](https://huggingface.co/papers/2601.14327)
33
+ - **Repository:** [GitHub - Yuan-lab-LLM/Yuan3.0-Ultra](https://github.com/Yuan-lab-LLM/Yuan3.0-Ultra)
34
 
35
  ## Recent Updates πŸŽ‰πŸŽ‰
36
 
 
74
 
75
  Fig.2: Benchmark Evaluation Results of Yuan3.0 Ultra
76
 
 
 
77
  </div>
78
 
79
  ## 3. Core Technologies
80
 
81
  ### Layer-Adaptive Expert Pruning (LAEP)
82
 
 
 
 
 
 
 
83
  LAEP adaptively prunes low-load experts layer by layer according to the token distribution in each layer during the stable phase, and proposes an expert rearrangement algorithm that greedily rearranges the remaining experts across computing devices to achieve balanced load. Yuan3.0 Ultra begins pre-training with 1515B parameters and applies LAEP during the stable phase, achieving 33.3% parameter reduction and a 49% improvement in pre-training efficiency.
84
 
85
  ### Revised Reflection Inhibition Reward Mechanism (RIRM)
 
97
  | Yuan3.0 Ultra int4 | 1.01T | 4bit | 64K | HuggingFace | [ModelScope]( https://modelscope.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [HuggingFace]( https://huggingface.co/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [WiseModel]( https://www.wisemodel.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 )
98
 
99
 
 
100
  ## 5. Evaluation Results
101
 
102
  Yuan3.0 Ultra achieves leading performance across multiple enterprise-level core benchmarks.
103
 
104
  ### 5.1 Multimodal RAG Evaluation: Docmatix πŸ†
105
 
 
 
106
  | Model | Accuracy (%) |
107
  |---|:---:|
108
  | GPT-4o | 56.8 |
 
 
 
 
 
 
109
  | **Yuan3.0 Ultra** | **67.4** |
110
 
 
 
111
  ### 5.2 Text RAG Evaluation: ChatRAG πŸ†
112
 
 
 
113
  | Model | Avg. | D2D | QuAC | QReCC | CoQA | DoQA | CFQA | SQA | TCQA | HDial | INSCIT |
114
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 
 
 
 
 
 
 
 
 
115
  | **Yuan3.0 Ultra** | **68.2** | **55.8** | **54.5** | **57.3** | **94.6** | **63.4** | 79.8 | **91.0** | **72.4** | **72.9** | **40.0** |
116
 
 
 
117
  ### 5.3 Multimodal Complex Table Understanding Evaluation: MMTab
118
 
 
 
119
  | Model | Avg. | TABMWP | WTQ | HiTab | TAT-QA | FeTaQA | TabFact | InfoTabs |
120
  |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
 
 
 
 
 
121
  | **Yuan3.0 Ultra** | 62.3 | 91.8 | 77.9 | 67.6 | **74.9** | **39.2** | 90.4 | **89.7** |
122
 
 
 
 
 
 
123
  ### 5.4 Text Summarization Evaluation: SummEval πŸ†
124
 
 
 
125
  | Model | Avg. | ROUGE-1 | ROUGE-2 | BERTScore | SummaC |
126
  |---|:---:|:---:|:---:|:---:|:---:|
 
 
 
 
 
 
 
 
127
  | **Yuan3.0 Ultra** | **62.8** | **59.1** | **41.0** | **91.1** | 45.4 |
128
 
 
 
129
  ### 5.5 Tool Invocation Evaluation: BFCL V3
130
 
 
 
131
  | Model | Avg. | Non-Live AST | Live AST | Multi-turn | Relevance | Irrelevance |
132
  |---|:---:|:---:|:---:|:---:|:---:|:---:|
 
 
 
 
 
 
133
  | **Yuan3.0 Ultra** | 67.8 | 81.7 | 74.5 | 45.3 | 66.7 | 86.0 |
134
 
 
 
 
 
 
135
  ### 5.6 Text-to-SQL Evaluation: Spider 1.0 & BIRD
136
 
 
 
137
  | Model | Spider 1.0 | BIRD |
138
  |---|:---:|:---:|
 
 
 
139
  | **Yuan3.0 Ultra** | **83.9** | 39.2 |
140
 
141
 
142
  ## 6. License
143
+ Use of Yuan 3.0 code and models must comply with the [Yuan 3.0 Model License Agreement](https://github.com/Yuan-lab-LLM/Yuan3.0?tab=License-1-ov-file). Yuan 3.0 models support commercial use and do not require an application for authorization. Please familiarize yourself with and adhere to the agreement. Do not use the open-source models, code, or any derivatives produced from this open-source project for any purposes that may cause harm to the nation or society, or for any services that have not undergone safety assessment and registration.