Add pipeline tag, library name and paper link to model card
#1
by
nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<div align="center">
|
| 2 |
<h1>
|
| 3 |
Yuan3.0 Ultra Multimodal Foundation Large Language Model
|
|
@@ -15,13 +25,12 @@
|
|
| 15 |
|
| 16 |
</div>
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-----
|
| 23 |
|
|
|
|
| 24 |
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## Recent Updates ππ
|
| 27 |
|
|
@@ -65,20 +74,12 @@ Yuan3.0 Ultra delivers outstanding performance on retrieval-augmented generation
|
|
| 65 |
|
| 66 |
Fig.2: Benchmark Evaluation Results of Yuan3.0 Ultra
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
</div>
|
| 71 |
|
| 72 |
## 3. Core Technologies
|
| 73 |
|
| 74 |
### Layer-Adaptive Expert Pruning (LAEP)
|
| 75 |
|
| 76 |
-
The evolution of expert load during large model pre-training can be divided into two phases:
|
| 77 |
-
* **Phase 1 β Initial Transition Phase**: Occurring at the early stage of model pre-training, where expert loads exhibit substantial volatility inherited from random initialization, with the number of tokens routed to the same expert potentially varying by orders of magnitude;
|
| 78 |
-
* **Phase 2 β Stable Phase**: Expert token loads across experts become temporally stable, with per-expert token counts exhibiting only relatively minor fluctuations;
|
| 79 |
-
|
| 80 |
-
In the stable phase of training, expert token loads are highly imbalanced: a small number of experts carry a large share of computation while some experts remain persistently underutilized, leading to wasted computational resources. The disparity between the highest- and lowest-load experts in the stable phase can reach nearly 500Γ.
|
| 81 |
-
|
| 82 |
LAEP adaptively prunes low-load experts layer by layer according to the token distribution in each layer during the stable phase, and proposes an expert rearrangement algorithm that greedily rearranges the remaining experts across computing devices to achieve balanced load. Yuan3.0 Ultra begins pre-training with 1515B parameters and applies LAEP during the stable phase, achieving 33.3% parameter reduction and a 49% improvement in pre-training efficiency.
|
| 83 |
|
| 84 |
### Revised Reflection Inhibition Reward Mechanism (RIRM)
|
|
@@ -96,115 +97,47 @@ During the Fast-thinking RL phase, models tend to produce excessive reflection s
|
|
| 96 |
| Yuan3.0 Ultra int4 | 1.01T | 4bit | 64K | HuggingFace | [ModelScope]( https://modelscope.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [HuggingFace]( https://huggingface.co/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [WiseModel]( https://www.wisemodel.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 )
|
| 97 |
|
| 98 |
|
| 99 |
-
|
| 100 |
## 5. Evaluation Results
|
| 101 |
|
| 102 |
Yuan3.0 Ultra achieves leading performance across multiple enterprise-level core benchmarks.
|
| 103 |
|
| 104 |
### 5.1 Multimodal RAG Evaluation: Docmatix π
|
| 105 |
|
| 106 |
-
Docmatix evaluates a model's comprehensive ability to retrieve, associate, and accurately answer questions across multiple modalities (text, tables, images) within multi-page, complex documents.
|
| 107 |
-
|
| 108 |
| Model | Accuracy (%) |
|
| 109 |
|---|:---:|
|
| 110 |
| GPT-4o | 56.8 |
|
| 111 |
-
| o3 | 45.6 |
|
| 112 |
-
| GPT-5.1 | 48.5 |
|
| 113 |
-
| GPT-5.2 | 48.4 |
|
| 114 |
-
| Gemini 3.1 Pro | 35.3 |
|
| 115 |
-
| Claude Opus 4.6 | 46.2 |
|
| 116 |
-
| Kimi K2.5 | 36.9 |
|
| 117 |
| **Yuan3.0 Ultra** | **67.4** |
|
| 118 |
|
| 119 |
-
---
|
| 120 |
-
|
| 121 |
### 5.2 Text RAG Evaluation: ChatRAG π
|
| 122 |
|
| 123 |
-
ChatRAG comprises 10 tasks, covering long-context retrieval (D2D, QuAC, QReCC), short-context and structured retrieval (CoQA, DoQA, CFQA, SQA, HDial), and Wikipedia-based retrieval (TCQA, INSCIT). Yuan3.0 Ultra achieves an average accuracy of **68.2%**, ranking first on 9 out of 10 tasks.
|
| 124 |
-
|
| 125 |
| Model | Avg. | D2D | QuAC | QReCC | CoQA | DoQA | CFQA | SQA | TCQA | HDial | INSCIT |
|
| 126 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 127 |
-
| DeepSeek-V3 | 50.5 | 31.6 | 28.9 | 49.3 | 77.0 | 26.1 | 83.5 | 82.1 | 46.7 | 47.4 | 32.1 |
|
| 128 |
-
| GPT-4o | 50.5 | 32.8 | 26.6 | 49.3 | 76.1 | 28.8 | 81.9 | 81.1 | 49.8 | 41.3 | 26.7 |
|
| 129 |
-
| o3 | 44.1 | 23.1 | 20.8 | 40.4 | 69.4 | 18.6 | 67.8 | 86.7 | 45.9 | 41.3 | 26.7 |
|
| 130 |
-
| DeepSeek-R1 | 39.4 | 21.5 | 22.2 | 42.4 | 62.5 | 24.7 | 81.5 | 82.1 | 30.7 | 38.0 | 28.7 |
|
| 131 |
-
| GPT-5.1 | 46.1 | 28.2 | 23.2 | 45.4 | 68.8 | 20.9 | 73.1 | 81.3 | 44.7 | 45.4 | 30.0 |
|
| 132 |
-
| GPT-5.2 | 45.6 | 30.2 | 23.1 | 47.0 | 64.8 | 25.3 | 72.3 | 79.1 | 38.3 | 45.3 | 30.9 |
|
| 133 |
-
| Gemini 3.1 Pro | 49.7 | 33.1 | 27.3 | 47.0 | 73.5 | 34.2 | 75.7 | 85.5 | 42.4 | 48.2 | 30.3 |
|
| 134 |
-
| Claude Opus 4.6 | 52.9 | 35.3 | 26.6 | 49.4 | 76.4 | 37.3 | **86.5** | 85.5 | 50.2 | 48.9 | 33.2 |
|
| 135 |
-
| Kimi K2.5 | 53.6 | 34.6 | 30.9 | 49.9 | 82.5 | 35.8 | 82.3 | 83.6 | 50.8 | 51.1 | 34.4 |
|
| 136 |
| **Yuan3.0 Ultra** | **68.2** | **55.8** | **54.5** | **57.3** | **94.6** | **63.4** | 79.8 | **91.0** | **72.4** | **72.9** | **40.0** |
|
| 137 |
|
| 138 |
-
---
|
| 139 |
-
|
| 140 |
### 5.3 Multimodal Complex Table Understanding Evaluation: MMTab
|
| 141 |
|
| 142 |
-
MMTab spans 15 evaluation sets, covering task types including table question answering, fact checking, and long-context table processing. Yuan3.0 Ultra surpasses Claude Opus 4.6 and Gemini 3.1 Pro with an average accuracy of **62.3%**, demonstrating comprehensive and well-balanced multimodal table processing capability.
|
| 143 |
-
|
| 144 |
| Model | Avg. | TABMWP | WTQ | HiTab | TAT-QA | FeTaQA | TabFact | InfoTabs |
|
| 145 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 146 |
-
| GPT-5.1 | 55.2 | 65.0 | 60.8 | **77.8** | 61.4 | 8.7 | 52.8 | 64.3 |
|
| 147 |
-
| GPT-5.2 | 37.3 | 67.2 | 69.8 | 15.8 | 28.0 | 6.2 | 63.5 | 69.3 |
|
| 148 |
-
| Gemini 3.1 Pro | 45.1 | 80.1 | **79.6** | 48.3 | 50.5 | 9.6 | 71.1 | 74.4 |
|
| 149 |
-
| Claude Opus 4.6 | 39.8 | 67.6 | 76.0 | 44.1 | 44.5 | 12.0 | 30.7 | 59.6 |
|
| 150 |
-
| Kimi K2.5 | **66.2** | **95.9** | 79.3 | 63.9 | 62.4 | 7.4 | **90.6** | 81.8 |
|
| 151 |
| **Yuan3.0 Ultra** | 62.3 | 91.8 | 77.9 | 67.6 | **74.9** | **39.2** | 90.4 | **89.7** |
|
| 152 |
|
| 153 |
-
|
| 154 |
-
*Full results across all 15 tasks are available in the technical report.*
|
| 155 |
-
|
| 156 |
-
---
|
| 157 |
-
|
| 158 |
### 5.4 Text Summarization Evaluation: SummEval π
|
| 159 |
|
| 160 |
-
SummEval comprehensively evaluates summarization quality from three dimensions: lexical overlap (ROUGE-1/2), semantic similarity (BERTScore), and factual consistency (SummaC), serving as an important reference for historical information compression capability in Agent applications. Yuan3.0 Ultra achieves an average accuracy of **62.8%**.
|
| 161 |
-
|
| 162 |
| Model | Avg. | ROUGE-1 | ROUGE-2 | BERTScore | SummaC |
|
| 163 |
|---|:---:|:---:|:---:|:---:|:---:|
|
| 164 |
-
| DeepSeek-V3 | 59.3 | 25.5 | 9.2 | 86.3 | **68.2** |
|
| 165 |
-
| DeepSeek-V3.2 | 51.4 | 33.3 | 11.9 | 85.6 | 41.8 |
|
| 166 |
-
| GPT-4o | 46.5 | 25.0 | 8.9 | 85.9 | 32.5 |
|
| 167 |
-
| GPT-5.1 | 49.4 | 27.5 | 10.2 | 84.6 | 40.5 |
|
| 168 |
-
| GPT-5.2 | 48.6 | 30.3 | 10.7 | 84.9 | 36.4 |
|
| 169 |
-
| Gemini 3.1 Pro | 48.5 | 32.4 | 11.4 | 85.4 | 34.3 |
|
| 170 |
-
| Claude Opus 4.6 | 49.9 | 33.1 | 11.0 | 85.9 | 37.8 |
|
| 171 |
-
| Kimi K2.5 | 49.8 | 32.3 | 11.3 | 85.4 | 38.2 |
|
| 172 |
| **Yuan3.0 Ultra** | **62.8** | **59.1** | **41.0** | **91.1** | 45.4 |
|
| 173 |
|
| 174 |
-
---
|
| 175 |
-
|
| 176 |
### 5.5 Tool Invocation Evaluation: BFCL V3
|
| 177 |
|
| 178 |
-
BFCL V3 evaluates real-world tool invocation capability across dimensions including static function selection (Non-Live AST), dynamic real-time execution (Live AST), multi-turn context maintenance (Multi-turn), relevance detection (Relevance), and irrelevant call rejection (Irrelevance Detection). Yuan3.0 Ultra delivers balanced performance across all categories, achieving an average score of **67.8%**, with particular strength in Irrelevance Detection (86.0%).
|
| 179 |
-
|
| 180 |
| Model | Avg. | Non-Live AST | Live AST | Multi-turn | Relevance | Irrelevance |
|
| 181 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 182 |
-
| Qwen3-235B-A22B | 68.0 | 87.9 | 77.0 | 40.1 | **83.3** | 76.3 |
|
| 183 |
-
| Claude-3.7-Sonnet | 58.6 | 41.3 | 78.4 | 48.4 | 72.2 | 81.4 |
|
| 184 |
-
| GPT-5.2 | 60.6 | 80.9 | 76.2 | 24.6 | 72.2 | 79.7 |
|
| 185 |
-
| Gemini 3.1 Pro | **78.8** | **91.5** | **84.9** | **60.3** | 61.1 | **88.2** |
|
| 186 |
-
| Claude Opus 4.6 | 74.9 | 88.2 | 78.9 | 59.8 | 61.1 | 78.0 |
|
| 187 |
-
| Kimi K2.5 | 70.6 | 86.4 | 78.6 | 48.6 | 61.1 | 77.0 |
|
| 188 |
| **Yuan3.0 Ultra** | 67.8 | 81.7 | 74.5 | 45.3 | 66.7 | 86.0 |
|
| 189 |
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
---
|
| 194 |
-
|
| 195 |
### 5.6 Text-to-SQL Evaluation: Spider 1.0 & BIRD
|
| 196 |
|
| 197 |
-
Spider 1.0 and BIRD are two major benchmarks in the Text-to-SQL domain. Yuan3.0 Ultra demonstrates strong performance on both evaluations.
|
| 198 |
-
|
| 199 |
| Model | Spider 1.0 | BIRD |
|
| 200 |
|---|:---:|:---:|
|
| 201 |
-
| Qwen3.5-397B-A17B | 82.4 | 39.6 |
|
| 202 |
-
| DeepSeek-V3.2 | 80.7 | 38.9 |
|
| 203 |
-
| Kimi K2.5 | 82.7 | **43.5** |
|
| 204 |
| **Yuan3.0 Ultra** | **83.9** | 39.2 |
|
| 205 |
|
| 206 |
|
| 207 |
## 6. License
|
| 208 |
-
Use of Yuan 3.0 code and models must comply with the [Yuan 3.0 Model License Agreement](https://github.com/Yuan-lab-LLM/Yuan3.0?tab=License-1-ov-file). Yuan 3.0 models support commercial use and do not require an application for authorization. Please familiarize yourself with and adhere to the agreement. Do not use the open-source models, code, or any derivatives produced from this open-source project for any purposes that may cause harm to the nation or society, or for any services that have not undergone safety assessment and registration.
|
| 209 |
-
|
| 210 |
-
Although measures have been taken during training to ensure data compliance and accuracy to the best of our ability, given the enormous scale of model parameters and the influence of probabilistic randomness, we cannot guarantee the accuracy of generated outputs, and models are susceptible to being misled by input instructions. This project assumes no responsibility for data security risks, public opinion risks, or any risks and liabilities arising from the model being misled, misused, disseminated, or improperly exploited due to the use of open-source models and code. You shall bear full and sole responsibility for all risks and consequences arising from your use, copying, distribution, and modification of this open-source project.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
tags:
|
| 6 |
+
- moe
|
| 7 |
+
- multimodal
|
| 8 |
+
- vision
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
<div align="center">
|
| 12 |
<h1>
|
| 13 |
Yuan3.0 Ultra Multimodal Foundation Large Language Model
|
|
|
|
| 25 |
|
| 26 |
</div>
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
-----
|
| 29 |
|
| 30 |
+
Yuan3.0 Ultra is a trillion-parameter Mixture-of-Experts (MoE) multimodal large language model designed for enterprise-grade scenarios.
|
| 31 |
|
| 32 |
+
- **Paper:** [Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM](https://huggingface.co/papers/2601.14327)
|
| 33 |
+
- **Repository:** [GitHub - Yuan-lab-LLM/Yuan3.0-Ultra](https://github.com/Yuan-lab-LLM/Yuan3.0-Ultra)
|
| 34 |
|
| 35 |
## Recent Updates ππ
|
| 36 |
|
|
|
|
| 74 |
|
| 75 |
Fig.2: Benchmark Evaluation Results of Yuan3.0 Ultra
|
| 76 |
|
|
|
|
|
|
|
| 77 |
</div>
|
| 78 |
|
| 79 |
## 3. Core Technologies
|
| 80 |
|
| 81 |
### Layer-Adaptive Expert Pruning (LAEP)
|
| 82 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
LAEP adaptively prunes low-load experts layer by layer according to the token distribution in each layer during the stable phase, and proposes an expert rearrangement algorithm that greedily rearranges the remaining experts across computing devices to achieve balanced load. Yuan3.0 Ultra begins pre-training with 1515B parameters and applies LAEP during the stable phase, achieving 33.3% parameter reduction and a 49% improvement in pre-training efficiency.
|
| 84 |
|
| 85 |
### Revised Reflection Inhibition Reward Mechanism (RIRM)
|
|
|
|
| 97 |
| Yuan3.0 Ultra int4 | 1.01T | 4bit | 64K | HuggingFace | [ModelScope]( https://modelscope.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [HuggingFace]( https://huggingface.co/YuanLabAI/Yuan3.0-Ultra-int4 ) \| [WiseModel]( https://www.wisemodel.cn/models/YuanLabAI/Yuan3.0-Ultra-int4 )
|
| 98 |
|
| 99 |
|
|
|
|
| 100 |
## 5. Evaluation Results
|
| 101 |
|
| 102 |
Yuan3.0 Ultra achieves leading performance across multiple enterprise-level core benchmarks.
|
| 103 |
|
| 104 |
### 5.1 Multimodal RAG Evaluation: Docmatix π
|
| 105 |
|
|
|
|
|
|
|
| 106 |
| Model | Accuracy (%) |
|
| 107 |
|---|:---:|
|
| 108 |
| GPT-4o | 56.8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
| **Yuan3.0 Ultra** | **67.4** |
|
| 110 |
|
|
|
|
|
|
|
| 111 |
### 5.2 Text RAG Evaluation: ChatRAG π
|
| 112 |
|
|
|
|
|
|
|
| 113 |
| Model | Avg. | D2D | QuAC | QReCC | CoQA | DoQA | CFQA | SQA | TCQA | HDial | INSCIT |
|
| 114 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
| **Yuan3.0 Ultra** | **68.2** | **55.8** | **54.5** | **57.3** | **94.6** | **63.4** | 79.8 | **91.0** | **72.4** | **72.9** | **40.0** |
|
| 116 |
|
|
|
|
|
|
|
| 117 |
### 5.3 Multimodal Complex Table Understanding Evaluation: MMTab
|
| 118 |
|
|
|
|
|
|
|
| 119 |
| Model | Avg. | TABMWP | WTQ | HiTab | TAT-QA | FeTaQA | TabFact | InfoTabs |
|
| 120 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
| **Yuan3.0 Ultra** | 62.3 | 91.8 | 77.9 | 67.6 | **74.9** | **39.2** | 90.4 | **89.7** |
|
| 122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
### 5.4 Text Summarization Evaluation: SummEval π
|
| 124 |
|
|
|
|
|
|
|
| 125 |
| Model | Avg. | ROUGE-1 | ROUGE-2 | BERTScore | SummaC |
|
| 126 |
|---|:---:|:---:|:---:|:---:|:---:|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
| **Yuan3.0 Ultra** | **62.8** | **59.1** | **41.0** | **91.1** | 45.4 |
|
| 128 |
|
|
|
|
|
|
|
| 129 |
### 5.5 Tool Invocation Evaluation: BFCL V3
|
| 130 |
|
|
|
|
|
|
|
| 131 |
| Model | Avg. | Non-Live AST | Live AST | Multi-turn | Relevance | Irrelevance |
|
| 132 |
|---|:---:|:---:|:---:|:---:|:---:|:---:|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
| **Yuan3.0 Ultra** | 67.8 | 81.7 | 74.5 | 45.3 | 66.7 | 86.0 |
|
| 134 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
### 5.6 Text-to-SQL Evaluation: Spider 1.0 & BIRD
|
| 136 |
|
|
|
|
|
|
|
| 137 |
| Model | Spider 1.0 | BIRD |
|
| 138 |
|---|:---:|:---:|
|
|
|
|
|
|
|
|
|
|
| 139 |
| **Yuan3.0 Ultra** | **83.9** | 39.2 |
|
| 140 |
|
| 141 |
|
| 142 |
## 6. License
|
| 143 |
+
Use of Yuan 3.0 code and models must comply with the [Yuan 3.0 Model License Agreement](https://github.com/Yuan-lab-LLM/Yuan3.0?tab=License-1-ov-file). Yuan 3.0 models support commercial use and do not require an application for authorization. Please familiarize yourself with and adhere to the agreement. Do not use the open-source models, code, or any derivatives produced from this open-source project for any purposes that may cause harm to the nation or society, or for any services that have not undergone safety assessment and registration.
|
|
|
|
|
|