Text Generation
Transformers
Safetensors
bailing_moe
conversational
custom_code
zhanghanxiao commited on
Commit
994b7a0
·
verified ·
1 Parent(s): 734bad7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -48
README.md CHANGED
@@ -23,11 +23,11 @@ This curriculum greatly enhances the model’s efficiency and reasoning depth, a
23
  ### Flagship-Level Efficient Reasoning
24
 
25
  <p align="center">
26
- <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/FRNXSJFZGXkAAAAAT-AAAAgADkV7AQFr/original"/>
27
  <p>
28
 
29
  <p align="center">
30
- <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/3in4SJr8YPkAAAAAUNAAAAgADkV7AQFr/original"/>
31
  <p>
32
 
33
  We comprehensively evaluated Ling-1T against leading flagship models, including both **open-source giants** (e.g., *DeepSeek-V3.1-Terminus*, *Kimi-K2-Instruct-0905*) and **closed-source APIs** (*GPT-5-main*, *Gemini-2.5-Pro*).
@@ -73,7 +73,7 @@ Key architectural innovations include:
73
  * **QK Normalization** for fully stable convergence
74
 
75
  <p align="center">
76
- <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/03WMQZIYxpUAAAAAVTAAAAgADkV7AQFr/original"/>
77
  <p>
78
 
79
  Ling-1T is the **largest FP8-trained foundation model** known to date.
@@ -111,51 +111,9 @@ Empirically, LPO offers superior **training stability** and **generalization** a
111
  Ling-1T has been extensively evaluated across **knowledge**, **code**, **math**, **reasoning**, **agent**, and **alignment** benchmarks.
112
  It currently stands as the **best open-source flagship non-thinking model**, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability.
113
 
114
- ## Evaluation
115
- | Task | Benchmark | DeepSeek-V3.1-Terminus | Kimi-K2-Instruct-0905 | gpt-5-main | Gemini 2.5 Pro | Ling-1T |
116
- | --------------------- | -------------------------- | ---------------------------------------- | ---------------------------------------- | ---------- | ---------------------------------------- | ---------------------------------------- |
117
- | | | (NonThinking) | | | (thinkBudget=128) | |
118
- | **Knowledge** | **Professional Knowledge** | | | | | |
119
- | | C-Eval | __91.76__ | 91.12 | 83.59 | 88.77 | __<span style="color:red">92.19</span>__ |
120
- | | MMLU-Redux (EM) | 92.37 | 91.58 | **92.75** | __<span style="color:red">94.67</span>__ | 92.25 |
121
- | | MMLU-Pro | __<span style="color:red">83.25</span>__ | 81.03 | 81.94 | **82.13** | 82.04 |
122
- | **Knowledge** | **STEM** | | | | | |
123
- | | MMLU-Pro-Stem | 87.91 | 85.30 | 73.45 | __<span style="color:red">88.60</span>__ | **88.5** |
124
- | | OlympiadBench-stem | 87.83 | 79.13 | 78.26 | **89.57** | __<span style="color:red">91.3</span>__ |
125
- | | GPQA-Diamond | __<span style="color:red">76.23</span>__ | **73.93** | 71.31 | 71.81 | 72.98 |
126
- | **Coding** | **Code Generation** | | | | | |
127
- | | MultiPL-E | **77.68** | 73.76 | 76.66 | 71.48 | __<span style="color:red">77.91</span>__ |
128
- | | mbpp | 90.69 | 89.96 | **91.72** | 91.01 | __<span style="color:red">96.87</span>__ |
129
- | | LiveCodeBench (2408-2505) | 48.02 | 48.95 | **48.57** | 45.43 | __<span style="color:red">61.68</span>__ |
130
- | | CodeForces-rating | 1582 | 1574 | 1120 | **1675** | __<span style="color:red">1901</span>__ |
131
- | | BIRD_SQL | 44.88 | 46.45 | 43.97 | __<span style="color:red">54.76</span>__ | **52.38** |
132
- | **Coding** | **Software Development** | | | | | |
133
- | | ArtifactsBench | 43.29 | 44.87 | 41.04 | __<span style="color:red">60.28</span>__ | **59.31** |
134
- | | FullStack Bench | **55.48** | 54.00 | 50.92 | 48.19 | __<span style="color:red">56.55</span>__ |
135
- | | Aider | **88.16** | 85.34 | 84.40 | __<span style="color:red">89.85</span>__ | 83.65 |
136
- | **Math** | **Competition Math** | | | | | |
137
- | | CNMO 2024 | 73.78 | 68.92 | 63.11 | **74.65** | __<span style="color:red">79.25</span>__ |
138
- | | AIME 2025 | 55.21 | 50.16 | 59.43 | **70.10** | __<span style="color:red">70.42</span>__ |
139
- | | UGMathBench | **72.70** | 69.97 | 67.27 | 70.10 | __<span style="color:red">74.95</span>__ |
140
- | | Omni-Math | 64.77 | 62.42 | 61.09 | **72.02** | __<span style="color:red">74.46</span>__ |
141
- | **Math** | **Professional Math** | | | | | |
142
- | | FinanceReasoning | 86.44 | 84.83 | 86.28 | **86.65** | __<span style="color:red">87.45</span>__ |
143
- | | Optibench | 64.30 | 60.83 | 40.06 | **68.76** | __<span style="color:red">74.71</span>__ |
144
- | | OptMATH | 35.99 | 35.84 | 39.16 | **42.77** | __<span style="color:red">57.68</span>__ |
145
- | **General Reasoning** | | | | | | |
146
- | | BBEH | **42.86** | 34.83 | 39.75 | 29.08 | __<span style="color:red">47.34</span>__ |
147
- | | KOR-Bench | **73.76** | 73.20 | 70.56 | 59.68 | __<span style="color:red">76.00</span>__ |
148
- | | ARC-AGI-1 | 14.69 | **22.19** | 14.06 | 18.94 | __<span style="color:red">43.81</span>__ |
149
- | | ZebraLogic | 81.6 | **85.5** | 57.3 | 70.2 | __<span style="color:red">90.8</span>__ |
150
- | **Agent** | | | | | | |
151
- | | BFCL-V3 | 52.67 | __<span style="color:red">71.05</span>__ | 50.27 | 63.31 | **69.64** |
152
- | **Alignment** | | | | | | |
153
- | | Arena Hard V2 ELO | 54.09 | __<span style="color:red">76.95</span>__ | 68.37 | 65.37 | **76.26** |
154
- | | Arena Hard V2 Win Rate | 63.24 | 69.88 | 65.06 | **74.46** | __<span style="color:red">75.83</span>__ |
155
- | | writing_bench | 80.95 | **87.59** | 77.07 | 80.53 | __<span style="color:red">89.4</span>__ |
156
- | | Creative Writing v3 | 85.18 | **87.01** | 80.93 | 84.99 | <span style="color:red">89.24</span> |
157
- | | MultiChallenge | 42.49 | 48.72 | 48.72 | **51.28** | __<span style="color:red">58.24</span>__ |
158
-
159
 
160
 
161
  ## Model Downloads
 
23
  ### Flagship-Level Efficient Reasoning
24
 
25
  <p align="center">
26
+ <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/YiXwTb4Q_vsAAAAAT-AAAAgADkV7AQFr/original"/>
27
  <p>
28
 
29
  <p align="center">
30
+ <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/MEh7Q5FtzbAAAAAAUQAAAAgADkV7AQFr/original"/>
31
  <p>
32
 
33
  We comprehensively evaluated Ling-1T against leading flagship models, including both **open-source giants** (e.g., *DeepSeek-V3.1-Terminus*, *Kimi-K2-Instruct-0905*) and **closed-source APIs** (*GPT-5-main*, *Gemini-2.5-Pro*).
 
73
  * **QK Normalization** for fully stable convergence
74
 
75
  <p align="center">
76
+ <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/naA9TJe7ttIAAAAAVRAAAAgADkV7AQFr/original"/>
77
  <p>
78
 
79
  Ling-1T is the **largest FP8-trained foundation model** known to date.
 
111
  Ling-1T has been extensively evaluated across **knowledge**, **code**, **math**, **reasoning**, **agent**, and **alignment** benchmarks.
112
  It currently stands as the **best open-source flagship non-thinking model**, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability.
113
 
114
+ <p align="center">
115
+ <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/KrwiQZEDHV0AAAAAWkAAAAgADkV7AQFr/original"/>
116
+ <p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
 
119
  ## Model Downloads