Update README.md
Browse files
README.md
CHANGED
|
@@ -23,11 +23,11 @@ This curriculum greatly enhances the model’s efficiency and reasoning depth, a
|
|
| 23 |
### Flagship-Level Efficient Reasoning
|
| 24 |
|
| 25 |
<p align="center">
|
| 26 |
-
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/
|
| 27 |
<p>
|
| 28 |
|
| 29 |
<p align="center">
|
| 30 |
-
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/
|
| 31 |
<p>
|
| 32 |
|
| 33 |
We comprehensively evaluated Ling-1T against leading flagship models, including both **open-source giants** (e.g., *DeepSeek-V3.1-Terminus*, *Kimi-K2-Instruct-0905*) and **closed-source APIs** (*GPT-5-main*, *Gemini-2.5-Pro*).
|
|
@@ -73,7 +73,7 @@ Key architectural innovations include:
|
|
| 73 |
* **QK Normalization** for fully stable convergence
|
| 74 |
|
| 75 |
<p align="center">
|
| 76 |
-
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/
|
| 77 |
<p>
|
| 78 |
|
| 79 |
Ling-1T is the **largest FP8-trained foundation model** known to date.
|
|
@@ -111,51 +111,9 @@ Empirically, LPO offers superior **training stability** and **generalization** a
|
|
| 111 |
Ling-1T has been extensively evaluated across **knowledge**, **code**, **math**, **reasoning**, **agent**, and **alignment** benchmarks.
|
| 112 |
It currently stands as the **best open-source flagship non-thinking model**, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability.
|
| 113 |
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
| | | (NonThinking) | | | (thinkBudget=128) | |
|
| 118 |
-
| **Knowledge** | **Professional Knowledge** | | | | | |
|
| 119 |
-
| | C-Eval | __91.76__ | 91.12 | 83.59 | 88.77 | __<span style="color:red">92.19</span>__ |
|
| 120 |
-
| | MMLU-Redux (EM) | 92.37 | 91.58 | **92.75** | __<span style="color:red">94.67</span>__ | 92.25 |
|
| 121 |
-
| | MMLU-Pro | __<span style="color:red">83.25</span>__ | 81.03 | 81.94 | **82.13** | 82.04 |
|
| 122 |
-
| **Knowledge** | **STEM** | | | | | |
|
| 123 |
-
| | MMLU-Pro-Stem | 87.91 | 85.30 | 73.45 | __<span style="color:red">88.60</span>__ | **88.5** |
|
| 124 |
-
| | OlympiadBench-stem | 87.83 | 79.13 | 78.26 | **89.57** | __<span style="color:red">91.3</span>__ |
|
| 125 |
-
| | GPQA-Diamond | __<span style="color:red">76.23</span>__ | **73.93** | 71.31 | 71.81 | 72.98 |
|
| 126 |
-
| **Coding** | **Code Generation** | | | | | |
|
| 127 |
-
| | MultiPL-E | **77.68** | 73.76 | 76.66 | 71.48 | __<span style="color:red">77.91</span>__ |
|
| 128 |
-
| | mbpp | 90.69 | 89.96 | **91.72** | 91.01 | __<span style="color:red">96.87</span>__ |
|
| 129 |
-
| | LiveCodeBench (2408-2505) | 48.02 | 48.95 | **48.57** | 45.43 | __<span style="color:red">61.68</span>__ |
|
| 130 |
-
| | CodeForces-rating | 1582 | 1574 | 1120 | **1675** | __<span style="color:red">1901</span>__ |
|
| 131 |
-
| | BIRD_SQL | 44.88 | 46.45 | 43.97 | __<span style="color:red">54.76</span>__ | **52.38** |
|
| 132 |
-
| **Coding** | **Software Development** | | | | | |
|
| 133 |
-
| | ArtifactsBench | 43.29 | 44.87 | 41.04 | __<span style="color:red">60.28</span>__ | **59.31** |
|
| 134 |
-
| | FullStack Bench | **55.48** | 54.00 | 50.92 | 48.19 | __<span style="color:red">56.55</span>__ |
|
| 135 |
-
| | Aider | **88.16** | 85.34 | 84.40 | __<span style="color:red">89.85</span>__ | 83.65 |
|
| 136 |
-
| **Math** | **Competition Math** | | | | | |
|
| 137 |
-
| | CNMO 2024 | 73.78 | 68.92 | 63.11 | **74.65** | __<span style="color:red">79.25</span>__ |
|
| 138 |
-
| | AIME 2025 | 55.21 | 50.16 | 59.43 | **70.10** | __<span style="color:red">70.42</span>__ |
|
| 139 |
-
| | UGMathBench | **72.70** | 69.97 | 67.27 | 70.10 | __<span style="color:red">74.95</span>__ |
|
| 140 |
-
| | Omni-Math | 64.77 | 62.42 | 61.09 | **72.02** | __<span style="color:red">74.46</span>__ |
|
| 141 |
-
| **Math** | **Professional Math** | | | | | |
|
| 142 |
-
| | FinanceReasoning | 86.44 | 84.83 | 86.28 | **86.65** | __<span style="color:red">87.45</span>__ |
|
| 143 |
-
| | Optibench | 64.30 | 60.83 | 40.06 | **68.76** | __<span style="color:red">74.71</span>__ |
|
| 144 |
-
| | OptMATH | 35.99 | 35.84 | 39.16 | **42.77** | __<span style="color:red">57.68</span>__ |
|
| 145 |
-
| **General Reasoning** | | | | | | |
|
| 146 |
-
| | BBEH | **42.86** | 34.83 | 39.75 | 29.08 | __<span style="color:red">47.34</span>__ |
|
| 147 |
-
| | KOR-Bench | **73.76** | 73.20 | 70.56 | 59.68 | __<span style="color:red">76.00</span>__ |
|
| 148 |
-
| | ARC-AGI-1 | 14.69 | **22.19** | 14.06 | 18.94 | __<span style="color:red">43.81</span>__ |
|
| 149 |
-
| | ZebraLogic | 81.6 | **85.5** | 57.3 | 70.2 | __<span style="color:red">90.8</span>__ |
|
| 150 |
-
| **Agent** | | | | | | |
|
| 151 |
-
| | BFCL-V3 | 52.67 | __<span style="color:red">71.05</span>__ | 50.27 | 63.31 | **69.64** |
|
| 152 |
-
| **Alignment** | | | | | | |
|
| 153 |
-
| | Arena Hard V2 ELO | 54.09 | __<span style="color:red">76.95</span>__ | 68.37 | 65.37 | **76.26** |
|
| 154 |
-
| | Arena Hard V2 Win Rate | 63.24 | 69.88 | 65.06 | **74.46** | __<span style="color:red">75.83</span>__ |
|
| 155 |
-
| | writing_bench | 80.95 | **87.59** | 77.07 | 80.53 | __<span style="color:red">89.4</span>__ |
|
| 156 |
-
| | Creative Writing v3 | 85.18 | **87.01** | 80.93 | 84.99 | <span style="color:red">89.24</span> |
|
| 157 |
-
| | MultiChallenge | 42.49 | 48.72 | 48.72 | **51.28** | __<span style="color:red">58.24</span>__ |
|
| 158 |
-
|
| 159 |
|
| 160 |
|
| 161 |
## Model Downloads
|
|
|
|
| 23 |
### Flagship-Level Efficient Reasoning
|
| 24 |
|
| 25 |
<p align="center">
|
| 26 |
+
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/YiXwTb4Q_vsAAAAAT-AAAAgADkV7AQFr/original"/>
|
| 27 |
<p>
|
| 28 |
|
| 29 |
<p align="center">
|
| 30 |
+
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/MEh7Q5FtzbAAAAAAUQAAAAgADkV7AQFr/original"/>
|
| 31 |
<p>
|
| 32 |
|
| 33 |
We comprehensively evaluated Ling-1T against leading flagship models, including both **open-source giants** (e.g., *DeepSeek-V3.1-Terminus*, *Kimi-K2-Instruct-0905*) and **closed-source APIs** (*GPT-5-main*, *Gemini-2.5-Pro*).
|
|
|
|
| 73 |
* **QK Normalization** for fully stable convergence
|
| 74 |
|
| 75 |
<p align="center">
|
| 76 |
+
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/naA9TJe7ttIAAAAAVRAAAAgADkV7AQFr/original"/>
|
| 77 |
<p>
|
| 78 |
|
| 79 |
Ling-1T is the **largest FP8-trained foundation model** known to date.
|
|
|
|
| 111 |
Ling-1T has been extensively evaluated across **knowledge**, **code**, **math**, **reasoning**, **agent**, and **alignment** benchmarks.
|
| 112 |
It currently stands as the **best open-source flagship non-thinking model**, rivaling closed-source APIs in complex reasoning while maintaining exceptional efficiency and interpretability.
|
| 113 |
|
| 114 |
+
<p align="center">
|
| 115 |
+
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/KrwiQZEDHV0AAAAAWkAAAAgADkV7AQFr/original"/>
|
| 116 |
+
<p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
|
| 119 |
## Model Downloads
|