nitpick: update several formatting issues in the card
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ library_name: transformers
|
|
| 16 |
**Ling-1T** is the first flagship *non-thinking* model in the Ling 2.0 series, featuring **1 trillion total parameters** with **≈ 50 billion active parameters per token**.
|
| 17 |
Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of *efficient reasoning* and *scalable cognition*.
|
| 18 |
|
| 19 |
-
Pre-trained on **20 trillion+ high-quality, reasoning-dense tokens**, Ling-1T-base supports up to **
|
| 20 |
This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve **state-of-the-art performance** on multiple complex reasoning benchmarks—balancing **accuracy** and **efficiency**.
|
| 21 |
|
| 22 |
|
|
@@ -49,7 +49,7 @@ On **ArtifactsBench**, Ling-1T ranks **first among open-source models**, and the
|
|
| 49 |
### Emergent Intelligence at Trillion-Scale
|
| 50 |
|
| 51 |
Scaling to the trillion-parameter level has revealed strong **emergent reasoning and transfer capabilities**.
|
| 52 |
-
For example, in the **BFCL V3** tool-use benchmark, Ling-1T achieves **≈ 70
|
| 53 |
Ling-1T can:
|
| 54 |
|
| 55 |
* Interpret complex natural-language instructions
|
|
@@ -67,7 +67,7 @@ This ensures architectural and hyperparameter scalability even under **1e25–1e
|
|
| 67 |
|
| 68 |
Key architectural innovations include:
|
| 69 |
|
| 70 |
-
* **
|
| 71 |
* **MTP layers** for enhanced compositional reasoning
|
| 72 |
* **Aux-loss-free**, **sigmoid-scoring expert routing** with **zero-mean updates**
|
| 73 |
* **QK Normalization** for fully stable convergence
|
|
@@ -77,7 +77,7 @@ Key architectural innovations include:
|
|
| 77 |
<p>
|
| 78 |
|
| 79 |
Ling-1T is the **largest FP8-trained foundation model** known to date.
|
| 80 |
-
FP8 mixed-precision training yields **15
|
| 81 |
A fine-grained, **heterogeneous 1F1B interleaved pipeline** further boosts utilization by 40 %+.
|
| 82 |
System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.
|
| 83 |
|
|
@@ -85,7 +85,7 @@ System-level optimizations—fused kernels, communication scheduling, recomputat
|
|
| 85 |
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/y5UVSKACgLEAAAAAVcAAAAgADkV7AQFr/original"/>
|
| 86 |
<p>
|
| 87 |
|
| 88 |
-
Pre-training used over **
|
| 89 |
Mid-training introduced **curated chain-of-thought corpora** for “**reasoning pre-activation**”, improving downstream reasoning stability.
|
| 90 |
A custom **WSM (Warmup–Stable–Merge)** LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.
|
| 91 |
|
|
|
|
| 16 |
**Ling-1T** is the first flagship *non-thinking* model in the Ling 2.0 series, featuring **1 trillion total parameters** with **≈ 50 billion active parameters per token**.
|
| 17 |
Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of *efficient reasoning* and *scalable cognition*.
|
| 18 |
|
| 19 |
+
Pre-trained on **20 trillion+ high-quality, reasoning-dense tokens**, Ling-1T-base supports up to **128K context length** and adopts an **evolutionary chain-of-thought (Evo-CoT)** process across mid-training and post-training.
|
| 20 |
This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve **state-of-the-art performance** on multiple complex reasoning benchmarks—balancing **accuracy** and **efficiency**.
|
| 21 |
|
| 22 |
|
|
|
|
| 49 |
### Emergent Intelligence at Trillion-Scale
|
| 50 |
|
| 51 |
Scaling to the trillion-parameter level has revealed strong **emergent reasoning and transfer capabilities**.
|
| 52 |
+
For example, in the **BFCL V3** tool-use benchmark, Ling-1T achieves **≈ 70% tool-call accuracy** with only light instruction tuning—despite having seen no large-scale trajectory data during training.
|
| 53 |
Ling-1T can:
|
| 54 |
|
| 55 |
* Interpret complex natural-language instructions
|
|
|
|
| 67 |
|
| 68 |
Key architectural innovations include:
|
| 69 |
|
| 70 |
+
* **1T total / 50B active parameters** with a **1/32 MoE activation ratio**
|
| 71 |
* **MTP layers** for enhanced compositional reasoning
|
| 72 |
* **Aux-loss-free**, **sigmoid-scoring expert routing** with **zero-mean updates**
|
| 73 |
* **QK Normalization** for fully stable convergence
|
|
|
|
| 77 |
<p>
|
| 78 |
|
| 79 |
Ling-1T is the **largest FP8-trained foundation model** known to date.
|
| 80 |
+
FP8 mixed-precision training yields **15%+ end-to-end speedup**, improved memory efficiency, and maintains **≤ 0.1% loss deviation** from BF16 across **1T tokens**.
|
| 81 |
A fine-grained, **heterogeneous 1F1B interleaved pipeline** further boosts utilization by 40 %+.
|
| 82 |
System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.
|
| 83 |
|
|
|
|
| 85 |
<img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/y5UVSKACgLEAAAAAVcAAAAgADkV7AQFr/original"/>
|
| 86 |
<p>
|
| 87 |
|
| 88 |
+
Pre-training used over **20T high-quality tokens**, with **> 40% reasoning-dense data** in later stages.
|
| 89 |
Mid-training introduced **curated chain-of-thought corpora** for “**reasoning pre-activation**”, improving downstream reasoning stability.
|
| 90 |
A custom **WSM (Warmup–Stable–Merge)** LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.
|
| 91 |
|