inclusionAI
/

Ling-1T

@@ -16,7 +16,7 @@ library_name: transformers
 **Ling-1T** is the first flagship *non-thinking* model in the Ling 2.0 series, featuring **1 trillion total parameters** with **≈ 50 billion active parameters per token**.
 Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of *efficient reasoning* and *scalable cognition*.
-Pre-trained on **20 trillion+ high-quality, reasoning-dense tokens**, Ling-1T-base supports up to **128 K context length** and adopts an **evolutionary chain-of-thought (Evo-CoT)** process across mid-training and post-training.
 This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve **state-of-the-art performance** on multiple complex reasoning benchmarks—balancing **accuracy** and **efficiency**.
@@ -49,7 +49,7 @@ On **ArtifactsBench**, Ling-1T ranks **first among open-source models**, and the
 ### Emergent Intelligence at Trillion-Scale
 Scaling to the trillion-parameter level has revealed strong **emergent reasoning and transfer capabilities**.
-For example, in the **BFCL V3** tool-use benchmark, Ling-1T achieves **≈ 70 % tool-call accuracy** with only light instruction tuning—despite having seen no large-scale trajectory data during training.
 Ling-1T can:
 * Interpret complex natural-language instructions
@@ -67,7 +67,7 @@ This ensures architectural and hyperparameter scalability even under **1e25–1e
 Key architectural innovations include:
-* **1 T total / 50 B active parameters** with a **1/32 MoE activation ratio**
 * **MTP layers** for enhanced compositional reasoning
 * **Aux-loss-free**, **sigmoid-scoring expert routing** with **zero-mean updates**
 * **QK Normalization** for fully stable convergence
@@ -77,7 +77,7 @@ Key architectural innovations include:
 <p>
 Ling-1T is the **largest FP8-trained foundation model** known to date.
-FP8 mixed-precision training yields **15 %+ end-to-end speedup**, improved memory efficiency, and maintains **≤ 0.1 % loss deviation** from BF16 across **1 T tokens**.
 A fine-grained, **heterogeneous 1F1B interleaved pipeline** further boosts utilization by 40 %+.
 System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.
@@ -85,7 +85,7 @@ System-level optimizations—fused kernels, communication scheduling, recomputat
     <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/y5UVSKACgLEAAAAAVcAAAAgADkV7AQFr/original"/>
 <p>
-Pre-training used over **20 T high-quality tokens**, with **> 40 % reasoning-dense data** in later stages.
 Mid-training introduced **curated chain-of-thought corpora** for “**reasoning pre-activation**”, improving downstream reasoning stability.
 A custom **WSM (Warmup–Stable–Merge)** LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.

 **Ling-1T** is the first flagship *non-thinking* model in the Ling 2.0 series, featuring **1 trillion total parameters** with **≈ 50 billion active parameters per token**.
 Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of *efficient reasoning* and *scalable cognition*.
+Pre-trained on **20 trillion+ high-quality, reasoning-dense tokens**, Ling-1T-base supports up to **128K context length** and adopts an **evolutionary chain-of-thought (Evo-CoT)** process across mid-training and post-training.
 This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve **state-of-the-art performance** on multiple complex reasoning benchmarks—balancing **accuracy** and **efficiency**.
 ### Emergent Intelligence at Trillion-Scale
 Scaling to the trillion-parameter level has revealed strong **emergent reasoning and transfer capabilities**.
+For example, in the **BFCL V3** tool-use benchmark, Ling-1T achieves **≈ 70% tool-call accuracy** with only light instruction tuning—despite having seen no large-scale trajectory data during training.
 Ling-1T can:
 * Interpret complex natural-language instructions
 Key architectural innovations include:
+* **1T total / 50B active parameters** with a **1/32 MoE activation ratio**
 * **MTP layers** for enhanced compositional reasoning
 * **Aux-loss-free**, **sigmoid-scoring expert routing** with **zero-mean updates**
 * **QK Normalization** for fully stable convergence
 <p>
 Ling-1T is the **largest FP8-trained foundation model** known to date.
+FP8 mixed-precision training yields **15%+ end-to-end speedup**, improved memory efficiency, and maintains **≤ 0.1% loss deviation** from BF16 across **1T tokens**.
 A fine-grained, **heterogeneous 1F1B interleaved pipeline** further boosts utilization by 40 %+.
 System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.
     <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/y5UVSKACgLEAAAAAVcAAAAgADkV7AQFr/original"/>
 <p>
+Pre-training used over **20T high-quality tokens**, with **> 40% reasoning-dense data** in later stages.
 Mid-training introduced **curated chain-of-thought corpora** for “**reasoning pre-activation**”, improving downstream reasoning stability.
 A custom **WSM (Warmup–Stable–Merge)** LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.