Text Generation
Transformers
Safetensors
bailing_moe
conversational
custom_code
RichardBian commited on
Commit
441ad0e
·
verified ·
1 Parent(s): 1b86b66

nitpick: update several formatting issues in the card

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -16,7 +16,7 @@ library_name: transformers
16
  **Ling-1T** is the first flagship *non-thinking* model in the Ling 2.0 series, featuring **1 trillion total parameters** with **≈ 50 billion active parameters per token**.
17
  Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of *efficient reasoning* and *scalable cognition*.
18
 
19
- Pre-trained on **20 trillion+ high-quality, reasoning-dense tokens**, Ling-1T-base supports up to **128 K context length** and adopts an **evolutionary chain-of-thought (Evo-CoT)** process across mid-training and post-training.
20
  This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve **state-of-the-art performance** on multiple complex reasoning benchmarks—balancing **accuracy** and **efficiency**.
21
 
22
 
@@ -49,7 +49,7 @@ On **ArtifactsBench**, Ling-1T ranks **first among open-source models**, and the
49
  ### Emergent Intelligence at Trillion-Scale
50
 
51
  Scaling to the trillion-parameter level has revealed strong **emergent reasoning and transfer capabilities**.
52
- For example, in the **BFCL V3** tool-use benchmark, Ling-1T achieves **≈ 70 % tool-call accuracy** with only light instruction tuning—despite having seen no large-scale trajectory data during training.
53
  Ling-1T can:
54
 
55
  * Interpret complex natural-language instructions
@@ -67,7 +67,7 @@ This ensures architectural and hyperparameter scalability even under **1e25–1e
67
 
68
  Key architectural innovations include:
69
 
70
- * **1 T total / 50 B active parameters** with a **1/32 MoE activation ratio**
71
  * **MTP layers** for enhanced compositional reasoning
72
  * **Aux-loss-free**, **sigmoid-scoring expert routing** with **zero-mean updates**
73
  * **QK Normalization** for fully stable convergence
@@ -77,7 +77,7 @@ Key architectural innovations include:
77
  <p>
78
 
79
  Ling-1T is the **largest FP8-trained foundation model** known to date.
80
- FP8 mixed-precision training yields **15 %+ end-to-end speedup**, improved memory efficiency, and maintains **≤ 0.1 % loss deviation** from BF16 across **1 T tokens**.
81
  A fine-grained, **heterogeneous 1F1B interleaved pipeline** further boosts utilization by 40 %+.
82
  System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.
83
 
@@ -85,7 +85,7 @@ System-level optimizations—fused kernels, communication scheduling, recomputat
85
  <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/y5UVSKACgLEAAAAAVcAAAAgADkV7AQFr/original"/>
86
  <p>
87
 
88
- Pre-training used over **20 T high-quality tokens**, with **> 40 % reasoning-dense data** in later stages.
89
  Mid-training introduced **curated chain-of-thought corpora** for “**reasoning pre-activation**”, improving downstream reasoning stability.
90
  A custom **WSM (Warmup–Stable–Merge)** LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.
91
 
 
16
  **Ling-1T** is the first flagship *non-thinking* model in the Ling 2.0 series, featuring **1 trillion total parameters** with **≈ 50 billion active parameters per token**.
17
  Built on the Ling 2.0 architecture, Ling-1T is designed to push the limits of *efficient reasoning* and *scalable cognition*.
18
 
19
+ Pre-trained on **20 trillion+ high-quality, reasoning-dense tokens**, Ling-1T-base supports up to **128K context length** and adopts an **evolutionary chain-of-thought (Evo-CoT)** process across mid-training and post-training.
20
  This curriculum greatly enhances the model’s efficiency and reasoning depth, allowing Ling-1T to achieve **state-of-the-art performance** on multiple complex reasoning benchmarks—balancing **accuracy** and **efficiency**.
21
 
22
 
 
49
  ### Emergent Intelligence at Trillion-Scale
50
 
51
  Scaling to the trillion-parameter level has revealed strong **emergent reasoning and transfer capabilities**.
52
+ For example, in the **BFCL V3** tool-use benchmark, Ling-1T achieves **≈ 70% tool-call accuracy** with only light instruction tuning—despite having seen no large-scale trajectory data during training.
53
  Ling-1T can:
54
 
55
  * Interpret complex natural-language instructions
 
67
 
68
  Key architectural innovations include:
69
 
70
+ * **1T total / 50B active parameters** with a **1/32 MoE activation ratio**
71
  * **MTP layers** for enhanced compositional reasoning
72
  * **Aux-loss-free**, **sigmoid-scoring expert routing** with **zero-mean updates**
73
  * **QK Normalization** for fully stable convergence
 
77
  <p>
78
 
79
  Ling-1T is the **largest FP8-trained foundation model** known to date.
80
+ FP8 mixed-precision training yields **15%+ end-to-end speedup**, improved memory efficiency, and maintains **≤ 0.1% loss deviation** from BF16 across **1T tokens**.
81
  A fine-grained, **heterogeneous 1F1B interleaved pipeline** further boosts utilization by 40 %+.
82
  System-level optimizations—fused kernels, communication scheduling, recomputation, checkpointing, simulation, and telemetry—ensure stable trillion-scale training.
83
 
 
85
  <img src="https://mdn.alipayobjects.com/huamei_bcz3yt/afts/img/y5UVSKACgLEAAAAAVcAAAAgADkV7AQFr/original"/>
86
  <p>
87
 
88
+ Pre-training used over **20T high-quality tokens**, with **> 40% reasoning-dense data** in later stages.
89
  Mid-training introduced **curated chain-of-thought corpora** for “**reasoning pre-activation**”, improving downstream reasoning stability.
90
  A custom **WSM (Warmup–Stable–Merge)** LR scheduler with mid-train checkpoint merging simulates LR decay and boosts generalization.
91