Files changed (1) hide show
  1. README.md +25 -38
README.md CHANGED
@@ -1,72 +1,66 @@
1
  ---
2
  license: mit
3
  base_model:
4
- - inclusionAI/Ling-flash-base-2.0
5
  pipeline_tag: text-generation
6
  library_name: transformers
7
  ---
8
 
9
-
10
-
11
  <p align="center">
12
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
13
  <p>
14
-
15
  <p align="center">🤗 <a href="https://huggingface.co/inclusionAI">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/inclusionAI">ModelScope</a></p>
16
 
17
-
18
  ## Introduction
19
 
20
  Today, we are officially open-sourcing Ring-flash-2.0.
21
 
22
- This is a __high-performance thinking model, deeply optimized__ based on Ling-flash-2.0-base. Like Ling-flash-2.0, Ring-flash-2.0 has a total of 100B parameters, with only 6.1B activated per inference. Our independently developed __icepop algorithm__ has successfully addressed the challenge of training instability in reinforcement learning (RL) for MoE LLMs after cold-start Long-CoT SFT, enabling the model’s complex reasoning capabilities to continuously improve throughout extended RL training cycles.
23
 
24
- Ring-flash-2.0 demonstrates significant breakthroughs across multiple challenging benchmarks, including __math competitions__, __code generation__, and __logical reasoning__. Its performance not only surpasses that of SOTA dense models under 40B parameters but also rivals larger open-weight MoE models and closed-source high-performance thinking model APIs.
25
 
26
  ### leading-level performance in complex reasoning
27
 
28
- We selected __representative open-source thinking models__ and __closed-source APIs__ for comparison, including GPT-OSS-120B(medium), Qwen3-32B-Thinking, Seed-OSS-36B-Instruct, and Gemini-2.5-Flash.
29
 
30
  The benchmarking results demonstrate that Ring-flash-2.0 exhibits leading performance across multiple challenging general reasoning tasks, including:
31
- - __Math competitions__ (AIME 25, Omni-MATH),
32
- - __Code generation__ (LiveCodeBench, CodeForce-Elo),
33
- - __Logical reasoning__ (ARC-Prize).
34
- It also shows strong competitiveness in specialized domains such as:
35
- - __Scientific and medical reasoning__ (GPQA-Diamond, HealthBench).
36
 
37
- More surprisingly, although Ring-flash-2.0 is primarily designed for complex reasoning, it outperforms all other compared models in __creative writing__ (Creative Writing v3) and matches the creative capability of its "twin brother"—the non-thinking model Ling-flash-2.0.
 
 
 
 
 
 
 
38
  <p align="center">
39
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*jLbeS74JqB8AAAAAWmAAAAgAemJ7AQ/original"/>
40
  <p>
41
-
42
  <p align="center">
43
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*_AG2T62ZWNsAAAAAWKAAAAgAemJ7AQ/original"/>
44
  <p>
45
 
46
-
47
  ### Efficient Architecture, High-Speed Inference
48
 
49
  <p align="center">
50
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*awCaS4yTD9UAAAAAUdAAAAgAemJ7AQ/original"/>
51
  <p>
52
-
53
  Building on the highly efficient MoE architecture of the Ling 2.0 series, and through structural optimizations such as a __1/32 expert activation ratio__ and __MTP layers__, Ring-flash-2.0 activates only 6.1B (4.8B non-embedding) parameters while delivering performance comparable to a ∼40B dense model.
54
  Thanks to its low activation and high sparsity design, Ring-flash-2.0 achieves a high generation speed of __200+ tokens/sec__ when deployed on just four H20 GPUs, significantly reducing inference costs for thinking models in high-concurrency scenarios.
55
 
56
-
57
  ## IcePop: Cooling Down Training-Inference Gaps in RL for MoE Models
58
 
59
  During the RL for MoE models, the discrepancy of precision between the training and inference engines is more pronounced compared to dense models. This gap widens progressively as sequence length and training steps increase—particularly during long-sequence generation and extended training cycles. A more critical issue is that the original GRPO algorithm begins to break down within a limited number of training steps. Specifically, the probabilistic discrepancy for the same token between training and inference phases gradually increases. When this relative difference exceeds 5%, training effectively fails, posing a significant challenge for long-horizon reinforcement learning with lengthy sequences.
60
 
61
- To address this issue, we introduced a key solution: __distribution calibration via masked bidirectional truncation, which effectively narrows the gap between training and inference__.
62
 
63
  - Bidirectional Truncation: We truncate not only tokens where the training probability is significantly higher than the inference probability but also the reverse scenario where the training probability is much lower.
64
  - Masking: Tokens with excessively large discrepancies are excluded from gradient computation.
65
 
66
  For detailed algorithm introduction, please refer to our technical blog: https://ringtech.notion.site/icepop
67
 
68
-
69
  ## SFT + RLVR + RLHF Multi-Stage Training
 
70
  To comprehensively enhance the capabilities of Ring-flash-2.0, we designed a Two-staged RL pipeline. First, lightweight Long-CoT SFT equips the Ling-flash-2.0-base model with diverse thinking patterns. This is followed by RL training with Verifiable Rewards (RLVR) to continually stimulate the model’s reasoning potential. Finally, an RLHF phase is incorporated to improve the model’s general abilities.
71
 
72
  During RL training, we compared directly combining RLVR and RLHF into joint training with the ultimately adopted Two-staged RL pipeline. Both approaches showed relatively similar effectiveness in our experiments. However, due to the differing difficulty levels of RLVR and RLHF tasks—with RLHF involving relatively shorter model rollouts—joint training resulted in more long-tail generations. From an engineering efficiency perspective, we ultimately adopted the Two-staged RL approach.
@@ -75,8 +69,6 @@ During RL training, we compared directly combining RLVR and RLHF into joint trai
75
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4Q_4SbSv73YAAAAAQ6AAAAgAemJ7AQ/original"/>
76
  <p>
77
 
78
-
79
-
80
  ## Quickstart
81
 
82
  ### 🤗 Hugging Face Transformers
@@ -85,9 +77,7 @@ Here is a code snippet to show you how to use the chat model with `transformers`
85
 
86
  ```python
87
  from transformers import AutoModelForCausalLM, AutoTokenizer
88
-
89
  model_name = "inclusionAI/Ring-flash-2.0"
90
-
91
  model = AutoModelForCausalLM.from_pretrained(
92
  model_name,
93
  dtype="auto",
@@ -95,7 +85,6 @@ model = AutoModelForCausalLM.from_pretrained(
95
  trust_remote_code=True,
96
  )
97
  tokenizer = AutoTokenizer.from_pretrained(model_name)
98
-
99
  prompt = "Give me a short introduction to large language models."
100
  messages = [
101
  {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
@@ -107,7 +96,6 @@ text = tokenizer.apply_chat_template(
107
  add_generation_prompt=True
108
  )
109
  model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
110
-
111
  generated_ids = model.generate(
112
  **model_inputs,
113
  max_new_tokens=8192
@@ -115,7 +103,6 @@ generated_ids = model.generate(
115
  generated_ids = [
116
  output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
117
  ]
118
-
119
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
120
  ```
121
 
@@ -146,25 +133,20 @@ pip install -e .
146
  ```python
147
  from transformers import AutoTokenizer
148
  from vllm import LLM, SamplingParams
149
-
150
  tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-2.0")
151
-
152
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
153
-
154
  llm = LLM(model="inclusionAI/Ring-flash-2.0", dtype='bfloat16')
155
  prompt = "Give me a short introduction to large language models."
156
  messages = [
157
  {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
158
  {"role": "user", "content": prompt}
159
  ]
160
-
161
  text = tokenizer.apply_chat_template(
162
  messages,
163
  tokenize=False,
164
  add_generation_prompt=True
165
  )
166
  outputs = llm.generate([text], sampling_params)
167
-
168
  ```
169
 
170
  #### Online Inference:
@@ -178,7 +160,9 @@ vllm serve inclusionAI/Ring-flash-2.0 \
178
  ```
179
 
180
  To handle long context in vLLM using YaRN, we need to follow these two steps:
 
181
  1. Add a `rope_scaling` field to the model's `config.json` file, for example:
 
182
  ```json
183
  {
184
  ...,
@@ -189,24 +173,29 @@ To handle long context in vLLM using YaRN, we need to follow these two steps:
189
  }
190
  }
191
  ```
 
192
  2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
193
 
194
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
195
 
196
-
197
  ### SGLang
198
 
199
  #### Environment Preparation
200
 
201
  We will later submit our model to SGLang official release, now we can prepare the environment following steps:
 
202
  ```shell
203
  pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
204
  ```
 
205
  You can use docker image as well:
 
206
  ```shell
207
  docker pull lmsysorg/sglang:v0.5.2rc0-cu126
208
  ```
 
209
  Then you should apply patch to sglang installation:
 
210
  ```shell
211
  # patch command is needed, run `yum install -y patch` if needed
212
  patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
@@ -214,9 +203,10 @@ patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__fil
214
 
215
  #### Run Inference
216
 
217
- BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
218
 
219
  - Start server:
 
220
  ```shell
221
  python -m sglang.launch_server \
222
  --model-path $MODLE_PATH \
@@ -224,6 +214,7 @@ python -m sglang.launch_server \
224
  --trust-remote-code \
225
  --attention-backend fa3
226
  ```
 
227
  MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
228
  to start command.
229
 
@@ -237,8 +228,6 @@ curl -s http://localhost:${PORT}/v1/chat/completions \
237
 
238
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
239
 
240
-
241
-
242
  ### Finetuning
243
 
244
  We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ling-V2/blob/main/docs/llamafactory_finetuning.md).
@@ -246,5 +235,3 @@ We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory
246
  ## License
247
 
248
  This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ring-V2/blob/master/LICENSE).
249
-
250
-
 
1
  ---
2
  license: mit
3
  base_model:
4
+ - inclusionAI/Ling-flash-base-2.0
5
  pipeline_tag: text-generation
6
  library_name: transformers
7
  ---
8
 
 
 
9
  <p align="center">
10
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
11
  <p>
 
12
  <p align="center">🤗 <a href="https://huggingface.co/inclusionAI">Hugging Face</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://modelscope.cn/organization/inclusionAI">ModelScope</a></p>
13
 
 
14
  ## Introduction
15
 
16
  Today, we are officially open-sourcing Ring-flash-2.0.
17
 
18
+ This is a **high-performance thinking model, deeply optimized** based on Ling-flash-2.0-base. Like Ling-flash-2.0, Ring-flash-2.0 has a total of 100B parameters, with only 6.1B activated per inference. Our independently developed **icepop algorithm** has successfully addressed the challenge of training instability in reinforcement learning (RL) for MoE LLMs after cold-start Long-CoT SFT, enabling the model’s complex reasoning capabilities to continuously improve throughout extended RL training cycles.
19
 
20
+ Ring-flash-2.0 demonstrates significant breakthroughs across multiple challenging benchmarks, including **math competitions**, **code generation**, and **logical reasoning**. Its performance not only surpasses that of SOTA dense models under 40B parameters but also rivals larger open-weight MoE models and closed-source high-performance thinking model APIs.
21
 
22
  ### leading-level performance in complex reasoning
23
 
24
+ We selected **representative open-source thinking models** and **closed-source APIs** for comparison, including GPT-OSS-120B(medium), Qwen3-32B-Thinking, Seed-OSS-36B-Instruct, and Gemini-2.5-Flash.
25
 
26
  The benchmarking results demonstrate that Ring-flash-2.0 exhibits leading performance across multiple challenging general reasoning tasks, including:
 
 
 
 
 
27
 
28
+ - **Math competitions** (AIME 25, Omni-MATH),
29
+ - **Code generation** (LiveCodeBench, CodeForce-Elo),
30
+ - **Logical reasoning** (ARC-Prize).
31
+ It also shows strong competitiveness in specialized domains such as:
32
+ - **Scientific and medical reasoning** (GPQA-Diamond, HealthBench).
33
+
34
+ More surprisingly, although Ring-flash-2.0 is primarily designed for complex reasoning, it outperforms all other compared models in **creative writing** (Creative Writing v3) and matches the creative capability of its "twin brother"—the non-thinking model Ling-flash-2.0.
35
+
36
  <p align="center">
37
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*jLbeS74JqB8AAAAAWmAAAAgAemJ7AQ/original"/>
38
  <p>
 
39
  <p align="center">
40
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*_AG2T62ZWNsAAAAAWKAAAAgAemJ7AQ/original"/>
41
  <p>
42
 
 
43
  ### Efficient Architecture, High-Speed Inference
44
 
45
  <p align="center">
46
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*awCaS4yTD9UAAAAAUdAAAAgAemJ7AQ/original"/>
47
  <p>
 
48
  Building on the highly efficient MoE architecture of the Ling 2.0 series, and through structural optimizations such as a __1/32 expert activation ratio__ and __MTP layers__, Ring-flash-2.0 activates only 6.1B (4.8B non-embedding) parameters while delivering performance comparable to a ∼40B dense model.
49
  Thanks to its low activation and high sparsity design, Ring-flash-2.0 achieves a high generation speed of __200+ tokens/sec__ when deployed on just four H20 GPUs, significantly reducing inference costs for thinking models in high-concurrency scenarios.
50
 
 
51
  ## IcePop: Cooling Down Training-Inference Gaps in RL for MoE Models
52
 
53
  During the RL for MoE models, the discrepancy of precision between the training and inference engines is more pronounced compared to dense models. This gap widens progressively as sequence length and training steps increase—particularly during long-sequence generation and extended training cycles. A more critical issue is that the original GRPO algorithm begins to break down within a limited number of training steps. Specifically, the probabilistic discrepancy for the same token between training and inference phases gradually increases. When this relative difference exceeds 5%, training effectively fails, posing a significant challenge for long-horizon reinforcement learning with lengthy sequences.
54
 
55
+ To address this issue, we introduced a key solution: **distribution calibration via masked bidirectional truncation, which effectively narrows the gap between training and inference**.
56
 
57
  - Bidirectional Truncation: We truncate not only tokens where the training probability is significantly higher than the inference probability but also the reverse scenario where the training probability is much lower.
58
  - Masking: Tokens with excessively large discrepancies are excluded from gradient computation.
59
 
60
  For detailed algorithm introduction, please refer to our technical blog: https://ringtech.notion.site/icepop
61
 
 
62
  ## SFT + RLVR + RLHF Multi-Stage Training
63
+
64
  To comprehensively enhance the capabilities of Ring-flash-2.0, we designed a Two-staged RL pipeline. First, lightweight Long-CoT SFT equips the Ling-flash-2.0-base model with diverse thinking patterns. This is followed by RL training with Verifiable Rewards (RLVR) to continually stimulate the model’s reasoning potential. Finally, an RLHF phase is incorporated to improve the model’s general abilities.
65
 
66
  During RL training, we compared directly combining RLVR and RLHF into joint training with the ultimately adopted Two-staged RL pipeline. Both approaches showed relatively similar effectiveness in our experiments. However, due to the differing difficulty levels of RLVR and RLHF tasks—with RLHF involving relatively shorter model rollouts—joint training resulted in more long-tail generations. From an engineering efficiency perspective, we ultimately adopted the Two-staged RL approach.
 
69
  <img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4Q_4SbSv73YAAAAAQ6AAAAgAemJ7AQ/original"/>
70
  <p>
71
 
 
 
72
  ## Quickstart
73
 
74
  ### 🤗 Hugging Face Transformers
 
77
 
78
  ```python
79
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
80
  model_name = "inclusionAI/Ring-flash-2.0"
 
81
  model = AutoModelForCausalLM.from_pretrained(
82
  model_name,
83
  dtype="auto",
 
85
  trust_remote_code=True,
86
  )
87
  tokenizer = AutoTokenizer.from_pretrained(model_name)
 
88
  prompt = "Give me a short introduction to large language models."
89
  messages = [
90
  {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
 
96
  add_generation_prompt=True
97
  )
98
  model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
 
99
  generated_ids = model.generate(
100
  **model_inputs,
101
  max_new_tokens=8192
 
103
  generated_ids = [
104
  output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
105
  ]
 
106
  response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
107
  ```
108
 
 
133
  ```python
134
  from transformers import AutoTokenizer
135
  from vllm import LLM, SamplingParams
 
136
  tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-flash-2.0")
 
137
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=16384)
 
138
  llm = LLM(model="inclusionAI/Ring-flash-2.0", dtype='bfloat16')
139
  prompt = "Give me a short introduction to large language models."
140
  messages = [
141
  {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
142
  {"role": "user", "content": prompt}
143
  ]
 
144
  text = tokenizer.apply_chat_template(
145
  messages,
146
  tokenize=False,
147
  add_generation_prompt=True
148
  )
149
  outputs = llm.generate([text], sampling_params)
 
150
  ```
151
 
152
  #### Online Inference:
 
160
  ```
161
 
162
  To handle long context in vLLM using YaRN, we need to follow these two steps:
163
+
164
  1. Add a `rope_scaling` field to the model's `config.json` file, for example:
165
+
166
  ```json
167
  {
168
  ...,
 
173
  }
174
  }
175
  ```
176
+
177
  2. Use an additional parameter `--max-model-len` to specify the desired maximum context length when starting the vLLM service.
178
 
179
  For detailed guidance, please refer to the vLLM [`instructions`](https://docs.vllm.ai/en/latest/).
180
 
 
181
  ### SGLang
182
 
183
  #### Environment Preparation
184
 
185
  We will later submit our model to SGLang official release, now we can prepare the environment following steps:
186
+
187
  ```shell
188
  pip3 install sglang==0.5.2rc0 sgl-kernel==0.3.7.post1
189
  ```
190
+
191
  You can use docker image as well:
192
+
193
  ```shell
194
  docker pull lmsysorg/sglang:v0.5.2rc0-cu126
195
  ```
196
+
197
  Then you should apply patch to sglang installation:
198
+
199
  ```shell
200
  # patch command is needed, run `yum install -y patch` if needed
201
  patch -d `python -c 'import sglang;import os; print(os.path.dirname(sglang.__file__))'` -p3 < inference/sglang/bailing_moe_v2.patch
 
203
 
204
  #### Run Inference
205
 
206
+ BF16 and FP8 models are supported by SGLang now, it depends on the dtype of the model in ${MODEL_PATH}. They both share the same command in the following:
207
 
208
  - Start server:
209
+
210
  ```shell
211
  python -m sglang.launch_server \
212
  --model-path $MODLE_PATH \
 
214
  --trust-remote-code \
215
  --attention-backend fa3
216
  ```
217
+
218
  MTP is supported for base model, and not yet for chat model. You can add parameter `--speculative-algorithm NEXTN`
219
  to start command.
220
 
 
228
 
229
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
230
 
 
 
231
  ### Finetuning
232
 
233
  We recommend you to use [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) to [finetune Ring](https://github.com/inclusionAI/Ling-V2/blob/main/docs/llamafactory_finetuning.md).
 
235
  ## License
236
 
237
  This code repository is licensed under [the MIT License](https://github.com/inclusionAI/Ring-V2/blob/master/LICENSE).