Instructions to use inclusionAI/Ring-mini-2.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use inclusionAI/Ring-mini-2.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="inclusionAI/Ring-mini-2.0", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("inclusionAI/Ring-mini-2.0", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use inclusionAI/Ring-mini-2.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "inclusionAI/Ring-mini-2.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inclusionAI/Ring-mini-2.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/inclusionAI/Ring-mini-2.0
- SGLang
How to use inclusionAI/Ring-mini-2.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "inclusionAI/Ring-mini-2.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inclusionAI/Ring-mini-2.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "inclusionAI/Ring-mini-2.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "inclusionAI/Ring-mini-2.0", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use inclusionAI/Ring-mini-2.0 with Docker Model Runner:
docker model run hf.co/inclusionAI/Ring-mini-2.0
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,59 +1,90 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
base_model:
|
| 4 |
-
- inclusionAI/Ling-mini-base-2.0-20T
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
library_name: transformers
|
| 7 |
---
|
|
|
|
| 8 |
# Ring-mini-2.0
|
| 9 |
|
| 10 |
<p align="center">
|
| 11 |
<img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
|
| 12 |
<p>
|
| 13 |
-
|
| 14 |
-
<p align="center">๐ค <a href="https://huggingface.co/inclusionAI">Hugging Face</a>   |   ๐ค <a href="https://modelscope.cn/organization/inclusionAI">ModelScope</a></p>
|
| 15 |
|
| 16 |
Today, we officially release Ring-mini-2.0 โ a high-performance inference-oriented MoE model deeply optimized based on the Ling 2.0 architecture. With only 16B total parameters and 1.4B activated parameters, it achieves comprehensive reasoning capabilities comparable to dense models below the 10B scale. It excels particularly in logical reasoning, code generation, and mathematical tasks, while supporting 128K long-context processing and 300+ tokens/s high-speed generation.
|
| 17 |
|
| 18 |
## Enhanced Reasoning: Joint Training with SFT + RLVR + RLHF
|
|
|
|
| 19 |
Built upon Ling-mini-2.0-base, Ring-mini-2.0 undergoes further training with Long-CoT SFT, more stable and continuous RLVR, and RLHF joint optimization, significantly improving the stability and generalization of complex reasoning. On multiple challenging benchmarks (LiveCodeBench, AIME 2025, GPQA, ARC-AGI-v1, etc.), it outperforms dense models below 10B and even rivals larger MoE models (e.g., gpt-oss-20B-medium) with comparable output lengths, particularly excelling in logical reasoning.
|
| 20 |
|
| 21 |
<p align="center">
|
| 22 |
<img src="https://mdn.alipayobjects.com/huamei_d2byvp/afts/img/O2YKQqkdEvAAAAAASzAAAAgADod9AQFr/original" width="1000"/>
|
| 23 |
|
| 24 |
<p>
|
| 25 |
-
|
| 26 |
## High Sparsity, High-Speed Generation
|
| 27 |
Inheriting the efficient MoE design of the Ling 2.0 series, Ring-mini-2.0 activates only 1.4B parameters and achieves performance equivalent to 7โ8B dense models through architectural optimizations such as 1/32 expert activation ratio and MTP layers. Thanks to its low activation and high sparsity design, Ring-mini-2.0 delivers a throughput of 300+ tokens/s when deployed on H20. With Expert Dual Streaming inference optimization, this can be further boosted to 500+ tokens/s, significantly reducing inference costs for high-concurrency scenarios involving thinking models. Additionally, with YaRN extrapolation, it supports 128K long-context processing, achieving a relative speedup of up to 7x in long-output scenarios.
|
| 28 |
<p align="center">
|
| 29 |
<img src="https://mdn.alipayobjects.com/huamei_d2byvp/afts/img/gjJKSpFVphEAAAAAgdAAAAgADod9AQFr/original" width="1000"/>
|
| 30 |
<p>
|
| 31 |
-
|
| 32 |
<p align="center">
|
| 33 |
<img src="https://mdn.alipayobjects.com/huamei_d2byvp/afts/img/o-vGQadCF_4AAAAAgLAAAAgADod9AQFr/original" width="1000"/>
|
| 34 |
<p>
|
| 35 |
|
| 36 |
-
|
| 37 |
## Model Downloads
|
| 38 |
|
| 39 |
<div align="center">
|
| 40 |
|
| 41 |
-
|
|
| 42 |
-
| :-----------
|
| 43 |
-
| Ring-mini-2.0 | 16.8B | 1.4B
|
|
|
|
| 44 |
</div>
|
| 45 |
|
| 46 |
## Quickstart
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
### ๐ค Hugging Face Transformers
|
| 49 |
|
| 50 |
Here is a code snippet to show you how to use the chat model with `transformers`:
|
| 51 |
|
| 52 |
```python
|
| 53 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 54 |
-
|
| 55 |
model_name = "inclusionAI/Ring-mini-2.0"
|
| 56 |
-
|
| 57 |
model = AutoModelForCausalLM.from_pretrained(
|
| 58 |
model_name,
|
| 59 |
torch_dtype="auto",
|
|
@@ -61,7 +92,6 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 61 |
trust_remote_code=True
|
| 62 |
)
|
| 63 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 64 |
-
|
| 65 |
prompt = "Give me a short introduction to large language models."
|
| 66 |
messages = [
|
| 67 |
{"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
|
|
@@ -74,7 +104,6 @@ text = tokenizer.apply_chat_template(
|
|
| 74 |
enable_thinking=True
|
| 75 |
)
|
| 76 |
model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
|
| 77 |
-
|
| 78 |
generated_ids = model.generate(
|
| 79 |
**model_inputs,
|
| 80 |
max_new_tokens=8192
|
|
@@ -82,12 +111,13 @@ generated_ids = model.generate(
|
|
| 82 |
generated_ids = [
|
| 83 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
| 84 |
]
|
| 85 |
-
|
| 86 |
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 87 |
```
|
| 88 |
|
| 89 |
## License
|
|
|
|
| 90 |
This code repository is licensed under [the MIT License](https://huggingface.co/inclusionAI/Ring-mini-2.0/blob/main/LICENSE).
|
| 91 |
|
| 92 |
## Citation
|
| 93 |
-
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
base_model:
|
| 4 |
+
- inclusionAI/Ling-mini-base-2.0-20T
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
library_name: transformers
|
| 7 |
---
|
| 8 |
+
|
| 9 |
# Ring-mini-2.0
|
| 10 |
|
| 11 |
<p align="center">
|
| 12 |
<img src="https://mdn.alipayobjects.com/huamei_qa8qxu/afts/img/A*4QxcQrBlTiAAAAAAQXAAAAgAemJ7AQ/original" width="100"/>
|
| 13 |
<p>
|
| 14 |
+
<p align="center">๐ค <a href="https://huggingface.co/inclusionAI">Hugging Face</a>   |   ๐ค <a href="https://modelscope.cn/organization/inclusionAI">ModelScope</a>   |   ๐ <a href="https://zenmux.ai/inclusionai/ring-mini-2.0?utm_source=hf_inclusionAI">Experience Now</a></p>
|
|
|
|
| 15 |
|
| 16 |
Today, we officially release Ring-mini-2.0 โ a high-performance inference-oriented MoE model deeply optimized based on the Ling 2.0 architecture. With only 16B total parameters and 1.4B activated parameters, it achieves comprehensive reasoning capabilities comparable to dense models below the 10B scale. It excels particularly in logical reasoning, code generation, and mathematical tasks, while supporting 128K long-context processing and 300+ tokens/s high-speed generation.
|
| 17 |
|
| 18 |
## Enhanced Reasoning: Joint Training with SFT + RLVR + RLHF
|
| 19 |
+
|
| 20 |
Built upon Ling-mini-2.0-base, Ring-mini-2.0 undergoes further training with Long-CoT SFT, more stable and continuous RLVR, and RLHF joint optimization, significantly improving the stability and generalization of complex reasoning. On multiple challenging benchmarks (LiveCodeBench, AIME 2025, GPQA, ARC-AGI-v1, etc.), it outperforms dense models below 10B and even rivals larger MoE models (e.g., gpt-oss-20B-medium) with comparable output lengths, particularly excelling in logical reasoning.
|
| 21 |
|
| 22 |
<p align="center">
|
| 23 |
<img src="https://mdn.alipayobjects.com/huamei_d2byvp/afts/img/O2YKQqkdEvAAAAAASzAAAAgADod9AQFr/original" width="1000"/>
|
| 24 |
|
| 25 |
<p>
|
|
|
|
| 26 |
## High Sparsity, High-Speed Generation
|
| 27 |
Inheriting the efficient MoE design of the Ling 2.0 series, Ring-mini-2.0 activates only 1.4B parameters and achieves performance equivalent to 7โ8B dense models through architectural optimizations such as 1/32 expert activation ratio and MTP layers. Thanks to its low activation and high sparsity design, Ring-mini-2.0 delivers a throughput of 300+ tokens/s when deployed on H20. With Expert Dual Streaming inference optimization, this can be further boosted to 500+ tokens/s, significantly reducing inference costs for high-concurrency scenarios involving thinking models. Additionally, with YaRN extrapolation, it supports 128K long-context processing, achieving a relative speedup of up to 7x in long-output scenarios.
|
| 28 |
<p align="center">
|
| 29 |
<img src="https://mdn.alipayobjects.com/huamei_d2byvp/afts/img/gjJKSpFVphEAAAAAgdAAAAgADod9AQFr/original" width="1000"/>
|
| 30 |
<p>
|
|
|
|
| 31 |
<p align="center">
|
| 32 |
<img src="https://mdn.alipayobjects.com/huamei_d2byvp/afts/img/o-vGQadCF_4AAAAAgLAAAAgADod9AQFr/original" width="1000"/>
|
| 33 |
<p>
|
| 34 |
|
|
|
|
| 35 |
## Model Downloads
|
| 36 |
|
| 37 |
<div align="center">
|
| 38 |
|
| 39 |
+
| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** |
|
| 40 |
+
| :-----------: | :---------------: | :-------------------: | :----------------: | :--------------------------------------------------------------------------------------------------------------------------------------------: |
|
| 41 |
+
| Ring-mini-2.0 | 16.8B | 1.4B | 128K | [๐ค HuggingFace](https://huggingface.co/inclusionAI/Ring-mini-2.0) <br>[๐ค Modelscope](https://modelscope.cn/models/inclusionAI/Ring-mini-2.0) |
|
| 42 |
+
|
| 43 |
</div>
|
| 44 |
|
| 45 |
## Quickstart
|
| 46 |
|
| 47 |
+
### ๐ Try Online
|
| 48 |
+
|
| 49 |
+
You can experience Ring-mini-2.0 online at: [ZenMux](https://zenmux.ai/inclusionai/ring-mini-2.0?utm_source=hf_inclusionAI)
|
| 50 |
+
|
| 51 |
+
### ๐ API Usage
|
| 52 |
+
|
| 53 |
+
You can also use Ring-mini-2.0 through API calls:
|
| 54 |
+
|
| 55 |
+
```python
|
| 56 |
+
from openai import OpenAI
|
| 57 |
+
|
| 58 |
+
# 1. Initialize the OpenAI client
|
| 59 |
+
client = OpenAI(
|
| 60 |
+
# 2. Point the base URL to the ZenMux endpoint
|
| 61 |
+
base_url="https://zenmux.ai/api/v1",
|
| 62 |
+
# 3. Replace with the API Key from your ZenMux user console
|
| 63 |
+
api_key="<your ZENMUX_API_KEY>",
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
# 4. Make a request
|
| 67 |
+
completion = client.chat.completions.create(
|
| 68 |
+
# 5. Specify the model to use in the format "provider/model-name"
|
| 69 |
+
model="inclusionai/ring-mini-2.0",
|
| 70 |
+
messages=[
|
| 71 |
+
{
|
| 72 |
+
"role": "user",
|
| 73 |
+
"content": "What is the meaning of life?"
|
| 74 |
+
}
|
| 75 |
+
]
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
print(completion.choices[0].message.content)
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
### ๐ค Hugging Face Transformers
|
| 82 |
|
| 83 |
Here is a code snippet to show you how to use the chat model with `transformers`:
|
| 84 |
|
| 85 |
```python
|
| 86 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
| 87 |
model_name = "inclusionAI/Ring-mini-2.0"
|
|
|
|
| 88 |
model = AutoModelForCausalLM.from_pretrained(
|
| 89 |
model_name,
|
| 90 |
torch_dtype="auto",
|
|
|
|
| 92 |
trust_remote_code=True
|
| 93 |
)
|
| 94 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
|
| 95 |
prompt = "Give me a short introduction to large language models."
|
| 96 |
messages = [
|
| 97 |
{"role": "system", "content": "You are Ring, an assistant created by inclusionAI"},
|
|
|
|
| 104 |
enable_thinking=True
|
| 105 |
)
|
| 106 |
model_inputs = tokenizer([text], return_tensors="pt", return_token_type_ids=False).to(model.device)
|
|
|
|
| 107 |
generated_ids = model.generate(
|
| 108 |
**model_inputs,
|
| 109 |
max_new_tokens=8192
|
|
|
|
| 111 |
generated_ids = [
|
| 112 |
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
| 113 |
]
|
|
|
|
| 114 |
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
| 115 |
```
|
| 116 |
|
| 117 |
## License
|
| 118 |
+
|
| 119 |
This code repository is licensed under [the MIT License](https://huggingface.co/inclusionAI/Ring-mini-2.0/blob/main/LICENSE).
|
| 120 |
|
| 121 |
## Citation
|
| 122 |
+
|
| 123 |
+
TODO
|