GGUF quants of Ring-mini-2.0

llama.cpp usage: --jinja argument needs to be passed

Using llama.cpp (commit 6de8ed75196c7cd98c1f34bbf3a7452451ba8ac2)

The importance matrix was generated with eaddario/imatrix-calibration combined_all_medium dataset.

All quants were generated/calibrated with the imatrix, including the K quants.

Compressed from BF16.


Ring-mini-2.0

πŸ€— Hugging Face   |   πŸ€– ModelScope |   πŸ™ Experience Now

Today, we officially release Ring-mini-2.0 β€” a high-performance inference-oriented MoE model deeply optimized based on the Ling 2.0 architecture. With only 16B total parameters and 1.4B activated parameters, it achieves comprehensive reasoning capabilities comparable to dense models below the 10B scale. It excels particularly in logical reasoning, code generation, and mathematical tasks, while supporting 128K long-context processing and 300+ tokens/s high-speed generation.

Enhanced Reasoning: Joint Training with SFT + RLVR + RLHF

Built upon Ling-mini-2.0-base, Ring-mini-2.0 undergoes further training with Long-CoT SFT, more stable and continuous RLVR, and RLHF joint optimization, significantly improving the stability and generalization of complex reasoning. On multiple challenging benchmarks (LiveCodeBench, AIME 2025, GPQA, ARC-AGI-v1, etc.), it outperforms dense models below 10B and even rivals larger MoE models (e.g., gpt-oss-20B-medium) with comparable output lengths, particularly excelling in logical reasoning.

High Sparsity, High-Speed Generation

Inheriting the efficient MoE design of the Ling 2.0 series, Ring-mini-2.0 activates only 1.4B parameters and achieves performance equivalent to 7–8B dense models through architectural optimizations such as 1/32 expert activation ratio and MTP layers. Thanks to its low activation and high sparsity design, Ring-mini-2.0 delivers a throughput of 300+ tokens/s when deployed on H20. With Expert Dual Streaming inference optimization, this can be further boosted to 500+ tokens/s, significantly reducing inference costs for high-concurrency scenarios involving thinking models. Additionally, with YaRN extrapolation, it supports 128K long-context processing, achieving a relative speedup of up to 7x in long-output scenarios.

Model Downloads

Model #Total Params #Activated Params Context Length Download
Ring-mini-2.0 16.8B 1.4B 128K πŸ€— HuggingFace
πŸ€– Modelscope
Downloads last month
319
GGUF
Model size
16B params
Architecture
bailingmoe2
Hardware compatibility
Log In to view the estimation

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for redponike/Ring-mini-2.0-GGUF

Quantized
(18)
this model