腾云智算（Tenyunw）专注于大模型推理优化和GPU云基础设施。

我们在做什么：
- 推理加速：实现了 Eagle3 for training on Qwen3-8B（20,000+ downloads），显著提升LLM推理速度
- GPU云平台：为AIGC应用提供性价比最优的推理部署方案
- 量化优化：基于QAT+ DPO的训练和推理框架，帮助客户在Blackwell架构硬件上翻倍并发能力

核心团队来自腾讯云、华为云、面壁智能等，在AI基础设施领域有15年以上实战经验。

🎁 限时福利：
如果你在做大模型应用部署，我们提供免费的推理优化咨询（30分钟技术诊断），帮你分析：
- 当前部署方案的性能瓶颈
- 成本优化的具体路径
- 适合你场景的推理加速方案

适合谁：
- 月推理费用 > $5K的团队
- 需要降低推理成本30%+
- 考虑自建推理集群

联系方式：
- 创始人/CTO Rocky（前腾讯云行业架构师团队负责人）
  - 微信：[rocket-assassin]

![image](https://cdn-uploads.huggingface.co/production/uploads/66f6b53b9989324a6937cc22/EUTm6-euzdR3wzu-oIJxq.png)

  - LinkedIn: [https://www.linkedin.com/in/wangchao0808/]
- 官网：https://www.tenyunw.com/
- Email: rockywang@tenyunw.com

We help AI builders deploy faster and cheaper. Let's talk.

---
datasets:
- abisee/cnn_dailymail
- nvidia/Nemotron-Post-Training-Dataset-v2
base_model:
- MiniMaxAI/MiniMax-M2.1
base_model_relation: quantized
license: mit
pipeline_tag: text-generation
---
# MiniMax-M2.1-NVFP4

**Format:** NVFP4 — optimal partial quantization of weights & activations to NVFP4.  
**Base model:** `MiniMax-M2.1-NVFP4`  
**How it was made:** [AutoQuantized](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#optimal-partial-quantization-using-auto-quantize) with [NVIDIA Model-Optimizer](https://github.com/NVIDIA/Model-Optimizer/) (NVFP4), using the default calibration mix. ([cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) and [nemotron-post-training-dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)) 

Check the [original model card](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for information about this model.

---


sglang Inference Note:

```
vim /sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py
```

change the code in 1517 line like this:

```
        ), f"Expected {name}_weight_scale.dim(2) == {expected_blocks[name]}, got {weight_scale.shape[-1]}"
    else:
        pass
        # For other backends, ensure the per-input block dimension is aligned to 16.
        #assert (
        #    weight_scale.shape[assert_dim] % block_size == 0
        #), f"Expected {name}_weight_scale.dim({assert_dim}) to be divisible by {block_size}"
```

deploy command MiniMax-M2.1-NVFP4 on sglang:
```
python3 -m sglang.launch_server --model-path  MiniMax-M2.1-NVFP4/   --quantization modelopt_fp4  --tp 8 --attention-backend flashinfer  --trust-remote-code
```

# perf

We performed deployment on 8x 5090, and the stress test performance data is provided below.

```

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   10.7085 |
+-----------------------------------+-----------+
| Number of concurrency             |    1      |
+-----------------------------------+-----------+
| Request rate (req/s)              |   -1      |
+-----------------------------------+-----------+
| Total requests                    |    1      |
+-----------------------------------+-----------+
| Succeed requests                  |    1      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   47.8126 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  143.438  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0934 |
+-----------------------------------+-----------+
| Average latency (s)               |   10.7085 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.5682 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0198 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0202 |
+-----------------------------------+-----------+
| Average input tokens per request  | 1024      |
+-----------------------------------+-----------+
| Average output tokens per request |  512      |
+-----------------------------------+-----------+
2026-01-07 04:00:24 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.5682  | 0.0196  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     25%     |  0.5682  | 0.0197  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     50%     |  0.5682  | 0.0198  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     66%     |  0.5682  | 0.0199  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     75%     |  0.5682  | 0.0199  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     80%     |  0.5682  | 0.0199  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     90%     |  0.5682  | 0.0201  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     95%     |  0.5682  | 0.0204  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     98%     |  0.5682  | 0.0393  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     99%     |  0.5682  | 0.0396  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   24.0981 |
+-----------------------------------+-----------+
| Number of concurrency             |   16      |
+-----------------------------------+-----------+
| Request rate (req/s)              |   -1      |
+-----------------------------------+-----------+
| Total requests                    |   16      |
+-----------------------------------+-----------+
| Succeed requests                  |   16      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  339.944  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 1019.83   |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.664  |
+-----------------------------------+-----------+
| Average latency (s)               |   24.0845 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    5.7343 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0359 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0358 |
+-----------------------------------+-----------+
| Average input tokens per request  | 1024      |
+-----------------------------------+-----------+
| Average output tokens per request |  512      |
+-----------------------------------+-----------+
2026-01-07 04:11:34 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  2.4771  | 0.0275  |  0.0301  |   24.0913   |     1024     |      512      |    21.2486     |    63.7458    |
|     25%     |  3.6108  |  0.028  |  0.0313  |   24.0928   |     1024     |      512      |    21.2493     |    63.7479    |
|     50%     |  5.8605  | 0.0284  |  0.0357  |   24.0939   |     1024     |      512      |    21.2507     |    63.7521    |
|     66%     |  6.985   | 0.0287  |  0.0379  |   24.094    |     1024     |      512      |     21.251     |    63.753     |
|     75%     |   8.11   | 0.0289  |  0.0401  |   24.095    |     1024     |      512      |     21.252     |    63.7559    |
|     80%     |   8.11   |  0.029  |  0.0401  |   24.095    |     1024     |      512      |     21.252     |    63.7559    |
|     90%     |  8.7294  | 0.0295  |  0.0423  |   24.0957   |     1024     |      512      |    21.2525     |    63.7576    |
|     95%     |  9.3849  | 0.0298  |  0.0445  |   24.0971   |     1024     |      512      |    21.3819     |    64.1458    |
|     98%     |  9.3849  | 0.0308  |  0.0445  |   24.0971   |     1024     |      512      |    21.3819     |    64.1458    |
|     99%     |  9.3849  | 0.0328  |  0.0445  |   24.0971   |     1024     |      512      |    21.3819     |    64.1458    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   35.9271 |
+-----------------------------------+-----------+
| Number of concurrency             |   32      |
+-----------------------------------+-----------+
| Request rate (req/s)              |   -1      |
+-----------------------------------+-----------+
| Total requests                    |   32      |
+-----------------------------------+-----------+
| Succeed requests                  |   32      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  456.034  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 1368.1    |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.8907 |
+-----------------------------------+-----------+
| Average latency (s)               |   35.914  |
+-----------------------------------+-----------+
| Average time to first token (s)   |   10.2324 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0503 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0501 |
+-----------------------------------+-----------+
| Average input tokens per request  | 1024      |
+-----------------------------------+-----------+
| Average output tokens per request |  512      |
+-----------------------------------+-----------+
2026-01-07 04:14:20 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  3.5765  | 0.0338  |  0.0369  |   35.913    |     1024     |      512      |    14.2526     |    42.7577    |
|     25%     |  5.8246  | 0.0351  |  0.0413  |   35.9153   |     1024     |      512      |     14.254     |    42.7621    |
|     50%     | 10.3224  | 0.0356  |  0.0501  |   35.9185   |     1024     |      512      |    14.2545     |    42.7636    |
|     66%     | 13.6935  | 0.0359  |  0.0564  |   35.9196   |     1024     |      512      |    14.2556     |    42.7667    |
|     75%     | 14.8185  | 0.0362  |  0.0589  |   35.9198   |     1024     |      512      |    14.2561     |    42.7682    |
|     80%     | 15.9416  | 0.0365  |  0.0611  |   35.9199   |     1024     |      512      |    14.2561     |    42.7683    |
|     90%     | 17.0672  |  0.037  |  0.0633  |   35.9233   |     1024     |      512      |    14.2567     |     42.77     |
|     95%     | 17.6148  | 0.0373  |  0.0655  |   35.9244   |     1024     |      512      |    14.2573     |    42.7718    |
|     98%     | 17.6331  | 0.0379  |  0.0677  |   35.9256   |     1024     |      512      |    14.3066     |    42.9197    |
|     99%     | 17.6331  | 0.0394  |  0.0677  |   35.9256   |     1024     |      512      |    14.3066     |    42.9197    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
```