腾云智算(Tenyunw)专注于大模型推理优化和GPU云基础设施。 我们在做什么: - 推理加速:实现了 Eagle3 for training on Qwen3-8B(20,000+ downloads),显著提升LLM推理速度 - GPU云平台:为AIGC应用提供性价比最优的推理部署方案 - 量化优化:基于QAT+ DPO的训练和推理框架,帮助客户在Blackwell架构硬件上翻倍并发能力 核心团队来自腾讯云、华为云、面壁智能等,在AI基础设施领域有15年以上实战经验。 🎁 限时福利: 如果你在做大模型应用部署,我们提供免费的推理优化咨询(30分钟技术诊断),帮你分析: - 当前部署方案的性能瓶颈 - 成本优化的具体路径 - 适合你场景的推理加速方案 适合谁: - 月推理费用 > $5K的团队 - 需要降低推理成本30%+ - 考虑自建推理集群 联系方式: - 创始人/CTO Rocky(前腾讯云行业架构师团队负责人) - 微信:[rocket-assassin] ![image](https://cdn-uploads.huggingface.co/production/uploads/66f6b53b9989324a6937cc22/EUTm6-euzdR3wzu-oIJxq.png) - LinkedIn: [https://www.linkedin.com/in/wangchao0808/] - 官网:https://www.tenyunw.com/ - Email: rockywang@tenyunw.com We help AI builders deploy faster and cheaper. Let's talk. --- datasets: - abisee/cnn_dailymail - nvidia/Nemotron-Post-Training-Dataset-v2 base_model: - MiniMaxAI/MiniMax-M2.1 base_model_relation: quantized license: mit pipeline_tag: text-generation --- # MiniMax-M2.1-NVFP4 **Format:** NVFP4 — optimal partial quantization of weights & activations to NVFP4. **Base model:** `MiniMax-M2.1-NVFP4` **How it was made:** [AutoQuantized](https://nvidia.github.io/Model-Optimizer/guides/_pytorch_quantization.html#optimal-partial-quantization-using-auto-quantize) with [NVIDIA Model-Optimizer](https://github.com/NVIDIA/Model-Optimizer/) (NVFP4), using the default calibration mix. ([cnn_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail) and [nemotron-post-training-dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)) Check the [original model card](https://huggingface.co/MiniMaxAI/MiniMax-M2.1) for information about this model. --- sglang Inference Note: ``` vim /sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py ``` change the code in 1517 line like this: ``` ), f"Expected {name}_weight_scale.dim(2) == {expected_blocks[name]}, got {weight_scale.shape[-1]}" else: pass # For other backends, ensure the per-input block dimension is aligned to 16. #assert ( # weight_scale.shape[assert_dim] % block_size == 0 #), f"Expected {name}_weight_scale.dim({assert_dim}) to be divisible by {block_size}" ``` deploy command MiniMax-M2.1-NVFP4 on sglang: ``` python3 -m sglang.launch_server --model-path MiniMax-M2.1-NVFP4/ --quantization modelopt_fp4 --tp 8 --attention-backend flashinfer --trust-remote-code ``` # perf We performed deployment on 8x 5090, and the stress test performance data is provided below. ``` Benchmarking summary: +-----------------------------------+-----------+ | Key | Value | +===================================+===========+ | Time taken for tests (s) | 10.7085 | +-----------------------------------+-----------+ | Number of concurrency | 1 | +-----------------------------------+-----------+ | Request rate (req/s) | -1 | +-----------------------------------+-----------+ | Total requests | 1 | +-----------------------------------+-----------+ | Succeed requests | 1 | +-----------------------------------+-----------+ | Failed requests | 0 | +-----------------------------------+-----------+ | Output token throughput (tok/s) | 47.8126 | +-----------------------------------+-----------+ | Total token throughput (tok/s) | 143.438 | +-----------------------------------+-----------+ | Request throughput (req/s) | 0.0934 | +-----------------------------------+-----------+ | Average latency (s) | 10.7085 | +-----------------------------------+-----------+ | Average time to first token (s) | 0.5682 | +-----------------------------------+-----------+ | Average time per output token (s) | 0.0198 | +-----------------------------------+-----------+ | Average inter-token latency (s) | 0.0202 | +-----------------------------------+-----------+ | Average input tokens per request | 1024 | +-----------------------------------+-----------+ | Average output tokens per request | 512 | +-----------------------------------+-----------+ 2026-01-07 04:00:24 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | 10% | 0.5682 | 0.0196 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 25% | 0.5682 | 0.0197 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 50% | 0.5682 | 0.0198 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 66% | 0.5682 | 0.0199 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 75% | 0.5682 | 0.0199 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 80% | 0.5682 | 0.0199 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 90% | 0.5682 | 0.0201 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 95% | 0.5682 | 0.0204 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 98% | 0.5682 | 0.0393 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | | 99% | 0.5682 | 0.0396 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ Benchmarking summary: +-----------------------------------+-----------+ | Key | Value | +===================================+===========+ | Time taken for tests (s) | 24.0981 | +-----------------------------------+-----------+ | Number of concurrency | 16 | +-----------------------------------+-----------+ | Request rate (req/s) | -1 | +-----------------------------------+-----------+ | Total requests | 16 | +-----------------------------------+-----------+ | Succeed requests | 16 | +-----------------------------------+-----------+ | Failed requests | 0 | +-----------------------------------+-----------+ | Output token throughput (tok/s) | 339.944 | +-----------------------------------+-----------+ | Total token throughput (tok/s) | 1019.83 | +-----------------------------------+-----------+ | Request throughput (req/s) | 0.664 | +-----------------------------------+-----------+ | Average latency (s) | 24.0845 | +-----------------------------------+-----------+ | Average time to first token (s) | 5.7343 | +-----------------------------------+-----------+ | Average time per output token (s) | 0.0359 | +-----------------------------------+-----------+ | Average inter-token latency (s) | 0.0358 | +-----------------------------------+-----------+ | Average input tokens per request | 1024 | +-----------------------------------+-----------+ | Average output tokens per request | 512 | +-----------------------------------+-----------+ 2026-01-07 04:11:34 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | 10% | 2.4771 | 0.0275 | 0.0301 | 24.0913 | 1024 | 512 | 21.2486 | 63.7458 | | 25% | 3.6108 | 0.028 | 0.0313 | 24.0928 | 1024 | 512 | 21.2493 | 63.7479 | | 50% | 5.8605 | 0.0284 | 0.0357 | 24.0939 | 1024 | 512 | 21.2507 | 63.7521 | | 66% | 6.985 | 0.0287 | 0.0379 | 24.094 | 1024 | 512 | 21.251 | 63.753 | | 75% | 8.11 | 0.0289 | 0.0401 | 24.095 | 1024 | 512 | 21.252 | 63.7559 | | 80% | 8.11 | 0.029 | 0.0401 | 24.095 | 1024 | 512 | 21.252 | 63.7559 | | 90% | 8.7294 | 0.0295 | 0.0423 | 24.0957 | 1024 | 512 | 21.2525 | 63.7576 | | 95% | 9.3849 | 0.0298 | 0.0445 | 24.0971 | 1024 | 512 | 21.3819 | 64.1458 | | 98% | 9.3849 | 0.0308 | 0.0445 | 24.0971 | 1024 | 512 | 21.3819 | 64.1458 | | 99% | 9.3849 | 0.0328 | 0.0445 | 24.0971 | 1024 | 512 | 21.3819 | 64.1458 | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ Benchmarking summary: +-----------------------------------+-----------+ | Key | Value | +===================================+===========+ | Time taken for tests (s) | 35.9271 | +-----------------------------------+-----------+ | Number of concurrency | 32 | +-----------------------------------+-----------+ | Request rate (req/s) | -1 | +-----------------------------------+-----------+ | Total requests | 32 | +-----------------------------------+-----------+ | Succeed requests | 32 | +-----------------------------------+-----------+ | Failed requests | 0 | +-----------------------------------+-----------+ | Output token throughput (tok/s) | 456.034 | +-----------------------------------+-----------+ | Total token throughput (tok/s) | 1368.1 | +-----------------------------------+-----------+ | Request throughput (req/s) | 0.8907 | +-----------------------------------+-----------+ | Average latency (s) | 35.914 | +-----------------------------------+-----------+ | Average time to first token (s) | 10.2324 | +-----------------------------------+-----------+ | Average time per output token (s) | 0.0503 | +-----------------------------------+-----------+ | Average inter-token latency (s) | 0.0501 | +-----------------------------------+-----------+ | Average input tokens per request | 1024 | +-----------------------------------+-----------+ | Average output tokens per request | 512 | +-----------------------------------+-----------+ 2026-01-07 04:14:20 - evalscope - INFO: Percentile results: +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ | 10% | 3.5765 | 0.0338 | 0.0369 | 35.913 | 1024 | 512 | 14.2526 | 42.7577 | | 25% | 5.8246 | 0.0351 | 0.0413 | 35.9153 | 1024 | 512 | 14.254 | 42.7621 | | 50% | 10.3224 | 0.0356 | 0.0501 | 35.9185 | 1024 | 512 | 14.2545 | 42.7636 | | 66% | 13.6935 | 0.0359 | 0.0564 | 35.9196 | 1024 | 512 | 14.2556 | 42.7667 | | 75% | 14.8185 | 0.0362 | 0.0589 | 35.9198 | 1024 | 512 | 14.2561 | 42.7682 | | 80% | 15.9416 | 0.0365 | 0.0611 | 35.9199 | 1024 | 512 | 14.2561 | 42.7683 | | 90% | 17.0672 | 0.037 | 0.0633 | 35.9233 | 1024 | 512 | 14.2567 | 42.77 | | 95% | 17.6148 | 0.0373 | 0.0655 | 35.9244 | 1024 | 512 | 14.2573 | 42.7718 | | 98% | 17.6331 | 0.0379 | 0.0677 | 35.9256 | 1024 | 512 | 14.3066 | 42.9197 | | 99% | 17.6331 | 0.0394 | 0.0677 | 35.9256 | 1024 | 512 | 14.3066 | 42.9197 | +-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+ ```