docs: add KTransformers CPU offloading inference guide
Browse filesAdd KTransformers as a recommended inference option for MiMo-V2-Flash.
KTransformers enables efficient deployment on consumer-grade hardware by offloading MoE expert computations to CPU while keeping other components on GPU. With 4× RTX 5090 + 2× AMD EPYC 9355, it achieves up to 35.7 tokens/s decode speed.
Benchmarks: https://ktransformers.net/benchmarks#MiMo-V2-Flash-FP8-TP4
README.md
CHANGED
|
@@ -250,6 +250,12 @@ curl -i http://localhost:9001/v1/chat/completions \
|
|
| 250 |
}'
|
| 251 |
```
|
| 252 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
### Notifications
|
| 254 |
|
| 255 |
#### 1. System prompt
|
|
|
|
| 250 |
}'
|
| 251 |
```
|
| 252 |
|
| 253 |
+
### Inference with KTransformers (CPU Offloading)
|
| 254 |
+
|
| 255 |
+
[KTransformers](https://github.com/kvcache-ai/ktransformers) enables efficient MiMo-V2-Flash deployment on consumer-grade hardware by offloading MoE expert computations to CPU, built on top of SGLang. With **4× RTX 5090 + 2× AMD EPYC 9355**, it achieves up to **35.7 tokens/s** decode speed.
|
| 256 |
+
|
| 257 |
+
For quick start and benchmarks, visit [KTransformers](https://ktransformers.net/zh/benchmarks#MiMo-V2-Flash-FP8-TP4).
|
| 258 |
+
|
| 259 |
### Notifications
|
| 260 |
|
| 261 |
#### 1. System prompt
|