ErvinX commited on
Commit
69da6c0
·
verified ·
1 Parent(s): 5a66373

docs: add KTransformers CPU offloading inference guide

Browse files

Add KTransformers as a recommended inference option for MiMo-V2-Flash.

KTransformers enables efficient deployment on consumer-grade hardware by offloading MoE expert computations to CPU while keeping other components on GPU. With 4× RTX 5090 + 2× AMD EPYC 9355, it achieves up to 35.7 tokens/s decode speed.

Benchmarks: https://ktransformers.net/benchmarks#MiMo-V2-Flash-FP8-TP4

Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -250,6 +250,12 @@ curl -i http://localhost:9001/v1/chat/completions \
250
  }'
251
  ```
252
 
 
 
 
 
 
 
253
  ### Notifications
254
 
255
  #### 1. System prompt
 
250
  }'
251
  ```
252
 
253
+ ### Inference with KTransformers (CPU Offloading)
254
+
255
+ [KTransformers](https://github.com/kvcache-ai/ktransformers) enables efficient MiMo-V2-Flash deployment on consumer-grade hardware by offloading MoE expert computations to CPU, built on top of SGLang. With **4× RTX 5090 + 2× AMD EPYC 9355**, it achieves up to **35.7 tokens/s** decode speed.
256
+
257
+ For quick start and benchmarks, visit [KTransformers](https://ktransformers.net/zh/benchmarks#MiMo-V2-Flash-FP8-TP4).
258
+
259
  ### Notifications
260
 
261
  #### 1. System prompt