MiMo-V2-Flash-GGUF Running Primarily on the CPU instead of Nvidia GPUs
First of all, deepest thanks to the Unsloth team for the amazing quants that you have been providing to the community for years, and also to the Llama.cpp team for making it possible for us to run the large models locally! Without your incredible work many of us would have no chance to run large AI models privately!
I have tried two different 2 bit quants of MiMo-V2-Flash-GGUF so far, and both are running primarily on the CPU. There is minimal activity on the GPUs, and most cores of the CPU are fully loaded, running at 100%. I tried different configurations for Llama.cpp, always the same result. With the exact same configurations of Llama.cpp, all other models are running on the GPUs, as expected.
I have four RTX 5090 GPUs and a fifth generation Xeon processor. I am running Ubuntu 24.04.3 LTS and CUDA V13.0.88.
I am attaching a screenshot of btop in case it helps, and assuming the image comes through - this should show how most CPU cores are fully loaded, with minimal GPU processing activity (the GPU memory is fully loaded, all model layers are loaded on the GPUs). I am also pasting below one of the configurations that I am using for Llama.cpp, and I tried many variations of this, with the same outcome.
My prompts are very complex, in English, and I am processing legal documents.
I also noticed that in that reasoning mode (--reasoning-budget -1), the reasoning starts out very well, with excellent instruction following, and clearly going towards a very good answer. But then, after about 1500 words of reasoning, the reasoning goes into loops and repetitions, and gets stuck.
For comparison, with the exact same Llama.cpp configurations and 2-bit quants from Unsloth, GLM-4.7 does very well both with reasoning on and off, and runs fully on the GPUs as expected. But GLM-4.7 is a bit slow for me to use it production with my current GPUs.
There is clearly something different about this MiMo-V2-Flash model, but I don't know if the problem that I am experiencing is the model itself or the quants. I don't think the problem is my Llama.cpp configuration or installation, but it is certainly possible.
lcpp ()
{
clear
cd ~/llamacpp/
source .venv/bin/activate
export CUDA_VISIBLE_DEVICES=0,1,2,3
#export GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
~/llamacpp/llama.cpp/build/bin/llama-server
--host 0.0.0.0 --port [redacted]
--model /home/[redacted]/MiMo-V2-Flash-GGUF/MiMo-V2-Flash-UD-Q2_K_XL-00001-of-00003.gguf
--flash-attn on
--ctx-size 20000
--verbosity 1
--threads 20 \ # [Note: I also tried here 0, 1, etc., the model still engages most CPU cores at 100%]
--reasoning-budget -1
--reasoning-format deepseek-legacy \ # [Note: I also tried other reasoning formats here]
--n-gpu-layers 999
--device CUDA0,CUDA1,CUDA2,CUDA3
--tensor-split 25,25,25,25
--split-mode layer
--parallel 1
--jinja
--prio 3
-ub 2400
-b 2000
}
I saw that MiMo-V2-Flash received very good reviews from the community and I had high hopes for it. Thank you for any ideas or suggestions!
MD