Dynamic Quants producing garbage thinking output (llama.cpp + Cuda 13 issue)
Just pulled both UD-Q3_K_XL and UD-Q4_K_XL and getting the same result. I was previously running M2.5 and bumped the model to point at these M2.7 quants as well as updating llama.cpp to latest version.
My logs become flooded with:
res send: sending result for task id = 0
res send: task id = 0 pushed to result queue
slot process_toke: id 3 | task 0 | n_decoded = 7446, n_remaining = -1, next token: 23 ''
srv update_slots: run slots completed
que start_loop: waiting for new tasks
que start_loop: processing new tasks
que start_loop: processing task, id = 7446
que start_loop: update slots
srv update_slots: posting NEXT_RESPONSE
que post: new task, id = 7447, front = 0
slot update_slots: id 3 | task 0 | slot decode token, n_ctx = 196608, n_tokens = 7485, truncated = 0
srv update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv update_chat_: Parsing chat message:
! $''
Parsing PEG input with format peg-native: [e~[
]~b]ai
<think>
! $''
srv operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"\u0017"}}],"created":1775973650,"id":"chatcmpl-edmSe4RiidSxhopP0cNN8g7Juh8uvO5A","model":"UD-Q4_K_XL.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
Here are my llama-serve params:
llama-serve --port 5512
-m /models/MiniMax-M2.7/UD-Q4_K_XL.gguf
-fa on
-ctk q8_0 -ctv q8_0
--no-mmap
-b 4096 -ub 2048
--temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
Update
The issue seems to be caused by a newer versions of llama.cpp and also CUDA 13. llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 has resolved the issue for me
I tried it just then and rebuilt from source - Q3_K_XL works fine
I also tried Q4_K_XL and it works:
I used the exact commands as you had:
./master_llama_cpp/llama.cpp/llama-cli --model MiniMax-M2.7-GGUF/UD-Q4_K_XL/MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf --flash-attn on --no-mmap -b 4096 -ub 2048 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0
i am also getting the same output as OP. i'm running v100 gpus. and the thinking output seems to have slowed down to crawl while also outputting random characters.
i even tested your unsloth studio and its not working with this model either
You both got llama.cpp from source right? Can you try other quant providers and see if they work?
i did actually try other uploaders they were the same(i have fast internet) i tried 2 other uploaders the Q3_K_L quant Q3_K_M quant. and yours UD_Q3_K_L. i was wondering if thats a llama.cpp bug? some kind of v100 incompatbility with minimax 2.7 because none of them seemed to work for me. for reference though i have used other models types just fine like qwen or minimax 2.5. i also used the latest llama.cpp source uploaded 13 hours ago.
I tried offloading MoE as well and it works fine through -ot ".ffn_.*_exps.=CPU"
Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
Looks like it is a llama.cpp issue. I am just so used to pulling new models and updating llama.cpp when I do. Backing up to the llama.cpp I was using before and running this I am no longer getting the garbage output
i did actually try other uploaders they were the same(i have fast internet) i tried 2 other uploaders the Q3_K_L quant Q3_K_M quant. and yours UD_Q3_K_L. i was wondering if thats a llama.cpp bug? some kind of v100 incompatbility with minimax 2.7 because none of them seemed to work for me. for reference though i have used other models types just fine like qwen or minimax 2.5. i also used the latest llama.cpp source uploaded 13 hours ago.
Looks like it is a llama.cpp issue. I am just so used to pulling new models and updating llama.cpp when I do. Backing up to the llama.cpp I was using before and running this I am no longer getting the garbage output
Wait a second, are you guys using CUDA 13.2 by any chance?
Wait a second, are you guys using CUDA 13.2 by any chance?
Yeah, I'm in that version Driver Version: 595.45.04 CUDA Driver Version: 13.2
Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
But even after building in that commit, I'm still seeing the problem.
i used a build from april 5th of ik_llama.cpp and its working now. so its definitely a regression in newer llamas. which i think is odd in itself because i thought unsloth quants were not compatible with ik_llama.cpp
Oh my CUDA 13.2 is the culprit
Wait a second, are you guys using CUDA 13.2 by any chance?
Yeah, I'm in that version
Driver Version: 595.45.04 CUDA Driver Version: 13.2Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
But even after building in that commit, I'm still seeing the problem.
i used a build from april 5th of ik_llama.cpp and its working now. so its definitely a regression in newer llamas
Do NOT use CUDA 13.2!!!!!!! NVIDIA is fixing the issue of bad ouputs
For me the UD Q4 K XL works fine. No garbage thinking output and so on. Just used the most recent llama cpp cuda server docker image
See https://github.com/unslothai/unsloth/issues/4849 - we already pinged NVIDIA - CUDA 13.2 breaks llama.cpp quants under 4bit - use CUDA 13.1 or lower, or as a fallback use our pre-compiled quants from https://github.com/unslothai/llama.cpp/releases/tag/b8746 which use CUDA 13.0
llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 also works for me using Cuda 12. Confirmed I was always using Cuda 12 so even something in new builds broke me. I might try and bisect in the next few days to find what it is.
So this fixed the outputs in my case:
- I'm using llama.cpp in the commit suggested. https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9
- Driver Version: 595.45.04 CUDA Driver Version: 13.2
β UD-IQ4_XS doesn't work.
β
UD-Q3_K_XL works.
finnally got it tested. and yes its working with the build you linked
I tried offloading MoE as well and it works fine through
-ot ".ffn_.*_exps.=CPU"Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378
llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 also works for me using Cuda 12. Confirmed I was always using Cuda 12 so even something in new builds broke me. I might try and bisect in the next few days to find what it is.
Can you edit your main post so people can find the fix easily without reading the thread? thank you!! :) @orlandocollins - try our CUDA 13.0 pre-compiled binaries at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions/1#69db47965d14cc8ca1c6d3e4







