unsloth/MiniMax-M2.7-GGUF · Dynamic Quants producing garbage thinking output (llama.cpp + Cuda 13 issue)

Apr 12

•

Just pulled both UD-Q3_K_XL and UD-Q4_K_XL and getting the same result. I was previously running M2.5 and bumped the model to point at these M2.7 quants as well as updating llama.cpp to latest version.

My logs become flooded with:


res          send: sending result for task id = 0
res          send: task id = 0 pushed to result queue
slot process_toke: id  3 | task 0 | n_decoded = 7446, n_remaining = -1, next token:    23 ''
srv  update_slots: run slots completed
que    start_loop: waiting for new tasks
que    start_loop: processing new tasks
que    start_loop: processing task, id = 7446
que    start_loop: update slots
srv  update_slots: posting NEXT_RESPONSE
que          post: new task, id = 7447, front = 0
slot update_slots: id  3 | task 0 | slot decode token, n_ctx = 196608, n_tokens = 7485, truncated = 0
srv  update_slots: decoding batch, n_tokens = 1
set_adapters_lora: adapters = (nil)
adapters_lora_are_same: adapters = (nil)
set_embeddings: value = 0
srv  update_chat_: Parsing chat message: 
!		 $''
Parsing PEG input with format peg-native: [e~[
]~b]ai
<think>

!		 $''
srv    operator(): http: streamed chunk: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"reasoning_content":"\u0017"}}],"created":1775973650,"id":"chatcmpl-edmSe4RiidSxhopP0cNN8g7Juh8uvO5A","model":"UD-Q4_K_XL.gguf","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}

Here are my llama-serve params:

llama-serve --port 5512
   -m /models/MiniMax-M2.7/UD-Q4_K_XL.gguf 
   -fa on 
   -ctk q8_0 -ctv q8_0 
   --no-mmap 
   -b 4096 -ub 2048
   --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0

Update

The issue seems to be caused by a newer versions of llama.cpp and also CUDA 13. llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 has resolved the issue for me

Apr 12

Unsloth AI org Apr 12

I tried it just then and rebuilt from source - Q3_K_XL works fine

I also tried Q4_K_XL and it works:

I used the exact commands as you had:

./master_llama_cpp/llama.cpp/llama-cli --model MiniMax-M2.7-GGUF/UD-Q4_K_XL/MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf --flash-attn on --no-mmap -b 4096 -ub 2048 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01 --repeat-penalty 1.0

Apr 12

i am also getting the same output as OP. i'm running v100 gpus. and the thinking output seems to have slowed down to crawl while also outputting random characters.

i even tested your unsloth studio and its not working with this model either

Unsloth AI org Apr 12

I just checked again via wget https://www.gutenberg.org/cache/epub/84/pg84.txt -O "Frankenstein.txt" to test long context, then:

then after some time:

Unsloth AI org Apr 12

You both got llama.cpp from source right? Can you try other quant providers and see if they work?

Apr 12

•

i did actually try other uploaders they were the same(i have fast internet) i tried 2 other uploaders the Q3_K_L quant Q3_K_M quant. and yours UD_Q3_K_L. i was wondering if thats a llama.cpp bug? some kind of v100 incompatbility with minimax 2.7 because none of them seemed to work for me. for reference though i have used other models types just fine like qwen or minimax 2.5. i also used the latest llama.cpp source uploaded 13 hours ago.

Unsloth AI org Apr 12

I tried offloading MoE as well and it works fine through -ot ".ffn_.*_exps.=CPU"

Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378

Apr 12

Looks like it is a llama.cpp issue. I am just so used to pulling new models and updating llama.cpp when I do. Backing up to the llama.cpp I was using before and running this I am no longer getting the garbage output

shimmyshimmer

Unsloth AI org Apr 12

i did actually try other uploaders they were the same(i have fast internet) i tried 2 other uploaders the Q3_K_L quant Q3_K_M quant. and yours UD_Q3_K_L. i was wondering if thats a llama.cpp bug? some kind of v100 incompatbility with minimax 2.7 because none of them seemed to work for me. for reference though i have used other models types just fine like qwen or minimax 2.5. i also used the latest llama.cpp source uploaded 13 hours ago.

Looks like it is a llama.cpp issue. I am just so used to pulling new models and updating llama.cpp when I do. Backing up to the llama.cpp I was using before and running this I am no longer getting the garbage output

Wait a second, are you guys using CUDA 13.2 by any chance?

EmmanuelMr

Apr 12

Wait a second, are you guys using CUDA 13.2 by any chance?

Yeah, I'm in that version Driver Version: 595.45.04 CUDA Driver Version: 13.2

Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378

But even after building in that commit, I'm still seeing the problem.

Apr 12

•

i used a build from april 5th of ik_llama.cpp and its working now. so its definitely a regression in newer llamas. which i think is odd in itself because i thought unsloth quants were not compatible with ik_llama.cpp

Unsloth AI org Apr 12

Oh my CUDA 13.2 is the culprit

shimmyshimmer

Unsloth AI org Apr 12

Wait a second, are you guys using CUDA 13.2 by any chance?

Yeah, I'm in that version Driver Version: 595.45.04 CUDA Driver Version: 13.2

Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378

But even after building in that commit, I'm still seeing the problem.

i used a build from april 5th of ik_llama.cpp and its working now. so its definitely a regression in newer llamas

Do NOT use CUDA 13.2!!!!!!! NVIDIA is fixing the issue of bad ouputs

Manuun1

Apr 12

For me the UD Q4 K XL works fine. No garbage thinking output and so on. Just used the most recent llama cpp cuda server docker image

Unsloth AI org Apr 12

•

See https://github.com/unslothai/unsloth/issues/4849 - we already pinged NVIDIA - CUDA 13.2 breaks llama.cpp quants under 4bit - use CUDA 13.1 or lower, or as a fallback use our pre-compiled quants from https://github.com/unslothai/llama.cpp/releases/tag/b8746 which use CUDA 13.0

Apr 12

•

llama.cpp https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 also works for me using Cuda 12. Confirmed I was always using Cuda 12 so even something in new builds broke me. I might try and bisect in the next few days to find what it is.

EmmanuelMr

Apr 12

•

So this fixed the outputs in my case:

I'm using llama.cpp in the commit suggested. https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9
Driver Version: 595.45.04 CUDA Driver Version: 13.2

❌ UD-IQ4_XS doesn't work.
✅ UD-Q3_K_XL works.

Unsloth AI org Apr 12

I tried IQ4_XS and it works:

It's a CUDA 13.2 issue most likely

Apr 12

finnally got it tested. and yes its working with the build you linked

I tried offloading MoE as well and it works fine through -ot ".ffn_.*_exps.=CPU"

Can you try installing llama.cpp on commit https://github.com/ggml-org/llama.cpp/commit/009a1133268d040a7c574a7b9c95413b0be369a9 ie the one before https://github.com/ggml-org/llama.cpp/pull/19378

Unsloth AI org Apr 12

•