Jan 21: GGUFs all UPDATED!!!

#1
by danielhanchen - opened
Unsloth AI org
β€’
edited 3 days ago

Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.

You can now use Z.ai's recommended parameters and get great results:

  • For general use-case: --temp 1.0 --top-p 0.95
  • For tool-calling: --temp 0.7 --top-p 1.0

Guide: https://unsloth.ai/docs/models/glm-4.7-flash

glm flash ggg

danielhanchen pinned discussion
danielhanchen changed discussion title from Looping issues should now be fixed. to Looping issues should now be mostly fixed.
Unsloth AI org
β€’
edited 4 days ago

Definitely please add --dry-multiplier 1.1

Or the entire --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1 which seems to work better for many people!

danielhanchen changed discussion title from Looping issues should now be mostly fixed. to Add --dry-multiplier 1.1 to reduce looping issues!

I didn't discover any settings that wouldn't do the loop... waiting for maybe lm studio (llama.cpp) further updates.

Updated the to-do list
Create basic game structure
Implement core gameplay mechanics and pipe system

Kilo said
}//???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
Unsloth AI org
β€’
edited 4 days ago

I didn't discover any settings that wouldn't do the loop... waiting for maybe lm studio (llama.cpp) further updates.

Updated the to-do list
Create basic game structure
Implement core gameplay mechanics and pipe system

Kilo said
}//???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Where was this from and do you have a screenshot of your config?

If you're using LM Studio, use --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 and disable repeat penalty.

temp 0.2, top-k 50, top-p 0.95, min-p 0.01, repeat penalty 1.1.
Will try to disable it

unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q4_K_M.gguf
repeat penalty disabled

Let me create the complete game file.</think>I'll create a complete Flappy Bird clone with enhanced graphics in a single HTML file. Let me build this step by step.
Checkpoint
(Current)
Kilo Code wants to create a new file
API Request...
01:10 PM
$0.0000
Kilo said
}//????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
Unsloth AI org

temp 0.2, top-k 50, top-p 0.95, min-p 0.01, repeat penalty 1.1.
Will try to disable it

repeat penalty has to be disabled. it's not dry multiplier

Unsloth AI org

unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q4_K_M.gguf
repeat penalty disabled

Let me create the complete game file.</think>I'll create a complete Flappy Bird clone with enhanced graphics in a single HTML file. Let me build this step by step.
Checkpoint
(Current)
Kilo Code wants to create a new file
API Request...
01:10 PM
$0.0000
Kilo said
}//????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

is this lmstudio or llamacpp

lmstudio runtime (llama.cpp v1.103.2), will try the UD model

UD same problem

I should delegate this task completely with clear instructions for Code mode to create the single-file implementation.
</think>


Checkpoint
(Current)

Kilo Code wants to create a new subtask in code mode
Subtask Instructions

Error

Error


Kilo said
}//???????????????????????????????????????????

The Z.ai team officially recommends following the GLM-4.7 sampler settings.

I'm curious where the recommendation for temp 0.2 and other things came from since it didn't seem to come from Z.ai?

Yeah it will be a mess for a few days, but i am happy to wait for this gem to work :D

I fix it

πŸŽ‰ Issue Solved! GLM-4.7-Flash Q6_K - Working perfectly now!

Environment

  • Model: unsloth/GLM-4.7-Flash-GGUF (Q6_K, 23GB)
  • llama.cpp: build 7779 (commit 6df686bee)
  • Hardware: RTX 4090, 128 GB RAM

βœ… Root Cause Found

The issue was NOT the model itself. The model works perfectly with llama-cli!

Problem: Server/API mode with incorrect parameters caused chaos output.

Mode Parameters Result
llama-cli Default (no extra params) βœ… Working
llama-server --chat-template glm4 --jinja --temp 0.2 --dry-multiplier 1.1 ❌ Chaos

πŸ“ Test Results (All Passed βœ…)

Test Input Output
Greeting "hi" "Hello! I'm the GLM large language model..."
Chinese "δ½ ε₯½" "δ½ ε₯½οΌζˆ‘ζ˜―GLMε€§θ―­θ¨€ζ¨‘εž‹..."
Math "What is 2+2?" "2 + 2 = 4"
Math "Solve x^2-4=0" Correct solution with steps
Coding "write a script for ocr" Complete Python script

πŸ”§ How to Fix

Remove these parameters from llama-server:

# Don't use these with GLM-4.7-Flash:
--chat-template glm4
--jinja
--temp 0.2
--top-p 0.95
--top-k 50
--min-p 0.01
--dry-multiplier 1.1

Use default parameters instead:
./llama-server \
    -m "GLM-4.7-Flash-Q6_K.gguf" \
    --host 0.0.0.0 \
    --port 8080

πŸ’­ Why This Happens

We suspect the chat template (--chat-template glm4 --jinja) conflicts with the model's internal reasoning process, causing the output to be replaced with incorrect tokens.

πŸ™ Conclusion

The model is excellent! Please update your documentation to avoid recommending these parameters for GLM-4.7-Flash.

Ref: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/1

Screenshot_2026-01-20_07-54-04
Screenshot_2026-01-20_07-53-58
Screenshot_2026-01-20_07-53-49
Screenshot_2026-01-20_07-53-40
Screenshot_2026-01-20_07-53-19
These are the test results (partial).

Unsloth AI org
β€’
edited 4 days ago

The Z.ai team officially recommends following the GLM-4.7 sampler settings.

I'm curious where the recommendation for temp 0.2 and other things came from since it didn't seem to come from Z.ai?

Around 5 people said it worked, after testing GLM's recommended parameters, we saw looping issues still. Also I think it's what LM Studio uses by default as well.

After using these sampling params, it started working better.

We shall include two ways to run the model for people to test and see which is better.

I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... πŸ˜΅β€πŸ’«

IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.

Unsloth AI org
β€’
edited 4 days ago

I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... πŸ˜΅β€πŸ’«

IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.

We experienced looping as well with the officially recommended parameters, until we added --dry-multiplier 1.1 which worked well for us. FYI this was tested not just by our quant, but other uploaders as well. And also using the sampling parameters with 0.2 temp worked as well.

I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... πŸ˜΅β€πŸ’«

IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.

We experienced looping as well with the officially recommended parameters, until we added --dry-multiplier 1.1 which worked well for us. FYI this was tested not just by our quant, but other uploaders as well. And also using the sampling parameters with 0.2 temp worked as well.

Have you test the original bf16 model or zai api?

@danielhanchen
Hello.
What is your final opinion about this advice from the above post?

Just to remove --chat-template glm4 --jinja and keep the rest in place?

πŸ”§ How to Fix
Remove these parameters from llama-server:

# Don't use these with GLM-4.7-Flash:
--chat-template glm4
--jinja
--temp 0.2
--top-p 0.95
--top-k 50
--min-p 0.01
--dry-multiplier 1.1

Use default parameters instead:
./llama-server \
    -m "GLM-4.7-Flash-Q6_K.gguf" \
    --host 0.0.0.0 \
    --port 8080

πŸ’­ Why This Happens

We suspect the chat template (--chat-template glm4 --jinja) conflicts with the model's internal reasoning process, causing the output to be replaced with incorrect tokens.

πŸ™ Conclusion

The model is excellent! Please update your documentation to avoid recommending these parameters for GLM-4.7-Flash.

Ref: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/1```

I think @CHNtentes comments are reasonable,
there shouldn't be such a mismatch between the BF16 and GGUFs sampling params...
+ we still have a massive throughput degradation over time (with AND without flash attention, no fa just makes it start from higher, but the degradation rate is the same, going from ~90t/s to something like 30t/s at the 1000th token, it's not usable for a thinking, and apparently quite verbose model).
Also, I don't know for others, but I can clearly hear my GPU sound noise is unexpected, the pitch fluctuates quite randomly, where it should just decrease progressively as computation becomes more expensive each new token.
I personally wouldn't try to find workarounds when there are still fundamental issues to be fix.

relevant: https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/12#696ec9bd8ce5068e7b3f11bc

Just running llama-cli seems to work well. But still, according to my usual test questions, this model might be good at coding, but not at logical thinking. It failed at several easy questions:

I have a boat with three available spaces. I want to transport a man, a sheep, and a cat to the other side of the river. How can I do that?

I have a bowl with a small cup inside. I placed the bowl upside down on a table and then pick up the bowl to put it in the microwave. Where is that cup?

On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.

No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.

And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!

Unsloth AI org

On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.

No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.

And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!

What are the default parameters you used?

It really seems like the llama.cpp implementation might have bugs: https://github.com/ggml-org/llama.cpp/pull/18936#issuecomment-3774525719

On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.

No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.

And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!

What are the default parameters you used?

Hi @shimmyshimmer , same as in the update from @gannima , I just removed all parameters relating to the model.

So no --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1.

Ok, so that wasn't too clear I admit ;). To be exhaustive and since I just have a 16 GB card I went with:

--offline -m GLM-4.7-Flash-UD-Q4_K_XL.gguf --no-slots -np 1 --port 8000 --host 0.0.0.0 -fa off -fitt 500 --n-cpu-moe 16 --ubatch-size 128

But as you see these have nothing to do with model parameters and that is all I used.

Additional note: I got the best --jinja performance with --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.0 --dry-multiplier 0.1. Tool calling worked in Goose and Zed, I was not getting any repetition looping. However, the model was regularly making simple copy/paste errors from context (e.g. mangling IP addresses or paths when producing reports or tool calls). I was just about to give up when I saw this thread.

Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980

I wonder if this will fix the looping issues some people have experienced.

Unsloth AI org
β€’
edited 3 days ago

Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980

I wonder if this will fix the looping issues some people have experienced.

I really hope so it will fix the issue, we have been going crazy over the correct parameters etc.

Many people say this works, some people say it doesn't. Some people say something else works while other people say it doesn't,

Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980

I wonder if this will fix the looping issues some people have experienced.

I really hope so it will fix the issue, we have been going crazy over the correct parameters etc.

Many people say this works, some people say it doesn't. Some people say something else works while other people say it doesn't,

It's why I commented those above. If the root cause is in the model inference, changing the sampling parameters might improve the output for some cases, but for other cases there's no improvement or even worse.

And it's not rare for llama.cpp support nowadays. I admire these devs from community for their work, but sometimes they might overlook something and merge the support PR, then many people suppose it's perfectly supported but it's not.

Unsloth AI org
β€’
edited 3 days ago

Hey guys llama.cpp fixed an issue in implementation. We reuploaded and results are much better now!!! Can we retest and see if it's better now?

Just use GLM-4.7's original parameters: https://unsloth.ai/docs/models/glm-4.7-flash#usage-guide

CC: @CHNtentes @coder543 @dugrema @gannima @urtuuuu @owao @McG-221 @Reverger @Ukro

danielhanchen changed discussion title from Add --dry-multiplier 1.1 to reduce looping issues! to Jan 21: GGUFs all UPDATED!!!
danielhanchen changed discussion title from Jan 21: GGUFs all UPDATED!!! to Jan 21: GGUFs UPDATED!!!
danielhanchen changed discussion title from Jan 21: GGUFs UPDATED!!! to Jan 21: GGUFs all UPDATED!!!
danielhanchen unpinned discussion
Unsloth AI org

If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.

Some people reported the MLA implementation is broken and flash attention is unusable. I suppose that means the vram usage and speed is still not optimal?

If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.

(moving quickly on it: it changed for a default min-p of 0.05 and top-p of 0.95 now, the README is not up to date I just noticed and made a PR https://github.com/ggml-org/llama.cpp/blob/master/common/common.h)
@danielhanchen

I'll personally wait for the dust to set down before giving a new try

image

Unsloth AI org

If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.

(moving quickly on it: it changed for a default min-p of 0.05 and top-p of 0.95 now, the README is not up to date I just noticed and made a PR https://github.com/ggml-org/llama.cpp/blob/master/common/common.h)
@danielhanchen

Thank you, will update docs.

@shimmyshimmer
Shouldn't we set min-p to 0? Because none among transformers, sglang and vllm use it by default, and zai.org don't mention it at all.
And I was thinking the same for top-k (set it to 0 instead of the default value of 40 for llama-cpp), but I see that while in the same way, sglang and vllm don't use it by default, transformers on the other hand set it to 50.

@shimmyshimmer
Shouldn't we set min-p to 0? Because none among transformers, sglang and vllm use it by default, and zai.org don't mention it at all.
And I was thinking the same for top-k (set it to 0 instead of the default value of 40 for llama-cpp), but I see that while in the same way, sglang and vllm don't use it by default, transformers on the other hand set it to 50.

I once read that setting top-k to 0 will make the amount of candidate tokens too large and slow down sampling speed. Not sure if that's true though.

Well, on 2*20 generations (10 with top-k 40 and 10 with top-k 0), for gpt-oss-20b and with Qwen-VL-32B, no noticeable diff. We will have tried!

I confirm that removing "--temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05" solved the problem of GLM not being able to call the tools correctly when using OpenHands.

We are using this:

ARGS="--no-mmap -fa on
-c 131072
-m $GGUF_PATH/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q5_K_M.gguf"

llama-server
$ARGS
-ngl 999
--host 0.0.0.0
--port 12345
--cache-ram -1
--parallel 2
--batch-size 512
--metrics

llama.cpp and weights were updated before testing.

Sign up or log in to comment