Jan 21: GGUFs all UPDATED!!!
Jan 21 UPDATE: llama.cpp has fixed a bug which caused the model to loop and produce poor outputs. We have reconverted and reuploaded the model so outputs should be much much better now.
You can now use Z.ai's recommended parameters and get great results:
- For general use-case:
--temp 1.0 --top-p 0.95 - For tool-calling:
--temp 0.7 --top-p 1.0
Guide: https://unsloth.ai/docs/models/glm-4.7-flash

Definitely please add --dry-multiplier 1.1
Or the entire --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1 which seems to work better for many people!
I didn't discover any settings that wouldn't do the loop... waiting for maybe lm studio (llama.cpp) further updates.
Updated the to-do list
Create basic game structure
Implement core gameplay mechanics and pipe system
Kilo said
}//???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
I didn't discover any settings that wouldn't do the loop... waiting for maybe lm studio (llama.cpp) further updates.
Updated the to-do list Create basic game structure Implement core gameplay mechanics and pipe system Kilo said }//???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
Where was this from and do you have a screenshot of your config?
If you're using LM Studio, use --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 and disable repeat penalty.
temp 0.2, top-k 50, top-p 0.95, min-p 0.01, repeat penalty 1.1.
Will try to disable it
unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q4_K_M.gguf
repeat penalty disabled
Let me create the complete game file.</think>I'll create a complete Flappy Bird clone with enhanced graphics in a single HTML file. Let me build this step by step.
Checkpoint
(Current)
Kilo Code wants to create a new file
API Request...
01:10 PM
$0.0000
Kilo said
}//????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
temp 0.2, top-k 50, top-p 0.95, min-p 0.01, repeat penalty 1.1.
Will try to disable it
repeat penalty has to be disabled. it's not dry multiplier
unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q4_K_M.gguf
repeat penalty disabledLet me create the complete game file.</think>I'll create a complete Flappy Bird clone with enhanced graphics in a single HTML file. Let me build this step by step. Checkpoint (Current) Kilo Code wants to create a new file API Request... 01:10 PM $0.0000 Kilo said }//????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
is this lmstudio or llamacpp
lmstudio runtime (llama.cpp v1.103.2), will try the UD model
UD same problem
I should delegate this task completely with clear instructions for Code mode to create the single-file implementation.
</think>
Checkpoint
(Current)
Kilo Code wants to create a new subtask in code mode
Subtask Instructions
Error
Error
Kilo said
}//???????????????????????????????????????????
The Z.ai team officially recommends following the GLM-4.7 sampler settings.
I'm curious where the recommendation for temp 0.2 and other things came from since it didn't seem to come from Z.ai?
Yeah it will be a mess for a few days, but i am happy to wait for this gem to work :D
I fix it
π Issue Solved! GLM-4.7-Flash Q6_K - Working perfectly now!
Environment
- Model: unsloth/GLM-4.7-Flash-GGUF (Q6_K, 23GB)
- llama.cpp: build 7779 (commit 6df686bee)
- Hardware: RTX 4090, 128 GB RAM
β Root Cause Found
The issue was NOT the model itself. The model works perfectly with llama-cli!
Problem: Server/API mode with incorrect parameters caused chaos output.
| Mode | Parameters | Result |
|---|---|---|
| llama-cli | Default (no extra params) | β Working |
| llama-server | --chat-template glm4 --jinja --temp 0.2 --dry-multiplier 1.1 |
β Chaos |
π Test Results (All Passed β )
| Test | Input | Output |
|---|---|---|
| Greeting | "hi" | "Hello! I'm the GLM large language model..." |
| Chinese | "δ½ ε₯½" | "δ½ ε₯½οΌζζ―GLMε€§θ―θ¨ζ¨‘ε..." |
| Math | "What is 2+2?" | "2 + 2 = 4" |
| Math | "Solve x^2-4=0" | Correct solution with steps |
| Coding | "write a script for ocr" | Complete Python script |
π§ How to Fix
Remove these parameters from llama-server:
# Don't use these with GLM-4.7-Flash:
--chat-template glm4
--jinja
--temp 0.2
--top-p 0.95
--top-k 50
--min-p 0.01
--dry-multiplier 1.1
Use default parameters instead:
./llama-server \
-m "GLM-4.7-Flash-Q6_K.gguf" \
--host 0.0.0.0 \
--port 8080
π Why This Happens
We suspect the chat template (--chat-template glm4 --jinja) conflicts with the model's internal reasoning process, causing the output to be replaced with incorrect tokens.
π Conclusion
The model is excellent! Please update your documentation to avoid recommending these parameters for GLM-4.7-Flash.
Ref: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/1
The Z.ai team officially recommends following the GLM-4.7 sampler settings.
I'm curious where the recommendation for temp 0.2 and other things came from since it didn't seem to come from Z.ai?
Around 5 people said it worked, after testing GLM's recommended parameters, we saw looping issues still. Also I think it's what LM Studio uses by default as well.
After using these sampling params, it started working better.
We shall include two ways to run the model for people to test and see which is better.
I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... π΅βπ«
IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.
I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... π΅βπ«
IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.
We experienced looping as well with the officially recommended parameters, until we added --dry-multiplier 1.1 which worked well for us. FYI this was tested not just by our quant, but other uploaders as well. And also using the sampling parameters with 0.2 temp worked as well.
I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... π΅βπ«
IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.
We experienced looping as well with the officially recommended parameters, until we added
--dry-multiplier 1.1which worked well for us. FYI this was tested not just by our quant, but other uploaders as well. And also using the sampling parameters with 0.2 temp worked as well.
Have you test the original bf16 model or zai api?
@danielhanchen
Hello.
What is your final opinion about this advice from the above post?
Just to remove --chat-template glm4 --jinja and keep the rest in place?
π§ How to Fix
Remove these parameters from llama-server:
# Don't use these with GLM-4.7-Flash:
--chat-template glm4
--jinja
--temp 0.2
--top-p 0.95
--top-k 50
--min-p 0.01
--dry-multiplier 1.1
Use default parameters instead:
./llama-server \
-m "GLM-4.7-Flash-Q6_K.gguf" \
--host 0.0.0.0 \
--port 8080
π Why This Happens
We suspect the chat template (--chat-template glm4 --jinja) conflicts with the model's internal reasoning process, causing the output to be replaced with incorrect tokens.
π Conclusion
The model is excellent! Please update your documentation to avoid recommending these parameters for GLM-4.7-Flash.
Ref: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/1```
I think
@CHNtentes
comments are reasonable,
there shouldn't be such a mismatch between the BF16 and GGUFs sampling params...
+ we still have a massive throughput degradation over time (with AND without flash attention, no fa just makes it start from higher, but the degradation rate is the same, going from ~90t/s to something like 30t/s at the 1000th token, it's not usable for a thinking, and apparently quite verbose model).
Also, I don't know for others, but I can clearly hear my GPU sound noise is unexpected, the pitch fluctuates quite randomly, where it should just decrease progressively as computation becomes more expensive each new token.
I personally wouldn't try to find workarounds when there are still fundamental issues to be fix.
relevant: https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/12#696ec9bd8ce5068e7b3f11bc
Just running llama-cli seems to work well. But still, according to my usual test questions, this model might be good at coding, but not at logical thinking. It failed at several easy questions:
I have a boat with three available spaces. I want to transport a man, a sheep, and a cat to the other side of the river. How can I do that?
I have a bowl with a small cup inside. I placed the bowl upside down on a table and then pick up the bowl to put it in the microwave. Where is that cup?
On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.
No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.
And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!
On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.
No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.
And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!
What are the default parameters you used?
It really seems like the llama.cpp implementation might have bugs: https://github.com/ggml-org/llama.cpp/pull/18936#issuecomment-3774525719
On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.
No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.
And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!
What are the default parameters you used?
Hi @shimmyshimmer , same as in the update from @gannima , I just removed all parameters relating to the model.
So no --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1.
Ok, so that wasn't too clear I admit ;). To be exhaustive and since I just have a 16 GB card I went with:
--offline -m GLM-4.7-Flash-UD-Q4_K_XL.gguf --no-slots -np 1 --port 8000 --host 0.0.0.0 -fa off -fitt 500 --n-cpu-moe 16 --ubatch-size 128
But as you see these have nothing to do with model parameters and that is all I used.
Additional note: I got the best --jinja performance with --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.0 --dry-multiplier 0.1. Tool calling worked in Goose and Zed, I was not getting any repetition looping. However, the model was regularly making simple copy/paste errors from context (e.g. mangling IP addresses or paths when producing reports or tool calls). I was just about to give up when I saw this thread.
Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980
I wonder if this will fix the looping issues some people have experienced.
Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980
I wonder if this will fix the looping issues some people have experienced.
I really hope so it will fix the issue, we have been going crazy over the correct parameters etc.
Many people say this works, some people say it doesn't. Some people say something else works while other people say it doesn't,
Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980
I wonder if this will fix the looping issues some people have experienced.
I really hope so it will fix the issue, we have been going crazy over the correct parameters etc.
Many people say this works, some people say it doesn't. Some people say something else works while other people say it doesn't,
It's why I commented those above. If the root cause is in the model inference, changing the sampling parameters might improve the output for some cases, but for other cases there's no improvement or even worse.
And it's not rare for llama.cpp support nowadays. I admire these devs from community for their work, but sometimes they might overlook something and merge the support PR, then many people suppose it's perfectly supported but it's not.
Hey guys llama.cpp fixed an issue in implementation. We reuploaded and results are much better now!!! Can we retest and see if it's better now?
Just use GLM-4.7's original parameters: https://unsloth.ai/docs/models/glm-4.7-flash#usage-guide
CC: @CHNtentes @coder543 @dugrema @gannima @urtuuuu @owao @McG-221 @Reverger @Ukro
If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.
Some people reported the MLA implementation is broken and flash attention is unusable. I suppose that means the vram usage and speed is still not optimal?
If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.
(moving quickly on it: it changed for a default min-p of 0.05 and top-p of 0.95 now, the README is not up to date I just noticed and made a PR https://github.com/ggml-org/llama.cpp/blob/master/common/common.h)
@danielhanchen
If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.
(moving quickly on it: it changed for a default min-p of 0.05 and top-p of 0.95 now, the README is not up to date I just noticed and made a PR https://github.com/ggml-org/llama.cpp/blob/master/common/common.h)
@danielhanchen
Thank you, will update docs.
@shimmyshimmer
Shouldn't we set min-p to 0? Because none among transformers, sglang and vllm use it by default, and zai.org don't mention it at all.
And I was thinking the same for top-k (set it to 0 instead of the default value of 40 for llama-cpp), but I see that while in the same way, sglang and vllm don't use it by default, transformers on the other hand set it to 50.
@shimmyshimmer
Shouldn't we setmin-pto 0? Because none among transformers, sglang and vllm use it by default, and zai.org don't mention it at all.
And I was thinking the same for top-k (set it to 0 instead of the default value of 40 for llama-cpp), but I see that while in the same way, sglang and vllm don't use it by default, transformers on the other hand set it to 50.
I once read that setting top-k to 0 will make the amount of candidate tokens too large and slow down sampling speed. Not sure if that's true though.
Well, on 2*20 generations (10 with top-k 40 and 10 with top-k 0), for gpt-oss-20b and with Qwen-VL-32B, no noticeable diff. We will have tried!
I confirm that removing "--temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05" solved the problem of GLM not being able to call the tools correctly when using OpenHands.
We are using this:
ARGS="--no-mmap -fa on
-c 131072
-m $GGUF_PATH/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q5_K_M.gguf"
llama-server
$ARGS
-ngl 999
--host 0.0.0.0
--port 12345
--cache-ram -1
--parallel 2
--batch-size 512
--metrics
llama.cpp and weights were updated before testing.





