Jan 21: All GLM-4.7-Flash quants reuploaded - much better outputs!
llama.cpp has fixed a bug which caused the model to loop and produce poor outputs.
Thanks to the work of llama.cpp and contributors, we have now have reconverted and reuploaded the model.
Outputs should now be much much better now, especially after our testing.
Please re-download thanks!
You can now use Z.ai's recommended parameters and get great results:
- For general use-case:
--temp 1.0 --top-p 0.95 - For tool-calling:
--temp 0.7 --top-p 1.0 - Remember to disable repeat penalty!
If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.1
Please let us know if you see an improvement!
Guide: https://unsloth.ai/docs/models/glm-4.7-flash

As an example using UD-Q4_K_XL ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --jinja and on a long convo:
Hi
What is 2+2
Create a Python Flappy Bird game
Create a totally different game in Rust
Find bugs in both
Make the 1st game I mentioned but in a standalone HTML file
Find bugs and show the fixed game
and the HTML code is at https://unsloth.ai/docs/models/glm-4.7-flash#flappy-bird-example-with-ud-q4_k_xl
I ran it, and it created the below:
Does it allow NSFW questions?
works perfect! Thanks and great work!
works perfect! Thanks and great work!
Amazing thanks for testing, we're waiting for more feedback before we tweet ahaha
llama.cpp has fixed a bug which caused the model to loop and produce poor outputs.
Thanks to the work of llama.cpp and contributors, we have now have reconverted and reuploaded the model.Outputs should now be much much better now, especially after our testing.
Please re-download thanks!
Hi!
Updated Llama.CPP with this morning's release, downloaded updated (Q8_0) model but still being hit by inconsistent tool calls in agentic coding tasks. Much less that with the first release, however.
Response to a /v1/chat/completions call directly captured from llama-server:
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"tool_calls":[{"index":2,"function":{"arguments":""}}]}}],"created":1768993844,"id":"chatcmpl-wO6uPHolWapSa3vpxZNSFstNjQfl7gfK","model":"GLM-4.7-Flash-Q8_0.gguf","system_fingerprint":"b7787-37c35f0e1","object":"chat.completion.chunk"}
data: {"error":{"code":500,"message":"Invalid diff: now finding less tool calls!","type":"server_error"}}
llama-server command line:
llama-server --model /Users/vox/.llama/models/GLM-4.7-Flash-Q8_0.gguf
--n-gpu-layers -1
--threads 16
--port 8011
--host 127.0.0.1
--jinja
--parallel 1
--ctx-size 131072
-b 2048
-ub 512
--temp 0.7
--top-p 1.0
--min-p 0.01
(Flash attention disabled as this is running on a Metal backend -- 64GB unified memory.)
Cheers!
I have re-downloaded UD Q4_K_XL and updated my local llama.cpp docker image (version 7786). And yes, all is good!
I used the above recommended parameters for tools: --temp 0.7 --top-p 1.0 --min-p 0.01 --jinja. This is working really well with Goose and Zed. Impressive model. Thanks for all the work! Note that for Zed, I had to add a custom context rule to help with tool usage:
Tool Call Syntax Rule
Always call tools with correct JSON syntax: tool_name followed by {parameters} without any special prefixes, formatting, or angle brackets. Never use Harmony-like formatting like <|start|> or <|end|>.
Hi Daniel,
Incredible work! than you very much for the detailed documentation and fast solution!
I can confirm: with the latest llama.cpp fixed the bug.
Latest llama compiled based on https://unsloth.ai/docs/models/glm-4.7-flash#llama.cpp-tutorial-gguf without any problem.
And run as API server:
./llama-server
-m /home/b/.lmstudio/models/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf
--host 0.0.0.0
--port 1234
-ngl 99
-c 65536
--parallel 1
--jinja
Speed:
OPENCODE - My dual RTX-3090 produce 77 - 85 tokens per second. which is not bad (however qwen3-coder-instruct_Q6 gives 130t/s)
Generated Solarsytem looks amazing. 😀 better than Devstral-small-2-24b-instruct-2512 from Mistral which was my favorite for clever tool calls.
There seem to be tool calling / freezing issues when using in roo code + lm studio. I am using the recommended parameters for tool calling, latest llama.cpp version recommended by lm studio.
Latest LM Studio (libraries) does not have the necessary llama.cpp code yet.
You can wait for LM Studio or use llama.cpp.
Opencode works amazing with llama.cpp and unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf.
Much better than Qwen3-Next (80B) or anything in the 30B area.
im using the GLM-4.7-Flash-UD-Q4_K_XL.gguf.
windows 11, rtx 3090 24gb, 64gb ddr5
LM Studio Runtime v1.104.2 still not works properly. it does not repeat but the code produced is not ok.
same prompt same settings, llama.cpp shines.
So still you can wait for LM Studio or use llama.cpp.
Hmm, I get this warning from llama.cpp (driven by opencode):
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
Edit: I dug more into this, I think the template is working correctly - just llama.cpp's heuristics are actually failing and printing a warning.
I suppose this is a llama.cpp issue, but does anyone else see a fast and almost constant degradation as more tokens are generated? With fa, I start out at about 35tk/sec and by the time it's done thinking I'm at 3tk/sec. Without fa, I go from 70+tk/sec down to 20tk/sec. All in all, it ends up being faster for me to infer with bigger models, DeepSeek, Kimi or GLM4.7
Looks like it got fixed within the last few hours. I rebuilt llama.cpp and it's now fixed and usable with FA. Ran the same prompt with high reasoning budget it starts at out 82tk/sec and ends at 70tk/sec after 17k output and took about 1/4th the time. Result is solid. Thanks unsloth
Hi!
GLM-4.7-Flash-Q4_K_M.gguf – llama.cpp + MCP Tooling Test
I tested GLM-4.7-Flash-Q4_K_M.gguf on llama.cpp server with MCP tool calls (weather API) and Simple Tool Call Test.
Setup
testing on 5060Ti
F:\llama-b7790-bin-win-cuda-13.1-x64\llama-server.exe ^
-m "F:\OLLAMA_MODELS\GLM-4.7-Flash-Q4_K_M.gguf" ^
--host 0.0.0.0 ^
--port 8010 ^
-c 8192 ^
-t 6 ^
-b 64 ^
-ngl 40 ^
--temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 ^
--jinja ^
--threads 16 --parallel 1 ^
--chat-template-kwargs "{"enable_thinking": false}" ^
--timeout 300 ^
--flash-attn on ^
--fit on
Performance
Throughput: 6.82 tok/s
Stable at 8k context
Smooth streaming
Tool calling and MCP works correctly
Observations
Good schema understanding
Clean table formatting
No hallucinated API calls
Good English output
Works well for agent-style workflows
Thanks unsloth
@danielhanchen I saw two PRs (#18936 and #18980) but they were merged in master at the time, so I did not compile a specific branch. Anyway, I updated to version 7802 this morning, but I still see the model struggling to call tools. There are no more catastrophic failures as before which is a good sign, but it still formats the calls with bad parameters from times to times.
This is promising, but I'm afraid I had to promptly revert to Qwen3 since this model is not reliable for production use. Eg, a simple agentic task like « extract this method to its own .cpp file » would badly fail since the erroneous tool calls to edit the original source file will likely mess its content by removing wrong part of the code, creating a feedback loop where the model will introduce more and more errors when trying to correct its initial edit.
However, using it in ask mode often gives encouraging reasoning outputs, so I guess when/if this tool calling issue will be resolved, this model would be a good contender in a production coding environment.
I also confirm, llama.cpp latest build works with Flash Attention.
./llama.cpp/llama-server
-m /home/b/.lmstudio/models/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf --host 0.0.0.0 --port 1234
-ngl 99
-c 65536
-fa on
--jinja
--temp 0.7
--top-p 1.0
--min-p 0.01
--threads -1
--parallel 1
this seems works perfectly with opencode. (a complex playwright test created with playwright MCP...) - AWSOME!
Look here what code you can generate with this Q6 - Quantized version and latest llama.cpp and Opencode:
https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/36
I just completed a main branch build of llamacpp, reported version is 7811. And yes, flash attention works properly now. I'm getting 31 tokens/sec with 15000 tokens loaded in context (15 moe layers offloaded), that's awesome.
@williamliao
, I also have a 5060ti with 16GB but on an unimpressive Intel i3-10100 CPU. I still get > 30 tokens/sec with a context of 26880 with UD Q4_K_XL. Here's the trick with MOE models like GLM, Qwen3 and GPT-OSS:-ngl 99 --n-cpu-moe 15 (change 15 to get more/less MOE layers on CPU to adjust context length). The parameters tell llama.cpp to put all dense layers on GPU and move only 15 MOE layers to CPU. That is unless you have a reason not to use the --n-cpu-moe feature of course!
I'm still getting random loops when generating even the simplest code (generally occurs after hitting 50k+ context size). I'm using the updated unsloth quants (GLM-4.7-Flash-Q8_0.gguf), served locally using llama.cpp server on b7819 (also tried b7815). [ temperature - 0.7, top p - 1.0, min-p - 0.01, repeat-penalty - 1.0 (off)) .
its kinda frustrating because when the model actually generates code its of pretty good quality, but it almost always gets stuck in this repetition loop.
Latest LM Studio (libraries) does not have the necessary llama.cpp code yet.
You can wait for LM Studio or use llama.cpp.
Opencode works amazing with llama.cpp and unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf.
Much better than Qwen3-Next (80B) or anything in the 30B area.
Can I ask your opencode setup? ive updated it and it still runs into the same issue - "Invalid diff: now finding less tool calls!"
LM studio still not works properly for me,
but llama.cpp is incredible good.
Nothing special for opencode. so the key is llama.cpp.
just fresh them to latest version.
this is my current versions and configs:
b@big3:~$ /adat/ai/llama.cpp/llama-server --version
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 7816 (a14b960bc)
built with GNU 13.3.0 for Linux x86_64
b@big3:~$ opencode --version
1.1.34
b@big3:~$ more ~/.config/opencode/opencode.json
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"playwright": {
"type": "local",
"command": [
"npx",
"-y",
"@playwright/mcp@latest",
"--extension"
],
"environment": {
"PLAYWRIGHT_MCP_EXTENSION_TOKEN": "sometoken"
},
"enabled": true
}
},
"provider": {
"llama-cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "Llama.cpp (Local)",
"options": {
"baseURL": "http://localhost:1234/v1",
"name": "glm-4-7-flash"
},
"models": {
"glm-4-7-flash": {
"name": "GLM-4.7 Flash (6-bit)"
}
}
},
"lmstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "LM Studio (Local)",
"options": {
"baseURL": "http://localhost:1234/v1",
"name": "devstral-small-2-24b-instruct-2512"
},
"models": {
"devstral-small-2-24b-instruct-2512": {
"name": "Devstral Small 24B"
}
}
}
},
"model": "llama-cpp/glm-4-7-flash"
}
I'm getting sporadic loops too, latest llama (b7825), up to date GLM-4.7-Flash-UD-Q4_K_XL.gguf, called llama.cpp with:
H:\llama.cpp\llama-server ^
--model "H:\unsloth_GLM-4.7-Flash-GGUF\GLM-4.7-Flash-UD-Q4_K_XL.gguf" ^
--alias "GLM-4.7-Flash" ^
--threads -1 ^
--seed 3407 ^
--ctx-size 65536 ^
--temp 1.0 ^
--top-p 0.95 ^
--min-p 0.01 ^
--repeat-penalty 1.0 ^
--port 8082 ^
--host 0.0.0.0 ^
--api-key 12345 ^
--fit on ^
--flash-attn on ^
--batch-size 1024 ^
--ubatch-size 256
Aside question, this is on a RTX 3090 - would I do better with the UD_Q5 model instead of UD_Q4, or will UD_Q5 be slower?
It's getting better. Using Q8_K_XL w/ recommended parameters, served by Llama.CPP 7822. Tool calling is near perfect, now.
Run into a reasoning loop, however. A deterministic one: same prompt with same context always leading to a loop at the same stage within the reasoning process (C++ code analysis.) Increased repeat penalty to 1.05 did the trick. It is not recommended, but still better than hitting loops.
Not Qwen3, but pretty good alternative.
Just a note about the template warning issue which affects tool calling, this is logged with llama.cpp: https://github.com/ggml-org/llama.cpp/issues/19009
I mentioned in there that the pre-bug version 7751 of llama.cpp works flawlessly, at least for me, on tool usage with GLM 4.7 Flash. I finally got a Flappy Bird implementation going with UD Q4_K_XL, llama.cpp build 7751 and the Zed editor.
But with llama.cpp version 7761+ (I used at least 3 different versions), my projects always ended-up self destructing through tools misfiring (truncated, corrupted and even deleted files) over multiple coding sessions.
Latest LM Studio (libraries) does not have the necessary llama.cpp code yet.
You can wait for LM Studio or use llama.cpp.
Opencode works amazing with llama.cpp and unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf.
Much better than Qwen3-Next (80B) or anything in the 30B area.Can I ask your opencode setup? ive updated it and it still runs into the same issue - "Invalid diff: now finding less tool calls!"
The issue "Invalid diff: now finding less tool calls!" in my case was simply caused due to a too small context window. After increasing from 25k to 70k the issue disappeared. Hope it helps!


