Unsloth AI org Jan 21

•

llama.cpp has fixed a bug which caused the model to loop and produce poor outputs.
Thanks to the work of llama.cpp and contributors, we have now have reconverted and reuploaded the model.

Outputs should now be much much better now, especially after our testing.

Please re-download thanks!

You can now use Z.ai's recommended parameters and get great results:

For general use-case: --temp 1.0 --top-p 0.95
For tool-calling: --temp 0.7 --top-p 1.0
Remember to disable repeat penalty!

If using llama.cpp, set --min-p 0.01 as llama.cpp's default is 0.1
Please let us know if you see an improvement!

Guide: https://unsloth.ai/docs/models/glm-4.7-flash

glm flash ggg

As an example using UD-Q4_K_XL ./llama.cpp/llama-cli --model unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q4_K_XL.gguf --temp 1.0 --top-p 0.95 --min-p 0.01 --jinja and on a long convo:

Hi
What is 2+2
Create a Python Flappy Bird game
Create a totally different game in Rust
Find bugs in both
Make the 1st game I mentioned but in a standalone HTML file
Find bugs and show the fixed game

and the HTML code is at https://unsloth.ai/docs/models/glm-4.7-flash#flappy-bird-example-with-ud-q4_k_xl
I ran it, and it created the below:

danielhanchen changed discussion title from All GLM-4.7-Flash quants now reuploaded! to All GLM-4.7-Flash quants reuploaded - should see much better outputs! Jan 21

danielhanchen pinned discussion Jan 21

danielhanchen changed discussion title from All GLM-4.7-Flash quants reuploaded - should see much better outputs! to Jan 21: All GLM-4.7-Flash quants reuploaded - much better outputs! Jan 21

deleted

Jan 21

Does it allow NSFW questions?

zoyer

Jan 21

works perfect! Thanks and great work!

danielhanchen

Unsloth AI org Jan 21

works perfect! Thanks and great work!

Amazing thanks for testing, we're waiting for more feedback before we tweet ahaha

LadyJun

Jan 21

llama.cpp has fixed a bug which caused the model to loop and produce poor outputs.
Thanks to the work of llama.cpp and contributors, we have now have reconverted and reuploaded the model.

Outputs should now be much much better now, especially after our testing.

Please re-download thanks!

Hi!

Updated Llama.CPP with this morning's release, downloaded updated (Q8_0) model but still being hit by inconsistent tool calls in agentic coding tasks. Much less that with the first release, however.
Response to a /v1/chat/completions call directly captured from llama-server:

data: {"choices":[{"finish_reason":null,"index":0,"delta":{"tool_calls":[{"index":2,"function":{"arguments":""}}]}}],"created":1768993844,"id":"chatcmpl-wO6uPHolWapSa3vpxZNSFstNjQfl7gfK","model":"GLM-4.7-Flash-Q8_0.gguf","system_fingerprint":"b7787-37c35f0e1","object":"chat.completion.chunk"}

data: {"error":{"code":500,"message":"Invalid diff: now finding less tool calls!","type":"server_error"}}

llama-server command line:

llama-server --model /Users/vox/.llama/models/GLM-4.7-Flash-Q8_0.gguf
--n-gpu-layers -1
--threads 16
--port 8011
--host 127.0.0.1
--jinja
--parallel 1
--ctx-size 131072
-b 2048
-ub 512
--temp 0.7
--top-p 1.0
--min-p 0.01

(Flash attention disabled as this is running on a Metal backend -- 64GB unified memory.)

Cheers!

danielhanchen

Unsloth AI org Jan 21

@LadyJun oh nice - you did not install the PR branch right, just latest main branch llama.cpp

dugrema

Jan 21

I have re-downloaded UD Q4_K_XL and updated my local llama.cpp docker image (version 7786). And yes, all is good!

I used the above recommended parameters for tools: --temp 0.7 --top-p 1.0 --min-p 0.01 --jinja. This is working really well with Goose and Zed. Impressive model. Thanks for all the work! Note that for Zed, I had to add a custom context rule to help with tool usage:

Tool Call Syntax Rule

Always call tools with correct JSON syntax: tool_name followed by {parameters} without any special prefixes, formatting, or angle brackets. Never use Harmony-like formatting like <|start|> or <|end|>.

robert1968

Jan 21

Hi Daniel,

Incredible work! than you very much for the detailed documentation and fast solution!

I can confirm: with the latest llama.cpp fixed the bug.

Latest llama compiled based on https://unsloth.ai/docs/models/glm-4.7-flash#llama.cpp-tutorial-gguf without any problem.

And run as API server:
./llama-server
-m /home/b/.lmstudio/models/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf
--host 0.0.0.0
--port 1234
-ngl 99
-c 65536
--parallel 1
--jinja

Speed:
OPENCODE - My dual RTX-3090 produce 77 - 85 tokens per second. which is not bad (however qwen3-coder-instruct_Q6 gives 130t/s)

Generated Solarsytem looks amazing. 😀 better than Devstral-small-2-24b-instruct-2512 from Mistral which was my favorite for clever tool calls.

shadev001

Jan 22

There seem to be tool calling / freezing issues when using in roo code + lm studio. I am using the recommended parameters for tool calling, latest llama.cpp version recommended by lm studio.

robert1968

Jan 22

Latest LM Studio (libraries) does not have the necessary llama.cpp code yet.
You can wait for LM Studio or use llama.cpp.
Opencode works amazing with llama.cpp and unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf.
Much better than Qwen3-Next (80B) or anything in the 30B area.

shadev001

Jan 22

they claim to be on b7790

shadev001

Jan 22

•

edited Jan 22

im using the GLM-4.7-Flash-UD-Q4_K_XL.gguf.
windows 11, rtx 3090 24gb, 64gb ddr5

robert1968

Jan 22

LM Studio Runtime v1.104.2 still not works properly. it does not repeat but the code produced is not ok.
same prompt same settings, llama.cpp shines.
So still you can wait for LM Studio or use llama.cpp.

ycros

Jan 22

•

edited Jan 22

Hmm, I get this warning from llama.cpp (driven by opencode):

Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

Edit: I dug more into this, I think the template is working correctly - just llama.cpp's heuristics are actually failing and printing a warning.

segmond

Jan 22

I suppose this is a llama.cpp issue, but does anyone else see a fast and almost constant degradation as more tokens are generated? With fa, I start out at about 35tk/sec and by the time it's done thinking I'm at 3tk/sec. Without fa, I go from 70+tk/sec down to 20tk/sec. All in all, it ends up being faster for me to infer with bigger models, DeepSeek, Kimi or GLM4.7

ycros

Jan 22

@segmond yeah, there are still perf fixes being worked on

segmond

Jan 22

Looks like it got fixed within the last few hours. I rebuilt llama.cpp and it's now fixed and usable with FA. Ran the same prompt with high reasoning budget it starts at out 82tk/sec and ends at 70tk/sec after 17k output and took about 1/4th the time. Result is solid. Thanks unsloth

williamliao

Jan 22

Hi!

GLM-4.7-Flash-Q4_K_M.gguf – llama.cpp + MCP Tooling Test

I tested GLM-4.7-Flash-Q4_K_M.gguf on llama.cpp server with MCP tool calls (weather API) and Simple Tool Call Test.

Setup

testing on 5060Ti

F:\llama-b7790-bin-win-cuda-13.1-x64\llama-server.exe ^
-m "F:\OLLAMA_MODELS\GLM-4.7-Flash-Q4_K_M.gguf" ^
--host 0.0.0.0 ^
--port 8010 ^
-c 8192 ^
-t 6 ^
-b 64 ^
-ngl 40 ^
--temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 ^
--jinja ^
--threads 16 --parallel 1 ^
--chat-template-kwargs "{"enable_thinking": false}" ^
--timeout 300 ^
--flash-attn on ^
--fit on

Performance

Throughput: 6.82 tok/s

Stable at 8k context

Smooth streaming

Tool calling and MCP works correctly

Observations

Good schema understanding

Clean table formatting

No hallucinated API calls

Good English output

Works well for agent-style workflows

Thanks unsloth

LadyJun

Jan 22

@danielhanchen I saw two PRs (#18936 and #18980) but they were merged in master at the time, so I did not compile a specific branch. Anyway, I updated to version 7802 this morning, but I still see the model struggling to call tools. There are no more catastrophic failures as before which is a good sign, but it still formats the calls with bad parameters from times to times.

This is promising, but I'm afraid I had to promptly revert to Qwen3 since this model is not reliable for production use. Eg, a simple agentic task like « extract this method to its own .cpp file » would badly fail since the erroneous tool calls to edit the original source file will likely mess its content by removing wrong part of the code, creating a feedback loop where the model will introduce more and more errors when trying to correct its initial edit.

However, using it in ask mode often gives encouraging reasoning outputs, so I guess when/if this tool calling issue will be resolved, this model would be a good contender in a production coding environment.

robert1968

Jan 22

I also confirm, llama.cpp latest build works with Flash Attention.
./llama.cpp/llama-server
-m /home/b/.lmstudio/models/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf --host 0.0.0.0 --port 1234
-ngl 99
-c 65536
-fa on
--jinja
--temp 0.7
--top-p 1.0
--min-p 0.01
--threads -1
--parallel 1

this seems works perfectly with opencode. (a complex playwright test created with playwright MCP...) - AWSOME!

aaron-newsome

Jan 22

Thanks for all the work you guys do getting these quants out to the public @danielhanchen .

robert1968

Jan 22

Look here what code you can generate with this Q6 - Quantized version and latest llama.cpp and Opencode:
https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/36

dugrema

Jan 22

I just completed a main branch build of llamacpp, reported version is 7811. And yes, flash attention works properly now. I'm getting 31 tokens/sec with 15000 tokens loaded in context (15 moe layers offloaded), that's awesome.

@williamliao , I also have a 5060ti with 16GB but on an unimpressive Intel i3-10100 CPU. I still get > 30 tokens/sec with a context of 26880 with UD Q4_K_XL. Here's the trick with MOE models like GLM, Qwen3 and GPT-OSS:
-ngl 99 --n-cpu-moe 15 (change 15 to get more/less MOE layers on CPU to adjust context length). The parameters tell llama.cpp to put all dense layers on GPU and move only 15 MOE layers to CPU. That is unless you have a reason not to use the --n-cpu-moe feature of course!

shadev001

Jan 24

•

edited Jan 24

I'm still getting random loops when generating even the simplest code (generally occurs after hitting 50k+ context size). I'm using the updated unsloth quants (GLM-4.7-Flash-Q8_0.gguf), served locally using llama.cpp server on b7819 (also tried b7815). [ temperature - 0.7, top p - 1.0, min-p - 0.01, repeat-penalty - 1.0 (off)) .

its kinda frustrating because when the model actually generates code its of pretty good quality, but it almost always gets stuck in this repetition loop.

Shikivvs1

Jan 24

Latest LM Studio (libraries) does not have the necessary llama.cpp code yet.
You can wait for LM Studio or use llama.cpp.
Opencode works amazing with llama.cpp and unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf.
Much better than Qwen3-Next (80B) or anything in the 30B area.

Can I ask your opencode setup? ive updated it and it still runs into the same issue - "Invalid diff: now finding less tool calls!"

robert1968

Jan 24

LM studio still not works properly for me,
but llama.cpp is incredible good.

Nothing special for opencode. so the key is llama.cpp.
just fresh them to latest version.
this is my current versions and configs:

b@big3:~$ /adat/ai/llama.cpp/llama-server --version
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 7816 (a14b960bc)
built with GNU 13.3.0 for Linux x86_64

b@big3:~$ opencode --version
1.1.34

b@big3:~$ more ~/.config/opencode/opencode.json
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"playwright": {
"type": "local",
"command": [
"npx",
"-y",
"@playwright/mcp@latest",
"--extension"
],
"environment": {
"PLAYWRIGHT_MCP_EXTENSION_TOKEN": "sometoken"
},
"enabled": true
}
},
"provider": {
"llama-cpp": {
"npm": "@ai-sdk/openai-compatible",
"name": "Llama.cpp (Local)",
"options": {
"baseURL": "http://localhost:1234/v1",
"name": "glm-4-7-flash"
},
"models": {
"glm-4-7-flash": {
"name": "GLM-4.7 Flash (6-bit)"
}
}
},
"lmstudio": {
"npm": "@ai-sdk/openai-compatible",
"name": "LM Studio (Local)",
"options": {
"baseURL": "http://localhost:1234/v1",
"name": "devstral-small-2-24b-instruct-2512"
},
"models": {
"devstral-small-2-24b-instruct-2512": {
"name": "Devstral Small 24B"
}
}
}
},
"model": "llama-cpp/glm-4-7-flash"
}

mancub

Jan 25

I'm getting sporadic loops too, latest llama (b7825), up to date GLM-4.7-Flash-UD-Q4_K_XL.gguf, called llama.cpp with:

H:\llama.cpp\llama-server ^
--model "H:\unsloth_GLM-4.7-Flash-GGUF\GLM-4.7-Flash-UD-Q4_K_XL.gguf" ^
--alias "GLM-4.7-Flash" ^
--threads -1 ^
--seed 3407 ^
--ctx-size 65536 ^
--temp 1.0 ^
--top-p 0.95 ^
--min-p 0.01 ^
--repeat-penalty 1.0 ^
--port 8082 ^
--host 0.0.0.0 ^
--api-key 12345 ^
--fit on ^
--flash-attn on ^
--batch-size 1024 ^
--ubatch-size 256

Aside question, this is on a RTX 3090 - would I do better with the UD_Q5 model instead of UD_Q4, or will UD_Q5 be slower?

LadyJun

Jan 25

It's getting better. Using Q8_K_XL w/ recommended parameters, served by Llama.CPP 7822. Tool calling is near perfect, now.

Run into a reasoning loop, however. A deterministic one: same prompt with same context always leading to a loop at the same stage within the reasoning process (C++ code analysis.) Increased repeat penalty to 1.05 did the trick. It is not recommended, but still better than hitting loops.

Not Qwen3, but pretty good alternative.

dugrema

Jan 25

Just a note about the template warning issue which affects tool calling, this is logged with llama.cpp: https://github.com/ggml-org/llama.cpp/issues/19009

I mentioned in there that the pre-bug version 7751 of llama.cpp works flawlessly, at least for me, on tool usage with GLM 4.7 Flash. I finally got a Flappy Bird implementation going with UD Q4_K_XL, llama.cpp build 7751 and the Zed editor.

But with llama.cpp version 7761+ (I used at least 3 different versions), my projects always ended-up self destructing through tools misfiring (truncated, corrupted and even deleted files) over multiple coding sessions.

SimonL-90

Jan 26

Latest LM Studio (libraries) does not have the necessary llama.cpp code yet.
You can wait for LM Studio or use llama.cpp.
Opencode works amazing with llama.cpp and unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-UD-Q6_K_XL.gguf.
Much better than Qwen3-Next (80B) or anything in the 30B area.

Can I ask your opencode setup? ive updated it and it still runs into the same issue - "Invalid diff: now finding less tool calls!"

The issue "Invalid diff: now finding less tool calls!" in my case was simply caused due to a too small context window. After increasing from 25k to 70k the issue disappeared. Hope it helps!