Jan 21: GGUFs all UPDATED!!!

danielhanchen changed discussion title from Looping issues should now be mostly fixed. to Add --dry-multiplier 1.1 to reduce looping issues! Jan 20

Ukro

Jan 20

I didn't discover any settings that wouldn't do the loop... waiting for maybe lm studio (llama.cpp) further updates.

Updated the to-do list
Create basic game structure
Implement core gameplay mechanics and pipe system

Kilo said
}//???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

danielhanchen

Unsloth AI org Jan 20

•

edited Jan 20

I didn't discover any settings that wouldn't do the loop... waiting for maybe lm studio (llama.cpp) further updates.

Updated the to-do list
Create basic game structure
Implement core gameplay mechanics and pipe system

Kilo said
}//???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

Where was this from and do you have a screenshot of your config?

If you're using LM Studio, use --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 and disable repeat penalty.

Ukro

Jan 20

temp 0.2, top-k 50, top-p 0.95, min-p 0.01, repeat penalty 1.1.
Will try to disable it

Ukro

Jan 20

unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q4_K_M.gguf
repeat penalty disabled

Let me create the complete game file.</think>I'll create a complete Flappy Bird clone with enhanced graphics in a single HTML file. Let me build this step by step.
Checkpoint
(Current)
Kilo Code wants to create a new file
API Request...
01:10 PM
$0.0000
Kilo said
}//????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

danielhanchen

Unsloth AI org Jan 20

temp 0.2, top-k 50, top-p 0.95, min-p 0.01, repeat penalty 1.1.
Will try to disable it

repeat penalty has to be disabled. it's not dry multiplier

danielhanchen

Unsloth AI org Jan 20

unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q4_K_M.gguf
repeat penalty disabled

Let me create the complete game file.</think>I'll create a complete Flappy Bird clone with enhanced graphics in a single HTML file. Let me build this step by step.
Checkpoint
(Current)
Kilo Code wants to create a new file
API Request...
01:10 PM
$0.0000
Kilo said
}//????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

is this lmstudio or llamacpp

Ukro

Jan 20

•

edited Jan 20

lmstudio runtime (llama.cpp v1.103.2), will try the UD model

Ukro

Jan 20

UD same problem

I should delegate this task completely with clear instructions for Code mode to create the single-file implementation.
</think>


Checkpoint
(Current)

Kilo Code wants to create a new subtask in code mode
Subtask Instructions

Error

Error


Kilo said
}//???????????????????????????????????????????

coder543

Jan 20

The Z.ai team officially recommends following the GLM-4.7 sampler settings.

I'm curious where the recommendation for temp 0.2 and other things came from since it didn't seem to come from Z.ai?

Ukro

Jan 20

Yeah it will be a mess for a few days, but i am happy to wait for this gem to work :D

gannima

Jan 20

I fix it

gannima

Jan 20

🎉 Issue Solved! GLM-4.7-Flash Q6_K - Working perfectly now!

Environment

Model: unsloth/GLM-4.7-Flash-GGUF (Q6_K, 23GB)
llama.cpp: build 7779 (commit 6df686bee)
Hardware: RTX 4090, 128 GB RAM

✅ Root Cause Found

The issue was NOT the model itself. The model works perfectly with llama-cli!

Problem: Server/API mode with incorrect parameters caused chaos output.

Mode	Parameters	Result
llama-cli	Default (no extra params)	✅ Working
llama-server	`--chat-template glm4 --jinja --temp 0.2 --dry-multiplier 1.1`	❌ Chaos

📝 Test Results (All Passed ✅)

Test	Input	Output
Greeting	"hi"	"Hello! I'm the GLM large language model..."
Chinese	"你好"	"你好！我是GLM大语言模型..."
Math	"What is 2+2?"	"2 + 2 = 4"
Math	"Solve x^2-4=0"	Correct solution with steps
Coding	"write a script for ocr"	Complete Python script

🔧 How to Fix

Remove these parameters from llama-server:

# Don't use these with GLM-4.7-Flash:
--chat-template glm4
--jinja
--temp 0.2
--top-p 0.95
--top-k 50
--min-p 0.01
--dry-multiplier 1.1

Use default parameters instead:
./llama-server \
    -m "GLM-4.7-Flash-Q6_K.gguf" \
    --host 0.0.0.0 \
    --port 8080

💭 Why This Happens

We suspect the chat template (--chat-template glm4 --jinja) conflicts with the model's internal reasoning process, causing the output to be replaced with incorrect tokens.

🙏 Conclusion

The model is excellent! Please update your documentation to avoid recommending these parameters for GLM-4.7-Flash.

Ref: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/1

gannima

Jan 20

These are the test results (partial).

danielhanchen

Unsloth AI org Jan 20

•

edited Jan 20

The Z.ai team officially recommends following the GLM-4.7 sampler settings.

I'm curious where the recommendation for temp 0.2 and other things came from since it didn't seem to come from Z.ai?

Around 5 people said it worked, after testing GLM's recommended parameters, we saw looping issues still. Also I think it's what LM Studio uses by default as well.

After using these sampling params, it started working better.

We shall include two ways to run the model for people to test and see which is better.

McG-221

Jan 20

I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... 😵‍💫

CHNtentes

Jan 20

IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.

danielhanchen

Unsloth AI org Jan 20

•

edited Jan 20

I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... 😵‍💫

IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.

We experienced looping as well with the officially recommended parameters, until we added --dry-multiplier 1.1 which worked well for us. FYI this was tested not just by our quant, but other uploaders as well. And also using the sampling parameters with 0.2 temp worked as well.

CHNtentes

Jan 20

I can confirm, that the MLX version of LM Studio worked for me with temp 0.2 etc. With the official parameters from Z.ai, I had looping and gibberish... 😵‍💫

IMO, if official parameters don't work well, it's probably because there's something broken with llama.cpp implementation. I would wait for better gguf models instead of trying different sampling params.

We experienced looping as well with the officially recommended parameters, until we added --dry-multiplier 1.1 which worked well for us. FYI this was tested not just by our quant, but other uploaders as well. And also using the sampling parameters with 0.2 temp worked as well.

Have you test the original bf16 model or zai api?

Reverger

Jan 20

•

edited Jan 20

@danielhanchen
Hello.
What is your final opinion about this advice from the above post?

Just to remove --chat-template glm4 --jinja and keep the rest in place?

🔧 How to Fix
Remove these parameters from llama-server:

# Don't use these with GLM-4.7-Flash:
--chat-template glm4
--jinja
--temp 0.2
--top-p 0.95
--top-k 50
--min-p 0.01
--dry-multiplier 1.1

Use default parameters instead:
./llama-server \
    -m "GLM-4.7-Flash-Q6_K.gguf" \
    --host 0.0.0.0 \
    --port 8080

💭 Why This Happens

We suspect the chat template (--chat-template glm4 --jinja) conflicts with the model's internal reasoning process, causing the output to be replaced with incorrect tokens.

🙏 Conclusion

The model is excellent! Please update your documentation to avoid recommending these parameters for GLM-4.7-Flash.

Ref: https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/1```

owao

Jan 20

•

edited Jan 20

I think @CHNtentes comments are reasonable,
there shouldn't be such a mismatch between the BF16 and GGUFs sampling params...
+ we still have a massive throughput degradation over time (with AND without flash attention, no fa just makes it start from higher, but the degradation rate is the same, going from ~90t/s to something like 30t/s at the 1000th token, it's not usable for a thinking, and apparently quite verbose model).
Also, I don't know for others, but I can clearly hear my GPU sound noise is unexpected, the pitch fluctuates quite randomly, where it should just decrease progressively as computation becomes more expensive each new token.
I personally wouldn't try to find workarounds when there are still fundamental issues to be fix.

relevant: https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/12#696ec9bd8ce5068e7b3f11bc

urtuuuu

Jan 20

Just running llama-cli seems to work well. But still, according to my usual test questions, this model might be good at coding, but not at logical thinking. It failed at several easy questions:

I have a boat with three available spaces. I want to transport a man, a sheep, and a cat to the other side of the river. How can I do that?

I have a bowl with a small cup inside. I placed the bowl upside down on a table and then pick up the bowl to put it in the microwave. Where is that cup?

dugrema

Jan 21

On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.

No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.

And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!

shimmyshimmer

Unsloth AI org Jan 21

On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.

No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.

And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!

What are the default parameters you used?

coder543

Jan 21

It really seems like the llama.cpp implementation might have bugs: https://github.com/ggml-org/llama.cpp/pull/18936#issuecomment-3774525719

dugrema

Jan 21

On llamacpp with UD Q4_K_XL, using the default params (nothing added to llamacpp) as mentioned above fixed the looping issues for me.

No more looping, tool calling works great in Goose. I am having some issues with Zed where it doesn't seem to understand how to submit a tool call, but that was working with --jinja so hints will probably fix it.

And I'm with @owao on flash-attention, not having it is brutal. You don't realize what you have until you lose it!

What are the default parameters you used?

Hi @shimmyshimmer , same as in the update from @gannima , I just removed all parameters relating to the model.

So no --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.01 --dry-multiplier 1.1.

Ok, so that wasn't too clear I admit ;). To be exhaustive and since I just have a 16 GB card I went with:

--offline -m GLM-4.7-Flash-UD-Q4_K_XL.gguf --no-slots -np 1 --port 8000 --host 0.0.0.0 -fa off -fitt 500 --n-cpu-moe 16 --ubatch-size 128

But as you see these have nothing to do with model parameters and that is all I used.

Additional note: I got the best --jinja performance with --temp 0.2 --top-k 50 --top-p 0.95 --min-p 0.0 --dry-multiplier 0.1. Tool calling worked in Goose and Zed, I was not getting any repetition looping. However, the model was regularly making simple copy/paste errors from context (e.g. mangling IP addresses or paths when producing reports or tool calls). I was just about to give up when I saw this thread.

coder543

Jan 21

Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980

I wonder if this will fix the looping issues some people have experienced.

danielhanchen

Unsloth AI org Jan 21

•

edited Jan 21

Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980

I wonder if this will fix the looping issues some people have experienced.

I really hope so it will fix the issue, we have been going crazy over the correct parameters etc.

Many people say this works, some people say it doesn't. Some people say something else works while other people say it doesn't,

CHNtentes

Jan 21

Possible fix: https://github.com/ggml-org/llama.cpp/pull/18980

I wonder if this will fix the looping issues some people have experienced.

I really hope so it will fix the issue, we have been going crazy over the correct parameters etc.

Many people say this works, some people say it doesn't. Some people say something else works while other people say it doesn't,

It's why I commented those above. If the root cause is in the model inference, changing the sampling parameters might improve the output for some cases, but for other cases there's no improvement or even worse.

CHNtentes

Jan 21

And it's not rare for llama.cpp support nowadays. I admire these devs from community for their work, but sometimes they might overlook something and merge the support PR, then many people suppose it's perfectly supported but it's not.

danielhanchen

Unsloth AI org Jan 21

•

edited Jan 21

Hey guys llama.cpp fixed an issue in implementation. We reuploaded and results are much better now!!! Can we retest and see if it's better now?

Just use GLM-4.7's original parameters: https://unsloth.ai/docs/models/glm-4.7-flash#usage-guide

CC: @CHNtentes @coder543 @dugrema @gannima @urtuuuu @owao @McG-221 @Reverger @Ukro

danielhanchen changed discussion title from Add --dry-multiplier 1.1 to reduce looping issues! to Jan 21: GGUFs all UPDATED!!! Jan 21

danielhanchen changed discussion title from Jan 21: GGUFs all UPDATED!!! to Jan 21: GGUFs UPDATED!!! Jan 21

danielhanchen changed discussion title from Jan 21: GGUFs UPDATED!!! to Jan 21: GGUFs all UPDATED!!! Jan 21

danielhanchen unpinned discussion Jan 21

danielhanchen

Unsloth AI org Jan 21

If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.

CHNtentes

Jan 21

Some people reported the MLA implementation is broken and flash attention is unusable. I suppose that means the vram usage and speed is still not optimal?

owao

Jan 21

If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.

(moving quickly on it: it changed for a default min-p of 0.05 and top-p of 0.95 now, the README is not up to date I just noticed and made a PR https://github.com/ggml-org/llama.cpp/blob/master/common/common.h)
@danielhanchen

owao

Jan 22

I'll personally wait for the dust to set down before giving a new try

shimmyshimmer

Unsloth AI org Jan 22

If using llama.cpp, don't forget to set min_p = 0.01, only for llama.cpp as the default is 0.1.

(moving quickly on it: it changed for a default min-p of 0.05 and top-p of 0.95 now, the README is not up to date I just noticed and made a PR https://github.com/ggml-org/llama.cpp/blob/master/common/common.h)
@danielhanchen

Thank you, will update docs.

owao

Jan 22

@shimmyshimmer
Shouldn't we set min-p to 0? Because none among transformers, sglang and vllm use it by default, and zai.org don't mention it at all.
And I was thinking the same for top-k (set it to 0 instead of the default value of 40 for llama-cpp), but I see that while in the same way, sglang and vllm don't use it by default, transformers on the other hand set it to 50.

CHNtentes

Jan 22

@shimmyshimmer
Shouldn't we set min-p to 0? Because none among transformers, sglang and vllm use it by default, and zai.org don't mention it at all.
And I was thinking the same for top-k (set it to 0 instead of the default value of 40 for llama-cpp), but I see that while in the same way, sglang and vllm don't use it by default, transformers on the other hand set it to 50.

I once read that setting top-k to 0 will make the amount of candidate tokens too large and slow down sampling speed. Not sure if that's true though.

owao

Jan 22

•

edited Jan 22

Well, on 2*20 generations (10 with top-k 40 and 10 with top-k 0), for gpt-oss-20b and with Qwen-VL-32B, no noticeable diff. We will have tried!

jcp1

Jan 23

I confirm that removing "--temp 0.7 --min-p 0.0 --top-p 0.80 --top-k 20 --repeat-penalty 1.05" solved the problem of GLM not being able to call the tools correctly when using OpenHands.

We are using this:

ARGS="--no-mmap -fa on
-c 131072
-m $GGUF_PATH/unsloth/GLM-4.7-Flash-GGUF/GLM-4.7-Flash-Q5_K_M.gguf"

llama-server
$ARGS
-ngl 999
--host 0.0.0.0
--port 12345
--cache-ram -1
--parallel 2
--batch-size 512
--metrics

llama.cpp and weights were updated before testing.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment