Can't seem to correctly generate structured output
It looks like this model isn't outputting thinking tags correctly, or somehow vLLM fails to remove all thinking traces. When generating structured output, it seems to output it including the thinking traces, which is why the output isn't JSON-parseable. Is anybody facing the same issue?
I also encounter alot of wrong toolcalls and find the model acting rather strange. I'd love a glm air with vision but am wondering if the quant is broken. Has anyone tried another quant?
Thank you for raising this to me. I will investigate this.
Is it maybe because of the strange generation settings:
top_p: 0.6
top_k: 2
temperature: 0.8
repetition_penalty: 1.1
That when you do the calibration on the regular model you get a strange calibration for the quant?
I will do some experiments to confirm, and perhaps requantize the model.
Something that popped up in my mind:
If you leave the vision encoder at the same quantization(original BF16), do you even need a multimodal calibration set?
The calibration set for your GLM 4.5 Air works really well, but of course isnt't multimodal.
I did think the same as you.
The vision params in this model is kept at BF16 precision, but assuming that the activations created by visual inputs are different from texts, the model requires visual dataset for visual activations and calibration.
Regarding my GLM 4.5 Air, does that mean tool calling and structure outputs work well in your use cases?
I really want to thank you a lot, your feedback really helps me shape my future models.
You're welcome, and thank you of course for providing these quants.
GLM 4.5 Air is the daily driver at the company I work at and works really well, including tool calling.
We haven't used structured outputs yet, but we're busy implementing more tools, and flows to use the model so maybe in the future.
Regarding my GLM 4.5 Air, does that mean tool calling and structure outputs work well in your use cases?
Your GLM 4.5 Air quant is working perfectly fine for me in regards to structured output, other than it feeling a bit dumb sometimes 😄 but I think that is to be expected at 4-bit.
I'm no expert but an LLM pointed me to this paper:
https://arxiv.org/html/2509.23729v2#S3
Here there's barely any difference between a multimodal calibration set and text only for AWQ.
Doesn't tell you anything yet about the calibration set, still.
And it's tested on just 2 models.
https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/37995c49-c9a4-4de2-812d-9f34b40978ae.pdf
Here they are implying vision tokens are less important, contradicting the practice of leaving the vision layers untouched.
Maybe it implies, calibrating for vision is not as important as calibrating for text.
Other thought:
A multimodal-calibrationset, will maybe give you different activations than a text only calibration.
Maybe the text calibration dataset you use for Air corresponds better with how I intend to use the model.
In this case both multimodal and text-only uses are foreseen for the model, does it make sense to calibrate on both text- and multimodal-datasets?
That's interesting. Thank you for the paper!
I am going to quantize translategemma and worried that there are no suitable visual dataset for calibration, but I guess text-only dataset in multi language should work.
Good way to find out, especially seeing as Translategemma is a bit smaller.
Should take less compute to get a quant, and easier to run both pre and post quantized models to compare outputs.
Please let me know your findings.
@cpatonn any updates on this?
I tried 4.6v on the glm website but also have quite mixed results, is the model just a bit wacky?
Also i don't see an option to control temperature on the website so i can't know if that's the issue 😅?
@HenkTenk Yes, it is true that vision dataset does not necessary improve quantized model accuracy, but the diversity of the text dataset i.e., various languages, math symbols, coding data, etc.
But considering the model is released 2 months ago, the model is not bleeding-edge anymore, and in addition with the bf16 glm 4.6v model possibly not being good in structured output and tool calling, I am a bit hesitant to requantize it.
Do you still use glm 4.6v? I will requantize if you prefer.
Well I don't use 4.6V but still use 4.5 air.
In testing for uses within the company I work for it really hits the spot between dutch proficiency, general intelligence and coding ability.
GLM 4.6V as some say is just air with vision attached but i just haven't experienced that with the quants i tried.
Just having air with vision on top of that would be pretty major.
If you say the original model also has problems with tool calling, I guess the game is over.
Thanks anyway for responding.