Can't seem to correctly generate structured output

by kldzj - opened Dec 15, 2025

Dec 15, 2025

•

edited Dec 20, 2025

It looks like this model isn't outputting thinking tags correctly, or somehow vLLM fails to remove all thinking traces. When generating structured output, it seems to output it including the thinking traces, which is why the output isn't JSON-parseable. Is anybody facing the same issue?

HenkTenk

Jan 25

I also encounter alot of wrong toolcalls and find the model acting rather strange. I'd love a glm air with vision but am wondering if the quant is broken. Has anyone tried another quant?

HenkTenk

Jan 25

@cpatonn are generations of the original model comparable to the 4-bit AWQ? Is the original just as erratic?

cpatonn

cyankiwi org Jan 26

Thank you for raising this to me. I will investigate this.

HenkTenk

Jan 28

Is it maybe because of the strange generation settings:
top_p: 0.6
top_k: 2
temperature: 0.8
repetition_penalty: 1.1

That when you do the calibration on the regular model you get a strange calibration for the quant?

HenkTenk

Feb 3

@cpatonn any news regarding this?

cpatonn

cyankiwi org Feb 3

Hi @HenkTenk , it is very likely that due to the calibration dataset 5CD-AI/LLaVA-CoT-o1-Instruct not having structure output data, and therefore, the quantization process does not fully capture the activations for structure outputs, and ultimately perform poorly in that case.

cpatonn

cyankiwi org Feb 3

I will do some experiments to confirm, and perhaps requantize the model.

HenkTenk

Feb 4

Something that popped up in my mind:
If you leave the vision encoder at the same quantization(original BF16), do you even need a multimodal calibration set?
The calibration set for your GLM 4.5 Air works really well, but of course isnt't multimodal.

cpatonn

cyankiwi org Feb 4

I did think the same as you.

The vision params in this model is kept at BF16 precision, but assuming that the activations created by visual inputs are different from texts, the model requires visual dataset for visual activations and calibration.

Regarding my GLM 4.5 Air, does that mean tool calling and structure outputs work well in your use cases?

I really want to thank you a lot, your feedback really helps me shape my future models.

HenkTenk

Feb 4

You're welcome, and thank you of course for providing these quants.
GLM 4.5 Air is the daily driver at the company I work at and works really well, including tool calling.
We haven't used structured outputs yet, but we're busy implementing more tools, and flows to use the model so maybe in the future.

kldzj

Feb 4

Regarding my GLM 4.5 Air, does that mean tool calling and structure outputs work well in your use cases?

Your GLM 4.5 Air quant is working perfectly fine for me in regards to structured output, other than it feeling a bit dumb sometimes 😄 but I think that is to be expected at 4-bit.

HenkTenk

Feb 4

•

edited Feb 4

I'm no expert but an LLM pointed me to this paper:
https://arxiv.org/html/2509.23729v2#S3

Here there's barely any difference between a multimodal calibration set and text only for AWQ.
Doesn't tell you anything yet about the calibration set, still.
And it's tested on just 2 models.

https://nicsefc.ee.tsinghua.edu.cn/nics_file/pdf/37995c49-c9a4-4de2-812d-9f34b40978ae.pdf
Here they are implying vision tokens are less important, contradicting the practice of leaving the vision layers untouched.
Maybe it implies, calibrating for vision is not as important as calibrating for text.

Other thought:
A multimodal-calibrationset, will maybe give you different activations than a text only calibration.
Maybe the text calibration dataset you use for Air corresponds better with how I intend to use the model.
In this case both multimodal and text-only uses are foreseen for the model, does it make sense to calibrate on both text- and multimodal-datasets?

cpatonn

cyankiwi org Feb 5

That's interesting. Thank you for the paper!

I am going to quantize translategemma and worried that there are no suitable visual dataset for calibration, but I guess text-only dataset in multi language should work.

HenkTenk

Feb 5

Good way to find out, especially seeing as Translategemma is a bit smaller.
Should take less compute to get a quant, and easier to run both pre and post quantized models to compare outputs.

Please let me know your findings.

HenkTenk

Feb 17

•

edited Feb 17

@cpatonn any updates on this?

I tried 4.6v on the glm website but also have quite mixed results, is the model just a bit wacky?
Also i don't see an option to control temperature on the website so i can't know if that's the issue 😅?

cpatonn

cyankiwi org Feb 17

@HenkTenk Yes, it is true that vision dataset does not necessary improve quantized model accuracy, but the diversity of the text dataset i.e., various languages, math symbols, coding data, etc.

But considering the model is released 2 months ago, the model is not bleeding-edge anymore, and in addition with the bf16 glm 4.6v model possibly not being good in structured output and tool calling, I am a bit hesitant to requantize it.

Do you still use glm 4.6v? I will requantize if you prefer.

HenkTenk

Feb 18

•

edited Feb 18

Well I don't use 4.6V but still use 4.5 air.
In testing for uses within the company I work for it really hits the spot between dutch proficiency, general intelligence and coding ability.
GLM 4.6V as some say is just air with vision attached but i just haven't experienced that with the quants i tried.
Just having air with vision on top of that would be pretty major.

If you say the original model also has problems with tool calling, I guess the game is over.

Thanks anyway for responding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment