Bug in Q8_K_XL script?

#1
by ubergarm - opened

Heya team unsloth, I was just peeping what you're doing with your _XL quants to answer "what makes an XL quant an XL quant for myself.

However, I noticed your big one seems to have F16 tensors instead of bf16 tensors which might be a mistake in your conversion pipeline script I'm guessing? I don't see f16 on smaller quants that i spot checked, just the Q8_K_XL's maybe?

https://huggingface.co/unsloth/Olmo-3.1-32B-Instruct-GGUF?show_file_info=Olmo-3.1-32B-Instruct-UD-Q8_K_XL.gguf

oops-f16-not-bf16

Oh i see it too over here on a different quant: https://huggingface.co/unsloth/Mistral-Large-3-675B-Instruct-2512-GGUF?show_file_info=UD-Q8_K_XL%2FMistral-Large-3-675B-Instruct-2512-UD-Q8_K_XL-00001-of-00017.gguf

Maybe it is just huggingface displaying something wrong? but the original bf16 shows bf16 type. just want to make sure you're not accidentally possibly clipping weights by downcasting a bf16 tensor to f16 tensor unless maybe you checked them all first somehow?

Cheers!

Unsloth AI org

@ubergarm Hey! Sorry on the delay - so F16 was selected because folks kept mentioning to us Mac devices had much slower inference on BF16, so in general F16 is selected to alleviate these issues - llama.cpp upcasts to FP32 for intermediate activations so generally it's fine.

In our tests, in general it's not the weight types that matter - it's rather the accumulation during matmuls - ie overflowing when large_bf16 * large_bf16 will cause issues if accumulation is done in fp16. In general it's ok for 1 layer or so since accumulation errors are contained.

I'm not sure if llama.cpp doesn't have the slowdown anymore - I'll probably default to using BF16 again, but I need to check before hand.

Sign up or log in to comment