Bug in Q8_K_XL script?
Heya team unsloth, I was just peeping what you're doing with your _XL quants to answer "what makes an XL quant an XL quant for myself.
However, I noticed your big one seems to have F16 tensors instead of bf16 tensors which might be a mistake in your conversion pipeline script I'm guessing? I don't see f16 on smaller quants that i spot checked, just the Q8_K_XL's maybe?
Oh i see it too over here on a different quant: https://huggingface.co/unsloth/Mistral-Large-3-675B-Instruct-2512-GGUF?show_file_info=UD-Q8_K_XL%2FMistral-Large-3-675B-Instruct-2512-UD-Q8_K_XL-00001-of-00017.gguf
Maybe it is just huggingface displaying something wrong? but the original bf16 shows bf16 type. just want to make sure you're not accidentally possibly clipping weights by downcasting a bf16 tensor to f16 tensor unless maybe you checked them all first somehow?
Cheers!
@ubergarm Hey! Sorry on the delay - so F16 was selected because folks kept mentioning to us Mac devices had much slower inference on BF16, so in general F16 is selected to alleviate these issues - llama.cpp upcasts to FP32 for intermediate activations so generally it's fine.
In our tests, in general it's not the weight types that matter - it's rather the accumulation during matmuls - ie overflowing when large_bf16 * large_bf16 will cause issues if accumulation is done in fp16. In general it's ok for 1 layer or so since accumulation errors are contained.
I'm not sure if llama.cpp doesn't have the slowdown anymore - I'll probably default to using BF16 again, but I need to check before hand.
