Any chances for Q8/Q6 quants?

#1
by boneshr - opened

Hi, I've been using Gemma4 E4B for various small tasks like combing through search results or docs. I've tried a setup with Q8 on a 16GB GPU and Q4 on the NPU via FLM on a separate device. I'm generally pretty happy with how it works, but I noticed that via FLM with the Q4 quant it can sometimes use double curly brackets when tool calling (tool{{"arg": "value}} instead of tool{"arg": "value"}) which makes the model give up and output a message along the lines of 'there seems to be a problem with (some) tool'.

In theory I think I could just try to hardcode around this, but to me it seems to be the type of mistake that models make in lower quants. Thus I was wondering, would it be architecturally possible to have a Q8 version too? I'm not super well versed in NPU architectures, but in general most Ryzen AI devices have 16+ GB of RAM so it should be pretty usable, if the architecture allows it.
I skimmed through FLM documentation and as far as I can tell it's non-trivial to make my own quants or anything of that sort, but if I'm wrong please correct me :p

diligently working out a way to eliminate json parsing issue via flm runtime (server). Please stay tuned for v0.9.42 release. thank you

Sign up or log in to comment