Any chances for Q8/Q6 quants?

by boneshr - opened 17 days ago

•

Hi, I've been using Gemma4 E4B for various small tasks like combing through search results or docs. I've tried a setup with Q8 on a 16GB GPU and Q4 on the NPU via FLM on a separate device. I'm generally pretty happy with how it works, but I noticed that via FLM with the Q4 quant it can sometimes use double curly brackets when tool calling (tool{{"arg": "value}} instead of tool{"arg": "value"}) which makes the model give up and output a message along the lines of 'there seems to be a problem with (some) tool'.

In theory I think I could just try to hardcode around this, but to me it seems to be the type of mistake that models make in lower quants. Thus I was wondering, would it be architecturally possible to have a Q8 version too? I'm not super well versed in NPU architectures, but in general most Ryzen AI devices have 16+ GB of RAM so it should be pretty usable, if the architecture allows it.
I skimmed through FLM documentation and as far as I can tell it's non-trivial to make my own quants or anything of that sort, but if I'm wrong please correct me :p

FastFlowLM

Owner 15 days ago

diligently working out a way to eliminate json parsing issue via flm runtime (server). Please stay tuned for v0.9.42 release. thank you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment