Performance evaluation of Gemma 3-27b-it with different quantization methods (4-bit vs 8-bit)

#102
by Ryan1007 - opened

Hi team, I'm planning to deploy Gemma 3-27b-it on a consumer GPU with limited VRAM. I've noticed some performance variations when using 4-bit quantization (bitsandbytes). Have you guys performed any benchmarks on how much the reasoning capability drops compared to the FP16 version? Any recommended quantization parameters for maintaining logical consistency?

Hi @Ryan1007
Google has not published an official benchmark table specifically comparing bitsandbytes to the FP16/BF16 base model for Gemma 3 27b-it. However, you can refer the community-led benchmarks available on Reddit . I have included links to thes benchmarks below for your reference.
https://www.reddit.com/r/LocalLLaMA/comments/1k6nrl1/i_benchmarked_the_gemma_3_27b_qat_models/
https://www.reddit.com/r/LocalLLaMA/comments/1k3jal4/gemma_3_qat_versus_other_q4_quants/

To maintain logical consistency you can start with parameters like NF4 quantization and turning on double quantization, while keeping the compute dtype in FP16. In practice that means setting bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True and bnb_4bit_compute_dtype=torch.float16 . We’ve generally seen NF4 hold up better than plain linear 4-bit, especially around outlier weights and double quant helps recover a bit more fidelity, which translates into more stable reasoning. Please let me know if this setup helps you .

Thanks

Sign up or log in to comment