Wingless Imp's many forms.

by gggrandma1990 - opened Aug 21, 2025

gggrandma1990

Aug 21, 2025

I've downloaded, compared, and concluded this quant to surpass its Max and Q4_K_M versions.

Case scenario: QA Benchmarks at multiple temperatures/regenerations at F32 cache via Layla (llama.cpp) on Qualcomm Snapdragon 8 Gen3 w/ 16gb RAM.

Findings: The Nonlinear quantization is close to Q4_0 (ARM-optimized) speed with higher integrity, especially for low parameter models. This high attention version was capable of answering questions the Q4_K_M found difficult. The Max version reflected the Q4_K_M, but at a significant speed advantage on my pocketware. This one, even moreso. I've looked over the tensor difference and this feels like a sweet, sweet spot for speed/intelligence at the current moment.

SicariusSicariiStuff

Owner Aug 21, 2025

Thank you so much for testing, this observation was noted by several other mobile users, with the only caveat of NL quants being somewhat less consistent (due to the nature of iMatrix & calibration etc).
With that said, your observation indeed adds to this quant being a 'sweet spot' of sorts for mobile.

Thank you for the feedback 👍🏻

EloyOn

Aug 22, 2025

The size of the model is more similar to q5_ks than q4_km. It's great if it's similar but with higher speed on phones though.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment