Wingless Imp's many forms.
I've downloaded, compared, and concluded this quant to surpass its Max and Q4_K_M versions.
Case scenario: QA Benchmarks at multiple temperatures/regenerations at F32 cache via Layla (llama.cpp) on Qualcomm Snapdragon 8 Gen3 w/ 16gb RAM.
Findings: The Nonlinear quantization is close to Q4_0 (ARM-optimized) speed with higher integrity, especially for low parameter models. This high attention version was capable of answering questions the Q4_K_M found difficult. The Max version reflected the Q4_K_M, but at a significant speed advantage on my pocketware. This one, even moreso. I've looked over the tensor difference and this feels like a sweet, sweet spot for speed/intelligence at the current moment.
Thank you so much for testing, this observation was noted by several other mobile users, with the only caveat of NL quants being somewhat less consistent (due to the nature of iMatrix & calibration etc).
With that said, your observation indeed adds to this quant being a 'sweet spot' of sorts for mobile.
Thank you for the feedback ππ»
The size of the model is more similar to q5_ks than q4_km. It's great if it's similar but with higher speed on phones though.