q3 for 24vram + 64ram systems please?
q3 for 24vram + 64ram systems please?
yeah now that ik support is merged into main: https://github.com/ikawrakow/ik_llama.cpp/pull/1240
i'll cook a few smaller quants including that range using imatrix
I'm awake and cooking again, imatrix is uploaded.
I'm fishing for best perplexity in your target size, this one might be a little too big if you want longer context, but would likely fit 64k at q8_0 barely maybe?
- smol-IQ3_KS 77.156 GiB (3.365 BPW)
- IQ2_KL 71.527 GiB (3.120 BPW)
Otherwise I'm checking a slightly smaller IQ2_KL ... I'm leaving the attn/shexp/first 3 dense all Q8_0 which will give best quality at a slight cost to size and TG speed (due to larger active weights going through memory bandwidth)...
I'll holler after I graph it and get a better feel for the curve!
UPDATE Got some test quants benchmarked... If i can fit 64k context with the smol-IQ3_KS in under 64GiB+24GiB i'll likely release that one. It's looking pretty tight, and may require you to run only 32k context or knock down to like q6_0 or q4_0 with khardamon transform or something to fit it, especially if you're not running headless or need a browser too...
llm_load_tensors: CPU buffer size = 79007.73 MiB
llama_new_context_with_model: KV self size = 6120.00 MiB, K (q8_0): 3060.00 MiB, V (q8_0): 3060.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.49 MiB
llama_new_context_with_model: CPU compute buffer size = 2078.00 MiB
