q3 for 24vram + 64ram systems please?

#2
by amane2 - opened

q3 for 24vram + 64ram systems please?

yeah now that ik support is merged into main: https://github.com/ikawrakow/ik_llama.cpp/pull/1240

i'll cook a few smaller quants including that range using imatrix

@amane2

I'm awake and cooking again, imatrix is uploaded.

I'm fishing for best perplexity in your target size, this one might be a little too big if you want longer context, but would likely fit 64k at q8_0 barely maybe?

  • smol-IQ3_KS 77.156 GiB (3.365 BPW)
  • IQ2_KL 71.527 GiB (3.120 BPW)

Otherwise I'm checking a slightly smaller IQ2_KL ... I'm leaving the attn/shexp/first 3 dense all Q8_0 which will give best quality at a slight cost to size and TG speed (due to larger active weights going through memory bandwidth)...

I'll holler after I graph it and get a better feel for the curve!

UPDATE Got some test quants benchmarked... If i can fit 64k context with the smol-IQ3_KS in under 64GiB+24GiB i'll likely release that one. It's looking pretty tight, and may require you to run only 32k context or knock down to like q6_0 or q4_0 with khardamon transform or something to fit it, especially if you're not running headless or need a browser too...

llm_load_tensors:        CPU buffer size = 79007.73 MiB
llama_new_context_with_model: KV self size  = 6120.00 MiB, K (q8_0): 3060.00 MiB, V (q8_0): 3060.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =  2078.00 MiB

I may have found a winner... interestingly knocking all attn/shexp/first 3 dense from full q8_0 down to iq6_k was barely noticible at all (in perplexity)... and saves a little over a GiB which might be "enough" hah...

Okay, uploading the smol-IQ3_KS 75.934 GiB (3.312 BPW) now!

ppl-Step-3.5

Sign up or log in to comment