Request for an IQ5 Quant

#19

by binahz - opened Oct 31, 2025

Oct 31, 2025

Hey, I know you this model is kinda old news now, but imo, its still one of the best for intelligence, longer context performance, and a nice writing style. Could you please create an IQ5 quant for this, similar to the ones you made for Deepseek v3.1, which will make it perfect for 768GB systems with 24GB VRAM? The only other one that comes close by anikifoss seems to have been quanted in such way where moving any of the full layers to GPU heavily hampers inference performance...

ubergarm

Owner Nov 3, 2025

Heya, I checked and do still have access to my DeepSeek-R1-0528-bf16-safetensors/ files, but given this was one of my earlier models I didn't go back with the more recent recipes.

I guess you prefer the older model over the more recent DeepSeek-V3.1-GGUF or DeepSeek-V3.1-Terminus-GGUF versions?

Not sure I'll get to it, especially if newer models land this week. Also am doing some maintenance on the big remote rig atm.

Regarding @anikifoss 's recipes, iirc they have slightly larger routed expert layers and no imatrix by design. Not sure why offloading more layers would hurt your performance though as typically each additional routed expert layer offloading onto GPU helps very slightly token generation speeds. KTransformers had an issue where offloading extra layers messed up the CUDA graphs or something, but afaik with ik_llama.cpp (which I assume you're using?) should be fine. Though ik's fork moves pretty fast with many small optimizations landing in just the past week e.g. https://github.com/ikawrakow/ik_llama.cpp/pull/842 which you could try testing with and without -no-ooae etc but just random guessing on my part at the moment.

Finally, are you benchmarking with llama-sweep-bench for your comparisons of speed (it is the best way to visualize both PP and TG speeds across various kv-cache depths imo).

Cheers!

ubergarm

Owner Nov 3, 2025

•

edited Nov 3, 2025

@binahz

I just saw a possibly (or not) issue mentioning where adding routed experts onto GPU was hurting performance: https://github.com/ggml-org/llama.cpp/issues/16945#issuecomment-3478207201

Are you seeing the issue with "heavily hampers inference performance" on other quants too or just the big one by anikifoss?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment