Suggestion: Add IQ3 variants.

#7
by tarruda - opened

IQ3 might extract more of the original model's performance on 128G RAM devices. For example, I have a M1 ultra with 128G and it can run MiMo 2.5 (310B parameter) with unsloth's IQ3_XXS quant and 128k context:
image

This is fully GPU offloaded BTW.

BatiAI's Q3_K_M is about 126gb and fit 128 ram with gpu offloading on Windows, but made for Macs as well.
(see readme for batai.cpp fork, but mainline's PR sounds about done)

https://huggingface.co/batiai/DeepSeek-V4-Flash-GGUF/tree/main

BatiAI's Q3_K_M is about 126gb and fit 128 ram with gpu offloading on Windows, but made for Macs as well.
(see readme for batai.cpp fork, but mainline's PR sounds about done)

https://huggingface.co/batiai/DeepSeek-V4-Flash-GGUF/tree/main

You will hardly have any context, unless you stream routed experts from ssd with https://github.com/Anemll/anemll-flash-llama.cpp/tree/DeepSeek-V4-SSD

@Dredd can you share a link to the PR?

Sign up or log in to comment