Include MXFP4 and NVFP4 quants please!!!

#2605

by Iwaku-Real - opened 10 days ago

FP4 formats have had widespread support for a few months now but it's still often a hit or miss whether a model has gotten either version from someone, if it is even a good one. (Remember that Nvidia said NVFP4 is supposed to be better than Q8 and almost as good as FP16 if done right!) I highly suggest you set up your workflow to determine the best quality-to-size ratios for MXFP4/NVFP4 quantization for each model and include each (like NVFP4-full, NVFP4-mixed) among your uploads, ideally both in Safetensors and GGUF so users of Llama.cpp and vLLM can use them. It would really encourage more people to use them because standard Qn GGUFs do NOT have the same speed boosts on hardware that supports native FP4.

RichardErkhov

10 days ago

We are doing only main llama cpp. If this is already included in the main llamacpp, please let me know as I did not notice that support. We are not doing other quants like AWQ and GPTQ, as we dont have enough gpu power to process such a huge amount of requests, especially with big models

redaihf

9 days ago

•

edited 9 days ago

MXFP4 is a supported GGUF quant type. Unsloth says it is only useful for "pure" MoE models. I assume this means natively trained MXFP4 models like GPT-OSS.

Iwaku-Real

9 days ago

We are doing only main llama cpp. If this is already included in the main llamacpp, please let me know as I did not notice that support.

It's been in main llama.cpp since April: https://github.com/ggml-org/llama.cpp/pull/22196

nicoboss

8 days ago

Yes MXFP4/NVFP4 are only usefull for GPT OSS and other models that get released in this specific format. Even for thouse such quants are very controversial. MXFP4 quants where already discussed many times previously and we do provide them for all GPT OSS based models if we don't forget but I can't perosnaly recommend them as they simply offer a bad quality per size ratio. Keep in mind that for GPT OSS based models are the relevant tensors are always in MXFP4 no matter the quant we choose so also quantizing the other tensors to MXFP4 seems kind of stupid and I indeed observed a a major degradiation and so did many other users.

nicoboss

8 days ago

If there is any specific model for which you really want want MXFP4, NVFP4 or any other special quants you can always explicitely request them and we usualy provide them.

Iwaku-Real

8 days ago

Nice, but I just think having NVFP4 quants should be a default in future uploads.

nicoboss

8 days ago

Nice, but I just think having NVFP4 quants should be a default in future uploads.

Why should we spend our resources and limited hugging face storage for quants that are worse than similarly sized alternatives useless a model uses them as source? I can't think of a single hardware combination where someone would want to choose MXFP4/NVFP4 over any better alternative. Is there any highly specialized hardware that would benefit from MXFP4/NVFP4 quants? The only place where MXFP4/NVFP4 make a lot of sense is for vllm where you have highly optimized GPU kernels especially for latest generation of graphic cards to run them at amazing speed but those would then not be GGUFs but SafeTensors and so not something ouer GGUF focused team will provide.

Iwaku-Real

7 days ago

I can't think of a single hardware combination where someone would want to choose MXFP4/NVFP4 over any better alternative. Is there any highly specialized hardware that would benefit from MXFP4/NVFP4 quants?

All Blackwell GPUs (RTX 50, DGX Spark, and B200/B300) whether consumer or not, they absolutely support MXFP4 and NVFP4. (AMD's MI350 supports MXFP4 oddly enough.)
The point of Nvidia championing NVFP4 is so they can get comparable quality to Q8 while making compute/prefill much faster and fitting more of the model into VRAM. It's especially good for QAT models like Gemma 4 that are designed to be quantized down to 4 bits. And because Llama.cpp itself supports NVFP4 GGUFs and the kernels are only getting better, I think it will soon be necessary to include NVFP4 (and very later MXFP4 once AMD figures out how to support it) so people with compatible hardware have that option for better performance.

nicoboss

7 days ago

All Blackwell GPUs (RTX 50, DGX Spark, and B200/B300) whether consumer or not, they absolutely support MXFP4 and NVFP4. (AMD's MI350 supports MXFP4 oddly enough.)

They sure do but you would be crazy to use llama.cpp instead of vllm to run such a model should it fit into GPU memory. For batched requests vllm gives you around 4000 tokens per second token generation while llama.cpp might give you around 40 tokens/second token generation which is a 100x speed difference. I know because despite beeing the main contributer of team mradermacher I myself use vllm every time a model fit inside my available GPU memory. 2 years ago, when I spent weeks testing all possible quants on all kind of hardware configurations, I concluded that compute never really is the bottleneck unless you have some really bad hardware. Instead, token generation using llama.cpp was always memory bandwidth bottlenecked and the same seams to apply for vllm. While I have not tested it yet I have the feeling that if you run a MXFP4 and a similarly sized non MXFP4 quant on your MXFP4 capable GPU you should see no meaningful performance difference at token generation performance. You might see some slightly higher token processing performance but even I wouldn't take for guaranteed. Feel free to post some benchmark results. llama.cpp offers a standardized performance benchmark you can run for this exact purpose. If your numbers convince me and I can replicate them and the quality isn't absolutely terrible I will talk with the rest of the team about adding it.

get comparable quality to Q8

Seems like highly misleading marketing to me. I am using Richards GPUs to test models 24/7 for the past year and NVFP4 feels around as good as 4-bit AWQ and nowhere near 8-bit or full precision but they booth beat the absolute terrible 4-bit bitsandbytes quants. I once spent a few month comparing the quality of all GGUFs and back then came to the conclusion that everything i1-Q5_K_M and higher is indistinguishable from the source model for humans but everything below that can be notice. So sure while 8- bit and the source model might be indistinguishable any 4-bit quant will offer worse quality but being able to fit a much larger model into your available memory will gain more than you lose due to the lower precision so in the sense quantum aware training makes a lot of sense.

AMD figures out how to support it

I wouldn't mind if they don't as NVFP4 is kind of a terrible datatype. Instead, they should implement a 4-bit data type that is good. You have only 4-bits to represent a number leaving you with just 16 possible numbers that can be represented. Now like a total idiot however invented NVFP4 decide it's a great idea to have a positive and negative zero for absolutely no reason other than for the binary representation to be the same like for other floating point numbers. This helps other 4-bit quant format designed by more intelligent persons to easily beat it. Generally, why even waste half your 16 numbers on negative numbers when positive numbers are far more useful to train LLMs. As you can see there are technical reasons why NVFP4 quants are kind of bad when comparing it to similarly sized quants.

Iwaku-Real

7 days ago

•

edited 7 days ago

NVFP4 is kind of a terrible datatype

In some ways I agree, but unfortunately the current implementation is baked into silicon and Nvidia isn't very likely to fix it. So there's not much else that can be done, unless the RaZeR format (which can gain back up to 30%+ accuracy loss) is more widely implemented by inference engines and kernels.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment