Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb?

by Fernanda24 - opened Dec 3, 2025

Dec 3, 2025

Hi Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb? I've never done a quant before but willing to try. wondering if we can quantize more layers to get the size down a tad bit more?

eousphoros

Owner Dec 3, 2025

You could try quantizing the indexer but my intuition says you probably don't want to. I think this is about as small as you can get with nvfp4 without really hurting model performance. If you give up on gpu acceleration you could go smaller though with llama.cpp style quantization.

Fernanda24

Dec 4, 2025

wait so will this work at all with cuda gpus? @eousphoros

eousphoros

Owner Dec 4, 2025

It should work in vllm with sm100, unfortunately due to how nvidia decided to segment their consumer vs datacenter blackwell cards much of the code in triton/deep gem/etc doesn't properly support sm120. The vllm hackery was mostly straight forward but deep gemm (https://github.com/deepseek-ai/DeepGEMM) required extensive work to even get something working and is still a ways off from something I would try to get merged. This is why I only provided the cpu reference impl for validation and experimentation with this model. Hopefully with time sm120 (rtx pro 6000 blackwell) will get better support from projects like deepgemm/triton/vllm/sglang/etc

eousphoros

Owner Dec 5, 2025

I uploaded https://hub.docker.com/repository/docker/eous/vllm-sm120/general which has my sm120 hacks, it is very mvp/research and will probably not work. Though just tested the model and with a smaller context you should be able to fit this model on 4 96gb gpu's.

Fernanda24

Dec 5, 2025

•

edited Dec 5, 2025

@eousphoros almost!

NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory 2 | #include <deep_gemm/impls/sm120_fp8_mqa_logits.cuh> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated.

is this maybe because i used the awq variation instead of nvfp4? its a bit smaller thats why

edit: oooh i need to install DeeeGEMM I see ok i need also to edit the install script to use command python3 vs just python and add --force-reinstall

ok Successfully installed deep-gemm-2.2.0+local

edit2: still get NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory

eousphoros

Owner Dec 5, 2025

Ah woops, forgot to copy the decode kernel into the container. I pushed a new container up. Also no idea if this will work with AWQ, it barely works with my nvfp4 quant.

eousphoros

Owner Dec 6, 2025

(APIServer pid=1) INFO 12-05 16:23:56 [loggers.py:248] Engine 000: Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Don't expect this to be fast, but it is faster then cpu inference.

69 hidden messages

Expand all

Fernanda24

Dec 10, 2025

•

edited Dec 10, 2025

lets go!!!!! copying build/lib.linux-x86_64-cpython-312/flash_mla/cuda.cpython-312-x86_64-linux-gnu.so -> flash_mla

edit:

figured out how to build these _flashmla_C.abi3.so _flashmla_extension_C.abi3.so
vllm has its own build file for flashmla
edits in vllm repo and then poiint to custom flashmla repo which will be missing some files, so then need to clone the vllm fork of flashmla and implement custom sm120 build into that or move the missing files over, i did the first option switched to the vllm fork of flashmla and redid the pybindings etc

Fernanda24

Dec 11, 2025

•

edited Dec 11, 2025

i made a stab at porting to sm120, early stage, work in progress

https://github.com/fernandaspets/vllm_FlashMLA

eousphoros

Owner Dec 11, 2025

Sorry I got distracted with a model fine tune completing its training, looks like you got this. Good luck, have fun. I'll come back to this in a couple days if you haven't got it working

Fernanda24

Dec 11, 2025

•

edited Dec 11, 2025

Sorry I got distracted with a model fine tune completing its training, looks like you got this. Good luck, have fun. I'll come back to this in a couple days if you haven't got it working

@eousphoros for sure I don't got it but crossing my fingers we will get it working. just having fun learning

mtcl

Dec 14, 2025

i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.

Fernanda24

Dec 14, 2025

•

edited Dec 15, 2025

i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.

@mtcl
whats the error and what is your launch command?
if you want to hop in here in AIder discord under models and benchmarks: minimax m2, I'll try to find the commands I used to launch it. My HISTFILE was only 1000 and did so many grep and sed commands recently on FlashMLA repo that all my old sglang vllm commands got nuked from bash history. But ill check in my notes
discord link: https://discord.com/channels/1131200896827654144/1431467264607256636

eousphoros

Owner Dec 19, 2025

i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.

--cpu-offload-gb 150

mtcl

Dec 20, 2025

i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.

@mtcl
whats the error and what is your launch command?
if you want to hop in here in AIder discord under models and benchmarks: minimax m2, I'll try to find the commands I used to launch it. My HISTFILE was only 1000 and did so many grep and sed commands recently on FlashMLA repo that all my old sglang vllm commands got nuked from bash history. But ill check in my notes
discord link: https://discord.com/channels/1131200896827654144/1431467264607256636

Can you please send me an invite to discord? I tried clicking on this link and am unable to post message there.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment