Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb?
Hi Is it possible to make smaller NVFP4 quant at 340-360GB to fit in 4x96gb? I've never done a quant before but willing to try. wondering if we can quantize more layers to get the size down a tad bit more?
You could try quantizing the indexer but my intuition says you probably don't want to. I think this is about as small as you can get with nvfp4 without really hurting model performance. If you give up on gpu acceleration you could go smaller though with llama.cpp style quantization.
It should work in vllm with sm100, unfortunately due to how nvidia decided to segment their consumer vs datacenter blackwell cards much of the code in triton/deep gem/etc doesn't properly support sm120. The vllm hackery was mostly straight forward but deep gemm (https://github.com/deepseek-ai/DeepGEMM) required extensive work to even get something working and is still a ways off from something I would try to get merged. This is why I only provided the cpu reference impl for validation and experimentation with this model. Hopefully with time sm120 (rtx pro 6000 blackwell) will get better support from projects like deepgemm/triton/vllm/sglang/etc
I uploaded https://hub.docker.com/repository/docker/eous/vllm-sm120/general which has my sm120 hacks, it is very mvp/research and will probably not work. Though just tested the model and with a smaller context you should be able to fit this model on 4 96gb gpu's.
@eousphoros almost!
NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory 2 | #include <deep_gemm/impls/sm120_fp8_mqa_logits.cuh> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ compilation terminated.
is this maybe because i used the awq variation instead of nvfp4? its a bit smaller thats why
edit: oooh i need to install DeeeGEMM I see ok i need also to edit the install script to use command python3 vs just python and add --force-reinstall
ok Successfully installed deep-gemm-2.2.0+local
edit2: still get NVCC compilation failed: /root/.cache/vllm/deep_gemm/cache/kernel.smxx_fp8_mqa_logits.6170cd6e0de7e861f56139277bd6b709/kernel.cu:2:10: fatal error: deep_gemm/impls/sm120_fp8_mqa_logits.cuh: No such file or directory
Ah woops, forgot to copy the decode kernel into the container. I pushed a new container up. Also no idea if this will work with AWQ, it barely works with my nvfp4 quant.
(APIServer pid=1) INFO 12-05 16:23:56 [loggers.py:248] Engine 000: Avg prompt throughput: 0.7 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Don't expect this to be fast, but it is faster then cpu inference.
lets go!!!!! copying build/lib.linux-x86_64-cpython-312/flash_mla/cuda.cpython-312-x86_64-linux-gnu.so -> flash_mla
edit:
figured out how to build these _flashmla_C.abi3.so _flashmla_extension_C.abi3.so
vllm has its own build file for flashmla
edits in vllm repo and then poiint to custom flashmla repo which will be missing some files, so then need to clone the vllm fork of flashmla and implement custom sm120 build into that or move the missing files over, i did the first option switched to the vllm fork of flashmla and redid the pybindings etc
i made a stab at porting to sm120, early stage, work in progress
Sorry I got distracted with a model fine tune completing its training, looks like you got this. Good luck, have fun. I'll come back to this in a couple days if you haven't got it working
Sorry I got distracted with a model fine tune completing its training, looks like you got this. Good luck, have fun. I'll come back to this in a couple days if you haven't got it working
@eousphoros for sure I don't got it but crossing my fingers we will get it working. just having fun learning
i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.
i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.
@mtcl
whats the error and what is your launch command?
if you want to hop in here in AIder discord under models and benchmarks: minimax m2, I'll try to find the commands I used to launch it. My HISTFILE was only 1000 and did so many grep and sed commands recently on FlashMLA repo that all my old sglang vllm commands got nuked from bash history. But ill check in my notes
discord link: https://discord.com/channels/1131200896827654144/1431467264607256636
i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.
--cpu-offload-gb 150
i wish i can use my shiny new 2X6000 pros with these nvfp4 models. I cannot fit deepseek, however fitting a Qwen235B or Minimax-M2 would be great. Anything I try fails.
@mtcl
whats the error and what is your launch command?
if you want to hop in here in AIder discord under models and benchmarks: minimax m2, I'll try to find the commands I used to launch it. My HISTFILE was only 1000 and did so many grep and sed commands recently on FlashMLA repo that all my old sglang vllm commands got nuked from bash history. But ill check in my notes
discord link: https://discord.com/channels/1131200896827654144/1431467264607256636
Can you please send me an invite to discord? I tried clicking on this link and am unable to post message there.