is NVFP4 supported on sm120 (blackwell rtx pro 6000, rtx 5090 etc)?
is NVFP4 supported on sm120 (blackwell rtx pro 6000, rtx 5090 etc)?
if yes how do we run it? does it work in vllm and or sglang now on sm120 rtx cards?
Yeah it does. There's a docker command on the model page for my NVFP4 quant that should get it running. I test these models mostly on a 2 x RTX Pro 6000 Blackwell system but sometimes on a B200. There's an environment flag you need to set to get INTELLECT-3 running but it's included in there. NVFP4 is still fragile but I've gotten quite a lot of models running on sm120 if you look at my other models. They're all NVFP4 quants and I usually have a note in the event that one wouldn't run. Those aren't that common though.
I don't really know anything about sglang but I guess it's possible they would also work there. I'm using llm-compressor for the quantization so any tool that can use the output of llm-compressor could theoretically run them.
super cool!
I got it to work with 1 gpu using:
VLLM_USE_FLASHINFER_MOE_FP4=1
for 2 gpus in -tp 2 in needed to add:
VLLM_FLASHINFER_MOE_BACKEND=throughput
for 4 gpus in -tp 4 I have not got it to work yet
I built vllm from source latest pull from github. also using latest stable pytorch 2.9.1cu130, latest triton and and flashinfer-python too I believe
wait... it only produces !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!... forever in -tp 2
So it was producing good output for a single GPU case but tensor parallel 2 you're seeing gibberish?
@Firworks yes exactly was good with single GPU except not sure if it emitted the thinking tags correctly. With two GPUS i just got !!!!!!!!!!!!!!!!!!!!.... i'll see if i can get you more details today. A new deepseek just came out too :)if we can get an NVFP4 that fits in 4x96gb that would be game changer.
I'm game to try and see if I can get it quantized. It should come out to something around 370GB maybe. I think it'll be a tight squeeze on 4x96GB but maybe we'll get lucky. I'll give it a shot tomorrow assuming someone else hasn't beaten me to the NVFP4 by then.
I integrated this NVFP4 model into claude-code and it worked very well.
If you are using WSL Docker with the vllm/vllm-openai:nightly image and still encounter errors, please add the following to the created image:
FROM vllm/vllm-openai:nightly
Install CUDA Toolkit headers
RUN apt-get update && apt-get install -y
cuda-toolkit-12-8
cuda-nvcc-12-8
cuda-libraries-dev-12-8
libcurand-dev-12-8
ninja-build
build-essential
libcurand-dev-12-8
libcublas-dev-12-8
ENV CUDA_HOME=/usr/local/cuda-12.8
ENV CUDA_PATH=/usr/local/cuda-12.8
ENV FLASHINFER_CUDA_HOME=/usr/local/cuda-12.8
ENV CPATH=/usr/local/cuda-12.8/include
ENV C_INCLUDE_PATH=/usr/local/cuda-12.8/include
ENV CPLUS_INCLUDE_PATH=/usr/local/cuda-12.8/include
I integrated this NVFP4 model into claude-code and it worked very well.
If you are using WSL Docker with the vllm/vllm-openai:nightly image and still encounter errors, please add the following to the created image:FROM vllm/vllm-openai:nightly
Install CUDA Toolkit headers
RUN apt-get update && apt-get install -y
cuda-toolkit-12-8
cuda-nvcc-12-8
cuda-libraries-dev-12-8
libcurand-dev-12-8
ninja-build
build-essential
libcurand-dev-12-8
libcublas-dev-12-8
ENV CUDA_HOME=/usr/local/cuda-12.8
ENV CUDA_PATH=/usr/local/cuda-12.8
ENV FLASHINFER_CUDA_HOME=/usr/local/cuda-12.8
ENV CPATH=/usr/local/cuda-12.8/include
ENV C_INCLUDE_PATH=/usr/local/cuda-12.8/include
ENV CPLUS_INCLUDE_PATH=/usr/local/cuda-12.8/include
Additional reminder:
If an error occurs:
RuntimeError: Error in function 'aligned_alloc' at /workspace/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_v with size 703856640 and alignment 16, but only 413138944 bytes available in AlignedAllocator. Increase the workspace buffer size. [rank0]:[W1201 21:33:10.932534662 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Please add environment variables:
-e VLLM_USE_FLASHINFER_MOE_FP4=1 -e VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE=2147483648
do you get spam of WARNING 12-09 13:37:20 [vllm.py:1322] Current vLLM config is not set. on vllm 0.12.0?