is NVFP4 supported on sm120 (blackwell rtx pro 6000, rtx 5090 etc)?

by Fernanda24 - opened Dec 1, 2025

Discussion

Fernanda24

Dec 1, 2025

is NVFP4 supported on sm120 (blackwell rtx pro 6000, rtx 5090 etc)?

if yes how do we run it? does it work in vllm and or sglang now on sm120 rtx cards?

Firworks

Owner Dec 1, 2025

Yeah it does. There's a docker command on the model page for my NVFP4 quant that should get it running. I test these models mostly on a 2 x RTX Pro 6000 Blackwell system but sometimes on a B200. There's an environment flag you need to set to get INTELLECT-3 running but it's included in there. NVFP4 is still fragile but I've gotten quite a lot of models running on sm120 if you look at my other models. They're all NVFP4 quants and I usually have a note in the event that one wouldn't run. Those aren't that common though.

I don't really know anything about sglang but I guess it's possible they would also work there. I'm using llm-compressor for the quantization so any tool that can use the output of llm-compressor could theoretically run them.

Fernanda24

Dec 1, 2025

super cool!

Fernanda24

Dec 1, 2025

@Firworks

I got it to work with 1 gpu using:
VLLM_USE_FLASHINFER_MOE_FP4=1

for 2 gpus in -tp 2 in needed to add:
VLLM_FLASHINFER_MOE_BACKEND=throughput

for 4 gpus in -tp 4 I have not got it to work yet

I built vllm from source latest pull from github. also using latest stable pytorch 2.9.1cu130, latest triton and and flashinfer-python too I believe

Fernanda24

Dec 1, 2025

wait... it only produces !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!... forever in -tp 2

Firworks

Owner Dec 2, 2025

So it was producing good output for a single GPU case but tensor parallel 2 you're seeing gibberish?

Fernanda24

Dec 2, 2025

@Firworks yes exactly was good with single GPU except not sure if it emitted the thinking tags correctly. With two GPUS i just got !!!!!!!!!!!!!!!!!!!!.... i'll see if i can get you more details today. A new deepseek just came out too :)if we can get an NVFP4 that fits in 4x96gb that would be game changer.

Firworks

Owner Dec 2, 2025

I'm game to try and see if I can get it quantized. It should come out to something around 370GB maybe. I think it'll be a tight squeeze on 4x96GB but maybe we'll get lucky. I'll give it a shot tomorrow assuming someone else hasn't beaten me to the NVFP4 by then.

win10

Dec 2, 2025

I integrated this NVFP4 model into claude-code and it worked very well.
If you are using WSL Docker with the vllm/vllm-openai:nightly image and still encounter errors, please add the following to the created image:

FROM vllm/vllm-openai:nightly

Install CUDA Toolkit headers

RUN apt-get update && apt-get install -y

cuda-toolkit-12-8

cuda-nvcc-12-8

cuda-libraries-dev-12-8

libcurand-dev-12-8

ninja-build

build-essential

libcurand-dev-12-8

libcublas-dev-12-8

ENV CUDA_HOME=/usr/local/cuda-12.8

ENV CUDA_PATH=/usr/local/cuda-12.8

ENV FLASHINFER_CUDA_HOME=/usr/local/cuda-12.8
ENV CPATH=/usr/local/cuda-12.8/include
ENV C_INCLUDE_PATH=/usr/local/cuda-12.8/include
ENV CPLUS_INCLUDE_PATH=/usr/local/cuda-12.8/include

win10

Dec 2, 2025

I integrated this NVFP4 model into claude-code and it worked very well.
If you are using WSL Docker with the vllm/vllm-openai:nightly image and still encounter errors, please add the following to the created image:

FROM vllm/vllm-openai:nightly

Install CUDA Toolkit headers

RUN apt-get update && apt-get install -y

cuda-toolkit-12-8

cuda-nvcc-12-8

cuda-libraries-dev-12-8

libcurand-dev-12-8

ninja-build

build-essential

libcurand-dev-12-8

libcublas-dev-12-8

ENV CUDA_HOME=/usr/local/cuda-12.8

ENV CUDA_PATH=/usr/local/cuda-12.8

ENV FLASHINFER_CUDA_HOME=/usr/local/cuda-12.8
ENV CPATH=/usr/local/cuda-12.8/include
ENV C_INCLUDE_PATH=/usr/local/cuda-12.8/include
ENV CPLUS_INCLUDE_PATH=/usr/local/cuda-12.8/include

Additional reminder:
If an error occurs:
RuntimeError: Error in function 'aligned_alloc' at /workspace/include/flashinfer/allocator.h:49: Buffer overflow when allocating memory for batch_prefill_tmp_v with size 703856640 and alignment 16, but only 413138944 bytes available in AlignedAllocator. Increase the workspace buffer size. [rank0]:[W1201 21:33:10.932534662 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Please add environment variables:
-e VLLM_USE_FLASHINFER_MOE_FP4=1 -e VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE=2147483648

willfalco

Dec 9, 2025

•

edited Dec 9, 2025

do you get spam of WARNING 12-09 13:37:20 [vllm.py:1322] Current vLLM config is not set. on vllm 0.12.0?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment