works with vLLM, with FLASHINFER_MOE_FP4

by bnjmnmarie - opened Nov 28, 2025

Nov 28, 2025

Hello,

Thanks for quantizing this model!
It runs with vLLM but by default it doesn't load the right kernel for it . Not sure why. I ran it on a single RTX Pro 6000.

export VLLM_USE_FLASHINFER_MOE_FP4=1
vllm serve Firworks/INTELLECT-3-nvfp4 --max-model-len 32000

Trace

INFO 11-28 16:20:10 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=8065) INFO 11-28 16:20:10 [api_server.py:1977] vLLM API server version 0.11.2
(APIServer pid=8065) INFO 11-28 16:20:10 [utils.py:253] non-default args: {'model_tag': 'Firworks/INTELLECT-3-nvfp4', 'model': 'Firworks/INTELLECT-3-nvfp4', 'max_model_len': 32000}
(APIServer pid=8065) INFO 11-28 16:20:11 [model.py:631] Resolved architecture: Glm4MoeForCausalLM
(APIServer pid=8065) INFO 11-28 16:20:11 [model.py:1745] Using max model len 32000
(APIServer pid=8065) INFO 11-28 16:20:11 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:19 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='Firworks/INTELLECT-3-nvfp4', speculative_config=None, tokenizer='Firworks/INTELLECT-3-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Firworks/INTELLECT-3-nvfp4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:20 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.0.2:33027 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:20 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [gpu_model_runner.py:3259] Starting to load model Firworks/INTELLECT-3-nvfp4...
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [compressed_tensors_w4a4_nvfp4.py:63] Using flashinfer-cutlass for NVFP4 GEMM
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [cuda.py:427] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [layer.py:342] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [nvfp4_moe_support.py:38] Using FlashInfer kernels for CompressedTensorsW4A4MoeMethod.
Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:00<00:11,  1.06it/s]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:02<00:12,  1.17s/it]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:03<00:12,  1.25s/it]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:04<00:11,  1.28s/it]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:06<00:10,  1.31s/it]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:07<00:09,  1.32s/it]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:08<00:07,  1.32s/it]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:10<00:06,  1.33s/it]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:11<00:05,  1.33s/it]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:12<00:03,  1.33s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [00:14<00:02,  1.33s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [00:15<00:01,  1.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:16<00:00,  1.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:16<00:00,  1.24s/it]
(EngineCore_DP0 pid=8202) 
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:38 [default_loader.py:314] Loading weights took 16.26 seconds
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:39 [gpu_model_runner.py:3338] Model loading took 57.7486 GiB memory and 17.918689 seconds
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:48 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/7e1a614333/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:48 [backends.py:647] Dynamo bytecode transform time: 8.95 s
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:49 [backends.py:251] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:03 [backends.py:282] Compiling a graph for dynamic shape takes 13.33 s
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:05 [monitor.py:34] torch.compile takes 22.29 s in total
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:07 [gpu_worker.py:359] Available KV cache memory: 21.97 GiB
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:07 [kv_cache_utils.py:1229] GPU KV cache size: 125,184 tokens
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:07 [kv_cache_utils.py:1234] Maximum concurrency for 32,000 tokens per request: 3.91x
(EngineCore_DP0 pid=8202) 2025-11-28 16:21:07,779 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=8202) 2025-11-28 16:21:15,708 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:16<00:00,  3.16it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [01:00<00:00,  1.18s/it]
(EngineCore_DP0 pid=8202) INFO 11-28 16:22:33 [gpu_model_runner.py:4244] Graph capturing finished in 77 secs, took -1.65 GiB
(EngineCore_DP0 pid=8202) INFO 11-28 16:22:33 [core.py:250] init engine (profile, create kv cache, warmup model) took 113.48 seconds
(APIServer pid=8065) INFO 11-28 16:22:35 [api_server.py:1725] Supported tasks: ['generate']
(APIServer pid=8065) WARNING 11-28 16:22:36 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_responses.py:154] Using default chat sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_completion.py:73] Using default completion sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [api_server.py:2052] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:38] Available routes are:
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=8065) INFO:     Started server process [8065]
(APIServer pid=8065) INFO:     Waiting for application startup.
(APIServer pid=8065) INFO:     Application startup complete

Firworks

Owner Nov 28, 2025

Nice! Thanks for figuring that out. I confirmed that it runs with that and also updated the model card.

I'm still puzzled why it's not needed for GLM-4.5 Air Base but I'm just glad it's working now. I ran a few test prompts and it gave some great responses.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment