works with vLLM, with FLASHINFER_MOE_FP4

#2
by bnjmnmarie - opened

Hello,

Thanks for quantizing this model!
It runs with vLLM but by default it doesn't load the right kernel for it . Not sure why. I ran it on a single RTX Pro 6000.

export VLLM_USE_FLASHINFER_MOE_FP4=1
vllm serve Firworks/INTELLECT-3-nvfp4 --max-model-len 32000

Trace

INFO 11-28 16:20:10 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=8065) INFO 11-28 16:20:10 [api_server.py:1977] vLLM API server version 0.11.2
(APIServer pid=8065) INFO 11-28 16:20:10 [utils.py:253] non-default args: {'model_tag': 'Firworks/INTELLECT-3-nvfp4', 'model': 'Firworks/INTELLECT-3-nvfp4', 'max_model_len': 32000}
(APIServer pid=8065) INFO 11-28 16:20:11 [model.py:631] Resolved architecture: Glm4MoeForCausalLM
(APIServer pid=8065) INFO 11-28 16:20:11 [model.py:1745] Using max model len 32000
(APIServer pid=8065) INFO 11-28 16:20:11 [scheduler.py:216] Chunked prefill is enabled with max_num_batched_tokens=8192.
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:19 [core.py:93] Initializing a V1 LLM engine (v0.11.2) with config: model='Firworks/INTELLECT-3-nvfp4', speculative_config=None, tokenizer='Firworks/INTELLECT-3-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Firworks/INTELLECT-3-nvfp4, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:20 [parallel_state.py:1208] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.0.2:33027 backend=nccl
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:20 [parallel_state.py:1394] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [gpu_model_runner.py:3259] Starting to load model Firworks/INTELLECT-3-nvfp4...
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [compressed_tensors_w4a4_nvfp4.py:63] Using flashinfer-cutlass for NVFP4 GEMM
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [cuda.py:418] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [cuda.py:427] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [layer.py:342] Enabled separate cuda stream for MoE shared_experts
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:21 [nvfp4_moe_support.py:38] Using FlashInfer kernels for CompressedTensorsW4A4MoeMethod.
Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:00<00:11,  1.06it/s]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:02<00:12,  1.17s/it]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:03<00:12,  1.25s/it]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:04<00:11,  1.28s/it]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:06<00:10,  1.31s/it]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:07<00:09,  1.32s/it]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:08<00:07,  1.32s/it]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:10<00:06,  1.33s/it]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:11<00:05,  1.33s/it]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:12<00:03,  1.33s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [00:14<00:02,  1.33s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [00:15<00:01,  1.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:16<00:00,  1.09s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:16<00:00,  1.24s/it]
(EngineCore_DP0 pid=8202) 
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:38 [default_loader.py:314] Loading weights took 16.26 seconds
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:39 [gpu_model_runner.py:3338] Model loading took 57.7486 GiB memory and 17.918689 seconds
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:48 [backends.py:631] Using cache directory: /root/.cache/vllm/torch_compile_cache/7e1a614333/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:48 [backends.py:647] Dynamo bytecode transform time: 8.95 s
(EngineCore_DP0 pid=8202) INFO 11-28 16:20:49 [backends.py:251] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:03 [backends.py:282] Compiling a graph for dynamic shape takes 13.33 s
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:05 [monitor.py:34] torch.compile takes 22.29 s in total
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:07 [gpu_worker.py:359] Available KV cache memory: 21.97 GiB
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:07 [kv_cache_utils.py:1229] GPU KV cache size: 125,184 tokens
(EngineCore_DP0 pid=8202) INFO 11-28 16:21:07 [kv_cache_utils.py:1234] Maximum concurrency for 32,000 tokens per request: 3.91x
(EngineCore_DP0 pid=8202) 2025-11-28 16:21:07,779 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=8202) 2025-11-28 16:21:15,708 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51/51 [00:16<00:00,  3.16it/s]
Capturing CUDA graphs (decode, FULL): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 51/51 [01:00<00:00,  1.18s/it]
(EngineCore_DP0 pid=8202) INFO 11-28 16:22:33 [gpu_model_runner.py:4244] Graph capturing finished in 77 secs, took -1.65 GiB
(EngineCore_DP0 pid=8202) INFO 11-28 16:22:33 [core.py:250] init engine (profile, create kv cache, warmup model) took 113.48 seconds
(APIServer pid=8065) INFO 11-28 16:22:35 [api_server.py:1725] Supported tasks: ['generate']
(APIServer pid=8065) WARNING 11-28 16:22:36 [model.py:1568] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_responses.py:154] Using default chat sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_completion.py:73] Using default completion sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [serving_chat.py:131] Using default chat sampling params from model: {'temperature': 0.6}
(APIServer pid=8065) INFO 11-28 16:22:36 [api_server.py:2052] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:38] Available routes are:
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=8065) INFO 11-28 16:22:36 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=8065) INFO:     Started server process [8065]
(APIServer pid=8065) INFO:     Waiting for application startup.
(APIServer pid=8065) INFO:     Application startup complete

Nice! Thanks for figuring that out. I confirmed that it runs with that and also updated the model card.

I'm still puzzled why it's not needed for GLM-4.5 Air Base but I'm just glad it's working now. I ran a few test prompts and it gave some great responses.

Sign up or log in to comment