Firworks
/

Qwen3-Coder-30B-A3B-Instruct-nvfp4

8-bit precision

compressed-tensors

Model card Files Files and versions

Firworks commited on Oct 25, 2025

Commit

291ea5d

·

verified ·

1 Parent(s): 0644434

Update README.md

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -7,6 +7,8 @@ base_model:
 ---
 # Qwen3-Coder-30B-A3B-Instruct-nvfp4
 **Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
 **Base model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`
 **How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with nvidia/OpenCodeInstruct.

 ---
 # Qwen3-Coder-30B-A3B-Instruct-nvfp4
+**Note**: This model (NVFP4 quantization) was tested on an NVIDIA RTX PRO 6000 (Blackwell, sm_120, CUDA 12.9, Driver 575.64.03) using vLLM 0.11.0 and NVIDIA's NGC container (v0.10.1, 25.09-py3). It fails to run due to issues with NVFP4 MoE kernel initialization, specifically "no kernel image is available" in `ops.shuffle_rows` (vLLM 0.11.0) and "Failed to initialize GEMM" in `cutlass_fp4_moe_mm` (vLLM 0.10.1). See related vLLM GitHub issues [#20522](https://github.com/vllm-project/vllm/issues/20522), [#23826](https://github.com/vllm-project/vllm/issues/23826), and [#18153](https://github.com/vllm-project/vllm/issues/18153) for details. A source build with `TORCH_CUDA_ARCH_LIST="12.0"` or a future vLLM release (e.g., v0.12.0) may resolve this.
 **Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
 **Base model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`
 **How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with nvidia/OpenCodeInstruct.