Firworks commited on
Commit
291ea5d
·
verified ·
1 Parent(s): 0644434

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -7,6 +7,8 @@ base_model:
7
  ---
8
  # Qwen3-Coder-30B-A3B-Instruct-nvfp4
9
 
 
 
10
  **Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
11
  **Base model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`
12
  **How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with nvidia/OpenCodeInstruct.
 
7
  ---
8
  # Qwen3-Coder-30B-A3B-Instruct-nvfp4
9
 
10
+ **Note**: This model (NVFP4 quantization) was tested on an NVIDIA RTX PRO 6000 (Blackwell, sm_120, CUDA 12.9, Driver 575.64.03) using vLLM 0.11.0 and NVIDIA's NGC container (v0.10.1, 25.09-py3). It fails to run due to issues with NVFP4 MoE kernel initialization, specifically "no kernel image is available" in `ops.shuffle_rows` (vLLM 0.11.0) and "Failed to initialize GEMM" in `cutlass_fp4_moe_mm` (vLLM 0.10.1). See related vLLM GitHub issues [#20522](https://github.com/vllm-project/vllm/issues/20522), [#23826](https://github.com/vllm-project/vllm/issues/23826), and [#18153](https://github.com/vllm-project/vllm/issues/18153) for details. A source build with `TORCH_CUDA_ARCH_LIST="12.0"` or a future vLLM release (e.g., v0.12.0) may resolve this.
11
+
12
  **Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
13
  **Base model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`
14
  **How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with nvidia/OpenCodeInstruct.