Update README.md
Browse files
README.md
CHANGED
|
@@ -7,6 +7,8 @@ base_model:
|
|
| 7 |
---
|
| 8 |
# Qwen3-Coder-30B-A3B-Instruct-nvfp4
|
| 9 |
|
|
|
|
|
|
|
| 10 |
**Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
|
| 11 |
**Base model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`
|
| 12 |
**How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with nvidia/OpenCodeInstruct.
|
|
|
|
| 7 |
---
|
| 8 |
# Qwen3-Coder-30B-A3B-Instruct-nvfp4
|
| 9 |
|
| 10 |
+
**Note**: This model (NVFP4 quantization) was tested on an NVIDIA RTX PRO 6000 (Blackwell, sm_120, CUDA 12.9, Driver 575.64.03) using vLLM 0.11.0 and NVIDIA's NGC container (v0.10.1, 25.09-py3). It fails to run due to issues with NVFP4 MoE kernel initialization, specifically "no kernel image is available" in `ops.shuffle_rows` (vLLM 0.11.0) and "Failed to initialize GEMM" in `cutlass_fp4_moe_mm` (vLLM 0.10.1). See related vLLM GitHub issues [#20522](https://github.com/vllm-project/vllm/issues/20522), [#23826](https://github.com/vllm-project/vllm/issues/23826), and [#18153](https://github.com/vllm-project/vllm/issues/18153) for details. A source build with `TORCH_CUDA_ARCH_LIST="12.0"` or a future vLLM release (e.g., v0.12.0) may resolve this.
|
| 11 |
+
|
| 12 |
**Format:** NVFP4 — weights & activations quantized to FP4 with dual scaling.
|
| 13 |
**Base model:** `Qwen/Qwen3-Coder-30B-A3B-Instruct`
|
| 14 |
**How it was made:** One-shot calibration with LLM Compressor (NVFP4 recipe), long-seq calibration with nvidia/OpenCodeInstruct.
|