File size: 3,662 Bytes
d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c d30664e d196e2c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | ---
license: apache-2.0
base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
tags:
- qwen3
- moe
- nvfp4
- quantized
- nvidia-modelopt
- coding
- dgx-spark
model_type: qwen3_moe
quantized_by: kleinpanic93
pipeline_tag: text-generation
library_name: transformers
---
# Qwen3-Coder-30B-A3B-Instruct-NVFP4
NVFP4 (4-bit floating point) quantization of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct), optimized for NVIDIA Blackwell GPUs.
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | Qwen/Qwen3-Coder-30B-A3B-Instruct |
| **Architecture** | Qwen3MoeForCausalLM (Mixture-of-Experts) |
| **Total Parameters** | 30B (3B active per token) |
| **Experts** | 128 per layer |
| **Quantization** | NVFP4 (4-bit NV floating point) |
| **KV Cache** | FP8 (8-bit float) |
| **Original Precision** | BF16 |
| **Quantized Size** | ~57 GB |
| **Quantization Tool** | NVIDIA ModelOpt 0.41.0 |
| **Calibration** | 512 samples (synthetic) |
| **Hardware** | NVIDIA DGX Spark GB10 (Blackwell) |
## Quantization Details
- **Method:** Post-training quantization via `nvidia-modelopt` with `NVFP4_DEFAULT_CFG`
- **Weights:** 4-bit NV floating point, group size 16
- **Activations:** 4-bit NV floating point, group size 16
- **KV Cache:** FP8 quantized for reduced memory during inference
- **Excluded layers:** `lm_head` and all MoE router/gate layers (48 total) — these remain in original precision to preserve routing quality
- **Export method:** HuggingFace `save_pretrained` with manual `quantization_config` injection (ModelOpt 0.41.0 native export does not yet support `Qwen3MoeExperts`)
## Usage
### With vLLM (Recommended)
```bash
vllm serve kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4 \
--quantization modelopt \
--trust-remote-code \
--max-model-len 32768
```
### With Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4"
)
```
## Hardware Requirements
- **Minimum VRAM:** ~57 GB (unified memory or dedicated)
- **Tested on:** NVIDIA DGX Spark (GB10, 128 GB unified memory)
- **Recommended:** NVIDIA Blackwell GPUs (GB10, GB200, B200)
## Provenance
```json
{
"source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
"quantization": "NVFP4",
"tool": "nvidia-modelopt 0.41.0",
"export_method": "save_pretrained_manual",
"calib_size": 512,
"calib_dataset": "synthetic-random",
"hardware": "NVIDIA GB10 (Blackwell)",
"elapsed_sec": 472
}
```
## Limitations
- This quantization uses **synthetic calibration data** (random tokens) because the container runs in offline mode. Production-grade quantization with real calibration data (e.g., C4, RedPajama) may yield slightly better quality.
- The export uses `save_pretrained` fallback rather than ModelOpt's native HF checkpoint exporter, since `Qwen3MoeExperts` is not yet in ModelOpt 0.41.0's export allowlist. The quantization math is identical — only the serialization path differs.
- MoE gate/router layers are preserved in original precision by design.
## License
This model inherits the [Apache 2.0 license](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE) from the base Qwen3-Coder-30B-A3B-Instruct model.
## Acknowledgments
- [Qwen Team](https://huggingface.co/Qwen) for the base model
- [NVIDIA](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for ModelOpt quantization toolkit
|