--- license: apache-2.0 base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct tags: - qwen3 - moe - nvfp4 - quantized - nvidia-modelopt - coding - dgx-spark model_type: qwen3_moe quantized_by: kleinpanic93 pipeline_tag: text-generation library_name: transformers --- # Qwen3-Coder-30B-A3B-Instruct-NVFP4 NVFP4 (4-bit floating point) quantization of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct), optimized for NVIDIA Blackwell GPUs. ## Model Details | Property | Value | |----------|-------| | **Base Model** | Qwen/Qwen3-Coder-30B-A3B-Instruct | | **Architecture** | Qwen3MoeForCausalLM (Mixture-of-Experts) | | **Total Parameters** | 30B (3B active per token) | | **Experts** | 128 per layer | | **Quantization** | NVFP4 (4-bit NV floating point) | | **KV Cache** | FP8 (8-bit float) | | **Original Precision** | BF16 | | **Quantized Size** | ~57 GB | | **Quantization Tool** | NVIDIA ModelOpt 0.41.0 | | **Calibration** | 512 samples (synthetic) | | **Hardware** | NVIDIA DGX Spark GB10 (Blackwell) | ## Quantization Details - **Method:** Post-training quantization via `nvidia-modelopt` with `NVFP4_DEFAULT_CFG` - **Weights:** 4-bit NV floating point, group size 16 - **Activations:** 4-bit NV floating point, group size 16 - **KV Cache:** FP8 quantized for reduced memory during inference - **Excluded layers:** `lm_head` and all MoE router/gate layers (48 total) — these remain in original precision to preserve routing quality - **Export method:** HuggingFace `save_pretrained` with manual `quantization_config` injection (ModelOpt 0.41.0 native export does not yet support `Qwen3MoeExperts`) ## Usage ### With vLLM (Recommended) ```bash vllm serve kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4 \ --quantization modelopt \ --trust-remote-code \ --max-model-len 32768 ``` ### With Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4", device_map="auto", trust_remote_code=True, ) tokenizer = AutoTokenizer.from_pretrained( "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4" ) ``` ## Hardware Requirements - **Minimum VRAM:** ~57 GB (unified memory or dedicated) - **Tested on:** NVIDIA DGX Spark (GB10, 128 GB unified memory) - **Recommended:** NVIDIA Blackwell GPUs (GB10, GB200, B200) ## Provenance ```json { "source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct", "quantization": "NVFP4", "tool": "nvidia-modelopt 0.41.0", "export_method": "save_pretrained_manual", "calib_size": 512, "calib_dataset": "synthetic-random", "hardware": "NVIDIA GB10 (Blackwell)", "elapsed_sec": 472 } ``` ## Limitations - This quantization uses **synthetic calibration data** (random tokens) because the container runs in offline mode. Production-grade quantization with real calibration data (e.g., C4, RedPajama) may yield slightly better quality. - The export uses `save_pretrained` fallback rather than ModelOpt's native HF checkpoint exporter, since `Qwen3MoeExperts` is not yet in ModelOpt 0.41.0's export allowlist. The quantization math is identical — only the serialization path differs. - MoE gate/router layers are preserved in original precision by design. ## License This model inherits the [Apache 2.0 license](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE) from the base Qwen3-Coder-30B-A3B-Instruct model. ## Acknowledgments - [Qwen Team](https://huggingface.co/Qwen) for the base model - [NVIDIA](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for ModelOpt quantization toolkit