| | --- |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct |
| | tags: |
| | - qwen3 |
| | - moe |
| | - nvfp4 |
| | - quantized |
| | - nvidia-modelopt |
| | - coding |
| | - dgx-spark |
| | model_type: qwen3_moe |
| | quantized_by: kleinpanic93 |
| | pipeline_tag: text-generation |
| | library_name: transformers |
| | --- |
| | |
| | # Qwen3-Coder-30B-A3B-Instruct-NVFP4 |
| |
|
| | NVFP4 (4-bit floating point) quantization of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct), optimized for NVIDIA Blackwell GPUs. |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | **Base Model** | Qwen/Qwen3-Coder-30B-A3B-Instruct | |
| | | **Architecture** | Qwen3MoeForCausalLM (Mixture-of-Experts) | |
| | | **Total Parameters** | 30B (3B active per token) | |
| | | **Experts** | 128 per layer | |
| | | **Quantization** | NVFP4 (4-bit NV floating point) | |
| | | **KV Cache** | FP8 (8-bit float) | |
| | | **Original Precision** | BF16 | |
| | | **Quantized Size** | ~57 GB | |
| | | **Quantization Tool** | NVIDIA ModelOpt 0.41.0 | |
| | | **Calibration** | 512 samples (synthetic) | |
| | | **Hardware** | NVIDIA DGX Spark GB10 (Blackwell) | |
| |
|
| | ## Quantization Details |
| |
|
| | - **Method:** Post-training quantization via `nvidia-modelopt` with `NVFP4_DEFAULT_CFG` |
| | - **Weights:** 4-bit NV floating point, group size 16 |
| | - **Activations:** 4-bit NV floating point, group size 16 |
| | - **KV Cache:** FP8 quantized for reduced memory during inference |
| | - **Excluded layers:** `lm_head` and all MoE router/gate layers (48 total) — these remain in original precision to preserve routing quality |
| | - **Export method:** HuggingFace `save_pretrained` with manual `quantization_config` injection (ModelOpt 0.41.0 native export does not yet support `Qwen3MoeExperts`) |
| |
|
| | ## Usage |
| |
|
| | ### With vLLM (Recommended) |
| |
|
| | ```bash |
| | vllm serve kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4 \ |
| | --quantization modelopt \ |
| | --trust-remote-code \ |
| | --max-model-len 32768 |
| | ``` |
| |
|
| | ### With Transformers |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4", |
| | device_map="auto", |
| | trust_remote_code=True, |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained( |
| | "kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4" |
| | ) |
| | ``` |
| |
|
| | ## Hardware Requirements |
| |
|
| | - **Minimum VRAM:** ~57 GB (unified memory or dedicated) |
| | - **Tested on:** NVIDIA DGX Spark (GB10, 128 GB unified memory) |
| | - **Recommended:** NVIDIA Blackwell GPUs (GB10, GB200, B200) |
| |
|
| | ## Provenance |
| |
|
| | ```json |
| | { |
| | "source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct", |
| | "quantization": "NVFP4", |
| | "tool": "nvidia-modelopt 0.41.0", |
| | "export_method": "save_pretrained_manual", |
| | "calib_size": 512, |
| | "calib_dataset": "synthetic-random", |
| | "hardware": "NVIDIA GB10 (Blackwell)", |
| | "elapsed_sec": 472 |
| | } |
| | ``` |
| |
|
| | ## Limitations |
| |
|
| | - This quantization uses **synthetic calibration data** (random tokens) because the container runs in offline mode. Production-grade quantization with real calibration data (e.g., C4, RedPajama) may yield slightly better quality. |
| | - The export uses `save_pretrained` fallback rather than ModelOpt's native HF checkpoint exporter, since `Qwen3MoeExperts` is not yet in ModelOpt 0.41.0's export allowlist. The quantization math is identical — only the serialization path differs. |
| | - MoE gate/router layers are preserved in original precision by design. |
| |
|
| | ## License |
| |
|
| | This model inherits the [Apache 2.0 license](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE) from the base Qwen3-Coder-30B-A3B-Instruct model. |
| |
|
| | ## Acknowledgments |
| |
|
| | - [Qwen Team](https://huggingface.co/Qwen) for the base model |
| | - [NVIDIA](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for ModelOpt quantization toolkit |
| |
|