NVFP4 quantization of Qwen3-Coder-30B-A3B-Instruct via spark-maker v3

d196e2c verified 6 days ago

3.66 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-Coder-30B-A3B-Instruct
	tags:
	- qwen3
	- moe
	- nvfp4
	- quantized
	- nvidia-modelopt
	- coding
	- dgx-spark
	model_type: qwen3_moe
	quantized_by: kleinpanic93
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Qwen3-Coder-30B-A3B-Instruct-NVFP4

	NVFP4 (4-bit floating point) quantization of [Qwen/Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct), optimized for NVIDIA Blackwell GPUs.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| Qwen/Qwen3-Coder-30B-A3B-Instruct \|
	\| Architecture \| Qwen3MoeForCausalLM (Mixture-of-Experts) \|
	\| Total Parameters \| 30B (3B active per token) \|
	\| Experts \| 128 per layer \|
	\| Quantization \| NVFP4 (4-bit NV floating point) \|
	\| KV Cache \| FP8 (8-bit float) \|
	\| Original Precision \| BF16 \|
	\| Quantized Size \| ~57 GB \|
	\| Quantization Tool \| NVIDIA ModelOpt 0.41.0 \|
	\| Calibration \| 512 samples (synthetic) \|
	\| Hardware \| NVIDIA DGX Spark GB10 (Blackwell) \|

	## Quantization Details

	- Method: Post-training quantization via `nvidia-modelopt` with `NVFP4_DEFAULT_CFG`
	- Weights: 4-bit NV floating point, group size 16
	- Activations: 4-bit NV floating point, group size 16
	- KV Cache: FP8 quantized for reduced memory during inference
	- Excluded layers: `lm_head` and all MoE router/gate layers (48 total) — these remain in original precision to preserve routing quality
	- Export method: HuggingFace `save_pretrained` with manual `quantization_config` injection (ModelOpt 0.41.0 native export does not yet support `Qwen3MoeExperts`)

	## Usage

	### With vLLM (Recommended)

	```bash
	vllm serve kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4 \
	--quantization modelopt \
	--trust-remote-code \
	--max-model-len 32768
	```

	### With Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4",
	device_map="auto",
	trust_remote_code=True,
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"kleinpanic93/Qwen3-Coder-30B-A3B-Instruct-NVFP4"
	)
	```

	## Hardware Requirements

	- Minimum VRAM: ~57 GB (unified memory or dedicated)
	- Tested on: NVIDIA DGX Spark (GB10, 128 GB unified memory)
	- Recommended: NVIDIA Blackwell GPUs (GB10, GB200, B200)

	## Provenance

	```json
	{
	"source_model": "Qwen/Qwen3-Coder-30B-A3B-Instruct",
	"quantization": "NVFP4",
	"tool": "nvidia-modelopt 0.41.0",
	"export_method": "save_pretrained_manual",
	"calib_size": 512,
	"calib_dataset": "synthetic-random",
	"hardware": "NVIDIA GB10 (Blackwell)",
	"elapsed_sec": 472
	}
	```

	## Limitations

	- This quantization uses synthetic calibration data (random tokens) because the container runs in offline mode. Production-grade quantization with real calibration data (e.g., C4, RedPajama) may yield slightly better quality.
	- The export uses `save_pretrained` fallback rather than ModelOpt's native HF checkpoint exporter, since `Qwen3MoeExperts` is not yet in ModelOpt 0.41.0's export allowlist. The quantization math is identical — only the serialization path differs.
	- MoE gate/router layers are preserved in original precision by design.

	## License

	This model inherits the [Apache 2.0 license](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct/blob/main/LICENSE) from the base Qwen3-Coder-30B-A3B-Instruct model.

	## Acknowledgments

	- [Qwen Team](https://huggingface.co/Qwen) for the base model
	- [NVIDIA](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for ModelOpt quantization toolkit