GadflyII
/

Qwen3-Coder-Next-NVFP4

Text Generation

Mixture of Experts

compressed-tensors

Model card Files Files and versions

Qwen3-Coder-Next-NVFP4 / README.md

GadflyII's picture

Update README.md

f75ee39 verified 5 days ago

|

history blame contribute delete

2.35 kB

	---
	library_name: transformers
	base_model: Qwen/Qwen3-Coder-Next
	tags:
	- qwen3
	- moe
	- nvfp4
	- quantized
	- llmcompressor
	- vllm
	license: apache-2.0
	pipeline_tag: text-generation
	---

	# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).
	https://github.com/Gadflyii/vllm/tree/main

	# Qwen3-Coder-Next-NVFP4

	NVFP4 quantized version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B-A3B).

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| Qwen/Qwen3-Coder-Next \|
	\| Architecture \| Qwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE) \|
	\| Parameters \| 80B total, 3B activated per token \|
	\| Experts \| 512 total, 10 activated + 1 shared \|
	\| Layers \| 48 \|
	\| Context Length \| 262,144 tokens (256K) \|
	\| Quantization \| NVFP4 (FP4 weights + FP4 activations) \|
	\| Size \| 45GB (down from ~149GB BF16, 70% reduction) \|
	\| Format \| compressed-tensors \|

	## Quantization Details

	Quantized using [llmcompressor](https://github.com/vllm-project/llm-compressor) 0.9.0.1.

	```python
	NUM_CALIBRATION_SAMPLES = 20
	MAX_SEQUENCE_LENGTH = 2048
	DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft)
	moe_calibrate_all_experts = True

	# Layers kept in BF16
	ignore = [
	"lm_head",
	"re:.*mlp.gate$", # MoE router gates
	"re:.*mlp.shared_expert_gate$", # Shared expert gates
	"re:.linear_attn.", # DeltaNet linear attention
	]
	```

	## Benchmark Results

	### MMLU-Pro

	\| Model \| Accuracy \| Delta \|
	\|-------\|----------\|-------\|
	\| BF16 \| 52.90% \| - \|
	\| NVFP4 \| 51.27% \| -1.63% \|

	### Context Length Testing

	Successfully tested up to 128K tokens with FP8 KV cache (Not enough VRAM to test any higher context).

	## Usage with vLLM

	Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+

	```bash
	#vllm Serving
	vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \
	--tensor-parallel-size 2 \
	--max-model-len 131072 \
	--kv-cache-dtype fp8
	```

	## License

	Apache 2.0 (same as base model)

	## Acknowledgments

	- [Qwen Team](https://huggingface.co/Qwen) for the base model
	- [RedHatAI](https://huggingface.co/RedHatAI) for the quantization approach reference
	- [vLLM Project](https://github.com/vllm-project/vllm) for llmcompressor