|
|
--- |
|
|
library_name: transformers |
|
|
base_model: Qwen/Qwen3-Coder-Next |
|
|
tags: |
|
|
- qwen3 |
|
|
- moe |
|
|
- nvfp4 |
|
|
- quantized |
|
|
- llmcompressor |
|
|
- vllm |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
|
|
|
# Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream). |
|
|
https://github.com/Gadflyii/vllm/tree/main |
|
|
|
|
|
# Qwen3-Coder-Next-NVFP4 |
|
|
|
|
|
NVFP4 quantized version of [Qwen/Qwen3-Coder-Next](https://huggingface.co/Qwen/Qwen3-Coder-Next) (80B-A3B). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Base Model** | Qwen/Qwen3-Coder-Next | |
|
|
| **Architecture** | Qwen3NextForCausalLM (Hybrid DeltaNet + Attention + MoE) | |
|
|
| **Parameters** | 80B total, 3B activated per token | |
|
|
| **Experts** | 512 total, 10 activated + 1 shared | |
|
|
| **Layers** | 48 | |
|
|
| **Context Length** | 262,144 tokens (256K) | |
|
|
| **Quantization** | NVFP4 (FP4 weights + FP4 activations) | |
|
|
| **Size** | 45GB (down from ~149GB BF16, 70% reduction) | |
|
|
| **Format** | compressed-tensors | |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
Quantized using [llmcompressor](https://github.com/vllm-project/llm-compressor) 0.9.0.1. |
|
|
|
|
|
```python |
|
|
NUM_CALIBRATION_SAMPLES = 20 |
|
|
MAX_SEQUENCE_LENGTH = 2048 |
|
|
DATASET = "HuggingFaceH4/ultrachat_200k" (train_sft) |
|
|
moe_calibrate_all_experts = True |
|
|
|
|
|
# Layers kept in BF16 |
|
|
ignore = [ |
|
|
"lm_head", |
|
|
"re:.*mlp.gate$", # MoE router gates |
|
|
"re:.*mlp.shared_expert_gate$", # Shared expert gates |
|
|
"re:.*linear_attn.*", # DeltaNet linear attention |
|
|
] |
|
|
``` |
|
|
|
|
|
## Benchmark Results |
|
|
|
|
|
### MMLU-Pro |
|
|
|
|
|
| Model | Accuracy | Delta | |
|
|
|-------|----------|-------| |
|
|
| BF16 | 52.90% | - | |
|
|
| **NVFP4** | **51.27%** | **-1.63%** | |
|
|
|
|
|
### Context Length Testing |
|
|
|
|
|
Successfully tested up to **128K tokens** with FP8 KV cache (Not enough VRAM to test any higher context). |
|
|
|
|
|
## Usage with vLLM |
|
|
|
|
|
Requires vLLM with NVFP4 support (0.16.0+), Transformers 5.0.0+ |
|
|
|
|
|
```bash |
|
|
#vllm Serving |
|
|
vllm serve GadflyII/Qwen3-Coder-Next-NVFP4 \ |
|
|
--tensor-parallel-size 2 \ |
|
|
--max-model-len 131072 \ |
|
|
--kv-cache-dtype fp8 |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 (same as base model) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- [Qwen Team](https://huggingface.co/Qwen) for the base model |
|
|
- [RedHatAI](https://huggingface.co/RedHatAI) for the quantization approach reference |
|
|
- [vLLM Project](https://github.com/vllm-project/vllm) for llmcompressor |
|
|
|