|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- text-generation |
|
|
- quantization |
|
|
- nvfp4 |
|
|
- nvidia |
|
|
- dgx-spark |
|
|
- blackwell |
|
|
- model_hub_mixin |
|
|
- pytorch_model_hub_mixin |
|
|
base_model: qingy2024/Qwen3-VLTO-32B-Instruct |
|
|
inference: false |
|
|
--- |
|
|
|
|
|
# Qwen3-VLTO-32B-Instruct-NVFP4 |
|
|
|
|
|
This is an NVFP4 quantized version of [qingy2024/Qwen3-VLTO-32B-Instruct](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct), optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model:** qingy2024/Qwen3-VLTO-32B-Instruct |
|
|
- **Quantization Format:** NVFP4 (4-bit floating point) |
|
|
- **Target Hardware:** NVIDIA DGX Spark (Grace Blackwell Superchip) |
|
|
- **Quantization Tool:** NVIDIA TensorRT Model Optimizer v0.35.1 |
|
|
- **Model Size:** Approximately 20 GB (68% reduction from BF16) |
|
|
|
|
|
## Performance Characteristics |
|
|
|
|
|
### Memory Efficiency |
|
|
|
|
|
| Model Version | Memory Usage | Reduction | |
|
|
|--------------|--------------|-----------| |
|
|
| BF16 (Original) | 61.03 GB | Baseline | |
|
|
| NVFP4 (This model) | 19.42 GB | 68.2% | |
|
|
|
|
|
### Inference Speed |
|
|
|
|
|
| Model Version | Throughput | Relative Performance | |
|
|
|--------------|------------|---------------------| |
|
|
| BF16 (Original) | 3.65 tokens/s | Baseline | |
|
|
| NVFP4 (This model) | 9.99 tokens/s | 2.74x faster | |
|
|
|
|
|
**Test Configuration:** |
|
|
- Hardware: NVIDIA DGX Spark GB10 |
|
|
- Framework: vLLM 0.10.2 |
|
|
- Max Model Length: 8192 tokens |
|
|
- GPU Memory Utilization: 90% |
|
|
|
|
|
## Quantization Details |
|
|
|
|
|
### NVFP4 Format |
|
|
|
|
|
NVFP4 is NVIDIA's 4-bit floating point quantization format featuring: |
|
|
- **Two-level scaling:** E4M3 FP8 scaling per 16-value block + global FP32 tensor scale |
|
|
- **Hardware acceleration:** Optimized for Tensor Cores on Blackwell GB10 GPUs |
|
|
- **Group size:** 16 |
|
|
- **Minimal accuracy degradation:** Less than 1% vs original model |
|
|
- **Excluded modules:** lm_head (kept in higher precision) |
|
|
|
|
|
### Calibration |
|
|
|
|
|
- **Dataset:** C4 (Colossal Clean Crawled Corpus) |
|
|
- **Calibration samples:** 512 |
|
|
- **Maximum sequence length:** 2048 tokens |
|
|
- **Method:** Post-training quantization with activation calibration |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Requirements |
|
|
|
|
|
- NVIDIA DGX Spark or compatible Blackwell GPU |
|
|
- vLLM >= 0.6.5 |
|
|
- nvidia-modelopt[hf] |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
**IMPORTANT:** This model must be loaded with vLLM using the `modelopt` quantization parameter. Standard HuggingFace `AutoModelForCausalLM` will not work. |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# Load NVFP4 quantized model |
|
|
llm = LLM( |
|
|
model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4", |
|
|
quantization="modelopt", # Required for NVFP4 |
|
|
trust_remote_code=True, |
|
|
gpu_memory_utilization=0.9 |
|
|
) |
|
|
|
|
|
# Generate |
|
|
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256) |
|
|
outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params) |
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
### Environment Variables |
|
|
|
|
|
You can optionally set: |
|
|
- `HF_CACHE_DIR`: Override HuggingFace cache location |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Hardware specific:** Optimized for NVIDIA Blackwell architecture (GB10) |
|
|
- **vLLM required:** Cannot be loaded with standard transformers library |
|
|
- **Quantization artifacts:** Minor precision loss (<1%) compared to BF16 original |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for: |
|
|
- High-throughput inference on NVIDIA DGX Spark systems |
|
|
- Production deployments requiring memory-efficient models |
|
|
- Research on quantization techniques for large language models |
|
|
|
|
|
## Training and Quantization |
|
|
|
|
|
### Base Model Training |
|
|
|
|
|
See the [original model card](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for base model training details. |
|
|
|
|
|
### Quantization Process |
|
|
|
|
|
1. **Model Loading:** Original model loaded in BF16 precision |
|
|
2. **Calibration:** 512 samples from C4 dataset for activation statistics |
|
|
3. **Quantization:** NVFP4 format applied using NVIDIA modelopt |
|
|
4. **Export:** Saved in HuggingFace safetensors format |
|
|
|
|
|
**Quantization Time:** Approximately 60-90 minutes on DGX Spark |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Test Results |
|
|
|
|
|
All 5 inference tests passed successfully: |
|
|
- Technical explanation generation |
|
|
- Code generation |
|
|
- Mathematical reasoning |
|
|
- Creative writing |
|
|
- Instruction following |
|
|
|
|
|
**Average performance:** 9.99 tokens/s on DGX Spark GB10 |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this quantized model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{qwen3vlto32b-nvfp4, |
|
|
author = {Ex0bit}, |
|
|
title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
And the original base model: |
|
|
|
|
|
```bibtex |
|
|
@misc{qingy2024qwen3vlto, |
|
|
author = {qingy2024}, |
|
|
title = {Qwen3-VLTO-32B-Instruct}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## References |
|
|
|
|
|
- [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) |
|
|
- [vLLM Documentation](https://docs.vllm.ai/) |
|
|
- [NVIDIA DGX Spark Documentation](https://docs.nvidia.com/dgx-spark/) |
|
|
- [Quantization GitHub Repository](https://github.com/Ex0bit/nvfp4-quantization) |
|
|
|
|
|
## License |
|
|
|
|
|
This quantized model inherits the license from the base model. Please refer to the [original model's license](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for details. |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- Ex0bit (@Ex0bit) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- NVIDIA for TensorRT Model Optimizer and DGX Spark hardware |
|
|
- qingy2024 for the base Qwen3-VLTO-32B-Instruct model |
|
|
- The vLLM team for high-performance inference framework |
|
|
|