Ex0bit's picture
Upload NVFP4 quantized Qwen3-VLTO-32B-Instruct model
39d23a8 verified
---
language:
- en
license: apache-2.0
tags:
- text-generation
- quantization
- nvfp4
- nvidia
- dgx-spark
- blackwell
- model_hub_mixin
- pytorch_model_hub_mixin
base_model: qingy2024/Qwen3-VLTO-32B-Instruct
inference: false
---
# Qwen3-VLTO-32B-Instruct-NVFP4
This is an NVFP4 quantized version of [qingy2024/Qwen3-VLTO-32B-Instruct](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct), optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs.
## Model Description
- **Base Model:** qingy2024/Qwen3-VLTO-32B-Instruct
- **Quantization Format:** NVFP4 (4-bit floating point)
- **Target Hardware:** NVIDIA DGX Spark (Grace Blackwell Superchip)
- **Quantization Tool:** NVIDIA TensorRT Model Optimizer v0.35.1
- **Model Size:** Approximately 20 GB (68% reduction from BF16)
## Performance Characteristics
### Memory Efficiency
| Model Version | Memory Usage | Reduction |
|--------------|--------------|-----------|
| BF16 (Original) | 61.03 GB | Baseline |
| NVFP4 (This model) | 19.42 GB | 68.2% |
### Inference Speed
| Model Version | Throughput | Relative Performance |
|--------------|------------|---------------------|
| BF16 (Original) | 3.65 tokens/s | Baseline |
| NVFP4 (This model) | 9.99 tokens/s | 2.74x faster |
**Test Configuration:**
- Hardware: NVIDIA DGX Spark GB10
- Framework: vLLM 0.10.2
- Max Model Length: 8192 tokens
- GPU Memory Utilization: 90%
## Quantization Details
### NVFP4 Format
NVFP4 is NVIDIA's 4-bit floating point quantization format featuring:
- **Two-level scaling:** E4M3 FP8 scaling per 16-value block + global FP32 tensor scale
- **Hardware acceleration:** Optimized for Tensor Cores on Blackwell GB10 GPUs
- **Group size:** 16
- **Minimal accuracy degradation:** Less than 1% vs original model
- **Excluded modules:** lm_head (kept in higher precision)
### Calibration
- **Dataset:** C4 (Colossal Clean Crawled Corpus)
- **Calibration samples:** 512
- **Maximum sequence length:** 2048 tokens
- **Method:** Post-training quantization with activation calibration
## Usage
### Requirements
- NVIDIA DGX Spark or compatible Blackwell GPU
- vLLM >= 0.6.5
- nvidia-modelopt[hf]
### Loading the Model
**IMPORTANT:** This model must be loaded with vLLM using the `modelopt` quantization parameter. Standard HuggingFace `AutoModelForCausalLM` will not work.
```python
from vllm import LLM, SamplingParams
# Load NVFP4 quantized model
llm = LLM(
model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4",
quantization="modelopt", # Required for NVFP4
trust_remote_code=True,
gpu_memory_utilization=0.9
)
# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params)
print(outputs[0].outputs[0].text)
```
### Environment Variables
You can optionally set:
- `HF_CACHE_DIR`: Override HuggingFace cache location
## Limitations
- **Hardware specific:** Optimized for NVIDIA Blackwell architecture (GB10)
- **vLLM required:** Cannot be loaded with standard transformers library
- **Quantization artifacts:** Minor precision loss (<1%) compared to BF16 original
## Intended Use
This model is intended for:
- High-throughput inference on NVIDIA DGX Spark systems
- Production deployments requiring memory-efficient models
- Research on quantization techniques for large language models
## Training and Quantization
### Base Model Training
See the [original model card](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for base model training details.
### Quantization Process
1. **Model Loading:** Original model loaded in BF16 precision
2. **Calibration:** 512 samples from C4 dataset for activation statistics
3. **Quantization:** NVFP4 format applied using NVIDIA modelopt
4. **Export:** Saved in HuggingFace safetensors format
**Quantization Time:** Approximately 60-90 minutes on DGX Spark
## Evaluation
### Test Results
All 5 inference tests passed successfully:
- Technical explanation generation
- Code generation
- Mathematical reasoning
- Creative writing
- Instruction following
**Average performance:** 9.99 tokens/s on DGX Spark GB10
## Citation
If you use this quantized model, please cite:
```bibtex
@misc{qwen3vlto32b-nvfp4,
author = {Ex0bit},
title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4}},
}
```
And the original base model:
```bibtex
@misc{qingy2024qwen3vlto,
author = {qingy2024},
title = {Qwen3-VLTO-32B-Instruct},
year = {2024},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}},
}
```
## References
- [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
- [vLLM Documentation](https://docs.vllm.ai/)
- [NVIDIA DGX Spark Documentation](https://docs.nvidia.com/dgx-spark/)
- [Quantization GitHub Repository](https://github.com/Ex0bit/nvfp4-quantization)
## License
This quantized model inherits the license from the base model. Please refer to the [original model's license](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for details.
## Model Card Authors
- Ex0bit (@Ex0bit)
## Acknowledgments
- NVIDIA for TensorRT Model Optimizer and DGX Spark hardware
- qingy2024 for the base Qwen3-VLTO-32B-Instruct model
- The vLLM team for high-performance inference framework