README.md · Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4 at main

Qwen3-VLTO-32B-Instruct-NVFP4 / README.md

Ex0bit

Upload NVFP4 quantized Qwen3-VLTO-32B-Instruct model

39d23a8 verified 3 months ago

preview code

raw

history blame contribute delete

5.48 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- text-generation
	- quantization
	- nvfp4
	- nvidia
	- dgx-spark
	- blackwell
	- model_hub_mixin
	- pytorch_model_hub_mixin
	base_model: qingy2024/Qwen3-VLTO-32B-Instruct
	inference: false
	---

	# Qwen3-VLTO-32B-Instruct-NVFP4

	This is an NVFP4 quantized version of [qingy2024/Qwen3-VLTO-32B-Instruct](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct), optimized for NVIDIA DGX Spark systems with Blackwell GB10 GPUs.

	## Model Description

	- Base Model: qingy2024/Qwen3-VLTO-32B-Instruct
	- Quantization Format: NVFP4 (4-bit floating point)
	- Target Hardware: NVIDIA DGX Spark (Grace Blackwell Superchip)
	- Quantization Tool: NVIDIA TensorRT Model Optimizer v0.35.1
	- Model Size: Approximately 20 GB (68% reduction from BF16)

	## Performance Characteristics

	### Memory Efficiency

	\| Model Version \| Memory Usage \| Reduction \|
	\|--------------\|--------------\|-----------\|
	\| BF16 (Original) \| 61.03 GB \| Baseline \|
	\| NVFP4 (This model) \| 19.42 GB \| 68.2% \|

	### Inference Speed

	\| Model Version \| Throughput \| Relative Performance \|
	\|--------------\|------------\|---------------------\|
	\| BF16 (Original) \| 3.65 tokens/s \| Baseline \|
	\| NVFP4 (This model) \| 9.99 tokens/s \| 2.74x faster \|

	Test Configuration:
	- Hardware: NVIDIA DGX Spark GB10
	- Framework: vLLM 0.10.2
	- Max Model Length: 8192 tokens
	- GPU Memory Utilization: 90%

	## Quantization Details

	### NVFP4 Format

	NVFP4 is NVIDIA's 4-bit floating point quantization format featuring:
	- Two-level scaling: E4M3 FP8 scaling per 16-value block + global FP32 tensor scale
	- Hardware acceleration: Optimized for Tensor Cores on Blackwell GB10 GPUs
	- Group size: 16
	- Minimal accuracy degradation: Less than 1% vs original model
	- Excluded modules: lm_head (kept in higher precision)

	### Calibration

	- Dataset: C4 (Colossal Clean Crawled Corpus)
	- Calibration samples: 512
	- Maximum sequence length: 2048 tokens
	- Method: Post-training quantization with activation calibration

	## Usage

	### Requirements

	- NVIDIA DGX Spark or compatible Blackwell GPU
	- vLLM >= 0.6.5
	- nvidia-modelopt[hf]

	### Loading the Model

	IMPORTANT: This model must be loaded with vLLM using the `modelopt` quantization parameter. Standard HuggingFace `AutoModelForCausalLM` will not work.

	```python
	from vllm import LLM, SamplingParams

	# Load NVFP4 quantized model
	llm = LLM(
	model="Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4",
	quantization="modelopt", # Required for NVFP4
	trust_remote_code=True,
	gpu_memory_utilization=0.9
	)

	# Generate
	sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
	outputs = llm.generate(["Explain quantum computing in simple terms:"], sampling_params)
	print(outputs[0].outputs[0].text)
	```

	### Environment Variables

	You can optionally set:
	- `HF_CACHE_DIR`: Override HuggingFace cache location

	## Limitations

	- Hardware specific: Optimized for NVIDIA Blackwell architecture (GB10)
	- vLLM required: Cannot be loaded with standard transformers library
	- Quantization artifacts: Minor precision loss (<1%) compared to BF16 original

	## Intended Use

	This model is intended for:
	- High-throughput inference on NVIDIA DGX Spark systems
	- Production deployments requiring memory-efficient models
	- Research on quantization techniques for large language models

	## Training and Quantization

	### Base Model Training

	See the [original model card](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for base model training details.

	### Quantization Process

	1. Model Loading: Original model loaded in BF16 precision
	2. Calibration: 512 samples from C4 dataset for activation statistics
	3. Quantization: NVFP4 format applied using NVIDIA modelopt
	4. Export: Saved in HuggingFace safetensors format

	Quantization Time: Approximately 60-90 minutes on DGX Spark

	## Evaluation

	### Test Results

	All 5 inference tests passed successfully:
	- Technical explanation generation
	- Code generation
	- Mathematical reasoning
	- Creative writing
	- Instruction following

	Average performance: 9.99 tokens/s on DGX Spark GB10

	## Citation

	If you use this quantized model, please cite:

	```bibtex
	@misc{qwen3vlto32b-nvfp4,
	author = {Ex0bit},
	title = {Qwen3-VLTO-32B-Instruct-NVFP4: NVFP4 Quantized Model for DGX Spark},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/Ex0bit/Qwen3-VLTO-32B-Instruct-NVFP4}},
	}
	```

	And the original base model:

	```bibtex
	@misc{qingy2024qwen3vlto,
	author = {qingy2024},
	title = {Qwen3-VLTO-32B-Instruct},
	year = {2024},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct}},
	}
	```

	## References

	- [NVIDIA TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
	- [vLLM Documentation](https://docs.vllm.ai/)
	- [NVIDIA DGX Spark Documentation](https://docs.nvidia.com/dgx-spark/)
	- [Quantization GitHub Repository](https://github.com/Ex0bit/nvfp4-quantization)

	## License

	This quantized model inherits the license from the base model. Please refer to the [original model's license](https://huggingface.co/qingy2024/Qwen3-VLTO-32B-Instruct) for details.

	## Model Card Authors

	- Ex0bit (@Ex0bit)

	## Acknowledgments

	- NVIDIA for TensorRT Model Optimizer and DGX Spark hardware
	- qingy2024 for the base Qwen3-VLTO-32B-Instruct model
	- The vLLM team for high-performance inference framework