--- language: - en - de tags: - sparse - quantization - nvfp4 - 2:4-sparsity - vllm - qwen license: apache-2.0 base_model: Qwen/Qwen3-8B --- # Sparse FP4 Collection This model is part of Cortecs' experimental model collection based on Qwen3. This collection features **2:4 structured sparsity** and **NVFP4 / NVFP4-A16 quantization**, optionally followed by light fine-tuning. The goal is to explore the trade-offs between compression, accuracy and throughput on Blackwell-class GPUs. ## Model Description The models are derived from the Qwen3 family and compressed using: - 2:4 sparsity (50 percent structured sparsity) - NVFP4 or NVFP4-A16 quantization - Optional short fine-tuning to recover accuracy These models target extremely high throughput on modern hardware while retaining useful accuracy for English and multilingual tasks. ## Evaluation All results were produced with a unified evaluation pipeline using standard academic benchmarks. ### Benchmark Results | Model | ARC | Hellaswag | MMLU | ARC_de | Hellaswag_de | MMLU_de | TruthfulQA | CrowS | English Avg | German Avg | Safety Avg | |----------------------------------|------|-----------|-------|--------|--------------|---------|------------|-------|-------------|------------|------------| | Qwen3 8B | 66.7 | 67.2 | 78.22 | 54.8 | 54.9 | 67.8 | 54.42 | 37.69 | 70.71 | 59.17 | 46.06 | | Qwen3 4B | 63.3 | 62.5 | 73.07 | 47.5 | 49.9 | 65.1 | 54.76 | 41.03 | 66.29 | 54.17 | 47.90 | | Qwen3 8B NVFP4A16 | 66.4 | 66.5 | 75.54 | 54.2 | 54.4 | 67.7 | 53.72 | 38.04 | 69.48 | 58.77 | 45.88 | | Qwen3 8B NVFP4 | 66.3 | 66.6 | 75.54 | 54.4 | 54.3 | 68.1 | 53.76 | 37.92 | 69.48 | 58.93 | 45.84 | | Qwen3 8B Sparse NVFP4A16 | 50.5 | 57.4 | 53.35 | 30.7 | 36.0 | 34.4 | 46.95 | 39.89 | 53.75 | 33.70 | 43.42 | | Qwen3 8B Sparse Finetune 0.01 | 53.8 | 62.8 | 60.17 | 35.8 | 46.6 | 46.4 | 50.66 | 39.18 | 58.92 | 42.93 | 44.92 | | Qwen3 8B Sparse Finetune 0.1 | 56.4 | 62.2 | 60.89 | 38.9 | 46.2 | 44.0 | 52.13 | 38.04 | 59.83 | 43.03 | 45.09 | ## Performance Throughput measurements were conducted on a single B200 GPU. | Model | Total tokens/s | |---------------------------------------|----------------| | Qwen3 8B | 30379 | | Qwen3 4B | 34483 | | Qwen3 8B NVFP4A16 | 15978 | | Qwen3 8B Sparse NVFP4A16 | 15860 | | Qwen3 8B NVFP4 | 35296 | ## Notes - 2:4 structured sparsity always results in 50 percent zeroed weights. - FP4 execution on Blackwell requires specialized kernels; throughput varies depending on backend support. - Sparse FP4 models show reduced accuracy but improved efficiency. Light fine-tuning is essential to recover performance. ## Intended Use These models are **experimental**, designed only to evaluate sparsity and quantization strategies. They should **not** be used for production systems, safety-critical applications, or deployment scenarios involving real user data. ## Limitations Sparse FP4 models may exhibit reduced robustness and generalization.