---
language:
- en
- de
tags:
- sparse
- quantization
- nvfp4
- 2:4-sparsity
- vllm
- qwen
license: apache-2.0
base_model: Qwen/Qwen3-8B
---

# Sparse FP4 Collection

This model is part of Cortecs' experimental model collection based on Qwen3.  
This collection features **2:4 structured sparsity** and **NVFP4 / NVFP4-A16 quantization**, optionally followed by light fine-tuning. The goal is to explore the trade-offs between compression, accuracy and throughput on Blackwell-class GPUs.

## Model Description

The models are derived from the Qwen3 family and compressed using:
- 2:4 sparsity (50 percent structured sparsity)
- NVFP4 or NVFP4-A16 quantization
- Optional short fine-tuning to recover accuracy

These models target extremely high throughput on modern hardware while retaining useful accuracy for English and multilingual tasks.

## Evaluation

All results were produced with a unified evaluation pipeline using standard academic benchmarks.

### Benchmark Results

| Model                            | ARC  | Hellaswag | MMLU  | ARC_de | Hellaswag_de | MMLU_de | TruthfulQA | CrowS | English Avg | German Avg | Safety Avg |
|----------------------------------|------|-----------|-------|--------|--------------|---------|------------|-------|-------------|------------|------------|
| Qwen3 8B                         | 66.7 | 67.2      | 78.22 | 54.8   | 54.9         | 67.8    | 54.42      | 37.69 | 70.71       | 59.17      | 46.06      |
| Qwen3 4B                         | 63.3 | 62.5      | 73.07 | 47.5   | 49.9         | 65.1    | 54.76      | 41.03 | 66.29       | 54.17      | 47.90      |
| Qwen3 8B NVFP4A16                | 66.4 | 66.5      | 75.54 | 54.2   | 54.4         | 67.7    | 53.72      | 38.04 | 69.48       | 58.77      | 45.88      |
| Qwen3 8B NVFP4                   | 66.3 | 66.6      | 75.54 | 54.4   | 54.3         | 68.1    | 53.76      | 37.92 | 69.48       | 58.93      | 45.84      |
| Qwen3 8B Sparse NVFP4A16         | 50.5 | 57.4      | 53.35 | 30.7   | 36.0         | 34.4    | 46.95      | 39.89 | 53.75       | 33.70      | 43.42      |
| Qwen3 8B Sparse Finetune 0.01    | 53.8 | 62.8      | 60.17 | 35.8   | 46.6         | 46.4    | 50.66      | 39.18 | 58.92       | 42.93      | 44.92      |
| Qwen3 8B Sparse Finetune 0.1     | 56.4 | 62.2      | 60.89 | 38.9   | 46.2         | 44.0    | 52.13      | 38.04 | 59.83       | 43.03      | 45.09      |

## Performance

Throughput measurements were conducted on a single B200 GPU.

| Model                                 | Total tokens/s |
|---------------------------------------|----------------|
| Qwen3 8B                              | 30379         |
| Qwen3 4B                              | 34483         |
| Qwen3 8B NVFP4A16                     | 15978         |
| Qwen3 8B Sparse NVFP4A16              | 15860         |
| Qwen3 8B NVFP4                        | 35296         |

## Notes

- 2:4 structured sparsity always results in 50 percent zeroed weights.
- FP4 execution on Blackwell requires specialized kernels; throughput varies depending on backend support.
- Sparse FP4 models show reduced accuracy but improved efficiency. Light fine-tuning is essential to recover performance.

## Intended Use

These models are **experimental**, designed only to evaluate sparsity and quantization strategies. They should **not** be used for production systems, safety-critical applications, or deployment scenarios involving real user data.

## Limitations

Sparse FP4 models may exhibit reduced robustness and generalization.