File size: 1,095 Bytes
ff9e384 1b78f23 ff9e384 1b78f23 ff9e384 1b78f23 ff9e384 1b78f23 ff9e384 1b78f23 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | This is the official QAT FP-Quant checkpoint of `qwen/Qwen3-8B`, produced as described in the [**"Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization"**](https://arxiv.org/abs/2509.23202) paper.
This model can be run on Blackwell-generation NVIDIA GPUs via [QuTLASS](https://github.com/IST-DASLab/qutlass) and [FP-Quant](https://github.com/IST-DASLab/FP-Quant) in either [transformers](https://huggingface.co/docs/transformers/main/en/quantization/fp_quant) or [vLLM](https://github.com/vllm-project/vllm/pull/24440).
The approximate recipe for training this model (up to local batch size and LR) is available [here](https://github.com/IST-DASLab/nanochat-qat/blob/qat/transformers_distill.py).
This checkpoint has the following performance relative to the original model and the RTN quantization:
| Model | MMLU | GSM8k | Hellaswag | Winogrande | Avg |
|-------|------|-------|-----------|------------|-----|
| `qwen/Qwen3-8B` | 73.0 | 90.9 | 75.5 | 70.6 | 77.5 |
| RTN | 70.2 | 86.4 | 73.0 | 68.1 | 74.4 |
| QAT (THIS) | 71.3 | 89.2 | 75.2 | 70.4 | 76.5 |
|