Qwen3-ASR Technical Report
Paper • 2601.21337 • Published • 36
NVFP4 quantized version of Qwen3-ASR-1.7B using NVIDIA ModelOpt (Max calibration algorithm), ready for deployment with vLLM.
Only the LLM component (model, ~1.4B params, 82% of total) is quantized to NVFP4. audio_tower and lm_head remain in BF16.
| Component | Parameters | Quantized |
|---|---|---|
audio_tower |
~0.3B | No |
model (LLM) |
~1.4B | NVFP4 |
lm_head |
shared | No |
Calibration algorithm: Max (per-tensor symmetric, MaxCalibrator)
gpu_memory_utilization=0.7)
| Model | Weights | KV Cache |
|---|---|---|
| BF16 | 3.87 GB | 16.58 GB |
| NVFP4 | 1.99 GB | 18.43 GB |
| Model | WER |
|---|---|
| BF16 | 7.34% |
| NVFP4 | 10.73% |
| Concurrency | BF16 (req/s) | NVFP4 (req/s) |
|---|---|---|
| 1 | 15.42 | 15.77 |
| 256 | 227.88 | 260.87 |
| 512 | 227.50 | 262.87 |
| 1024 | 219.63 | 268.25 |
NVFP4 achieves highest throughput at high concurrency due to smaller model size freeing more KV cache capacity.
| Model | Cosine Similarity |
|---|---|
| NVFP4 | 0.945 |
git clone https://github.com/QwenLM/Qwen3-ASR
pip install nvidia-modelopt vllm
qwen-asr-serve vrfai/qwen3asr-nvfp4 \
--quantization modelopt_fp4 \
--gpu-memory-utilization 0.7
import requests, soundfile as sf, io
def transcribe(audio_path, url="http://localhost:8000/v1/audio/transcriptions"):
audio, sr = sf.read(audio_path)
buf = io.BytesIO()
sf.write(buf, audio, sr, format="WAV")
buf.seek(0)
r = requests.post(
url,
files={"file": ("audio.wav", buf, "audio/wav")},
data={"model": "vrfai/qwen3asr-nvfp4"},
)
return r.json().get("text", "")
Base model
Qwen/Qwen3-ASR-1.7B