Qwen3-ASR-1.7B NVFP4 Quantized

NVFP4 quantized version of Qwen3-ASR-1.7B using NVIDIA ModelOpt (Max calibration algorithm), ready for deployment with vLLM.

Quantization Details

Only the LLM component (model, ~1.4B params, 82% of total) is quantized to NVFP4. audio_tower and lm_head remain in BF16.

Component	Parameters	Quantized
`audio_tower`	~0.3B	No
`model` (LLM)	~1.4B	NVFP4
`lm_head`	shared	No

Calibration algorithm: Max (per-tensor symmetric, MaxCalibrator)

Results

Memory Usage (RTX 5090, `gpu_memory_utilization=0.7`)

Model	Weights	KV Cache
BF16	3.87 GB	16.58 GB
NVFP4	1.99 GB	18.43 GB

WER (760 VIVOS test samples, concurrency=1)

Model	WER
BF16	7.34%
NVFP4	10.73%

Throughput across Concurrency Levels (7168 VIVOS samples)

Concurrency	BF16 (req/s)	NVFP4 (req/s)
1	15.42	15.77
256	227.88	260.87
512	227.50	262.87
1024	219.63	268.25

NVFP4 achieves highest throughput at high concurrency due to smaller model size freeing more KV cache capacity.

LLM Output Cosine Similarity vs BF16 (50 samples)

Model	Cosine Similarity
NVFP4	0.945

Usage

Serving with vLLM

git clone https://github.com/QwenLM/Qwen3-ASR
pip install nvidia-modelopt vllm

qwen-asr-serve vrfai/qwen3asr-nvfp4 \
    --quantization modelopt_fp4 \
    --gpu-memory-utilization 0.7

Inference

import requests, soundfile as sf, io

def transcribe(audio_path, url="http://localhost:8000/v1/audio/transcriptions"):
    audio, sr = sf.read(audio_path)
    buf = io.BytesIO()
    sf.write(buf, audio, sr, format="WAV")
    buf.seek(0)
    r = requests.post(
        url,
        files={"file": ("audio.wav", buf, "audio/wav")},
        data={"model": "vrfai/qwen3asr-nvfp4"},
    )
    return r.json().get("text", "")

References

Downloads last month: 37

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/qwen3asr-nvfp4

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(17)

this model

Paper for vrfai/qwen3asr-nvfp4

Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 36