Qwen3-ASR-1.7B FP8 Quantized

FP8 quantized version of Qwen3-ASR-1.7B using NVIDIA ModelOpt (Max calibration algorithm), ready for deployment with vLLM.

Quantization Details

Only the LLM component (model, ~1.4B params, 82% of total) is quantized to FP8. audio_tower and lm_head remain in BF16.

Component	Parameters	Quantized
`audio_tower`	~0.3B	No
`model` (LLM)	~1.4B	FP8
`lm_head`	shared	No

Calibration algorithm: Max (per-tensor symmetric, MaxCalibrator)

Results

Memory Usage (RTX 5090, `gpu_memory_utilization=0.7`)

Model	Weights	KV Cache
BF16	3.87 GB	16.58 GB
FP8	2.55 GB	17.86 GB

WER (760 VIVOS test samples, concurrency=1)

Model	WER
BF16	7.34%
FP8	7.60%

Throughput across Concurrency Levels (7168 VIVOS samples)

Concurrency	BF16 (req/s)	FP8 (req/s)
1	15.42	19.37
256	227.88	246.47
512	227.50	251.37
1024	219.63	248.24

LLM Output Cosine Similarity vs BF16 (50 samples)

Model	Cosine Similarity
FP8	0.994

Usage

Serving with vLLM

git clone https://github.com/QwenLM/Qwen3-ASR
pip install nvidia-modelopt vllm

qwen-asr-serve vrfai/qwen3asr-fp8 \
    --quantization modelopt \
    --gpu-memory-utilization 0.7

Inference

import requests, soundfile as sf, io

def transcribe(audio_path, url="http://localhost:8000/v1/audio/transcriptions"):
    audio, sr = sf.read(audio_path)
    buf = io.BytesIO()
    sf.write(buf, audio, sr, format="WAV")
    buf.seek(0)
    r = requests.post(
        url,
        files={"file": ("audio.wav", buf, "audio/wav")},
        data={"model": "vrfai/qwen3asr-fp8"},
    )
    return r.json().get("text", "")

References

Downloads last month: 38

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for vrfai/qwen3asr-fp8

Base model

Qwen/Qwen3-ASR-1.7B

Finetuned

(17)

this model

Paper for vrfai/qwen3asr-fp8

Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 36