This repo stores weights for my OpenVINO implementation of Qwen3-ASR-0.6B, soon to be a part of OpenArc.
Features
Stateful KV Cache
- OpenVINO models can be configured for "stateful" operation. Instead of copying KV cache around between devices, we keep it all on GPU or CPU, which increases decode throughput by limiting how long the GPU spends waiting for data.
"Mixed" Precision
- Audio Encoder: FP16
- Thinker Embeddings: INT8
- Decoder Language model: INT8
NPU
- This model should be NPU friendly, but I have not tested.
A770 Benchmarks
=== Qwen3 ASR Transcription ===
Text: I must not fear. Fear is the mind killer. Fear is the little death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past, I will turn the inner eye to see its path. But the fear is gone; it will be nothing. Only I will remain.
Metrics:
feature_sec: 0.004266884992830455
encoder_sec: 1.3122401779983193
prefill_sec: 0.13349935901351273
prefill_tok_s: 2187.276419585235
decode_sec: 1.3044207430211827
decode_tok_s: 59.0300333783864
detok_sec: 0.0690425910288468
prompt_tokens: 292
generated_tokens: 77
encoder_tokens: 277
audio_duration_sec: 21.2906875
model_load_sec: 4.678463210002519
end_to_end_sec: 3.443493520957418
rtf: 0.16173707499851367
language: English
I'm not done hacking on this and will update the weights as my implementation matures.
- Downloads last month
- 76
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for dseditor/Qwen3-ASR-0.6B-INT8_ASYM-OpenVINO
Base model
Qwen/Qwen3-ASR-0.6B