Discord

This repo stores weights for my OpenVINO implementation of Qwen3-ASR-0.6B, soon to be a part of OpenArc.

Features

Stateful KV Cache

  • OpenVINO models can be configured for "stateful" operation. Instead of copying KV cache around between devices, we keep it all on GPU or CPU, which increases decode throughput by limiting how long the GPU spends waiting for data.

"Mixed" Precision

  • Audio Encoder: FP16
  • Thinker Embeddings: INT8
  • Decoder Language model: INT8

NPU

  • This model should be NPU friendly, but I have not tested.

A770 Benchmarks

=== Qwen3 ASR Transcription ===
Text: I must not fear. Fear is the mind killer. Fear is the little death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past, I will turn the inner eye to see its path. But the fear is gone; it will be nothing. Only I will remain.

Metrics:
  feature_sec: 0.004266884992830455
  encoder_sec: 1.3122401779983193
  prefill_sec: 0.13349935901351273
  prefill_tok_s: 2187.276419585235
  decode_sec: 1.3044207430211827
  decode_tok_s: 59.0300333783864
  detok_sec: 0.0690425910288468
  prompt_tokens: 292
  generated_tokens: 77
  encoder_tokens: 277
  audio_duration_sec: 21.2906875
  model_load_sec: 4.678463210002519
  end_to_end_sec: 3.443493520957418
  rtf: 0.16173707499851367
  language: English

I'm not done hacking on this and will update the weights as my implementation matures.

Downloads last month
76
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dseditor/Qwen3-ASR-0.6B-INT8_ASYM-OpenVINO

Finetuned
(6)
this model