Qwen3-ASR-1.7B — OpenVINO INT8 with Explicit KV-Cache

An OpenVINO-optimized version of Qwen/Qwen3-ASR-1.7B, exported and quantized independently as a community effort. Not affiliated with Intel, or any official OpenVINO project. GPU support (Intel or NVIDIA) has not been tested.


Model Architecture

The inference pipeline is split into four OpenVINO IR models:

File Precision Shape In Shape Out
audio_encoder_model FP16 mel (128, 1000) audio_embeds (1, 130, 2048)
thinker_embeddings_model INT8 input_ids (1, L) token_embeds (1, L, 2048)
decoder_prefill_kv_model INT8 input_embeds (1, L, 2048), position_ids logits, past_keys (28, 1, 8, L, 128), past_values
decoder_kv_model INT8 new_embed (1, 1, 2048), new_pos, past_keys, past_values logits, new_keys, new_values

Quantization Approach

Explicit KV-Cache (not Stateful)

The decoder is split into two models that pass KV tensors explicitly between steps:

  1. Prefill (decoder_prefill_kv_model): processes the full context (audio embeddings + prompt tokens) in a single forward pass, returning past_keys and past_values as output tensors.
  2. Decode (decoder_kv_model): accepts one new token embedding at a time along with the accumulated KV tensors, appends one step, and returns updated new_keys / new_values.

This design does not rely on OpenVINO stateful model internals. KV tensors are plain NumPy arrays, making the inference loop fully transparent and portable.

Prefill:  [audio_embeds + prompt_embeds]  →  logit₀, past_K, past_V
Decode₁:  [emb₁, pos₁, past_K, past_V]   →  logit₁, K₁, V₁
Decode₂:  [emb₂, pos₂, K₁, V₁]          →  logit₂, K₂, V₂
   ...

KV tensor shape: (28 layers, 1 batch, 8 GQA heads, seq_len, 128 head_dim)

Weight-Only INT8 Asymmetric Compression

Quantization was applied with NNCF compress_weights:

import nncf
import openvino as ov

core  = ov.Core()
model = core.read_model("decoder_prefill_kv_model.xml")

quantized = nncf.compress_weights(
    model,
    mode=nncf.CompressWeightsMode.INT8_ASYM,
)
ov.save_model(quantized, "decoder_prefill_kv_model.xml")

Weights only are compressed; activations remain FP32. This eliminates the need for calibration data and avoids the accuracy collapse that full PTQ causes on speech models when calibration data is limited.

Why not full PTQ? Full activation quantization (nncf.quantize) with a small calibration set (~25 samples) produces garbled output on Qwen3-ASR. Weight-only compression (compress_weights) gives a clean accuracy/size trade-off with zero calibration overhead.


Audio Constraints

  • Maximum 10 seconds per chunk. The audio encoder was exported with a fixed mel spectrogram shape of (128, 1000), corresponding to exactly 10 s at 16 kHz. Longer audio must be split before inference.
  • 16,000 Hz, mono (float32)

CPU Benchmarks

Tested on CPU device, 10-second Chinese speech segment:

Mode RTF
Full-context FP16 (no KV cache) 3.06×
Explicit KV-Cache FP16 0.47×
Explicit KV-Cache INT8_ASYM (this repo) 0.22×

RTF < 1.0 means faster than real-time.


Repository Contents

audio_encoder_model.xml / .bin          FP16  audio mel encoder
thinker_embeddings_model.xml / .bin     INT8  token embedding table
decoder_prefill_kv_model.xml / .bin     INT8  full-context prefill, outputs past KV
decoder_kv_model.xml / .bin            INT8  single-step decode, explicit KV I/O
prompt_template.json                         token IDs for prompt construction
vocab.json / merges.txt                      BPE tokenizer files
config.json / tokenizer_config.json          model configuration

Supported Languages

30 languages including Chinese, English, Japanese, Cantonese, Korean, and more. See prompt_template.json"supported_languages" for the complete list.


License

Apache 2.0 — same as the original Qwen/Qwen3-ASR-1.7B.

Downloads last month
285
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dseditor/Qwen3-ASR-1.7B-INT8_OpenVINO

Finetuned
(14)
this model