Qwen3-ASR-1.7B — OpenVINO INT8 with Explicit KV-Cache
An OpenVINO-optimized version of Qwen/Qwen3-ASR-1.7B, exported and quantized independently as a community effort. Not affiliated with Intel, or any official OpenVINO project. GPU support (Intel or NVIDIA) has not been tested.
Model Architecture
The inference pipeline is split into four OpenVINO IR models:
| File | Precision | Shape In | Shape Out |
|---|---|---|---|
audio_encoder_model |
FP16 | mel (128, 1000) |
audio_embeds (1, 130, 2048) |
thinker_embeddings_model |
INT8 | input_ids (1, L) |
token_embeds (1, L, 2048) |
decoder_prefill_kv_model |
INT8 | input_embeds (1, L, 2048), position_ids |
logits, past_keys (28, 1, 8, L, 128), past_values |
decoder_kv_model |
INT8 | new_embed (1, 1, 2048), new_pos, past_keys, past_values |
logits, new_keys, new_values |
Quantization Approach
Explicit KV-Cache (not Stateful)
The decoder is split into two models that pass KV tensors explicitly between steps:
- Prefill (
decoder_prefill_kv_model): processes the full context (audio embeddings + prompt tokens) in a single forward pass, returningpast_keysandpast_valuesas output tensors. - Decode (
decoder_kv_model): accepts one new token embedding at a time along with the accumulated KV tensors, appends one step, and returns updatednew_keys/new_values.
This design does not rely on OpenVINO stateful model internals. KV tensors are plain NumPy arrays, making the inference loop fully transparent and portable.
Prefill: [audio_embeds + prompt_embeds] → logit₀, past_K, past_V
Decode₁: [emb₁, pos₁, past_K, past_V] → logit₁, K₁, V₁
Decode₂: [emb₂, pos₂, K₁, V₁] → logit₂, K₂, V₂
...
KV tensor shape: (28 layers, 1 batch, 8 GQA heads, seq_len, 128 head_dim)
Weight-Only INT8 Asymmetric Compression
Quantization was applied with NNCF compress_weights:
import nncf
import openvino as ov
core = ov.Core()
model = core.read_model("decoder_prefill_kv_model.xml")
quantized = nncf.compress_weights(
model,
mode=nncf.CompressWeightsMode.INT8_ASYM,
)
ov.save_model(quantized, "decoder_prefill_kv_model.xml")
Weights only are compressed; activations remain FP32. This eliminates the need for calibration data and avoids the accuracy collapse that full PTQ causes on speech models when calibration data is limited.
Why not full PTQ? Full activation quantization (
nncf.quantize) with a small calibration set (~25 samples) produces garbled output on Qwen3-ASR. Weight-only compression (compress_weights) gives a clean accuracy/size trade-off with zero calibration overhead.
Audio Constraints
- Maximum 10 seconds per chunk.
The audio encoder was exported with a fixed mel spectrogram shape of
(128, 1000), corresponding to exactly 10 s at 16 kHz. Longer audio must be split before inference. - 16,000 Hz, mono (float32)
CPU Benchmarks
Tested on CPU device, 10-second Chinese speech segment:
| Mode | RTF |
|---|---|
| Full-context FP16 (no KV cache) | 3.06× |
| Explicit KV-Cache FP16 | 0.47× |
| Explicit KV-Cache INT8_ASYM (this repo) | 0.22× |
RTF < 1.0 means faster than real-time.
Repository Contents
audio_encoder_model.xml / .bin FP16 audio mel encoder
thinker_embeddings_model.xml / .bin INT8 token embedding table
decoder_prefill_kv_model.xml / .bin INT8 full-context prefill, outputs past KV
decoder_kv_model.xml / .bin INT8 single-step decode, explicit KV I/O
prompt_template.json token IDs for prompt construction
vocab.json / merges.txt BPE tokenizer files
config.json / tokenizer_config.json model configuration
Supported Languages
30 languages including Chinese, English, Japanese, Cantonese, Korean, and more.
See prompt_template.json → "supported_languages" for the complete list.
License
Apache 2.0 — same as the original Qwen/Qwen3-ASR-1.7B.
- Downloads last month
- 285
Model tree for dseditor/Qwen3-ASR-1.7B-INT8_OpenVINO
Base model
Qwen/Qwen3-ASR-1.7B