dseditor
/

Qwen3-ASR-1.7B-INT8_OpenVINO

qwen3_asr

OpenVINO

Model card Files Files and versions

xet

Community

dseditor commited on Feb 22

Commit

ec7d0d5

verified ·

1 Parent(s): b795ad8

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +124 -0

README.md ADDED Viewed

	@@ -0,0 +1,124 @@

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-ASR-1.7B
+tags:
+- OpenVINO
+---
+# Qwen3-ASR-1.7B — OpenVINO INT8 with Explicit KV-Cache
+An OpenVINO-optimized version of [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B),
+exported and quantized independently as a community effort.
+**Not affiliated with Intel, or any official OpenVINO project.**
+GPU support (Intel or NVIDIA) has not been tested.
+---
+## Model Architecture
+The inference pipeline is split into four OpenVINO IR models:
+| File | Precision | Shape In | Shape Out |
+|------|-----------|----------|-----------|
+| `audio_encoder_model` | FP16 | `mel (128, 1000)` | `audio_embeds (1, 130, 2048)` |
+| `thinker_embeddings_model` | INT8 | `input_ids (1, L)` | `token_embeds (1, L, 2048)` |
+| `decoder_prefill_kv_model` | INT8 | `input_embeds (1, L, 2048)`, `position_ids` | `logits`, `past_keys (28, 1, 8, L, 128)`, `past_values` |
+| `decoder_kv_model` | INT8 | `new_embed (1, 1, 2048)`, `new_pos`, `past_keys`, `past_values` | `logits`, `new_keys`, `new_values` |
+---
+## Quantization Approach
+### Explicit KV-Cache (not Stateful)
+The decoder is split into two models that pass KV tensors **explicitly** between steps:
+1. **Prefill** (`decoder_prefill_kv_model`): processes the full context (audio embeddings + prompt tokens) in a single forward pass, returning `past_keys` and `past_values` as output tensors.
+2. **Decode** (`decoder_kv_model`): accepts one new token embedding at a time along with the accumulated KV tensors, appends one step, and returns updated `new_keys` / `new_values`.
+This design does not rely on OpenVINO stateful model internals. KV tensors are plain NumPy arrays, making the inference loop fully transparent and portable.
+```
+Prefill:  [audio_embeds + prompt_embeds]  →  logit₀, past_K, past_V
+Decode₁:  [emb₁, pos₁, past_K, past_V]   →  logit₁, K₁, V₁
+Decode₂:  [emb₂, pos₂, K₁, V₁]          →  logit₂, K₂, V₂
+   ...
+```
+KV tensor shape: `(28 layers, 1 batch, 8 GQA heads, seq_len, 128 head_dim)`
+### Weight-Only INT8 Asymmetric Compression
+Quantization was applied with [NNCF](https://github.com/openvinotoolkit/nncf) `compress_weights`:
+```python
+import nncf
+import openvino as ov
+core  = ov.Core()
+model = core.read_model("decoder_prefill_kv_model.xml")
+quantized = nncf.compress_weights(
+    model,
+    mode=nncf.CompressWeightsMode.INT8_ASYM,
+)
+ov.save_model(quantized, "decoder_prefill_kv_model.xml")
+```
+**Weights only** are compressed; activations remain FP32. This eliminates the need for calibration data and avoids the accuracy collapse that full PTQ causes on speech models when calibration data is limited.
+> **Why not full PTQ?**
+> Full activation quantization (`nncf.quantize`) with a small calibration set (~25 samples)
+> produces garbled output on Qwen3-ASR. Weight-only compression (`compress_weights`) gives
+> a clean accuracy/size trade-off with zero calibration overhead.
+---
+## Audio Constraints
+- **Maximum 10 seconds per chunk.**
+  The audio encoder was exported with a fixed mel spectrogram shape of `(128, 1000)`,
+  corresponding to exactly 10 s at 16 kHz. Longer audio must be split before inference.
+- **16,000 Hz, mono (float32)**
+---
+## CPU Benchmarks
+Tested on CPU device, 10-second Chinese speech segment:
+| Mode | RTF |
+|------|-----|
+| Full-context FP16 (no KV cache) | 3.06× |
+| Explicit KV-Cache FP16 | 0.47× |
+| **Explicit KV-Cache INT8_ASYM (this repo)** | **0.22×** |
+RTF < 1.0 means faster than real-time.
+---
+## Repository Contents
+```
+audio_encoder_model.xml / .bin          FP16  audio mel encoder
+thinker_embeddings_model.xml / .bin     INT8  token embedding table
+decoder_prefill_kv_model.xml / .bin     INT8  full-context prefill, outputs past KV
+decoder_kv_model.xml / .bin            INT8  single-step decode, explicit KV I/O
+prompt_template.json                         token IDs for prompt construction
+vocab.json / merges.txt                      BPE tokenizer files
+config.json / tokenizer_config.json          model configuration
+```
+---
+## Supported Languages
+30 languages including Chinese, English, Japanese, Cantonese, Korean, and more.
+See `prompt_template.json` → `"supported_languages"` for the complete list.
+---
+## License
+Apache 2.0 — same as the original [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B).