dseditor commited on
Commit
ec7d0d5
·
verified ·
1 Parent(s): b795ad8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +124 -0
README.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3-ASR-1.7B
5
+ tags:
6
+ - OpenVINO
7
+ ---
8
+
9
+ # Qwen3-ASR-1.7B — OpenVINO INT8 with Explicit KV-Cache
10
+
11
+ An OpenVINO-optimized version of [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B),
12
+ exported and quantized independently as a community effort.
13
+ **Not affiliated with Intel, or any official OpenVINO project.**
14
+ GPU support (Intel or NVIDIA) has not been tested.
15
+
16
+ ---
17
+
18
+ ## Model Architecture
19
+
20
+ The inference pipeline is split into four OpenVINO IR models:
21
+
22
+ | File | Precision | Shape In | Shape Out |
23
+ |------|-----------|----------|-----------|
24
+ | `audio_encoder_model` | FP16 | `mel (128, 1000)` | `audio_embeds (1, 130, 2048)` |
25
+ | `thinker_embeddings_model` | INT8 | `input_ids (1, L)` | `token_embeds (1, L, 2048)` |
26
+ | `decoder_prefill_kv_model` | INT8 | `input_embeds (1, L, 2048)`, `position_ids` | `logits`, `past_keys (28, 1, 8, L, 128)`, `past_values` |
27
+ | `decoder_kv_model` | INT8 | `new_embed (1, 1, 2048)`, `new_pos`, `past_keys`, `past_values` | `logits`, `new_keys`, `new_values` |
28
+
29
+ ---
30
+
31
+ ## Quantization Approach
32
+
33
+ ### Explicit KV-Cache (not Stateful)
34
+
35
+ The decoder is split into two models that pass KV tensors **explicitly** between steps:
36
+
37
+ 1. **Prefill** (`decoder_prefill_kv_model`): processes the full context (audio embeddings + prompt tokens) in a single forward pass, returning `past_keys` and `past_values` as output tensors.
38
+ 2. **Decode** (`decoder_kv_model`): accepts one new token embedding at a time along with the accumulated KV tensors, appends one step, and returns updated `new_keys` / `new_values`.
39
+
40
+ This design does not rely on OpenVINO stateful model internals. KV tensors are plain NumPy arrays, making the inference loop fully transparent and portable.
41
+
42
+ ```
43
+ Prefill: [audio_embeds + prompt_embeds] → logit₀, past_K, past_V
44
+ Decode₁: [emb₁, pos₁, past_K, past_V] → logit₁, K₁, V₁
45
+ Decode₂: [emb₂, pos₂, K₁, V₁] → logit₂, K₂, V₂
46
+ ...
47
+ ```
48
+
49
+ KV tensor shape: `(28 layers, 1 batch, 8 GQA heads, seq_len, 128 head_dim)`
50
+
51
+ ### Weight-Only INT8 Asymmetric Compression
52
+
53
+ Quantization was applied with [NNCF](https://github.com/openvinotoolkit/nncf) `compress_weights`:
54
+
55
+ ```python
56
+ import nncf
57
+ import openvino as ov
58
+
59
+ core = ov.Core()
60
+ model = core.read_model("decoder_prefill_kv_model.xml")
61
+
62
+ quantized = nncf.compress_weights(
63
+ model,
64
+ mode=nncf.CompressWeightsMode.INT8_ASYM,
65
+ )
66
+ ov.save_model(quantized, "decoder_prefill_kv_model.xml")
67
+ ```
68
+
69
+ **Weights only** are compressed; activations remain FP32. This eliminates the need for calibration data and avoids the accuracy collapse that full PTQ causes on speech models when calibration data is limited.
70
+
71
+ > **Why not full PTQ?**
72
+ > Full activation quantization (`nncf.quantize`) with a small calibration set (~25 samples)
73
+ > produces garbled output on Qwen3-ASR. Weight-only compression (`compress_weights`) gives
74
+ > a clean accuracy/size trade-off with zero calibration overhead.
75
+
76
+ ---
77
+
78
+ ## Audio Constraints
79
+
80
+ - **Maximum 10 seconds per chunk.**
81
+ The audio encoder was exported with a fixed mel spectrogram shape of `(128, 1000)`,
82
+ corresponding to exactly 10 s at 16 kHz. Longer audio must be split before inference.
83
+ - **16,000 Hz, mono (float32)**
84
+
85
+ ---
86
+
87
+ ## CPU Benchmarks
88
+
89
+ Tested on CPU device, 10-second Chinese speech segment:
90
+
91
+ | Mode | RTF |
92
+ |------|-----|
93
+ | Full-context FP16 (no KV cache) | 3.06× |
94
+ | Explicit KV-Cache FP16 | 0.47× |
95
+ | **Explicit KV-Cache INT8_ASYM (this repo)** | **0.22×** |
96
+
97
+ RTF < 1.0 means faster than real-time.
98
+
99
+ ---
100
+
101
+ ## Repository Contents
102
+
103
+ ```
104
+ audio_encoder_model.xml / .bin FP16 audio mel encoder
105
+ thinker_embeddings_model.xml / .bin INT8 token embedding table
106
+ decoder_prefill_kv_model.xml / .bin INT8 full-context prefill, outputs past KV
107
+ decoder_kv_model.xml / .bin INT8 single-step decode, explicit KV I/O
108
+ prompt_template.json token IDs for prompt construction
109
+ vocab.json / merges.txt BPE tokenizer files
110
+ config.json / tokenizer_config.json model configuration
111
+ ```
112
+
113
+ ---
114
+
115
+ ## Supported Languages
116
+
117
+ 30 languages including Chinese, English, Japanese, Cantonese, Korean, and more.
118
+ See `prompt_template.json` → `"supported_languages"` for the complete list.
119
+
120
+ ---
121
+
122
+ ## License
123
+
124
+ Apache 2.0 — same as the original [Qwen/Qwen3-ASR-1.7B](https://huggingface.co/Qwen/Qwen3-ASR-1.7B).