Trondtempia richiejp commited on
Commit
ca66821
·
0 Parent(s):

Duplicate from LocalAI-io/LocalVQE

Browse files

Co-authored-by: Richard Palethorpe <richiejp@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ localvqe-baseline.gguf filter=lfs diff=lfs merge=lfs -text
37
+ localvqe-finetune.gguf filter=lfs diff=lfs merge=lfs -text
38
+ localvqe-v1-f32.gguf filter=lfs diff=lfs merge=lfs -text
39
+ localvqe-v1-1.3M-f32.gguf filter=lfs diff=lfs merge=lfs -text
40
+ localvqe-technical-report.pdf filter=lfs diff=lfs merge=lfs -text
41
+ localvqe-v1.1-1.3M-f32.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,295 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ tags:
4
+ - audio-to-audio
5
+ - speech-enhancement
6
+ - acoustic-echo-cancellation
7
+ - noise-suppression
8
+ - ggml
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # LocalVQE
13
+
14
+ [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo)
15
+ [![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
16
+ [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
17
+
18
+ **Local Voice Quality Enhancement** — a compact neural model for joint
19
+ acoustic echo cancellation (AEC), noise suppression, and dereverberation of
20
+ 16 kHz speech, designed to run on commodity CPUs in real time.
21
+
22
+ - 1.3 M parameters (~5 MB F32)
23
+ - ~1.66 ms per 16 ms frame on Zen4 (24 threads) — **≈9.6× realtime**
24
+ - Causal, streaming: 256-sample hop, 16 ms algorithmic latency
25
+ - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
26
+ PyTorch reference included for verification and research
27
+
28
+ Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.
29
+
30
+ This page is the Hugging Face model card — it hosts the published weights.
31
+ Source code, build system, tests, and training pipeline live in the GitHub
32
+ repository: <https://github.com/localai-org/LocalVQE>.
33
+
34
+ The current release is **v1.1**, which fixes intermittent crackling the
35
+ previous release produced under heavy background noise.
36
+
37
+ The technical report describing the architecture, streaming-state contract,
38
+ and streaming-causal normalisation operator is included in this repo as
39
+ [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
40
+ like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
41
+ from an existing author in those categories — if you can endorse, please
42
+ reach out via the GitHub repo.
43
+
44
+ **Authors:**
45
+ - Richard Palethorpe ([richiejp](https://github.com/richiejp))
46
+ - Claude (Anthropic)
47
+
48
+ LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 —
49
+ *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
50
+ Cancellation, Noise Suppression and Dereverberation*,
51
+ [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) — smaller, GGML-native,
52
+ and tuned for streaming CPU inference. The architecture is documented in
53
+ the technical report linked above.
54
+
55
+ ## A concrete example
56
+
57
+ Picture a video call from a laptop. Your microphone picks up three things
58
+ alongside your voice:
59
+
60
+ 1. The remote participant's voice, played back through your speakers and
61
+ caught again by your mic — this is the **echo**. Without cancellation
62
+ they hear themselves a fraction of a second later.
63
+ 2. Your own voice bouncing off walls, desk, and monitor before reaching
64
+ the mic — this is **reverberation**, the "tunnel" or "bathroom" sound
65
+ that makes you feel far away from the listener.
66
+ 3. A fan, keyboard clatter, a dog barking, or traffic outside — plain
67
+ **background noise**.
68
+
69
+ LocalVQE removes all three in a single causal pass, frame by frame, on
70
+ the CPU, so only your voice reaches the far end.
71
+
72
+ ## Why this, and not a classical AEC/NS stack?
73
+
74
+ Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
75
+ NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
76
+ frame and remain a strong baseline when the acoustic path is benign. LocalVQE
77
+ is interesting when you want:
78
+
79
+ - **Robustness to non-linear echo paths** (small loudspeakers, handheld
80
+ devices, plastic laptop chassis) where linear AEC leaves residual echo.
81
+ - **Non-stationary noise suppression** (babble, keyboards, fans changing
82
+ speed) that energy-based noise estimators struggle with.
83
+ - **One model, many conditions** — no per-device tuning of step sizes,
84
+ forgetting factors, or VAD thresholds.
85
+ - **A single deterministic causal pass** — no double-talk detector, no
86
+ adaptation state that can diverge.
87
+
88
+ The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
89
+ ~1–2 ms/frame. On anything larger than a microcontroller that's still a
90
+ small fraction of a real-time budget.
91
+
92
+ ## Why this, and not DeepVQE?
93
+
94
+ Microsoft never released DeepVQE — no weights, no reference
95
+ implementation, no streaming runtime. We re-implemented it from the
96
+ paper as a GGML graph at
97
+ [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
98
+ (the full-width ~7.5 M-parameter version) before starting LocalVQE.
99
+ LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
100
+ (~5 MB F32), small enough to run on commodity CPUs in real time.
101
+
102
+ ## Files in this repository
103
+
104
+ | File | Size | Description |
105
+ |---|---|---|
106
+ | `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
107
+ | `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export — what the C++ inference engine loads. |
108
+
109
+ Only F32 GGUF is published today. A `quantize` tool is included in the
110
+ C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
111
+ released.
112
+
113
+ ## Validation Results
114
+
115
+ Full 800-clip eval on the
116
+ [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
117
+ — real recordings, not synthetic mixes.
118
+
119
+ | Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
120
+ |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
121
+ | doubletalk | 115 | 4.70 | 2.35 | 8.4 dB | 2.85 |
122
+ | doubletalk-with-movement | 185 | 4.63 | 2.35 | 8.3 dB | 2.80 |
123
+ | farend-singletalk | 107 | 2.98 | 4.91 | 44.7 dB | 1.93 |
124
+ | farend-singletalk-with-movement | 193 | 3.40 | 4.95 | 45.0 dB | 1.91 |
125
+ | nearend-singletalk | 200 | 4.99 | 4.05 | 2.5 dB | 3.13 |
126
+
127
+ - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
128
+ quality predictor. "Echo" rates how well echo was removed; "degradation"
129
+ rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
130
+ - **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on
131
+ far-end single-talk where the input is echo-only; on scenes with active
132
+ near-end speech it understates echo removal because both numerator and
133
+ denominator are dominated by speech.
134
+
135
+ ## Building the C++ Inference Engine
136
+
137
+ Source, build system, and tests live at
138
+ <https://github.com/localai-org/LocalVQE>. Requires CMake ≥ 3.20 and a C++17
139
+ compiler. A [Nix](https://nixos.org/) flake is provided:
140
+
141
+ ```bash
142
+ git clone --recursive https://github.com/localai-org/LocalVQE.git
143
+ cd LocalVQE
144
+
145
+ # With Nix:
146
+ nix develop
147
+ cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
148
+ cmake --build ggml/build -j$(nproc)
149
+
150
+ # Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
151
+ cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
152
+ cmake --build ggml/build -j$(nproc)
153
+ ```
154
+
155
+ Binaries land in `ggml/build/bin/`. The CPU build produces multiple
156
+ `libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
157
+ Keep the binaries and `.so` files together.
158
+
159
+ ### Vulkan backend (embedded / integrated-GPU targets)
160
+
161
+ Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
162
+ CPU build — an additional `libggml-vulkan.so` is produced in
163
+ `ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
164
+ present, otherwise it falls back to the CPU variants.
165
+
166
+ ```bash
167
+ cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
168
+ cmake --build ggml/build -j$(nproc)
169
+ ```
170
+
171
+ The Nix flake's dev shell already includes `vulkan-loader`,
172
+ `vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
173
+ from your distro (Debian: `libvulkan-dev vulkan-headers
174
+ glslc`/`shaderc`).
175
+
176
+ ### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
177
+
178
+ Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
179
+ full `ggml_backend_graph_compute`.
180
+
181
+ | Backend | Threads | p50 | p99 | max |
182
+ |-----------------------------|--------:|--------:|--------:|--------:|
183
+ | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
184
+ | CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms |
185
+ | CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms |
186
+ | Vulkan — AMD iGPU (RADV) | — | 4.43 ms | 4.62 ms | 5.07 ms |
187
+ | Vulkan — NVIDIA RTX 5070 Ti | — | 1.79 ms | 3.41 ms | 4.14 ms |
188
+
189
+ Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
190
+ shared desktop is sensitive to external GPU clients (display
191
+ compositor, browser). On a dedicated embedded device with no
192
+ compositor contending for the queue, expect the quieter end of the
193
+ range.
194
+
195
+ ## Running Inference
196
+
197
+ Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
198
+ either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
199
+ `huggingface_hub`. Then:
200
+
201
+ ### CLI
202
+
203
+ ```bash
204
+ ./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
205
+ --in-wav mic.wav ref.wav \
206
+ --out-wav enhanced.wav
207
+ ```
208
+
209
+ Expects 16 kHz mono PCM for both mic and far-end reference.
210
+
211
+ ### Benchmark
212
+
213
+ ```bash
214
+ ./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
215
+ --in-wav mic.wav ref.wav --iters 10 --profile
216
+ ```
217
+
218
+ ### Shared Library (C API)
219
+
220
+ ```bash
221
+ cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
222
+ cmake --build ggml/build -j$(nproc)
223
+ ```
224
+
225
+ Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
226
+ `ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
227
+ integration.
228
+
229
+ ### Quantizing (experimental)
230
+
231
+ Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
232
+ tool in the C++ build can produce GGUF variants from the F32 reference
233
+ for experimentation:
234
+
235
+ ```bash
236
+ ./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
237
+ ```
238
+
239
+ Expect end-to-end quality loss until proper per-tensor selection and
240
+ calibration have been worked through.
241
+
242
+ ## PyTorch Reference
243
+
244
+ `localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
245
+ It is provided for verification, ablation, and downstream research — not
246
+ for end-user inference, which should go through the GGML build above. The
247
+ model definition lives under `pytorch/` in the
248
+ [GitHub repo](https://github.com/localai-org/LocalVQE):
249
+
250
+ ```bash
251
+ git clone https://github.com/localai-org/LocalVQE.git
252
+ cd LocalVQE/pytorch
253
+ pip install -r requirements.txt
254
+ ```
255
+
256
+ ## Citing LocalVQE
257
+
258
+ If you use LocalVQE in academic work, please cite the repository via the
259
+ `CITATION.cff` at <https://github.com/localai-org/LocalVQE> — GitHub renders
260
+ a "Cite this repository" button that produces APA and BibTeX entries
261
+ automatically.
262
+
263
+ For a DOI, we recommend citing a specific release via
264
+ [Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
265
+ also cite the upstream DeepVQE paper:
266
+
267
+ ```bibtex
268
+ @inproceedings{indenbom2023deepvqe,
269
+ title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
270
+ Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
271
+ author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
272
+ and Chernov, Mykola and Aichner, Robert},
273
+ booktitle = {Interspeech},
274
+ year = {2023},
275
+ doi = {10.21437/Interspeech.2023-2176}
276
+ }
277
+ ```
278
+
279
+ ## Dataset Attribution
280
+
281
+ Published weights are trained on data from the
282
+ [ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
283
+ (Microsoft, CC BY 4.0) and fine-tuned on the
284
+ [ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
285
+
286
+ ## Safety Note
287
+
288
+ Training data was filtered by DNSMOS perceived-quality scores, which can
289
+ misclassify distressed speech (screaming, crying) as noise. LocalVQE may
290
+ attenuate or distort such signals and must not be relied upon for emergency
291
+ call or safety-critical applications.
292
+
293
+ ## License
294
+
295
+ Apache License 2.0.
localvqe-technical-report.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:351d69f2f10bc775f77f5a034c196fe94c57634e27403412bec7d417ccdcb468
3
+ size 365911
localvqe-v1-1.3M-f32.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5eaf577449d0f920d8ee5e1042b8ddc7b6627313a042c62e2ada1b42719ab30
3
+ size 5162720
localvqe-v1-1.3M.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:499d7cadfe939c2f7849ce2926c791de97c10f084fbfd8243794d199a0d54f8a
3
+ size 11656320
localvqe-v1.1-1.3M-f32.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c118227c6b433d6aa36d9e4b993e0f31aa60787ea38d301d04db917a4a2b0a84
3
+ size 5173088
localvqe-v1.1-1.3M.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76aabaa3bca3a9d7989463226312aa2344f978403c3e0e007e58a15922c97707
3
+ size 11453482