File size: 12,049 Bytes
ca66821
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
---
library_name: pytorch
tags:
  - audio-to-audio
  - speech-enhancement
  - acoustic-echo-cancellation
  - noise-suppression
  - ggml
license: apache-2.0
---

# LocalVQE

[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo)
[![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)

**Local Voice Quality Enhancement** β€” a compact neural model for joint
acoustic echo cancellation (AEC), noise suppression, and dereverberation of
16 kHz speech, designed to run on commodity CPUs in real time.

- 1.3 M parameters (~5 MB F32)
- ~1.66 ms per 16 ms frame on Zen4 (24 threads) β€” **β‰ˆ9.6Γ— realtime**
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
  PyTorch reference included for verification and research

Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.

This page is the Hugging Face model card β€” it hosts the published weights.
Source code, build system, tests, and training pipeline live in the GitHub
repository: <https://github.com/localai-org/LocalVQE>.

The current release is **v1.1**, which fixes intermittent crackling the
previous release produced under heavy background noise.

The technical report describing the architecture, streaming-state contract,
and streaming-causal normalisation operator is included in this repo as
[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
from an existing author in those categories β€” if you can endorse, please
reach out via the GitHub repo.

**Authors:**
- Richard Palethorpe ([richiejp](https://github.com/richiejp))
- Claude (Anthropic)

LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β€”
*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
Cancellation, Noise Suppression and Dereverberation*,
[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) β€” smaller, GGML-native,
and tuned for streaming CPU inference. The architecture is documented in
the technical report linked above.

## A concrete example

Picture a video call from a laptop. Your microphone picks up three things
alongside your voice:

1. The remote participant's voice, played back through your speakers and
   caught again by your mic β€” this is the **echo**. Without cancellation
   they hear themselves a fraction of a second later.
2. Your own voice bouncing off walls, desk, and monitor before reaching
   the mic β€” this is **reverberation**, the "tunnel" or "bathroom" sound
   that makes you feel far away from the listener.
3. A fan, keyboard clatter, a dog barking, or traffic outside β€” plain
   **background noise**.

LocalVQE removes all three in a single causal pass, frame by frame, on
the CPU, so only your voice reaches the far end.

## Why this, and not a classical AEC/NS stack?

Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
frame and remain a strong baseline when the acoustic path is benign. LocalVQE
is interesting when you want:

- **Robustness to non-linear echo paths** (small loudspeakers, handheld
  devices, plastic laptop chassis) where linear AEC leaves residual echo.
- **Non-stationary noise suppression** (babble, keyboards, fans changing
  speed) that energy-based noise estimators struggle with.
- **One model, many conditions** β€” no per-device tuning of step sizes,
  forgetting factors, or VAD thresholds.
- **A single deterministic causal pass** β€” no double-talk detector, no
  adaptation state that can diverge.

The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
~1–2 ms/frame. On anything larger than a microcontroller that's still a
small fraction of a real-time budget.

## Why this, and not DeepVQE?

Microsoft never released DeepVQE β€” no weights, no reference
implementation, no streaming runtime. We re-implemented it from the
paper as a GGML graph at
[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
(the full-width ~7.5 M-parameter version) before starting LocalVQE.
LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
(~5 MB F32), small enough to run on commodity CPUs in real time.

## Files in this repository

| File | Size | Description |
|---|---|---|
| `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint β€” DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export β€” what the C++ inference engine loads. |

Only F32 GGUF is published today. A `quantize` tool is included in the
C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
released.

## Validation Results

Full 800-clip eval on the
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
β€” real recordings, not synthetic mixes.

| Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
| doubletalk                        | 115 |          4.70 |         2.35 |       8.4 dB |          2.85 |
| doubletalk-with-movement          | 185 |          4.63 |         2.35 |       8.3 dB |          2.80 |
| farend-singletalk                 | 107 |          2.98 |         4.91 |      44.7 dB |          1.93 |
| farend-singletalk-with-movement   | 193 |          3.40 |         4.95 |      45.0 dB |          1.91 |
| nearend-singletalk                | 200 |          4.99 |         4.05 |       2.5 dB |          3.13 |

- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
  quality predictor. "Echo" rates how well echo was removed; "degradation"
  rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
- **Blind ERLE** is `10Β·log10(E[micΒ²] / E[enhΒ²])`. Only meaningful on
  far-end single-talk where the input is echo-only; on scenes with active
  near-end speech it understates echo removal because both numerator and
  denominator are dominated by speech.

## Building the C++ Inference Engine

Source, build system, and tests live at
<https://github.com/localai-org/LocalVQE>. Requires CMake β‰₯ 3.20 and a C++17
compiler. A [Nix](https://nixos.org/) flake is provided:

```bash
git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE

# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)

# Without Nix β€” install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
```

Binaries land in `ggml/build/bin/`. The CPU build produces multiple
`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
Keep the binaries and `.so` files together.

### Vulkan backend (embedded / integrated-GPU targets)

Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
CPU build β€” an additional `libggml-vulkan.so` is produced in
`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
present, otherwise it falls back to the CPU variants.

```bash
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)
```

The Nix flake's dev shell already includes `vulkan-loader`,
`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
from your distro (Debian: `libvulkan-dev vulkan-headers
glslc`/`shaderc`).

### Streaming latency (per-hop, 16 kHz / 256-sample hop β†’ 16 ms budget)

Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
full `ggml_backend_graph_compute`.

| Backend                     | Threads | p50     | p99     | max     |
|-----------------------------|--------:|--------:|--------:|--------:|
| CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
| CPU                         |       2 | 2.07 ms | 2.25 ms | 3.65 ms |
| CPU                         |       4 | 1.32 ms | 1.57 ms | 6.91 ms |
| Vulkan β€” AMD iGPU (RADV)    |       β€” | 4.43 ms | 4.62 ms | 5.07 ms |
| Vulkan β€” NVIDIA RTX 5070 Ti |       β€” | 1.79 ms | 3.41 ms | 4.14 ms |

Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
shared desktop is sensitive to external GPU clients (display
compositor, browser). On a dedicated embedded device with no
compositor contending for the queue, expect the quieter end of the
range.

## Running Inference

Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
`huggingface_hub`. Then:

### CLI

```bash
./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav \
    --out-wav enhanced.wav
```

Expects 16 kHz mono PCM for both mic and far-end reference.

### Benchmark

```bash
./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
    --in-wav mic.wav ref.wav --iters 10 --profile
```

### Shared Library (C API)

```bash
cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)
```

Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
integration.

### Quantizing (experimental)

Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
tool in the C++ build can produce GGUF variants from the F32 reference
for experimentation:

```bash
./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
```

Expect end-to-end quality loss until proper per-tensor selection and
calibration have been worked through.

## PyTorch Reference

`localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
It is provided for verification, ablation, and downstream research β€” not
for end-user inference, which should go through the GGML build above. The
model definition lives under `pytorch/` in the
[GitHub repo](https://github.com/localai-org/LocalVQE):

```bash
git clone https://github.com/localai-org/LocalVQE.git
cd LocalVQE/pytorch
pip install -r requirements.txt
```

## Citing LocalVQE

If you use LocalVQE in academic work, please cite the repository via the
`CITATION.cff` at <https://github.com/localai-org/LocalVQE> β€” GitHub renders
a "Cite this repository" button that produces APA and BibTeX entries
automatically.

For a DOI, we recommend citing a specific release via
[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
also cite the upstream DeepVQE paper:

```bibtex
@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech},
  year      = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}
```

## Dataset Attribution

Published weights are trained on data from the
[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
(Microsoft, CC BY 4.0) and fine-tuned on the
[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).

## Safety Note

Training data was filtered by DNSMOS perceived-quality scores, which can
misclassify distressed speech (screaming, crying) as noise. LocalVQE may
attenuate or distort such signals and must not be relied upon for emergency
call or safety-critical applications.

## License

Apache License 2.0.