File size: 12,049 Bytes
ca66821 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | ---
library_name: pytorch
tags:
- audio-to-audio
- speech-enhancement
- acoustic-echo-cancellation
- noise-suppression
- ggml
license: apache-2.0
---
# LocalVQE
[](https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo)
[](https://github.com/localai-org/LocalVQE)
[](https://www.apache.org/licenses/LICENSE-2.0)
**Local Voice Quality Enhancement** β a compact neural model for joint
acoustic echo cancellation (AEC), noise suppression, and dereverberation of
16 kHz speech, designed to run on commodity CPUs in real time.
- 1.3 M parameters (~5 MB F32)
- ~1.66 ms per 16 ms frame on Zen4 (24 threads) β **β9.6Γ realtime**
- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
PyTorch reference included for verification and research
Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.
This page is the Hugging Face model card β it hosts the published weights.
Source code, build system, tests, and training pipeline live in the GitHub
repository: <https://github.com/localai-org/LocalVQE>.
The current release is **v1.1**, which fixes intermittent crackling the
previous release produced under heavy background noise.
The technical report describing the architecture, streaming-state contract,
and streaming-causal normalisation operator is included in this repo as
[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
from an existing author in those categories β if you can endorse, please
reach out via the GitHub repo.
**Authors:**
- Richard Palethorpe ([richiejp](https://github.com/richiejp))
- Claude (Anthropic)
LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 β
*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
Cancellation, Noise Suppression and Dereverberation*,
[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) β smaller, GGML-native,
and tuned for streaming CPU inference. The architecture is documented in
the technical report linked above.
## A concrete example
Picture a video call from a laptop. Your microphone picks up three things
alongside your voice:
1. The remote participant's voice, played back through your speakers and
caught again by your mic β this is the **echo**. Without cancellation
they hear themselves a fraction of a second later.
2. Your own voice bouncing off walls, desk, and monitor before reaching
the mic β this is **reverberation**, the "tunnel" or "bathroom" sound
that makes you feel far away from the listener.
3. A fan, keyboard clatter, a dog barking, or traffic outside β plain
**background noise**.
LocalVQE removes all three in a single causal pass, frame by frame, on
the CPU, so only your voice reaches the far end.
## Why this, and not a classical AEC/NS stack?
Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
frame and remain a strong baseline when the acoustic path is benign. LocalVQE
is interesting when you want:
- **Robustness to non-linear echo paths** (small loudspeakers, handheld
devices, plastic laptop chassis) where linear AEC leaves residual echo.
- **Non-stationary noise suppression** (babble, keyboards, fans changing
speed) that energy-based noise estimators struggle with.
- **One model, many conditions** β no per-device tuning of step sizes,
forgetting factors, or VAD thresholds.
- **A single deterministic causal pass** β no double-talk detector, no
adaptation state that can diverge.
The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
~1β2 ms/frame. On anything larger than a microcontroller that's still a
small fraction of a real-time budget.
## Why this, and not DeepVQE?
Microsoft never released DeepVQE β no weights, no reference
implementation, no streaming runtime. We re-implemented it from the
paper as a GGML graph at
[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
(the full-width ~7.5 M-parameter version) before starting LocalVQE.
LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
(~5 MB F32), small enough to run on commodity CPUs in real time.
## Files in this repository
| File | Size | Description |
|---|---|---|
| `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint β DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export β what the C++ inference engine loads. |
Only F32 GGUF is published today. A `quantize` tool is included in the
C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
released.
## Validation Results
Full 800-clip eval on the
[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
β real recordings, not synthetic mixes.
| Scenario | n | AECMOS echo β | AECMOS deg β | blind ERLE β | DNSMOS OVRL β |
|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
| doubletalk | 115 | 4.70 | 2.35 | 8.4 dB | 2.85 |
| doubletalk-with-movement | 185 | 4.63 | 2.35 | 8.3 dB | 2.80 |
| farend-singletalk | 107 | 2.98 | 4.91 | 44.7 dB | 1.93 |
| farend-singletalk-with-movement | 193 | 3.40 | 4.95 | 45.0 dB | 1.91 |
| nearend-singletalk | 200 | 4.99 | 4.05 | 2.5 dB | 3.13 |
- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
quality predictor. "Echo" rates how well echo was removed; "degradation"
rates how clean the resulting speech is. 1β5 MOS scale, higher is better.
- **Blind ERLE** is `10Β·log10(E[micΒ²] / E[enhΒ²])`. Only meaningful on
far-end single-talk where the input is echo-only; on scenes with active
near-end speech it understates echo removal because both numerator and
denominator are dominated by speech.
## Building the C++ Inference Engine
Source, build system, and tests live at
<https://github.com/localai-org/LocalVQE>. Requires CMake β₯ 3.20 and a C++17
compiler. A [Nix](https://nixos.org/) flake is provided:
```bash
git clone --recursive https://github.com/localai-org/LocalVQE.git
cd LocalVQE
# With Nix:
nix develop
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
# Without Nix β install cmake, gcc/clang, pkg-config, libsndfile, then:
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
cmake --build ggml/build -j$(nproc)
```
Binaries land in `ggml/build/bin/`. The CPU build produces multiple
`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
Keep the binaries and `.so` files together.
### Vulkan backend (embedded / integrated-GPU targets)
Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
CPU build β an additional `libggml-vulkan.so` is produced in
`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
present, otherwise it falls back to the CPU variants.
```bash
cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
cmake --build ggml/build -j$(nproc)
```
The Nix flake's dev shell already includes `vulkan-loader`,
`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
from your distro (Debian: `libvulkan-dev vulkan-headers
glslc`/`shaderc`).
### Streaming latency (per-hop, 16 kHz / 256-sample hop β 16 ms budget)
Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
full `ggml_backend_graph_compute`.
| Backend | Threads | p50 | p99 | max |
|-----------------------------|--------:|--------:|--------:|--------:|
| CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms |
| CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms |
| CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms |
| Vulkan β AMD iGPU (RADV) | β | 4.43 ms | 4.62 ms | 5.07 ms |
| Vulkan β NVIDIA RTX 5070 Ti | β | 1.79 ms | 3.41 ms | 4.14 ms |
Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
shared desktop is sensitive to external GPU clients (display
compositor, browser). On a dedicated embedded device with no
compositor contending for the queue, expect the quieter end of the
range.
## Running Inference
Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
`huggingface_hub`. Then:
### CLI
```bash
./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
--in-wav mic.wav ref.wav \
--out-wav enhanced.wav
```
Expects 16 kHz mono PCM for both mic and far-end reference.
### Benchmark
```bash
./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
--in-wav mic.wav ref.wav --iters 10 --profile
```
### Shared Library (C API)
```bash
cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
cmake --build ggml/build -j$(nproc)
```
Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
integration.
### Quantizing (experimental)
Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
tool in the C++ build can produce GGUF variants from the F32 reference
for experimentation:
```bash
./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
```
Expect end-to-end quality loss until proper per-tensor selection and
calibration have been worked through.
## PyTorch Reference
`localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
It is provided for verification, ablation, and downstream research β not
for end-user inference, which should go through the GGML build above. The
model definition lives under `pytorch/` in the
[GitHub repo](https://github.com/localai-org/LocalVQE):
```bash
git clone https://github.com/localai-org/LocalVQE.git
cd LocalVQE/pytorch
pip install -r requirements.txt
```
## Citing LocalVQE
If you use LocalVQE in academic work, please cite the repository via the
`CITATION.cff` at <https://github.com/localai-org/LocalVQE> β GitHub renders
a "Cite this repository" button that produces APA and BibTeX entries
automatically.
For a DOI, we recommend citing a specific release via
[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
also cite the upstream DeepVQE paper:
```bibtex
@inproceedings{indenbom2023deepvqe,
title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
and Chernov, Mykola and Aichner, Robert},
booktitle = {Interspeech},
year = {2023},
doi = {10.21437/Interspeech.2023-2176}
}
```
## Dataset Attribution
Published weights are trained on data from the
[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
(Microsoft, CC BY 4.0) and fine-tuned on the
[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
## Safety Note
Training data was filtered by DNSMOS perceived-quality scores, which can
misclassify distressed speech (screaming, crying) as noise. LocalVQE may
attenuate or distort such signals and must not be relied upon for emergency
call or safety-critical applications.
## License
Apache License 2.0.
|