--- library_name: pytorch tags: - audio-to-audio - speech-enhancement - acoustic-echo-cancellation - noise-suppression - ggml license: apache-2.0 --- # LocalVQE [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo) [![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0) **Local Voice Quality Enhancement** — a compact neural model for joint acoustic echo cancellation (AEC), noise suppression, and dereverberation of 16 kHz speech, designed to run on commodity CPUs in real time. - 1.3 M parameters (~5 MB F32) - ~1.66 ms per 16 ms frame on Zen4 (24 threads) — **≈9.6× realtime** - Causal, streaming: 256-sample hop, 16 ms algorithmic latency - F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml); PyTorch reference included for verification and research Try it live: . This page is the Hugging Face model card — it hosts the published weights. Source code, build system, tests, and training pipeline live in the GitHub repository: . The current release is **v1.1**, which fixes intermittent crackling the previous release produced under heavy background noise. The technical report describing the architecture, streaming-state contract, and streaming-causal normalisation operator is included in this repo as [`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement from an existing author in those categories — if you can endorse, please reach out via the GitHub repo. **Authors:** - Richard Palethorpe ([richiejp](https://github.com/richiejp)) - Claude (Anthropic) LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 — *DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation*, [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) — smaller, GGML-native, and tuned for streaming CPU inference. The architecture is documented in the technical report linked above. ## A concrete example Picture a video call from a laptop. Your microphone picks up three things alongside your voice: 1. The remote participant's voice, played back through your speakers and caught again by your mic — this is the **echo**. Without cancellation they hear themselves a fraction of a second later. 2. Your own voice bouncing off walls, desk, and monitor before reaching the mic — this is **reverberation**, the "tunnel" or "bathroom" sound that makes you feel far away from the listener. 3. A fan, keyboard clatter, a dog barking, or traffic outside — plain **background noise**. LocalVQE removes all three in a single causal pass, frame by frame, on the CPU, so only your voice reaches the far end. ## Why this, and not a classical AEC/NS stack? Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per frame and remain a strong baseline when the acoustic path is benign. LocalVQE is interesting when you want: - **Robustness to non-linear echo paths** (small loudspeakers, handheld devices, plastic laptop chassis) where linear AEC leaves residual echo. - **Non-stationary noise suppression** (babble, keyboards, fans changing speed) that energy-based noise estimators struggle with. - **One model, many conditions** — no per-device tuning of step sizes, forgetting factors, or VAD thresholds. - **A single deterministic causal pass** — no double-talk detector, no adaptation state that can diverge. The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE ~1–2 ms/frame. On anything larger than a microcontroller that's still a small fraction of a real-time budget. ## Why this, and not DeepVQE? Microsoft never released DeepVQE — no weights, no reference implementation, no streaming runtime. We re-implemented it from the paper as a GGML graph at [richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml) (the full-width ~7.5 M-parameter version) before starting LocalVQE. LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters (~5 MB F32), small enough to run on commodity CPUs in real time. ## Files in this repository | File | Size | Description | |---|---|---| | `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. | | `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export — what the C++ inference engine loads. | Only F32 GGUF is published today. A `quantize` tool is included in the C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been released. ## Validation Results Full 800-clip eval on the [ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge) — real recordings, not synthetic mixes. | Scenario | n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ | |-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:| | doubletalk | 115 | 4.70 | 2.35 | 8.4 dB | 2.85 | | doubletalk-with-movement | 185 | 4.63 | 2.35 | 8.3 dB | 2.80 | | farend-singletalk | 107 | 2.98 | 4.91 | 44.7 dB | 1.93 | | farend-singletalk-with-movement | 193 | 3.40 | 4.95 | 45.0 dB | 1.91 | | nearend-singletalk | 200 | 4.99 | 4.05 | 2.5 dB | 3.13 | - **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC quality predictor. "Echo" rates how well echo was removed; "degradation" rates how clean the resulting speech is. 1–5 MOS scale, higher is better. - **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on far-end single-talk where the input is echo-only; on scenes with active near-end speech it understates echo removal because both numerator and denominator are dominated by speech. ## Building the C++ Inference Engine Source, build system, and tests live at . Requires CMake ≥ 3.20 and a C++17 compiler. A [Nix](https://nixos.org/) flake is provided: ```bash git clone --recursive https://github.com/localai-org/LocalVQE.git cd LocalVQE # With Nix: nix develop cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release cmake --build ggml/build -j$(nproc) # Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then: cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release cmake --build ggml/build -j$(nproc) ``` Binaries land in `ggml/build/bin/`. The CPU build produces multiple `libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime. Keep the binaries and `.so` files together. ### Vulkan backend (embedded / integrated-GPU targets) Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the CPU build — an additional `libggml-vulkan.so` is produced in `ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is present, otherwise it falls back to the CPU variants. ```bash cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON cmake --build ggml/build -j$(nproc) ``` The Nix flake's dev shell already includes `vulkan-loader`, `vulkan-headers`, and `shaderc`. Without Nix, install the equivalents from your distro (Debian: `libvulkan-dev vulkan-headers glslc`/`shaderc`). ### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget) Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a full `ggml_backend_graph_compute`. | Backend | Threads | p50 | p99 | max | |-----------------------------|--------:|--------:|--------:|--------:| | CPU | 1 | 3.40 ms | 3.57 ms | 5.06 ms | | CPU | 2 | 2.07 ms | 2.25 ms | 3.65 ms | | CPU | 4 | 1.32 ms | 1.57 ms | 6.91 ms | | Vulkan — AMD iGPU (RADV) | — | 4.43 ms | 4.62 ms | 5.07 ms | | Vulkan — NVIDIA RTX 5070 Ti | — | 1.79 ms | 3.41 ms | 4.14 ms | Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a shared desktop is sensitive to external GPU clients (display compositor, browser). On a dedicated embedded device with no compositor contending for the queue, expect the quieter end of the range. ## Running Inference Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above) either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from `huggingface_hub`. Then: ### CLI ```bash ./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \ --in-wav mic.wav ref.wav \ --out-wav enhanced.wav ``` Expects 16 kHz mono PCM for both mic and far-end reference. ### Benchmark ```bash ./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \ --in-wav mic.wav ref.wav --iters 10 --profile ``` ### Shared Library (C API) ```bash cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON cmake --build ggml/build -j$(nproc) ``` Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See `ggml/example_purego_test.go` in the GitHub repo for a Go / `purego` integration. ### Quantizing (experimental) Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize` tool in the C++ build can produce GGUF variants from the F32 reference for experimentation: ```bash ./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0 ``` Expect end-to-end quality loss until proper per-tensor selection and calibration have been worked through. ## PyTorch Reference `localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export. It is provided for verification, ablation, and downstream research — not for end-user inference, which should go through the GGML build above. The model definition lives under `pytorch/` in the [GitHub repo](https://github.com/localai-org/LocalVQE): ```bash git clone https://github.com/localai-org/LocalVQE.git cd LocalVQE/pytorch pip install -r requirements.txt ``` ## Citing LocalVQE If you use LocalVQE in academic work, please cite the repository via the `CITATION.cff` at — GitHub renders a "Cite this repository" button that produces APA and BibTeX entries automatically. For a DOI, we recommend citing a specific release via [Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please also cite the upstream DeepVQE paper: ```bibtex @inproceedings{indenbom2023deepvqe, title = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation}, author = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin and Chernov, Mykola and Aichner, Robert}, booktitle = {Interspeech}, year = {2023}, doi = {10.21437/Interspeech.2023-2176} } ``` ## Dataset Attribution Published weights are trained on data from the [ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge) (Microsoft, CC BY 4.0) and fine-tuned on the [ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge). ## Safety Note Training data was filtered by DNSMOS perceived-quality scores, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate or distort such signals and must not be relied upon for emergency call or safety-critical applications. ## License Apache License 2.0.