Duplicate from LocalAI-io/LocalVQE

Browse files

Co-authored-by: Richard Palethorpe <richiejp@users.noreply.huggingface.co>

Files changed (7) hide show

.gitattributes +41 -0
README.md +295 -0
localvqe-technical-report.pdf +3 -0
localvqe-v1-1.3M-f32.gguf +3 -0
localvqe-v1-1.3M.pt +3 -0
localvqe-v1.1-1.3M-f32.gguf +3 -0
localvqe-v1.1-1.3M.pt +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,41 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+localvqe-baseline.gguf filter=lfs diff=lfs merge=lfs -text
+localvqe-finetune.gguf filter=lfs diff=lfs merge=lfs -text
+localvqe-v1-f32.gguf filter=lfs diff=lfs merge=lfs -text
+localvqe-v1-1.3M-f32.gguf filter=lfs diff=lfs merge=lfs -text
+localvqe-technical-report.pdf filter=lfs diff=lfs merge=lfs -text
+localvqe-v1.1-1.3M-f32.gguf filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,295 @@

+---
+library_name: pytorch
+tags:
+  - audio-to-audio
+  - speech-enhancement
+  - acoustic-echo-cancellation
+  - noise-suppression
+  - ggml
+license: apache-2.0
+---
+# LocalVQE
+[![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-md.svg)](https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo)
+[![GitHub](https://img.shields.io/badge/GitHub-localai--org%2FLocalVQE-181717?logo=github)](https://github.com/localai-org/LocalVQE)
+[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
+**Local Voice Quality Enhancement** — a compact neural model for joint
+acoustic echo cancellation (AEC), noise suppression, and dereverberation of
+16 kHz speech, designed to run on commodity CPUs in real time.
+- 1.3 M parameters (~5 MB F32)
+- ~1.66 ms per 16 ms frame on Zen4 (24 threads) — **≈9.6× realtime**
+- Causal, streaming: 256-sample hop, 16 ms algorithmic latency
+- F32 reference inference in C++ via [GGML](https://github.com/ggml-org/ggml);
+  PyTorch reference included for verification and research
+Try it live: <https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo>.
+This page is the Hugging Face model card — it hosts the published weights.
+Source code, build system, tests, and training pipeline live in the GitHub
+repository: <https://github.com/localai-org/LocalVQE>.
+The current release is **v1.1**, which fixes intermittent crackling the
+previous release produced under heavy background noise.
+The technical report describing the architecture, streaming-state contract,
+and streaming-causal normalisation operator is included in this repo as
+[`localvqe-technical-report.pdf`](localvqe-technical-report.pdf). We would
+like to publish it to arXiv (`eess.AS` / `cs.SD`) but need an endorsement
+from an existing author in those categories — if you can endorse, please
+reach out via the GitHub repo.
+**Authors:**
+- Richard Palethorpe ([richiejp](https://github.com/richiejp))
+- Claude (Anthropic)
+LocalVQE is a derivative of **DeepVQE** (Indenbom et al., Interspeech 2023 —
+*DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo
+Cancellation, Noise Suppression and Dereverberation*,
+[arXiv:2306.03177](https://arxiv.org/abs/2306.03177)) — smaller, GGML-native,
+and tuned for streaming CPU inference. The architecture is documented in
+the technical report linked above.
+## A concrete example
+Picture a video call from a laptop. Your microphone picks up three things
+alongside your voice:
+1. The remote participant's voice, played back through your speakers and
+   caught again by your mic — this is the **echo**. Without cancellation
+   they hear themselves a fraction of a second later.
+2. Your own voice bouncing off walls, desk, and monitor before reaching
+   the mic — this is **reverberation**, the "tunnel" or "bathroom" sound
+   that makes you feel far away from the listener.
+3. A fan, keyboard clatter, a dog barking, or traffic outside — plain
+   **background noise**.
+LocalVQE removes all three in a single causal pass, frame by frame, on
+the CPU, so only your voice reaches the far end.
+## Why this, and not a classical AEC/NS stack?
+Hand-tuned DSP pipelines (NLMS/AP/Kalman AEC, Wiener/spectral-subtraction
+NS, MCRA noise tracking, RLS dereverb) can run in tens of microseconds per
+frame and remain a strong baseline when the acoustic path is benign. LocalVQE
+is interesting when you want:
+- **Robustness to non-linear echo paths** (small loudspeakers, handheld
+  devices, plastic laptop chassis) where linear AEC leaves residual echo.
+- **Non-stationary noise suppression** (babble, keyboards, fans changing
+  speed) that energy-based noise estimators struggle with.
+- **One model, many conditions** — no per-device tuning of step sizes,
+  forgetting factors, or VAD thresholds.
+- **A single deterministic causal pass** — no double-talk detector, no
+  adaptation state that can diverge.
+The trade-off is CPU: a classical stack might cost ~0.1 ms/frame, LocalVQE
+~1–2 ms/frame. On anything larger than a microcontroller that's still a
+small fraction of a real-time budget.
+## Why this, and not DeepVQE?
+Microsoft never released DeepVQE — no weights, no reference
+implementation, no streaming runtime. We re-implemented it from the
+paper as a GGML graph at
+[richiejp/deepvqe-ggml](https://github.com/richiejp/deepvqe-ggml)
+(the full-width ~7.5 M-parameter version) before starting LocalVQE.
+LocalVQE is the same idea pruned and rebuilt to ~1.3 M parameters
+(~5 MB F32), small enough to run on commodity CPUs in real time.
+## Files in this repository
+| File | Size | Description |
+|---|---|---|
+| `localvqe-v1.1-1.3M.pt` | 11 MB | PyTorch checkpoint — DNS5 pre-training + ICASSP 2022/2023 AEC Challenge fine-tune. |
+| `localvqe-v1.1-1.3M-f32.gguf` | 5 MB | GGML F32 export — what the C++ inference engine loads. |
+Only F32 GGUF is published today. A `quantize` tool is included in the
+C++ build (see below); calibrated Q4_K / Q8_0 weights have not yet been
+released.
+## Validation Results
+Full 800-clip eval on the
+[ICASSP 2022 AEC Challenge blind test set](https://github.com/microsoft/AEC-Challenge)
+— real recordings, not synthetic mixes.
+| Scenario                          |   n | AECMOS echo ↑ | AECMOS deg ↑ | blind ERLE ↑ | DNSMOS OVRL ↑ |
+|-----------------------------------|----:|--------------:|-------------:|-------------:|--------------:|
+| doubletalk                        | 115 |          4.70 |         2.35 |       8.4 dB |          2.85 |
+| doubletalk-with-movement          | 185 |          4.63 |         2.35 |       8.3 dB |          2.80 |
+| farend-singletalk                 | 107 |          2.98 |         4.91 |      44.7 dB |          1.93 |
+| farend-singletalk-with-movement   | 193 |          3.40 |         4.95 |      45.0 dB |          1.91 |
+| nearend-singletalk                | 200 |          4.99 |         4.05 |       2.5 dB |          3.13 |
+- **AECMOS** (Purin et al., ICASSP 2022) is Microsoft's non-intrusive AEC
+  quality predictor. "Echo" rates how well echo was removed; "degradation"
+  rates how clean the resulting speech is. 1–5 MOS scale, higher is better.
+- **Blind ERLE** is `10·log10(E[mic²] / E[enh²])`. Only meaningful on
+  far-end single-talk where the input is echo-only; on scenes with active
+  near-end speech it understates echo removal because both numerator and
+  denominator are dominated by speech.
+## Building the C++ Inference Engine
+Source, build system, and tests live at
+<https://github.com/localai-org/LocalVQE>. Requires CMake ≥ 3.20 and a C++17
+compiler. A [Nix](https://nixos.org/) flake is provided:
+```bash
+git clone --recursive https://github.com/localai-org/LocalVQE.git
+cd LocalVQE
+# With Nix:
+nix develop
+cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
+cmake --build ggml/build -j$(nproc)
+# Without Nix — install cmake, gcc/clang, pkg-config, libsndfile, then:
+cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release
+cmake --build ggml/build -j$(nproc)
+```
+Binaries land in `ggml/build/bin/`. The CPU build produces multiple
+`libggml-cpu-*.so` variants (SSE4.2 / AVX2 / AVX-512) selected at runtime.
+Keep the binaries and `.so` files together.
+### Vulkan backend (embedded / integrated-GPU targets)
+Add `-DLOCALVQE_VULKAN=ON` to the configure step. This composes with the
+CPU build — an additional `libggml-vulkan.so` is produced in
+`ggml/build/bin/` and the runtime loader picks it up when a Vulkan ICD is
+present, otherwise it falls back to the CPU variants.
+```bash
+cmake -S ggml -B ggml/build -DCMAKE_BUILD_TYPE=Release -DLOCALVQE_VULKAN=ON
+cmake --build ggml/build -j$(nproc)
+```
+The Nix flake's dev shell already includes `vulkan-loader`,
+`vulkan-headers`, and `shaderc`. Without Nix, install the equivalents
+from your distro (Debian: `libvulkan-dev vulkan-headers
+glslc`/`shaderc`).
+### Streaming latency (per-hop, 16 kHz / 256-sample hop → 16 ms budget)
+Measured with `bench` on Zen4 desktop (Ryzen 9 7900). Each hop is a
+full `ggml_backend_graph_compute`.
+| Backend                     | Threads | p50     | p99     | max     |
+|-----------------------------|--------:|--------:|--------:|--------:|
+| CPU                         |       1 | 3.40 ms | 3.57 ms | 5.06 ms |
+| CPU                         |       2 | 2.07 ms | 2.25 ms | 3.65 ms |
+| CPU                         |       4 | 1.32 ms | 1.57 ms | 6.91 ms |
+| Vulkan — AMD iGPU (RADV)    |       — | 4.43 ms | 4.62 ms | 5.07 ms |
+| Vulkan — NVIDIA RTX 5070 Ti |       — | 1.79 ms | 3.41 ms | 4.14 ms |
+Vulkan p50/p95/p99 are tight, but worst-case single-hop latency on a
+shared desktop is sensitive to external GPU clients (display
+compositor, browser). On a dedicated embedded device with no
+compositor contending for the queue, expect the quieter end of the
+range.
+## Running Inference
+Download `localvqe-v1.1-1.3M-f32.gguf` from this repository (the file list above)
+either via `huggingface-cli`, the Hub web UI, or `hf_hub_download` from
+`huggingface_hub`. Then:
+### CLI
+```bash
+./ggml/build/bin/localvqe localvqe-v1.1-1.3M-f32.gguf \
+    --in-wav mic.wav ref.wav \
+    --out-wav enhanced.wav
+```
+Expects 16 kHz mono PCM for both mic and far-end reference.
+### Benchmark
+```bash
+./ggml/build/bin/bench localvqe-v1.1-1.3M-f32.gguf \
+    --in-wav mic.wav ref.wav --iters 10 --profile
+```
+### Shared Library (C API)
+```bash
+cmake -S ggml -B ggml/build -DLOCALVQE_BUILD_SHARED=ON
+cmake --build ggml/build -j$(nproc)
+```
+Produces `liblocalvqe.so` with the API in `ggml/localvqe_api.h`. See
+`ggml/example_purego_test.go` in the GitHub repo for a Go / `purego`
+integration.
+### Quantizing (experimental)
+Calibrated Q4_K / Q8_0 weights are not yet published. The `quantize`
+tool in the C++ build can produce GGUF variants from the F32 reference
+for experimentation:
+```bash
+./ggml/build/bin/quantize localvqe-v1.1-1.3M-f32.gguf localvqe-v1.1-1.3M-q8.gguf Q8_0
+```
+Expect end-to-end quality loss until proper per-tensor selection and
+calibration have been worked through.
+## PyTorch Reference
+`localvqe-v1.1-1.3M.pt` is the PyTorch checkpoint used to produce the GGUF export.
+It is provided for verification, ablation, and downstream research — not
+for end-user inference, which should go through the GGML build above. The
+model definition lives under `pytorch/` in the
+[GitHub repo](https://github.com/localai-org/LocalVQE):
+```bash
+git clone https://github.com/localai-org/LocalVQE.git
+cd LocalVQE/pytorch
+pip install -r requirements.txt
+```
+## Citing LocalVQE
+If you use LocalVQE in academic work, please cite the repository via the
+`CITATION.cff` at <https://github.com/localai-org/LocalVQE> — GitHub renders
+a "Cite this repository" button that produces APA and BibTeX entries
+automatically.
+For a DOI, we recommend citing a specific release via
+[Zenodo](https://zenodo.org), which mints a DOI per GitHub release. Please
+also cite the upstream DeepVQE paper:
+```bibtex
+@inproceedings{indenbom2023deepvqe,
+  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
+               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
+  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
+               and Chernov, Mykola and Aichner, Robert},
+  booktitle = {Interspeech},
+  year      = {2023},
+  doi       = {10.21437/Interspeech.2023-2176}
+}
+```
+## Dataset Attribution
+Published weights are trained on data from the
+[ICASSP 2023 Deep Noise Suppression Challenge](https://github.com/microsoft/DNS-Challenge)
+(Microsoft, CC BY 4.0) and fine-tuned on the
+[ICASSP 2022/2023 Acoustic Echo Cancellation Challenge](https://github.com/microsoft/AEC-Challenge).
+## Safety Note
+Training data was filtered by DNSMOS perceived-quality scores, which can
+misclassify distressed speech (screaming, crying) as noise. LocalVQE may
+attenuate or distort such signals and must not be relied upon for emergency
+call or safety-critical applications.
+## License
+Apache License 2.0.

localvqe-technical-report.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:351d69f2f10bc775f77f5a034c196fe94c57634e27403412bec7d417ccdcb468
+size 365911

localvqe-v1-1.3M-f32.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d5eaf577449d0f920d8ee5e1042b8ddc7b6627313a042c62e2ada1b42719ab30
+size 5162720

localvqe-v1-1.3M.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:499d7cadfe939c2f7849ce2926c791de97c10f084fbfd8243794d199a0d54f8a
+size 11656320

localvqe-v1.1-1.3M-f32.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c118227c6b433d6aa36d9e4b993e0f31aa60787ea38d301d04db917a4a2b0a84
+size 5173088

localvqe-v1.1-1.3M.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76aabaa3bca3a9d7989463226312aa2344f978403c3e0e007e58a15922c97707
+size 11453482