DeepVQE-AEC (GGUF)
GGML/GGUF inference model for DeepVQE (Indenbom et al., Interspeech 2023) โ joint acoustic echo cancellation (AEC), noise suppression, and dereverberation.
Quick Start
Build
Requires CMake 3.20+ and a C++17 compiler. The ggml library is included as a git submodule.
git clone --recursive https://github.com/richiejp/deepvqe-ggml
cd deepvqe-ggml/ggml
# CLI only
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# With shared library (C API for FFI from Python, Go, etc.)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DDEEPVQE_BUILD_SHARED=ON
cmake --build build
CLI
Process STFT-domain audio (NumPy .npy files):
./build/deepvqe deepvqe.gguf --input-npy mic_stft.npy ref_stft.npy
C API
The shared library (libdeepvqe.so) exposes a simple C API for integration
into any language:
#include "deepvqe_api.h"
// Load model
uintptr_t ctx = deepvqe_new("deepvqe.gguf");
// Process 16kHz mono float32 audio
// mic: microphone input (with echo + noise)
// ref: far-end reference (what the speaker is hearing)
// out: cleaned output (pre-allocated, same length)
int ret = deepvqe_process_f32(ctx, mic, ref, n_samples, out);
// int16 PCM variant also available
int ret = deepvqe_process_s16(ctx, mic_s16, ref_s16, n_samples, out_s16);
deepvqe_free(ctx);
Python (ctypes)
import ctypes, numpy as np
lib = ctypes.CDLL("./build/libdeepvqe.so")
lib.deepvqe_new.restype = ctypes.c_void_p
lib.deepvqe_new.argtypes = [ctypes.c_char_p]
lib.deepvqe_process_f32.restype = ctypes.c_int
lib.deepvqe_process_f32.argtypes = [
ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p,
ctypes.c_int, ctypes.c_void_p,
]
lib.deepvqe_free.argtypes = [ctypes.c_void_p]
ctx = lib.deepvqe_new(b"deepvqe.gguf")
mic = np.zeros(16000, dtype=np.float32) # 1 second of 16kHz audio
ref = np.zeros(16000, dtype=np.float32)
out = np.empty_like(mic)
ret = lib.deepvqe_process_f32(
ctx,
mic.ctypes.data,
ref.ctypes.data,
len(mic),
out.ctypes.data,
)
lib.deepvqe_free(ctx)
Used in production by VoxInput for real-time voice input with echo cancellation.
Model Details
| Architecture | DeepVQE with AlignBlock (soft delay estimation) |
| Parameters | ~8.0M |
| Sample rate | 16 kHz |
| STFT | 512 FFT, 256 hop (16 ms), sqrt-Hann window |
| Delay range | dmax=32 frames (320 ms) |
| Format | GGUF (F32) |
Training
Trained on the full DNS5 16 kHz dataset (~300K clean speech files after DNSMOS quality filtering, 64K noise, 60K impulse responses) on a single NVIDIA RTX 5070 (16 GB).
Safety note: Training data was filtered by DNSMOS perceived quality scores, which can misclassify distressed speech (e.g. screaming, crying) as noise. This model may therefore attenuate or distort such signals and should not be relied upon for emergency call or safety-critical applications.
Data
- DNS5 (Microsoft, CC BY 4.0)
- ICASSP 2022 AEC Challenge โ echo scenarios
See deepvqe-ggml for training code and full documentation.
References
- DeepVQE paper (Indenbom et al., 2023)
- deepvqe-ggml โ training & export code
- Downloads last month
- 193
We're not able to determine the quantization variants.