DeepVQE-AEC (GGUF)

GGML/GGUF inference model for DeepVQE (Indenbom et al., Interspeech 2023) โ€” joint acoustic echo cancellation (AEC), noise suppression, and dereverberation.

Quick Start

Build

Requires CMake 3.20+ and a C++17 compiler. The ggml library is included as a git submodule.

git clone --recursive https://github.com/richiejp/deepvqe-ggml
cd deepvqe-ggml/ggml

# CLI only
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# With shared library (C API for FFI from Python, Go, etc.)
cmake -B build -DCMAKE_BUILD_TYPE=Release -DDEEPVQE_BUILD_SHARED=ON
cmake --build build

CLI

Process STFT-domain audio (NumPy .npy files):

./build/deepvqe deepvqe.gguf --input-npy mic_stft.npy ref_stft.npy

C API

The shared library (libdeepvqe.so) exposes a simple C API for integration into any language:

#include "deepvqe_api.h"

// Load model
uintptr_t ctx = deepvqe_new("deepvqe.gguf");

// Process 16kHz mono float32 audio
//   mic: microphone input (with echo + noise)
//   ref: far-end reference (what the speaker is hearing)
//   out: cleaned output (pre-allocated, same length)
int ret = deepvqe_process_f32(ctx, mic, ref, n_samples, out);

// int16 PCM variant also available
int ret = deepvqe_process_s16(ctx, mic_s16, ref_s16, n_samples, out_s16);

deepvqe_free(ctx);

Python (ctypes)

import ctypes, numpy as np

lib = ctypes.CDLL("./build/libdeepvqe.so")
lib.deepvqe_new.restype = ctypes.c_void_p
lib.deepvqe_new.argtypes = [ctypes.c_char_p]
lib.deepvqe_process_f32.restype = ctypes.c_int
lib.deepvqe_process_f32.argtypes = [
    ctypes.c_void_p, ctypes.c_void_p, ctypes.c_void_p,
    ctypes.c_int, ctypes.c_void_p,
]
lib.deepvqe_free.argtypes = [ctypes.c_void_p]

ctx = lib.deepvqe_new(b"deepvqe.gguf")

mic = np.zeros(16000, dtype=np.float32)  # 1 second of 16kHz audio
ref = np.zeros(16000, dtype=np.float32)
out = np.empty_like(mic)

ret = lib.deepvqe_process_f32(
    ctx,
    mic.ctypes.data,
    ref.ctypes.data,
    len(mic),
    out.ctypes.data,
)

lib.deepvqe_free(ctx)

Used in production by VoxInput for real-time voice input with echo cancellation.

Model Details

Architecture DeepVQE with AlignBlock (soft delay estimation)
Parameters ~8.0M
Sample rate 16 kHz
STFT 512 FFT, 256 hop (16 ms), sqrt-Hann window
Delay range dmax=32 frames (320 ms)
Format GGUF (F32)

Training

Trained on the full DNS5 16 kHz dataset (~300K clean speech files after DNSMOS quality filtering, 64K noise, 60K impulse responses) on a single NVIDIA RTX 5070 (16 GB).

Safety note: Training data was filtered by DNSMOS perceived quality scores, which can misclassify distressed speech (e.g. screaming, crying) as noise. This model may therefore attenuate or distort such signals and should not be relied upon for emergency call or safety-critical applications.

Data

See deepvqe-ggml for training code and full documentation.

References

Downloads last month
193
GGUF
Model size
7.97M params
Architecture
deepvqe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for richiejp/deepvqe-aec-gguf