richiejp commited on
Commit
432fc91
Β·
verified Β·
1 Parent(s): 1023775

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ tags:
4
+ - audio-to-audio
5
+ - speech-enhancement
6
+ - acoustic-echo-cancellation
7
+ - noise-suppression
8
+ - ggml
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # LocalVQE β€” Local Voice Quality Enhancement
13
+
14
+ Real-time joint acoustic echo cancellation (AEC), noise suppression (NS), and
15
+ dereverberation for 16 kHz speech. Based on *LocalVQE: Real Time Deep Voice
16
+ Quality Enhancement* (Indenbom et al., Interspeech 2023,
17
+ [arXiv:2306.03177](https://arxiv.org/abs/2306.03177)), redesigned for
18
+ quantization-aware CPU inference with DCT-II analysis/synthesis in place of
19
+ STFT.
20
+
21
+ Project source: <https://github.com/richiejp/LocalVQE>
22
+
23
+ ## Files
24
+
25
+ | File | Size | Description |
26
+ |---|---|---|
27
+ | `baseline.pt` | 11 MB | DNS5-trained baseline (val_loss 0.0213, ERLE 14.9 dB). |
28
+ | `finetune.pt` | 11 MB | AEC fine-tune (1 epoch on AEC Challenge synthetic set). |
29
+ | `localvqe-baseline.gguf` | 5 MB | GGML F32 export of `baseline.pt` (BN-folded, DCT weights embedded). |
30
+ | `localvqe-finetune.gguf` | 5 MB | GGML F32 export of `finetune.pt`. |
31
+
32
+ ## Usage (GGML / C++ / Go)
33
+
34
+ ```bash
35
+ # Build the ggml binary
36
+ cd ggml && cmake -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
37
+
38
+ # Run inference on a 16 kHz WAV pair
39
+ ./build/bin/localvqe localvqe-baseline.gguf \
40
+ --in-wav mic.wav ref.wav --out-wav enhanced.wav
41
+ ```
42
+
43
+ Per-frame wall time on Zen4 (24 threads): ~1.66 ms (9.6Γ— realtime at
44
+ 16 kHz / 256-sample hop).
45
+
46
+ ## Architecture
47
+
48
+ | Component | Value |
49
+ |-----------|-------|
50
+ | Sample rate | 16 kHz |
51
+ | Analysis basis | DCT-II (Conv1d filterbank, 512 filters, stride 256, frozen) |
52
+ | Mic encoder | 5 blocks: 2 β†’ 64 β†’ 128 β†’ 128 β†’ 128 β†’ 128 |
53
+ | Far-end encoder | 2 blocks: 2 β†’ 32 β†’ 128 |
54
+ | AlignBlock | Cross-attention soft delay, d_max=32 (320 ms), h=32 |
55
+ | Bottleneck | GRU(1152 β†’ 576) + Linear(576 β†’ 1152) |
56
+ | Decoder | 5 sub-pixel conv + BN blocks: 128 β†’ 128 β†’ 128 β†’ 64 β†’ 64 |
57
+ | CCM | 27-ch β†’ 3Γ—3 complex convolving mask (real-valued arithmetic) |
58
+ | Parameters | ~8.0 M |
59
+
60
+ ## Citation
61
+
62
+ ```bibtex
63
+ @inproceedings{indenbom2023localvqe,
64
+ title={{LocalVQE}: Real Time Deep Voice Quality Enhancement},
65
+ author={Indenbom, Evgenii and Beresford, Nicolae-C{\u a}t{\u a}lin and Adiban, Mohammad and Klej{\v c}ek, Ji{\v r}{\' i}},
66
+ booktitle={Interspeech},
67
+ year={2023}
68
+ }
69
+ ```