| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - speech-enhancement |
| - audio |
| - denoising |
| - onnx |
| - causal |
| - streaming |
| - real-time |
| - edge-ai |
| - stm32 |
| datasets: |
| - JacobLinCool/VoiceBank-DEMAND-16k |
| --- |
| |
| # LiSenNet |
|
|
| Ultra-compact, causal, real-time speech enhancers trained on |
| VoiceBank-DEMAND-16k — a sub-band U-Net with a magnitude-only mask (phase from a |
| 2-iteration Griffin-Lim offline, or the noisy phase for real-time). Port of |
| **Yan, Zhou, Chen & Lu, _LiSenNet_, |
| [arXiv:2409.13285](https://arxiv.org/abs/2409.13285)** |
| ([hyyan2k/LiSenNet](https://github.com/hyyan2k/LiSenNet), MIT). |
|
|
| This repo holds **three variants**, each in its own subfolder: |
|
|
| | subfolder | recipe | params | NPU-compiles | FP32 PESQ | real-time int8 PESQ | |
| | --------- | ------ | -----: | :----------: | --------: | ------------------: | |
| | [`gru/`](./gru) | dual-path **GRU** (faithful) | 36,783 | ✗ | 3.006 | 2.930 | |
| | [`conv/`](./conv) | dual-path **conv** | 41,063 | ✗ | 2.970 | 2.855 | |
| | [`conv-hardened/`](./conv-hardened) | conv + **NPU-hardened** | 36,288 | ✓ | **3.013** | **2.998** | |
|
|
| PESQ is wideband, on the full 824-utterance VoiceBank-DEMAND test split. |
|
|
| * **`gru/`** is the faithful reproduction and the original quality reference. Its |
| GRU + 2-axis `LayerNorm` do **not** compile to the STM32N6 Neural-ART NPU. |
| * **`conv/`** replaces the GRU bottleneck with a dual-path conv one (0 GRU / |
| 0 LayerNormalization). Its ops map to the NPU, but the FIFO-state streaming |
| graph (`conv/g_best_streaming_fp32.onnx`, `feat + N state_i_in -> est_mag + |
| N state_i_out`) crashes the Neural-ART codegen — kept as the CPU/onnxruntime |
| frame-by-frame reference. |
| * **`conv-hardened/`** is the **NPU-deployable** variant and the current best |
| model overall: per-channel BatchNorm (folds into the convs), ReLU, plain |
| ConvTranspose upsampling, and a stateless **windowed** deploy graph |
| (`conv-hardened/g_best_windowed_int8_static.onnx`, signed QInt8, |
| `feat_window (B,3,132,257) -> est_mag (B,64,257)`, window = receptive field |
| 68 + 64 emitted frames) that **compiles to Neural-ART** — the artifact handed |
| to stedgeai. The hardened primitives also quantize far better (int8 drop |
| −0.016 vs −0.115 for `conv/`). |
| |
| Code + full write-up: [https://github.com/LarocheC/eco8-neaixt](https://github.com/LarocheC/eco8-neaixt) — see |
| [RESULTS_LISENNET.md](https://github.com/LarocheC/eco8-neaixt/blob/main/RESULTS_LISENNET.md). |
| |
| ## Files (per subfolder) |
| |
| `config.json`, `g_best` (PyTorch `{"generator": state_dict}`), `g_best_fp32.onnx` |
| and `g_best_int8_static.onnx` (whole-utterance mask sub-network, |
| `feat (B,3,T,F) -> est_mag (B,T,F)`). `conv/` additionally has |
| `g_best_streaming_fp32.onnx` and `g_best_streaming_int8_static.onnx` (single |
| frame + explicit state I/O); `conv-hardened/` has `g_best_windowed_fp32.onnx` |
| and `g_best_windowed_int8_static.onnx` (stateless windowed deploy graph, the |
| stedgeai / Neural-ART target). The ONNX graphs are the mask sub-network only — |
| STFT, feature build and phase recovery stay host-side. |
|
|
| ## Loading (PyTorch) |
|
|
| ```python |
| import json, torch |
| from huggingface_hub import hf_hub_download |
| from common.env import AttrDict |
| from lisennet.model import build_lisennet |
| |
| REPO, SUB = "claroche1/LiSenNet", "conv-hardened" # or "gru" / "conv" |
| cfg = json.load(open(hf_hub_download(REPO, f"{SUB}/config.json"))) |
| ckpt = torch.load(hf_hub_download(REPO, f"{SUB}/g_best"), map_location="cpu", weights_only=True) |
| model = build_lisennet(AttrDict(cfg)).eval() |
| model.load_state_dict(ckpt["generator"]) # model(noisy_wav)["est"] |
| ``` |
|
|
| ## Running the NPU windowed deploy graph (`conv-hardened/`) |
|
|
| Stateless: feed a sliding window of the last `68 + 64 = 132` feature frames and |
| read the 64 newest enhanced-magnitude frames (no state tensors to carry). |
|
|
| ```python |
| import numpy as np, onnxruntime as ort |
| from huggingface_hub import hf_hub_download |
| |
| sess = ort.InferenceSession( |
| hf_hub_download("claroche1/LiSenNet", "conv-hardened/g_best_windowed_int8_static.onnx"), |
| providers=["CPUExecutionProvider"], |
| ) |
| feat_window = np.zeros((1, 3, 132, 257), np.float32) # last 68+64 feature frames |
| est_mag = sess.run(["est_mag"], {"feat_window": feat_window})[0] # (1, 64, 257) |
| ``` |
|
|
| ## Running the CPU streaming graph frame-by-frame (`conv/`) |
|
|
| ```python |
| import numpy as np, onnxruntime as ort |
| from huggingface_hub import hf_hub_download |
| |
| sess = ort.InferenceSession( |
| hf_hub_download("claroche1/LiSenNet", "conv/g_best_streaming_fp32.onnx"), |
| providers=["CPUExecutionProvider"], |
| ) |
| state_in = [i for i in sess.get_inputs() if i.name != "feat"] # FIFO states |
| out_names = [o.name for o in sess.get_outputs()] # est_mag + state_*_out |
| zeros = lambda s: np.zeros([d if isinstance(d, int) else 1 for d in s], np.float32) |
| states = {i.name: zeros(i.shape) for i in state_in} # start-of-stream = zeros |
| |
| def step(feat_t): # feat_t: (1, 3, 1, 257) |
| res = sess.run(out_names, {"feat": feat_t, **states}) |
| for i, v in zip(state_in, res[1:]): |
| states[i.name] = v |
| return res[0] # est_mag (1, 1, 257) |
| ``` |
|
|
| ## License |
|
|
| MIT. See the [source repository](https://github.com/LarocheC/eco8-neaixt) for training code and full attribution. |
|
|