LiSenNet / README.md
claroche1's picture
upload LiSenNet conv_hardened variant -> conv-hardened/
0090f2e verified
|
Raw
History Blame Contribute Delete
5.33 kB
---
license: mit
library_name: pytorch
tags:
- speech-enhancement
- audio
- denoising
- onnx
- causal
- streaming
- real-time
- edge-ai
- stm32
datasets:
- JacobLinCool/VoiceBank-DEMAND-16k
---
# LiSenNet
Ultra-compact, causal, real-time speech enhancers trained on
VoiceBank-DEMAND-16k — a sub-band U-Net with a magnitude-only mask (phase from a
2-iteration Griffin-Lim offline, or the noisy phase for real-time). Port of
**Yan, Zhou, Chen & Lu, _LiSenNet_,
[arXiv:2409.13285](https://arxiv.org/abs/2409.13285)**
([hyyan2k/LiSenNet](https://github.com/hyyan2k/LiSenNet), MIT).
This repo holds **three variants**, each in its own subfolder:
| subfolder | recipe | params | NPU-compiles | FP32 PESQ | real-time int8 PESQ |
| --------- | ------ | -----: | :----------: | --------: | ------------------: |
| [`gru/`](./gru) | dual-path **GRU** (faithful) | 36,783 | ✗ | 3.006 | 2.930 |
| [`conv/`](./conv) | dual-path **conv** | 41,063 | ✗ | 2.970 | 2.855 |
| [`conv-hardened/`](./conv-hardened) | conv + **NPU-hardened** | 36,288 | ✓ | **3.013** | **2.998** |
PESQ is wideband, on the full 824-utterance VoiceBank-DEMAND test split.
* **`gru/`** is the faithful reproduction and the original quality reference. Its
GRU + 2-axis `LayerNorm` do **not** compile to the STM32N6 Neural-ART NPU.
* **`conv/`** replaces the GRU bottleneck with a dual-path conv one (0 GRU /
0 LayerNormalization). Its ops map to the NPU, but the FIFO-state streaming
graph (`conv/g_best_streaming_fp32.onnx`, `feat + N state_i_in -> est_mag +
N state_i_out`) crashes the Neural-ART codegen — kept as the CPU/onnxruntime
frame-by-frame reference.
* **`conv-hardened/`** is the **NPU-deployable** variant and the current best
model overall: per-channel BatchNorm (folds into the convs), ReLU, plain
ConvTranspose upsampling, and a stateless **windowed** deploy graph
(`conv-hardened/g_best_windowed_int8_static.onnx`, signed QInt8,
`feat_window (B,3,132,257) -> est_mag (B,64,257)`, window = receptive field
68 + 64 emitted frames) that **compiles to Neural-ART** — the artifact handed
to stedgeai. The hardened primitives also quantize far better (int8 drop
−0.016 vs −0.115 for `conv/`).
Code + full write-up: [https://github.com/LarocheC/eco8-neaixt](https://github.com/LarocheC/eco8-neaixt) — see
[RESULTS_LISENNET.md](https://github.com/LarocheC/eco8-neaixt/blob/main/RESULTS_LISENNET.md).
## Files (per subfolder)
`config.json`, `g_best` (PyTorch `{"generator": state_dict}`), `g_best_fp32.onnx`
and `g_best_int8_static.onnx` (whole-utterance mask sub-network,
`feat (B,3,T,F) -> est_mag (B,T,F)`). `conv/` additionally has
`g_best_streaming_fp32.onnx` and `g_best_streaming_int8_static.onnx` (single
frame + explicit state I/O); `conv-hardened/` has `g_best_windowed_fp32.onnx`
and `g_best_windowed_int8_static.onnx` (stateless windowed deploy graph, the
stedgeai / Neural-ART target). The ONNX graphs are the mask sub-network only —
STFT, feature build and phase recovery stay host-side.
## Loading (PyTorch)
```python
import json, torch
from huggingface_hub import hf_hub_download
from common.env import AttrDict
from lisennet.model import build_lisennet
REPO, SUB = "claroche1/LiSenNet", "conv-hardened" # or "gru" / "conv"
cfg = json.load(open(hf_hub_download(REPO, f"{SUB}/config.json")))
ckpt = torch.load(hf_hub_download(REPO, f"{SUB}/g_best"), map_location="cpu", weights_only=True)
model = build_lisennet(AttrDict(cfg)).eval()
model.load_state_dict(ckpt["generator"]) # model(noisy_wav)["est"]
```
## Running the NPU windowed deploy graph (`conv-hardened/`)
Stateless: feed a sliding window of the last `68 + 64 = 132` feature frames and
read the 64 newest enhanced-magnitude frames (no state tensors to carry).
```python
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
sess = ort.InferenceSession(
hf_hub_download("claroche1/LiSenNet", "conv-hardened/g_best_windowed_int8_static.onnx"),
providers=["CPUExecutionProvider"],
)
feat_window = np.zeros((1, 3, 132, 257), np.float32) # last 68+64 feature frames
est_mag = sess.run(["est_mag"], {"feat_window": feat_window})[0] # (1, 64, 257)
```
## Running the CPU streaming graph frame-by-frame (`conv/`)
```python
import numpy as np, onnxruntime as ort
from huggingface_hub import hf_hub_download
sess = ort.InferenceSession(
hf_hub_download("claroche1/LiSenNet", "conv/g_best_streaming_fp32.onnx"),
providers=["CPUExecutionProvider"],
)
state_in = [i for i in sess.get_inputs() if i.name != "feat"] # FIFO states
out_names = [o.name for o in sess.get_outputs()] # est_mag + state_*_out
zeros = lambda s: np.zeros([d if isinstance(d, int) else 1 for d in s], np.float32)
states = {i.name: zeros(i.shape) for i in state_in} # start-of-stream = zeros
def step(feat_t): # feat_t: (1, 3, 1, 257)
res = sess.run(out_names, {"feat": feat_t, **states})
for i, v in zip(state_in, res[1:]):
states[i.name] = v
return res[0] # est_mag (1, 1, 257)
```
## License
MIT. See the [source repository](https://github.com/LarocheC/eco8-neaixt) for training code and full attribution.