LiSenNet / README.md

upload LiSenNet conv_hardened variant -> conv-hardened/

0090f2e verified about 13 hours ago

5.33 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- speech-enhancement
	- audio
	- denoising
	- onnx
	- causal
	- streaming
	- real-time
	- edge-ai
	- stm32
	datasets:
	- JacobLinCool/VoiceBank-DEMAND-16k
	---

	# LiSenNet

	Ultra-compact, causal, real-time speech enhancers trained on
	VoiceBank-DEMAND-16k — a sub-band U-Net with a magnitude-only mask (phase from a
	2-iteration Griffin-Lim offline, or the noisy phase for real-time). Port of
	**Yan, Zhou, Chen & Lu, _LiSenNet_,
	[arXiv:2409.13285](https://arxiv.org/abs/2409.13285)**
	([hyyan2k/LiSenNet](https://github.com/hyyan2k/LiSenNet), MIT).

	This repo holds three variants, each in its own subfolder:

	\| subfolder \| recipe \| params \| NPU-compiles \| FP32 PESQ \| real-time int8 PESQ \|
	\| --------- \| ------ \| -----: \| :----------: \| --------: \| ------------------: \|
	\| [`gru/`](./gru) \| dual-path GRU (faithful) \| 36,783 \| ✗ \| 3.006 \| 2.930 \|
	\| [`conv/`](./conv) \| dual-path conv \| 41,063 \| ✗ \| 2.970 \| 2.855 \|
	\| [`conv-hardened/`](./conv-hardened) \| conv + NPU-hardened \| 36,288 \| ✓ \| 3.013 \| 2.998 \|

	PESQ is wideband, on the full 824-utterance VoiceBank-DEMAND test split.

	* `gru/` is the faithful reproduction and the original quality reference. Its
	GRU + 2-axis `LayerNorm` do not compile to the STM32N6 Neural-ART NPU.
	* `conv/` replaces the GRU bottleneck with a dual-path conv one (0 GRU /
	0 LayerNormalization). Its ops map to the NPU, but the FIFO-state streaming
	graph (`conv/g_best_streaming_fp32.onnx`, `feat + N state_i_in -> est_mag +
	N state_i_out`) crashes the Neural-ART codegen — kept as the CPU/onnxruntime
	frame-by-frame reference.
	* `conv-hardened/` is the NPU-deployable variant and the current best
	model overall: per-channel BatchNorm (folds into the convs), ReLU, plain
	ConvTranspose upsampling, and a stateless windowed deploy graph
	(`conv-hardened/g_best_windowed_int8_static.onnx`, signed QInt8,
	`feat_window (B,3,132,257) -> est_mag (B,64,257)`, window = receptive field
	68 + 64 emitted frames) that compiles to Neural-ART — the artifact handed
	to stedgeai. The hardened primitives also quantize far better (int8 drop
	−0.016 vs −0.115 for `conv/`).

	Code + full write-up: [https://github.com/LarocheC/eco8-neaixt](https://github.com/LarocheC/eco8-neaixt) — see
	[RESULTS_LISENNET.md](https://github.com/LarocheC/eco8-neaixt/blob/main/RESULTS_LISENNET.md).

	## Files (per subfolder)

	`config.json`, `g_best` (PyTorch `{"generator": state_dict}`), `g_best_fp32.onnx`
	and `g_best_int8_static.onnx` (whole-utterance mask sub-network,
	`feat (B,3,T,F) -> est_mag (B,T,F)`). `conv/` additionally has
	`g_best_streaming_fp32.onnx` and `g_best_streaming_int8_static.onnx` (single
	frame + explicit state I/O); `conv-hardened/` has `g_best_windowed_fp32.onnx`
	and `g_best_windowed_int8_static.onnx` (stateless windowed deploy graph, the
	stedgeai / Neural-ART target). The ONNX graphs are the mask sub-network only —
	STFT, feature build and phase recovery stay host-side.

	## Loading (PyTorch)

	```python
	import json, torch
	from huggingface_hub import hf_hub_download
	from common.env import AttrDict
	from lisennet.model import build_lisennet

	REPO, SUB = "claroche1/LiSenNet", "conv-hardened" # or "gru" / "conv"
	cfg = json.load(open(hf_hub_download(REPO, f"{SUB}/config.json")))
	ckpt = torch.load(hf_hub_download(REPO, f"{SUB}/g_best"), map_location="cpu", weights_only=True)
	model = build_lisennet(AttrDict(cfg)).eval()
	model.load_state_dict(ckpt["generator"]) # model(noisy_wav)["est"]
	```

	## Running the NPU windowed deploy graph (`conv-hardened/`)

	Stateless: feed a sliding window of the last `68 + 64 = 132` feature frames and
	read the 64 newest enhanced-magnitude frames (no state tensors to carry).

	```python
	import numpy as np, onnxruntime as ort
	from huggingface_hub import hf_hub_download

	sess = ort.InferenceSession(
	hf_hub_download("claroche1/LiSenNet", "conv-hardened/g_best_windowed_int8_static.onnx"),
	providers=["CPUExecutionProvider"],
	)
	feat_window = np.zeros((1, 3, 132, 257), np.float32) # last 68+64 feature frames
	est_mag = sess.run(["est_mag"], {"feat_window": feat_window})[0] # (1, 64, 257)
	```

	## Running the CPU streaming graph frame-by-frame (`conv/`)

	```python
	import numpy as np, onnxruntime as ort
	from huggingface_hub import hf_hub_download

	sess = ort.InferenceSession(
	hf_hub_download("claroche1/LiSenNet", "conv/g_best_streaming_fp32.onnx"),
	providers=["CPUExecutionProvider"],
	)
	state_in = [i for i in sess.get_inputs() if i.name != "feat"] # FIFO states
	out_names = [o.name for o in sess.get_outputs()] # est_mag + state_*_out
	zeros = lambda s: np.zeros([d if isinstance(d, int) else 1 for d in s], np.float32)
	states = {i.name: zeros(i.shape) for i in state_in} # start-of-stream = zeros

	def step(feat_t): # feat_t: (1, 3, 1, 257)
	res = sess.run(out_names, {"feat": feat_t, **states})
	for i, v in zip(state_in, res[1:]):
	states[i.name] = v
	return res[0] # est_mag (1, 1, 257)
	```

	## License

	MIT. See the [source repository](https://github.com/LarocheC/eco8-neaixt) for training code and full attribution.