Fine-grained Soundscape Control for Augmented Hearing
Paper • 2603.00395 • Published
Pretrained TSE (Target Sound Extraction) models from MobiSys 2026 paper.
These models extract target sounds from binaural mixtures in real-time (~10ms latency), conditioned on sound class labels via FiLM (Feature-wise Linear Modulation).
Fine-Grained Soundscape Control for Augmented Hearing arXiv:2603.00395 Proceedings of the 24th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys '26), 2026.
| Model | Architecture | D | H | B | Params | SNRi (dB) | SI-SDRi (dB) |
|---|---|---|---|---|---|---|---|
| Orange Pi | TFGridNet | 32 | 64 | 6 | 1.9M | 12.26 | 10.16 |
| Raspberry Pi | TFGridNet | 16 | 64 | 3 | 0.5M | 10.81 | 8.89 |
| NeuralAids | TFMLPNet | 32 | 32 | 6 | 0.7M | 11.57 | 9.44 |
| Waveformer | Waveformer | - | - | - | 1.7M | 8.32 | 6.51 |
| Outputs | SNRi (dB) | SI-SDRi (dB) |
|---|---|---|
| 5-out | 12.26 | 10.16 |
| 20-out | 11.89 | 9.72 |
| Model | FiLM | SNRi (dB) | SI-SDRi (dB) |
|---|---|---|---|
| Orange Pi | First | 11.43 | 9.28 |
| Orange Pi | All | 12.31 | 10.18 |
| Orange Pi | All-except-first | 12.26 | 10.16 |
| NeuralAids | First | 10.62 | 8.58 |
| NeuralAids | All | 11.62 | 9.51 |
| NeuralAids | All-except-first | 11.57 | 9.44 |
from huggingface_hub import hf_hub_download
import torch, json
# Download model + config
repo_id = "ooshyun/fine_grained_soundscape_control"
model_name = "orange_pi" # or: raspberry_pi, neuralaid, orange_pi_film_all, etc.
# Using the evaluation script (recommended):
# pip install -r requirements.txt
# python -m src.tse.eval --pretrained ooshyun/fine_grained_soundscape_control --model orange_pi --data_dir /path/to/data_dir
| Short Name | Description | HF Directory |
|---|---|---|
orange_pi |
Table 1 default (1ch, film=all-ex-1st) | tfgridnet_large_..._1ch_..._film_all_except_first_... |
raspberry_pi |
Table 1 small (1ch) | tfgridnet_small_..._1ch_... |
neuralaid |
Table 1 TFMLPNet (1ch) | tfmlpnet_..._1ch_... |
waveformer |
Table 1 baseline (1ch) | waveformer_... |
orange_pi_5out |
Table 2 (5ch, 5out) | tfgridnet_large_..._5ch_5spk_5out_..._film_all_except_first_... |
orange_pi_20out |
Table 2 (20ch, 20out) | tfgridnet_large_..._20ch_5spk_20out_... |
orange_pi_film_first |
Table 3 FiLM=first | tfgridnet_large_..._5ch_..._film_first_... |
orange_pi_film_all |
Table 3 FiLM=all | tfgridnet_large_..._5ch_..._film_all_... |
orange_pi_film_all_except_first |
Table 3 FiLM=all-ex-1st | tfgridnet_large_..._5ch_..._film_all_except_first_... |
neuralaid_film_first |
Table 3 NeuralAids first | tfmlpnet_..._film_first_... |
neuralaid_film_all |
Table 3 NeuralAids all | tfmlpnet_..._film_all_layers_6_... |
neuralaid_film_all_except_first |
Table 3 NeuralAids all-ex-1st | tfmlpnet_..._film_all_except_first_... |
All models use causal STFT with:
Models are trained on BinauralCuratedDataset — on-the-fly binaural synthesis from 6 public datasets (FSD50K, ESC-50, musdb18, DISCO, TAU-2019, CIPIC HRTF).
github.com/ooshyun/fine_grained_soundscape_control
@inproceedings{aurchestra2026,
title = {Fine-Grained Soundscape Control for Augmented Hearing},
booktitle = {Proceedings of the 24th ACM International Conference on
Mobile Systems, Applications, and Services (MobiSys '26)},
year = {2026},
}