Fine-Grained Soundscape Control for Augmented Hearing

Pretrained TSE (Target Sound Extraction) models from MobiSys 2026 paper.

These models extract target sounds from binaural mixtures in real-time (~10ms latency), conditioned on sound class labels via FiLM (Feature-wise Linear Modulation).

Paper

Fine-Grained Soundscape Control for Augmented Hearing arXiv:2603.00395 Proceedings of the 24th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys '26), 2026.

Models

Table 1: TSE Model Comparison (1ch output, single source)

Model	Architecture	D	H	B	Params	SNRi (dB)	SI-SDRi (dB)
Orange Pi	TFGridNet	32	64	6	1.9M	12.26	10.16
Raspberry Pi	TFGridNet	16	64	3	0.5M	10.81	8.89
NeuralAids	TFMLPNet	32	32	6	0.7M	11.57	9.44
Waveformer	Waveformer	-	-	-	1.7M	8.32	6.51

Table 2: Multi-output Models (Orange Pi, 5 sources)

Outputs	SNRi (dB)	SI-SDRi (dB)
5-out	12.26	10.16
20-out	11.89	9.72

Table 3: FiLM Ablation

Model	FiLM	SNRi (dB)	SI-SDRi (dB)
Orange Pi	First	11.43	9.28
Orange Pi	All	12.31	10.18
Orange Pi	All-except-first	12.26	10.16
NeuralAids	First	10.62	8.58
NeuralAids	All	11.62	9.51
NeuralAids	All-except-first	11.57	9.44

Usage

from huggingface_hub import hf_hub_download
import torch, json

# Download model + config
repo_id = "ooshyun/fine_grained_soundscape_control"
model_name = "orange_pi"  # or: raspberry_pi, neuralaid, orange_pi_film_all, etc.

# Using the evaluation script (recommended):
# pip install -r requirements.txt
# python -m src.tse.eval --pretrained ooshyun/fine_grained_soundscape_control --model orange_pi --data_dir /path/to/data_dir

Model Names

Short Name	Description	HF Directory
`orange_pi`	Table 1 default (1ch, film=all-ex-1st)	`tfgridnet_large_..._1ch_..._film_all_except_first_...`
`raspberry_pi`	Table 1 small (1ch)	`tfgridnet_small_..._1ch_...`
`neuralaid`	Table 1 TFMLPNet (1ch)	`tfmlpnet_..._1ch_...`
`waveformer`	Table 1 baseline (1ch)	`waveformer_...`
`orange_pi_5out`	Table 2 (5ch, 5out)	`tfgridnet_large_..._5ch_5spk_5out_..._film_all_except_first_...`
`orange_pi_20out`	Table 2 (20ch, 20out)	`tfgridnet_large_..._20ch_5spk_20out_...`
`orange_pi_film_first`	Table 3 FiLM=first	`tfgridnet_large_..._5ch_..._film_first_...`
`orange_pi_film_all`	Table 3 FiLM=all	`tfgridnet_large_..._5ch_..._film_all_...`
`orange_pi_film_all_except_first`	Table 3 FiLM=all-ex-1st	`tfgridnet_large_..._5ch_..._film_all_except_first_...`
`neuralaid_film_first`	Table 3 NeuralAids first	`tfmlpnet_..._film_first_...`
`neuralaid_film_all`	Table 3 NeuralAids all	`tfmlpnet_..._film_all_layers_6_...`
`neuralaid_film_all_except_first`	Table 3 NeuralAids all-ex-1st	`tfmlpnet_..._film_all_except_first_...`

STFT Configuration

All models use causal STFT with:

Sample rate: 16 kHz
Chunk size: 96 samples (6 ms)
Lookahead: 64 samples (4 ms)
Lookback: 96 samples (6 ms)
FFT size: 256
Algorithmic latency: 10 ms

Training Data

Models are trained on BinauralCuratedDataset — on-the-fly binaural synthesis from 6 public datasets (FSD50K, ESC-50, musdb18, DISCO, TAU-2019, CIPIC HRTF).

Code

github.com/ooshyun/fine_grained_soundscape_control

Citation

@inproceedings{oh2026fine,
  title     = {Fine-Grained Soundscape Control for Augmented Hearing},
  booktitle = {Proceedings of the 24th ACM International Conference on
               Mobile Systems, Applications, and Services (MobiSys '26)},
  year      = {2026},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for ooshyun/fine_grained_soundscape_control

Fine-grained Soundscape Control for Augmented Hearing

Paper • 2603.00395 • Published Feb 28