Fine-Grained Soundscape Control for Augmented Hearing

Pretrained TSE (Target Sound Extraction) models from MobiSys 2026 paper.

These models extract target sounds from binaural mixtures in real-time (~10ms latency), conditioned on sound class labels via FiLM (Feature-wise Linear Modulation).

Paper

Fine-Grained Soundscape Control for Augmented Hearing arXiv:2603.00395 Proceedings of the 24th ACM International Conference on Mobile Systems, Applications, and Services (MobiSys '26), 2026.

Models

Table 1: TSE Model Comparison (1ch output, single source)

Model Architecture D H B Params SNRi (dB) SI-SDRi (dB)
Orange Pi TFGridNet 32 64 6 1.9M 12.26 10.16
Raspberry Pi TFGridNet 16 64 3 0.5M 10.81 8.89
NeuralAids TFMLPNet 32 32 6 0.7M 11.57 9.44
Waveformer Waveformer - - - 1.7M 8.32 6.51

Table 2: Multi-output Models (Orange Pi, 5 sources)

Outputs SNRi (dB) SI-SDRi (dB)
5-out 12.26 10.16
20-out 11.89 9.72

Table 3: FiLM Ablation

Model FiLM SNRi (dB) SI-SDRi (dB)
Orange Pi First 11.43 9.28
Orange Pi All 12.31 10.18
Orange Pi All-except-first 12.26 10.16
NeuralAids First 10.62 8.58
NeuralAids All 11.62 9.51
NeuralAids All-except-first 11.57 9.44

Usage

from huggingface_hub import hf_hub_download
import torch, json

# Download model + config
repo_id = "ooshyun/fine_grained_soundscape_control"
model_name = "orange_pi"  # or: raspberry_pi, neuralaid, orange_pi_film_all, etc.

# Using the evaluation script (recommended):
# pip install -r requirements.txt
# python -m src.tse.eval --pretrained ooshyun/fine_grained_soundscape_control --model orange_pi --data_dir /path/to/data_dir

Model Names

Short Name Description HF Directory
orange_pi Table 1 default (1ch, film=all-ex-1st) tfgridnet_large_..._1ch_..._film_all_except_first_...
raspberry_pi Table 1 small (1ch) tfgridnet_small_..._1ch_...
neuralaid Table 1 TFMLPNet (1ch) tfmlpnet_..._1ch_...
waveformer Table 1 baseline (1ch) waveformer_...
orange_pi_5out Table 2 (5ch, 5out) tfgridnet_large_..._5ch_5spk_5out_..._film_all_except_first_...
orange_pi_20out Table 2 (20ch, 20out) tfgridnet_large_..._20ch_5spk_20out_...
orange_pi_film_first Table 3 FiLM=first tfgridnet_large_..._5ch_..._film_first_...
orange_pi_film_all Table 3 FiLM=all tfgridnet_large_..._5ch_..._film_all_...
orange_pi_film_all_except_first Table 3 FiLM=all-ex-1st tfgridnet_large_..._5ch_..._film_all_except_first_...
neuralaid_film_first Table 3 NeuralAids first tfmlpnet_..._film_first_...
neuralaid_film_all Table 3 NeuralAids all tfmlpnet_..._film_all_layers_6_...
neuralaid_film_all_except_first Table 3 NeuralAids all-ex-1st tfmlpnet_..._film_all_except_first_...

STFT Configuration

All models use causal STFT with:

  • Sample rate: 16 kHz
  • Chunk size: 96 samples (6 ms)
  • Lookahead: 64 samples (4 ms)
  • Lookback: 96 samples (6 ms)
  • FFT size: 256
  • Algorithmic latency: 10 ms

Training Data

Models are trained on BinauralCuratedDataset — on-the-fly binaural synthesis from 6 public datasets (FSD50K, ESC-50, musdb18, DISCO, TAU-2019, CIPIC HRTF).

Code

github.com/ooshyun/fine_grained_soundscape_control

Citation

@inproceedings{aurchestra2026,
  title     = {Fine-Grained Soundscape Control for Augmented Hearing},
  booktitle = {Proceedings of the 24th ACM International Conference on
               Mobile Systems, Applications, and Services (MobiSys '26)},
  year      = {2026},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for ooshyun/fine_grained_soundscape_control