ooshyun's picture
fix: replace Aurchestra with official paper title
db06e38 verified
metadata
license: mit
tags:
  - audio
  - sound-event-detection
  - audio-spectrogram-transformer
  - yamnet
datasets:
  - audioset
language:
  - en
pipeline_tag: audio-classification

Sound Event Detection β€” Pretrained Models

Pretrained models for Sound Event Detection (SED) used in MobiSys 2026 #198 "Fine-grained Soundscape Control for Augmented Hearing".

Models

1. YAMNet (Pretrained Baseline)

  • Source: google/yamnet (TensorFlow) / PyTorch reimplementation
  • Classes: 521 AudioSet classes
  • Usage: Loaded directly from HuggingFace β€” no checkpoint in this repo

2. AST (Pretrained Baseline)

  • Source: MIT/ast-finetuned-audioset-10-10-0.4593
  • Architecture: Audio Spectrogram Transformer
  • Classes: 527 AudioSet classes
  • Usage: Loaded directly from HuggingFace β€” no checkpoint in this repo

3. Fine-tuned AST (sed_ast_snr_ctl_v2_16k)

  • Base model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Fine-tuned on: On-the-fly synthesized binaural audio mixtures (SNR-controlled, 16kHz)
  • Classes: 20 target sound classes
  • Training: AdamW, OneCycleLR with group-wise learning rates (backbone 1e-5, head 1e-3), 80 epochs
  • Checkpoint: sed_ast_snr_ctl_v2_16k/checkpoints/best.pt

File Structure

.
β”œβ”€β”€ README.md
└── sed_ast_snr_ctl_v2_16k/
    β”œβ”€β”€ config.json          # Training configuration
    └── checkpoints/
        └── best.pt          # Fine-tuned model weights (~2GB)

Usage

# Fine-tuned AST
from huggingface_hub import hf_hub_download

checkpoint_path = hf_hub_download(
    repo_id="ooshyun/sound_event_detection",
    filename="sed_ast_snr_ctl_v2_16k/checkpoints/best.pt",
)

config_path = hf_hub_download(
    repo_id="ooshyun/sound_event_detection",
    filename="sed_ast_snr_ctl_v2_16k/config.json",
)

For training and evaluation code, see ooshyun/sound_event_detection.

Citation

If you use these models, please cite:

MobiSys 2026 #198 "Fine-grained Soundscape Control for Augmented Hearing"