Urban Sound Tagging with Audio Spectrogram Transformer
Model Description
Fine-tuned Audio Spectrogram Transformer (AST) for hierarchical urban sound classification on the SONYC-UST dataset.
- Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
- Dataset: SONYC-UST (Sounds of New York City - Urban Sound Tagging)
- Task: Multi-label hierarchical audio classification
- Fine-grained Classes: 29
- Coarse Classes: 8
Performance
Test Set Results
| Metric | Coarse (Macro) | Fine (Macro) |
|---|---|---|
| AUPRC ⭐ | 0.3184 | 0.3983 |
| AUC | 0.7828 | 0.9611 |
| F1 | 0.2070 | 0.1520 |
AUPRC (Average Precision) is the primary evaluation metric for DCASE urban sound tagging tasks.
Training Configuration
| Parameter | Value |
|---|---|
| Batch size | 128 |
| Learning rate | 5e-5 |
| Epochs | 20 |
| Backbone | Frozen (last 4 layers unfrozen) |
| Mixed precision | Enabled (FP16) |
| Attention | SDPA (Scaled Dot Product) |
| Data loading | In-memory |
| Dataset normalization | SONYC-UST stats |
| Augmentation | SpecAugment + Mixup |
| Mixup α | 0.2 |
| Num workers | 8 |
Quick Start
from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys
# Download model files
repo_path = snapshot_download(
repo_id="xd-br0/ast-sonyc-ust-v2",
allow_patterns=["*.py", "*.json", "*.safetensors"]
)
# Add to Python path
sys.path.insert(0, repo_path)
# Import custom classes
from configuration_ast_sonyc import ASTSONYCConfig
from modeling_ast_sonyc import ASTSONYCModel
from pipeline_sonyc import SONYCAudioPipeline
# Load model
config = ASTSONYCConfig.from_pretrained(repo_path)
model = ASTSONYCModel.from_pretrained(repo_path, config=config)
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)
# Create classifier
classifier = SONYCAudioPipeline(
model=model,
feature_extractor=feature_extractor
)
# Classify audio
results = classifier("urban_sound.wav", top_k=5, threshold=0.3)
# Display results
print("Fine-grained predictions:")
for pred in results["fine"]:
print(f"{pred['label']}: {pred['score']:.3f}")
print("\nCoarse-grained predictions:")
for pred in results["coarse"]:
print(f"{pred['label']}: {pred['score']:.3f}")
Alternative: One-liner Helper Function
from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys
def load_ast_sonyc_classifier(repo_id="xd-br0/ast-sonyc-ust-v2"):
"""Load AST-SONYC classifier from HuggingFace Hub"""
# Download and setup
repo_path = snapshot_download(repo_id=repo_id, allow_patterns=["*.py", "*.json", "*.safetensors"])
sys.path.insert(0, repo_path)
# Import custom classes
from configuration_ast_sonyc import ASTSONYCConfig
from modeling_ast_sonyc import ASTSONYCModel
from pipeline_sonyc import SONYCAudioPipeline
# Load model
config = ASTSONYCConfig.from_pretrained(repo_path)
model = ASTSONYCModel.from_pretrained(repo_path, config=config)
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)
return SONYCAudioPipeline(model=model, feature_extractor=feature_extractor)
# Usage
classifier = load_ast_sonyc_classifier()
results = classifier("audio.wav")
Advanced: Get All Predictions
# Get all predictions regardless of confidence threshold
results = classifier("audio.wav", return_all=True)
# Or use a custom threshold
results = classifier("audio.wav", threshold=0.1, top_k=10)
Label Hierarchy
Coarse Classes (8)
- alert-signal
- dog
- engine
- human-voice
- machinery-impact
- music
- non-machinery-impact
- powered-saw
Fine-grained Classes (29)
- amplified-speech
- car-alarm
- car-horn
- chainsaw
- dog-barking-whining
- engine-of-uncertain-size
- hoe-ram
- ice-cream-truck
- jackhammer
- large-crowd
- large-rotating-saw
- large-sounding-engine
- medium-sounding-engine
- mobile-music
- music-from-uncertain-source
- non-machinery-impact
- other-unknown-alert-signal
- other-unknown-human-voice
- other-unknown-impact-machinery
- other-unknown-powered-saw
- person-or-small-group-shouting
- person-or-small-group-talking
- pile-driver
- reverse-beeper
- rock-drill
- siren
- small-medium-rotating-saw
- small-sounding-engine
- stationary-music
Model Architecture
Input Audio (16kHz, 10s)
↓
Audio Spectrogram Transformer (MIT/AST)
↓
Patch Embeddings (197 patches)
↓
Transformer Encoder (12 layers)
↓
[CLS] Token Pooling
↓
Hierarchical Classification
├─→ Fine-grained Head → 29 classes
└─→ Coarse Head → 8 classes
Training Details
- Optimizer: AdamW with differential learning rates
- Backbone: 1e-5
- Classification heads: 1e-4
- Loss Function: Hierarchical BCE with fine→coarse masking
- Augmentation:
- SpecAugment (time + frequency masking)
- Mixup (α=0.2)
- Gaussian noise
- Mixed Precision: FP16/BF16
- Batch Size: 128
- Scheduler: Cosine annealing with warmup
Output Format
The model returns predictions in a hierarchical format:
{
"fine": [
{"label": "car-horn", "score": 0.89},
{"label": "siren", "score": 0.76},
...
],
"coarse": [
{"label": "alert-signal", "score": 0.92},
{"label": "engine", "score": 0.45},
...
]
}
Limitations
- Geographic Bias: Trained exclusively on New York City urban sounds; may not generalize well to other cities or rural environments
- Fixed Duration: Designed for 10-second audio clips; longer/shorter clips may need preprocessing
- Temporal Resolution: Cannot localize sounds within the 10-second window
- Class Imbalance: Some classes (e.g., dog barking) have fewer training examples
- Environmental Factors: Performance may degrade with:
- High background noise
- Multiple overlapping sound sources
- Weather conditions (wind, rain)
- Recording quality issues
Ethical Considerations
- Privacy: Model can identify human voices but not speaker identity
- Bias: Training data reflects NYC's specific urban soundscape
- Intended Use: Environmental monitoring, noise pollution analysis, urban planning
- Misuse Prevention: Should not be used for surveillance without proper consent
Citation
If you use this model, please cite:
@inproceedings{bello2019sonyc,
title={SONYC Urban Sound Tagging (SONYC-UST): a multilabel dataset from an urban acoustic sensor network},
author={Bello, Juan Pablo and Silva, Claudio and Mydlarz, Charlie and Salamon, Justin and Doraiswamy, Harish and Arora, Aneesh and Cartwright, Mark},
booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE)},
pages={35--39},
year={2019}
}
@inproceedings{gong2021ast,
title={AST: Audio Spectrogram Transformer},
author={Gong, Yuan and Chung, Yu-An and Glass, James},
booktitle={Interspeech},
year={2021}
}
License
This model is released under the MIT License. See LICENSE for details.
Acknowledgments
- MIT for the pre-trained AST model
- NYU SONYC team for the SONYC-UST dataset
- HuggingFace for the Transformers library
Contact
For questions, issues, or contributions:
- Model Repository: Gitlab
- Issues: Please report bugs and feature requests on GitHub
Note: This model is intended for research and educational purposes. For production deployment, please evaluate performance on your specific use case and environment.
- Downloads last month
- 82
Model tree for xd-br0/ast-sonyc-ust-v2
Base model
MIT/ast-finetuned-audioset-10-10-0.4593Dataset used to train xd-br0/ast-sonyc-ust-v2
Evaluation results
- AUPRC on CLAPv2/SONYC-USTself-reported0.318
- AUC on CLAPv2/SONYC-USTself-reported0.783
- F1 on CLAPv2/SONYC-USTself-reported0.207
- AUPRC on CLAPv2/SONYC-USTself-reported0.398
- AUC on CLAPv2/SONYC-USTself-reported0.961
- F1 on CLAPv2/SONYC-USTself-reported0.152