Urban Sound Tagging with Audio Spectrogram Transformer

Model Description

Fine-tuned Audio Spectrogram Transformer (AST) for hierarchical urban sound classification on the SONYC-UST dataset.

  • Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Dataset: SONYC-UST (Sounds of New York City - Urban Sound Tagging)
  • Task: Multi-label hierarchical audio classification
  • Fine-grained Classes: 29
  • Coarse Classes: 8

Performance

Test Set Results

Metric Coarse (Macro) Fine (Macro)
AUPRC ⭐ 0.3184 0.3983
AUC 0.7828 0.9611
F1 0.2070 0.1520

AUPRC (Average Precision) is the primary evaluation metric for DCASE urban sound tagging tasks.

Training Configuration

Parameter Value
Batch size 128
Learning rate 5e-5
Epochs 20
Backbone Frozen (last 4 layers unfrozen)
Mixed precision Enabled (FP16)
Attention SDPA (Scaled Dot Product)
Data loading In-memory
Dataset normalization SONYC-UST stats
Augmentation SpecAugment + Mixup
Mixup α 0.2
Num workers 8

Quick Start

from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys

# Download model files
repo_path = snapshot_download(
    repo_id="xd-br0/ast-sonyc-ust-v2",
    allow_patterns=["*.py", "*.json", "*.safetensors"]
)

# Add to Python path
sys.path.insert(0, repo_path)

# Import custom classes
from configuration_ast_sonyc import ASTSONYCConfig
from modeling_ast_sonyc import ASTSONYCModel
from pipeline_sonyc import SONYCAudioPipeline

# Load model
config = ASTSONYCConfig.from_pretrained(repo_path)
model = ASTSONYCModel.from_pretrained(repo_path, config=config)
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)

# Create classifier
classifier = SONYCAudioPipeline(
    model=model,
    feature_extractor=feature_extractor
)

# Classify audio
results = classifier("urban_sound.wav", top_k=5, threshold=0.3)

# Display results
print("Fine-grained predictions:")
for pred in results["fine"]:
    print(f"{pred['label']}: {pred['score']:.3f}")

print("\nCoarse-grained predictions:")
for pred in results["coarse"]:
    print(f"{pred['label']}: {pred['score']:.3f}")

Alternative: One-liner Helper Function

from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys

def load_ast_sonyc_classifier(repo_id="xd-br0/ast-sonyc-ust-v2"):
    """Load AST-SONYC classifier from HuggingFace Hub"""
    # Download and setup
    repo_path = snapshot_download(repo_id=repo_id, allow_patterns=["*.py", "*.json", "*.safetensors"])
    sys.path.insert(0, repo_path)

    # Import custom classes
    from configuration_ast_sonyc import ASTSONYCConfig
    from modeling_ast_sonyc import ASTSONYCModel
    from pipeline_sonyc import SONYCAudioPipeline

    # Load model
    config = ASTSONYCConfig.from_pretrained(repo_path)
    model = ASTSONYCModel.from_pretrained(repo_path, config=config)
    feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)

    return SONYCAudioPipeline(model=model, feature_extractor=feature_extractor)

# Usage
classifier = load_ast_sonyc_classifier()
results = classifier("audio.wav")

Advanced: Get All Predictions

# Get all predictions regardless of confidence threshold
results = classifier("audio.wav", return_all=True)

# Or use a custom threshold
results = classifier("audio.wav", threshold=0.1, top_k=10)

Label Hierarchy

Coarse Classes (8)

  • alert-signal
  • dog
  • engine
  • human-voice
  • machinery-impact
  • music
  • non-machinery-impact
  • powered-saw

Fine-grained Classes (29)

  • amplified-speech
  • car-alarm
  • car-horn
  • chainsaw
  • dog-barking-whining
  • engine-of-uncertain-size
  • hoe-ram
  • ice-cream-truck
  • jackhammer
  • large-crowd
  • large-rotating-saw
  • large-sounding-engine
  • medium-sounding-engine
  • mobile-music
  • music-from-uncertain-source
  • non-machinery-impact
  • other-unknown-alert-signal
  • other-unknown-human-voice
  • other-unknown-impact-machinery
  • other-unknown-powered-saw
  • person-or-small-group-shouting
  • person-or-small-group-talking
  • pile-driver
  • reverse-beeper
  • rock-drill
  • siren
  • small-medium-rotating-saw
  • small-sounding-engine
  • stationary-music

Model Architecture

Input Audio (16kHz, 10s)
    ↓
Audio Spectrogram Transformer (MIT/AST)
    ↓
Patch Embeddings (197 patches)
    ↓
Transformer Encoder (12 layers)
    ↓
[CLS] Token Pooling
    ↓
Hierarchical Classification
    ├─→ Fine-grained Head → 29 classes
    └─→ Coarse Head → 8 classes

Training Details

  • Optimizer: AdamW with differential learning rates
    • Backbone: 1e-5
    • Classification heads: 1e-4
  • Loss Function: Hierarchical BCE with fine→coarse masking
  • Augmentation:
    • SpecAugment (time + frequency masking)
    • Mixup (α=0.2)
    • Gaussian noise
  • Mixed Precision: FP16/BF16
  • Batch Size: 128
  • Scheduler: Cosine annealing with warmup

Output Format

The model returns predictions in a hierarchical format:

{
    "fine": [
        {"label": "car-horn", "score": 0.89},
        {"label": "siren", "score": 0.76},
        ...
    ],
    "coarse": [
        {"label": "alert-signal", "score": 0.92},
        {"label": "engine", "score": 0.45},
        ...
    ]
}

Limitations

  • Geographic Bias: Trained exclusively on New York City urban sounds; may not generalize well to other cities or rural environments
  • Fixed Duration: Designed for 10-second audio clips; longer/shorter clips may need preprocessing
  • Temporal Resolution: Cannot localize sounds within the 10-second window
  • Class Imbalance: Some classes (e.g., dog barking) have fewer training examples
  • Environmental Factors: Performance may degrade with:
    • High background noise
    • Multiple overlapping sound sources
    • Weather conditions (wind, rain)
    • Recording quality issues

Ethical Considerations

  • Privacy: Model can identify human voices but not speaker identity
  • Bias: Training data reflects NYC's specific urban soundscape
  • Intended Use: Environmental monitoring, noise pollution analysis, urban planning
  • Misuse Prevention: Should not be used for surveillance without proper consent

Citation

If you use this model, please cite:

@inproceedings{bello2019sonyc,
  title={SONYC Urban Sound Tagging (SONYC-UST): a multilabel dataset from an urban acoustic sensor network},
  author={Bello, Juan Pablo and Silva, Claudio and Mydlarz, Charlie and Salamon, Justin and Doraiswamy, Harish and Arora, Aneesh and Cartwright, Mark},
  booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE)},
  pages={35--39},
  year={2019}
}

@inproceedings{gong2021ast,
  title={AST: Audio Spectrogram Transformer},
  author={Gong, Yuan and Chung, Yu-An and Glass, James},
  booktitle={Interspeech},
  year={2021}
}

License

This model is released under the MIT License. See LICENSE for details.

Acknowledgments

  • MIT for the pre-trained AST model
  • NYU SONYC team for the SONYC-UST dataset
  • HuggingFace for the Transformers library

Contact

For questions, issues, or contributions:

  • Model Repository: Gitlab
  • Issues: Please report bugs and feature requests on GitHub

Note: This model is intended for research and educational purposes. For production deployment, please evaluate performance on your specific use case and environment.

Downloads last month
82
Safetensors
Model size
86.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xd-br0/ast-sonyc-ust-v2

Finetuned
(158)
this model

Dataset used to train xd-br0/ast-sonyc-ust-v2

Evaluation results