SIREN-TRANSCRIBE
Music Analysis with Self-Supervised Foundation Model
SIREN-TRANSCRIBE is part of the SIREN Audio Suite - a family of neural audio processing models designed for professional music production workflows.
Model Description
SIREN-TRANSCRIBE analyzes music to extract musical key and tempo using a 330M-parameter self-supervised music encoder as a foundation model with custom classification heads.
Key capabilities:
- Key detection - Identify musical key (24 classes: major and minor)
- Tempo estimation - Accurate BPM detection
- Foundation model - Built on 330M-parameter music encoder
Architecture
| Component | Details |
|---|---|
| Base Model | 330M-parameter music encoder (24 transformer layers) |
| Key Head | Custom MLP (24 classes) |
| Tempo Head | Custom MLP (regression) |
| Sample Rate | 24 kHz |
The SIREN Family
| Model | Purpose |
|---|---|
| SIREN-FX | Neural audio effects |
| SIREN-FIX | Audio restoration and repair |
| SIREN-MASTER | Audio enhancement and mastering |
| SIREN-STEER | Steerable audio transformations |
| SIREN-SEPARATE | Source separation |
| SIREN-TRANSCRIBE | Music analysis (this model) |
Usage
import torch
# Load model
checkpoint = torch.load('siren_transcribe.pt', map_location='cpu')
# Model expects audio at 24kHz
# Output: {"key": "Am", "tempo": 120.5}
Training Details
- Training Data: Large-scale music dataset with analysis labels
- Hardware: NVIDIA B200 GPU
- Training Duration: 100 epochs
Intended Use
- Musical key detection
- Tempo/BPM estimation
- Music information retrieval
- DJ tools and music organization
- Research in music understanding
Limitations
- Key detection limited to 24 major/minor keys
- Tempo estimation best for 60-200 BPM range
- Requires 24kHz input
License
Apache 2.0
Citation
@software{siren_transcribe_2026,
title={SIREN-TRANSCRIBE: Music Analysis with Foundation Model},
author={SIREN Team},
year={2026},
url={https://huggingface.co/hilarl/siren-transcribe}
}