|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
base_model: MIT/ast-finetuned-audioset-10-10-0.4593 |
|
|
tags: |
|
|
- audio-classification |
|
|
- music |
|
|
- speech |
|
|
- ast |
|
|
- audio-spectrogram-transformer |
|
|
pipeline_tag: audio-classification |
|
|
datasets: |
|
|
- AIGenLab/speech-music-82k |
|
|
metrics: |
|
|
- accuracy |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# π΅ AST Music vs Speech Classifier (82K) |
|
|
|
|
|
Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model:** MIT/ast-finetuned-audioset-10-10-0.4593 |
|
|
- **Task:** Binary Audio Classification (Music vs Speech) |
|
|
- **Training Dataset:** AIGenLab/speech-music-82k (82000 samples) |
|
|
- **Overall Accuracy:** 86.7% (26/30) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Performance Results |
|
|
|
|
|
| Category | Accuracy | Correct | Total | |
|
|
|----------|----------|---------|-------| |
|
|
| Pure Music | 100.0% | 10 | 10 | |
|
|
| Pure Speech | 60.0% | 6 | 10 | |
|
|
| Speech + Music | 100.0% | 10 | 10 | |
|
|
|
|
|
### Pure Music |
|
|
|
|
|
| File | Music Score | Speech Score | Prediction | Result | |
|
|
|------|-------------|--------------|------------|--------| |
|
|
| music_1.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_10.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_2.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_3.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_4.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_5.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_6.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_7.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_8.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| music_9.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
|
|
|
### Pure Speech |
|
|
|
|
|
| File | Music Score | Speech Score | Prediction | Result | |
|
|
|------|-------------|--------------|------------|--------| |
|
|
| speech_1.wav | 0.051 | 0.949 | SPEECH | β
| |
|
|
| speech_10.wav | 0.039 | 0.961 | SPEECH | β
| |
|
|
| speech_2.wav | 0.000 | 1.000 | SPEECH | β
| |
|
|
| speech_3.wav | 0.372 | 0.628 | SPEECH | β
| |
|
|
| speech_4.wav | 1.000 | 0.000 | MUSIC | β | |
|
|
| speech_5.wav | 0.997 | 0.003 | MUSIC | β | |
|
|
| speech_6.wav | 1.000 | 0.000 | MUSIC | β | |
|
|
| speech_7.wav | 0.170 | 0.830 | SPEECH | β
| |
|
|
| speech_8.wav | 0.870 | 0.130 | MUSIC | β | |
|
|
| speech_9.wav | 0.035 | 0.965 | SPEECH | β
| |
|
|
|
|
|
### Speech + Music |
|
|
|
|
|
| File | Music Score | Speech Score | Prediction | Result | |
|
|
|------|-------------|--------------|------------|--------| |
|
|
| speech_and_music_1.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_10.wav | 0.987 | 0.013 | MUSIC | β
| |
|
|
| speech_and_music_2.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_3wav.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_4.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_5.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_6.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_7.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_8.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
| speech_and_music_9.wav | 1.000 | 0.000 | MUSIC | β
| |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π Quick Start |
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the model |
|
|
classifier = pipeline( |
|
|
"audio-classification", |
|
|
model="AIGenLab/AST-speech-and-music-classifier-82K" |
|
|
) |
|
|
|
|
|
# Classify audio |
|
|
result = classifier("your_audio.wav") |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Advanced Usage |
|
|
```python |
|
|
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor |
|
|
import torch |
|
|
import torchaudio |
|
|
|
|
|
# Load model and feature extractor |
|
|
model = AutoModelForAudioClassification.from_pretrained( |
|
|
"AIGenLab/AST-speech-and-music-classifier-82K" |
|
|
) |
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained( |
|
|
"AIGenLab/AST-speech-and-music-classifier-82K" |
|
|
) |
|
|
|
|
|
# Load audio (16kHz required) |
|
|
audio, sr = torchaudio.load("audio.wav") |
|
|
if sr != 16000: |
|
|
audio = torchaudio.functional.resample(audio, sr, 16000) |
|
|
|
|
|
# Process |
|
|
inputs = feature_extractor( |
|
|
audio.squeeze().numpy(), |
|
|
sampling_rate=16000, |
|
|
return_tensors="pt" |
|
|
) |
|
|
|
|
|
# Predict |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
|
|
|
music_score = predictions[0][0].item() |
|
|
speech_score = predictions[0][1].item() |
|
|
|
|
|
print(f"Music: {music_score:.3f}") |
|
|
print(f"Speech: {speech_score:.3f}") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base Model | MIT/ast-finetuned-audioset-10-10-0.4593 | |
|
|
| Dataset | AIGenLab/speech-music-82k (82000 samples) | |
|
|
| Epochs | 1 | |
|
|
| Batch Size | 64 | |
|
|
| Learning Rate | 3e-5 | |
|
|
| Loss Weight | Music: 2.5x, Speech: 1.0x | |
|
|
| Optimizer | AdamW | |
|
|
| Framework | Transformers + PyTorch | |
|
|
|