metadata
license: apache-2.0
language:
- en
base_model: MIT/ast-finetuned-audioset-10-10-0.4593
tags:
- audio-classification
- music
- speech
- ast
- audio-spectrogram-transformer
pipeline_tag: audio-classification
datasets:
- AIGenLab/speech-music-82k
metrics:
- accuracy
library_name: transformers
π΅ AST Music vs Speech Classifier (82K)
Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification.
Model Details
- Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
- Task: Binary Audio Classification (Music vs Speech)
- Training Dataset: AIGenLab/speech-music-82k (82000 samples)
- Overall Accuracy: 86.7% (26/30)
π Performance Results
| Category | Accuracy | Correct | Total |
|---|---|---|---|
| Pure Music | 100.0% | 10 | 10 |
| Pure Speech | 60.0% | 6 | 10 |
| Speech + Music | 100.0% | 10 | 10 |
Pure Music
| File | Music Score | Speech Score | Prediction | Result |
|---|---|---|---|---|
| music_1.wav | 1.000 | 0.000 | MUSIC | β |
| music_10.wav | 1.000 | 0.000 | MUSIC | β |
| music_2.wav | 1.000 | 0.000 | MUSIC | β |
| music_3.wav | 1.000 | 0.000 | MUSIC | β |
| music_4.wav | 1.000 | 0.000 | MUSIC | β |
| music_5.wav | 1.000 | 0.000 | MUSIC | β |
| music_6.wav | 1.000 | 0.000 | MUSIC | β |
| music_7.wav | 1.000 | 0.000 | MUSIC | β |
| music_8.wav | 1.000 | 0.000 | MUSIC | β |
| music_9.wav | 1.000 | 0.000 | MUSIC | β |
Pure Speech
| File | Music Score | Speech Score | Prediction | Result |
|---|---|---|---|---|
| speech_1.wav | 0.051 | 0.949 | SPEECH | β |
| speech_10.wav | 0.039 | 0.961 | SPEECH | β |
| speech_2.wav | 0.000 | 1.000 | SPEECH | β |
| speech_3.wav | 0.372 | 0.628 | SPEECH | β |
| speech_4.wav | 1.000 | 0.000 | MUSIC | β |
| speech_5.wav | 0.997 | 0.003 | MUSIC | β |
| speech_6.wav | 1.000 | 0.000 | MUSIC | β |
| speech_7.wav | 0.170 | 0.830 | SPEECH | β |
| speech_8.wav | 0.870 | 0.130 | MUSIC | β |
| speech_9.wav | 0.035 | 0.965 | SPEECH | β |
Speech + Music
| File | Music Score | Speech Score | Prediction | Result |
|---|---|---|---|---|
| speech_and_music_1.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_10.wav | 0.987 | 0.013 | MUSIC | β |
| speech_and_music_2.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_3wav.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_4.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_5.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_6.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_7.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_8.wav | 1.000 | 0.000 | MUSIC | β |
| speech_and_music_9.wav | 1.000 | 0.000 | MUSIC | β |
π Quick Start
from transformers import pipeline
# Load the model
classifier = pipeline(
"audio-classification",
model="AIGenLab/AST-speech-and-music-classifier-82K"
)
# Classify audio
result = classifier("your_audio.wav")
print(result)
π§ Advanced Usage
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
import torchaudio
# Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained(
"AIGenLab/AST-speech-and-music-classifier-82K"
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
"AIGenLab/AST-speech-and-music-classifier-82K"
)
# Load audio (16kHz required)
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
audio = torchaudio.functional.resample(audio, sr, 16000)
# Process
inputs = feature_extractor(
audio.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt"
)
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
music_score = predictions[0][0].item()
speech_score = predictions[0][1].item()
print(f"Music: {music_score:.3f}")
print(f"Speech: {speech_score:.3f}")
π Training Details
| Parameter | Value |
|---|---|
| Base Model | MIT/ast-finetuned-audioset-10-10-0.4593 |
| Dataset | AIGenLab/speech-music-82k (82000 samples) |
| Epochs | 1 |
| Batch Size | 64 |
| Learning Rate | 3e-5 |
| Loss Weight | Music: 2.5x, Speech: 1.0x |
| Optimizer | AdamW |
| Framework | Transformers + PyTorch |