---
license: apache-2.0
language:
- en
base_model: MIT/ast-finetuned-audioset-10-10-0.4593
tags:
- audio-classification
- music
- speech
- ast
- audio-spectrogram-transformer
pipeline_tag: audio-classification
datasets:
- AIGenLab/speech-music-1k
metrics:
- accuracy
library_name: transformers
---

# 🎵 AST Music vs Speech Classifier

Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification.

## Model Details

- **Base Model:** MIT/ast-finetuned-audioset-10-10-0.4593
- **Task:** Binary Audio Classification (Music vs Speech)
- **Training Dataset:** AIGenLab/speech-music-1k (1000 samples)
- **Overall Accuracy:** 90.0% (27/30)

---

## 📊 Performance Results

| Category | Accuracy | Correct | Total |
|----------|----------|---------|-------|
| Pure Music | 100.0% | 10 | 10 |
| Pure Speech | 70.0% | 7 | 10 |
| Speech + Music | 100.0% | 10 | 10 |

### Pure Music

| File | Music Score | Speech Score | Prediction | Result |
|------|-------------|--------------|------------|--------|
| music_1.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_10.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_2.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_3.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_4.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_5.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_6.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_7.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_8.wav | 1.000 | 0.000 | MUSIC | ✅ |
| music_9.wav | 1.000 | 0.000 | MUSIC | ✅ |

### Pure Speech

| File | Music Score | Speech Score | Prediction | Result |
|------|-------------|--------------|------------|--------|
| speech_1.wav | 0.000 | 1.000 | SPEECH | ✅ |
| speech_10.wav | 0.002 | 0.998 | SPEECH | ✅ |
| speech_2.wav | 0.000 | 1.000 | SPEECH | ✅ |
| speech_3.wav | 0.714 | 0.286 | MUSIC | ❌ |
| speech_4.wav | 0.906 | 0.094 | MUSIC | ❌ |
| speech_5.wav | 0.350 | 0.650 | SPEECH | ✅ |
| speech_6.wav | 0.895 | 0.105 | MUSIC | ❌ |
| speech_7.wav | 0.068 | 0.932 | SPEECH | ✅ |
| speech_8.wav | 0.097 | 0.903 | SPEECH | ✅ |
| speech_9.wav | 0.083 | 0.917 | SPEECH | ✅ |

### Speech + Music

| File | Music Score | Speech Score | Prediction | Result |
|------|-------------|--------------|------------|--------|
| speech_and_music_1.wav | 0.995 | 0.005 | MUSIC | ✅ |
| speech_and_music_10.wav | 0.987 | 0.013 | MUSIC | ✅ |
| speech_and_music_2.wav | 1.000 | 0.000 | MUSIC | ✅ |
| speech_and_music_3wav.wav | 1.000 | 0.000 | MUSIC | ✅ |
| speech_and_music_4.wav | 1.000 | 0.000 | MUSIC | ✅ |
| speech_and_music_5.wav | 0.998 | 0.002 | MUSIC | ✅ |
| speech_and_music_6.wav | 1.000 | 0.000 | MUSIC | ✅ |
| speech_and_music_7.wav | 1.000 | 0.000 | MUSIC | ✅ |
| speech_and_music_8.wav | 1.000 | 0.000 | MUSIC | ✅ |
| speech_and_music_9.wav | 1.000 | 0.000 | MUSIC | ✅ |


---

## 🚀 Quick Start
```python
from transformers import pipeline

# Load the model
classifier = pipeline(
    "audio-classification", 
    model="AIGenLab/AST-speech-and-music-classifier-1K"
)

# Classify audio
result = classifier("your_audio.wav")
print(result)
```

---

## 🔧 Advanced Usage
```python
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
import torchaudio

# Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained(
    "AIGenLab/AST-speech-and-music-classifier-1K"
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
    "AIGenLab/AST-speech-and-music-classifier-1K"
)

# Load audio (16kHz required)
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)

# Process
inputs = feature_extractor(
    audio.squeeze().numpy(), 
    sampling_rate=16000, 
    return_tensors="pt"
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

music_score = predictions[0][0].item()
speech_score = predictions[0][1].item()

print(f"Music: {music_score:.3f}")
print(f"Speech: {speech_score:.3f}")
```

---

## 📊 Training Details

| Parameter | Value |
|-----------|-------|
| Base Model | MIT/ast-finetuned-audioset-10-10-0.4593 |
| Dataset | AIGenLab/speech-music-1k (1000 samples) |
| Epochs | 1 |
| Batch Size | 64 |
| Learning Rate | 3e-5 |
| Loss Weight | Music: 2.5x, Speech: 1.0x |
| Optimizer | AdamW |
| Framework | Transformers + PyTorch |