File size: 4,413 Bytes
927ea8b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
license: apache-2.0
language:
- en
base_model: MIT/ast-finetuned-audioset-10-10-0.4593
tags:
- audio-classification
- music
- speech
- ast
- audio-spectrogram-transformer
pipeline_tag: audio-classification
datasets:
- AIGenLab/speech-music-82k
metrics:
- accuracy
library_name: transformers
---

# 🎡 AST Music vs Speech Classifier (82K)

Fine-tuned Audio Spectrogram Transformer (AST) for music vs speech classification.

## Model Details

- **Base Model:** MIT/ast-finetuned-audioset-10-10-0.4593
- **Task:** Binary Audio Classification (Music vs Speech)
- **Training Dataset:** AIGenLab/speech-music-82k (82000 samples)
- **Overall Accuracy:** 86.7% (26/30)

---

## πŸ“Š Performance Results

| Category | Accuracy | Correct | Total |
|----------|----------|---------|-------|
| Pure Music | 100.0% | 10 | 10 |
| Pure Speech | 60.0% | 6 | 10 |
| Speech + Music | 100.0% | 10 | 10 |

### Pure Music

| File | Music Score | Speech Score | Prediction | Result |
|------|-------------|--------------|------------|--------|
| music_1.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_10.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_2.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_3.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_4.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_5.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_6.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_7.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_8.wav | 1.000 | 0.000 | MUSIC | βœ… |
| music_9.wav | 1.000 | 0.000 | MUSIC | βœ… |

### Pure Speech

| File | Music Score | Speech Score | Prediction | Result |
|------|-------------|--------------|------------|--------|
| speech_1.wav | 0.051 | 0.949 | SPEECH | βœ… |
| speech_10.wav | 0.039 | 0.961 | SPEECH | βœ… |
| speech_2.wav | 0.000 | 1.000 | SPEECH | βœ… |
| speech_3.wav | 0.372 | 0.628 | SPEECH | βœ… |
| speech_4.wav | 1.000 | 0.000 | MUSIC | ❌ |
| speech_5.wav | 0.997 | 0.003 | MUSIC | ❌ |
| speech_6.wav | 1.000 | 0.000 | MUSIC | ❌ |
| speech_7.wav | 0.170 | 0.830 | SPEECH | βœ… |
| speech_8.wav | 0.870 | 0.130 | MUSIC | ❌ |
| speech_9.wav | 0.035 | 0.965 | SPEECH | βœ… |

### Speech + Music

| File | Music Score | Speech Score | Prediction | Result |
|------|-------------|--------------|------------|--------|
| speech_and_music_1.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_10.wav | 0.987 | 0.013 | MUSIC | βœ… |
| speech_and_music_2.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_3wav.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_4.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_5.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_6.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_7.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_8.wav | 1.000 | 0.000 | MUSIC | βœ… |
| speech_and_music_9.wav | 1.000 | 0.000 | MUSIC | βœ… |


---

## πŸš€ Quick Start
```python
from transformers import pipeline

# Load the model
classifier = pipeline(
    "audio-classification", 
    model="AIGenLab/AST-speech-and-music-classifier-82K"
)

# Classify audio
result = classifier("your_audio.wav")
print(result)
```

---

## πŸ”§ Advanced Usage
```python
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
import torchaudio

# Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained(
    "AIGenLab/AST-speech-and-music-classifier-82K"
)
feature_extractor = AutoFeatureExtractor.from_pretrained(
    "AIGenLab/AST-speech-and-music-classifier-82K"
)

# Load audio (16kHz required)
audio, sr = torchaudio.load("audio.wav")
if sr != 16000:
    audio = torchaudio.functional.resample(audio, sr, 16000)

# Process
inputs = feature_extractor(
    audio.squeeze().numpy(), 
    sampling_rate=16000, 
    return_tensors="pt"
)

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

music_score = predictions[0][0].item()
speech_score = predictions[0][1].item()

print(f"Music: {music_score:.3f}")
print(f"Speech: {speech_score:.3f}")
```

---

## πŸ“Š Training Details

| Parameter | Value |
|-----------|-------|
| Base Model | MIT/ast-finetuned-audioset-10-10-0.4593 |
| Dataset | AIGenLab/speech-music-82k (82000 samples) |
| Epochs | 1 |
| Batch Size | 64 |
| Learning Rate | 3e-5 |
| Loss Weight | Music: 2.5x, Speech: 1.0x |
| Optimizer | AdamW |
| Framework | Transformers + PyTorch |