Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- audio
|
| 5 |
+
- audio-classification
|
| 6 |
+
- speech
|
| 7 |
+
- music
|
| 8 |
+
- pytorch
|
| 9 |
+
- ast
|
| 10 |
+
- VAD
|
| 11 |
+
library_name: pytorch
|
| 12 |
+
pipeline_tag: audio-classification
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Audio Classification with AST (Music / Non-Speech / Speech)
|
| 16 |
+
|
| 17 |
+
This model is a **fine-tuned Audio Spectrogram Transformer (AST)** for **audio classification**.
|
| 18 |
+
It classifies audio clips into three categories:
|
| 19 |
+
|
| 20 |
+
- **Speech**
|
| 21 |
+
- **Music**
|
| 22 |
+
- **Non-Speech**
|
| 23 |
+
|
| 24 |
+
The model operates on **log-Mel filterbank features extracted from 16 kHz audio** and uses a **Transformer-based architecture** adapted for audio spectrograms.
|
| 25 |
+
|
| 26 |
+
Training and source code are available here:
|
| 27 |
+
|
| 28 |
+
**GitHub repository:**
|
| 29 |
+
https://github.com/areffarhadi/audio-classification/tree/main/AST-model
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
# Model Details
|
| 34 |
+
|
| 35 |
+
## Architecture
|
| 36 |
+
|
| 37 |
+
The model is based on **Audio Spectrogram Transformer (AST)**, which applies the Vision Transformer architecture to audio spectrograms.
|
| 38 |
+
|
| 39 |
+
Key characteristics:
|
| 40 |
+
|
| 41 |
+
- Transformer encoder architecture
|
| 42 |
+
- Patch-based spectrogram representation
|
| 43 |
+
- Learned positional embeddings
|
| 44 |
+
- Classification token and distillation token
|
| 45 |
+
- Final classification head fine-tuned for the target classes
|
| 46 |
+
|
| 47 |
+
### Input
|
| 48 |
+
|
| 49 |
+
- **Audio format:** WAV
|
| 50 |
+
- **Sampling rate:** 16 kHz
|
| 51 |
+
- **Features:** Log-Mel filterbank
|
| 52 |
+
- **Mel bins:** 128
|
| 53 |
+
- **Target length:** 1024 frames
|
| 54 |
+
|
| 55 |
+
### Output Classes
|
| 56 |
+
|
| 57 |
+
| Index | Label |
|
| 58 |
+
|------|------|
|
| 59 |
+
| 0 | Music |
|
| 60 |
+
| 1 | Non-Speech |
|
| 61 |
+
| 2 | Speech |
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
# Usage
|
| 66 |
+
|
| 67 |
+
[**Inference code**](https://github.com/areffarhadi/audio-classification/blob/main/AST-model/ast_inference_with_manifest.py)
|