Audio Classification with AST (Music / Non-Speech / Speech)
This model is a fine-tuned Audio Spectrogram Transformer (AST) for audio classification.
It classifies audio clips into three categories:
- Speech
- Music
- Non-Speech
The model operates on log-Mel filterbank features extracted from 16 kHz audio and uses a Transformer-based architecture adapted for audio spectrograms.
Training and source code are available here:
GitHub repository:
https://github.com/areffarhadi/audio-classification/tree/main/AST-model
Model Details
Architecture
The model is based on Audio Spectrogram Transformer (AST), which applies the Vision Transformer architecture to audio spectrograms.
Key characteristics:
- Transformer encoder architecture
- Patch-based spectrogram representation
- Learned positional embeddings
- Classification token and distillation token
- Final classification head fine-tuned for the target classes
Input
- Audio format: WAV
- Sampling rate: 16 kHz
- Features: Log-Mel filterbank
- Mel bins: 128
- Target length: 1024 frames
Output Classes
| Index | Label |
|---|---|
| 0 | Music |
| 1 | Non-Speech |
| 2 | Speech |