Audio Classification with AST (Music / Non-Speech / Speech)

This model is a fine-tuned Audio Spectrogram Transformer (AST) for audio classification.
It classifies audio clips into three categories:

Speech
Music
Non-Speech

The model operates on log-Mel filterbank features extracted from 16 kHz audio and uses a Transformer-based architecture adapted for audio spectrograms.

Training and source code are available here:

GitHub repository:
https://github.com/areffarhadi/audio-classification/tree/main/AST-model

Model Details

Architecture

The model is based on Audio Spectrogram Transformer (AST), which applies the Vision Transformer architecture to audio spectrograms.

Key characteristics:

Transformer encoder architecture
Patch-based spectrogram representation
Learned positional embeddings
Classification token and distillation token
Final classification head fine-tuned for the target classes

Input

Audio format: WAV
Sampling rate: 16 kHz
Features: Log-Mel filterbank
Mel bins: 128
Target length: 1024 frames

Output Classes

Index	Label
0	Music
1	Non-Speech
2	Speech

Usage

Inference code

Downloads last month: -; Downloads are not tracked for this model. How to track