Audio Classification with AST (Music / Non-Speech / Speech)

This model is a fine-tuned Audio Spectrogram Transformer (AST) for audio classification.
It classifies audio clips into three categories:

  • Speech
  • Music
  • Non-Speech

The model operates on log-Mel filterbank features extracted from 16 kHz audio and uses a Transformer-based architecture adapted for audio spectrograms.

Training and source code are available here:

GitHub repository:
https://github.com/areffarhadi/audio-classification/tree/main/AST-model


Model Details

Architecture

The model is based on Audio Spectrogram Transformer (AST), which applies the Vision Transformer architecture to audio spectrograms.

Key characteristics:

  • Transformer encoder architecture
  • Patch-based spectrogram representation
  • Learned positional embeddings
  • Classification token and distillation token
  • Final classification head fine-tuned for the target classes

Input

  • Audio format: WAV
  • Sampling rate: 16 kHz
  • Features: Log-Mel filterbank
  • Mel bins: 128
  • Target length: 1024 frames

Output Classes

Index Label
0 Music
1 Non-Speech
2 Speech

Usage

Inference code

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support