pk1308
/

23f2000254-t12026-genre-classifier

Audio Classification

music-genre-classification

mel-spectrogram

Model card Files Files and versions

Music Genre Classifier - 23f2000254-t12026

Hierarchical Vision Transformer for music genre classification using Mel spectrogram and CQT features.

Model Architecture

Type: Hierarchical Vision Transformer (ViT)
Input: 7-second audio segments with 2-second overlap (6 segments)
Features:
- Mel spectrogram (128 bins)
- Constant-Q Transform (84 bins)
Encoder: Factorized attention with 256 embed dim, 8 heads, 4 layers
Temporal: Transformer encoder with CLS token aggregation

Training Details

Dataset: 10,000 synthetic mashups (GTZAN-based)
Classes: 10 genres (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock)
Best Val F1: 0.763
Best Val Accuracy: 76.6%

Files

best_model.pt: Model checkpoint with weights, config, and genre mapping
history.json: Training history

Downloads last month: -; Downloads are not tracked for this model. How to track