Music Genre Classifier - 23f2000254-t12026

Hierarchical Vision Transformer for music genre classification using Mel spectrogram and CQT features.

Model Architecture

  • Type: Hierarchical Vision Transformer (ViT)
  • Input: 7-second audio segments with 2-second overlap (6 segments)
  • Features:
    • Mel spectrogram (128 bins)
    • Constant-Q Transform (84 bins)
  • Encoder: Factorized attention with 256 embed dim, 8 heads, 4 layers
  • Temporal: Transformer encoder with CLS token aggregation

Training Details

  • Dataset: 10,000 synthetic mashups (GTZAN-based)
  • Classes: 10 genres (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock)
  • Best Val F1: 0.763
  • Best Val Accuracy: 76.6%

Files

  • best_model.pt: Model checkpoint with weights, config, and genre mapping
  • history.json: Training history
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support