Music Genre Classifier - 23f2000254-t12026
Hierarchical Vision Transformer for music genre classification using Mel spectrogram and CQT features.
Model Architecture
- Type: Hierarchical Vision Transformer (ViT)
- Input: 7-second audio segments with 2-second overlap (6 segments)
- Features:
- Mel spectrogram (128 bins)
- Constant-Q Transform (84 bins)
- Encoder: Factorized attention with 256 embed dim, 8 heads, 4 layers
- Temporal: Transformer encoder with CLS token aggregation
Training Details
- Dataset: 10,000 synthetic mashups (GTZAN-based)
- Classes: 10 genres (blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, rock)
- Best Val F1: 0.763
- Best Val Accuracy: 76.6%
Files
best_model.pt: Model checkpoint with weights, config, and genre mappinghistory.json: Training history