Lyrics Genre Classification Model
This model classifies song lyrics into multiple music genres. It's a multi-label classifier trained on the Yegor25/lyrics_genre_dataset_large dataset.
Model Details
- Base Model: roberta-base
- Task: Multi-label text classification
- Training Data: 850,000 examples
- Test Data: 150,000 examples
- Number of Genres: 43
- Max Sequence Length: 512 tokens
Genres
alt-rock, axé, blues, classical, country, dance, electro-pop, electronic, electronica, folk, forró, funk, funk-carioca, gothic, hard-rock, hardcore, heavy-metal, hip-hop, indie, indie-pop, indie-rock, j-pop, j-rock, jazz, metal, mpb, new-wave, pagode, pop, pop-rock, progressive-rock, punk, punk-rock, r&b, rap, reggae, religion, rock, samba, sertanejo, soul, synth-pop, trap
Performance
- F1 Score (Macro): 0.2547
- F1 Score (Micro): 0.5507
- F1 Score (Weighted): 0.5845
- Subset Accuracy: 0.4105
- Hamming Loss: 0.0285
Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle
# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("xiejoshua/lyrics-genre-classifier")
tokenizer = AutoTokenizer.from_pretrained("xiejoshua/lyrics-genre-classifier")
# Load genre labels (download mlb.pkl from model repo)
with open('mlb.pkl', 'rb') as f:
mlb = pickle.load(f)
# Predict
lyrics = "Your lyrics here..."
inputs = tokenizer(lyrics, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.sigmoid(outputs.logits)
predicted_labels = (predictions > 0.5).int().cpu().numpy()
# Get probabilities for all genres
genre_probs = {
genre: float(prob)
for genre, prob in zip(mlb.classes_, predictions[0].cpu().numpy())
}
# Sort by probability
sorted_probs = sorted(genre_probs.items(), key=lambda x: x[1], reverse=True)
print("Top 5 predicted genres:")
for genre, prob in sorted_probs[:5]:
print(f" {genre}: {prob:.2%}")
Training Details
- Epochs: 4
- Batch Size: 48
- Learning Rate: 2.5e-05
- Optimizer: AdamW
- Weight Decay: 0.01
Limitations
- Trained primarily on English lyrics
- May have biases based on training data distribution
- Performance may vary on newer or niche genres not well-represented in training data
Citation
If you use this model, please cite the original dataset:
@dataset{yegor25_lyrics_genre,
author = {Yegor25},
title = {Lyrics Genre Dataset Large},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/Yegor25/lyrics_genre_dataset_large}
}
- Downloads last month
- 8