Lyrics Genre Classification Model

This model classifies song lyrics into multiple music genres. It's a multi-label classifier trained on the Yegor25/lyrics_genre_dataset_large dataset.

Model Details

Base Model: roberta-base
Task: Multi-label text classification
Training Data: 850,000 examples
Test Data: 150,000 examples
Number of Genres: 43
Max Sequence Length: 512 tokens

Genres

alt-rock, axé, blues, classical, country, dance, electro-pop, electronic, electronica, folk, forró, funk, funk-carioca, gothic, hard-rock, hardcore, heavy-metal, hip-hop, indie, indie-pop, indie-rock, j-pop, j-rock, jazz, metal, mpb, new-wave, pagode, pop, pop-rock, progressive-rock, punk, punk-rock, r&b, rap, reggae, religion, rock, samba, sertanejo, soul, synth-pop, trap

Performance

F1 Score (Macro): 0.2547
F1 Score (Micro): 0.5507
F1 Score (Weighted): 0.5845
Subset Accuracy: 0.4105
Hamming Loss: 0.0285

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("xiejoshua/lyrics-genre-classifier")
tokenizer = AutoTokenizer.from_pretrained("xiejoshua/lyrics-genre-classifier")

# Load genre labels (download mlb.pkl from model repo)
with open('mlb.pkl', 'rb') as f:
    mlb = pickle.load(f)

# Predict
lyrics = "Your lyrics here..."

inputs = tokenizer(lyrics, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.sigmoid(outputs.logits)
    predicted_labels = (predictions > 0.5).int().cpu().numpy()

# Get probabilities for all genres
genre_probs = {
    genre: float(prob)
    for genre, prob in zip(mlb.classes_, predictions[0].cpu().numpy())
}

# Sort by probability
sorted_probs = sorted(genre_probs.items(), key=lambda x: x[1], reverse=True)
print("Top 5 predicted genres:")
for genre, prob in sorted_probs[:5]:
    print(f"  {genre}: {prob:.2%}")

Training Details

Epochs: 4
Batch Size: 48
Learning Rate: 2.5e-05
Optimizer: AdamW
Weight Decay: 0.01

Limitations

Trained primarily on English lyrics
May have biases based on training data distribution
Performance may vary on newer or niche genres not well-represented in training data

Citation

If you use this model, please cite the original dataset:

@dataset{yegor25_lyrics_genre,
  author = {Yegor25},
  title = {Lyrics Genre Dataset Large},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/Yegor25/lyrics_genre_dataset_large}
}

Downloads last month: 8

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for xiejoshua/lyrics-genre-classifier

Quantizations

1 model