Lyrics Genre Classification Model

This model classifies song lyrics into multiple music genres. It's a multi-label classifier trained on the Yegor25/lyrics_genre_dataset_large dataset.

Model Details

  • Base Model: roberta-base
  • Task: Multi-label text classification
  • Training Data: 850,000 examples
  • Test Data: 150,000 examples
  • Number of Genres: 43
  • Max Sequence Length: 512 tokens

Genres

alt-rock, axé, blues, classical, country, dance, electro-pop, electronic, electronica, folk, forró, funk, funk-carioca, gothic, hard-rock, hardcore, heavy-metal, hip-hop, indie, indie-pop, indie-rock, j-pop, j-rock, jazz, metal, mpb, new-wave, pagode, pop, pop-rock, progressive-rock, punk, punk-rock, r&b, rap, reggae, religion, rock, samba, sertanejo, soul, synth-pop, trap

Performance

  • F1 Score (Macro): 0.2547
  • F1 Score (Micro): 0.5507
  • F1 Score (Weighted): 0.5845
  • Subset Accuracy: 0.4105
  • Hamming Loss: 0.0285

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("xiejoshua/lyrics-genre-classifier")
tokenizer = AutoTokenizer.from_pretrained("xiejoshua/lyrics-genre-classifier")

# Load genre labels (download mlb.pkl from model repo)
with open('mlb.pkl', 'rb') as f:
    mlb = pickle.load(f)

# Predict
lyrics = "Your lyrics here..."

inputs = tokenizer(lyrics, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.sigmoid(outputs.logits)
    predicted_labels = (predictions > 0.5).int().cpu().numpy()

# Get probabilities for all genres
genre_probs = {
    genre: float(prob)
    for genre, prob in zip(mlb.classes_, predictions[0].cpu().numpy())
}

# Sort by probability
sorted_probs = sorted(genre_probs.items(), key=lambda x: x[1], reverse=True)
print("Top 5 predicted genres:")
for genre, prob in sorted_probs[:5]:
    print(f"  {genre}: {prob:.2%}")

Training Details

  • Epochs: 4
  • Batch Size: 48
  • Learning Rate: 2.5e-05
  • Optimizer: AdamW
  • Weight Decay: 0.01

Limitations

  • Trained primarily on English lyrics
  • May have biases based on training data distribution
  • Performance may vary on newer or niche genres not well-represented in training data

Citation

If you use this model, please cite the original dataset:

@dataset{yegor25_lyrics_genre,
  author = {Yegor25},
  title = {Lyrics Genre Dataset Large},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/Yegor25/lyrics_genre_dataset_large}
}
Downloads last month
8
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xiejoshua/lyrics-genre-classifier

Quantizations
1 model