Banger Scorer

Rate how good a song sounds, 0-10. A lightweight MLP trained on top of frozen MERT-v1-330M embeddings to predict music quality from audio. Trained on FMA-Small play count data, designed for relative ranking of AI-generated songs within a batch.

Global scatter: predicted vs actual banger scores

Model Description

The banger scorer takes a raw audio waveform, encodes it through MERT-v1-330M (frozen, 330M params) to produce a 1024-dimensional embedding, then passes that embedding through a small trained MLP to output a single scalar score from 0 to 10.

The core use case: generate a batch of songs with a model like ACE-Step, score them all automatically, and keep only the top-scoring tracks. Bangers only.

Audio waveform (24kHz, 30s)
    --> MERT-v1-330M (frozen, 330M params)
    --> 1024-dim embedding (mean-pooled across ~1200 time frames)
    --> MLP Scorer (558K trainable params)
    --> Score 0-10

Architecture

Encoder: m-a-p/MERT-v1-330M -- a 24-layer self-supervised music understanding model trained on 160K hours of audio. Completely frozen during training; used only to extract embeddings.

Scorer head: MLP with progressive bottleneck:

Input: 1024-dim MERT embedding
    --> Linear(1024, 512) + BatchNorm + ReLU + Dropout(0.3)
    --> Linear(512, 256)  + BatchNorm + ReLU + Dropout(0.3)
    --> Linear(256, 128)  + BatchNorm + ReLU + Dropout(0.15)
    --> Linear(128, 1)
Output: scalar score (0-10)
  • Total trainable parameters: 558K (2.6 MB in float32)
  • Inference time: < 1ms per prediction (MLP only, excludes MERT encoding)

Training Data

Trained on FMA-Small (Free Music Archive), a Creative Commons dataset of 8,000 tracks across 8 balanced genres (1,000 each): Hip-Hop, Pop, Folk, Experimental, Rock, International, Electronic, Instrumental.

Labels: Play counts from FMA, log-normalized to a 0-10 scale:

log_listens = np.log1p(df["listens"])
banger_score = (log_listens - log_listens.min()) / (log_listens.max() - log_listens.min()) * 10.0

After log-normalization, the training score distribution is heavily concentrated in the 1-5 range, with very few examples above 7:

Training score distribution

Training genre distribution

Pre-computed embeddings are available as a separate dataset: treadon/fma-mert-embeddings. MERT embedding extraction took 101 minutes on M4 Pro MPS across 7,997 successfully processed tracks (3 corrupt MP3s failed, 99.96% success rate).

Training Procedure

  • Split: 70% train (5,600) / 15% validation (1,200) / 15% test (1,200)
  • Optimizer: AdamW, lr=1e-3, weight_decay=1e-4
  • Scheduler: CosineAnnealingLR over 200 max epochs
  • Loss: MSE
  • Batch size: 64
  • Early stopping: Patience 20, best model at epoch 9 (early stopped at epoch 30)
  • Training time: ~30 seconds on Apple M4 Pro (MPS) with cached embeddings
  • Device: Apple M4 Pro, Metal Performance Shaders (MPS)

Evaluation Results

Metric Value Target Status
Test MAE 0.858 < 1.5 Exceeded by 43%
Test Spearman 0.468 > 0.4 Hit target
Val MAE 0.822 -- --
Model size ~2.6 MB < 50 MB --

MAE of 0.858 on a 0-10 scale means predictions are typically off by less than 1 point. For filtering purposes (pick the best 5 from 50), the model reliably distinguishes a 2/10 from a 6/10, even if it cannot tell a 7/10 from an 8/10.

Test Results: 230 AI-Generated Songs

The scorer was tested on 230 songs generated with ACE-Step 1.5 across 10 genres (20 songs each) plus 1 banger-optimized run (30 songs). Languages included English, Spanish, Hindi, Punjabi, and Chinese.

Genre ranking by mean score

Rank Genre Mean Best Range
1 Electronic/EDM 3.71 5.29 2.80-5.29
2 Punjabi/Bhangra 3.77 4.26 2.79-4.26
3 Bollywood 3.53 4.38 2.70-4.38
4 C-Pop 3.20 4.47 2.13-4.47
5 Latin/Reggaeton 3.19 3.90 2.36-3.90
6 Pop/Dance 3.05 4.31 1.98-4.31
7 Rock/Alternative 3.03 3.66 2.02-3.66
8 Hip Hop 2.92 3.38 2.52-3.38
9 Acoustic/Folk 2.63 3.31 2.03-3.31
10 R&B/Soul 2.62 3.21 2.14-3.21

Overall best: Melodic techno, 130 BPM, Eb minor -- scored 5.29/10 (67th percentile of FMA).

Score distribution histogram

Box plot of scores by genre

Optimization Impact

After analyzing the 200 random-parameter songs, a banger-optimized run of 30 songs was generated using only the highest-scoring parameter combinations (dark electronic/industrial styles, 126-138 BPM, minor keys only). Results:

Optimization impact comparison

Metric Random (200 songs) Optimized (30 songs) Improvement
Mean score 3.17 3.48 +10%
Songs >= 3.5 20% 60% 3x
Songs >= 4.0 5% 20% 4x
Top score 5.29 5.29 Same ceiling

The optimization raised the floor and consistency dramatically (4x hit rate for scores >= 4.0) without raising the ceiling.

Hit rate comparison by genre

Musical Analysis

Top vs bottom songs comparison

Key findings:

  • Minor keys outperformed major keys (3.26 vs 3.03 mean)
  • BPM sweet spots vary by genre: EDM peaks at 126-138, Punjabi at 95-105, Pop at 124-128
  • Slower BPMs (< 85) consistently underperformed across all genres

Major vs minor key analysis

Key analysis

BPM vs score

BPM/Key heatmap for banger-optimized run

Generated vs training score distributions

How to Use

import torch
import librosa
import numpy as np
from transformers import AutoModel, AutoFeatureExtractor

# 1. Load MERT encoder
mert_name = "m-a-p/MERT-v1-330M"
feature_extractor = AutoFeatureExtractor.from_pretrained(mert_name, trust_remote_code=True)
mert = AutoModel.from_pretrained(mert_name, trust_remote_code=True)
mert.eval()

# 2. Load the scorer MLP
from huggingface_hub import hf_hub_download
import torch.nn as nn

class BangerScorer(nn.Module):
    def __init__(self, input_dim=1024, dropout=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 512), nn.BatchNorm1d(512), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(512, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(dropout),
            nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(dropout / 2),
            nn.Linear(128, 1),
        )
    def forward(self, x):
        return self.net(x).squeeze(-1)

model_path = hf_hub_download("treadon/banger-scorer", "scorer_model.pt")
scorer = BangerScorer()
scorer.load_state_dict(torch.load(model_path, weights_only=True))
scorer.eval()

# 3. Score an audio file
waveform, _ = librosa.load("song.mp3", sr=24000, mono=True)
waveform = waveform[:24000 * 30]  # truncate to 30s

inputs = feature_extractor(waveform, sampling_rate=24000, return_tensors="pt")
with torch.no_grad():
    embedding = mert(**inputs).last_hidden_state.mean(dim=1)  # (1, 1024)
    score = scorer(embedding).item()
    score = max(0, min(10, score))  # clamp to 0-10

print(f"Banger score: {score:.2f}/10")

Intended Use

Do: Use the scorer to rank songs relative to each other within a batch. Generate N candidates, score them all, keep the top K.

Don't: Treat the score as an absolute quality judgment. A song scoring 3.5 is not objectively "bad" -- it just means the model thinks it is less likely to be popular based on patterns learned from FMA play counts.

The scorer is most useful as a cheap, fast filter to surface promising candidates from a large batch of AI-generated music, reducing the amount of human listening needed.

Limitations

  • Genre bias: The scorer strongly prefers high-energy, beat-driven music (EDM, Punjabi/Bhangra, Bollywood) over mellow genres (R&B, Acoustic/Folk). This reflects FMA's popularity distribution, not absolute musical quality.
  • FMA popularity != mainstream popularity. FMA is indie/unsigned artists on a free archive. The model learned "music that people actively seek out on a niche platform," not Billboard chart hits.
  • 30-second clips. Songs are scored based on 30-second excerpts. Quality aspects involving full-track structure (build-ups, drops, bridges) are not fully captured.
  • Mean pooling loses temporal info. By averaging MERT's per-frame outputs, temporal dynamics are collapsed. A song with a brilliant 10-second hook and 20 seconds of noise averages to the same embedding as a consistently mediocre song.
  • Popularity is not quality. Some brilliant niche music has low play counts; some generic music has millions of plays. The model learns statistical tendencies, not absolute aesthetic truth.
  • Distribution shift. MERT was trained on real music. AI-generated music may contain subtle artifacts that shift the embedding distribution in ways the scorer was not trained to handle.
  • Training data skew. Only 45 tracks out of 8,000 score above 7 in the training data. The model learned the 2-5 range well but cannot confidently score anything higher.

Caption style analysis

Citation

If you use this model, please cite the MERT paper and the FMA dataset:

@article{li2023mert,
  title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and others},
  journal={arXiv preprint arXiv:2306.00107},
  year={2023}
}

@inproceedings{defferrard2017fma,
  title={FMA: A Dataset For Music Analysis},
  author={Defferrard, Micha{\"e}l and Benzi, Kirell and Vandergheynst, Pierre and Bresson, Xavier},
  booktitle={ISMIR},
  year={2017}
}

Model Card Contact

treadon on HuggingFace

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train treadon/banger-scorer

Space using treadon/banger-scorer 1

Paper for treadon/banger-scorer