Upload MusicQualityModel checkpoint

Browse files

Files changed (6) hide show

README.md +85 -0
base.yaml +118 -0
best_model.pt +3 -0
checkpoint_info.json +17 -0
config.yaml +33 -0
model_state_dict.pt +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,85 @@

+---
+language: en
+library_name: pytorch
+license: mit
+pipeline_tag: audio-classification
+tags:
+- audio
+- music
+- quality-assessment
+- MOS-prediction
+- music-generation
+---
+# MusicQualityModel — A3a_lora
+Multi-head neural evaluator for music generation quality, built on frozen
+[MuQ](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter) representations with
+learned attention pooling and per-dimension MLP prediction heads.
+## Model Details
+- **Encoder:** `OpenMuQ/MuQ-large-msd-iter` (tuning mode: `lora`)
+- **Pooling:** Attention-weighted mean pooling
+- **Heads:** MI, TA
+- **Loss:** `ordinal_ce`
+- **Input:** Audio waveform at 24000 Hz, max 10.0s
+## Performance
+Evaluated with 5-fold cross-validation on MusicEval (2,748 clips, 31 TTM systems).
+## Usage
+```python
+import torch
+import torchaudio
+from omegaconf import OmegaConf
+from huggingface_hub import hf_hub_download
+# Download files
+config_path = hf_hub_download("zhudi2825/MuQ-Eval", "config.yaml")
+model_path = hf_hub_download("zhudi2825/MuQ-Eval", "best_model.pt")
+# Load config and build model
+cfg = OmegaConf.load(config_path)
+from src.model import MusicQualityModel
+model = MusicQualityModel(cfg)
+ckpt = torch.load(model_path, map_location="cpu", weights_only=False)
+model.load_state_dict(ckpt["model_state"])
+model.eval()
+# Run inference
+waveform, sr = torchaudio.load("audio.wav")
+if sr != 24000:
+    waveform = torchaudio.transforms.Resample(sr, 24000)(waveform)
+waveform = waveform.mean(0)  # mono
+waveform = waveform[:240000].unsqueeze(0)  # [1, samples]
+with torch.no_grad():
+    preds = model(waveform)
+    scores = model._last_expected_scores
+    for name, score in scores.items():
+        print(f"{name}: {score.item():.2f}")
+```
+## Training
+- **Dataset:** MusicEval (BAAI/MusicEval) — 5-fold stratified CV by TTM model
+- **Epochs:** 30
+- **Batch size:** 16
+- **Optimizer:** AdamW (lr=0.0001, wd=0.01)
+- **Scheduler:** cosine with 500 warmup steps
+- **Precision:** bf16
+## Citation
+If you use this model, please cite:
+```bibtex
+@article{musicquality2026,
+  title={Frozen Music Representations Suffice for Per-Sample Quality Prediction of Generated Music},
+  year={2026}
+}
+```

base.yaml ADDED Viewed

	@@ -0,0 +1,118 @@

+# Base configuration for neural music quality evaluator
+# All experiment configs inherit from this.
+seed: 42
+# --- Data ---
+data:
+  musiceval_id: "BAAI/MusicEval"
+  songeval_id: "ASLP-lab/SongEval"
+  sample_rate: 24000
+  clip_duration_sec: 10.0
+  clip_samples: 240000  # 24000 * 10
+  num_workers: 4
+  pin_memory: true
+  # MusicEval CV
+  cv_folds: 5
+  cv_stratify_by: "model"  # stratify by TTM model
+  # SongEval chunking
+  songeval_chunk_mode: "random"  # random|center|multi
+  songeval_num_chunks: 1  # per song during training
+  # Dimension mapping
+  musiceval_dims:
+    MI: "overall_quality"
+    TA: "textual_alignment"
+  songeval_dims:
+    MI: "Musicality"
+    TA: null  # no text alignment in SongEval
+    PQ: "Coherence"
+# --- Model ---
+model:
+  encoder: "muq"  # muq | mert_95m | mert_330m
+  encoder_id: "OpenMuQ/MuQ-large-msd-iter"
+  encoder_dim: 1024
+  freeze_encoder: true
+  # Pooling
+  pooling: "attention"  # attention | mean | cls
+  # Prediction heads
+  heads:
+    - name: "MI"
+      output_dim: 1  # regression
+    - name: "TA"
+      output_dim: 1
+  # Head architecture
+  head_hidden_dim: 256
+  head_dropout: 0.1
+  head_layers: 2  # number of MLP layers in head
+  # LoRA (only when tuning_mode=lora)
+  lora:
+    r: 16
+    alpha: 32
+    target_modules: ["q_proj", "k_proj", "v_proj", "out_proj"]
+    dropout: 0.1
+    bias: "none"
+  # Tuning mode: frozen | lora | full
+  tuning_mode: "frozen"
+# --- Loss ---
+loss:
+  type: "mse"  # mse | ordinal_ce | ordinal_ce_contrastive
+  # Ordinal CE params
+  ordinal_bins: 5
+  ordinal_sigma: 0.5
+  # Contrastive params
+  contrastive_weight: 0.5
+  contrastive_margin: 0.5
+  contrastive_warmstart_epoch: 6  # add contrastive loss after this epoch
+  # Uncertainty weighting (Kendall et al.)
+  uncertainty_weighting: false
+  # Bias calibration (MBNet-style)
+  bias_calibration: false
+# --- Training ---
+training:
+  epochs: 30
+  batch_size: 16
+  lr: 1.0e-4
+  weight_decay: 0.01
+  warmup_steps: 500
+  scheduler: "cosine"  # cosine | linear | constant
+  gradient_clip_norm: 1.0
+  mixed_precision: "bf16"  # bf16 | fp16 | no
+  gradient_checkpointing: false
+  # Early stopping
+  patience: 7
+  monitor: "val/MI_srcc"
+  monitor_mode: "max"
+  # Logging
+  log_every_n_steps: 10
+  eval_every_n_epochs: 1
+  save_top_k: 3
+# --- Evaluation ---
+evaluation:
+  bootstrap_n: 1000
+  bootstrap_ci: 0.95
+  steiger_alpha: 0.0083  # Bonferroni-corrected for 6 comparisons
+# --- Paths ---
+paths:
+  output_dir: "./outputs"
+  cache_dir: "./cache"
+  wandb_project: "music-quality-evaluator"
+# --- Experiment ---
+experiment:
+  name: "base"
+  tags: []

best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7f259c533dc72db8f6306d3ca85ac8c4d4528bab07e5d059ea49654c6e61d66
+size 1354863088

checkpoint_info.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "epoch": 2,
+  "metrics": {
+    "loss/MI": 1.0626029676447313,
+    "loss/TA": 1.2027087981502216,
+    "loss/total": 2.2653117713828883,
+    "train/loss_avg": 2.2653117713828883,
+    "val/loss": 2.3157517766952513,
+    "val/MI_pcc": 0.8138228058815002,
+    "val/MI_srcc": 0.8251110190549646,
+    "val/TA_pcc": 0.5718650817871094,
+    "val/TA_srcc": 0.5726515902330593,
+    "epoch": 2,
+    "time_sec": 54.221648931503296,
+    "lr": 6.183999999999998e-05
+  }
+}

config.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+# A3a: MuQ + LoRA r=16 + Ordinal CE
+# Phase 2 component ablation: add LoRA to A2.
+defaults:
+  - base
+experiment:
+  name: "A3a_lora"
+  tags: ["ablation", "lora", "ordinal_ce", "phase2"]
+model:
+  tuning_mode: "lora"
+  pooling: "attention"
+  head_layers: 3
+  head_hidden_dim: 512
+  lora:
+    r: 16
+    alpha: 32
+    target_modules: ["q_proj", "k_proj", "v_proj", "out_proj"]
+    dropout: 0.1
+loss:
+  type: "ordinal_ce"
+  ordinal_bins: 5
+  ordinal_sigma: 0.5
+  uncertainty_weighting: false
+  bias_calibration: false
+training:
+  epochs: 30
+  batch_size: 16
+  lr: 1.0e-4  # lower LR for encoder tuning
+  gradient_checkpointing: false

model_state_dict.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aaec85582cbea0d726d0511f571c58435c251d08e1c3231bfc722343e6a30868
+size 1341136958