zhudi2825 commited on
Commit
b6ed007
·
verified ·
1 Parent(s): 981c67f

Upload MusicQualityModel checkpoint

Browse files
Files changed (6) hide show
  1. README.md +85 -0
  2. base.yaml +118 -0
  3. best_model.pt +3 -0
  4. checkpoint_info.json +17 -0
  5. config.yaml +33 -0
  6. model_state_dict.pt +3 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ library_name: pytorch
4
+ license: mit
5
+ pipeline_tag: audio-classification
6
+ tags:
7
+ - audio
8
+ - music
9
+ - quality-assessment
10
+ - MOS-prediction
11
+ - music-generation
12
+ ---
13
+
14
+ # MusicQualityModel — A3a_lora
15
+
16
+ Multi-head neural evaluator for music generation quality, built on frozen
17
+ [MuQ](https://huggingface.co/OpenMuQ/MuQ-large-msd-iter) representations with
18
+ learned attention pooling and per-dimension MLP prediction heads.
19
+
20
+ ## Model Details
21
+
22
+ - **Encoder:** `OpenMuQ/MuQ-large-msd-iter` (tuning mode: `lora`)
23
+ - **Pooling:** Attention-weighted mean pooling
24
+ - **Heads:** MI, TA
25
+ - **Loss:** `ordinal_ce`
26
+ - **Input:** Audio waveform at 24000 Hz, max 10.0s
27
+
28
+ ## Performance
29
+
30
+ Evaluated with 5-fold cross-validation on MusicEval (2,748 clips, 31 TTM systems).
31
+
32
+ ## Usage
33
+
34
+ ```python
35
+ import torch
36
+ import torchaudio
37
+ from omegaconf import OmegaConf
38
+ from huggingface_hub import hf_hub_download
39
+
40
+ # Download files
41
+ config_path = hf_hub_download("zhudi2825/MuQ-Eval", "config.yaml")
42
+ model_path = hf_hub_download("zhudi2825/MuQ-Eval", "best_model.pt")
43
+
44
+ # Load config and build model
45
+ cfg = OmegaConf.load(config_path)
46
+ from src.model import MusicQualityModel
47
+ model = MusicQualityModel(cfg)
48
+
49
+ ckpt = torch.load(model_path, map_location="cpu", weights_only=False)
50
+ model.load_state_dict(ckpt["model_state"])
51
+ model.eval()
52
+
53
+ # Run inference
54
+ waveform, sr = torchaudio.load("audio.wav")
55
+ if sr != 24000:
56
+ waveform = torchaudio.transforms.Resample(sr, 24000)(waveform)
57
+ waveform = waveform.mean(0) # mono
58
+ waveform = waveform[:240000].unsqueeze(0) # [1, samples]
59
+
60
+ with torch.no_grad():
61
+ preds = model(waveform)
62
+ scores = model._last_expected_scores
63
+ for name, score in scores.items():
64
+ print(f"{name}: {score.item():.2f}")
65
+ ```
66
+
67
+ ## Training
68
+
69
+ - **Dataset:** MusicEval (BAAI/MusicEval) — 5-fold stratified CV by TTM model
70
+ - **Epochs:** 30
71
+ - **Batch size:** 16
72
+ - **Optimizer:** AdamW (lr=0.0001, wd=0.01)
73
+ - **Scheduler:** cosine with 500 warmup steps
74
+ - **Precision:** bf16
75
+
76
+ ## Citation
77
+
78
+ If you use this model, please cite:
79
+
80
+ ```bibtex
81
+ @article{musicquality2026,
82
+ title={Frozen Music Representations Suffice for Per-Sample Quality Prediction of Generated Music},
83
+ year={2026}
84
+ }
85
+ ```
base.yaml ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Base configuration for neural music quality evaluator
2
+ # All experiment configs inherit from this.
3
+
4
+ seed: 42
5
+
6
+ # --- Data ---
7
+ data:
8
+ musiceval_id: "BAAI/MusicEval"
9
+ songeval_id: "ASLP-lab/SongEval"
10
+ sample_rate: 24000
11
+ clip_duration_sec: 10.0
12
+ clip_samples: 240000 # 24000 * 10
13
+ num_workers: 4
14
+ pin_memory: true
15
+
16
+ # MusicEval CV
17
+ cv_folds: 5
18
+ cv_stratify_by: "model" # stratify by TTM model
19
+
20
+ # SongEval chunking
21
+ songeval_chunk_mode: "random" # random|center|multi
22
+ songeval_num_chunks: 1 # per song during training
23
+
24
+ # Dimension mapping
25
+ musiceval_dims:
26
+ MI: "overall_quality"
27
+ TA: "textual_alignment"
28
+ songeval_dims:
29
+ MI: "Musicality"
30
+ TA: null # no text alignment in SongEval
31
+ PQ: "Coherence"
32
+
33
+ # --- Model ---
34
+ model:
35
+ encoder: "muq" # muq | mert_95m | mert_330m
36
+ encoder_id: "OpenMuQ/MuQ-large-msd-iter"
37
+ encoder_dim: 1024
38
+ freeze_encoder: true
39
+
40
+ # Pooling
41
+ pooling: "attention" # attention | mean | cls
42
+
43
+ # Prediction heads
44
+ heads:
45
+ - name: "MI"
46
+ output_dim: 1 # regression
47
+ - name: "TA"
48
+ output_dim: 1
49
+
50
+ # Head architecture
51
+ head_hidden_dim: 256
52
+ head_dropout: 0.1
53
+ head_layers: 2 # number of MLP layers in head
54
+
55
+ # LoRA (only when tuning_mode=lora)
56
+ lora:
57
+ r: 16
58
+ alpha: 32
59
+ target_modules: ["q_proj", "k_proj", "v_proj", "out_proj"]
60
+ dropout: 0.1
61
+ bias: "none"
62
+
63
+ # Tuning mode: frozen | lora | full
64
+ tuning_mode: "frozen"
65
+
66
+ # --- Loss ---
67
+ loss:
68
+ type: "mse" # mse | ordinal_ce | ordinal_ce_contrastive
69
+ # Ordinal CE params
70
+ ordinal_bins: 5
71
+ ordinal_sigma: 0.5
72
+ # Contrastive params
73
+ contrastive_weight: 0.5
74
+ contrastive_margin: 0.5
75
+ contrastive_warmstart_epoch: 6 # add contrastive loss after this epoch
76
+ # Uncertainty weighting (Kendall et al.)
77
+ uncertainty_weighting: false
78
+ # Bias calibration (MBNet-style)
79
+ bias_calibration: false
80
+
81
+ # --- Training ---
82
+ training:
83
+ epochs: 30
84
+ batch_size: 16
85
+ lr: 1.0e-4
86
+ weight_decay: 0.01
87
+ warmup_steps: 500
88
+ scheduler: "cosine" # cosine | linear | constant
89
+ gradient_clip_norm: 1.0
90
+ mixed_precision: "bf16" # bf16 | fp16 | no
91
+ gradient_checkpointing: false
92
+
93
+ # Early stopping
94
+ patience: 7
95
+ monitor: "val/MI_srcc"
96
+ monitor_mode: "max"
97
+
98
+ # Logging
99
+ log_every_n_steps: 10
100
+ eval_every_n_epochs: 1
101
+ save_top_k: 3
102
+
103
+ # --- Evaluation ---
104
+ evaluation:
105
+ bootstrap_n: 1000
106
+ bootstrap_ci: 0.95
107
+ steiger_alpha: 0.0083 # Bonferroni-corrected for 6 comparisons
108
+
109
+ # --- Paths ---
110
+ paths:
111
+ output_dir: "./outputs"
112
+ cache_dir: "./cache"
113
+ wandb_project: "music-quality-evaluator"
114
+
115
+ # --- Experiment ---
116
+ experiment:
117
+ name: "base"
118
+ tags: []
best_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7f259c533dc72db8f6306d3ca85ac8c4d4528bab07e5d059ea49654c6e61d66
3
+ size 1354863088
checkpoint_info.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 2,
3
+ "metrics": {
4
+ "loss/MI": 1.0626029676447313,
5
+ "loss/TA": 1.2027087981502216,
6
+ "loss/total": 2.2653117713828883,
7
+ "train/loss_avg": 2.2653117713828883,
8
+ "val/loss": 2.3157517766952513,
9
+ "val/MI_pcc": 0.8138228058815002,
10
+ "val/MI_srcc": 0.8251110190549646,
11
+ "val/TA_pcc": 0.5718650817871094,
12
+ "val/TA_srcc": 0.5726515902330593,
13
+ "epoch": 2,
14
+ "time_sec": 54.221648931503296,
15
+ "lr": 6.183999999999998e-05
16
+ }
17
+ }
config.yaml ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # A3a: MuQ + LoRA r=16 + Ordinal CE
2
+ # Phase 2 component ablation: add LoRA to A2.
3
+
4
+ defaults:
5
+ - base
6
+
7
+ experiment:
8
+ name: "A3a_lora"
9
+ tags: ["ablation", "lora", "ordinal_ce", "phase2"]
10
+
11
+ model:
12
+ tuning_mode: "lora"
13
+ pooling: "attention"
14
+ head_layers: 3
15
+ head_hidden_dim: 512
16
+ lora:
17
+ r: 16
18
+ alpha: 32
19
+ target_modules: ["q_proj", "k_proj", "v_proj", "out_proj"]
20
+ dropout: 0.1
21
+
22
+ loss:
23
+ type: "ordinal_ce"
24
+ ordinal_bins: 5
25
+ ordinal_sigma: 0.5
26
+ uncertainty_weighting: false
27
+ bias_calibration: false
28
+
29
+ training:
30
+ epochs: 30
31
+ batch_size: 16
32
+ lr: 1.0e-4 # lower LR for encoder tuning
33
+ gradient_checkpointing: false
model_state_dict.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aaec85582cbea0d726d0511f571c58435c251d08e1c3231bfc722343e6a30868
3
+ size 1341136958