File size: 6,472 Bytes
d5ff68b
 
 
 
 
 
 
 
7c56114
 
 
4e09c6b
7c56114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e09c6b
7c56114
 
 
 
 
 
 
 
 
 
 
4e09c6b
7c56114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e09c6b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---
license: apache-2.0
pipeline_tag: audio-classification
tags:
- music
- song
- aesthetics
- ASAE
---



# **HEAR**: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation
[**Paper**](https://arxiv.org/pdf/2511.18869) |
[**Model**](https://huggingface.co/earlab/EAR_HEAR)
<br>

Official PyTorch Implementation of ICASSP 2026 paper "HEAR: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation"

This repository contains the training and evaluation code for HEAR, a robust framework designed to address the challenges of multidimensional music aesthetic evaluation under limited labeled data.
![](HEAR.png)
## 🌟 Key Features
* **Excellent Performance**: Ranked 2nd/19 on Track 1 and 5th/17 on Track 2 in the [ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge](https://aslp-lab.github.io/Automatic-Song-Aesthetics-Evaluation-Challenge/).
* **Robustness**: Synergizes Multi-Source Multi-Scale Representations and Hierarchical Augmentation to capture robust features under limited labeled data.
* **Dual Capability**: Optimized for both exact score prediction and ranking (Top-Tier Song Identification).

## πŸ“¦ Installation
Clone the repository and install dependencies:
```
git clone https://github.com:Eps-Acoustic-Revolution-Lab/EAR_HEAR.git
git submodule update --init --recursive

conda create -n hear python=3.10 -y
conda activate hear
pip install -r requirements.txt
```

## πŸš€ Quick Start
```
# Download pretrained model weights
export HF_ENDPOINT=https://hf-mirror.com  # For users in Mainland China, this is needed for HuggingFace downloads
hf download earlab/EAR_HEAR --local-dir pretrained_models

# Track 1: Single-Label Inference (Musicality)
python inference.py \
    --input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \
    --output_json_path output.json
    --model_path pretrained_models/track_1.pth \
    --model_config_path config_track_1.yaml


# Track 2:  Multi-Label Inference (5 Dimensions)
python inference.py \
    --input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \
    --output_json_path output.json
    --model_path pretrained_models/track_2.pth \
    --model_config_path config_track_2.yaml
```

## 🎯 Training

### Step 1: Data Preparation

First, prepare the dataset by running the data pipeline:

```bash
cd data_pipeline
bash run.sh
```

This script will:
1. **Download Dataset**: Download the [SongEval](https://huggingface.co/datasets/ASLP-lab/SongEval) dataset
2. **Split Dataset**: Split the dataset into training and validation sets based on [the challenge's validation IDs
](https://github.com/ASLP-lab/Automatic-Song-Aesthetics-Evaluation-Challenge/blob/main/static/val_ids.txt)
3. **Audio Augmentation**: Apply audio augmentation to the training set
4. **Extract Features**: Extract MuQ and MusicFM features for both training and test sets
5. **Generate PKL Files**: Generate `train_set.pkl` and `test_set.pkl` files for training and evaluation


### Step 2: Model Training

After data preparation, you can train the HEAR model for either Track 1 (single-label: Musicality) or Track 2 (multi-label: 5 dimensions).

#### Track 1: Single-Label Training (Musicality)

Train the model for musicality prediction:

```bash
python train_track_1.py \
    --experiment_name track1_exp \
    --train-data /path/to/train_set.pkl \
    --test-data /path/to/test_set.pkl \
    --max-epoch 60 \
    --batch-size 8 \
    --lr 1e-5 \
    --weight_decay 1e-3 \
    --accum_steps 4 \
    --lambda 0.15 \
    --workers 8 \
    --seed 0
```

#### Track 2: Multi-Label Training (5 Dimensions)

Train the model for multi-dimensional aesthetic evaluation:

```bash
python train_track_2.py \
    --experiment_name track2_exp \
    --train-data /path/to/train_set.pkl \
    --test-data /path/to/test_set.pkl \
    --max-epoch 60 \
    --batch-size 8 \
    --lr 1e-5 \
    --weight_decay 1e-3 \
    --accum_steps 4 \
    --lambda 0.05 \
    --workers 8 \
    --seed 0
```

#### Key Parameters

* `--max-epoch`: Maximum number of training epochs (default: 60)
* `--batch-size`: Batch size for training (default: 8)
* `--experiment_name`: Name of the experiment for saving models and logs
* `--lr`: Learning rate (default: 1e-5)
* `--weight_decay`: Weight decay for optimizer (default: 1e-3)
* `--accum_steps`: Gradient accumulation steps (default: 4)
* `--lambda`: Weight for ranking loss (Track 1: 0.15, Track 2: 0.05)
* `--workers`: Number of data loading workers (default: 8)
* `--seed`: Random seed for reproducibility (default: 0)
* `--train-data`: Path to training data pkl file (default: `data_pipeline/dataset_pkl/train_set.pkl`)
* `--test-data`: Path to test data pkl file (default: `data_pipeline/dataset_pkl/test_set.pkl`)
* `--log-dir`: Path to tensorboard log directory (default: `./log/tensorboard_records/{experiment_name}`)

#### Evaluation Mode

To evaluate a trained model, use the `--eval` flag:

```bash
python train_track_1.py --eval --experiment_name track1_exp
python train_track_2.py --eval --experiment_name track2_exp
```

#### Model Configuration

Model architectures are configured in:
* `config_track_1.yaml` - Configuration for Track 1
* `config_track_2.yaml` - Configuration for Track 2

Trained models are saved in `log/models/{experiment_name}/model.pth`, and training logs are saved to TensorBoard in `./log/tensorboard_records/{experiment_name}/` (or custom path specified by `--log-dir`).


## πŸ™ Acknowledgement

We sincerely thank the authors and contributors of the following open-source projects.:

* **[SongEval](https://github.com/ASLP-lab/SongEval)**
* **[SongFormer](https://github.com/ASLP-lab/SongFormer)**
* **[Audiomentations](https://github.com/iver56/audiomentations)**
* **[Wespeaker](https://github.com/wenet-e2e/wespeaker)**
* **[allRank](https://github.com/allegro/allRank)**

We would like to express our special thanks to **Shizhe Chen** from **Shanghai Conservatory of Music** for his invaluable guidance and insights on music aesthetics.

## πŸ“š Citation
```bibtex
@misc{liu2025hearhierarchicallyenhancedaesthetic,
      title={Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation}, 
      author={Shuyang Liu and Yuan Jin and Rui Lin and Shizhe Chen and Junyu Dai and Tao Jiang},
      year={2025},
      eprint={2511.18869},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2511.18869}, 
}
```