File size: 6,472 Bytes
d5ff68b 7c56114 4e09c6b 7c56114 4e09c6b 7c56114 4e09c6b 7c56114 4e09c6b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
---
license: apache-2.0
pipeline_tag: audio-classification
tags:
- music
- song
- aesthetics
- ASAE
---
# **HEAR**: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation
[**Paper**](https://arxiv.org/pdf/2511.18869) |
[**Model**](https://huggingface.co/earlab/EAR_HEAR)
<br>
Official PyTorch Implementation of ICASSP 2026 paper "HEAR: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation"
This repository contains the training and evaluation code for HEAR, a robust framework designed to address the challenges of multidimensional music aesthetic evaluation under limited labeled data.

## π Key Features
* **Excellent Performance**: Ranked 2nd/19 on Track 1 and 5th/17 on Track 2 in the [ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge](https://aslp-lab.github.io/Automatic-Song-Aesthetics-Evaluation-Challenge/).
* **Robustness**: Synergizes Multi-Source Multi-Scale Representations and Hierarchical Augmentation to capture robust features under limited labeled data.
* **Dual Capability**: Optimized for both exact score prediction and ranking (Top-Tier Song Identification).
## π¦ Installation
Clone the repository and install dependencies:
```
git clone https://github.com:Eps-Acoustic-Revolution-Lab/EAR_HEAR.git
git submodule update --init --recursive
conda create -n hear python=3.10 -y
conda activate hear
pip install -r requirements.txt
```
## π Quick Start
```
# Download pretrained model weights
export HF_ENDPOINT=https://hf-mirror.com # For users in Mainland China, this is needed for HuggingFace downloads
hf download earlab/EAR_HEAR --local-dir pretrained_models
# Track 1: Single-Label Inference (Musicality)
python inference.py \
--input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \
--output_json_path output.json
--model_path pretrained_models/track_1.pth \
--model_config_path config_track_1.yaml
# Track 2: Multi-Label Inference (5 Dimensions)
python inference.py \
--input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \
--output_json_path output.json
--model_path pretrained_models/track_2.pth \
--model_config_path config_track_2.yaml
```
## π― Training
### Step 1: Data Preparation
First, prepare the dataset by running the data pipeline:
```bash
cd data_pipeline
bash run.sh
```
This script will:
1. **Download Dataset**: Download the [SongEval](https://huggingface.co/datasets/ASLP-lab/SongEval) dataset
2. **Split Dataset**: Split the dataset into training and validation sets based on [the challenge's validation IDs
](https://github.com/ASLP-lab/Automatic-Song-Aesthetics-Evaluation-Challenge/blob/main/static/val_ids.txt)
3. **Audio Augmentation**: Apply audio augmentation to the training set
4. **Extract Features**: Extract MuQ and MusicFM features for both training and test sets
5. **Generate PKL Files**: Generate `train_set.pkl` and `test_set.pkl` files for training and evaluation
### Step 2: Model Training
After data preparation, you can train the HEAR model for either Track 1 (single-label: Musicality) or Track 2 (multi-label: 5 dimensions).
#### Track 1: Single-Label Training (Musicality)
Train the model for musicality prediction:
```bash
python train_track_1.py \
--experiment_name track1_exp \
--train-data /path/to/train_set.pkl \
--test-data /path/to/test_set.pkl \
--max-epoch 60 \
--batch-size 8 \
--lr 1e-5 \
--weight_decay 1e-3 \
--accum_steps 4 \
--lambda 0.15 \
--workers 8 \
--seed 0
```
#### Track 2: Multi-Label Training (5 Dimensions)
Train the model for multi-dimensional aesthetic evaluation:
```bash
python train_track_2.py \
--experiment_name track2_exp \
--train-data /path/to/train_set.pkl \
--test-data /path/to/test_set.pkl \
--max-epoch 60 \
--batch-size 8 \
--lr 1e-5 \
--weight_decay 1e-3 \
--accum_steps 4 \
--lambda 0.05 \
--workers 8 \
--seed 0
```
#### Key Parameters
* `--max-epoch`: Maximum number of training epochs (default: 60)
* `--batch-size`: Batch size for training (default: 8)
* `--experiment_name`: Name of the experiment for saving models and logs
* `--lr`: Learning rate (default: 1e-5)
* `--weight_decay`: Weight decay for optimizer (default: 1e-3)
* `--accum_steps`: Gradient accumulation steps (default: 4)
* `--lambda`: Weight for ranking loss (Track 1: 0.15, Track 2: 0.05)
* `--workers`: Number of data loading workers (default: 8)
* `--seed`: Random seed for reproducibility (default: 0)
* `--train-data`: Path to training data pkl file (default: `data_pipeline/dataset_pkl/train_set.pkl`)
* `--test-data`: Path to test data pkl file (default: `data_pipeline/dataset_pkl/test_set.pkl`)
* `--log-dir`: Path to tensorboard log directory (default: `./log/tensorboard_records/{experiment_name}`)
#### Evaluation Mode
To evaluate a trained model, use the `--eval` flag:
```bash
python train_track_1.py --eval --experiment_name track1_exp
python train_track_2.py --eval --experiment_name track2_exp
```
#### Model Configuration
Model architectures are configured in:
* `config_track_1.yaml` - Configuration for Track 1
* `config_track_2.yaml` - Configuration for Track 2
Trained models are saved in `log/models/{experiment_name}/model.pth`, and training logs are saved to TensorBoard in `./log/tensorboard_records/{experiment_name}/` (or custom path specified by `--log-dir`).
## π Acknowledgement
We sincerely thank the authors and contributors of the following open-source projects.:
* **[SongEval](https://github.com/ASLP-lab/SongEval)**
* **[SongFormer](https://github.com/ASLP-lab/SongFormer)**
* **[Audiomentations](https://github.com/iver56/audiomentations)**
* **[Wespeaker](https://github.com/wenet-e2e/wespeaker)**
* **[allRank](https://github.com/allegro/allRank)**
We would like to express our special thanks to **Shizhe Chen** from **Shanghai Conservatory of Music** for his invaluable guidance and insights on music aesthetics.
## π Citation
```bibtex
@misc{liu2025hearhierarchicallyenhancedaesthetic,
title={Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation},
author={Shuyang Liu and Yuan Jin and Rui Lin and Shizhe Chen and Junyu Dai and Tao Jiang},
year={2025},
eprint={2511.18869},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2511.18869},
}
```
|