|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: audio-classification |
|
|
tags: |
|
|
- music |
|
|
- song |
|
|
- aesthetics |
|
|
- ASAE |
|
|
--- |
|
|
|
|
|
|
|
|
|
|
|
# **HEAR**: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation |
|
|
[**Paper**](https://arxiv.org/pdf/2511.18869) | |
|
|
[**Model**](https://huggingface.co/earlab/EAR_HEAR) |
|
|
<br> |
|
|
|
|
|
Official PyTorch Implementation of ICASSP 2026 paper "HEAR: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation" |
|
|
|
|
|
This repository contains the training and evaluation code for HEAR, a robust framework designed to address the challenges of multidimensional music aesthetic evaluation under limited labeled data. |
|
|
 |
|
|
## π Key Features |
|
|
* **Excellent Performance**: Ranked 2nd/19 on Track 1 and 5th/17 on Track 2 in the [ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge](https://aslp-lab.github.io/Automatic-Song-Aesthetics-Evaluation-Challenge/). |
|
|
* **Robustness**: Synergizes Multi-Source Multi-Scale Representations and Hierarchical Augmentation to capture robust features under limited labeled data. |
|
|
* **Dual Capability**: Optimized for both exact score prediction and ranking (Top-Tier Song Identification). |
|
|
|
|
|
## π¦ Installation |
|
|
Clone the repository and install dependencies: |
|
|
``` |
|
|
git clone https://github.com:Eps-Acoustic-Revolution-Lab/EAR_HEAR.git |
|
|
git submodule update --init --recursive |
|
|
|
|
|
conda create -n hear python=3.10 -y |
|
|
conda activate hear |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
## π Quick Start |
|
|
``` |
|
|
# Download pretrained model weights |
|
|
export HF_ENDPOINT=https://hf-mirror.com # For users in Mainland China, this is needed for HuggingFace downloads |
|
|
hf download earlab/EAR_HEAR --local-dir pretrained_models |
|
|
|
|
|
# Track 1: Single-Label Inference (Musicality) |
|
|
python inference.py \ |
|
|
--input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \ |
|
|
--output_json_path output.json |
|
|
--model_path pretrained_models/track_1.pth \ |
|
|
--model_config_path config_track_1.yaml |
|
|
|
|
|
|
|
|
# Track 2: Multi-Label Inference (5 Dimensions) |
|
|
python inference.py \ |
|
|
--input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \ |
|
|
--output_json_path output.json |
|
|
--model_path pretrained_models/track_2.pth \ |
|
|
--model_config_path config_track_2.yaml |
|
|
``` |
|
|
|
|
|
## π― Training |
|
|
|
|
|
### Step 1: Data Preparation |
|
|
|
|
|
First, prepare the dataset by running the data pipeline: |
|
|
|
|
|
```bash |
|
|
cd data_pipeline |
|
|
bash run.sh |
|
|
``` |
|
|
|
|
|
This script will: |
|
|
1. **Download Dataset**: Download the [SongEval](https://huggingface.co/datasets/ASLP-lab/SongEval) dataset |
|
|
2. **Split Dataset**: Split the dataset into training and validation sets based on [the challenge's validation IDs |
|
|
](https://github.com/ASLP-lab/Automatic-Song-Aesthetics-Evaluation-Challenge/blob/main/static/val_ids.txt) |
|
|
3. **Audio Augmentation**: Apply audio augmentation to the training set |
|
|
4. **Extract Features**: Extract MuQ and MusicFM features for both training and test sets |
|
|
5. **Generate PKL Files**: Generate `train_set.pkl` and `test_set.pkl` files for training and evaluation |
|
|
|
|
|
|
|
|
### Step 2: Model Training |
|
|
|
|
|
After data preparation, you can train the HEAR model for either Track 1 (single-label: Musicality) or Track 2 (multi-label: 5 dimensions). |
|
|
|
|
|
#### Track 1: Single-Label Training (Musicality) |
|
|
|
|
|
Train the model for musicality prediction: |
|
|
|
|
|
```bash |
|
|
python train_track_1.py \ |
|
|
--experiment_name track1_exp \ |
|
|
--train-data /path/to/train_set.pkl \ |
|
|
--test-data /path/to/test_set.pkl \ |
|
|
--max-epoch 60 \ |
|
|
--batch-size 8 \ |
|
|
--lr 1e-5 \ |
|
|
--weight_decay 1e-3 \ |
|
|
--accum_steps 4 \ |
|
|
--lambda 0.15 \ |
|
|
--workers 8 \ |
|
|
--seed 0 |
|
|
``` |
|
|
|
|
|
#### Track 2: Multi-Label Training (5 Dimensions) |
|
|
|
|
|
Train the model for multi-dimensional aesthetic evaluation: |
|
|
|
|
|
```bash |
|
|
python train_track_2.py \ |
|
|
--experiment_name track2_exp \ |
|
|
--train-data /path/to/train_set.pkl \ |
|
|
--test-data /path/to/test_set.pkl \ |
|
|
--max-epoch 60 \ |
|
|
--batch-size 8 \ |
|
|
--lr 1e-5 \ |
|
|
--weight_decay 1e-3 \ |
|
|
--accum_steps 4 \ |
|
|
--lambda 0.05 \ |
|
|
--workers 8 \ |
|
|
--seed 0 |
|
|
``` |
|
|
|
|
|
#### Key Parameters |
|
|
|
|
|
* `--max-epoch`: Maximum number of training epochs (default: 60) |
|
|
* `--batch-size`: Batch size for training (default: 8) |
|
|
* `--experiment_name`: Name of the experiment for saving models and logs |
|
|
* `--lr`: Learning rate (default: 1e-5) |
|
|
* `--weight_decay`: Weight decay for optimizer (default: 1e-3) |
|
|
* `--accum_steps`: Gradient accumulation steps (default: 4) |
|
|
* `--lambda`: Weight for ranking loss (Track 1: 0.15, Track 2: 0.05) |
|
|
* `--workers`: Number of data loading workers (default: 8) |
|
|
* `--seed`: Random seed for reproducibility (default: 0) |
|
|
* `--train-data`: Path to training data pkl file (default: `data_pipeline/dataset_pkl/train_set.pkl`) |
|
|
* `--test-data`: Path to test data pkl file (default: `data_pipeline/dataset_pkl/test_set.pkl`) |
|
|
* `--log-dir`: Path to tensorboard log directory (default: `./log/tensorboard_records/{experiment_name}`) |
|
|
|
|
|
#### Evaluation Mode |
|
|
|
|
|
To evaluate a trained model, use the `--eval` flag: |
|
|
|
|
|
```bash |
|
|
python train_track_1.py --eval --experiment_name track1_exp |
|
|
python train_track_2.py --eval --experiment_name track2_exp |
|
|
``` |
|
|
|
|
|
#### Model Configuration |
|
|
|
|
|
Model architectures are configured in: |
|
|
* `config_track_1.yaml` - Configuration for Track 1 |
|
|
* `config_track_2.yaml` - Configuration for Track 2 |
|
|
|
|
|
Trained models are saved in `log/models/{experiment_name}/model.pth`, and training logs are saved to TensorBoard in `./log/tensorboard_records/{experiment_name}/` (or custom path specified by `--log-dir`). |
|
|
|
|
|
|
|
|
## π Acknowledgement |
|
|
|
|
|
We sincerely thank the authors and contributors of the following open-source projects.: |
|
|
|
|
|
* **[SongEval](https://github.com/ASLP-lab/SongEval)** |
|
|
* **[SongFormer](https://github.com/ASLP-lab/SongFormer)** |
|
|
* **[Audiomentations](https://github.com/iver56/audiomentations)** |
|
|
* **[Wespeaker](https://github.com/wenet-e2e/wespeaker)** |
|
|
* **[allRank](https://github.com/allegro/allRank)** |
|
|
|
|
|
We would like to express our special thanks to **Shizhe Chen** from **Shanghai Conservatory of Music** for his invaluable guidance and insights on music aesthetics. |
|
|
|
|
|
## π Citation |
|
|
```bibtex |
|
|
@misc{liu2025hearhierarchicallyenhancedaesthetic, |
|
|
title={Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation}, |
|
|
author={Shuyang Liu and Yuan Jin and Rui Lin and Shizhe Chen and Junyu Dai and Tao Jiang}, |
|
|
year={2025}, |
|
|
eprint={2511.18869}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SD}, |
|
|
url={https://arxiv.org/abs/2511.18869}, |
|
|
} |
|
|
``` |
|
|
|
|
|
|