File size: 6,472 Bytes

---
license: apache-2.0
pipeline_tag: audio-classification
tags:
- music
- song
- aesthetics
- ASAE
---



# **HEAR**: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation
[**Paper**](https://arxiv.org/pdf/2511.18869) |
[**Model**](https://huggingface.co/earlab/EAR_HEAR)
<br>

Official PyTorch Implementation of ICASSP 2026 paper "HEAR: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation"

This repository contains the training and evaluation code for HEAR, a robust framework designed to address the challenges of multidimensional music aesthetic evaluation under limited labeled data.
![](HEAR.png)
## 🌟 Key Features
* **Excellent Performance**: Ranked 2nd/19 on Track 1 and 5th/17 on Track 2 in the [ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge](https://aslp-lab.github.io/Automatic-Song-Aesthetics-Evaluation-Challenge/).
* **Robustness**: Synergizes Multi-Source Multi-Scale Representations and Hierarchical Augmentation to capture robust features under limited labeled data.
* **Dual Capability**: Optimized for both exact score prediction and ranking (Top-Tier Song Identification).

## 📦 Installation
Clone the repository and install dependencies:
```
git clone https://github.com:Eps-Acoustic-Revolution-Lab/EAR_HEAR.git
git submodule update --init --recursive

conda create -n hear python=3.10 -y
conda activate hear
pip install -r requirements.txt
```

## 🚀 Quick Start
```
# Download pretrained model weights
export HF_ENDPOINT=https://hf-mirror.com  # For users in Mainland China, this is needed for HuggingFace downloads
hf download earlab/EAR_HEAR --local-dir pretrained_models

# Track 1: Single-Label Inference (Musicality)
python inference.py \
    --input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \
    --output_json_path output.json
    --model_path pretrained_models/track_1.pth \
    --model_config_path config_track_1.yaml


# Track 2:  Multi-Label Inference (5 Dimensions)
python inference.py \
    --input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \
    --output_json_path output.json
    --model_path pretrained_models/track_2.pth \
    --model_config_path config_track_2.yaml
```

## 🎯 Training

### Step 1: Data Preparation

First, prepare the dataset by running the data pipeline:

```bash
cd data_pipeline
bash run.sh
```

This script will:
1. **Download Dataset**: Download the [SongEval](https://huggingface.co/datasets/ASLP-lab/SongEval) dataset
2. **Split Dataset**: Split the dataset into training and validation sets based on [the challenge's validation IDs
](https://github.com/ASLP-lab/Automatic-Song-Aesthetics-Evaluation-Challenge/blob/main/static/val_ids.txt)
3. **Audio Augmentation**: Apply audio augmentation to the training set
4. **Extract Features**: Extract MuQ and MusicFM features for both training and test sets
5. **Generate PKL Files**: Generate `train_set.pkl` and `test_set.pkl` files for training and evaluation


### Step 2: Model Training

After data preparation, you can train the HEAR model for either Track 1 (single-label: Musicality) or Track 2 (multi-label: 5 dimensions).

#### Track 1: Single-Label Training (Musicality)

Train the model for musicality prediction:

```bash
python train_track_1.py \
    --experiment_name track1_exp \
    --train-data /path/to/train_set.pkl \
    --test-data /path/to/test_set.pkl \
    --max-epoch 60 \
    --batch-size 8 \
    --lr 1e-5 \
    --weight_decay 1e-3 \
    --accum_steps 4 \
    --lambda 0.15 \
    --workers 8 \
    --seed 0
```

#### Track 2: Multi-Label Training (5 Dimensions)

Train the model for multi-dimensional aesthetic evaluation:

```bash
python train_track_2.py \
    --experiment_name track2_exp \
    --train-data /path/to/train_set.pkl \
    --test-data /path/to/test_set.pkl \
    --max-epoch 60 \
    --batch-size 8 \
    --lr 1e-5 \
    --weight_decay 1e-3 \
    --accum_steps 4 \
    --lambda 0.05 \
    --workers 8 \
    --seed 0
```

#### Key Parameters

* `--max-epoch`: Maximum number of training epochs (default: 60)
* `--batch-size`: Batch size for training (default: 8)
* `--experiment_name`: Name of the experiment for saving models and logs
* `--lr`: Learning rate (default: 1e-5)
* `--weight_decay`: Weight decay for optimizer (default: 1e-3)
* `--accum_steps`: Gradient accumulation steps (default: 4)
* `--lambda`: Weight for ranking loss (Track 1: 0.15, Track 2: 0.05)
* `--workers`: Number of data loading workers (default: 8)
* `--seed`: Random seed for reproducibility (default: 0)
* `--train-data`: Path to training data pkl file (default: `data_pipeline/dataset_pkl/train_set.pkl`)
* `--test-data`: Path to test data pkl file (default: `data_pipeline/dataset_pkl/test_set.pkl`)
* `--log-dir`: Path to tensorboard log directory (default: `./log/tensorboard_records/{experiment_name}`)

#### Evaluation Mode

To evaluate a trained model, use the `--eval` flag:

```bash
python train_track_1.py --eval --experiment_name track1_exp
python train_track_2.py --eval --experiment_name track2_exp
```

#### Model Configuration

Model architectures are configured in:
* `config_track_1.yaml` - Configuration for Track 1
* `config_track_2.yaml` - Configuration for Track 2

Trained models are saved in `log/models/{experiment_name}/model.pth`, and training logs are saved to TensorBoard in `./log/tensorboard_records/{experiment_name}/` (or custom path specified by `--log-dir`).


## 🙏 Acknowledgement

We sincerely thank the authors and contributors of the following open-source projects.:

* **[SongEval](https://github.com/ASLP-lab/SongEval)**
* **[SongFormer](https://github.com/ASLP-lab/SongFormer)**
* **[Audiomentations](https://github.com/iver56/audiomentations)**
* **[Wespeaker](https://github.com/wenet-e2e/wespeaker)**
* **[allRank](https://github.com/allegro/allRank)**

We would like to express our special thanks to **Shizhe Chen** from **Shanghai Conservatory of Music** for his invaluable guidance and insights on music aesthetics.

## 📚 Citation
```bibtex
@misc{liu2025hearhierarchicallyenhancedaesthetic,
      title={Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation}, 
      author={Shuyang Liu and Yuan Jin and Rui Lin and Shizhe Chen and Junyu Dai and Tao Jiang},
      year={2025},
      eprint={2511.18869},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2511.18869}, 
}
```