--- license: apache-2.0 pipeline_tag: audio-classification tags: - music - song - aesthetics - ASAE --- # **HEAR**: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation [**Paper**](https://arxiv.org/pdf/2511.18869) | [**Model**](https://huggingface.co/earlab/EAR_HEAR)
Official PyTorch Implementation of ICASSP 2026 paper "HEAR: Hierarchically Enhanced Aesthetic Representations for Multidimensional Music Evaluation" This repository contains the training and evaluation code for HEAR, a robust framework designed to address the challenges of multidimensional music aesthetic evaluation under limited labeled data. ![](HEAR.png) ## 🌟 Key Features * **Excellent Performance**: Ranked 2nd/19 on Track 1 and 5th/17 on Track 2 in the [ICASSP 2026 Automatic Song Aesthetics Evaluation Challenge](https://aslp-lab.github.io/Automatic-Song-Aesthetics-Evaluation-Challenge/). * **Robustness**: Synergizes Multi-Source Multi-Scale Representations and Hierarchical Augmentation to capture robust features under limited labeled data. * **Dual Capability**: Optimized for both exact score prediction and ranking (Top-Tier Song Identification). ## 📦 Installation Clone the repository and install dependencies: ``` git clone https://github.com:Eps-Acoustic-Revolution-Lab/EAR_HEAR.git git submodule update --init --recursive conda create -n hear python=3.10 -y conda activate hear pip install -r requirements.txt ``` ## 🚀 Quick Start ``` # Download pretrained model weights export HF_ENDPOINT=https://hf-mirror.com # For users in Mainland China, this is needed for HuggingFace downloads hf download earlab/EAR_HEAR --local-dir pretrained_models # Track 1: Single-Label Inference (Musicality) python inference.py \ --input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \ --output_json_path output.json --model_path pretrained_models/track_1.pth \ --model_config_path config_track_1.yaml # Track 2: Multi-Label Inference (5 Dimensions) python inference.py \ --input_audio_path data_pipeline/origin_song_eval_dataset/mp3/0.mp3 \ --output_json_path output.json --model_path pretrained_models/track_2.pth \ --model_config_path config_track_2.yaml ``` ## 🎯 Training ### Step 1: Data Preparation First, prepare the dataset by running the data pipeline: ```bash cd data_pipeline bash run.sh ``` This script will: 1. **Download Dataset**: Download the [SongEval](https://huggingface.co/datasets/ASLP-lab/SongEval) dataset 2. **Split Dataset**: Split the dataset into training and validation sets based on [the challenge's validation IDs ](https://github.com/ASLP-lab/Automatic-Song-Aesthetics-Evaluation-Challenge/blob/main/static/val_ids.txt) 3. **Audio Augmentation**: Apply audio augmentation to the training set 4. **Extract Features**: Extract MuQ and MusicFM features for both training and test sets 5. **Generate PKL Files**: Generate `train_set.pkl` and `test_set.pkl` files for training and evaluation ### Step 2: Model Training After data preparation, you can train the HEAR model for either Track 1 (single-label: Musicality) or Track 2 (multi-label: 5 dimensions). #### Track 1: Single-Label Training (Musicality) Train the model for musicality prediction: ```bash python train_track_1.py \ --experiment_name track1_exp \ --train-data /path/to/train_set.pkl \ --test-data /path/to/test_set.pkl \ --max-epoch 60 \ --batch-size 8 \ --lr 1e-5 \ --weight_decay 1e-3 \ --accum_steps 4 \ --lambda 0.15 \ --workers 8 \ --seed 0 ``` #### Track 2: Multi-Label Training (5 Dimensions) Train the model for multi-dimensional aesthetic evaluation: ```bash python train_track_2.py \ --experiment_name track2_exp \ --train-data /path/to/train_set.pkl \ --test-data /path/to/test_set.pkl \ --max-epoch 60 \ --batch-size 8 \ --lr 1e-5 \ --weight_decay 1e-3 \ --accum_steps 4 \ --lambda 0.05 \ --workers 8 \ --seed 0 ``` #### Key Parameters * `--max-epoch`: Maximum number of training epochs (default: 60) * `--batch-size`: Batch size for training (default: 8) * `--experiment_name`: Name of the experiment for saving models and logs * `--lr`: Learning rate (default: 1e-5) * `--weight_decay`: Weight decay for optimizer (default: 1e-3) * `--accum_steps`: Gradient accumulation steps (default: 4) * `--lambda`: Weight for ranking loss (Track 1: 0.15, Track 2: 0.05) * `--workers`: Number of data loading workers (default: 8) * `--seed`: Random seed for reproducibility (default: 0) * `--train-data`: Path to training data pkl file (default: `data_pipeline/dataset_pkl/train_set.pkl`) * `--test-data`: Path to test data pkl file (default: `data_pipeline/dataset_pkl/test_set.pkl`) * `--log-dir`: Path to tensorboard log directory (default: `./log/tensorboard_records/{experiment_name}`) #### Evaluation Mode To evaluate a trained model, use the `--eval` flag: ```bash python train_track_1.py --eval --experiment_name track1_exp python train_track_2.py --eval --experiment_name track2_exp ``` #### Model Configuration Model architectures are configured in: * `config_track_1.yaml` - Configuration for Track 1 * `config_track_2.yaml` - Configuration for Track 2 Trained models are saved in `log/models/{experiment_name}/model.pth`, and training logs are saved to TensorBoard in `./log/tensorboard_records/{experiment_name}/` (or custom path specified by `--log-dir`). ## 🙏 Acknowledgement We sincerely thank the authors and contributors of the following open-source projects.: * **[SongEval](https://github.com/ASLP-lab/SongEval)** * **[SongFormer](https://github.com/ASLP-lab/SongFormer)** * **[Audiomentations](https://github.com/iver56/audiomentations)** * **[Wespeaker](https://github.com/wenet-e2e/wespeaker)** * **[allRank](https://github.com/allegro/allRank)** We would like to express our special thanks to **Shizhe Chen** from **Shanghai Conservatory of Music** for his invaluable guidance and insights on music aesthetics. ## 📚 Citation ```bibtex @misc{liu2025hearhierarchicallyenhancedaesthetic, title={Hear: Hierarchically Enhanced Aesthetic Representations For Multidimensional Music Evaluation}, author={Shuyang Liu and Yuan Jin and Rui Lin and Shizhe Chen and Junyu Dai and Tao Jiang}, year={2025}, eprint={2511.18869}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2511.18869}, } ```