|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- laion/LAION-DISCO-12M |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
pipeline_tag: audio-to-audio |
|
|
tags: |
|
|
- music |
|
|
- vae |
|
|
- perceptual weighting |
|
|
- phase |
|
|
--- |
|
|
# Ξ΅ar-VAE: High Fidelity Music Reconstruction Model |
|
|
[[Demo Page](https://eps-acoustic-revolution-lab.github.io/EAR_VAE/)] - [[Codes](https://github.com/Eps-Acoustic-Revolution-Lab/EAR_VAE)] - [[Paper](http://arxiv.org/abs/2509.14912)] |
|
|
|
|
|
This repository contains the official inference code for Ξ΅ar-VAE, aa 44.1 kHz music signal reconstruction model that rethinks and optimizes VAE training for audio. It targets two common weaknesses in existing open-source VAEsβphase accuracy and stereophonic spatial representationβby aligning objectives with auditory perception and introducing phase-aware training. Experiments show substantial improvements across diverse metrics, with particular strength in high-frequency harmonics and spatial characteristics. |
|
|
|
|
|
> β2025-12-10 Updateβ: a new model weight works in 48kHz sample rate, same-level vocal performance with better stereophonic energy reconstruction. |
|
|
|
|
|
Why Ξ΅ar-VAE: |
|
|
- π§ Perceptual alignment: A K-weighting perceptual filter is applied before loss computation to better match human hearing. |
|
|
- π Phase-aware objectives: Two novel phase losses |
|
|
- Stereo Correlation Loss for robust inter-channel coherence. |
|
|
- Phase-Derivative Loss using Instantaneous Frequency and Group Delay for phase precision. |
|
|
- π Spectral supervision paradigm: Magnitude supervised across MSLR (Mid/Side/Left/Right) components, while phase is supervised only by LR (Left/Right), improving stability and fidelity. |
|
|
- π 44.1 kHz performance: Outperforms leading open-source models across diverse metrics. |
|
|
|
|
|
## 1. Installation |
|
|
|
|
|
Follow these steps to set up the environment and install the necessary dependencies. |
|
|
|
|
|
### Installation Steps |
|
|
|
|
|
1. **Clone the repository:** |
|
|
```bash |
|
|
git clone <your-repo-url> |
|
|
cd ear_vae |
|
|
``` |
|
|
|
|
|
2. **Create and activate a conda environment:** |
|
|
```bash |
|
|
conda create -n ear_vae python=3.8 |
|
|
conda activate ear_vae |
|
|
``` |
|
|
|
|
|
3. **Run the installation script:** |
|
|
|
|
|
This script will install the remaining dependencies. |
|
|
```bash |
|
|
bash install_requirements.sh |
|
|
``` |
|
|
This will install: |
|
|
- `descript-audio-codec` |
|
|
- `alias-free-torch` |
|
|
- `ffmpeg < 7` (via conda) |
|
|
|
|
|
4. **Download the model weight:** |
|
|
|
|
|
You could download the model checkpoint from **[Hugging Face](https://huggingface.co/earlab/EAR_VAE)** |
|
|
## 2. Usage |
|
|
|
|
|
The `inference.py` script is used to process audio files from an input directory and save the reconstructed audio to an output directory. |
|
|
|
|
|
### Running Inference |
|
|
|
|
|
You can run the inference with the following command: |
|
|
|
|
|
```bash |
|
|
python inference.py --indir <input_directory> --outdir <output_directory> --model_path <path_to_model> --device <device> |
|
|
``` |
|
|
|
|
|
### Command-Line Arguments |
|
|
|
|
|
- `--indir`: (Optional) Path to the input directory containing audio files. Default: `./data`. |
|
|
- `--outdir`: (Optional) Path to the output directory where reconstructed audio will be saved. Default: `./results`. |
|
|
- `--model_path`: (Optional) Path to the pretrained model weights (`.pyt` file). Default: `./pretrained_weight/ear_vae_44k.pyt`. |
|
|
- `--device`: (Optional) The device to run the model on (e.g., `cuda:0` or `cpu`). Defaults to `cuda:0` if available, otherwise `cpu`. |
|
|
|
|
|
### Example |
|
|
|
|
|
1. Place your input audio files (e.g., `.wav`, `.mp3`) into the `data/` directory. |
|
|
2. Run the inference script: |
|
|
|
|
|
```bash |
|
|
python inference.py |
|
|
``` |
|
|
This will use the default paths. The reconstructed audio files will be saved in the `results/` directory. |
|
|
|
|
|
## 3. Project Structure |
|
|
|
|
|
``` |
|
|
. |
|
|
βββ README.md # This file |
|
|
βββ config/ # For model configurations |
|
|
β βββ model_config.json |
|
|
βββ data/ # Default directory for input audio files |
|
|
βββ eval/ # Scripts for model evaluation |
|
|
β βββ eval_compare_matrix.py |
|
|
β βββ install_requirements.sh |
|
|
β βββ README.md |
|
|
βββ inference.py # Main script for running audio reconstruction |
|
|
βββ install_requirements.sh # Installation script for dependencies |
|
|
βββ model/ # Contains the model architecture code |
|
|
β βββ sa2vae.py |
|
|
β βββ transformer.py |
|
|
β βββ vaegan.py |
|
|
βββ pretrained_weight/ # Directory for pretrained model weights |
|
|
β βββ your_weight_here |
|
|
``` |
|
|
|
|
|
## 4. Model Details |
|
|
|
|
|
The model is a Variational Autoencoder with a Generative Adversarial Network (VAE-GAN) structure. |
|
|
- **Encoder**: An Oobleck-style encoder that downsamples the input audio into a latent representation. |
|
|
- **Bottleneck**: A VAE bottleneck that introduces a probabilistic latent space, sampling from a learned mean and variance. |
|
|
- **Decoder**: An Oobleck-style decoder that upsamples the latent representation back into an audio waveform. |
|
|
- **Transformer**: A Continuous Transformer can optionally be placed in the bottleneck to further process the latent sequence. |
|
|
|
|
|
This architecture allows for efficient and high-quality audio reconstruction. |
|
|
|
|
|
## 5. Evaluation |
|
|
|
|
|
The `eval/` directory contains scripts to evaluate the model's reconstruction performance using objective metrics. |
|
|
|
|
|
### Evaluation Prerequisites |
|
|
|
|
|
1. **Install Dependencies**: The evaluation script has its own set of dependencies. Install them by running the script in the `eval` directory: |
|
|
```bash |
|
|
bash eval/install_requirements.sh |
|
|
``` |
|
|
This will install libraries such as `auraloss`. |
|
|
|
|
|
2. **FFmpeg**: The script uses `ffmpeg` for loudness analysis. Make sure `ffmpeg` is installed and available in your system's PATH. You can install it via conda: |
|
|
```bash |
|
|
conda install -c conda-forge 'ffmpeg<7' |
|
|
``` |
|
|
|
|
|
### Running Evaluation |
|
|
|
|
|
The `eval_compare_matrix.py` script compares the reconstructed audio with the original ground truth files and computes various metrics. |
|
|
|
|
|
For more details on the evaluation metrics and options, refer to the `eval/README.md` file. |
|
|
|
|
|
## 6. Acknowledgements |
|
|
|
|
|
This project builds upon the work of several open-source projects. We would like to extend our special thanks to: |
|
|
|
|
|
- **[Stability AI's Stable Audio Tools](https://github.com/Stability-AI/stable-audio-tools)**: For providing a foundational framework and tools for audio generation. |
|
|
- **[Descript's Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: For the weight-normed convolusional layers |
|
|
|
|
|
Their contributions have been invaluable to the development of Ξ΅ar-VAE. |