File size: 6,561 Bytes

---
license: apache-2.0
datasets:
- laion/LAION-DISCO-12M
language:
- en
- zh
pipeline_tag: audio-to-audio
tags:
- music
- vae
- perceptual weighting
- phase
---
# εar-VAE: High Fidelity Music Reconstruction Model
[[Demo Page](https://eps-acoustic-revolution-lab.github.io/EAR_VAE/)] - [[Codes](https://github.com/Eps-Acoustic-Revolution-Lab/EAR_VAE)] - [[Paper](http://arxiv.org/abs/2509.14912)]

This repository contains the official inference code for εar-VAE, aa 44.1 kHz music signal reconstruction model that rethinks and optimizes VAE training for audio. It targets two common weaknesses in existing open-source VAEs—phase accuracy and stereophonic spatial representation—by aligning objectives with auditory perception and introducing phase-aware training. Experiments show substantial improvements across diverse metrics, with particular strength in high-frequency harmonics and spatial characteristics.

> ⭐2025-12-10 Update⭐: a new model weight works in 48kHz sample rate, same-level vocal performance with better stereophonic energy reconstruction.

Why εar-VAE:
- 🎧 Perceptual alignment: A K-weighting perceptual filter is applied before loss computation to better match human hearing.
- 🔁 Phase-aware objectives: Two novel phase losses
  - Stereo Correlation Loss for robust inter-channel coherence.
  - Phase-Derivative Loss using Instantaneous Frequency and Group Delay for phase precision.
- 🌈 Spectral supervision paradigm: Magnitude supervised across MSLR (Mid/Side/Left/Right) components, while phase is supervised only by LR (Left/Right), improving stability and fidelity.
- 📈 44.1 kHz performance: Outperforms leading open-source models across diverse metrics.

## 1. Installation

Follow these steps to set up the environment and install the necessary dependencies.

### Installation Steps

1.  **Clone the repository:**
    ```bash
    git clone <your-repo-url>
    cd ear_vae
    ```

2.  **Create and activate a conda environment:**
    ```bash
    conda create -n ear_vae python=3.8
    conda activate ear_vae
    ```

3.  **Run the installation script:**
    
    This script will install the remaining dependencies.
    ```bash
    bash install_requirements.sh
    ```
    This will install:
    - `descript-audio-codec`
    - `alias-free-torch`
    - `ffmpeg < 7` (via conda)
    
4.  **Download the model weight:**

    You could download the model checkpoint from **[Hugging Face](https://huggingface.co/earlab/EAR_VAE)**
## 2. Usage

The `inference.py` script is used to process audio files from an input directory and save the reconstructed audio to an output directory.

### Running Inference

You can run the inference with the following command:

```bash
python inference.py --indir <input_directory> --outdir <output_directory> --model_path <path_to_model> --device <device>
```

### Command-Line Arguments

-   `--indir`: (Optional) Path to the input directory containing audio files. Default: `./data`.
-   `--outdir`: (Optional) Path to the output directory where reconstructed audio will be saved. Default: `./results`.
-   `--model_path`: (Optional) Path to the pretrained model weights (`.pyt` file). Default: `./pretrained_weight/ear_vae_44k.pyt`.
-   `--device`: (Optional) The device to run the model on (e.g., `cuda:0` or `cpu`). Defaults to `cuda:0` if available, otherwise `cpu`.

### Example

1.  Place your input audio files (e.g., `.wav`, `.mp3`) into the `data/` directory.
2.  Run the inference script:

    ```bash
    python inference.py
    ```
    This will use the default paths. The reconstructed audio files will be saved in the `results/` directory.

## 3. Project Structure

```
.
├── README.md               # This file
├── config/                 # For model configurations
│   └── model_config.json
├── data/                   # Default directory for input audio files
├── eval/                   # Scripts for model evaluation
│   ├── eval_compare_matrix.py
│   ├── install_requirements.sh
│   └── README.md
├── inference.py            # Main script for running audio reconstruction
├── install_requirements.sh # Installation script for dependencies
├── model/                  # Contains the model architecture code
│   ├── sa2vae.py
│   ├── transformer.py
│   └── vaegan.py
├── pretrained_weight/      # Directory for pretrained model weights
│   └── your_weight_here
```

## 4. Model Details

The model is a Variational Autoencoder with a Generative Adversarial Network (VAE-GAN) structure.
-   **Encoder**: An Oobleck-style encoder that downsamples the input audio into a latent representation.
-   **Bottleneck**: A VAE bottleneck that introduces a probabilistic latent space, sampling from a learned mean and variance.
-   **Decoder**: An Oobleck-style decoder that upsamples the latent representation back into an audio waveform.
-   **Transformer**: A Continuous Transformer can optionally be placed in the bottleneck to further process the latent sequence.

This architecture allows for efficient and high-quality audio reconstruction.

## 5. Evaluation

The `eval/` directory contains scripts to evaluate the model's reconstruction performance using objective metrics.

### Evaluation Prerequisites

1.  **Install Dependencies**: The evaluation script has its own set of dependencies. Install them by running the script in the `eval` directory:
    ```bash
    bash eval/install_requirements.sh
    ```
    This will install libraries such as `auraloss`.

2.  **FFmpeg**: The script uses `ffmpeg` for loudness analysis. Make sure `ffmpeg` is installed and available in your system's PATH. You can install it via conda:
    ```bash
    conda install -c conda-forge 'ffmpeg<7'
    ```

### Running Evaluation

The `eval_compare_matrix.py` script compares the reconstructed audio with the original ground truth files and computes various metrics.

For more details on the evaluation metrics and options, refer to the `eval/README.md` file.

## 6. Acknowledgements

This project builds upon the work of several open-source projects. We would like to extend our special thanks to:

-   **[Stability AI's Stable Audio Tools](https://github.com/Stability-AI/stable-audio-tools)**: For providing a foundational framework and tools for audio generation.
-   **[Descript's Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: For the weight-normed convolusional layers

Their contributions have been invaluable to the development of εar-VAE.