File size: 6,561 Bytes
e79fe63
 
 
 
 
 
 
 
 
 
 
a85a7a9
 
e79fe63
b3c4dc3
defd3e3
b3c4dc3
 
 
1b5567a
 
b3c4dc3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e79fe63
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: apache-2.0
datasets:
- laion/LAION-DISCO-12M
language:
- en
- zh
pipeline_tag: audio-to-audio
tags:
- music
- vae
- perceptual weighting
- phase
---
# Ξ΅ar-VAE: High Fidelity Music Reconstruction Model
[[Demo Page](https://eps-acoustic-revolution-lab.github.io/EAR_VAE/)] - [[Codes](https://github.com/Eps-Acoustic-Revolution-Lab/EAR_VAE)] - [[Paper](http://arxiv.org/abs/2509.14912)]

This repository contains the official inference code for Ξ΅ar-VAE, aa 44.1 kHz music signal reconstruction model that rethinks and optimizes VAE training for audio. It targets two common weaknesses in existing open-source VAEsβ€”phase accuracy and stereophonic spatial representationβ€”by aligning objectives with auditory perception and introducing phase-aware training. Experiments show substantial improvements across diverse metrics, with particular strength in high-frequency harmonics and spatial characteristics.

> ⭐2025-12-10 Update⭐: a new model weight works in 48kHz sample rate, same-level vocal performance with better stereophonic energy reconstruction.

Why Ξ΅ar-VAE:
- 🎧 Perceptual alignment: A K-weighting perceptual filter is applied before loss computation to better match human hearing.
- πŸ” Phase-aware objectives: Two novel phase losses
  - Stereo Correlation Loss for robust inter-channel coherence.
  - Phase-Derivative Loss using Instantaneous Frequency and Group Delay for phase precision.
- 🌈 Spectral supervision paradigm: Magnitude supervised across MSLR (Mid/Side/Left/Right) components, while phase is supervised only by LR (Left/Right), improving stability and fidelity.
- πŸ“ˆ 44.1 kHz performance: Outperforms leading open-source models across diverse metrics.

## 1. Installation

Follow these steps to set up the environment and install the necessary dependencies.

### Installation Steps

1.  **Clone the repository:**
    ```bash
    git clone <your-repo-url>
    cd ear_vae
    ```

2.  **Create and activate a conda environment:**
    ```bash
    conda create -n ear_vae python=3.8
    conda activate ear_vae
    ```

3.  **Run the installation script:**
    
    This script will install the remaining dependencies.
    ```bash
    bash install_requirements.sh
    ```
    This will install:
    - `descript-audio-codec`
    - `alias-free-torch`
    - `ffmpeg < 7` (via conda)
    
4.  **Download the model weight:**

    You could download the model checkpoint from **[Hugging Face](https://huggingface.co/earlab/EAR_VAE)**
## 2. Usage

The `inference.py` script is used to process audio files from an input directory and save the reconstructed audio to an output directory.

### Running Inference

You can run the inference with the following command:

```bash
python inference.py --indir <input_directory> --outdir <output_directory> --model_path <path_to_model> --device <device>
```

### Command-Line Arguments

-   `--indir`: (Optional) Path to the input directory containing audio files. Default: `./data`.
-   `--outdir`: (Optional) Path to the output directory where reconstructed audio will be saved. Default: `./results`.
-   `--model_path`: (Optional) Path to the pretrained model weights (`.pyt` file). Default: `./pretrained_weight/ear_vae_44k.pyt`.
-   `--device`: (Optional) The device to run the model on (e.g., `cuda:0` or `cpu`). Defaults to `cuda:0` if available, otherwise `cpu`.

### Example

1.  Place your input audio files (e.g., `.wav`, `.mp3`) into the `data/` directory.
2.  Run the inference script:

    ```bash
    python inference.py
    ```
    This will use the default paths. The reconstructed audio files will be saved in the `results/` directory.

## 3. Project Structure

```
.
β”œβ”€β”€ README.md               # This file
β”œβ”€β”€ config/                 # For model configurations
β”‚   └── model_config.json
β”œβ”€β”€ data/                   # Default directory for input audio files
β”œβ”€β”€ eval/                   # Scripts for model evaluation
β”‚   β”œβ”€β”€ eval_compare_matrix.py
β”‚   β”œβ”€β”€ install_requirements.sh
β”‚   └── README.md
β”œβ”€β”€ inference.py            # Main script for running audio reconstruction
β”œβ”€β”€ install_requirements.sh # Installation script for dependencies
β”œβ”€β”€ model/                  # Contains the model architecture code
β”‚   β”œβ”€β”€ sa2vae.py
β”‚   β”œβ”€β”€ transformer.py
β”‚   └── vaegan.py
β”œβ”€β”€ pretrained_weight/      # Directory for pretrained model weights
β”‚   └── your_weight_here
```

## 4. Model Details

The model is a Variational Autoencoder with a Generative Adversarial Network (VAE-GAN) structure.
-   **Encoder**: An Oobleck-style encoder that downsamples the input audio into a latent representation.
-   **Bottleneck**: A VAE bottleneck that introduces a probabilistic latent space, sampling from a learned mean and variance.
-   **Decoder**: An Oobleck-style decoder that upsamples the latent representation back into an audio waveform.
-   **Transformer**: A Continuous Transformer can optionally be placed in the bottleneck to further process the latent sequence.

This architecture allows for efficient and high-quality audio reconstruction.

## 5. Evaluation

The `eval/` directory contains scripts to evaluate the model's reconstruction performance using objective metrics.

### Evaluation Prerequisites

1.  **Install Dependencies**: The evaluation script has its own set of dependencies. Install them by running the script in the `eval` directory:
    ```bash
    bash eval/install_requirements.sh
    ```
    This will install libraries such as `auraloss`.

2.  **FFmpeg**: The script uses `ffmpeg` for loudness analysis. Make sure `ffmpeg` is installed and available in your system's PATH. You can install it via conda:
    ```bash
    conda install -c conda-forge 'ffmpeg<7'
    ```

### Running Evaluation

The `eval_compare_matrix.py` script compares the reconstructed audio with the original ground truth files and computes various metrics.

For more details on the evaluation metrics and options, refer to the `eval/README.md` file.

## 6. Acknowledgements

This project builds upon the work of several open-source projects. We would like to extend our special thanks to:

-   **[Stability AI's Stable Audio Tools](https://github.com/Stability-AI/stable-audio-tools)**: For providing a foundational framework and tools for audio generation.
-   **[Descript's Audio Codec](https://github.com/descriptinc/descript-audio-codec)**: For the weight-normed convolusional layers

Their contributions have been invaluable to the development of Ξ΅ar-VAE.