Spaces:
Sleeping
Sleeping
| # Descript Audio Codec - VAE Variant (.dac-vae): High-Fidelity Audio Compression with Variational Autoencoder | |
| This repository contains training and inference scripts for the Descript Audio Codec VAE variant (.dac-vae), a modified version of the [original DAC](https://github.com/descriptinc/descript-audio-codec) that replaces the RVQGAN architecture with a Variational Autoencoder while maintaining the same high-quality audio compression capabilities. | |
| ## Overview | |
| Building on the foundation of the [original Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec), **DAC-VAE** adapts the architecture to use Variational Autoencoder principles instead of Residual Vector Quantization (RVQ). | |
| ### Key Differences from Original DAC | |
| 👉 **DAC-VAE** compresses **24 kHz audio** (instead of 44.1 kHz) using a continuous latent representation through VAE architecture | |
| ### 🔄 Architecture Changes: | |
| - Replaces the RVQGAN's discrete codebook with VAE's continuous latent space | |
| - Maintains the same encoder-decoder backbone architecture from the original DAC | |
| - Swaps vector quantization layers for VAE reparameterization trick | |
| - Preserves the multi-scale discriminator design for adversarial training | |
| ### 🎯 Inherited Features from Original DAC: | |
| - High-fidelity neural audio compression | |
| - Universal model for all audio domains (speech, environment, music, etc.) | |
| - Efficient encoding and decoding | |
| - State-of-the-art reconstruction quality | |
| ## Why VAE Instead of RVQGAN? | |
| This fork explores an alternative approach to the original DAC's discrete coding strategy: | |
| | Component | Original DAC (RVQGAN) | DAC-VAE (This Repo) | | |
| |-----------|----------------------|---------------------| | |
| | Latent Space | Discrete (VQ codes) | Continuous (Gaussian) | | |
| | Sampling Rate | 44.1 kHz | 24 kHz | | |
| | Quantization | Residual VQ with codebooks | VAE reparameterization | | |
| | Training Objective | Reconstruction + VQ + Adversarial | Reconstruction + KL + Adversarial | | |
| | Compression | Fixed bitrate (8 kbps) | Variable (KL-controlled) | | |
| ## Installation | |
| ```bash | |
| # Clone this repository | |
| git clone https://github.com/primepake/dac-vae.git | |
| cd dac-vae | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### Inference | |
| ```bash | |
| python3 inference.py \ | |
| --checkpoint checkpoint.pt \ | |
| --config configs/configx2.yml \ | |
| --mode encode_decode \ | |
| --input test.wav \ | |
| --output reconstruction.wav | |
| ``` | |
| ### Training | |
| ```bash | |
| # Single GPU training | |
| python3 train.py --run_id factorx2 | |
| # Multi-GPU training (4 GPUs) | |
| torchrun --nnodes=1 --nproc_per_node=4 train.py --run_id factorx2 | |
| ``` | |
| ## Model Architecture | |
| DAC-VAE preserves most of the original DAC architecture with key modifications: | |
| - **Encoder**: Same convolutional architecture as original DAC | |
| - **Latent Layer**: VAE reparameterization (replaces VQ-VAE quantization) | |
| - **Decoder**: Identical transposed convolution architecture | |
| - **Discriminator**: Same multi-scale discriminator for perceptual quality | |
| ### Configuration | |
| The model can be configured through YAML files in the `configs/` directory: | |
| - `configx2.yml`: Default 24kHz configuration with 2x downsampling factor | |
| - Adjust latent dimensions, KL weight, and other hyperparameters as needed | |
| ## Training Details | |
| ### Dataset Preparation | |
| Prepare your audio dataset with the following structure: | |
| ``` | |
| dataset/ | |
| ├── train/ | |
| │ ├── audio1.wav | |
| │ ├── audio2.wav | |
| │ └── ... | |
| └── val/ | |
| ├── audio1.wav | |
| ├── audio2.wav | |
| └── ... | |
| ``` | |
| ### Training Command | |
| ```bash | |
| torchrun --nnodes=1 --nproc_per_node=4 train.py \ | |
| --run_id my_experiment \ | |
| --config configs/configx2.yml | |
| ``` | |
| ## Evaluation | |
| Evaluate model performance using: | |
| ```bash | |
| python3 evaluate.py \ | |
| --checkpoint checkpoint.pt \ | |
| --test_dir /path/to/test/audio | |
| ``` | |
| ## Pretrained Models | |
| | Model | Sample Rate | Config | Download | | |
| |-------|-------------|---------|----------| | |
| | dac_vae_24khz_v1 | 24 kHz | config.yml | [64 dim 3x frames](https://github.com/primepake/dac_vae/releases/tag/64dim-3xframe_rate) | | |
| | dac_vae_24khz_v1 | 24 kHz | configx2.yml | [80 dim 2x frames](https://github.com/primepake/dac_vae/releases/tag/80dim-2xframe_rate) | | |
| ## Citation | |
| If you use DAC-VAE, please cite both this work and the original DAC paper: | |
| ```bibtex | |
| @misc{dacvae2024, | |
| title={DAC-VAE: Variational Autoencoder Adaptation of Descript Audio Codec}, | |
| author={primepake}, | |
| year={2024}, | |
| url={https://github.com/primepake/dac-vae} | |
| } | |
| @misc{kumar2023high, | |
| title={High-Fidelity Audio Compression with Improved RVQGAN}, | |
| author={Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan}, | |
| journal={arXiv preprint arXiv:2306.06546}, | |
| year={2023} | |
| } | |
| ``` | |
| ## License | |
| This project maintains the same license as the original Descript Audio Codec. See [LICENSE](https://github.com/descriptinc/descript-audio-codec/blob/main/LICENSE) file for details. | |
| ## Acknowledgments | |
| This work is built directly on top of the excellent [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec) by the Descript team. We thank them for open-sourcing their high-quality implementation, which made this VAE exploration possible. | |
| ## Related Links | |
| - [Original DAC Repository](https://github.com/descriptinc/descript-audio-codec) | |
| - [Original DAC Paper](https://arxiv.org/abs/2306.06546) | |
| - [Descript Audio Codec Demo](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a18f30bfd) | |
| ## Contributing | |
| Contributions are welcome! Please feel free to submit a Pull Request. | |
| ## Contact | |
| For questions and feedback, please open an issue in this repository. |