# Descript Audio Codec - VAE Variant (.dac-vae): High-Fidelity Audio Compression with Variational Autoencoder This repository contains training and inference scripts for the Descript Audio Codec VAE variant (.dac-vae), a modified version of the [original DAC](https://github.com/descriptinc/descript-audio-codec) that replaces the RVQGAN architecture with a Variational Autoencoder while maintaining the same high-quality audio compression capabilities. ## Overview Building on the foundation of the [original Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec), **DAC-VAE** adapts the architecture to use Variational Autoencoder principles instead of Residual Vector Quantization (RVQ). ### Key Differences from Original DAC 👉 **DAC-VAE** compresses **24 kHz audio** (instead of 44.1 kHz) using a continuous latent representation through VAE architecture ### 🔄 Architecture Changes: - Replaces the RVQGAN's discrete codebook with VAE's continuous latent space - Maintains the same encoder-decoder backbone architecture from the original DAC - Swaps vector quantization layers for VAE reparameterization trick - Preserves the multi-scale discriminator design for adversarial training ### 🎯 Inherited Features from Original DAC: - High-fidelity neural audio compression - Universal model for all audio domains (speech, environment, music, etc.) - Efficient encoding and decoding - State-of-the-art reconstruction quality ## Why VAE Instead of RVQGAN? This fork explores an alternative approach to the original DAC's discrete coding strategy: | Component | Original DAC (RVQGAN) | DAC-VAE (This Repo) | |-----------|----------------------|---------------------| | Latent Space | Discrete (VQ codes) | Continuous (Gaussian) | | Sampling Rate | 44.1 kHz | 24 kHz | | Quantization | Residual VQ with codebooks | VAE reparameterization | | Training Objective | Reconstruction + VQ + Adversarial | Reconstruction + KL + Adversarial | | Compression | Fixed bitrate (8 kbps) | Variable (KL-controlled) | ## Installation ```bash # Clone this repository git clone https://github.com/primepake/dac-vae.git cd dac-vae # Install dependencies pip install -r requirements.txt ``` ## Usage ### Inference ```bash python3 inference.py \ --checkpoint checkpoint.pt \ --config configs/configx2.yml \ --mode encode_decode \ --input test.wav \ --output reconstruction.wav ``` ### Training ```bash # Single GPU training python3 train.py --run_id factorx2 # Multi-GPU training (4 GPUs) torchrun --nnodes=1 --nproc_per_node=4 train.py --run_id factorx2 ``` ## Model Architecture DAC-VAE preserves most of the original DAC architecture with key modifications: - **Encoder**: Same convolutional architecture as original DAC - **Latent Layer**: VAE reparameterization (replaces VQ-VAE quantization) - **Decoder**: Identical transposed convolution architecture - **Discriminator**: Same multi-scale discriminator for perceptual quality ### Configuration The model can be configured through YAML files in the `configs/` directory: - `configx2.yml`: Default 24kHz configuration with 2x downsampling factor - Adjust latent dimensions, KL weight, and other hyperparameters as needed ## Training Details ### Dataset Preparation Prepare your audio dataset with the following structure: ``` dataset/ ├── train/ │ ├── audio1.wav │ ├── audio2.wav │ └── ... └── val/ ├── audio1.wav ├── audio2.wav └── ... ``` ### Training Command ```bash torchrun --nnodes=1 --nproc_per_node=4 train.py \ --run_id my_experiment \ --config configs/configx2.yml ``` ## Evaluation Evaluate model performance using: ```bash python3 evaluate.py \ --checkpoint checkpoint.pt \ --test_dir /path/to/test/audio ``` ## Pretrained Models | Model | Sample Rate | Config | Download | |-------|-------------|---------|----------| | dac_vae_24khz_v1 | 24 kHz | config.yml | [64 dim 3x frames](https://github.com/primepake/dac_vae/releases/tag/64dim-3xframe_rate) | | dac_vae_24khz_v1 | 24 kHz | configx2.yml | [80 dim 2x frames](https://github.com/primepake/dac_vae/releases/tag/80dim-2xframe_rate) | ## Citation If you use DAC-VAE, please cite both this work and the original DAC paper: ```bibtex @misc{dacvae2024, title={DAC-VAE: Variational Autoencoder Adaptation of Descript Audio Codec}, author={primepake}, year={2024}, url={https://github.com/primepake/dac-vae} } @misc{kumar2023high, title={High-Fidelity Audio Compression with Improved RVQGAN}, author={Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan}, journal={arXiv preprint arXiv:2306.06546}, year={2023} } ``` ## License This project maintains the same license as the original Descript Audio Codec. See [LICENSE](https://github.com/descriptinc/descript-audio-codec/blob/main/LICENSE) file for details. ## Acknowledgments This work is built directly on top of the excellent [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec) by the Descript team. We thank them for open-sourcing their high-quality implementation, which made this VAE exploration possible. ## Related Links - [Original DAC Repository](https://github.com/descriptinc/descript-audio-codec) - [Original DAC Paper](https://arxiv.org/abs/2306.06546) - [Descript Audio Codec Demo](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a18f30bfd) ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## Contact For questions and feedback, please open an issue in this repository.