Spaces:
Sleeping
Sleeping
File size: 5,660 Bytes
5d43438 0cae0b4 5d43438 0cae0b4 5d43438 d066d0d 5d43438 f4e248a 5d43438 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
# Descript Audio Codec - VAE Variant (.dac-vae): High-Fidelity Audio Compression with Variational Autoencoder
This repository contains training and inference scripts for the Descript Audio Codec VAE variant (.dac-vae), a modified version of the [original DAC](https://github.com/descriptinc/descript-audio-codec) that replaces the RVQGAN architecture with a Variational Autoencoder while maintaining the same high-quality audio compression capabilities.
## Overview
Building on the foundation of the [original Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec), **DAC-VAE** adapts the architecture to use Variational Autoencoder principles instead of Residual Vector Quantization (RVQ).
### Key Differences from Original DAC
👉 **DAC-VAE** compresses **24 kHz audio** (instead of 44.1 kHz) using a continuous latent representation through VAE architecture
### 🔄 Architecture Changes:
- Replaces the RVQGAN's discrete codebook with VAE's continuous latent space
- Maintains the same encoder-decoder backbone architecture from the original DAC
- Swaps vector quantization layers for VAE reparameterization trick
- Preserves the multi-scale discriminator design for adversarial training
### 🎯 Inherited Features from Original DAC:
- High-fidelity neural audio compression
- Universal model for all audio domains (speech, environment, music, etc.)
- Efficient encoding and decoding
- State-of-the-art reconstruction quality
## Why VAE Instead of RVQGAN?
This fork explores an alternative approach to the original DAC's discrete coding strategy:
| Component | Original DAC (RVQGAN) | DAC-VAE (This Repo) |
|-----------|----------------------|---------------------|
| Latent Space | Discrete (VQ codes) | Continuous (Gaussian) |
| Sampling Rate | 44.1 kHz | 24 kHz |
| Quantization | Residual VQ with codebooks | VAE reparameterization |
| Training Objective | Reconstruction + VQ + Adversarial | Reconstruction + KL + Adversarial |
| Compression | Fixed bitrate (8 kbps) | Variable (KL-controlled) |
## Installation
```bash
# Clone this repository
git clone https://github.com/primepake/dac-vae.git
cd dac-vae
# Install dependencies
pip install -r requirements.txt
```
## Usage
### Inference
```bash
python3 inference.py \
--checkpoint checkpoint.pt \
--config configs/configx2.yml \
--mode encode_decode \
--input test.wav \
--output reconstruction.wav
```
### Training
```bash
# Single GPU training
python3 train.py --run_id factorx2
# Multi-GPU training (4 GPUs)
torchrun --nnodes=1 --nproc_per_node=4 train.py --run_id factorx2
```
## Model Architecture
DAC-VAE preserves most of the original DAC architecture with key modifications:
- **Encoder**: Same convolutional architecture as original DAC
- **Latent Layer**: VAE reparameterization (replaces VQ-VAE quantization)
- **Decoder**: Identical transposed convolution architecture
- **Discriminator**: Same multi-scale discriminator for perceptual quality
### Configuration
The model can be configured through YAML files in the `configs/` directory:
- `configx2.yml`: Default 24kHz configuration with 2x downsampling factor
- Adjust latent dimensions, KL weight, and other hyperparameters as needed
## Training Details
### Dataset Preparation
Prepare your audio dataset with the following structure:
```
dataset/
├── train/
│ ├── audio1.wav
│ ├── audio2.wav
│ └── ...
└── val/
├── audio1.wav
├── audio2.wav
└── ...
```
### Training Command
```bash
torchrun --nnodes=1 --nproc_per_node=4 train.py \
--run_id my_experiment \
--config configs/configx2.yml
```
## Evaluation
Evaluate model performance using:
```bash
python3 evaluate.py \
--checkpoint checkpoint.pt \
--test_dir /path/to/test/audio
```
## Pretrained Models
| Model | Sample Rate | Config | Download |
|-------|-------------|---------|----------|
| dac_vae_24khz_v1 | 24 kHz | config.yml | [64 dim 3x frames](https://github.com/primepake/dac_vae/releases/tag/64dim-3xframe_rate) |
| dac_vae_24khz_v1 | 24 kHz | configx2.yml | [80 dim 2x frames](https://github.com/primepake/dac_vae/releases/tag/80dim-2xframe_rate) |
## Citation
If you use DAC-VAE, please cite both this work and the original DAC paper:
```bibtex
@misc{dacvae2024,
title={DAC-VAE: Variational Autoencoder Adaptation of Descript Audio Codec},
author={primepake},
year={2024},
url={https://github.com/primepake/dac-vae}
}
@misc{kumar2023high,
title={High-Fidelity Audio Compression with Improved RVQGAN},
author={Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan},
journal={arXiv preprint arXiv:2306.06546},
year={2023}
}
```
## License
This project maintains the same license as the original Descript Audio Codec. See [LICENSE](https://github.com/descriptinc/descript-audio-codec/blob/main/LICENSE) file for details.
## Acknowledgments
This work is built directly on top of the excellent [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec) by the Descript team. We thank them for open-sourcing their high-quality implementation, which made this VAE exploration possible.
## Related Links
- [Original DAC Repository](https://github.com/descriptinc/descript-audio-codec)
- [Original DAC Paper](https://arxiv.org/abs/2306.06546)
- [Descript Audio Codec Demo](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a18f30bfd)
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Contact
For questions and feedback, please open an issue in this repository. |