Spaces:

mnhatdaous
/

learnable-speech

Sleeping

App Files Files Community

learnable-speech / dac-vae /README.md

primepake

update readme dac

d066d0d 4 months ago

preview code

raw

history blame contribute delete

5.66 kB

	# Descript Audio Codec - VAE Variant (.dac-vae): High-Fidelity Audio Compression with Variational Autoencoder

	This repository contains training and inference scripts for the Descript Audio Codec VAE variant (.dac-vae), a modified version of the [original DAC](https://github.com/descriptinc/descript-audio-codec) that replaces the RVQGAN architecture with a Variational Autoencoder while maintaining the same high-quality audio compression capabilities.

	## Overview

	Building on the foundation of the [original Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec), DAC-VAE adapts the architecture to use Variational Autoencoder principles instead of Residual Vector Quantization (RVQ).

	### Key Differences from Original DAC

	👉 DAC-VAE compresses 24 kHz audio (instead of 44.1 kHz) using a continuous latent representation through VAE architecture

	### 🔄 Architecture Changes:

	- Replaces the RVQGAN's discrete codebook with VAE's continuous latent space
	- Maintains the same encoder-decoder backbone architecture from the original DAC
	- Swaps vector quantization layers for VAE reparameterization trick
	- Preserves the multi-scale discriminator design for adversarial training

	### 🎯 Inherited Features from Original DAC:

	- High-fidelity neural audio compression
	- Universal model for all audio domains (speech, environment, music, etc.)
	- Efficient encoding and decoding
	- State-of-the-art reconstruction quality

	## Why VAE Instead of RVQGAN?

	This fork explores an alternative approach to the original DAC's discrete coding strategy:

	\| Component \| Original DAC (RVQGAN) \| DAC-VAE (This Repo) \|
	\|-----------\|----------------------\|---------------------\|
	\| Latent Space \| Discrete (VQ codes) \| Continuous (Gaussian) \|
	\| Sampling Rate \| 44.1 kHz \| 24 kHz \|
	\| Quantization \| Residual VQ with codebooks \| VAE reparameterization \|
	\| Training Objective \| Reconstruction + VQ + Adversarial \| Reconstruction + KL + Adversarial \|
	\| Compression \| Fixed bitrate (8 kbps) \| Variable (KL-controlled) \|

	## Installation

	```bash
	# Clone this repository
	git clone https://github.com/primepake/dac-vae.git
	cd dac-vae

	# Install dependencies
	pip install -r requirements.txt
	```

	## Usage

	### Inference

	```bash
	python3 inference.py \
	--checkpoint checkpoint.pt \
	--config configs/configx2.yml \
	--mode encode_decode \
	--input test.wav \
	--output reconstruction.wav
	```

	### Training

	```bash
	# Single GPU training
	python3 train.py --run_id factorx2

	# Multi-GPU training (4 GPUs)
	torchrun --nnodes=1 --nproc_per_node=4 train.py --run_id factorx2
	```
	## Model Architecture

	DAC-VAE preserves most of the original DAC architecture with key modifications:

	- Encoder: Same convolutional architecture as original DAC
	- Latent Layer: VAE reparameterization (replaces VQ-VAE quantization)
	- Decoder: Identical transposed convolution architecture
	- Discriminator: Same multi-scale discriminator for perceptual quality

	### Configuration

	The model can be configured through YAML files in the `configs/` directory:

	- `configx2.yml`: Default 24kHz configuration with 2x downsampling factor
	- Adjust latent dimensions, KL weight, and other hyperparameters as needed

	## Training Details

	### Dataset Preparation

	Prepare your audio dataset with the following structure:
	```
	dataset/
	├── train/
	│ ├── audio1.wav
	│ ├── audio2.wav
	│ └── ...
	└── val/
	├── audio1.wav
	├── audio2.wav
	└── ...
	```

	### Training Command

	```bash
	torchrun --nnodes=1 --nproc_per_node=4 train.py \
	--run_id my_experiment \
	--config configs/configx2.yml
	```

	## Evaluation

	Evaluate model performance using:

	```bash
	python3 evaluate.py \
	--checkpoint checkpoint.pt \
	--test_dir /path/to/test/audio
	```

	## Pretrained Models

	\| Model \| Sample Rate \| Config \| Download \|
	\|-------\|-------------\|---------\|----------\|
	\| dac_vae_24khz_v1 \| 24 kHz \| config.yml \| [64 dim 3x frames](https://github.com/primepake/dac_vae/releases/tag/64dim-3xframe_rate) \|
	\| dac_vae_24khz_v1 \| 24 kHz \| configx2.yml \| [80 dim 2x frames](https://github.com/primepake/dac_vae/releases/tag/80dim-2xframe_rate) \|


	## Citation

	If you use DAC-VAE, please cite both this work and the original DAC paper:

	```bibtex
	@misc{dacvae2024,
	title={DAC-VAE: Variational Autoencoder Adaptation of Descript Audio Codec},
	author={primepake},
	year={2024},
	url={https://github.com/primepake/dac-vae}
	}

	@misc{kumar2023high,
	title={High-Fidelity Audio Compression with Improved RVQGAN},
	author={Kumar, Rithesh and Seetharaman, Prem and Luebs, Alejandro and Kumar, Ishaan and Kumar, Kundan},
	journal={arXiv preprint arXiv:2306.06546},
	year={2023}
	}
	```

	## License

	This project maintains the same license as the original Descript Audio Codec. See [LICENSE](https://github.com/descriptinc/descript-audio-codec/blob/main/LICENSE) file for details.

	## Acknowledgments

	This work is built directly on top of the excellent [Descript Audio Codec](https://github.com/descriptinc/descript-audio-codec) by the Descript team. We thank them for open-sourcing their high-quality implementation, which made this VAE exploration possible.

	## Related Links

	- [Original DAC Repository](https://github.com/descriptinc/descript-audio-codec)
	- [Original DAC Paper](https://arxiv.org/abs/2306.06546)
	- [Descript Audio Codec Demo](https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a18f30bfd)

	## Contributing

	Contributions are welcome! Please feel free to submit a Pull Request.

	## Contact

	For questions and feedback, please open an issue in this repository.