Update README.md
Browse files
README.md
CHANGED
|
@@ -3,4 +3,74 @@ license: mit
|
|
| 3 |
tags:
|
| 4 |
- text-to-audio
|
| 5 |
- controlnet
|
| 6 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
- text-to-audio
|
| 5 |
- controlnet
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
<img src="arts/ezaudio.png">
|
| 9 |
+
|
| 10 |
+
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
|
| 11 |
+
|
| 12 |
+
🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
|
| 13 |
+
|
| 14 |
+
🎛 Play EzAudio on Hugging Face Space: [EzAudio: Text-to-Audio Generation, Editing, and Inpainting](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 15 |
+
|
| 16 |
+
🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)!
|
| 17 |
+
|
| 18 |
+
## Installation
|
| 19 |
+
|
| 20 |
+
Clone the repository:
|
| 21 |
+
```
|
| 22 |
+
git clone git@github.com:haidog-yaqub/EzAudio.git
|
| 23 |
+
```
|
| 24 |
+
Install the dependencies:
|
| 25 |
+
```
|
| 26 |
+
cd EzAudio
|
| 27 |
+
pip install -r requirements.txt
|
| 28 |
+
```
|
| 29 |
+
Download checkponts from: [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio)
|
| 30 |
+
|
| 31 |
+
## Usage
|
| 32 |
+
|
| 33 |
+
You can use the model with the following code:
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
from api.ezaudio import load_models, generate_audio
|
| 37 |
+
|
| 38 |
+
# model and config paths
|
| 39 |
+
config_name = 'ckpts/ezaudio-xl.yml'
|
| 40 |
+
ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
|
| 41 |
+
vae_path = 'ckpts/vae/1m.pt'
|
| 42 |
+
# save_path = 'output/'
|
| 43 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 44 |
+
|
| 45 |
+
# load model
|
| 46 |
+
(autoencoder, unet, tokenizer,
|
| 47 |
+
text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
|
| 48 |
+
vae_path, device)
|
| 49 |
+
|
| 50 |
+
prompt = "a dog barking in the distance"
|
| 51 |
+
sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)
|
| 52 |
+
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
## Todo
|
| 56 |
+
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 57 |
+
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
| 58 |
+
- [x] Release inference code
|
| 59 |
+
- [ ] Release checkpoints for stage1 and stage2
|
| 60 |
+
- [ ] Release training pipeline and dataset
|
| 61 |
+
|
| 62 |
+
## Reference
|
| 63 |
+
|
| 64 |
+
If you find the code useful for your research, please consider citing:
|
| 65 |
+
|
| 66 |
+
```bibtex
|
| 67 |
+
@article{hai2024ezaudio,
|
| 68 |
+
title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
|
| 69 |
+
author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
|
| 70 |
+
journal={arXiv preprint arXiv:2409.10819},
|
| 71 |
+
year={2024}
|
| 72 |
+
}
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
## Acknowledgement
|
| 76 |
+
Some code are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).
|