| | --- |
| | license: cc-by-nc-nd-4.0 |
| | --- |
| | |
| | # ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation |
| |
|
| | This page shares the official model checkpoints of the paper \ |
| | *ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation* \ |
| | from Microsoft Applied Science Group and UC Berkeley \ |
| | by [Yatong Bai](https://bai-yt.github.io), |
| | [Trung Dang](https://www.microsoft.com/applied-sciences/people/trung-dang), |
| | [Dung Tran](https://www.microsoft.com/applied-sciences/people/dung-tran), |
| | [Kazuhito Koishida](https://www.microsoft.com/applied-sciences/people/kazuhito-koishida), |
| | and [Somayeh Sojoudi](https://people.eecs.berkeley.edu/~sojoudi/). |
| |
|
| | **[[🤗 Live Demo](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA)]** |
| | **[[Preprint Paper](https://arxiv.org/abs/2309.10740)]** |
| | **[[Project Homepage](https://consistency-tta.github.io)]** |
| | **[[Code](https://github.com/Bai-YT/ConsistencyTTA)]** |
| | **[[Model Checkpoints](https://huggingface.co/Bai-YT/ConsistencyTTA)]** |
| | **[[Generation Examples](https://consistency-tta.github.io/demo.html)]** |
| |
|
| |
|
| | ## Description |
| |
|
| | **2024/06 Updates:** |
| |
|
| | - We have hosted an interactive live demo of ConsistencyTTA at [🤗 Huggingface](https://huggingface.co/spaces/Bai-YT/ConsistencyTTA). |
| | - ConsistencyTTA has been accepted to ***INTERSPEECH 2024***! We look forward to meeting you in Kos Island. |
| |
|
| | This work proposes a *consistency distillation* framework to train |
| | text-to-audio (TTA) generation models that only require a single neural network query, |
| | reducing the computation of the core step of diffusion-based TTA models by a factor of 400. |
| | By incorporating *classifier-free guidance* into the distillation framework, |
| | our models retain diffusion models' impressive generation quality and diversity. |
| | Furthermore, the non-recurrent differentiable structure of the consistency model |
| | allows for end-to-end fine-tuning with novel loss functions such as the CLAP score, further boosting performance. |
| |
|
| | <center> |
| | <img src="main_figure_.png" alt="ConsistencyTTA Results" title="Results" width="480"/> |
| | </center> |
| |
|
| |
|
| | ## Model Details |
| |
|
| | We share three model checkpoints: |
| | - [ConsistencyTTA directly distilled from a diffusion model]( |
| | https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA.zip); |
| | - [ConsistencyTTA fine-tuned by optimizing the CLAP score]( |
| | https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/ConsistencyTTA_CLAPFT.zip); |
| | - [The diffusion teacher model from which ConsistencyTTA is distilled]( |
| | https://huggingface.co/Bai-YT/ConsistencyTTA/blob/main/LightweightLDM.zip). |
| | |
| | The first two models are capable of high-quality single-step text-to-audio generation. Generations are 10 seconds long. |
| | |
| | After downloading and unzipping the files, place them in the `saved` directory. |
| | |
| | The training and inference code are on our [GitHub page](https://github.com/Bai-YT/ConsistencyTTA). Please refer to the GitHub page for usage details. |
| | |