| | --- |
| | license: mit |
| | datasets: |
| | - ShoukanLabs/AniSpeech |
| | - vctk |
| | - blabble-io/libritts_r |
| | language: |
| | - en |
| | pipeline_tag: text-to-speech |
| | base_model: yl4579/StyleTTS2-LibriTTS |
| | --- |
| | |
| | <style> |
| |
|
| | .TitleContainer { |
| | background-color: #ffff; |
| | margin-bottom: 0rem; |
| | margin-left: auto; |
| | margin-right: auto; |
| | width: 40%; |
| | height: 30%; |
| | border-radius: 10rem; |
| | border: 0.5vw solid #ff593e; |
| | transition: .6s; |
| | } |
| | |
| | .TitleContainer:hover { |
| | transform: scale(1.05); |
| | } |
| | |
| | .VokanLogo { |
| | margin: auto; |
| | display: block; |
| | } |
| | |
| | audio { |
| | margin: 0.5rem; |
| | } |
| | |
| | .audio-container { |
| | display: flex; |
| | justify-content: center; |
| | align-items: center; |
| | } |
| | |
| | </style> |
| |
|
| | <hr> |
| |
|
| | <div class="TitleContainer" align="center"> |
| | <!--<img src="https://huggingface.co/ShoukanLabs/Vokan/resolve/main/Vokan.gif" class="VokanLogo">--> |
| | <img src="Vokan.gif" class="VokanLogo"> |
| | </div> |
| | |
| | <p align="center", style="font-size: 1vw; font-weight: bold; color: #ff593e;">A StyleTTS2 fine-tune, designed for expressiveness.</p> |
| |
|
| | <hr> |
| |
|
| | <div class='audio-container'> |
| | <a align="center" href="https://discord.gg/5bq9HqVhsJ"><img src="https://img.shields.io/badge/find_us_at_the-ShoukanLabs_Discord-invite?style=flat-square&logo=discord&logoColor=%23ffffff&labelColor=%235865F2&color=%23ffffff" width="320" alt="discord"></a> |
| | <!--<a align="left" style="font-size: 1.3rem; font-weight: bold; color: #5662f6;" href="https://discord.gg/5bq9HqVhsJ">find us on Discord</a>--> |
| | </div> |
| |
|
| | **Vokan** is an advanced finetuned **StyleTTS2** model crafted for authentic and expressive zero-shot performance. Designed to serve as a better |
| | base model for further finetuning in the future! |
| | It leverages a diverse dataset and extensive training to generate high-quality synthesized speech. |
| | Trained on a combination of the AniSpeech, VCTK, and LibriTTS-R datasets, Vokan ensures authenticity and naturalness across various accents and contexts. |
| | With over 6+ days worth of audio data and 672 diverse and expressive speakers, |
| | Vokan captures a wide range of vocal characteristics, contributing to its remarkable performance. |
| | Although the amount of training data is less than the original, the inclusion of a broad array of accents and speakers enriches the model's vector space. |
| | Vokan's training required significant computational resources, including 300 hours on 1x H100 and an additional 600 hours on 1x 3090 hardware configuration. |
| |
|
| | You can read more about it on our article on [DagsHub!](https://dagshub.com/blog/styletts2/) |
| |
|
| |
|
| | <hr> |
| | <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Vokan Samples!</p> |
| | <div class='audio-container'> |
| | <div> |
| | <audio controls> |
| | <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%201.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </div> |
| | |
| | <div> |
| | <audio controls> |
| | <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%202.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </div> |
| | </div> |
| | <div class='audio-container'> |
| | <div> |
| | <audio controls> |
| | <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%203.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </div> |
| | <div> |
| | <audio controls> |
| | <source src="https://dagshub.com/StyleTTS/Article/raw/74539c801ce3a894ec3df6b52fa2dd579637481d/demo%204.wav" type="audio/wav"> |
| | Your browser does not support the audio element. |
| | </audio> |
| | </div> |
| | </div> |
| | <hr> |
| | |
| | <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Acknowledgements!</p> |
| |
|
| | - **[DagsHub](https://dagshub.com):** Special thanks to DagsHub for sponsoring GPU compute resources as well as offering an amazing versioning service, enabling efficient model training and development. A shoutout to Dean in particular! |
| | - **[camenduru](https://github.com/camenduru):** Thanks to camenduru for their expertise in cloud infrastructure and model training, which played a crucial role in the development of Vokan! Please give them a follow! |
| |
|
| | <hr> |
| |
|
| | <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Conclusion!</p> |
| |
|
| | V2 is currently in the works, aiming to be bigger and better in every way! Including multilingual support! |
| | This is where you come in, if you have any large single speaker datasets you'd like to contribute, |
| | in any language, you can contribute to our **Vokan dataset**. A large **community dataset** that combines a bunch of |
| | smaller single speaker datasets to create one big multispeaker one. |
| | You can upload your uberduck or FakeYou compliant datasets via the |
| | **[Vokan](https://huggingface.co/ShoukanLabs/Vokan)** bot on the **[ShoukanLabs Discord Server](https://discord.gg/hdVeretude)**. |
| | The more data we have, the better the models we produce will be! |
| |
|
| | [This model is also available on DagsHub](https://dagshub.com/ShoukanLabs/Vokan) |
| | <hr> |
| |
|
| | <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">Citations!</p> |
| |
|
| | ```citations |
| | @misc{li2023styletts, |
| | title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
| | author={Yinghao Aaron Li and Cong Han and Vinay S. Raghavan and Gavin Mischler and Nima Mesgarani}, |
| | year={2023}, |
| | eprint={2306.07691}, |
| | archivePrefix={arXiv}, |
| | primaryClass={eess.AS} |
| | } |
| | |
| | @misc{zen2019libritts, |
| | title={LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech}, |
| | author={Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu}, |
| | year={2019}, |
| | eprint={1904.02882}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.SD} |
| | } |
| | |
| | Christophe Veaux, Junichi Yamagishi, Kirsten MacDonald, |
| | "CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit", |
| | The Centre for Speech Technology Research (CSTR), |
| | University of Edinburgh |
| | ``` |
| |
|
| | <p align="center", style="font-size: 2vw; font-weight: bold; color: #ff593e;">License!</p> |
| |
|
| | ``` |
| | MIT |
| | ``` |