--- language: - en - multilingual tags: - text-to-speech - speech-synthesis - pytorch - styletts2 - speaches - neural-tts - voice-cloning pipeline_tag: text-to-speech library_name: pytorch license: mit datasets: - LibriTTS metrics: - naturalness - similarity widget: - text: "Hello, this is a sample of StyleTTS2 speech synthesis." example_title: "English Sample" - text: "StyleTTS2 can synthesize high-quality speech with style control." example_title: "Style Control Sample" --- # StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level text-to-speech synthesis. This model builds upon the original StyleTTS with significant improvements in naturalness and similarity. ## Model Description - **Model Type**: Neural Text-to-Speech (TTS) - **Language(s)**: English (primary), with support for 18+ languages - **License**: MIT - **Paper**: [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training](https://arxiv.org/abs/2306.07691) - **Sample Rate**: 24,000 Hz - **Architecture**: Style diffusion with adversarial training ## Features - **High-Quality Synthesis**: Achieves human-level naturalness in speech synthesis - **Style Control**: Advanced style transfer and voice cloning capabilities - **Multi-Language Support**: Primary English model with support for 18+ additional languages - **Voice Cloning**: Can clone voices from reference audio samples - **Diffusion-Based**: Uses diffusion models for high-quality audio generation ## Usage This model is designed for text-to-speech synthesis with the following capabilities: 1. **Multi-Voice Synthesis**: Generate speech using preset voice styles 2. **Voice Cloning**: Clone voices from reference audio samples 3. **Style Control**: Fine-tune synthesis parameters for different styles 4. **Multi-Language**: Support for various languages with English-accented pronunciation ### Parameters - `alpha` (0.0-1.0): Style blending factor (default: 0.3) - `beta` (0.0-1.0): Style mixing factor (default: 0.7) - `diffusion_steps` (3-20): Number of diffusion steps for quality (default: 5) - `embedding_scale` (1.0-10.0): Embedding scale factor (default: 1.0) ## Training Data - **Primary Dataset**: LibriTTS - **Languages**: English (primary) + 18 additional languages - **Training Approach**: Style diffusion with adversarial training using large speech language models ## Performance StyleTTS 2 achieves human-level performance in: - **Naturalness**: Comparable to human speech in listening tests - **Similarity**: High fidelity voice cloning and style transfer - **Quality**: Superior audio quality compared to previous TTS models ## Limitations - **Compute Requirements**: Requires significant computational resources for inference - **English-First**: Optimized for English, other languages may have accented pronunciation - **Context Dependency**: Performance varies with input text length and complexity ## Citation ```bibtex @article{li2024styletts2, title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, author={Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima}, journal={arXiv preprint arXiv:2306.07691}, year={2024} } ``` ## Links - Paper: [https://arxiv.org/abs/2306.07691](https://arxiv.org/abs/2306.07691) - Samples: [https://styletts2.github.io/](https://styletts2.github.io/) - Code: [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) - License: MIT License