|
|
--- |
|
|
language: |
|
|
- en |
|
|
- multilingual |
|
|
tags: |
|
|
- text-to-speech |
|
|
- speech-synthesis |
|
|
- pytorch |
|
|
- styletts2 |
|
|
- speaches |
|
|
- neural-tts |
|
|
- voice-cloning |
|
|
pipeline_tag: text-to-speech |
|
|
library_name: pytorch |
|
|
license: mit |
|
|
datasets: |
|
|
- LibriTTS |
|
|
metrics: |
|
|
- naturalness |
|
|
- similarity |
|
|
widget: |
|
|
- text: "Hello, this is a sample of StyleTTS2 speech synthesis." |
|
|
example_title: "English Sample" |
|
|
- text: "StyleTTS2 can synthesize high-quality speech with style control." |
|
|
example_title: "Style Control Sample" |
|
|
--- |
|
|
|
|
|
# StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training |
|
|
|
|
|
StyleTTS 2 is a text-to-speech model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level text-to-speech synthesis. This model builds upon the original StyleTTS with significant improvements in naturalness and similarity. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model Type**: Neural Text-to-Speech (TTS) |
|
|
- **Language(s)**: English (primary), with support for 18+ languages |
|
|
- **License**: MIT |
|
|
- **Paper**: [StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training](https://arxiv.org/abs/2306.07691) |
|
|
- **Sample Rate**: 24,000 Hz |
|
|
- **Architecture**: Style diffusion with adversarial training |
|
|
|
|
|
## Features |
|
|
|
|
|
- **High-Quality Synthesis**: Achieves human-level naturalness in speech synthesis |
|
|
- **Style Control**: Advanced style transfer and voice cloning capabilities |
|
|
- **Multi-Language Support**: Primary English model with support for 18+ additional languages |
|
|
- **Voice Cloning**: Can clone voices from reference audio samples |
|
|
- **Diffusion-Based**: Uses diffusion models for high-quality audio generation |
|
|
|
|
|
## Usage |
|
|
|
|
|
This model is designed for text-to-speech synthesis with the following capabilities: |
|
|
|
|
|
1. **Multi-Voice Synthesis**: Generate speech using preset voice styles |
|
|
2. **Voice Cloning**: Clone voices from reference audio samples |
|
|
3. **Style Control**: Fine-tune synthesis parameters for different styles |
|
|
4. **Multi-Language**: Support for various languages with English-accented pronunciation |
|
|
|
|
|
### Parameters |
|
|
|
|
|
- `alpha` (0.0-1.0): Style blending factor (default: 0.3) |
|
|
- `beta` (0.0-1.0): Style mixing factor (default: 0.7) |
|
|
- `diffusion_steps` (3-20): Number of diffusion steps for quality (default: 5) |
|
|
- `embedding_scale` (1.0-10.0): Embedding scale factor (default: 1.0) |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Primary Dataset**: LibriTTS |
|
|
- **Languages**: English (primary) + 18 additional languages |
|
|
- **Training Approach**: Style diffusion with adversarial training using large speech language models |
|
|
|
|
|
## Performance |
|
|
|
|
|
StyleTTS 2 achieves human-level performance in: |
|
|
- **Naturalness**: Comparable to human speech in listening tests |
|
|
- **Similarity**: High fidelity voice cloning and style transfer |
|
|
- **Quality**: Superior audio quality compared to previous TTS models |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Compute Requirements**: Requires significant computational resources for inference |
|
|
- **English-First**: Optimized for English, other languages may have accented pronunciation |
|
|
- **Context Dependency**: Performance varies with input text length and complexity |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{li2024styletts2, |
|
|
title={StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models}, |
|
|
author={Li, Yinghao Aaron and Han, Cong and Mesgarani, Nima}, |
|
|
journal={arXiv preprint arXiv:2306.07691}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Links |
|
|
|
|
|
- Paper: [https://arxiv.org/abs/2306.07691](https://arxiv.org/abs/2306.07691) |
|
|
- Samples: [https://styletts2.github.io/](https://styletts2.github.io/) |
|
|
- Code: [https://github.com/yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2) |
|
|
- License: MIT License |