File size: 11,807 Bytes
568b42b 1b7d9f9 568b42b 5a314e8 568b42b 5a314e8 568b42b 5a314e8 568b42b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
---
library_name: transformers
pipeline_tag: text-to-speech
license: mit
language:
- en
inference: false
tags:
- text-to-speech
- tts
- expressive
- parler-tts
- voice-synthesis
- multi-speaker
- audio
base_model: parler-tts/parler-tts-mini-v1.1
---
<div align="center">
<img src="https://huggingface.co/voicing-ai/ParlerVoice/resolve/main/logo.svg" alt="VoicingAI Logo" width="200"/>
# ParlerVoice
### **Professional Text-to-Speech by VoicingAI R&D Labs**
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
[](https://huggingface.co/TieIncred/ParlerVoice)
**ParlerVoice** is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.
</div>
---
## β¨ **Key Features**
- **π Extensive Training Data**: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!)
- **π₯ Comprehensive Speaker Library**: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics
- **π Advanced Expressiveness**: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions
- **π¬ Technical Architecture**: Advanced two-tokenizer system enabling both prompt-based and description-based generation
- **π Multi-Accent Support**: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents
### **Technical Specifications**
- **Base Model**: `parler-tts/parler-tts-mini-v1.1`
- **Training Data**: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!)
- **Architecture**: Two-tokenizer flow for enhanced control and consistency
- **Output Quality**: 24kHz high-fidelity audio generation
---
## π **Technical Performance**
Our technical evaluation demonstrates strong performance across key metrics:
1. **π Performance Benchmarks**: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations
2. **π¬ Architecture Studies**: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines
3. **βοΈ Comparative Analysis**: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution
4. **π Dataset Quality**: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!)
π **[View Full Technical Report & Audio Samples](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)**
---
## π **Installation**
```bash
# Install base dependencies
pip install git+https://github.com/huggingface/parler-tts.git
# Install ParlerVoice (for advanced features and presets)
pip install -r requirements.txt
```
---
## π» **Usage**
### **Quick Start with Transformers API**
```python
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Load the model
model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device)
prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
prompt = "Hey, how are you doing today?"
description = (
"Connor conveys a neutral mood through a professional and controlled delivery. "
"He speaks with a slightly low pitch, adding subtle weight to his delivery. "
"His pace is moderate, keeping the speech easy to follow. "
"His voice is slightly expressive, with subtle emotional inflections. "
"The recording is exceptionally clean and close-sounding."
)
desc_inputs = description_tokenizer(description, return_tensors="pt").to(device)
prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device)
gen = model.generate(
input_ids=desc_inputs.input_ids,
attention_mask=desc_inputs.attention_mask,
prompt_input_ids=prompt_inputs.input_ids,
prompt_attention_mask=prompt_inputs.attention_mask,
)
audio_arr = gen.cpu().numpy().squeeze()
sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate)
```
### **Advanced Usage with Speaker Presets** (Recommended)
For best results, use the ParlerVoice inference engine from the [GitHub repository](https://github.com/VoicingAI/ParlerVoice):
```python
from parlervoice_infer.engine import ParlerVoiceInference
from parlervoice_infer.config import GenerationConfig
# Initialize the engine
infer = ParlerVoiceInference(
checkpoint_path="TieIncred/ParlerVoice",
base_model_path="parler-tts/parler-tts-mini-v1.1",
)
# Generate with speaker preset
cfg = GenerationConfig()
audio, path = infer.generate_with_speaker_preset(
prompt="Welcome to the future of voice AI!",
speaker="Connor", # Choose from 85 available speakers
preset="professional", # Options: casual, narration, dramatic, podcast, news_anchor
config=cfg,
output_path="welcome_voice.wav",
)
```
### **Maximum Control with Rich Descriptions**
```python
# For maximum control and consistency
desc = (
"Connor conveys a confident, professional tone with a warm and engaging delivery. "
"He speaks with a moderate pace, clear articulation, and subtle emotional warmth. "
"His voice has a rich, resonant quality that commands attention while remaining approachable. "
"The recording is clean and professional with minimal background noise."
)
audio, path = infer.generate_audio(
prompt="Innovation in AI voice technology continues to push boundaries.",
description=desc,
output_path="innovative_voice.wav",
)
```
### **Command Line Interface**
```bash
python -m parlervoice_infer \
--checkpoint "TieIncred/ParlerVoice" \
--prompt "Experience the next generation of voice synthesis!" \
--speaker Connor \
--preset dramatic \
--output parlervoice_demo.wav
```
---
## π£οΈ **Speaker Library**
ParlerVoice features an extensive collection of **85 professionally curated speaker identities**:
### **πΊπΈ American Speakers**
**Male:** Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael
**Female:** Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia
### **π¬π§ British Speakers**
- Oliver (Male)
- Sophie (Female)
### **π¦πΊ Australian / New Zealand**
- **Male:** Liam, Finn
- **Female:** Ruby, Emma, Chloe
### **π International Accents**
- **Connor** (Male, Canadian)
- **Thabo** (Male, South African)
- **Marco** (Male, Italian)
- **Cian** (Male, Irish)
- **Wei** (Male, Chinese)
- **Aoife** (Female, Irish)
- **Siobhan** (Female, Irish)
- **Johan** (Male, Dutch)
- **Pieter** (Male, Dutch)
- **Ingrid** (Female, Dutch)
- **Priya** (Female, Indian)
- **Mei, Lin, Xiao, Li, Jing, Yan** (Chinese)
- **Elena** (Female, Spanish/European)
*Full details in the [technical documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)*
---
## β‘ **Key Capabilities**
### **π Expressive Control**
- **Natural Language Descriptions**: Control emotion, tone, pace, and style through intuitive text descriptions
- **Real-time Adjustment**: Modify expressiveness on-the-fly for dynamic content
- **Contextual Awareness**: Maintains consistency across long-form content
### **π Audio Quality**
- **High-Fidelity Output**: 24kHz crystal-clear audio reproduction
- **Noise Control**: Advanced background noise and reverb management
- **Speaker Consistency**: Maintains voice identity across different emotional states
### **π Performance Optimizations**
- **Efficient Inference**: Optimized for both CPU and GPU deployment
- **Batch Processing**: Handle multiple requests simultaneously
- **Streaming Support**: Real-time audio generation capabilities
- **Compatible with SDPA and compile optimizations** from upstream Parler-TTS
For optimization tips, see [Parler-TTS INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
---
## π‘ **Best Practices**
### Recommended Usage for Optimal Results
- **Use speaker presets** from the repository for consistent, high-quality outputs
- **Include named speakers** in descriptions to bias towards specific voice identities
- **Provide detailed descriptions** for maximum control over expressiveness and tone
- **Pull latest updates** from the repo as we actively refine description phrasing
### Example Description Template
```
[Speaker Name] conveys a [emotion] mood through a [style] delivery.
They speak with a [pitch level] pitch and [pace] pace.
The voice is [expressiveness level], with [characteristics].
The recording is [quality level] with [background description].
```
---
## π **License**
This project is licensed under the **MIT License**.
**Open Source & Free to Use** - ParlerVoice is available for:
- β
Commercial applications and services
- β
Academic research and educational purposes
- β
Personal projects and community contributions
- β
Integration into other products and services
- β
Modification and redistribution
---
## π **Citations**
If you use this work, please consider citing:
```bibtex
@software{iqbal2025parlervoice,
title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control},
author={Tausif Iqbal and Zeeshan and Anant},
year={2025},
publisher={VoicingAI R\&D Labs},
url={https://github.com/VoicingAI/ParlerVoice}
}
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}
@misc{lyth2024natural,
title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
author={Dan Lyth and Simon King},
year={2024},
eprint={2402.01912},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```
---
## π **Resources**
- **π¦ GitHub Repository**: [VoicingAI/ParlerVoice](https://github.com/VoicingAI/ParlerVoice)
- **π Technical Report & Samples**: [Notion Documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)
- **π€ Hugging Face Model**: [TieIncred/ParlerVoice](https://huggingface.co/TieIncred/ParlerVoice)
- **π― Base Model**: [parler-tts/parler-tts-mini-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-v1.1)
---
<div align="center">
**Made with β€οΈ by VoicingAI R&D Labs**
**Principal Researcher**: [Tausif Iqbal](https://www.linkedin.com/in/tausif-iqbal-77819a182/)
**Core Team**: [Zeeshan](https://www.linkedin.com/in/zeeshan-parvez/) β’ [Anant](https://www.linkedin.com/in/anant-upadhyay-052524256/)
*Developed at [VoicingAI](https://voicing.ai)*
</div>
|