File size: 11,807 Bytes
568b42b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b7d9f9
568b42b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a314e8
568b42b
 
 
 
 
 
 
5a314e8
568b42b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a314e8
568b42b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
---
library_name: transformers
pipeline_tag: text-to-speech
license: mit
language:
- en
inference: false
tags:
- text-to-speech
- tts
- expressive
- parler-tts
- voice-synthesis
- multi-speaker
- audio
base_model: parler-tts/parler-tts-mini-v1.1
---

<div align="center">
  <img src="https://huggingface.co/voicing-ai/ParlerVoice/resolve/main/logo.svg" alt="VoicingAI Logo" width="200"/>

# ParlerVoice

### **Professional Text-to-Speech by VoicingAI R&D Labs**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue.svg)](https://huggingface.co/TieIncred/ParlerVoice)

**ParlerVoice** is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.

</div>

---

## ✨ **Key Features**

- **πŸ† Extensive Training Data**: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!)
- **πŸ‘₯ Comprehensive Speaker Library**: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics
- **🎭 Advanced Expressiveness**: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions
- **πŸ”¬ Technical Architecture**: Advanced two-tokenizer system enabling both prompt-based and description-based generation
- **🌍 Multi-Accent Support**: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents

### **Technical Specifications**
- **Base Model**: `parler-tts/parler-tts-mini-v1.1`
- **Training Data**: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!)
- **Architecture**: Two-tokenizer flow for enhanced control and consistency
- **Output Quality**: 24kHz high-fidelity audio generation

---

## πŸ“ˆ **Technical Performance**

Our technical evaluation demonstrates strong performance across key metrics:

1. **πŸ† Performance Benchmarks**: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations

2. **πŸ”¬ Architecture Studies**: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines

3. **βš–οΈ Comparative Analysis**: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution

4. **🌍 Dataset Quality**: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!)

πŸ“Š **[View Full Technical Report & Audio Samples](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)**

---

## πŸ›  **Installation**

```bash
# Install base dependencies
pip install git+https://github.com/huggingface/parler-tts.git

# Install ParlerVoice (for advanced features and presets)
pip install -r requirements.txt
```

---

## πŸ’» **Usage**

### **Quick Start with Transformers API**

```python
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Load the model
model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device)
prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

prompt = "Hey, how are you doing today?"
description = (
    "Connor conveys a neutral mood through a professional and controlled delivery. "
    "He speaks with a slightly low pitch, adding subtle weight to his delivery. "
    "His pace is moderate, keeping the speech easy to follow. "
    "His voice is slightly expressive, with subtle emotional inflections. "
    "The recording is exceptionally clean and close-sounding."
)

desc_inputs = description_tokenizer(description, return_tensors="pt").to(device)
prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device)

gen = model.generate(
    input_ids=desc_inputs.input_ids,
    attention_mask=desc_inputs.attention_mask,
    prompt_input_ids=prompt_inputs.input_ids,
    prompt_attention_mask=prompt_inputs.attention_mask,
)

audio_arr = gen.cpu().numpy().squeeze()
sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate)
```

### **Advanced Usage with Speaker Presets** (Recommended)

For best results, use the ParlerVoice inference engine from the [GitHub repository](https://github.com/VoicingAI/ParlerVoice):

```python
from parlervoice_infer.engine import ParlerVoiceInference
from parlervoice_infer.config import GenerationConfig

# Initialize the engine
infer = ParlerVoiceInference(
    checkpoint_path="TieIncred/ParlerVoice",
    base_model_path="parler-tts/parler-tts-mini-v1.1",
)

# Generate with speaker preset
cfg = GenerationConfig()
audio, path = infer.generate_with_speaker_preset(
    prompt="Welcome to the future of voice AI!",
    speaker="Connor",  # Choose from 85 available speakers
    preset="professional",  # Options: casual, narration, dramatic, podcast, news_anchor
    config=cfg,
    output_path="welcome_voice.wav",
)
```

### **Maximum Control with Rich Descriptions**

```python
# For maximum control and consistency
desc = (
    "Connor conveys a confident, professional tone with a warm and engaging delivery. "
    "He speaks with a moderate pace, clear articulation, and subtle emotional warmth. "
    "His voice has a rich, resonant quality that commands attention while remaining approachable. "
    "The recording is clean and professional with minimal background noise."
)

audio, path = infer.generate_audio(
    prompt="Innovation in AI voice technology continues to push boundaries.",
    description=desc,
    output_path="innovative_voice.wav",
)
```

### **Command Line Interface**

```bash
python -m parlervoice_infer \
  --checkpoint "TieIncred/ParlerVoice" \
  --prompt "Experience the next generation of voice synthesis!" \
  --speaker Connor \
  --preset dramatic \
  --output parlervoice_demo.wav
```

---

## πŸ—£οΈ **Speaker Library**

ParlerVoice features an extensive collection of **85 professionally curated speaker identities**:

### **πŸ‡ΊπŸ‡Έ American Speakers**

**Male:** Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael

**Female:** Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia

### **πŸ‡¬πŸ‡§ British Speakers**
- Oliver (Male)
- Sophie (Female)

### **πŸ‡¦πŸ‡Ί Australian / New Zealand**
- **Male:** Liam, Finn
- **Female:** Ruby, Emma, Chloe

### **🌍 International Accents**
- **Connor** (Male, Canadian)
- **Thabo** (Male, South African)
- **Marco** (Male, Italian)
- **Cian** (Male, Irish)
- **Wei** (Male, Chinese)
- **Aoife** (Female, Irish)
- **Siobhan** (Female, Irish)
- **Johan** (Male, Dutch)
- **Pieter** (Male, Dutch)
- **Ingrid** (Female, Dutch)
- **Priya** (Female, Indian)
- **Mei, Lin, Xiao, Li, Jing, Yan** (Chinese)
- **Elena** (Female, Spanish/European)

*Full details in the [technical documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)*

---

## ⚑ **Key Capabilities**

### **🎭 Expressive Control**
- **Natural Language Descriptions**: Control emotion, tone, pace, and style through intuitive text descriptions
- **Real-time Adjustment**: Modify expressiveness on-the-fly for dynamic content
- **Contextual Awareness**: Maintains consistency across long-form content

### **πŸ”Š Audio Quality**
- **High-Fidelity Output**: 24kHz crystal-clear audio reproduction
- **Noise Control**: Advanced background noise and reverb management
- **Speaker Consistency**: Maintains voice identity across different emotional states

### **πŸš€ Performance Optimizations**
- **Efficient Inference**: Optimized for both CPU and GPU deployment
- **Batch Processing**: Handle multiple requests simultaneously
- **Streaming Support**: Real-time audio generation capabilities
- **Compatible with SDPA and compile optimizations** from upstream Parler-TTS

For optimization tips, see [Parler-TTS INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)

---

## πŸ’‘ **Best Practices**

### Recommended Usage for Optimal Results
- **Use speaker presets** from the repository for consistent, high-quality outputs
- **Include named speakers** in descriptions to bias towards specific voice identities
- **Provide detailed descriptions** for maximum control over expressiveness and tone
- **Pull latest updates** from the repo as we actively refine description phrasing

### Example Description Template
```
[Speaker Name] conveys a [emotion] mood through a [style] delivery. 
They speak with a [pitch level] pitch and [pace] pace. 
The voice is [expressiveness level], with [characteristics]. 
The recording is [quality level] with [background description].
```

---

## πŸ“‹ **License**

This project is licensed under the **MIT License**.

**Open Source & Free to Use** - ParlerVoice is available for:
- βœ… Commercial applications and services
- βœ… Academic research and educational purposes
- βœ… Personal projects and community contributions
- βœ… Integration into other products and services
- βœ… Modification and redistribution

---

## πŸ“š **Citations**

If you use this work, please consider citing:

```bibtex
@software{iqbal2025parlervoice,
  title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control},
  author={Tausif Iqbal and Zeeshan and Anant},
  year={2025},
  publisher={VoicingAI R\&D Labs},
  url={https://github.com/VoicingAI/ParlerVoice}
}

@misc{lacombe-etal-2024-parler-tts,
  author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
  title = {Parler-TTS},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/parler-tts}}
}

@misc{lyth2024natural,
  title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
  author={Dan Lyth and Simon King},
  year={2024},
  eprint={2402.01912},
  archivePrefix={arXiv},
  primaryClass={cs.SD}
}
```

---

## πŸ”— **Resources**

- **πŸ“¦ GitHub Repository**: [VoicingAI/ParlerVoice](https://github.com/VoicingAI/ParlerVoice)
- **πŸ“Š Technical Report & Samples**: [Notion Documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)
- **πŸ€— Hugging Face Model**: [TieIncred/ParlerVoice](https://huggingface.co/TieIncred/ParlerVoice)
- **🎯 Base Model**: [parler-tts/parler-tts-mini-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-v1.1)

---

<div align="center">

**Made with ❀️ by VoicingAI R&D Labs**

**Principal Researcher**: [Tausif Iqbal](https://www.linkedin.com/in/tausif-iqbal-77819a182/)

**Core Team**: [Zeeshan](https://www.linkedin.com/in/zeeshan-parvez/) β€’ [Anant](https://www.linkedin.com/in/anant-upadhyay-052524256/)

*Developed at [VoicingAI](https://voicing.ai)*

</div>