ParlerVoice / README.md
TieIncred's picture
Upload README.md with huggingface_hub
1b7d9f9 verified
---
library_name: transformers
pipeline_tag: text-to-speech
license: mit
language:
- en
inference: false
tags:
- text-to-speech
- tts
- expressive
- parler-tts
- voice-synthesis
- multi-speaker
- audio
base_model: parler-tts/parler-tts-mini-v1.1
---
<div align="center">
<img src="https://huggingface.co/voicing-ai/ParlerVoice/resolve/main/logo.svg" alt="VoicingAI Logo" width="200"/>
# ParlerVoice
### **Professional Text-to-Speech by VoicingAI R&D Labs**
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue.svg)](https://huggingface.co/TieIncred/ParlerVoice)
**ParlerVoice** is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.
</div>
---
## ✨ **Key Features**
- **πŸ† Extensive Training Data**: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!)
- **πŸ‘₯ Comprehensive Speaker Library**: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics
- **🎭 Advanced Expressiveness**: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions
- **πŸ”¬ Technical Architecture**: Advanced two-tokenizer system enabling both prompt-based and description-based generation
- **🌍 Multi-Accent Support**: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents
### **Technical Specifications**
- **Base Model**: `parler-tts/parler-tts-mini-v1.1`
- **Training Data**: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!)
- **Architecture**: Two-tokenizer flow for enhanced control and consistency
- **Output Quality**: 24kHz high-fidelity audio generation
---
## πŸ“ˆ **Technical Performance**
Our technical evaluation demonstrates strong performance across key metrics:
1. **πŸ† Performance Benchmarks**: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations
2. **πŸ”¬ Architecture Studies**: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines
3. **βš–οΈ Comparative Analysis**: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution
4. **🌍 Dataset Quality**: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!)
πŸ“Š **[View Full Technical Report & Audio Samples](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)**
---
## πŸ›  **Installation**
```bash
# Install base dependencies
pip install git+https://github.com/huggingface/parler-tts.git
# Install ParlerVoice (for advanced features and presets)
pip install -r requirements.txt
```
---
## πŸ’» **Usage**
### **Quick Start with Transformers API**
```python
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
# Load the model
model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device)
prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice")
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)
prompt = "Hey, how are you doing today?"
description = (
"Connor conveys a neutral mood through a professional and controlled delivery. "
"He speaks with a slightly low pitch, adding subtle weight to his delivery. "
"His pace is moderate, keeping the speech easy to follow. "
"His voice is slightly expressive, with subtle emotional inflections. "
"The recording is exceptionally clean and close-sounding."
)
desc_inputs = description_tokenizer(description, return_tensors="pt").to(device)
prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device)
gen = model.generate(
input_ids=desc_inputs.input_ids,
attention_mask=desc_inputs.attention_mask,
prompt_input_ids=prompt_inputs.input_ids,
prompt_attention_mask=prompt_inputs.attention_mask,
)
audio_arr = gen.cpu().numpy().squeeze()
sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate)
```
### **Advanced Usage with Speaker Presets** (Recommended)
For best results, use the ParlerVoice inference engine from the [GitHub repository](https://github.com/VoicingAI/ParlerVoice):
```python
from parlervoice_infer.engine import ParlerVoiceInference
from parlervoice_infer.config import GenerationConfig
# Initialize the engine
infer = ParlerVoiceInference(
checkpoint_path="TieIncred/ParlerVoice",
base_model_path="parler-tts/parler-tts-mini-v1.1",
)
# Generate with speaker preset
cfg = GenerationConfig()
audio, path = infer.generate_with_speaker_preset(
prompt="Welcome to the future of voice AI!",
speaker="Connor", # Choose from 85 available speakers
preset="professional", # Options: casual, narration, dramatic, podcast, news_anchor
config=cfg,
output_path="welcome_voice.wav",
)
```
### **Maximum Control with Rich Descriptions**
```python
# For maximum control and consistency
desc = (
"Connor conveys a confident, professional tone with a warm and engaging delivery. "
"He speaks with a moderate pace, clear articulation, and subtle emotional warmth. "
"His voice has a rich, resonant quality that commands attention while remaining approachable. "
"The recording is clean and professional with minimal background noise."
)
audio, path = infer.generate_audio(
prompt="Innovation in AI voice technology continues to push boundaries.",
description=desc,
output_path="innovative_voice.wav",
)
```
### **Command Line Interface**
```bash
python -m parlervoice_infer \
--checkpoint "TieIncred/ParlerVoice" \
--prompt "Experience the next generation of voice synthesis!" \
--speaker Connor \
--preset dramatic \
--output parlervoice_demo.wav
```
---
## πŸ—£οΈ **Speaker Library**
ParlerVoice features an extensive collection of **85 professionally curated speaker identities**:
### **πŸ‡ΊπŸ‡Έ American Speakers**
**Male:** Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael
**Female:** Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia
### **πŸ‡¬πŸ‡§ British Speakers**
- Oliver (Male)
- Sophie (Female)
### **πŸ‡¦πŸ‡Ί Australian / New Zealand**
- **Male:** Liam, Finn
- **Female:** Ruby, Emma, Chloe
### **🌍 International Accents**
- **Connor** (Male, Canadian)
- **Thabo** (Male, South African)
- **Marco** (Male, Italian)
- **Cian** (Male, Irish)
- **Wei** (Male, Chinese)
- **Aoife** (Female, Irish)
- **Siobhan** (Female, Irish)
- **Johan** (Male, Dutch)
- **Pieter** (Male, Dutch)
- **Ingrid** (Female, Dutch)
- **Priya** (Female, Indian)
- **Mei, Lin, Xiao, Li, Jing, Yan** (Chinese)
- **Elena** (Female, Spanish/European)
*Full details in the [technical documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)*
---
## ⚑ **Key Capabilities**
### **🎭 Expressive Control**
- **Natural Language Descriptions**: Control emotion, tone, pace, and style through intuitive text descriptions
- **Real-time Adjustment**: Modify expressiveness on-the-fly for dynamic content
- **Contextual Awareness**: Maintains consistency across long-form content
### **πŸ”Š Audio Quality**
- **High-Fidelity Output**: 24kHz crystal-clear audio reproduction
- **Noise Control**: Advanced background noise and reverb management
- **Speaker Consistency**: Maintains voice identity across different emotional states
### **πŸš€ Performance Optimizations**
- **Efficient Inference**: Optimized for both CPU and GPU deployment
- **Batch Processing**: Handle multiple requests simultaneously
- **Streaming Support**: Real-time audio generation capabilities
- **Compatible with SDPA and compile optimizations** from upstream Parler-TTS
For optimization tips, see [Parler-TTS INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)
---
## πŸ’‘ **Best Practices**
### Recommended Usage for Optimal Results
- **Use speaker presets** from the repository for consistent, high-quality outputs
- **Include named speakers** in descriptions to bias towards specific voice identities
- **Provide detailed descriptions** for maximum control over expressiveness and tone
- **Pull latest updates** from the repo as we actively refine description phrasing
### Example Description Template
```
[Speaker Name] conveys a [emotion] mood through a [style] delivery.
They speak with a [pitch level] pitch and [pace] pace.
The voice is [expressiveness level], with [characteristics].
The recording is [quality level] with [background description].
```
---
## πŸ“‹ **License**
This project is licensed under the **MIT License**.
**Open Source & Free to Use** - ParlerVoice is available for:
- βœ… Commercial applications and services
- βœ… Academic research and educational purposes
- βœ… Personal projects and community contributions
- βœ… Integration into other products and services
- βœ… Modification and redistribution
---
## πŸ“š **Citations**
If you use this work, please consider citing:
```bibtex
@software{iqbal2025parlervoice,
title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control},
author={Tausif Iqbal and Zeeshan and Anant},
year={2025},
publisher={VoicingAI R\&D Labs},
url={https://github.com/VoicingAI/ParlerVoice}
}
@misc{lacombe-etal-2024-parler-tts,
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
title = {Parler-TTS},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/parler-tts}}
}
@misc{lyth2024natural,
title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
author={Dan Lyth and Simon King},
year={2024},
eprint={2402.01912},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
```
---
## πŸ”— **Resources**
- **πŸ“¦ GitHub Repository**: [VoicingAI/ParlerVoice](https://github.com/VoicingAI/ParlerVoice)
- **πŸ“Š Technical Report & Samples**: [Notion Documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)
- **πŸ€— Hugging Face Model**: [TieIncred/ParlerVoice](https://huggingface.co/TieIncred/ParlerVoice)
- **🎯 Base Model**: [parler-tts/parler-tts-mini-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-v1.1)
---
<div align="center">
**Made with ❀️ by VoicingAI R&D Labs**
**Principal Researcher**: [Tausif Iqbal](https://www.linkedin.com/in/tausif-iqbal-77819a182/)
**Core Team**: [Zeeshan](https://www.linkedin.com/in/zeeshan-parvez/) β€’ [Anant](https://www.linkedin.com/in/anant-upadhyay-052524256/)
*Developed at [VoicingAI](https://voicing.ai)*
</div>