|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: text-to-speech |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
inference: false |
|
|
tags: |
|
|
- text-to-speech |
|
|
- tts |
|
|
- expressive |
|
|
- parler-tts |
|
|
- voice-synthesis |
|
|
- multi-speaker |
|
|
- audio |
|
|
base_model: parler-tts/parler-tts-mini-v1.1 |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://huggingface.co/voicing-ai/ParlerVoice/resolve/main/logo.svg" alt="VoicingAI Logo" width="200"/> |
|
|
|
|
|
# ParlerVoice |
|
|
|
|
|
### **Professional Text-to-Speech by VoicingAI R&D Labs** |
|
|
|
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://www.python.org/downloads/) |
|
|
[](https://huggingface.co/TieIncred/ParlerVoice) |
|
|
|
|
|
**ParlerVoice** is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities. |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## β¨ **Key Features** |
|
|
|
|
|
- **π Extensive Training Data**: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!) |
|
|
- **π₯ Comprehensive Speaker Library**: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics |
|
|
- **π Advanced Expressiveness**: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions |
|
|
- **π¬ Technical Architecture**: Advanced two-tokenizer system enabling both prompt-based and description-based generation |
|
|
- **π Multi-Accent Support**: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents |
|
|
|
|
|
### **Technical Specifications** |
|
|
- **Base Model**: `parler-tts/parler-tts-mini-v1.1` |
|
|
- **Training Data**: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!) |
|
|
- **Architecture**: Two-tokenizer flow for enhanced control and consistency |
|
|
- **Output Quality**: 24kHz high-fidelity audio generation |
|
|
|
|
|
--- |
|
|
|
|
|
## π **Technical Performance** |
|
|
|
|
|
Our technical evaluation demonstrates strong performance across key metrics: |
|
|
|
|
|
1. **π Performance Benchmarks**: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations |
|
|
|
|
|
2. **π¬ Architecture Studies**: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines |
|
|
|
|
|
3. **βοΈ Comparative Analysis**: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution |
|
|
|
|
|
4. **π Dataset Quality**: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!) |
|
|
|
|
|
π **[View Full Technical Report & Audio Samples](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)** |
|
|
|
|
|
--- |
|
|
|
|
|
## π **Installation** |
|
|
|
|
|
```bash |
|
|
# Install base dependencies |
|
|
pip install git+https://github.com/huggingface/parler-tts.git |
|
|
|
|
|
# Install ParlerVoice (for advanced features and presets) |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π» **Usage** |
|
|
|
|
|
### **Quick Start with Transformers API** |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from parler_tts import ParlerTTSForConditionalGeneration |
|
|
from transformers import AutoTokenizer |
|
|
import soundfile as sf |
|
|
|
|
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
# Load the model |
|
|
model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device) |
|
|
prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice") |
|
|
description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path) |
|
|
|
|
|
prompt = "Hey, how are you doing today?" |
|
|
description = ( |
|
|
"Connor conveys a neutral mood through a professional and controlled delivery. " |
|
|
"He speaks with a slightly low pitch, adding subtle weight to his delivery. " |
|
|
"His pace is moderate, keeping the speech easy to follow. " |
|
|
"His voice is slightly expressive, with subtle emotional inflections. " |
|
|
"The recording is exceptionally clean and close-sounding." |
|
|
) |
|
|
|
|
|
desc_inputs = description_tokenizer(description, return_tensors="pt").to(device) |
|
|
prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device) |
|
|
|
|
|
gen = model.generate( |
|
|
input_ids=desc_inputs.input_ids, |
|
|
attention_mask=desc_inputs.attention_mask, |
|
|
prompt_input_ids=prompt_inputs.input_ids, |
|
|
prompt_attention_mask=prompt_inputs.attention_mask, |
|
|
) |
|
|
|
|
|
audio_arr = gen.cpu().numpy().squeeze() |
|
|
sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate) |
|
|
``` |
|
|
|
|
|
### **Advanced Usage with Speaker Presets** (Recommended) |
|
|
|
|
|
For best results, use the ParlerVoice inference engine from the [GitHub repository](https://github.com/VoicingAI/ParlerVoice): |
|
|
|
|
|
```python |
|
|
from parlervoice_infer.engine import ParlerVoiceInference |
|
|
from parlervoice_infer.config import GenerationConfig |
|
|
|
|
|
# Initialize the engine |
|
|
infer = ParlerVoiceInference( |
|
|
checkpoint_path="TieIncred/ParlerVoice", |
|
|
base_model_path="parler-tts/parler-tts-mini-v1.1", |
|
|
) |
|
|
|
|
|
# Generate with speaker preset |
|
|
cfg = GenerationConfig() |
|
|
audio, path = infer.generate_with_speaker_preset( |
|
|
prompt="Welcome to the future of voice AI!", |
|
|
speaker="Connor", # Choose from 85 available speakers |
|
|
preset="professional", # Options: casual, narration, dramatic, podcast, news_anchor |
|
|
config=cfg, |
|
|
output_path="welcome_voice.wav", |
|
|
) |
|
|
``` |
|
|
|
|
|
### **Maximum Control with Rich Descriptions** |
|
|
|
|
|
```python |
|
|
# For maximum control and consistency |
|
|
desc = ( |
|
|
"Connor conveys a confident, professional tone with a warm and engaging delivery. " |
|
|
"He speaks with a moderate pace, clear articulation, and subtle emotional warmth. " |
|
|
"His voice has a rich, resonant quality that commands attention while remaining approachable. " |
|
|
"The recording is clean and professional with minimal background noise." |
|
|
) |
|
|
|
|
|
audio, path = infer.generate_audio( |
|
|
prompt="Innovation in AI voice technology continues to push boundaries.", |
|
|
description=desc, |
|
|
output_path="innovative_voice.wav", |
|
|
) |
|
|
``` |
|
|
|
|
|
### **Command Line Interface** |
|
|
|
|
|
```bash |
|
|
python -m parlervoice_infer \ |
|
|
--checkpoint "TieIncred/ParlerVoice" \ |
|
|
--prompt "Experience the next generation of voice synthesis!" \ |
|
|
--speaker Connor \ |
|
|
--preset dramatic \ |
|
|
--output parlervoice_demo.wav |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π£οΈ **Speaker Library** |
|
|
|
|
|
ParlerVoice features an extensive collection of **85 professionally curated speaker identities**: |
|
|
|
|
|
### **πΊπΈ American Speakers** |
|
|
|
|
|
**Male:** Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael |
|
|
|
|
|
**Female:** Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia |
|
|
|
|
|
### **π¬π§ British Speakers** |
|
|
- Oliver (Male) |
|
|
- Sophie (Female) |
|
|
|
|
|
### **π¦πΊ Australian / New Zealand** |
|
|
- **Male:** Liam, Finn |
|
|
- **Female:** Ruby, Emma, Chloe |
|
|
|
|
|
### **π International Accents** |
|
|
- **Connor** (Male, Canadian) |
|
|
- **Thabo** (Male, South African) |
|
|
- **Marco** (Male, Italian) |
|
|
- **Cian** (Male, Irish) |
|
|
- **Wei** (Male, Chinese) |
|
|
- **Aoife** (Female, Irish) |
|
|
- **Siobhan** (Female, Irish) |
|
|
- **Johan** (Male, Dutch) |
|
|
- **Pieter** (Male, Dutch) |
|
|
- **Ingrid** (Female, Dutch) |
|
|
- **Priya** (Female, Indian) |
|
|
- **Mei, Lin, Xiao, Li, Jing, Yan** (Chinese) |
|
|
- **Elena** (Female, Spanish/European) |
|
|
|
|
|
*Full details in the [technical documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)* |
|
|
|
|
|
--- |
|
|
|
|
|
## β‘ **Key Capabilities** |
|
|
|
|
|
### **π Expressive Control** |
|
|
- **Natural Language Descriptions**: Control emotion, tone, pace, and style through intuitive text descriptions |
|
|
- **Real-time Adjustment**: Modify expressiveness on-the-fly for dynamic content |
|
|
- **Contextual Awareness**: Maintains consistency across long-form content |
|
|
|
|
|
### **π Audio Quality** |
|
|
- **High-Fidelity Output**: 24kHz crystal-clear audio reproduction |
|
|
- **Noise Control**: Advanced background noise and reverb management |
|
|
- **Speaker Consistency**: Maintains voice identity across different emotional states |
|
|
|
|
|
### **π Performance Optimizations** |
|
|
- **Efficient Inference**: Optimized for both CPU and GPU deployment |
|
|
- **Batch Processing**: Handle multiple requests simultaneously |
|
|
- **Streaming Support**: Real-time audio generation capabilities |
|
|
- **Compatible with SDPA and compile optimizations** from upstream Parler-TTS |
|
|
|
|
|
For optimization tips, see [Parler-TTS INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) |
|
|
|
|
|
--- |
|
|
|
|
|
## π‘ **Best Practices** |
|
|
|
|
|
### Recommended Usage for Optimal Results |
|
|
- **Use speaker presets** from the repository for consistent, high-quality outputs |
|
|
- **Include named speakers** in descriptions to bias towards specific voice identities |
|
|
- **Provide detailed descriptions** for maximum control over expressiveness and tone |
|
|
- **Pull latest updates** from the repo as we actively refine description phrasing |
|
|
|
|
|
### Example Description Template |
|
|
``` |
|
|
[Speaker Name] conveys a [emotion] mood through a [style] delivery. |
|
|
They speak with a [pitch level] pitch and [pace] pace. |
|
|
The voice is [expressiveness level], with [characteristics]. |
|
|
The recording is [quality level] with [background description]. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π **License** |
|
|
|
|
|
This project is licensed under the **MIT License**. |
|
|
|
|
|
**Open Source & Free to Use** - ParlerVoice is available for: |
|
|
- β
Commercial applications and services |
|
|
- β
Academic research and educational purposes |
|
|
- β
Personal projects and community contributions |
|
|
- β
Integration into other products and services |
|
|
- β
Modification and redistribution |
|
|
|
|
|
--- |
|
|
|
|
|
## π **Citations** |
|
|
|
|
|
If you use this work, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@software{iqbal2025parlervoice, |
|
|
title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control}, |
|
|
author={Tausif Iqbal and Zeeshan and Anant}, |
|
|
year={2025}, |
|
|
publisher={VoicingAI R\&D Labs}, |
|
|
url={https://github.com/VoicingAI/ParlerVoice} |
|
|
} |
|
|
|
|
|
@misc{lacombe-etal-2024-parler-tts, |
|
|
author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi}, |
|
|
title = {Parler-TTS}, |
|
|
year = {2024}, |
|
|
publisher = {GitHub}, |
|
|
journal = {GitHub repository}, |
|
|
howpublished = {\url{https://github.com/huggingface/parler-tts}} |
|
|
} |
|
|
|
|
|
@misc{lyth2024natural, |
|
|
title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations}, |
|
|
author={Dan Lyth and Simon King}, |
|
|
year={2024}, |
|
|
eprint={2402.01912}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.SD} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π **Resources** |
|
|
|
|
|
- **π¦ GitHub Repository**: [VoicingAI/ParlerVoice](https://github.com/VoicingAI/ParlerVoice) |
|
|
- **π Technical Report & Samples**: [Notion Documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link) |
|
|
- **π€ Hugging Face Model**: [TieIncred/ParlerVoice](https://huggingface.co/TieIncred/ParlerVoice) |
|
|
- **π― Base Model**: [parler-tts/parler-tts-mini-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-v1.1) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Made with β€οΈ by VoicingAI R&D Labs** |
|
|
|
|
|
**Principal Researcher**: [Tausif Iqbal](https://www.linkedin.com/in/tausif-iqbal-77819a182/) |
|
|
|
|
|
**Core Team**: [Zeeshan](https://www.linkedin.com/in/zeeshan-parvez/) β’ [Anant](https://www.linkedin.com/in/anant-upadhyay-052524256/) |
|
|
|
|
|
*Developed at [VoicingAI](https://voicing.ai)* |
|
|
|
|
|
</div> |
|
|
|