--- library_name: transformers pipeline_tag: text-to-speech license: mit language: - en inference: false tags: - text-to-speech - tts - expressive - parler-tts - voice-synthesis - multi-speaker - audio base_model: parler-tts/parler-tts-mini-v1.1 ---
VoicingAI Logo # ParlerVoice ### **Professional Text-to-Speech by VoicingAI R&D Labs** [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue.svg)](https://huggingface.co/TieIncred/ParlerVoice) **ParlerVoice** is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.
--- ## ✨ **Key Features** - **πŸ† Extensive Training Data**: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!) - **πŸ‘₯ Comprehensive Speaker Library**: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics - **🎭 Advanced Expressiveness**: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions - **πŸ”¬ Technical Architecture**: Advanced two-tokenizer system enabling both prompt-based and description-based generation - **🌍 Multi-Accent Support**: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents ### **Technical Specifications** - **Base Model**: `parler-tts/parler-tts-mini-v1.1` - **Training Data**: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!) - **Architecture**: Two-tokenizer flow for enhanced control and consistency - **Output Quality**: 24kHz high-fidelity audio generation --- ## πŸ“ˆ **Technical Performance** Our technical evaluation demonstrates strong performance across key metrics: 1. **πŸ† Performance Benchmarks**: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations 2. **πŸ”¬ Architecture Studies**: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines 3. **βš–οΈ Comparative Analysis**: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution 4. **🌍 Dataset Quality**: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!) πŸ“Š **[View Full Technical Report & Audio Samples](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)** --- ## πŸ›  **Installation** ```bash # Install base dependencies pip install git+https://github.com/huggingface/parler-tts.git # Install ParlerVoice (for advanced features and presets) pip install -r requirements.txt ``` --- ## πŸ’» **Usage** ### **Quick Start with Transformers API** ```python import torch from parler_tts import ParlerTTSForConditionalGeneration from transformers import AutoTokenizer import soundfile as sf device = "cuda:0" if torch.cuda.is_available() else "cpu" # Load the model model = ParlerTTSForConditionalGeneration.from_pretrained("voicing-ai/ParlerVoice").to(device) prompt_tokenizer = AutoTokenizer.from_pretrained("voicing-ai/ParlerVoice") description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path) prompt = "Hey, how are you doing today?" description = ( "Connor conveys a neutral mood through a professional and controlled delivery. " "He speaks with a slightly low pitch, adding subtle weight to his delivery. " "His pace is moderate, keeping the speech easy to follow. " "His voice is slightly expressive, with subtle emotional inflections. " "The recording is exceptionally clean and close-sounding." ) desc_inputs = description_tokenizer(description, return_tensors="pt").to(device) prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device) gen = model.generate( input_ids=desc_inputs.input_ids, attention_mask=desc_inputs.attention_mask, prompt_input_ids=prompt_inputs.input_ids, prompt_attention_mask=prompt_inputs.attention_mask, ) audio_arr = gen.cpu().numpy().squeeze() sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate) ``` ### **Advanced Usage with Speaker Presets** (Recommended) For best results, use the ParlerVoice inference engine from the [GitHub repository](https://github.com/VoicingAI/ParlerVoice): ```python from parlervoice_infer.engine import ParlerVoiceInference from parlervoice_infer.config import GenerationConfig # Initialize the engine infer = ParlerVoiceInference( checkpoint_path="voicing-ai/ParlerVoice", base_model_path="parler-tts/parler-tts-mini-v1.1", ) # Generate with speaker preset cfg = GenerationConfig() audio, path = infer.generate_with_speaker_preset( prompt="Welcome to the future of voice AI!", speaker="Connor", # Choose from 85 available speakers preset="professional", # Options: casual, narration, dramatic, podcast, news_anchor config=cfg, output_path="welcome_voice.wav", ) ``` ### **Maximum Control with Rich Descriptions** ```python # For maximum control and consistency desc = ( "Connor conveys a confident, professional tone with a warm and engaging delivery. " "He speaks with a moderate pace, clear articulation, and subtle emotional warmth. " "His voice has a rich, resonant quality that commands attention while remaining approachable. " "The recording is clean and professional with minimal background noise." ) audio, path = infer.generate_audio( prompt="Innovation in AI voice technology continues to push boundaries.", description=desc, output_path="innovative_voice.wav", ) ``` ### **Command Line Interface** ```bash python -m parlervoice_infer \ --checkpoint "voicing-ai/ParlerVoice" \ --prompt "Experience the next generation of voice synthesis!" \ --speaker Connor \ --preset dramatic \ --output parlervoice_demo.wav ``` --- ## πŸ—£οΈ **Speaker Library** ParlerVoice features an extensive collection of **85 professionally curated speaker identities**: ### **πŸ‡ΊπŸ‡Έ American Speakers** **Male:** Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael **Female:** Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia ### **πŸ‡¬πŸ‡§ British Speakers** - Oliver (Male) - Sophie (Female) ### **πŸ‡¦πŸ‡Ί Australian / New Zealand** - **Male:** Liam, Finn - **Female:** Ruby, Emma, Chloe ### **🌍 International Accents** - **Connor** (Male, Canadian) - **Thabo** (Male, South African) - **Marco** (Male, Italian) - **Cian** (Male, Irish) - **Wei** (Male, Chinese) - **Aoife** (Female, Irish) - **Siobhan** (Female, Irish) - **Johan** (Male, Dutch) - **Pieter** (Male, Dutch) - **Ingrid** (Female, Dutch) - **Priya** (Female, Indian) - **Mei, Lin, Xiao, Li, Jing, Yan** (Chinese) - **Elena** (Female, Spanish/European) *Full details in the [technical documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)* --- ## ⚑ **Key Capabilities** ### **🎭 Expressive Control** - **Natural Language Descriptions**: Control emotion, tone, pace, and style through intuitive text descriptions - **Real-time Adjustment**: Modify expressiveness on-the-fly for dynamic content - **Contextual Awareness**: Maintains consistency across long-form content ### **πŸ”Š Audio Quality** - **High-Fidelity Output**: 24kHz crystal-clear audio reproduction - **Noise Control**: Advanced background noise and reverb management - **Speaker Consistency**: Maintains voice identity across different emotional states ### **πŸš€ Performance Optimizations** - **Efficient Inference**: Optimized for both CPU and GPU deployment - **Batch Processing**: Handle multiple requests simultaneously - **Streaming Support**: Real-time audio generation capabilities - **Compatible with SDPA and compile optimizations** from upstream Parler-TTS For optimization tips, see [Parler-TTS INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md) --- ## πŸ’‘ **Best Practices** ### Recommended Usage for Optimal Results - **Use speaker presets** from the repository for consistent, high-quality outputs - **Include named speakers** in descriptions to bias towards specific voice identities - **Provide detailed descriptions** for maximum control over expressiveness and tone - **Pull latest updates** from the repo as we actively refine description phrasing ### Example Description Template ``` [Speaker Name] conveys a [emotion] mood through a [style] delivery. They speak with a [pitch level] pitch and [pace] pace. The voice is [expressiveness level], with [characteristics]. The recording is [quality level] with [background description]. ``` --- ## πŸ“‹ **License** This project is licensed under the **MIT License**. **Open Source & Free to Use** - ParlerVoice is available for: - βœ… Commercial applications and services - βœ… Academic research and educational purposes - βœ… Personal projects and community contributions - βœ… Integration into other products and services - βœ… Modification and redistribution --- ## πŸ“š **Citations** If you use this work, please consider citing: ```bibtex @software{iqbal2025parlervoice, title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control}, author={Tausif Iqbal and Zeeshan and Anant}, year={2025}, publisher={VoicingAI R\&D Labs}, url={https://github.com/VoicingAI/ParlerVoice} } @misc{lacombe-etal-2024-parler-tts, author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi}, title = {Parler-TTS}, year = {2024}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/huggingface/parler-tts}} } @misc{lyth2024natural, title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations}, author={Dan Lyth and Simon King}, year={2024}, eprint={2402.01912}, archivePrefix={arXiv}, primaryClass={cs.SD} } ``` --- ## πŸ”— **Resources** - **πŸ“¦ GitHub Repository**: [VoicingAI/ParlerVoice](https://github.com/VoicingAI/ParlerVoice) - **πŸ“Š Technical Report & Samples**: [Notion Documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link) - **πŸ€— Hugging Face Model**: [voicing-ai/ParlerVoice](https://huggingface.co/voicing-ai/ParlerVoice) - **🎯 Base Model**: [parler-tts/parler-tts-mini-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-v1.1) ---
**Made with ❀️ by VoicingAI R&D Labs** **Principal Researcher**: [Tausif Iqbal](https://www.linkedin.com/in/tausif-iqbal-77819a182/) **Core Team**: [Zeeshan](https://www.linkedin.com/in/zeeshan-parvez/) β€’ [Anant](https://www.linkedin.com/in/anant-upadhyay-052524256/) *Developed at [VoicingAI](https://voicing.ai)*