ParlerVoice / README.md

Upload README.md with huggingface_hub

1b7d9f9 verified 3 months ago

11.8 kB

	---
	library_name: transformers
	pipeline_tag: text-to-speech
	license: mit
	language:
	- en
	inference: false
	tags:
	- text-to-speech
	- tts
	- expressive
	- parler-tts
	- voice-synthesis
	- multi-speaker
	- audio
	base_model: parler-tts/parler-tts-mini-v1.1
	---

	<div align="center">
	<img src="https://huggingface.co/voicing-ai/ParlerVoice/resolve/main/logo.svg" alt="VoicingAI Logo" width="200"/>

	# ParlerVoice

	### Professional Text-to-Speech by VoicingAI R&D Labs

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue.svg)](https://huggingface.co/TieIncred/ParlerVoice)

	ParlerVoice is an advanced text-to-speech model offering enhanced expressive control and speaker consistency. Built on proven neural architectures and trained on extensive curated datasets, ParlerVoice provides high-quality voice synthesis capabilities.

	</div>

	---

	## ✨ Key Features

	- 🏆 Extensive Training Data: Fine-tuned on 650+ hours of carefully curated, high-quality proprietary audio data (dataset release coming soon!)
	- 👥 Comprehensive Speaker Library: 85 distinct speaker identities with consistent, recognizable voices across different accents and demographics
	- 🎭 Advanced Expressiveness: Precise control over tone, emotion, pitch, pace, style, reverb, and background noise through natural language descriptions
	- 🔬 Technical Architecture: Advanced two-tokenizer system enabling both prompt-based and description-based generation
	- 🌍 Multi-Accent Support: Coverage for American, British, Australian, Canadian, South African, Italian, and Irish accents

	### Technical Specifications
	- Base Model: `parler-tts/parler-tts-mini-v1.1`
	- Training Data: 650+ hours of curated proprietary audio (dataset release coming soon - stay tuned!)
	- Architecture: Two-tokenizer flow for enhanced control and consistency
	- Output Quality: 24kHz high-fidelity audio generation

	---

	## 📈 Technical Performance

	Our technical evaluation demonstrates strong performance across key metrics:

	1. 🏆 Performance Benchmarks: Achieved 95.2% speaker similarity consistency across different emotional states and 4.7/5.0 naturalness score in comprehensive human evaluations

	2. 🔬 Architecture Studies: Analysis showed the two-tokenizer approach provides improved expressive control compared to single-tokenizer baselines

	3. ⚖️ Comparative Analysis: Offers competitive inference speed while maintaining high audio quality at 24kHz resolution

	4. 🌍 Dataset Quality: The 650+ hour curated proprietary dataset supports 85 distinct voice identities across 7 accent categories (public release coming soon!)

	📊 [View Full Technical Report & Audio Samples](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)

	---

	## 🛠 Installation

	```bash
	# Install base dependencies
	pip install git+https://github.com/huggingface/parler-tts.git

	# Install ParlerVoice (for advanced features and presets)
	pip install -r requirements.txt
	```

	---

	## 💻 Usage

	### Quick Start with Transformers API

	```python
	import torch
	from parler_tts import ParlerTTSForConditionalGeneration
	from transformers import AutoTokenizer
	import soundfile as sf

	device = "cuda:0" if torch.cuda.is_available() else "cpu"

	# Load the model
	model = ParlerTTSForConditionalGeneration.from_pretrained("TieIncred/ParlerVoice").to(device)
	prompt_tokenizer = AutoTokenizer.from_pretrained("TieIncred/ParlerVoice")
	description_tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder._name_or_path)

	prompt = "Hey, how are you doing today?"
	description = (
	"Connor conveys a neutral mood through a professional and controlled delivery. "
	"He speaks with a slightly low pitch, adding subtle weight to his delivery. "
	"His pace is moderate, keeping the speech easy to follow. "
	"His voice is slightly expressive, with subtle emotional inflections. "
	"The recording is exceptionally clean and close-sounding."
	)

	desc_inputs = description_tokenizer(description, return_tensors="pt").to(device)
	prompt_inputs = prompt_tokenizer(prompt, return_tensors="pt").to(device)

	gen = model.generate(
	input_ids=desc_inputs.input_ids,
	attention_mask=desc_inputs.attention_mask,
	prompt_input_ids=prompt_inputs.input_ids,
	prompt_attention_mask=prompt_inputs.attention_mask,
	)

	audio_arr = gen.cpu().numpy().squeeze()
	sf.write("parlervoice_out.wav", audio_arr, model.config.sampling_rate)
	```

	### Advanced Usage with Speaker Presets (Recommended)

	For best results, use the ParlerVoice inference engine from the [GitHub repository](https://github.com/VoicingAI/ParlerVoice):

	```python
	from parlervoice_infer.engine import ParlerVoiceInference
	from parlervoice_infer.config import GenerationConfig

	# Initialize the engine
	infer = ParlerVoiceInference(
	checkpoint_path="TieIncred/ParlerVoice",
	base_model_path="parler-tts/parler-tts-mini-v1.1",
	)

	# Generate with speaker preset
	cfg = GenerationConfig()
	audio, path = infer.generate_with_speaker_preset(
	prompt="Welcome to the future of voice AI!",
	speaker="Connor", # Choose from 85 available speakers
	preset="professional", # Options: casual, narration, dramatic, podcast, news_anchor
	config=cfg,
	output_path="welcome_voice.wav",
	)
	```

	### Maximum Control with Rich Descriptions

	```python
	# For maximum control and consistency
	desc = (
	"Connor conveys a confident, professional tone with a warm and engaging delivery. "
	"He speaks with a moderate pace, clear articulation, and subtle emotional warmth. "
	"His voice has a rich, resonant quality that commands attention while remaining approachable. "
	"The recording is clean and professional with minimal background noise."
	)

	audio, path = infer.generate_audio(
	prompt="Innovation in AI voice technology continues to push boundaries.",
	description=desc,
	output_path="innovative_voice.wav",
	)
	```

	### Command Line Interface

	```bash
	python -m parlervoice_infer \
	--checkpoint "TieIncred/ParlerVoice" \
	--prompt "Experience the next generation of voice synthesis!" \
	--speaker Connor \
	--preset dramatic \
	--output parlervoice_demo.wav
	```

	---

	## 🗣️ Speaker Library

	ParlerVoice features an extensive collection of 85 professionally curated speaker identities:

	### 🇺🇸 American Speakers

	Male: Tyler, Ryan, Jackson, Kyle, Derek, Cameron, Marcus, Ethan, Parker, Hayden, Grant, Chase, Tucker, Dalton, Zach, Brandon, Austin, Trevor, Jordan, Nathan, Blake, Garrett, Caleb, Logan, Hunter, Mason, Colton, Flynn, Devin, Carson, Preston, Landon, Bryce, Jasper, Cole, Noah, Taylor, Trent, Shane, Jared, Reid, Spencer, Wyatt, Luke, Cody, Drew, Henry, Vincent, Nolan, Kane, Ian, Kent, Jace, Max, Reed, Wade, George, Seth, Cruz, Miles, John, Michael

	Female: Madison, Ashley, Jennifer, Samantha, Brittany, Camille, Rachel, Paige, Haley, Megan, Alexis, Zara, Grace, Alice, Olivia

	### 🇬🇧 British Speakers
	- Oliver (Male)
	- Sophie (Female)

	### 🇦🇺 Australian / New Zealand
	- Male: Liam, Finn
	- Female: Ruby, Emma, Chloe

	### 🌍 International Accents
	- Connor (Male, Canadian)
	- Thabo (Male, South African)
	- Marco (Male, Italian)
	- Cian (Male, Irish)
	- Wei (Male, Chinese)
	- Aoife (Female, Irish)
	- Siobhan (Female, Irish)
	- Johan (Male, Dutch)
	- Pieter (Male, Dutch)
	- Ingrid (Female, Dutch)
	- Priya (Female, Indian)
	- Mei, Lin, Xiao, Li, Jing, Yan (Chinese)
	- Elena (Female, Spanish/European)

	Full details in the [technical documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)

	---

	## ⚡ Key Capabilities

	### 🎭 Expressive Control
	- Natural Language Descriptions: Control emotion, tone, pace, and style through intuitive text descriptions
	- Real-time Adjustment: Modify expressiveness on-the-fly for dynamic content
	- Contextual Awareness: Maintains consistency across long-form content

	### 🔊 Audio Quality
	- High-Fidelity Output: 24kHz crystal-clear audio reproduction
	- Noise Control: Advanced background noise and reverb management
	- Speaker Consistency: Maintains voice identity across different emotional states

	### 🚀 Performance Optimizations
	- Efficient Inference: Optimized for both CPU and GPU deployment
	- Batch Processing: Handle multiple requests simultaneously
	- Streaming Support: Real-time audio generation capabilities
	- Compatible with SDPA and compile optimizations from upstream Parler-TTS

	For optimization tips, see [Parler-TTS INFERENCE.md](https://github.com/huggingface/parler-tts/blob/main/INFERENCE.md)

	---

	## 💡 Best Practices

	### Recommended Usage for Optimal Results
	- Use speaker presets from the repository for consistent, high-quality outputs
	- Include named speakers in descriptions to bias towards specific voice identities
	- Provide detailed descriptions for maximum control over expressiveness and tone
	- Pull latest updates from the repo as we actively refine description phrasing

	### Example Description Template
	```
	[Speaker Name] conveys a [emotion] mood through a [style] delivery.
	They speak with a [pitch level] pitch and [pace] pace.
	The voice is [expressiveness level], with [characteristics].
	The recording is [quality level] with [background description].
	```

	---

	## 📋 License

	This project is licensed under the MIT License.

	Open Source & Free to Use - ParlerVoice is available for:
	- ✅ Commercial applications and services
	- ✅ Academic research and educational purposes
	- ✅ Personal projects and community contributions
	- ✅ Integration into other products and services
	- ✅ Modification and redistribution

	---

	## 📚 Citations

	If you use this work, please consider citing:

	```bibtex
	@software{iqbal2025parlervoice,
	title={ParlerVoice: Expressive Text-to-Speech with Advanced Speaker Control},
	author={Tausif Iqbal and Zeeshan and Anant},
	year={2025},
	publisher={VoicingAI R\&D Labs},
	url={https://github.com/VoicingAI/ParlerVoice}
	}

	@misc{lacombe-etal-2024-parler-tts,
	author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
	title = {Parler-TTS},
	year = {2024},
	publisher = {GitHub},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/huggingface/parler-tts}}
	}

	@misc{lyth2024natural,
	title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
	author={Dan Lyth and Simon King},
	year={2024},
	eprint={2402.01912},
	archivePrefix={arXiv},
	primaryClass={cs.SD}
	}
	```

	---

	## 🔗 Resources

	- 📦 GitHub Repository: [VoicingAI/ParlerVoice](https://github.com/VoicingAI/ParlerVoice)
	- 📊 Technical Report & Samples: [Notion Documentation](https://quilt-growth-39a.notion.site/ParlerVoice-28a776bb53f280949beef800875eb0f7?source=copy_link)
	- 🤗 Hugging Face Model: [TieIncred/ParlerVoice](https://huggingface.co/TieIncred/ParlerVoice)
	- 🎯 Base Model: [parler-tts/parler-tts-mini-v1.1](https://huggingface.co/parler-tts/parler-tts-mini-v1.1)

	---

	<div align="center">

	Made with ❤️ by VoicingAI R&D Labs

	Principal Researcher: [Tausif Iqbal](https://www.linkedin.com/in/tausif-iqbal-77819a182/)

	Core Team: [Zeeshan](https://www.linkedin.com/in/zeeshan-parvez/) • [Anant](https://www.linkedin.com/in/anant-upadhyay-052524256/)

	Developed at [VoicingAI](https://voicing.ai)

	</div>