PrettyVoice β Expressive Text-to-Speech Model
Model Overview
PrettyVoice is an expressive Text-to-Speech (TTS) and Chat-to-Speech model fine-tuned to generate emotional, conversational, and natural human-like speech.
This model is optimized for acted dialogue, emotional expression, and conversational delivery, not flat or robotic narration.
β οΈ This is an audio-generation model, not a text-only language model.
Model Details
| Field | Value |
|---|---|
| Model name | somrajmondal/PrettyVoice_model |
| Author | Somraj Mondal |
| Architecture | CsmForConditionalGeneration |
| Language | English |
| Output | Speech audio (WAV, 24 kHz) |
| Speaker support | Single speaker (speaker_id = 0) |
| License | Apache 2.0 |
Intended Use
β Designed for
- Text-to-Speech (TTS)
- Chat-style speech generation
- Emotional and expressive dialogue
- Storytelling and voice acting
- Romantic and conversational AI voices
- Multimodal chat β audio pipelines
β Not designed for
- Text-only generation
- Code generation
- Speech-to-text (ASR)
- Multilingual speech
- Robotic or factual narration
Installation
pip install unsloth transformers soundfile torch accelerate safetensors
Inference Code
import torch
import soundfile as sf
from IPython.display import Audio, display
from unsloth import FastModel
from transformers import CsmForConditionalGeneration
# Load model and processor
model, processor = FastModel.from_pretrained(
model_name="somrajmondal/PrettyVoice_model",
auto_model=CsmForConditionalGeneration,
max_seq_length=2048,
)
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
from IPython.display import Audio, display
import soundfile as sf
import torch
text = '''<smiles> Hey guys! How are you today? How sweet does my voice sound?'''
speaker_id = 0
conversation = [
{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
# -------------------------
# Prepare inputs and compute token length
# -------------------------
inputs_for_length = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
return_tensors="pt",
)
# Number of input tokens
input_length = inputs_for_length["input_ids"].shape[1]
# Estimate max_new_tokens for audio generation
max_new_tokens = int(input_length * 2.5)
# Clamp min and max
max_new_tokens = max(150, min(max_new_tokens, 3600))
print(f"Input tokens: {input_length}, max_new_tokens set to: {max_new_tokens}")
# -------------------------
# Generate audio
# -------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
audio_values = model.generate(
**inputs_for_length.to(device),
max_new_tokens=max_new_tokens,
output_audio=True,
# do_sample=True, # makes voice expressive
# temperature=0.9,
# top_p=0.9,
)
# -------------------------
# Convert to numpy, normalize, save and play
# -------------------------
audio = audio_values[0].float().cpu().numpy()
audio = audio / (abs(audio).max() + 1e-6) # normalize
sf.write("example_voice_context.wav", audio, 24000)
display(Audio(audio, rate=24000))
print("Saved example_voice_context.wav successfully!")
- Downloads last month
- 7