PrettyVoice – Expressive Text-to-Speech Model

Model Overview

PrettyVoice is an expressive Text-to-Speech (TTS) and Chat-to-Speech model fine-tuned to generate emotional, conversational, and natural human-like speech.

This model is optimized for acted dialogue, emotional expression, and conversational delivery, not flat or robotic narration.

⚠️ This is an audio-generation model, not a text-only language model.


Model Details

Field Value
Model name somrajmondal/PrettyVoice_model
Author Somraj Mondal
Architecture CsmForConditionalGeneration
Language English
Output Speech audio (WAV, 24 kHz)
Speaker support Single speaker (speaker_id = 0)
License Apache 2.0

Intended Use

βœ… Designed for

  • Text-to-Speech (TTS)
  • Chat-style speech generation
  • Emotional and expressive dialogue
  • Storytelling and voice acting
  • Romantic and conversational AI voices
  • Multimodal chat β†’ audio pipelines

❌ Not designed for

  • Text-only generation
  • Code generation
  • Speech-to-text (ASR)
  • Multilingual speech
  • Robotic or factual narration

Installation

pip install unsloth transformers soundfile torch accelerate safetensors

Inference Code

import torch
import soundfile as sf
from IPython.display import Audio, display
from unsloth import FastModel
from transformers import CsmForConditionalGeneration

# Load model and processor
model, processor = FastModel.from_pretrained(
    model_name="somrajmondal/PrettyVoice_model",
    auto_model=CsmForConditionalGeneration,
    max_seq_length=2048,
)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)



from IPython.display import Audio, display
import soundfile as sf
import torch

text = '''<smiles> Hey guys! How are you today? How sweet does my voice sound?'''
speaker_id = 0
conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]

# -------------------------
# Prepare inputs and compute token length
# -------------------------
inputs_for_length = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)

# Number of input tokens
input_length = inputs_for_length["input_ids"].shape[1]

# Estimate max_new_tokens for audio generation
max_new_tokens = int(input_length * 2.5)
# Clamp min and max
max_new_tokens = max(150, min(max_new_tokens, 3600))

print(f"Input tokens: {input_length}, max_new_tokens set to: {max_new_tokens}")

# -------------------------
# Generate audio
# -------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

audio_values = model.generate(
    **inputs_for_length.to(device),
    max_new_tokens=max_new_tokens,
    output_audio=True,
    # do_sample=True,       # makes voice expressive
    # temperature=0.9,
    # top_p=0.9,
)

# -------------------------
# Convert to numpy, normalize, save and play
# -------------------------
audio = audio_values[0].float().cpu().numpy()
audio = audio / (abs(audio).max() + 1e-6)  # normalize

sf.write("example_voice_context.wav", audio, 24000)
display(Audio(audio, rate=24000))

print("Saved example_voice_context.wav successfully!")

Downloads last month
7
Safetensors
Model size
2B params
Tensor type
F32
Β·
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for somrajmondal/PrettyVoice_model

Base model

sesame/csm-1b
Finetuned
unsloth/csm-1b
Finetuned
(183)
this model