PrettyVoice – Expressive Text-to-Speech Model

Model Overview

PrettyVoice is an expressive Text-to-Speech (TTS) and Chat-to-Speech model fine-tuned to generate emotional, conversational, and natural human-like speech.

This model is optimized for acted dialogue, emotional expression, and conversational delivery, not flat or robotic narration.

⚠️ This is an audio-generation model, not a text-only language model.

Model Details

Field	Value
Model name	`somrajmondal/PrettyVoice_model`
Author	Somraj Mondal
Architecture	`CsmForConditionalGeneration`
Language	English
Output	Speech audio (WAV, 24 kHz)
Speaker support	Single speaker (`speaker_id = 0`)
License	Apache 2.0

Intended Use

✅ Designed for

Text-to-Speech (TTS)
Chat-style speech generation
Emotional and expressive dialogue
Storytelling and voice acting
Romantic and conversational AI voices
Multimodal chat → audio pipelines

❌ Not designed for

Text-only generation
Code generation
Speech-to-text (ASR)
Multilingual speech
Robotic or factual narration

Installation

pip install unsloth transformers soundfile torch accelerate safetensors

Inference Code

import torch
import soundfile as sf
from IPython.display import Audio, display
from unsloth import FastModel
from transformers import CsmForConditionalGeneration

# Load model and processor
model, processor = FastModel.from_pretrained(
    model_name="somrajmondal/PrettyVoice_model",
    auto_model=CsmForConditionalGeneration,
    max_seq_length=2048,
)

# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)



from IPython.display import Audio, display
import soundfile as sf
import torch

text = '''<smiles> Hey guys! How are you today? How sweet does my voice sound?'''
speaker_id = 0
conversation = [
    {"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]

# -------------------------
# Prepare inputs and compute token length
# -------------------------
inputs_for_length = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
)

# Number of input tokens
input_length = inputs_for_length["input_ids"].shape[1]

# Estimate max_new_tokens for audio generation
max_new_tokens = int(input_length * 2.5)
# Clamp min and max
max_new_tokens = max(150, min(max_new_tokens, 3600))

print(f"Input tokens: {input_length}, max_new_tokens set to: {max_new_tokens}")

# -------------------------
# Generate audio
# -------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

audio_values = model.generate(
    **inputs_for_length.to(device),
    max_new_tokens=max_new_tokens,
    output_audio=True,
    # do_sample=True,       # makes voice expressive
    # temperature=0.9,
    # top_p=0.9,
)

# -------------------------
# Convert to numpy, normalize, save and play
# -------------------------
audio = audio_values[0].float().cpu().numpy()
audio = audio / (abs(audio).max() + 1e-6)  # normalize

sf.write("example_voice_context.wav", audio, 24000)
display(Audio(audio, rate=24000))

print("Saved example_voice_context.wav successfully!")

Downloads last month: 1

Safetensors

Model size

2B params

Tensor type

F32

F16

Model tree for somrajmondal/PrettyVoice_model

Base model

sesame/csm-1b

Finetuned

unsloth/csm-1b

Finetuned

(192)

this model