Text-to-Speech
Transformers
Safetensors
English
csm
text-to-audio
audio-generation
multimodal
unsloth
Instructions to use somrajmondal/PrettyVoice_model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use somrajmondal/PrettyVoice_model with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="somrajmondal/PrettyVoice_model")# Load model directly from transformers import AutoProcessor, AutoModelForTextToWaveform processor = AutoProcessor.from_pretrained("somrajmondal/PrettyVoice_model") model = AutoModelForTextToWaveform.from_pretrained("somrajmondal/PrettyVoice_model") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Unsloth Studio
How to use somrajmondal/PrettyVoice_model with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for somrajmondal/PrettyVoice_model to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for somrajmondal/PrettyVoice_model to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for somrajmondal/PrettyVoice_model to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="somrajmondal/PrettyVoice_model", max_seq_length=2048, )
PrettyVoice – Expressive Text-to-Speech Model
Model Overview
PrettyVoice is an expressive Text-to-Speech (TTS) and Chat-to-Speech model fine-tuned to generate emotional, conversational, and natural human-like speech.
This model is optimized for acted dialogue, emotional expression, and conversational delivery, not flat or robotic narration.
⚠️ This is an audio-generation model, not a text-only language model.
Model Details
| Field | Value |
|---|---|
| Model name | somrajmondal/PrettyVoice_model |
| Author | Somraj Mondal |
| Architecture | CsmForConditionalGeneration |
| Language | English |
| Output | Speech audio (WAV, 24 kHz) |
| Speaker support | Single speaker (speaker_id = 0) |
| License | Apache 2.0 |
Intended Use
✅ Designed for
- Text-to-Speech (TTS)
- Chat-style speech generation
- Emotional and expressive dialogue
- Storytelling and voice acting
- Romantic and conversational AI voices
- Multimodal chat → audio pipelines
❌ Not designed for
- Text-only generation
- Code generation
- Speech-to-text (ASR)
- Multilingual speech
- Robotic or factual narration
Installation
pip install unsloth transformers soundfile torch accelerate safetensors
Inference Code
import torch
import soundfile as sf
from IPython.display import Audio, display
from unsloth import FastModel
from transformers import CsmForConditionalGeneration
# Load model and processor
model, processor = FastModel.from_pretrained(
model_name="somrajmondal/PrettyVoice_model",
auto_model=CsmForConditionalGeneration,
max_seq_length=2048,
)
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
from IPython.display import Audio, display
import soundfile as sf
import torch
text = '''<smiles> Hey guys! How are you today? How sweet does my voice sound?'''
speaker_id = 0
conversation = [
{"role": str(speaker_id), "content": [{"type": "text", "text": text}]},
]
# -------------------------
# Prepare inputs and compute token length
# -------------------------
inputs_for_length = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
return_tensors="pt",
)
# Number of input tokens
input_length = inputs_for_length["input_ids"].shape[1]
# Estimate max_new_tokens for audio generation
max_new_tokens = int(input_length * 2.5)
# Clamp min and max
max_new_tokens = max(150, min(max_new_tokens, 3600))
print(f"Input tokens: {input_length}, max_new_tokens set to: {max_new_tokens}")
# -------------------------
# Generate audio
# -------------------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
audio_values = model.generate(
**inputs_for_length.to(device),
max_new_tokens=max_new_tokens,
output_audio=True,
# do_sample=True, # makes voice expressive
# temperature=0.9,
# top_p=0.9,
)
# -------------------------
# Convert to numpy, normalize, save and play
# -------------------------
audio = audio_values[0].float().cpu().numpy()
audio = audio / (abs(audio).max() + 1e-6) # normalize
sf.write("example_voice_context.wav", audio, 24000)
display(Audio(audio, rate=24000))
print("Saved example_voice_context.wav successfully!")
- Downloads last month
- 1