Instructions to use IOTEverythin/roxi-tts-pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IOTEverythin/roxi-tts-pro with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="IOTEverythin/roxi-tts-pro", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("IOTEverythin/roxi-tts-pro", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
Roxi-TTS Pro (1.7B): Indian-English text-to-speech
Roxi-TTS Pro is a 1.7B text-to-speech model that speaks in a clear, natural Indian-English accent. It is built for customer-support calls and website voice assistants, and it is the highest-quality voice in the Roxi line. If you need an Indian-English voice that sounds warm, professional, and telephony-ready, start here.
Why Roxi-TTS Pro
- Natural Indian-English accent, not a generic English voice with an accent bolted on.
- Highest intelligibility in the Roxi line: word error rate 0.18 (Whisper-base.en), and strong speaker consistency 0.97 (WavLM-SV).
- Stable generation with fewer cut-offs than the smaller models, so most lines are usable on the first try.
- 24 kHz output, single consistent branded voice.
- Apache-2.0 base models, so it is commercially permissive end to end.
Quick facts
| Field | Value |
|---|---|
| Base model | OpenMOSS-Team/MOSS-TTS-Local-Transformer (1.7B, Apache-2.0) |
| Audio tokenizer | OpenMOSS-Team/MOSS-Audio-Tokenizer (Apache-2.0) |
| Method | LoRA (PEFT), r=32, alpha=64, merged into the base weights |
| Training data | About 4 hours, single IndicTTS-English speaker, 2371 clips |
| Output | 24 kHz mono |
| Speaker similarity | 0.97 (WavLM-SV cosine to held-out target) |
| Intelligibility WER | 0.18 (Whisper-base.en on generated audio) |
| Speed | Real-time factor about 2.5 on a 16 GB GPU (best for offline or premium audio) |
Install
Built for transformers 4.57.1. Install the MOSS-TTS repository so the model class is importable.
pip install "transformers==4.57.1" torch torchaudio soundfile librosa peft
git clone https://github.com/OpenMOSS/MOSS-TTS.git
Quick start
import sys, torch, soundfile as sf
sys.path.insert(0, "MOSS-TTS") # cloned repo, provides moss_tts_local
from transformers import AutoProcessor
from moss_tts_local.modeling_moss_tts import MossTTSDelayModel
repo = "IOTEverythin/roxi-tts-pro"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32
processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
model = MossTTSDelayModel.from_pretrained(
repo, torch_dtype=dtype, attn_implementation="sdpa"
).to(device).eval()
text = "Welcome to Voz Vox. How may I help you today?"
instruction = "Speak naturally in a clear, conversational Indian-English style."
conv = [[processor.build_user_message(text=text, instruction=instruction)]]
batch = processor(conv, mode="generation")
out = model.generate(
input_ids=batch["input_ids"].to(device),
attention_mask=batch["attention_mask"].to(device),
max_new_tokens=4096, do_sample=True, temperature=0.9,
)
audio = processor.decode(out)[0].audio_codes_list[0]
sf.write("out.wav", audio.float().cpu().numpy(), processor.model_config.sampling_rate)
Tips for reliable output: write numbers as words, spell brand names phonetically (for example Voz Vox), avoid raw abbreviations, and keep sentences to about twelve words. Generation is autoregressive and can occasionally under-generate, so if a clip is short, generate two or three times and keep the longest, then trim leading and trailing silence. Do not raise max_new_tokens far above the default, since the codec decode grows quadratically in memory.
Which Roxi voice should I use
| Model | Base | Best for | Speaker sim | WER |
|---|---|---|---|---|
| roxi-tts-pro (this) | MOSS-TTS-Local 1.7B | Highest quality, offline or premium audio | 0.97 | 0.18 |
| roxi-tts-v3.1 | MOSS-TTS-Nano 0.1B | Real-time, live voice agents | 0.96 | 0.33 |
Use Roxi-TTS Pro when quality matters most and you can pre-render or afford a GPU. Use the smaller 0.1B voice when you need real-time, low-latency speech for a live agent.
Performance and deployability
Measured on a single 16 GB GPU (bf16, SDPA attention): real-time factor about 2.5, that is roughly 13 seconds of compute per 5 seconds of audio, with peak GPU memory about 13.4 GB. This makes Roxi-TTS Pro well suited to offline or pre-rendered speech and to a premium quality tier. For live, low-latency turn taking, prefer the 0.1B roxi-tts-v3.1, or optimize this model with quantization, torch.compile, a faster GPU, or by caching common phrases.
Intended use
Indian-English text to speech for customer-support calls and website voice assistants: natural, warm or professional, and telephony aware. Single-speaker branded voice.
Limitations
- The training data is read speech, so delivery is somewhat formal rather than fully conversational.
- Not real-time on a single consumer GPU. See Performance.
- Stochastic under-generation. Use the retry approach and keep sentences short.
- Style and emotion control are not reliable. The voice is neutral. For emotion, see roxi-tts-emotion.
- Requires transformers 4.57.1.
License and attribution
Released under Apache-2.0. Built on MOSS-TTS-Local-Transformer (Apache-2.0) and its audio tokenizer (Apache-2.0). Training data is the IIT-Madras Indic TTS English set accessed via SPRINGLab/IndicTTS-English. The dataset license requires the following notice:
COPYRIGHT 2016 TTS Consortium, TDIL, Meity, represented by Hema A. Murthy and S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.
Responsible use
This voice is derived from a real dataset speaker. Do not use it to impersonate real people or for fraud, social engineering, or deception. Disclose AI-generated audio where required by law or policy. Provided as is, without warranty.
- Downloads last month
- -
Model tree for IOTEverythin/roxi-tts-pro
Base model
OpenMOSS-Team/MOSS-TTS-Local-TransformerDataset used to train IOTEverythin/roxi-tts-pro
Evaluation results
- Speaker similarity (WavLM-SV, vs target)self-reported0.970
- Intelligibility WER (Whisper-base.en)self-reported0.180