Roxi-TTS Pro (1.7B): Indian-English text-to-speech

Roxi-TTS Pro is a 1.7B text-to-speech model that speaks in a clear, natural Indian-English accent. It is built for customer-support calls and website voice assistants, and it is the highest-quality voice in the Roxi line. If you need an Indian-English voice that sounds warm, professional, and telephony-ready, start here.

Why Roxi-TTS Pro

Natural Indian-English accent, not a generic English voice with an accent bolted on.
Highest intelligibility in the Roxi line: word error rate 0.18 (Whisper-base.en), and strong speaker consistency 0.97 (WavLM-SV).
Stable generation with fewer cut-offs than the smaller models, so most lines are usable on the first try.
24 kHz output, single consistent branded voice.
Apache-2.0 base models, so it is commercially permissive end to end.

Quick facts

Field	Value
Base model	OpenMOSS-Team/MOSS-TTS-Local-Transformer (1.7B, Apache-2.0)
Audio tokenizer	OpenMOSS-Team/MOSS-Audio-Tokenizer (Apache-2.0)
Method	LoRA (PEFT), r=32, alpha=64, merged into the base weights
Training data	About 4 hours, single IndicTTS-English speaker, 2371 clips
Output	24 kHz mono
Speaker similarity	0.97 (WavLM-SV cosine to held-out target)
Intelligibility WER	0.18 (Whisper-base.en on generated audio)
Speed	Real-time factor about 2.5 on a 16 GB GPU (best for offline or premium audio)

Install

Built for transformers 4.57.1. Install the MOSS-TTS repository so the model class is importable.

pip install "transformers==4.57.1" torch torchaudio soundfile librosa peft
git clone https://github.com/OpenMOSS/MOSS-TTS.git

Quick start

import sys, torch, soundfile as sf
sys.path.insert(0, "MOSS-TTS")  # cloned repo, provides moss_tts_local
from transformers import AutoProcessor
from moss_tts_local.modeling_moss_tts import MossTTSDelayModel

repo = "IOTEverythin/roxi-tts-pro"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)

model = MossTTSDelayModel.from_pretrained(
    repo, torch_dtype=dtype, attn_implementation="sdpa"
).to(device).eval()

text = "Welcome to Voz Vox. How may I help you today?"
instruction = "Speak naturally in a clear, conversational Indian-English style."

conv = [[processor.build_user_message(text=text, instruction=instruction)]]
batch = processor(conv, mode="generation")
out = model.generate(
    input_ids=batch["input_ids"].to(device),
    attention_mask=batch["attention_mask"].to(device),
    max_new_tokens=4096, do_sample=True, temperature=0.9,
)
audio = processor.decode(out)[0].audio_codes_list[0]
sf.write("out.wav", audio.float().cpu().numpy(), processor.model_config.sampling_rate)

Tips for reliable output: write numbers as words, spell brand names phonetically (for example Voz Vox), avoid raw abbreviations, and keep sentences to about twelve words. Generation is autoregressive and can occasionally under-generate, so if a clip is short, generate two or three times and keep the longest, then trim leading and trailing silence. Do not raise max_new_tokens far above the default, since the codec decode grows quadratically in memory.

Which Roxi voice should I use

Model	Base	Best for	Speaker sim	WER
roxi-tts-pro (this)	MOSS-TTS-Local 1.7B	Highest quality, offline or premium audio	0.97	0.18
roxi-tts-v3.1	MOSS-TTS-Nano 0.1B	Real-time, live voice agents	0.96	0.33

Use Roxi-TTS Pro when quality matters most and you can pre-render or afford a GPU. Use the smaller 0.1B voice when you need real-time, low-latency speech for a live agent.

Performance and deployability

Measured on a single 16 GB GPU (bf16, SDPA attention): real-time factor about 2.5, that is roughly 13 seconds of compute per 5 seconds of audio, with peak GPU memory about 13.4 GB. This makes Roxi-TTS Pro well suited to offline or pre-rendered speech and to a premium quality tier. For live, low-latency turn taking, prefer the 0.1B roxi-tts-v3.1, or optimize this model with quantization, torch.compile, a faster GPU, or by caching common phrases.

Intended use

Indian-English text to speech for customer-support calls and website voice assistants: natural, warm or professional, and telephony aware. Single-speaker branded voice.

Limitations

The training data is read speech, so delivery is somewhat formal rather than fully conversational.
Not real-time on a single consumer GPU. See Performance.
Stochastic under-generation. Use the retry approach and keep sentences short.
Style and emotion control are not reliable. The voice is neutral. For emotion, see roxi-tts-emotion.
Requires transformers 4.57.1.

License and attribution

Released under Apache-2.0. Built on MOSS-TTS-Local-Transformer (Apache-2.0) and its audio tokenizer (Apache-2.0). Training data is the IIT-Madras Indic TTS English set accessed via SPRINGLab/IndicTTS-English. The dataset license requires the following notice:

Responsible use

This voice is derived from a real dataset speaker. Do not use it to impersonate real people or for fraud, social engineering, or deception. Disclose AI-generated audio where required by law or policy. Provided as is, without warranty.

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for IOTEverythin/roxi-tts-pro

Base model

OpenMOSS-Team/MOSS-TTS-Local-Transformer

Adapter

(2)

this model

Dataset used to train IOTEverythin/roxi-tts-pro

Evaluation results

Speaker similarity (WavLM-SV, vs target)
self-reported

0.970
Intelligibility WER (Whisper-base.en)
self-reported

0.180