Roxi-TTS Pro (1.7B): Indian-English text-to-speech

Roxi-TTS Pro is a 1.7B text-to-speech model that speaks in a clear, natural Indian-English accent. It is built for customer-support calls and website voice assistants, and it is the highest-quality voice in the Roxi line. If you need an Indian-English voice that sounds warm, professional, and telephony-ready, start here.

Why Roxi-TTS Pro

  • Natural Indian-English accent, not a generic English voice with an accent bolted on.
  • Highest intelligibility in the Roxi line: word error rate 0.18 (Whisper-base.en), and strong speaker consistency 0.97 (WavLM-SV).
  • Stable generation with fewer cut-offs than the smaller models, so most lines are usable on the first try.
  • 24 kHz output, single consistent branded voice.
  • Apache-2.0 base models, so it is commercially permissive end to end.

Quick facts

Field Value
Base model OpenMOSS-Team/MOSS-TTS-Local-Transformer (1.7B, Apache-2.0)
Audio tokenizer OpenMOSS-Team/MOSS-Audio-Tokenizer (Apache-2.0)
Method LoRA (PEFT), r=32, alpha=64, merged into the base weights
Training data About 4 hours, single IndicTTS-English speaker, 2371 clips
Output 24 kHz mono
Speaker similarity 0.97 (WavLM-SV cosine to held-out target)
Intelligibility WER 0.18 (Whisper-base.en on generated audio)
Speed Real-time factor about 2.5 on a 16 GB GPU (best for offline or premium audio)

Install

Built for transformers 4.57.1. Install the MOSS-TTS repository so the model class is importable.

pip install "transformers==4.57.1" torch torchaudio soundfile librosa peft
git clone https://github.com/OpenMOSS/MOSS-TTS.git

Quick start

import sys, torch, soundfile as sf
sys.path.insert(0, "MOSS-TTS")  # cloned repo, provides moss_tts_local
from transformers import AutoProcessor
from moss_tts_local.modeling_moss_tts import MossTTSDelayModel

repo = "IOTEverythin/roxi-tts-pro"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)

model = MossTTSDelayModel.from_pretrained(
    repo, torch_dtype=dtype, attn_implementation="sdpa"
).to(device).eval()

text = "Welcome to Voz Vox. How may I help you today?"
instruction = "Speak naturally in a clear, conversational Indian-English style."

conv = [[processor.build_user_message(text=text, instruction=instruction)]]
batch = processor(conv, mode="generation")
out = model.generate(
    input_ids=batch["input_ids"].to(device),
    attention_mask=batch["attention_mask"].to(device),
    max_new_tokens=4096, do_sample=True, temperature=0.9,
)
audio = processor.decode(out)[0].audio_codes_list[0]
sf.write("out.wav", audio.float().cpu().numpy(), processor.model_config.sampling_rate)

Tips for reliable output: write numbers as words, spell brand names phonetically (for example Voz Vox), avoid raw abbreviations, and keep sentences to about twelve words. Generation is autoregressive and can occasionally under-generate, so if a clip is short, generate two or three times and keep the longest, then trim leading and trailing silence. Do not raise max_new_tokens far above the default, since the codec decode grows quadratically in memory.

Which Roxi voice should I use

Model Base Best for Speaker sim WER
roxi-tts-pro (this) MOSS-TTS-Local 1.7B Highest quality, offline or premium audio 0.97 0.18
roxi-tts-v3.1 MOSS-TTS-Nano 0.1B Real-time, live voice agents 0.96 0.33

Use Roxi-TTS Pro when quality matters most and you can pre-render or afford a GPU. Use the smaller 0.1B voice when you need real-time, low-latency speech for a live agent.

Performance and deployability

Measured on a single 16 GB GPU (bf16, SDPA attention): real-time factor about 2.5, that is roughly 13 seconds of compute per 5 seconds of audio, with peak GPU memory about 13.4 GB. This makes Roxi-TTS Pro well suited to offline or pre-rendered speech and to a premium quality tier. For live, low-latency turn taking, prefer the 0.1B roxi-tts-v3.1, or optimize this model with quantization, torch.compile, a faster GPU, or by caching common phrases.

Intended use

Indian-English text to speech for customer-support calls and website voice assistants: natural, warm or professional, and telephony aware. Single-speaker branded voice.

Limitations

  • The training data is read speech, so delivery is somewhat formal rather than fully conversational.
  • Not real-time on a single consumer GPU. See Performance.
  • Stochastic under-generation. Use the retry approach and keep sentences short.
  • Style and emotion control are not reliable. The voice is neutral. For emotion, see roxi-tts-emotion.
  • Requires transformers 4.57.1.

License and attribution

Released under Apache-2.0. Built on MOSS-TTS-Local-Transformer (Apache-2.0) and its audio tokenizer (Apache-2.0). Training data is the IIT-Madras Indic TTS English set accessed via SPRINGLab/IndicTTS-English. The dataset license requires the following notice:

COPYRIGHT 2016 TTS Consortium, TDIL, Meity, represented by Hema A. Murthy and S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

Responsible use

This voice is derived from a real dataset speaker. Do not use it to impersonate real people or for fraud, social engineering, or deception. Disclose AI-generated audio where required by law or policy. Provided as is, without warranty.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IOTEverythin/roxi-tts-pro

Adapter
(2)
this model

Dataset used to train IOTEverythin/roxi-tts-pro

Evaluation results