Instructions to use IOTEverythin/roxi-tts-v3.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use IOTEverythin/roxi-tts-v3.1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-to-speech", model="IOTEverythin/roxi-tts-v3.1", trust_remote_code=True)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("IOTEverythin/roxi-tts-v3.1", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
ποΈ Roxi-TTS v3.1 β Indian-English voice (MOSS-TTS-Nano LoRA)
A compact Indian-English text-to-speech voice β a LoRA fine-tune of the 0.1 B MOSS-TTS-Nano on ~4 hours of a single studio speaker. Built for the VozVox voice-agent platform (customer-support / website assistants). Tiny, fast, 48 kHz, and commercially permissive end-to-end.
This is the current best of the Roxi line (preferred by ear over v2 and v3). It's an honest 0.1 B proof-of-concept: natural and clearly Indian, but read-speech in style β not yet fully conversational. See Limitations.
π Model at a glance
| Base model | OpenMOSS-Team/MOSS-TTS-Nano (0.1 B, autoregressive audio-token + LLM) β Apache-2.0 |
| Audio tokenizer | OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano (Apache-2.0) |
| Method | LoRA (PEFT) β r=32, Ξ±=64, targets c_attn, c_proj, fc_in, fc_out (~4.2% params), BF16, merged |
| Training data | ~4 h, single IndicTTS-English speaker (studio, 48 kHz), 2,634 clips |
| Output | 48 kHz mono WAV |
| Speaker similarity | 0.96 (WavLM-SV cosine to held-out target) |
| Intelligibility (WER) | 0.33 (Whisper-base.en on generated audio) |
𧬠The Roxi line
| Model | Speaker / data | Speaker-sim | WER | Notes |
|---|---|---|---|---|
roxi-tts-v2 |
speaker A, ~50 min, r16 | 0.96 | 0.26 | milder voice |
roxi-tts-v3 |
speaker B, ~70 min, r16 | 0.96 | 0.29 | different voice, fewer cut-offs |
roxi-tts-v3.1 (this) |
speaker B, ~4 h, r32 | 0.96 | 0.33 | preferred by ear (smoothest) |
roxi-tts-v2-onnx |
ONNX/CPU build of v2 | 0.73 | 0.25 | no transformers dependency |
β οΈ Requirements β please read
This model uses MOSS-TTS-Nano's custom code (trust_remote_code), which is built for
transformers==4.57.1. On transformers 5.x it produces NaN/noise. Pin it, and restart your
runtime after installing (Colab preloads 5.x):
pip install "transformers==4.57.1" torch torchaudio soundfile librosa sentencepiece
import transformers; assert transformers.__version__ == "4.57.1" # verify before generating
Also: load one trust_remote_code model per kernel (loading two corrupts the module cache).
π Usage
import torch
from transformers import AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(
"IOTEverythin/roxi-tts-v3.1", trust_remote_code=True, dtype=torch.float32
).to(device).eval()
res = model.inference(
text="Welcome. Your appointment is confirmed for Monday at ten thirty in the morning.",
output_audio_path="out.wav", mode="continuation",
audio_tokenizer_type="moss-audio-tokenizer-nano",
audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
device=device, audio_repetition_penalty=1.1, use_kv_cache=True,
)
from IPython.display import Audio; Audio("out.wav")
Recommended helper (retry + trim)
Generation is autoregressive and occasionally under-generates (cuts off) β retry and trim silence:
import numpy as np, soundfile as sf, librosa
from IPython.display import Audio, display
def say(text, tries=6):
target = len(text.split())/3.0; best=(0,None,24000)
for _ in range(tries):
model.inference(text=text, output_audio_path="out.wav", mode="continuation",
audio_tokenizer_type="moss-audio-tokenizer-nano",
audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
device=device, audio_repetition_penalty=1.1, use_kv_cache=True)
y,sr = sf.read("out.wav"); y = y.mean(1) if y.ndim>1 else y
yt,_ = librosa.effects.trim(y.astype(np.float32), top_db=35)
if len(yt)/sr > best[0]: best=(len(yt)/sr, yt, sr)
if len(yt)/sr >= target*0.9: break
display(Audio(best[1], rate=best[2]))
say("Our Bengaluru office is open until six thirty this evening.")
Tips: spell brands phonetically ("Voz Vox"), avoid raw abbreviations ("in the morning", not "A M"),
write numbers as words, and keep sentences β€ ~12 words for reliability. Do not raise
max_new_frames (the codec decode is O(nΒ²) memory and can OOM).
π― Intended use
Indian-English TTS for customer-support calls and website voice assistants β natural, conversational, warm/professional, telephony-aware. Single-speaker branded voice.
ποΈ Training
- Data: a single expressive (storytelling) speaker isolated from
SPRINGLab/IndicTTS-Englishvia WavLM-SV speaker clustering across dataset shards (~2,634 clips, ~4 h, studio 48 kHz, proper-case transcripts). - LoRA r=32/Ξ±=64 on
c_attn, c_proj, fc_in, fc_out, BF16, lr 1e-4 cosine, grad-accum 8; epoch-2 checkpoint selected (later epochs overfit the reading style and raised WER). - Evaluation on held-out clips: speaker similarity via
microsoft/wavlm-base-plus-sv; intelligibility viaopenai/whisper-base.enon the actual generated audio.
π§± Limitations
- 0.1 B model β sounds synthetic vs larger TTS; naturalness is capped by size.
- Read-speech data β delivery is somewhat formal, not fully conversational; accent is the training speaker's (clear, mildly Indian), not strongly stereotypical.
- Stochastic cut-offs β use the retry helper; keep sentences short.
- Telephony (8 kHz) not separately tuned. Style/emotion control is not reliable (neutral only).
- Requires
transformers==4.57.1(see Requirements).
π Attribution & license
Released under Apache-2.0. Built on MOSS-TTS-Nano (Apache-2.0) and its audio tokenizer (Apache-2.0).
Training data: IIT-Madras Indic TTS (English) via SPRINGLab/IndicTTS-English. Required notice:
COPYRIGHT 2016 TTS Consortium, TDIL, Meity β represented by Hema A. Murthy & S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.
π‘οΈ Responsible use
This voice is derived from a real dataset speaker. Do not use it to impersonate real people, for fraud, social engineering, or deception. Disclose AI-generated audio where required by law/policy. Provided "as is", without warranty.
- Downloads last month
- -
Model tree for IOTEverythin/roxi-tts-v3.1
Base model
OpenMOSS-Team/MOSS-TTS-Nano-100MEvaluation results
- Speaker similarity (WavLM-SV, vs target)self-reported0.960
- Intelligibility WER (Whisper-base.en)self-reported0.330