🎙️ Roxi-TTS v3.1 — Indian-English voice (MOSS-TTS-Nano LoRA)

A compact Indian-English text-to-speech voice — a LoRA fine-tune of the 0.1 B MOSS-TTS-Nano on ~4 hours of a single studio speaker. Built for the VozVox voice-agent platform (customer-support / website assistants). Tiny, fast, 48 kHz, and commercially permissive end-to-end.

This is the current best of the Roxi line (preferred by ear over v2 and v3). It's an honest 0.1 B proof-of-concept: natural and clearly Indian, but read-speech in style — not yet fully conversational. See Limitations.

📋 Model at a glance


Base model	`OpenMOSS-Team/MOSS-TTS-Nano` (0.1 B, autoregressive audio-token + LLM) — Apache-2.0
Audio tokenizer	`OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano` (Apache-2.0)
Method	LoRA (PEFT) — r=32, α=64, targets `c_attn, c_proj, fc_in, fc_out` (~4.2% params), BF16, merged
Training data	~4 h, single IndicTTS-English speaker (studio, 48 kHz), 2,634 clips
Output	48 kHz mono WAV
Speaker similarity	0.96 (WavLM-SV cosine to held-out target)
Intelligibility (WER)	0.33 (Whisper-base.en on generated audio)

🧬 The Roxi line

Model	Speaker / data	Speaker-sim	WER	Notes
`roxi-tts-v2`	speaker A, ~50 min, r16	0.96	0.26	milder voice
`roxi-tts-v3`	speaker B, ~70 min, r16	0.96	0.29	different voice, fewer cut-offs
`roxi-tts-v3.1` (this)	speaker B, ~4 h, r32	0.96	0.33	preferred by ear (smoothest)
`roxi-tts-v2-onnx`	ONNX/CPU build of v2	0.73	0.25	no transformers dependency

⚠️ Requirements — please read

This model uses MOSS-TTS-Nano's custom code (trust_remote_code), which is built for transformers==4.57.1. On transformers 5.x it produces NaN/noise. Pin it, and restart your runtime after installing (Colab preloads 5.x):

pip install "transformers==4.57.1" torch torchaudio soundfile librosa sentencepiece

import transformers; assert transformers.__version__ == "4.57.1"   # verify before generating

Also: load one trust_remote_code model per kernel (loading two corrupts the module cache).

🚀 Usage

import torch
from transformers import AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(
    "IOTEverythin/roxi-tts-v3.1", trust_remote_code=True, dtype=torch.float32
).to(device).eval()

res = model.inference(
    text="Welcome. Your appointment is confirmed for Monday at ten thirty in the morning.",
    output_audio_path="out.wav", mode="continuation",
    audio_tokenizer_type="moss-audio-tokenizer-nano",
    audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
    device=device, audio_repetition_penalty=1.1, use_kv_cache=True,
)
from IPython.display import Audio; Audio("out.wav")

Recommended helper (retry + trim)

Generation is autoregressive and occasionally under-generates (cuts off) — retry and trim silence:

import numpy as np, soundfile as sf, librosa
from IPython.display import Audio, display
def say(text, tries=6):
    target = len(text.split())/3.0; best=(0,None,24000)
    for _ in range(tries):
        model.inference(text=text, output_audio_path="out.wav", mode="continuation",
            audio_tokenizer_type="moss-audio-tokenizer-nano",
            audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
            device=device, audio_repetition_penalty=1.1, use_kv_cache=True)
        y,sr = sf.read("out.wav"); y = y.mean(1) if y.ndim>1 else y
        yt,_ = librosa.effects.trim(y.astype(np.float32), top_db=35)
        if len(yt)/sr > best[0]: best=(len(yt)/sr, yt, sr)
        if len(yt)/sr >= target*0.9: break
    display(Audio(best[1], rate=best[2]))
say("Our Bengaluru office is open until six thirty this evening.")

Tips: spell brands phonetically ("Voz Vox"), avoid raw abbreviations ("in the morning", not "A M"), write numbers as words, and keep sentences ≤ ~12 words for reliability. Do not raise max_new_frames (the codec decode is O(n²) memory and can OOM).

🎯 Intended use

Indian-English TTS for customer-support calls and website voice assistants — natural, conversational, warm/professional, telephony-aware. Single-speaker branded voice.

🏗️ Training

Data: a single expressive (storytelling) speaker isolated from SPRINGLab/IndicTTS-English via WavLM-SV speaker clustering across dataset shards (~2,634 clips, ~4 h, studio 48 kHz, proper-case transcripts).
LoRA r=32/α=64 on c_attn, c_proj, fc_in, fc_out, BF16, lr 1e-4 cosine, grad-accum 8; epoch-2 checkpoint selected (later epochs overfit the reading style and raised WER).
Evaluation on held-out clips: speaker similarity via microsoft/wavlm-base-plus-sv; intelligibility via openai/whisper-base.en on the actual generated audio.

🧱 Limitations

0.1 B model — sounds synthetic vs larger TTS; naturalness is capped by size.
Read-speech data — delivery is somewhat formal, not fully conversational; accent is the training speaker's (clear, mildly Indian), not strongly stereotypical.
Stochastic cut-offs — use the retry helper; keep sentences short.
Telephony (8 kHz) not separately tuned. Style/emotion control is not reliable (neutral only).
Requires transformers==4.57.1 (see Requirements).

🙏 Attribution & license

Released under Apache-2.0. Built on MOSS-TTS-Nano (Apache-2.0) and its audio tokenizer (Apache-2.0). Training data: IIT-Madras Indic TTS (English) via SPRINGLab/IndicTTS-English. Required notice:

COPYRIGHT 2016 TTS Consortium, TDIL, Meity — represented by Hema A. Murthy & S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

🛡️ Responsible use

This voice is derived from a real dataset speaker. Do not use it to impersonate real people, for fraud, social engineering, or deception. Disclose AI-generated audio where required by law/policy. Provided "as is", without warranty.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for IOTEverythin/roxi-tts-v3.1

Base model

OpenMOSS-Team/MOSS-TTS-Nano-100M

Adapter

(3)

this model

Evaluation results

Speaker similarity (WavLM-SV, vs target)
self-reported

0.960
Intelligibility WER (Whisper-base.en)
self-reported

0.330