πŸŽ™οΈ Roxi-TTS v3.1 β€” Indian-English voice (MOSS-TTS-Nano LoRA)

base license lang sr method

A compact Indian-English text-to-speech voice β€” a LoRA fine-tune of the 0.1 B MOSS-TTS-Nano on ~4 hours of a single studio speaker. Built for the VozVox voice-agent platform (customer-support / website assistants). Tiny, fast, 48 kHz, and commercially permissive end-to-end.

This is the current best of the Roxi line (preferred by ear over v2 and v3). It's an honest 0.1 B proof-of-concept: natural and clearly Indian, but read-speech in style β€” not yet fully conversational. See Limitations.

πŸ“‹ Model at a glance

Base model OpenMOSS-Team/MOSS-TTS-Nano (0.1 B, autoregressive audio-token + LLM) β€” Apache-2.0
Audio tokenizer OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano (Apache-2.0)
Method LoRA (PEFT) β€” r=32, Ξ±=64, targets c_attn, c_proj, fc_in, fc_out (~4.2% params), BF16, merged
Training data ~4 h, single IndicTTS-English speaker (studio, 48 kHz), 2,634 clips
Output 48 kHz mono WAV
Speaker similarity 0.96 (WavLM-SV cosine to held-out target)
Intelligibility (WER) 0.33 (Whisper-base.en on generated audio)

🧬 The Roxi line

Model Speaker / data Speaker-sim WER Notes
roxi-tts-v2 speaker A, ~50 min, r16 0.96 0.26 milder voice
roxi-tts-v3 speaker B, ~70 min, r16 0.96 0.29 different voice, fewer cut-offs
roxi-tts-v3.1 (this) speaker B, ~4 h, r32 0.96 0.33 preferred by ear (smoothest)
roxi-tts-v2-onnx ONNX/CPU build of v2 0.73 0.25 no transformers dependency

⚠️ Requirements β€” please read

This model uses MOSS-TTS-Nano's custom code (trust_remote_code), which is built for transformers==4.57.1. On transformers 5.x it produces NaN/noise. Pin it, and restart your runtime after installing (Colab preloads 5.x):

pip install "transformers==4.57.1" torch torchaudio soundfile librosa sentencepiece
import transformers; assert transformers.__version__ == "4.57.1"   # verify before generating

Also: load one trust_remote_code model per kernel (loading two corrupts the module cache).

πŸš€ Usage

import torch
from transformers import AutoModelForCausalLM
device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(
    "IOTEverythin/roxi-tts-v3.1", trust_remote_code=True, dtype=torch.float32
).to(device).eval()

res = model.inference(
    text="Welcome. Your appointment is confirmed for Monday at ten thirty in the morning.",
    output_audio_path="out.wav", mode="continuation",
    audio_tokenizer_type="moss-audio-tokenizer-nano",
    audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
    device=device, audio_repetition_penalty=1.1, use_kv_cache=True,
)
from IPython.display import Audio; Audio("out.wav")

Recommended helper (retry + trim)

Generation is autoregressive and occasionally under-generates (cuts off) β€” retry and trim silence:

import numpy as np, soundfile as sf, librosa
from IPython.display import Audio, display
def say(text, tries=6):
    target = len(text.split())/3.0; best=(0,None,24000)
    for _ in range(tries):
        model.inference(text=text, output_audio_path="out.wav", mode="continuation",
            audio_tokenizer_type="moss-audio-tokenizer-nano",
            audio_tokenizer_pretrained_name_or_path="OpenMOSS-Team/MOSS-Audio-Tokenizer-Nano",
            device=device, audio_repetition_penalty=1.1, use_kv_cache=True)
        y,sr = sf.read("out.wav"); y = y.mean(1) if y.ndim>1 else y
        yt,_ = librosa.effects.trim(y.astype(np.float32), top_db=35)
        if len(yt)/sr > best[0]: best=(len(yt)/sr, yt, sr)
        if len(yt)/sr >= target*0.9: break
    display(Audio(best[1], rate=best[2]))
say("Our Bengaluru office is open until six thirty this evening.")

Tips: spell brands phonetically ("Voz Vox"), avoid raw abbreviations ("in the morning", not "A M"), write numbers as words, and keep sentences ≀ ~12 words for reliability. Do not raise max_new_frames (the codec decode is O(nΒ²) memory and can OOM).

🎯 Intended use

Indian-English TTS for customer-support calls and website voice assistants β€” natural, conversational, warm/professional, telephony-aware. Single-speaker branded voice.

πŸ—οΈ Training

  • Data: a single expressive (storytelling) speaker isolated from SPRINGLab/IndicTTS-English via WavLM-SV speaker clustering across dataset shards (~2,634 clips, ~4 h, studio 48 kHz, proper-case transcripts).
  • LoRA r=32/Ξ±=64 on c_attn, c_proj, fc_in, fc_out, BF16, lr 1e-4 cosine, grad-accum 8; epoch-2 checkpoint selected (later epochs overfit the reading style and raised WER).
  • Evaluation on held-out clips: speaker similarity via microsoft/wavlm-base-plus-sv; intelligibility via openai/whisper-base.en on the actual generated audio.

🧱 Limitations

  • 0.1 B model β€” sounds synthetic vs larger TTS; naturalness is capped by size.
  • Read-speech data β€” delivery is somewhat formal, not fully conversational; accent is the training speaker's (clear, mildly Indian), not strongly stereotypical.
  • Stochastic cut-offs β€” use the retry helper; keep sentences short.
  • Telephony (8 kHz) not separately tuned. Style/emotion control is not reliable (neutral only).
  • Requires transformers==4.57.1 (see Requirements).

πŸ™ Attribution & license

Released under Apache-2.0. Built on MOSS-TTS-Nano (Apache-2.0) and its audio tokenizer (Apache-2.0). Training data: IIT-Madras Indic TTS (English) via SPRINGLab/IndicTTS-English. Required notice:

COPYRIGHT 2016 TTS Consortium, TDIL, Meity β€” represented by Hema A. Murthy & S. Umesh, Department of Computer Science and Engineering and Electrical Engineering, IIT Madras. ALL RIGHTS RESERVED.

πŸ›‘οΈ Responsible use

This voice is derived from a real dataset speaker. Do not use it to impersonate real people, for fraud, social engineering, or deception. Disclose AI-generated audio where required by law/policy. Provided "as is", without warranty.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for IOTEverythin/roxi-tts-v3.1

Adapter
(3)
this model

Evaluation results