Roxi-TTS Emotion (1.7B): controllable emotional Indian-English text-to-speech

Roxi-TTS Emotion is a 1.7B text-to-speech model that speaks Indian-English in eight selectable emotions from a single, consistent voice. You choose the emotion per sentence with a short text instruction, so the same speaker can sound happy, sad, angry, excited, calm, apologetic, fearful, or neutral.

Non-commercial. This model was trained on the Skit-AI Emotional TTS dataset, which is licensed CC BY-NC 4.0. The model is therefore released under CC BY-NC 4.0 and must not be used for commercial purposes. For commercial use, obtain a license from Skit-AI.

Why Roxi-TTS Emotion

  • Eight emotions from one voice: neutral, happy, sad, angry, excited, calm, apologetic, fear.
  • Controllable by instruction, no separate model per emotion.
  • One consistent speaker across all emotions: 0.958 speaker similarity (WavLM-SV).
  • Measurably expressive: 2.1x spread in pitch variation across the eight emotions.
  • Natural Indian-English accent, 24 kHz output.

Emotions and how to steer

Set the instruction field to one of the following:

Emotion Instruction
neutral Speak in a neutral, clear, conversational Indian-English style.
happy Speak in a happy, cheerful, warm tone, in a clear Indian-English style.
sad Speak in a sad, downcast, sorrowful tone, in a clear Indian-English style.
angry Speak in an angry, irritated, forceful tone, in a clear Indian-English style.
excited Speak in an excited, high-energy, enthusiastic tone, in a clear Indian-English style.
calm Speak in a calm, relaxed, soothing tone, in a clear Indian-English style.
apologetic Speak in an apologetic, regretful, gentle tone, in a clear Indian-English style.
fear Speak in a fearful, anxious, worried tone, in a clear Indian-English style.

Quick facts

Field Value
Base model OpenMOSS-Team/MOSS-TTS-Local-Transformer (1.7B, Apache-2.0)
Audio tokenizer OpenMOSS-Team/MOSS-Audio-Tokenizer (Apache-2.0)
Method LoRA (PEFT), r=16, alpha=32, merged into the base weights
Training data Skit-AI Emotional TTS, single Indian-English female speaker, 8 emotions, about 2825 clips
Output 24 kHz mono
Voice consistency 0.958 speaker similarity across emotions (WavLM-SV)
Emotion range 2.1x pitch-variation spread across the eight emotions
Speed Real-time factor about 2.5 on a 16 GB GPU, not real-time

Install

Built for transformers 4.57.1. Install the MOSS-TTS repository so the model class is importable.

pip install "transformers==4.57.1" torch torchaudio soundfile librosa peft
git clone https://github.com/OpenMOSS/MOSS-TTS.git

Quick start

import sys, torch, soundfile as sf
sys.path.insert(0, "MOSS-TTS")
from transformers import AutoProcessor
from moss_tts_local.modeling_moss_tts import MossTTSDelayModel

repo = "IOTEverythin/roxi-tts-emotion"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if device == "cuda" else torch.float32

processor = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
processor.audio_tokenizer = processor.audio_tokenizer.to(device)
model = MossTTSDelayModel.from_pretrained(repo, torch_dtype=dtype, attn_implementation="sdpa").to(device).eval()

text = "I just heard the news about the meeting tomorrow."
instruction = "Speak in an excited, high-energy, enthusiastic tone, in a clear Indian-English style."

conv = [[processor.build_user_message(text=text, instruction=instruction)]]
batch = processor(conv, mode="generation")
out = model.generate(
    input_ids=batch["input_ids"].to(device),
    attention_mask=batch["attention_mask"].to(device),
    max_new_tokens=4096, do_sample=True, temperature=0.9,
)
audio = processor.decode(out)[0].audio_codes_list[0]
sf.write("out.wav", audio.float().cpu().numpy(), processor.model_config.sampling_rate)

Swap the instruction to change the emotion. Generation is autoregressive and can under-generate, so if a clip is short, generate a few times and keep the longest, then trim silence. Keep sentences to about twelve words.

Limitations

  • Non-commercial license (see below).
  • Not real-time on a single consumer GPU.
  • Eight emotions. Surprise is excluded because the source had no audio for it.
  • Emotion is set per sentence through the instruction, not through inline tags inside a sentence.
  • Requires transformers 4.57.1.

License and attribution

Released under CC BY-NC 4.0 (non-commercial). Built on MOSS-TTS-Local-Transformer (Apache-2.0) and the MOSS Audio Tokenizer (Apache-2.0). The emotion control is learned from the Skit-AI Emotional TTS dataset (https://github.com/skit-ai/emotion-tts-dataset), which is licensed CC BY-NC 4.0, copyright Skit.ai. Because the training data is non-commercial, this derivative model is non-commercial as well.

Responsible use

This voice is derived from a real dataset speaker. Do not use it to impersonate real people or for fraud, social engineering, or deception. Disclose AI-generated audio where required by law or policy. Provided as is, without warranty.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IOTEverythin/roxi-tts-emotion

Adapter
(2)
this model

Dataset used to train IOTEverythin/roxi-tts-emotion