AudAR TTS Pro Xpression v2

AudAR TTS Pro Xpression v2 is a 4-billion parameter autoregressive text-to-speech model with native support for paralinguistic expression control. It generates highly expressive, emotionally nuanced speech in both Arabic and English with fine-grained control over vocal delivery through inline expression tags.

What's New in v2

Improved training: Extended training with best checkpoint selection (epoch 6.97, lowest eval loss)
Better emotional fidelity: Refined expression token learning for more natural delivery
Reduced artifacts: Improved stability in long-form generation

Key Features

4B parameter autoregressive architecture trained on large-scale expressive speech data
Zero-shot voice cloning from a short reference audio (5-15 seconds)
11 paralinguistic expression tokens for precise emotional control
Bilingual Arabic and English with dialect awareness
24kHz output via NeuCodec neural vocoder
Flash Attention v2 optimized for fast inference on modern GPUs

Paralinguistic Expression System

Active Expression Tokens

These tokens have been trained with strong acoustic grounding on large-scale expressive data (>2,000 training examples each):

Token	Effect	Description
`[gasp]`	Audible intake of breath	Surprise, shock, realization
`[trembling]`	Shaky, unsteady voice	Fear, cold, extreme emotion
`[shouting]`	Raised volume, projection	Anger, urgency, excitement
`[crying]`	Tearful vocal quality	Sadness, grief, overwhelming emotion
`[giggles]`	Light laughter	Amusement, nervousness, flirtation
`[cough]`	Throat clearing / cough	Illness, hesitation, interruption
`[yawn]`	Yawning vocalization	Tiredness, boredom
`[panicked]`	Rapid, breathless delivery	Emergency, fear, anxiety
`[tired]`	Low energy, slower pace	Exhaustion, fatigue
`[very slow]`	Deliberately slow pacing	Emphasis, gravity, sleepiness
`[very fast]`	Accelerated delivery	Urgency, excitement, news reporting

Legacy Prosody Tags

These tags are inherited from the base training data and provide additional stylistic control:

Tag	Effect	Description
`[laughs]`	Full laughter	Joy, humor
`[whispers]`	Reduced volume, breathy	Secrecy, intimacy, suspense
`[sighs]`	Exhalation	Resignation, relief, frustration
`[excited]`	High energy, bright tone	Enthusiasm, good news
`[curious]`	Rising intonation	Questioning, wonder
`[sarcastic]`	Flat/exaggerated tone	Irony, mockery

Usage

Input Format

The model uses a structured prompt format with reference audio encoding:

user: Convert the text to speech:
<|REF_TEXT_START|>{reference_text}<|REF_TEXT_END|>
<|REF_SPEECH_START|>{encoded_reference_audio}<|REF_SPEECH_END|>
<|TARGET_TEXT_START|>{target_text_with_expression_tags}<|TARGET_TEXT_END|>
assistant:
<|TARGET_CODES_START|>{generated_speech_codes}<|TARGET_CODES_END|>

Expression Tag Placement

Tags can be placed before, after, or inline within text:

# Tag before text - sets the tone for what follows
text = "[panicked] Everyone get out of the building now!"

# Inline tags - expression shifts mid-sentence
text = "He said he was fine [crying] but I could tell he wasn't"

# Multiple tags - layered expression
text = "[very fast] Breaking news! [shouting] The team has won!"

Expressive Dialogue Examples

English - Joy & Celebration

text = "[excited] I got the job! They called me this morning! [giggles] I literally jumped out of bed and started dancing!"

English - Grief

text = "[very slow] I held his hand until the very end. [crying] He looked at me and smiled... and then he was just... gone. [trembling] The room felt so empty."

Arabic - Excitement

text = "[excited] نجحت! والله نجحت بامتياز! [giggles] ما صدقت لما شفت النتيجة! [shouting] الحمد لله!"

Arabic - Sadness

text = "[crying] فقدناه... فقدنا أغلى إنسان. [trembling] كل يوم أصحى وأحسّ إنه بيجي... بس ما يجي. [very slow] وحشتني يا أبوي."

Arabic - Sports

text = "[very fast] عاجل! المنتخب سجّل هدف في الدقيقة التسعين! [shouting] يا الله! [giggles] الجمهور جنّ جنونه!"

Best Practices

Don't overuse tags - One or two per sentence provides the most natural results
Match tag to content - The expression should be contextually appropriate
Use transitions - Combine fast/slow pacing with emotional tags for dynamic delivery
Reference audio matters - The reference speaker's natural style influences tag interpretation
Combine with punctuation - Exclamation marks and ellipses reinforce expression tags

Voice Profiles

Pre-selected reference speakers are available as a separate dataset: audarai/voice_profile_xpress_v1

12 speakers (6 Arabic, 6 English) with gender diversity and dialect coverage.

Technical Specifications

Property	Value
Parameters	4.19 billion
Architecture	Autoregressive Transformer
Precision	bfloat16
Vocab Size	217,240 tokens
Max Context	8192 tokens
Audio Codec	NeuCodec (24kHz, single codebook)
Output Sample Rate	24,000 Hz
Languages	Arabic, English
Flash Attention	v2 supported
Training	rsLoRA, rank 64, alpha 128
Best Checkpoint	Epoch 6.97, eval_loss 5.1397

Inference Requirements

GPU: NVIDIA GPU with 16GB+ VRAM (H100/A100/RTX 4090 recommended)
Dependencies: transformers, torch, neucodec
Recommended: Flash Attention 2 for optimal throughput

Quick Start

import torch
import re
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec

# Load model
model_id = "audarai/tts-pro-xpression-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
model.eval()

# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec")
codec.eval()

# Encode reference audio (16kHz)
import librosa
wav, _ = librosa.load("reference.wav", sr=16000, mono=True)
wav_t = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
    ref_codes = codec.encode_code(audio_or_path=wav_t).squeeze(0).squeeze(0).tolist()

# Build prompt
ref_text = "Your reference transcript here"
target_text = "[excited] This is amazing news! [giggles] I can't believe it!"
ref_codes_str = ''.join(f'<|speech_{c}|>' for c in ref_codes)

prompt = (
    f'user: Convert the text to speech:'
    f'<|REF_TEXT_START|>{ref_text}<|REF_TEXT_END|>'
    f'<|REF_SPEECH_START|>{ref_codes_str}<|REF_SPEECH_END|>'
    f'<|TARGET_TEXT_START|>{target_text}<|TARGET_TEXT_END|>'
    f'\nassistant:<|TARGET_CODES_START|>'
)

input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
eos_id = tokenizer.convert_tokens_to_ids("<|TARGET_CODES_END|>")

# Generate
with torch.no_grad():
    output = model.generate(
        input_ids, max_length=input_ids.shape[1] + 2048,
        eos_token_id=eos_id, do_sample=True,
        temperature=1.0, top_k=50, min_new_tokens=50, use_cache=True,
    )

# Decode to audio
gen_str = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=False)
speech_ids = [int(n) for n in re.findall(r"<\|speech_(\d+)\|>", gen_str)]
codes_t = torch.tensor(speech_ids, dtype=torch.long)[None, None, :]
with torch.no_grad():
    audio = codec.decode_code(codes_t).cpu().numpy()[0, 0, :]
sf.write("output.wav", audio, 24000)

Citation

@misc{audar2025xpression,
  title={AudAR TTS Pro Xpression: Paralinguistic Expression Control for Neural Text-to-Speech},
  author={AudAR AI},
  year={2025},
  publisher={AudAR}
}

License

This model is released under the AudAR Commercial License. Contact licensing@audar.ai for commercial use inquiries.

Downloads last month: 31

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for audarai/tts-pro-xpression-v2

Quantizations

1 model