AudAR TTS Pro Xpression v2
AudAR TTS Pro Xpression v2 is a 4-billion parameter autoregressive text-to-speech model with native support for paralinguistic expression control. It generates highly expressive, emotionally nuanced speech in both Arabic and English with fine-grained control over vocal delivery through inline expression tags.
What's New in v2
- Improved training: Extended training with best checkpoint selection (epoch 6.97, lowest eval loss)
- Better emotional fidelity: Refined expression token learning for more natural delivery
- Reduced artifacts: Improved stability in long-form generation
Key Features
- 4B parameter autoregressive architecture trained on large-scale expressive speech data
- Zero-shot voice cloning from a short reference audio (5-15 seconds)
- 11 paralinguistic expression tokens for precise emotional control
- Bilingual Arabic and English with dialect awareness
- 24kHz output via NeuCodec neural vocoder
- Flash Attention v2 optimized for fast inference on modern GPUs
Paralinguistic Expression System
Active Expression Tokens
These tokens have been trained with strong acoustic grounding on large-scale expressive data (>2,000 training examples each):
| Token | Effect | Description |
|---|---|---|
[gasp] |
Audible intake of breath | Surprise, shock, realization |
[trembling] |
Shaky, unsteady voice | Fear, cold, extreme emotion |
[shouting] |
Raised volume, projection | Anger, urgency, excitement |
[crying] |
Tearful vocal quality | Sadness, grief, overwhelming emotion |
[giggles] |
Light laughter | Amusement, nervousness, flirtation |
[cough] |
Throat clearing / cough | Illness, hesitation, interruption |
[yawn] |
Yawning vocalization | Tiredness, boredom |
[panicked] |
Rapid, breathless delivery | Emergency, fear, anxiety |
[tired] |
Low energy, slower pace | Exhaustion, fatigue |
[very slow] |
Deliberately slow pacing | Emphasis, gravity, sleepiness |
[very fast] |
Accelerated delivery | Urgency, excitement, news reporting |
Legacy Prosody Tags
These tags are inherited from the base training data and provide additional stylistic control:
| Tag | Effect | Description |
|---|---|---|
[laughs] |
Full laughter | Joy, humor |
[whispers] |
Reduced volume, breathy | Secrecy, intimacy, suspense |
[sighs] |
Exhalation | Resignation, relief, frustration |
[excited] |
High energy, bright tone | Enthusiasm, good news |
[curious] |
Rising intonation | Questioning, wonder |
[sarcastic] |
Flat/exaggerated tone | Irony, mockery |
Usage
Input Format
The model uses a structured prompt format with reference audio encoding:
user: Convert the text to speech:
<|REF_TEXT_START|>{reference_text}<|REF_TEXT_END|>
<|REF_SPEECH_START|>{encoded_reference_audio}<|REF_SPEECH_END|>
<|TARGET_TEXT_START|>{target_text_with_expression_tags}<|TARGET_TEXT_END|>
assistant:
<|TARGET_CODES_START|>{generated_speech_codes}<|TARGET_CODES_END|>
Expression Tag Placement
Tags can be placed before, after, or inline within text:
# Tag before text - sets the tone for what follows
text = "[panicked] Everyone get out of the building now!"
# Inline tags - expression shifts mid-sentence
text = "He said he was fine [crying] but I could tell he wasn't"
# Multiple tags - layered expression
text = "[very fast] Breaking news! [shouting] The team has won!"
Expressive Dialogue Examples
English - Joy & Celebration
text = "[excited] I got the job! They called me this morning! [giggles] I literally jumped out of bed and started dancing!"
English - Grief
text = "[very slow] I held his hand until the very end. [crying] He looked at me and smiled... and then he was just... gone. [trembling] The room felt so empty."
Arabic - Excitement
text = "[excited] نجحت! والله نجحت بامتياز! [giggles] ما صدقت لما شفت النتيجة! [shouting] الحمد لله!"
Arabic - Sadness
text = "[crying] فقدناه... فقدنا أغلى إنسان. [trembling] كل يوم أصحى وأحسّ إنه بيجي... بس ما يجي. [very slow] وحشتني يا أبوي."
Arabic - Sports
text = "[very fast] عاجل! المنتخب سجّل هدف في الدقيقة التسعين! [shouting] يا الله! [giggles] الجمهور جنّ جنونه!"
Best Practices
- Don't overuse tags - One or two per sentence provides the most natural results
- Match tag to content - The expression should be contextually appropriate
- Use transitions - Combine fast/slow pacing with emotional tags for dynamic delivery
- Reference audio matters - The reference speaker's natural style influences tag interpretation
- Combine with punctuation - Exclamation marks and ellipses reinforce expression tags
Voice Profiles
Pre-selected reference speakers are available as a separate dataset: audarai/voice_profile_xpress_v1
12 speakers (6 Arabic, 6 English) with gender diversity and dialect coverage.
Technical Specifications
| Property | Value |
|---|---|
| Parameters | 4.19 billion |
| Architecture | Autoregressive Transformer |
| Precision | bfloat16 |
| Vocab Size | 217,240 tokens |
| Max Context | 8192 tokens |
| Audio Codec | NeuCodec (24kHz, single codebook) |
| Output Sample Rate | 24,000 Hz |
| Languages | Arabic, English |
| Flash Attention | v2 supported |
| Training | rsLoRA, rank 64, alpha 128 |
| Best Checkpoint | Epoch 6.97, eval_loss 5.1397 |
Inference Requirements
- GPU: NVIDIA GPU with 16GB+ VRAM (H100/A100/RTX 4090 recommended)
- Dependencies:
transformers,torch,neucodec - Recommended: Flash Attention 2 for optimal throughput
Quick Start
import torch
import re
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
# Load model
model_id = "audarai/tts-pro-xpression-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
model.eval()
# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec")
codec.eval()
# Encode reference audio (16kHz)
import librosa
wav, _ = librosa.load("reference.wav", sr=16000, mono=True)
wav_t = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
ref_codes = codec.encode_code(audio_or_path=wav_t).squeeze(0).squeeze(0).tolist()
# Build prompt
ref_text = "Your reference transcript here"
target_text = "[excited] This is amazing news! [giggles] I can't believe it!"
ref_codes_str = ''.join(f'<|speech_{c}|>' for c in ref_codes)
prompt = (
f'user: Convert the text to speech:'
f'<|REF_TEXT_START|>{ref_text}<|REF_TEXT_END|>'
f'<|REF_SPEECH_START|>{ref_codes_str}<|REF_SPEECH_END|>'
f'<|TARGET_TEXT_START|>{target_text}<|TARGET_TEXT_END|>'
f'\nassistant:<|TARGET_CODES_START|>'
)
input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
eos_id = tokenizer.convert_tokens_to_ids("<|TARGET_CODES_END|>")
# Generate
with torch.no_grad():
output = model.generate(
input_ids, max_length=input_ids.shape[1] + 2048,
eos_token_id=eos_id, do_sample=True,
temperature=1.0, top_k=50, min_new_tokens=50, use_cache=True,
)
# Decode to audio
gen_str = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=False)
speech_ids = [int(n) for n in re.findall(r"<\|speech_(\d+)\|>", gen_str)]
codes_t = torch.tensor(speech_ids, dtype=torch.long)[None, None, :]
with torch.no_grad():
audio = codec.decode_code(codes_t).cpu().numpy()[0, 0, :]
sf.write("output.wav", audio, 24000)
Citation
@misc{audar2025xpression,
title={AudAR TTS Pro Xpression: Paralinguistic Expression Control for Neural Text-to-Speech},
author={AudAR AI},
year={2025},
publisher={AudAR}
}
License
This model is released under the AudAR Commercial License. Contact licensing@audar.ai for commercial use inquiries.
- Downloads last month
- 31