AudAR TTS Pro Xpression v2

AudAR TTS Pro Xpression v2 is a 4-billion parameter autoregressive text-to-speech model with native support for paralinguistic expression control. It generates highly expressive, emotionally nuanced speech in both Arabic and English with fine-grained control over vocal delivery through inline expression tags.

What's New in v2

  • Improved training: Extended training with best checkpoint selection (epoch 6.97, lowest eval loss)
  • Better emotional fidelity: Refined expression token learning for more natural delivery
  • Reduced artifacts: Improved stability in long-form generation

Key Features

  • 4B parameter autoregressive architecture trained on large-scale expressive speech data
  • Zero-shot voice cloning from a short reference audio (5-15 seconds)
  • 11 paralinguistic expression tokens for precise emotional control
  • Bilingual Arabic and English with dialect awareness
  • 24kHz output via NeuCodec neural vocoder
  • Flash Attention v2 optimized for fast inference on modern GPUs

Paralinguistic Expression System

Active Expression Tokens

These tokens have been trained with strong acoustic grounding on large-scale expressive data (>2,000 training examples each):

Token Effect Description
[gasp] Audible intake of breath Surprise, shock, realization
[trembling] Shaky, unsteady voice Fear, cold, extreme emotion
[shouting] Raised volume, projection Anger, urgency, excitement
[crying] Tearful vocal quality Sadness, grief, overwhelming emotion
[giggles] Light laughter Amusement, nervousness, flirtation
[cough] Throat clearing / cough Illness, hesitation, interruption
[yawn] Yawning vocalization Tiredness, boredom
[panicked] Rapid, breathless delivery Emergency, fear, anxiety
[tired] Low energy, slower pace Exhaustion, fatigue
[very slow] Deliberately slow pacing Emphasis, gravity, sleepiness
[very fast] Accelerated delivery Urgency, excitement, news reporting

Legacy Prosody Tags

These tags are inherited from the base training data and provide additional stylistic control:

Tag Effect Description
[laughs] Full laughter Joy, humor
[whispers] Reduced volume, breathy Secrecy, intimacy, suspense
[sighs] Exhalation Resignation, relief, frustration
[excited] High energy, bright tone Enthusiasm, good news
[curious] Rising intonation Questioning, wonder
[sarcastic] Flat/exaggerated tone Irony, mockery

Usage

Input Format

The model uses a structured prompt format with reference audio encoding:

user: Convert the text to speech:
<|REF_TEXT_START|>{reference_text}<|REF_TEXT_END|>
<|REF_SPEECH_START|>{encoded_reference_audio}<|REF_SPEECH_END|>
<|TARGET_TEXT_START|>{target_text_with_expression_tags}<|TARGET_TEXT_END|>
assistant:
<|TARGET_CODES_START|>{generated_speech_codes}<|TARGET_CODES_END|>

Expression Tag Placement

Tags can be placed before, after, or inline within text:

# Tag before text - sets the tone for what follows
text = "[panicked] Everyone get out of the building now!"

# Inline tags - expression shifts mid-sentence
text = "He said he was fine [crying] but I could tell he wasn't"

# Multiple tags - layered expression
text = "[very fast] Breaking news! [shouting] The team has won!"

Expressive Dialogue Examples

English - Joy & Celebration

text = "[excited] I got the job! They called me this morning! [giggles] I literally jumped out of bed and started dancing!"

English - Grief

text = "[very slow] I held his hand until the very end. [crying] He looked at me and smiled... and then he was just... gone. [trembling] The room felt so empty."

Arabic - Excitement

text = "[excited] نجحت! والله نجحت بامتياز! [giggles] ما صدقت لما شفت النتيجة! [shouting] الحمد لله!"

Arabic - Sadness

text = "[crying] فقدناه... فقدنا أغلى إنسان. [trembling] كل يوم أصحى وأحسّ إنه بيجي... بس ما يجي. [very slow] وحشتني يا أبوي."

Arabic - Sports

text = "[very fast] عاجل! المنتخب سجّل هدف في الدقيقة التسعين! [shouting] يا الله! [giggles] الجمهور جنّ جنونه!"

Best Practices

  1. Don't overuse tags - One or two per sentence provides the most natural results
  2. Match tag to content - The expression should be contextually appropriate
  3. Use transitions - Combine fast/slow pacing with emotional tags for dynamic delivery
  4. Reference audio matters - The reference speaker's natural style influences tag interpretation
  5. Combine with punctuation - Exclamation marks and ellipses reinforce expression tags

Voice Profiles

Pre-selected reference speakers are available as a separate dataset: audarai/voice_profile_xpress_v1

12 speakers (6 Arabic, 6 English) with gender diversity and dialect coverage.

Technical Specifications

Property Value
Parameters 4.19 billion
Architecture Autoregressive Transformer
Precision bfloat16
Vocab Size 217,240 tokens
Max Context 8192 tokens
Audio Codec NeuCodec (24kHz, single codebook)
Output Sample Rate 24,000 Hz
Languages Arabic, English
Flash Attention v2 supported
Training rsLoRA, rank 64, alpha 128
Best Checkpoint Epoch 6.97, eval_loss 5.1397

Inference Requirements

  • GPU: NVIDIA GPU with 16GB+ VRAM (H100/A100/RTX 4090 recommended)
  • Dependencies: transformers, torch, neucodec
  • Recommended: Flash Attention 2 for optimal throughput

Quick Start

import torch
import re
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec

# Load model
model_id = "audarai/tts-pro-xpression-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="cuda"
)
model.eval()

# Load codec
codec = NeuCodec.from_pretrained("neuphonic/neucodec")
codec.eval()

# Encode reference audio (16kHz)
import librosa
wav, _ = librosa.load("reference.wav", sr=16000, mono=True)
wav_t = torch.from_numpy(wav).float().unsqueeze(0).unsqueeze(0)
with torch.no_grad():
    ref_codes = codec.encode_code(audio_or_path=wav_t).squeeze(0).squeeze(0).tolist()

# Build prompt
ref_text = "Your reference transcript here"
target_text = "[excited] This is amazing news! [giggles] I can't believe it!"
ref_codes_str = ''.join(f'<|speech_{c}|>' for c in ref_codes)

prompt = (
    f'user: Convert the text to speech:'
    f'<|REF_TEXT_START|>{ref_text}<|REF_TEXT_END|>'
    f'<|REF_SPEECH_START|>{ref_codes_str}<|REF_SPEECH_END|>'
    f'<|TARGET_TEXT_START|>{target_text}<|TARGET_TEXT_END|>'
    f'\nassistant:<|TARGET_CODES_START|>'
)

input_ids = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to("cuda")
eos_id = tokenizer.convert_tokens_to_ids("<|TARGET_CODES_END|>")

# Generate
with torch.no_grad():
    output = model.generate(
        input_ids, max_length=input_ids.shape[1] + 2048,
        eos_token_id=eos_id, do_sample=True,
        temperature=1.0, top_k=50, min_new_tokens=50, use_cache=True,
    )

# Decode to audio
gen_str = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=False)
speech_ids = [int(n) for n in re.findall(r"<\|speech_(\d+)\|>", gen_str)]
codes_t = torch.tensor(speech_ids, dtype=torch.long)[None, None, :]
with torch.no_grad():
    audio = codec.decode_code(codes_t).cpu().numpy()[0, 0, :]
sf.write("output.wav", audio, 24000)

Citation

@misc{audar2025xpression,
  title={AudAR TTS Pro Xpression: Paralinguistic Expression Control for Neural Text-to-Speech},
  author={AudAR AI},
  year={2025},
  publisher={AudAR}
}

License

This model is released under the AudAR Commercial License. Contact licensing@audar.ai for commercial use inquiries.

Downloads last month
31
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for audarai/tts-pro-xpression-v2

Quantizations
1 model