Wren-TTS-0.5B-multi-expressive

Expressive multilingual speech LLM in the Wren series. Fine-tuned from shangeth/Wren-TTS-0.5B-multi on the style-tagged Expresso dataset to add 23 emotion / delivery style tags while retaining the 8-language voice-cloning ability of the base model.

Generates Kyutai Mimi neural-codec tokens from text using a Qwen/Qwen2.5-0.5B backbone, then decodes to 24 kHz waveform with the Mimi decoder.

Supports the same 8 languages as the base: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese — but style tags were only seen with English text during fine-tuning.

Style tags

Prepend any of these tags to the input text to condition the speaking style:

<angry>       <articulated>  <awe>         <bored>       <breath>
<calm>        <confused>     <desire>      <disgusted>   <enunciated>
<fast>        <fearful>      <happy>       <laugh>       <laughing>
<narration>   <projected>    <reduced>     <sad>         <sarcastic>
<sleepy>      <sympathetic>  <whisper>

Example:

<happy> Welcome to the show!
<whisper> I'll tell you a secret.
<sad> This is the saddest day of my life.

Links

Architecture

Identical to the multi base — same delay-pattern Mimi prediction, same 8 codebooks, same multispeaker reference conditioning. Only the LLM weights are fine-tuned.

[<style_tag>] text ──► Qwen2.5-0.5B ──► k=8 Mimi heads (delay) ──► Mimi decoder ──► 24 kHz

See the base model card for the full architectural detail.

Fine-tuning recipe

Per-epoch data mix (~52k rows ≈ 30 min on a single A100-40GB):

dataset weight rows / epoch role
shangeth/expresso-mimi-codes-tagged 1.0 ~26,000 the new domain (style tags)
shangeth/mls-mimi-codes 0.004 ~24,000 multilingual replay
shangeth/libritts-r-mimi-codes 0.005 ~2,400 English speaker diversity
shangeth/vctk-mimi-codes 0.05 ~2,200 accent diversity
shangeth/jenny-mimi-codes 0.0001 ~2 retention
shangeth/ljspeech-mimi-codes 0.0002 ~3 retention
  • LR: 1e-5 with 2k warmup steps, cosine decay to 10k
  • Effective batch size: 24
  • Optimizer: AdamW (β=(0.9, 0.95)), wd=0.01, grad-clip 1.0
  • AMP: enabled (autocast)
  • Early stopped at epoch 3 / 20 (val patience=2; best val_loss=3.826)

Replay weights tuned for balanced per-voice exposure rather than per-row volume, so single-speaker datasets (Jenny, LJSpeech) don't overexpose their one voice.

Usage

pip install torch torchaudio transformers datasets

A reference audio clip is required. The model was trained multispeaker-only; without ref_codes it produces poor output.

import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoModel, AutoProcessor

model_id = "shangeth/Wren-TTS-0.5B-multi-expressive"
device   = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()

# Reference voice
sample  = next(iter(load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True)))
ref_wav = torch.from_numpy(np.asarray(sample["audio"]["array"], dtype=np.float32)).unsqueeze(0)
ref_sr  = sample["audio"]["sampling_rate"]
ref_codes = model.encode_audio(ref_wav, ref_sr)[:, :150]

# Style-tagged synthesis
texts = [
    "<happy> Welcome to the show, everybody!",
    "<sad> This is the saddest day of my life.",
    "<whisper> I will tell you a secret.",
    "<sarcastic> Oh sure, that worked out perfectly.",
    "<sleepy> I'm so tired, I can barely keep my eyes open.",
]
for i, text in enumerate(texts):
    inputs = processor(text)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    waveform = model.generate(
        **inputs,
        ref_codes=ref_codes,
        max_audio_frames=300, min_audio_frames=20,
        temperature=0.8, top_k=50, top_p=0.9,
        output_audio=True,
    )
    processor.save_audio(waveform, f"out_{i}.wav")

Limitations & known issues

  • Limited expressive training data. Expresso is a single English domain (~37 h), so style following is decent, not perfect. Quality varies across tags; high-frequency tags (<happy>, <sad>, <whisper>) work most reliably.
  • Style tags are English-only. During fine-tune, tags only co-occurred with English text; behaviour with multilingual prompts is undefined and may degrade language quality.
  • Inherited from base: hallucinated continuations, audiobook-style prosody on untagged English, varying per-language quality (German/Dutch/French strongest).
  • 0.5B backbone — quality is below frontier expressive TTS systems.

License

CC-BY-NC-4.0 — non-commercial use only. Inherited from the Expresso fine-tune data. The base model and upstream components carry their own licenses; review before redistribution.

Citation

@misc{wren2026,
  title  = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
  author = {Shangeth Rajaa},
  year   = {2026},
  url    = {https://github.com/shangeth/wren}
}
Downloads last month
128
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shangeth/Wren-TTS-0.5B-multi-expressive

Finetuned
(1)
this model

Datasets used to train shangeth/Wren-TTS-0.5B-multi-expressive

Space using shangeth/Wren-TTS-0.5B-multi-expressive 1

Collection including shangeth/Wren-TTS-0.5B-multi-expressive