Wren-TTS-0.5B-multi

Multilingual speech LLM in the Wren series. Generates Kyutai Mimi neural-codec tokens from text using a Qwen/Qwen2.5-0.5B backbone, then decodes to 24 kHz waveform with the Mimi decoder.

Supports 8 languages: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese.

For an English-only sibling on the same architecture, see shangeth/Wren-TTS-360M-en. For expressive synthesis with style tags, see shangeth/Wren-TTS-0.5B-multi-expressive.

Links

Why Qwen2.5-0.5B for multilingual?

The English variant Wren-TTS-360M-en uses SmolLM2-360M, which was pretrained almost entirely on English text and tokenizes non-Latin / diacritic-heavy languages poorly (per-character explosion). Qwen2.5-0.5B ships a 151k-token multilingual vocabulary with efficient tokens for all 7 MLS languages.

Architecture is otherwise identical to the English variant — same delay-pattern Mimi prediction, same 8 codebooks, same multispeaker reference conditioning.

Architecture

text ──► Qwen2.5-0.5B ──► k=8 Mimi heads (delay pattern) ──► Mimi decoder ──► 24 kHz

Within each step the k heads run in parallel (one forward, k logits), but across steps they sit on a delay so cb_q at frame f is emitted at step s = f + q. The delay preserves RVQ conditioning (cb_q sees cb_0..cb_{q-1} of the same frame in the hidden state) while cutting sequence length to T + k - 1 and LLM calls to one forward per step — instead of k forwards per step in a flat interleaved layout.

  • Backbone: Qwen2.5-0.5B (causal LM; transformer body ~358M params, multilingual 151k-token vocab)
  • Audio tokenizer: Mimi (kyutai/mimi), 12.5 fps, 2048-entry codebooks
  • Codebooks used: all 8 Mimi codebooks
  • Per-codebook input tables: Embedding(2049, hidden) per cb (extra row = AUDIO_PAD for delay edges)
  • Per-codebook output heads: Linear(hidden, 2048) for cb1..cb7. cb0 is Linear(hidden, 2049) — the extra class is AUDIO_EOS (stop token).
  • Speaker conditioning (required): prepend <|reference_start|> ref_codes <|reference_end|> to the prompt; ref_codes is the Mimi encoding of a short reference clip in the target voice. Trained multispeaker-only — output quality is poor without a reference.
  • Cross-lingual voice transfer: the reference voice and the target text language do not have to match. Take an English speaker's voice and synthesize French in that voice; quality varies (English voices generally transfer cleanly because English is mixed throughout multilingual training).

Training data (~1.87M utterances total)

English (~433k)

  • VCTK — 109 speakers, multiple English accents (~44k)
  • Jenny — single high-quality voice (~21k)
  • LibriTTS-R train-clean-{100,360} + train-other-500 (~355k)
  • LJSpeech — single speaker (~13k)

Multilingual via MLS (~1.45M)

Language utterances
German (de) 469,942
Dutch (nl) 374,287
French (fr) 258,213
Spanish (es) 220,701
Italian (it) 59,623
Portuguese (pt) 37,533
Polish (pl) 25,043

Mimi codes are pre-extracted and published as shangeth/mls-mimi-codes + the per-corpus mimi-codes datasets — this avoids redundant 6M-utterance encoding during training.

Single-pass from-scratch training (no two-stage pretrain → fine-tune). Held-out validation combines LibriTTS-R dev_clean + MLS dev (all 7 langs) + 5% per single-speaker English source. All weights set to 1.0 (every row, every epoch, no subsampling). Trained on a single A100-40GB.

Text casing and punctuation are preserved. Pass text naturally — do not pre-lowercase.

Usage

pip install torch torchaudio transformers datasets

A reference audio clip is required. The model was trained multispeaker-only; without ref_codes it produces poor output. Any 3–12 s speech clip in any of the supported languages works as the voice reference.

import torch
import numpy as np
from datasets import load_dataset
from transformers import AutoModel, AutoProcessor

model_id = "shangeth/Wren-TTS-0.5B-multi"
device   = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()

# Reference voice — any short clip in any of the 8 supported languages.
sample  = next(iter(load_dataset("openslr/librispeech_asr", "clean", split="test", streaming=True)))
ref_wav = torch.from_numpy(np.asarray(sample["audio"]["array"], dtype=np.float32)).unsqueeze(0)
ref_sr  = sample["audio"]["sampling_rate"]
ref_codes = model.encode_audio(ref_wav, ref_sr)[:, :150]   # cap at ~12s

# Synthesize in any supported language — pass the text in that language naturally.
texts = [
    "Hello world, how are you today?",                       # en
    "Hallo Welt, wie geht es dir heute?",                    # de
    "Bonjour le monde, comment ça va aujourd'hui ?",         # fr
    "Hola mundo, ¿cómo estás hoy?",                          # es
    "Hallo wereld, hoe gaat het vandaag?",                   # nl
    "Ciao mondo, come stai oggi?",                           # it
    "Witaj świecie, jak się masz dzisiaj?",                  # pl
    "Olá mundo, como estás hoje?",                           # pt
]
for i, text in enumerate(texts):
    inputs = processor(text)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    waveform = model.generate(
        **inputs,
        ref_codes=ref_codes,
        max_audio_frames=300,
        min_audio_frames=20,
        temperature=0.8, top_k=50, top_p=0.9,
        output_audio=True,
    )
    processor.save_audio(waveform, f"out_{i}.wav")

Sampling tips

Defaults: temperature=0.8, top_k=50, top_p=0.9, max_audio_frames=300 (~24 s). If the model hallucinates extra speech past the input text:

  • Raise eos_bias — e.g. 2.0–6.0 — to make the model more eager to stop
  • Lower temperature (0.6) and top_p (0.8)
  • Set max_audio_frames12 * len(text_in_chars)
  • Set min_audio_frames=1 for very short prompts

Limitations & known issues

  • Hallucinated continuations: occasionally generates plausible speech past the input text. Mitigate with eos_bias at inference.
  • Language coverage: only the 8 trained languages. Anything else produces noise / accented gibberish.
  • Per-language quality varies with data volume: German / Dutch / French are strongest (largest training shares); Polish / Portuguese / Italian have less training data and may sound less natural.
  • Audiobook-style prosody inherited from MLS / LibriTTS-R (both LibriVox-derived); not as expressive as conversational TTS.
  • Speaker-id collisions across MLS languages: MLS uses per-language integer speaker IDs that may overlap numerically across languages. The training-time multispeaker reference picker bucketed colliding IDs together, so very occasionally the model saw a cross-language reference during training. In practice this had negligible effect on quality.
  • 0.5B backbone — quality is below frontier multilingual TTS systems.

The Wren series

Wren is a family of compact (<3B parameter) multimodal speech LLMs — small enough to run on a single consumer GPU, designed for open research on unified speech understanding and synthesis. Planned siblings:

  • Wren-TTS — text → speech (this release; English + multilingual variants)
  • Wren-ASR — speech → text
  • Wren-LM — speech-language modelling / dialog
  • Wren-Omni — unified ASR + TTS + LM in one checkpoint

All Wren models share the same design principles: small backbone LLM + neural audio codec, open weights, simple PyTorch checkpoints, reproducible training recipes.

Repository contents

File Purpose
model.safetensors Model weights
config.json WrenConfig (with auto_map for trust_remote_code)
tokenizer.json + friends Qwen2.5 tokenizer with Wren's 3 special tokens added
processor_config.json WrenProcessor auto_map
configuration_wren.py WrenConfig(PretrainedConfig)
modeling_wren.py WrenForTTS(PreTrainedModel) — loads Mimi codec lazily on first generate
processing_wren.py WrenProcessor(ProcessorMixin) — tokenize + save_audio
README.md This model card

Citation

@misc{wren2026,
  title  = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
  author = {Shangeth Rajaa},
  year   = {2026},
  url    = {https://github.com/shangeth/wren}
}

License

Apache-2.0 for the checkpoint weights and code in this repo. Upstream components carry their own licenses — review before redistribution.

Downloads last month
137
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for shangeth/Wren-TTS-0.5B-multi

Finetunes
1 model

Datasets used to train shangeth/Wren-TTS-0.5B-multi

Space using shangeth/Wren-TTS-0.5B-multi 1

Collection including shangeth/Wren-TTS-0.5B-multi