Wren-ASR-0.5B-multi
Multilingual automatic speech recognition model in the Wren series. Encodes audio with the Kyutai Mimi neural codec, then transcribes with a Qwen/Qwen2.5-0.5B backbone — no acoustic encoder, no CTC, just a small LLM consuming Mimi codes as input embeddings.
Supports 8 languages: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese.
Links
- Training & inference code: github.com/shangeth/wren-asr
- Wren research project: github.com/shangeth/wren
- TTS counterpart: shangeth/Wren-TTS-0.5B-multi
- Dataset extraction (Mimi codes): github.com/shangeth/wren-datasets
- Demo Space: huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo
Architecture
audio ──► Mimi encoder (k=3) ──► Qwen2.5-0.5B (audio prefix → text) ──► transcript
Mimi codes serve as a discrete audio prefix in the LLM's input embedding space.
At each audio frame the k=3 codebook codes go through k separate input embedding
tables; their sum (scaled by 1/√k) is the input embedding for that step. The
audio prefix is wrapped in <|audio_start|> / <|audio_end|> tokens, after
which the LLM autoregressively emits text using its native vocabulary and
lm_head — no new output heads were added.
- Backbone: Qwen2.5-0.5B (causal LM; transformer body ~358M params, 151k-token multilingual vocab)
- Audio tokenizer: Mimi (
kyutai/mimi), 12.5 fps, 2048-entry codebooks - Codebooks used: first 3 (semantic-content-rich); reduces input embedding size 8/3× vs 8-codebook variants
- Audio prefix:
<|audio_start|>+ summed-codebook embeds × T_frames +<|audio_end|> - Output: standard text autoregression via
model.llm.generate(inputs_embeds=...)
Training data
Trained on the union of every dataset used to train Wren-TTS — the same 6 corpora that power the en/multi/expressive TTS recipes, with text used as the ASR target:
| Dataset | Rows | Language(s) |
|---|---|---|
| VCTK | ~44k | en (109 speakers, multiple accents) |
| Jenny | ~21k | en (single speaker) |
| LibriTTS-R | ~360k | en (clean_100 + clean_360 + other_500) |
| LJSpeech | ~13k | en (single speaker) |
| MLS | ~6.0M | de · fr · es · it · nl · pl · pt |
| Expresso (tagged) | ~26k | en (style tags stripped at load time) |
| Total | ~6.46M rows / epoch |
Mimi codes are pre-extracted and published as the per-corpus mimi-codes datasets
(see Datasets above) — no online encoding during training. Single-pass
from-scratch training, ~k=3 codebooks. Held-out validation combines LibriTTS-R
dev_clean + MLS dev (all 7 langs) + Expresso dev (tags stripped) + 5%
per single-speaker English source. All weights set to 1.0 (every row, every
epoch, no subsampling). Trained on a single A100-40GB.
Text casing and punctuation are preserved in the ground-truth transcripts.
Usage
pip install torch torchaudio transformers
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor
model_id = "shangeth/Wren-ASR-0.5B-multi"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()
# Load any short clip (one of the 8 supported languages, ≤ 30 s)
wav, sr = torchaudio.load("input.wav")
inputs = processor(audio=wav, sampling_rate=sr)
inputs = {k: v.to(device) for k, v in inputs.items()}
ids = model.generate(**inputs, max_new_tokens=200)
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)
Sampling tips
Defaults: greedy decoding (do_sample=False). For longer / harder utterances:
- Pass
do_sample=True, temperature=0.7, top_p=0.9for diverse beams - Raise
max_new_tokensif transcripts are getting cut off - Audio is hard-capped at 30 s (375 frames @ 12.5 fps) by the training recipe; for longer audio, segment first
Limitations & known issues
- Language coverage: only the 8 trained languages. Out-of-distribution audio produces noise / hallucinated text in the closest matching language.
- Per-language quality varies with data volume: German / Dutch / French are strongest (largest training shares); Polish / Portuguese / Italian have less training data and may be less accurate.
- Audiobook-style audio dominates training: MLS / LibriTTS-R / LJSpeech / Jenny are all studio-style read speech. Performance on conversational audio, noisy environments, or accented far-field input may degrade.
- 0.5B backbone — quality is below frontier ASR systems (Whisper-large-v3, USM, etc.). The pitch is "small enough to run anywhere" + "shares architecture with Wren-TTS-0.5B-multi for unified speech-text experimentation".
- 30s audio cap. Hard-cap at training time; longer audio needs to be segmented externally.
- No speaker diarization. Single-stream transcription only.
The Wren series
Wren is a family of compact (<3B parameter) multimodal speech LLMs — small enough to run on a single consumer GPU, designed for open research on unified speech understanding and synthesis.
- Wren-TTS — text → speech (English + multilingual + expressive variants)
- Wren-ASR — speech → text (this release)
- Wren-LM — speech-language modelling / dialog (planned)
- Wren-Omni — unified ASR + TTS + LM in one checkpoint (planned)
All Wren models share the same design principles: small backbone LLM + neural audio codec, open weights, simple PyTorch checkpoints, reproducible training recipes. Wren-ASR uses the same Qwen2.5-0.5B backbone as Wren-TTS-0.5B-multi and is trained on the same corpora — making the pair a natural starting point for unified speech-text modelling research.
Repository contents
| File | Purpose |
|---|---|
model.safetensors |
Model weights |
config.json |
WrenASRConfig (with auto_map for trust_remote_code) |
tokenizer.json + friends |
Qwen2.5 tokenizer with Wren-ASR's 2 special tokens added |
processor_config.json |
WrenASRProcessor auto_map |
configuration_wren_asr.py |
WrenASRConfig(PretrainedConfig) |
modeling_wren_asr.py |
WrenForASR(PreTrainedModel) — loads Mimi codec lazily on first call |
processing_wren_asr.py |
WrenASRProcessor(ProcessorMixin) — audio → Mimi codes + text decode |
README.md |
This model card |
Citation
@misc{wren2026,
title = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
author = {Shangeth Rajaa},
year = {2026},
url = {https://github.com/shangeth/wren}
}
License
Apache-2.0 for the checkpoint weights and code in this repo. Upstream components carry their own licenses — review before redistribution. The Expresso dataset (used for English style robustness) is CC-BY-NC-4.0; if you build derived models on this checkpoint and want to release them commercially, retrain with Expresso excluded.
- Downloads last month
- -