Wren-ASR-0.5B-multi / README.md
shangeth's picture
Upload Wren-ASR-0.5B-multi checkpoint
72ecc83 verified
---
license: apache-2.0
language:
- en
- de
- fr
- es
- nl
- it
- pl
- pt
library_name: pytorch
tags:
- automatic-speech-recognition
- asr
- audio
- speech-recognition
- multilingual
- wren
- mimi
- qwen2.5
- neural-codec
pipeline_tag: automatic-speech-recognition
datasets:
- shangeth/mls-mimi-codes
- shangeth/libritts-r-mimi-codes
- shangeth/vctk-mimi-codes
- shangeth/jenny-mimi-codes
- shangeth/ljspeech-mimi-codes
- shangeth/expresso-mimi-codes-tagged
- facebook/multilingual_librispeech
- mythicinfinity/libritts_r
- keithito/lj_speech
- CSTR-Edinburgh/vctk
- reach-vb/jenny_tts_dataset
- ylacombe/expresso
---
# Wren-ASR-0.5B-multi
**Multilingual** automatic speech recognition model in the Wren series. Encodes
audio with the [Kyutai Mimi](https://huggingface.co/kyutai/mimi) neural codec,
then transcribes with a [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
backbone β€” no acoustic encoder, no CTC, just a small LLM consuming Mimi codes as
input embeddings.
Supports **8 languages**: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese.
## Links
- **Training & inference code:** [github.com/shangeth/wren-asr](https://github.com/shangeth/wren-asr)
- **Wren research project:** [github.com/shangeth/wren](https://github.com/shangeth/wren)
- **TTS counterpart:** [shangeth/Wren-TTS-0.5B-multi](https://huggingface.co/shangeth/Wren-TTS-0.5B-multi)
- **Dataset extraction (Mimi codes):** [github.com/shangeth/wren-datasets](https://github.com/shangeth/wren-datasets)
- **Demo Space:** [huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo](https://huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo)
## Architecture
```
audio ──► Mimi encoder (k=3) ──► Qwen2.5-0.5B (audio prefix β†’ text) ──► transcript
```
Mimi codes serve as a discrete audio prefix in the LLM's input embedding space.
At each audio frame the k=3 codebook codes go through k separate input embedding
tables; their sum (scaled by 1/√k) is the input embedding for that step. The
audio prefix is wrapped in `<|audio_start|>` / `<|audio_end|>` tokens, after
which the LLM autoregressively emits text using its native vocabulary and
`lm_head` β€” no new output heads were added.
- **Backbone:** Qwen2.5-0.5B (causal LM; transformer body ~358M params, 151k-token multilingual vocab)
- **Audio tokenizer:** Mimi (`kyutai/mimi`), 12.5 fps, 2048-entry codebooks
- **Codebooks used:** first 3 (semantic-content-rich); reduces input embedding size 8/3Γ— vs 8-codebook variants
- **Audio prefix:** `<|audio_start|>` + summed-codebook embeds Γ— T_frames + `<|audio_end|>`
- **Output:** standard text autoregression via `model.llm.generate(inputs_embeds=...)`
## Training data
Trained on the **union of every dataset used to train Wren-TTS** β€” the same
6 corpora that power the en/multi/expressive TTS recipes, with text used as the
ASR target:
| Dataset | Rows | Language(s) |
|---|---|---|
| VCTK | ~44k | en (109 speakers, multiple accents) |
| Jenny | ~21k | en (single speaker) |
| LibriTTS-R | ~360k | en (clean_100 + clean_360 + other_500) |
| LJSpeech | ~13k | en (single speaker) |
| MLS | ~6.0M | de Β· fr Β· es Β· it Β· nl Β· pl Β· pt |
| Expresso (tagged) | ~26k | en (style tags stripped at load time) |
| **Total** | **~6.46M** rows / epoch | |
Mimi codes are pre-extracted and published as the per-corpus mimi-codes datasets
(see Datasets above) β€” no online encoding during training. Single-pass
from-scratch training, ~k=3 codebooks. Held-out validation combines LibriTTS-R
`dev_clean` + MLS `dev` (all 7 langs) + Expresso `dev` (tags stripped) + 5%
per single-speaker English source. All weights set to 1.0 (every row, every
epoch, no subsampling). Trained on a single A100-40GB.
Text casing and punctuation are preserved in the ground-truth transcripts.
## Usage
```bash
pip install torch torchaudio transformers
```
```python
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor
model_id = "shangeth/Wren-ASR-0.5B-multi"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()
# Load any short clip (one of the 8 supported languages, ≀ 30 s)
wav, sr = torchaudio.load("input.wav")
inputs = processor(audio=wav, sampling_rate=sr)
inputs = {k: v.to(device) for k, v in inputs.items()}
ids = model.generate(**inputs, max_new_tokens=200)
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)
```
## Sampling tips
Defaults: greedy decoding (`do_sample=False`). For longer / harder utterances:
- Pass `do_sample=True, temperature=0.7, top_p=0.9` for diverse beams
- Raise `max_new_tokens` if transcripts are getting cut off
- Audio is hard-capped at 30 s (375 frames @ 12.5 fps) by the training recipe;
for longer audio, segment first
## Limitations & known issues
- **Language coverage:** only the 8 trained languages. Out-of-distribution
audio produces noise / hallucinated text in the closest matching language.
- **Per-language quality varies with data volume:** German / Dutch / French
are strongest (largest training shares); Polish / Portuguese / Italian have
less training data and may be less accurate.
- **Audiobook-style audio dominates training:** MLS / LibriTTS-R / LJSpeech /
Jenny are all studio-style read speech. Performance on conversational audio,
noisy environments, or accented far-field input may degrade.
- **0.5B backbone** β€” quality is below frontier ASR systems (Whisper-large-v3,
USM, etc.). The pitch is "small enough to run anywhere" + "shares architecture
with Wren-TTS-0.5B-multi for unified speech-text experimentation".
- **30s audio cap.** Hard-cap at training time; longer audio needs to be
segmented externally.
- **No speaker diarization.** Single-stream transcription only.
## The Wren series
Wren is a family of compact (<3B parameter) multimodal speech LLMs β€” small
enough to run on a single consumer GPU, designed for open research on unified
speech understanding and synthesis.
- **Wren-TTS** β€” text β†’ speech (English + multilingual + expressive variants)
- **Wren-ASR** β€” speech β†’ text (this release)
- **Wren-LM** β€” speech-language modelling / dialog (planned)
- **Wren-Omni** β€” unified ASR + TTS + LM in one checkpoint (planned)
All Wren models share the same design principles: small backbone LLM + neural
audio codec, open weights, simple PyTorch checkpoints, reproducible training
recipes. Wren-ASR uses the same Qwen2.5-0.5B backbone as Wren-TTS-0.5B-multi
and is trained on the same corpora β€” making the pair a natural starting point
for unified speech-text modelling research.
## Repository contents
| File | Purpose |
|---|---|
| `model.safetensors` | Model weights |
| `config.json` | `WrenASRConfig` (with `auto_map` for `trust_remote_code`) |
| `tokenizer.json` + friends | Qwen2.5 tokenizer with Wren-ASR's 2 special tokens added |
| `processor_config.json` | `WrenASRProcessor` auto_map |
| `configuration_wren_asr.py` | `WrenASRConfig(PretrainedConfig)` |
| `modeling_wren_asr.py` | `WrenForASR(PreTrainedModel)` β€” loads Mimi codec lazily on first call |
| `processing_wren_asr.py` | `WrenASRProcessor(ProcessorMixin)` β€” audio β†’ Mimi codes + text decode |
| `README.md` | This model card |
## Citation
```bibtex
@misc{wren2026,
title = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
author = {Shangeth Rajaa},
year = {2026},
url = {https://github.com/shangeth/wren}
}
```
## License
Apache-2.0 for the checkpoint weights and code in this repo.
Upstream components carry their own licenses β€” review before redistribution.
The Expresso dataset (used for English style robustness) is CC-BY-NC-4.0; if
you build derived models on this checkpoint and want to release them
commercially, retrain with Expresso excluded.