Wren-ASR-0.5B-multi / README.md
shangeth's picture
Upload Wren-ASR-0.5B-multi checkpoint
72ecc83 verified
metadata
license: apache-2.0
language:
  - en
  - de
  - fr
  - es
  - nl
  - it
  - pl
  - pt
library_name: pytorch
tags:
  - automatic-speech-recognition
  - asr
  - audio
  - speech-recognition
  - multilingual
  - wren
  - mimi
  - qwen2.5
  - neural-codec
pipeline_tag: automatic-speech-recognition
datasets:
  - shangeth/mls-mimi-codes
  - shangeth/libritts-r-mimi-codes
  - shangeth/vctk-mimi-codes
  - shangeth/jenny-mimi-codes
  - shangeth/ljspeech-mimi-codes
  - shangeth/expresso-mimi-codes-tagged
  - facebook/multilingual_librispeech
  - mythicinfinity/libritts_r
  - keithito/lj_speech
  - CSTR-Edinburgh/vctk
  - reach-vb/jenny_tts_dataset
  - ylacombe/expresso

Wren-ASR-0.5B-multi

Multilingual automatic speech recognition model in the Wren series. Encodes audio with the Kyutai Mimi neural codec, then transcribes with a Qwen/Qwen2.5-0.5B backbone β€” no acoustic encoder, no CTC, just a small LLM consuming Mimi codes as input embeddings.

Supports 8 languages: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese.

Links

Architecture

audio ──► Mimi encoder (k=3) ──► Qwen2.5-0.5B (audio prefix β†’ text) ──► transcript

Mimi codes serve as a discrete audio prefix in the LLM's input embedding space. At each audio frame the k=3 codebook codes go through k separate input embedding tables; their sum (scaled by 1/√k) is the input embedding for that step. The audio prefix is wrapped in <|audio_start|> / <|audio_end|> tokens, after which the LLM autoregressively emits text using its native vocabulary and lm_head β€” no new output heads were added.

  • Backbone: Qwen2.5-0.5B (causal LM; transformer body ~358M params, 151k-token multilingual vocab)
  • Audio tokenizer: Mimi (kyutai/mimi), 12.5 fps, 2048-entry codebooks
  • Codebooks used: first 3 (semantic-content-rich); reduces input embedding size 8/3Γ— vs 8-codebook variants
  • Audio prefix: <|audio_start|> + summed-codebook embeds Γ— T_frames + <|audio_end|>
  • Output: standard text autoregression via model.llm.generate(inputs_embeds=...)

Training data

Trained on the union of every dataset used to train Wren-TTS β€” the same 6 corpora that power the en/multi/expressive TTS recipes, with text used as the ASR target:

Dataset Rows Language(s)
VCTK ~44k en (109 speakers, multiple accents)
Jenny ~21k en (single speaker)
LibriTTS-R ~360k en (clean_100 + clean_360 + other_500)
LJSpeech ~13k en (single speaker)
MLS ~6.0M de Β· fr Β· es Β· it Β· nl Β· pl Β· pt
Expresso (tagged) ~26k en (style tags stripped at load time)
Total ~6.46M rows / epoch

Mimi codes are pre-extracted and published as the per-corpus mimi-codes datasets (see Datasets above) β€” no online encoding during training. Single-pass from-scratch training, ~k=3 codebooks. Held-out validation combines LibriTTS-R dev_clean + MLS dev (all 7 langs) + Expresso dev (tags stripped) + 5% per single-speaker English source. All weights set to 1.0 (every row, every epoch, no subsampling). Trained on a single A100-40GB.

Text casing and punctuation are preserved in the ground-truth transcripts.

Usage

pip install torch torchaudio transformers
import torch
import torchaudio
from transformers import AutoModel, AutoProcessor

model_id  = "shangeth/Wren-ASR-0.5B-multi"
device    = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model     = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()

# Load any short clip (one of the 8 supported languages, ≀ 30 s)
wav, sr = torchaudio.load("input.wav")

inputs = processor(audio=wav, sampling_rate=sr)
inputs = {k: v.to(device) for k, v in inputs.items()}

ids  = model.generate(**inputs, max_new_tokens=200)
text = processor.batch_decode(ids, skip_special_tokens=True)[0]
print(text)

Sampling tips

Defaults: greedy decoding (do_sample=False). For longer / harder utterances:

  • Pass do_sample=True, temperature=0.7, top_p=0.9 for diverse beams
  • Raise max_new_tokens if transcripts are getting cut off
  • Audio is hard-capped at 30 s (375 frames @ 12.5 fps) by the training recipe; for longer audio, segment first

Limitations & known issues

  • Language coverage: only the 8 trained languages. Out-of-distribution audio produces noise / hallucinated text in the closest matching language.
  • Per-language quality varies with data volume: German / Dutch / French are strongest (largest training shares); Polish / Portuguese / Italian have less training data and may be less accurate.
  • Audiobook-style audio dominates training: MLS / LibriTTS-R / LJSpeech / Jenny are all studio-style read speech. Performance on conversational audio, noisy environments, or accented far-field input may degrade.
  • 0.5B backbone β€” quality is below frontier ASR systems (Whisper-large-v3, USM, etc.). The pitch is "small enough to run anywhere" + "shares architecture with Wren-TTS-0.5B-multi for unified speech-text experimentation".
  • 30s audio cap. Hard-cap at training time; longer audio needs to be segmented externally.
  • No speaker diarization. Single-stream transcription only.

The Wren series

Wren is a family of compact (<3B parameter) multimodal speech LLMs β€” small enough to run on a single consumer GPU, designed for open research on unified speech understanding and synthesis.

  • Wren-TTS β€” text β†’ speech (English + multilingual + expressive variants)
  • Wren-ASR β€” speech β†’ text (this release)
  • Wren-LM β€” speech-language modelling / dialog (planned)
  • Wren-Omni β€” unified ASR + TTS + LM in one checkpoint (planned)

All Wren models share the same design principles: small backbone LLM + neural audio codec, open weights, simple PyTorch checkpoints, reproducible training recipes. Wren-ASR uses the same Qwen2.5-0.5B backbone as Wren-TTS-0.5B-multi and is trained on the same corpora β€” making the pair a natural starting point for unified speech-text modelling research.

Repository contents

File Purpose
model.safetensors Model weights
config.json WrenASRConfig (with auto_map for trust_remote_code)
tokenizer.json + friends Qwen2.5 tokenizer with Wren-ASR's 2 special tokens added
processor_config.json WrenASRProcessor auto_map
configuration_wren_asr.py WrenASRConfig(PretrainedConfig)
modeling_wren_asr.py WrenForASR(PreTrainedModel) β€” loads Mimi codec lazily on first call
processing_wren_asr.py WrenASRProcessor(ProcessorMixin) β€” audio β†’ Mimi codes + text decode
README.md This model card

Citation

@misc{wren2026,
  title  = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
  author = {Shangeth Rajaa},
  year   = {2026},
  url    = {https://github.com/shangeth/wren}
}

License

Apache-2.0 for the checkpoint weights and code in this repo. Upstream components carry their own licenses β€” review before redistribution. The Expresso dataset (used for English style robustness) is CC-BY-NC-4.0; if you build derived models on this checkpoint and want to release them commercially, retrain with Expresso excluded.