| --- |
| license: apache-2.0 |
| language: |
| - en |
| - de |
| - fr |
| - es |
| - nl |
| - it |
| - pl |
| - pt |
| library_name: pytorch |
| tags: |
| - automatic-speech-recognition |
| - asr |
| - audio |
| - speech-recognition |
| - multilingual |
| - wren |
| - mimi |
| - qwen2.5 |
| - neural-codec |
| pipeline_tag: automatic-speech-recognition |
| datasets: |
| - shangeth/mls-mimi-codes |
| - shangeth/libritts-r-mimi-codes |
| - shangeth/vctk-mimi-codes |
| - shangeth/jenny-mimi-codes |
| - shangeth/ljspeech-mimi-codes |
| - shangeth/expresso-mimi-codes-tagged |
| - facebook/multilingual_librispeech |
| - mythicinfinity/libritts_r |
| - keithito/lj_speech |
| - CSTR-Edinburgh/vctk |
| - reach-vb/jenny_tts_dataset |
| - ylacombe/expresso |
| --- |
| |
| # Wren-ASR-0.5B-multi |
|
|
| **Multilingual** automatic speech recognition model in the Wren series. Encodes |
| audio with the [Kyutai Mimi](https://huggingface.co/kyutai/mimi) neural codec, |
| then transcribes with a [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| backbone β no acoustic encoder, no CTC, just a small LLM consuming Mimi codes as |
| input embeddings. |
|
|
| Supports **8 languages**: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese. |
|
|
| ## Links |
|
|
| - **Training & inference code:** [github.com/shangeth/wren-asr](https://github.com/shangeth/wren-asr) |
| - **Wren research project:** [github.com/shangeth/wren](https://github.com/shangeth/wren) |
| - **TTS counterpart:** [shangeth/Wren-TTS-0.5B-multi](https://huggingface.co/shangeth/Wren-TTS-0.5B-multi) |
| - **Dataset extraction (Mimi codes):** [github.com/shangeth/wren-datasets](https://github.com/shangeth/wren-datasets) |
| - **Demo Space:** [huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo](https://huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo) |
|
|
| ## Architecture |
|
|
| ``` |
| audio βββΊ Mimi encoder (k=3) βββΊ Qwen2.5-0.5B (audio prefix β text) βββΊ transcript |
| ``` |
|
|
| Mimi codes serve as a discrete audio prefix in the LLM's input embedding space. |
| At each audio frame the k=3 codebook codes go through k separate input embedding |
| tables; their sum (scaled by 1/βk) is the input embedding for that step. The |
| audio prefix is wrapped in `<|audio_start|>` / `<|audio_end|>` tokens, after |
| which the LLM autoregressively emits text using its native vocabulary and |
| `lm_head` β no new output heads were added. |
|
|
| - **Backbone:** Qwen2.5-0.5B (causal LM; transformer body ~358M params, 151k-token multilingual vocab) |
| - **Audio tokenizer:** Mimi (`kyutai/mimi`), 12.5 fps, 2048-entry codebooks |
| - **Codebooks used:** first 3 (semantic-content-rich); reduces input embedding size 8/3Γ vs 8-codebook variants |
| - **Audio prefix:** `<|audio_start|>` + summed-codebook embeds Γ T_frames + `<|audio_end|>` |
| - **Output:** standard text autoregression via `model.llm.generate(inputs_embeds=...)` |
|
|
| ## Training data |
|
|
| Trained on the **union of every dataset used to train Wren-TTS** β the same |
| 6 corpora that power the en/multi/expressive TTS recipes, with text used as the |
| ASR target: |
|
|
| | Dataset | Rows | Language(s) | |
| |---|---|---| |
| | VCTK | ~44k | en (109 speakers, multiple accents) | |
| | Jenny | ~21k | en (single speaker) | |
| | LibriTTS-R | ~360k | en (clean_100 + clean_360 + other_500) | |
| | LJSpeech | ~13k | en (single speaker) | |
| | MLS | ~6.0M | de Β· fr Β· es Β· it Β· nl Β· pl Β· pt | |
| | Expresso (tagged) | ~26k | en (style tags stripped at load time) | |
| | **Total** | **~6.46M** rows / epoch | | |
| |
| Mimi codes are pre-extracted and published as the per-corpus mimi-codes datasets |
| (see Datasets above) β no online encoding during training. Single-pass |
| from-scratch training, ~k=3 codebooks. Held-out validation combines LibriTTS-R |
| `dev_clean` + MLS `dev` (all 7 langs) + Expresso `dev` (tags stripped) + 5% |
| per single-speaker English source. All weights set to 1.0 (every row, every |
| epoch, no subsampling). Trained on a single A100-40GB. |
|
|
| Text casing and punctuation are preserved in the ground-truth transcripts. |
|
|
| ## Usage |
|
|
| ```bash |
| pip install torch torchaudio transformers |
| ``` |
|
|
| ```python |
| import torch |
| import torchaudio |
| from transformers import AutoModel, AutoProcessor |
| |
| model_id = "shangeth/Wren-ASR-0.5B-multi" |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| |
| processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
| model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval() |
| |
| # Load any short clip (one of the 8 supported languages, β€ 30 s) |
| wav, sr = torchaudio.load("input.wav") |
| |
| inputs = processor(audio=wav, sampling_rate=sr) |
| inputs = {k: v.to(device) for k, v in inputs.items()} |
| |
| ids = model.generate(**inputs, max_new_tokens=200) |
| text = processor.batch_decode(ids, skip_special_tokens=True)[0] |
| print(text) |
| ``` |
|
|
| ## Sampling tips |
|
|
| Defaults: greedy decoding (`do_sample=False`). For longer / harder utterances: |
| - Pass `do_sample=True, temperature=0.7, top_p=0.9` for diverse beams |
| - Raise `max_new_tokens` if transcripts are getting cut off |
| - Audio is hard-capped at 30 s (375 frames @ 12.5 fps) by the training recipe; |
| for longer audio, segment first |
|
|
| ## Limitations & known issues |
|
|
| - **Language coverage:** only the 8 trained languages. Out-of-distribution |
| audio produces noise / hallucinated text in the closest matching language. |
| - **Per-language quality varies with data volume:** German / Dutch / French |
| are strongest (largest training shares); Polish / Portuguese / Italian have |
| less training data and may be less accurate. |
| - **Audiobook-style audio dominates training:** MLS / LibriTTS-R / LJSpeech / |
| Jenny are all studio-style read speech. Performance on conversational audio, |
| noisy environments, or accented far-field input may degrade. |
| - **0.5B backbone** β quality is below frontier ASR systems (Whisper-large-v3, |
| USM, etc.). The pitch is "small enough to run anywhere" + "shares architecture |
| with Wren-TTS-0.5B-multi for unified speech-text experimentation". |
| - **30s audio cap.** Hard-cap at training time; longer audio needs to be |
| segmented externally. |
| - **No speaker diarization.** Single-stream transcription only. |
|
|
| ## The Wren series |
|
|
| Wren is a family of compact (<3B parameter) multimodal speech LLMs β small |
| enough to run on a single consumer GPU, designed for open research on unified |
| speech understanding and synthesis. |
|
|
| - **Wren-TTS** β text β speech (English + multilingual + expressive variants) |
| - **Wren-ASR** β speech β text (this release) |
| - **Wren-LM** β speech-language modelling / dialog (planned) |
| - **Wren-Omni** β unified ASR + TTS + LM in one checkpoint (planned) |
|
|
| All Wren models share the same design principles: small backbone LLM + neural |
| audio codec, open weights, simple PyTorch checkpoints, reproducible training |
| recipes. Wren-ASR uses the same Qwen2.5-0.5B backbone as Wren-TTS-0.5B-multi |
| and is trained on the same corpora β making the pair a natural starting point |
| for unified speech-text modelling research. |
|
|
| ## Repository contents |
|
|
| | File | Purpose | |
| |---|---| |
| | `model.safetensors` | Model weights | |
| | `config.json` | `WrenASRConfig` (with `auto_map` for `trust_remote_code`) | |
| | `tokenizer.json` + friends | Qwen2.5 tokenizer with Wren-ASR's 2 special tokens added | |
| | `processor_config.json` | `WrenASRProcessor` auto_map | |
| | `configuration_wren_asr.py` | `WrenASRConfig(PretrainedConfig)` | |
| | `modeling_wren_asr.py` | `WrenForASR(PreTrainedModel)` β loads Mimi codec lazily on first call | |
| | `processing_wren_asr.py` | `WrenASRProcessor(ProcessorMixin)` β audio β Mimi codes + text decode | |
| | `README.md` | This model card | |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{wren2026, |
| title = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling}, |
| author = {Shangeth Rajaa}, |
| year = {2026}, |
| url = {https://github.com/shangeth/wren} |
| } |
| ``` |
| |
| ## License |
| |
| Apache-2.0 for the checkpoint weights and code in this repo. |
| Upstream components carry their own licenses β review before redistribution. |
| The Expresso dataset (used for English style robustness) is CC-BY-NC-4.0; if |
| you build derived models on this checkpoint and want to release them |
| commercially, retrain with Expresso excluded. |
| |