Upload Wren-ASR-0.5B-multi checkpoint

72ecc83 verified 26 days ago

8.04 kB

	---
	license: apache-2.0
	language:
	- en
	- de
	- fr
	- es
	- nl
	- it
	- pl
	- pt
	library_name: pytorch
	tags:
	- automatic-speech-recognition
	- asr
	- audio
	- speech-recognition
	- multilingual
	- wren
	- mimi
	- qwen2.5
	- neural-codec
	pipeline_tag: automatic-speech-recognition
	datasets:
	- shangeth/mls-mimi-codes
	- shangeth/libritts-r-mimi-codes
	- shangeth/vctk-mimi-codes
	- shangeth/jenny-mimi-codes
	- shangeth/ljspeech-mimi-codes
	- shangeth/expresso-mimi-codes-tagged
	- facebook/multilingual_librispeech
	- mythicinfinity/libritts_r
	- keithito/lj_speech
	- CSTR-Edinburgh/vctk
	- reach-vb/jenny_tts_dataset
	- ylacombe/expresso
	---

	# Wren-ASR-0.5B-multi

	Multilingual automatic speech recognition model in the Wren series. Encodes
	audio with the [Kyutai Mimi](https://huggingface.co/kyutai/mimi) neural codec,
	then transcribes with a [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
	backbone — no acoustic encoder, no CTC, just a small LLM consuming Mimi codes as
	input embeddings.

	Supports 8 languages: English, German, French, Spanish, Dutch, Italian, Polish, Portuguese.

	## Links

	- Training & inference code: [github.com/shangeth/wren-asr](https://github.com/shangeth/wren-asr)
	- Wren research project: [github.com/shangeth/wren](https://github.com/shangeth/wren)
	- TTS counterpart: [shangeth/Wren-TTS-0.5B-multi](https://huggingface.co/shangeth/Wren-TTS-0.5B-multi)
	- Dataset extraction (Mimi codes): [github.com/shangeth/wren-datasets](https://github.com/shangeth/wren-datasets)
	- Demo Space: [huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo](https://huggingface.co/spaces/shangeth/Wren-ASR-0.5B-multi-demo)

	## Architecture

	```
	audio ──► Mimi encoder (k=3) ──► Qwen2.5-0.5B (audio prefix → text) ──► transcript
	```

	Mimi codes serve as a discrete audio prefix in the LLM's input embedding space.
	At each audio frame the k=3 codebook codes go through k separate input embedding
	tables; their sum (scaled by 1/√k) is the input embedding for that step. The
	audio prefix is wrapped in `<\|audio_start\|>` / `<\|audio_end\|>` tokens, after
	which the LLM autoregressively emits text using its native vocabulary and
	`lm_head` — no new output heads were added.

	- Backbone: Qwen2.5-0.5B (causal LM; transformer body ~358M params, 151k-token multilingual vocab)
	- Audio tokenizer: Mimi (`kyutai/mimi`), 12.5 fps, 2048-entry codebooks
	- Codebooks used: first 3 (semantic-content-rich); reduces input embedding size 8/3× vs 8-codebook variants
	- Audio prefix: `<\|audio_start\|>` + summed-codebook embeds × T_frames + `<\|audio_end\|>`
	- Output: standard text autoregression via `model.llm.generate(inputs_embeds=...)`

	## Training data

	Trained on the union of every dataset used to train Wren-TTS — the same
	6 corpora that power the en/multi/expressive TTS recipes, with text used as the
	ASR target:

	\| Dataset \| Rows \| Language(s) \|
	\|---\|---\|---\|
	\| VCTK \| ~44k \| en (109 speakers, multiple accents) \|
	\| Jenny \| ~21k \| en (single speaker) \|
	\| LibriTTS-R \| ~360k \| en (clean_100 + clean_360 + other_500) \|
	\| LJSpeech \| ~13k \| en (single speaker) \|
	\| MLS \| ~6.0M \| de · fr · es · it · nl · pl · pt \|
	\| Expresso (tagged) \| ~26k \| en (style tags stripped at load time) \|
	\| Total \| ~6.46M rows / epoch \| \|

	Mimi codes are pre-extracted and published as the per-corpus mimi-codes datasets
	(see Datasets above) — no online encoding during training. Single-pass
	from-scratch training, ~k=3 codebooks. Held-out validation combines LibriTTS-R
	`dev_clean` + MLS `dev` (all 7 langs) + Expresso `dev` (tags stripped) + 5%
	per single-speaker English source. All weights set to 1.0 (every row, every
	epoch, no subsampling). Trained on a single A100-40GB.

	Text casing and punctuation are preserved in the ground-truth transcripts.

	## Usage

	```bash
	pip install torch torchaudio transformers
	```

	```python
	import torch
	import torchaudio
	from transformers import AutoModel, AutoProcessor

	model_id = "shangeth/Wren-ASR-0.5B-multi"
	device = "cuda" if torch.cuda.is_available() else "cpu"

	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to(device).eval()

	# Load any short clip (one of the 8 supported languages, ≤ 30 s)
	wav, sr = torchaudio.load("input.wav")

	inputs = processor(audio=wav, sampling_rate=sr)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	ids = model.generate(**inputs, max_new_tokens=200)
	text = processor.batch_decode(ids, skip_special_tokens=True)[0]
	print(text)
	```

	## Sampling tips

	Defaults: greedy decoding (`do_sample=False`). For longer / harder utterances:
	- Pass `do_sample=True, temperature=0.7, top_p=0.9` for diverse beams
	- Raise `max_new_tokens` if transcripts are getting cut off
	- Audio is hard-capped at 30 s (375 frames @ 12.5 fps) by the training recipe;
	for longer audio, segment first

	## Limitations & known issues

	- Language coverage: only the 8 trained languages. Out-of-distribution
	audio produces noise / hallucinated text in the closest matching language.
	- Per-language quality varies with data volume: German / Dutch / French
	are strongest (largest training shares); Polish / Portuguese / Italian have
	less training data and may be less accurate.
	- Audiobook-style audio dominates training: MLS / LibriTTS-R / LJSpeech /
	Jenny are all studio-style read speech. Performance on conversational audio,
	noisy environments, or accented far-field input may degrade.
	- 0.5B backbone — quality is below frontier ASR systems (Whisper-large-v3,
	USM, etc.). The pitch is "small enough to run anywhere" + "shares architecture
	with Wren-TTS-0.5B-multi for unified speech-text experimentation".
	- 30s audio cap. Hard-cap at training time; longer audio needs to be
	segmented externally.
	- No speaker diarization. Single-stream transcription only.

	## The Wren series

	Wren is a family of compact (<3B parameter) multimodal speech LLMs — small
	enough to run on a single consumer GPU, designed for open research on unified
	speech understanding and synthesis.

	- Wren-TTS — text → speech (English + multilingual + expressive variants)
	- Wren-ASR — speech → text (this release)
	- Wren-LM — speech-language modelling / dialog (planned)
	- Wren-Omni — unified ASR + TTS + LM in one checkpoint (planned)

	All Wren models share the same design principles: small backbone LLM + neural
	audio codec, open weights, simple PyTorch checkpoints, reproducible training
	recipes. Wren-ASR uses the same Qwen2.5-0.5B backbone as Wren-TTS-0.5B-multi
	and is trained on the same corpora — making the pair a natural starting point
	for unified speech-text modelling research.

	## Repository contents

	\| File \| Purpose \|
	\|---\|---\|
	\| `model.safetensors` \| Model weights \|
	\| `config.json` \| `WrenASRConfig` (with `auto_map` for `trust_remote_code`) \|
	\| `tokenizer.json` + friends \| Qwen2.5 tokenizer with Wren-ASR's 2 special tokens added \|
	\| `processor_config.json` \| `WrenASRProcessor` auto_map \|
	\| `configuration_wren_asr.py` \| `WrenASRConfig(PretrainedConfig)` \|
	\| `modeling_wren_asr.py` \| `WrenForASR(PreTrainedModel)` — loads Mimi codec lazily on first call \|
	\| `processing_wren_asr.py` \| `WrenASRProcessor(ProcessorMixin)` — audio → Mimi codes + text decode \|
	\| `README.md` \| This model card \|

	## Citation

	```bibtex
	@misc{wren2026,
	title = {Wren: A Family of Small Open-Weight Models for Unified Speech-Text Modelling},
	author = {Shangeth Rajaa},
	year = {2026},
	url = {https://github.com/shangeth/wren}
	}
	```

	## License

	Apache-2.0 for the checkpoint weights and code in this repo.
	Upstream components carry their own licenses — review before redistribution.
	The Expresso dataset (used for English style robustness) is CC-BY-NC-4.0; if
	you build derived models on this checkpoint and want to release them
	commercially, retrain with Expresso excluded.