Automatic Speech Recognition
Transformers
Safetensors
English
musci
text-generation
speech-to-text
asr
speech
english
qwen3
audio
reinforcement-learning
custom_code
Eval Results (legacy)
Eval Results
Instructions to use Musci-research/Musci-ASR-2.4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Musci-research/Musci-ASR-2.4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Musci-research/Musci-ASR-2.4B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Musci-research/Musci-ASR-2.4B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 4,365 Bytes
6cb6a8a 396767e 6cb6a8a 396767e 6371d76 6cb6a8a 396767e 6cb6a8a 396767e 6371d76 396767e 6cb6a8a 396767e 6cb6a8a 396767e 6cb6a8a 396767e 6cb6a8a 396767e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 | ---
language: en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- automatic-speech-recognition
- speech-to-text
- asr
- speech
- english
- qwen3
- audio
- reinforcement-learning
datasets:
- openslr/librispeech_asr
- speechcolab/gigaspeech
- mozilla-foundation/common_voice_17_0
- facebook/voxpopuli
- LIUM/tedlium
- edinburghcstr/ami
- anton-l/earnings22
- kensho/spgispeech
metrics:
- wer
model-index:
- name: Musci-ASR-2.4B
results:
- task:
type: automatic-speech-recognition
dataset:
name: Open ASR Leaderboard
type: hf-audio/esb-datasets-test-only-sorted
metrics:
- type: wer
value: 5.44
name: Average WER
license: apache-2.0
---
# Musci-ASR-2.4B
Musci-ASR-2.4B is an English speech-to-text model that pairs a Qwen3-1.7B-base language-model backbone with a Qwen3-Omni-MoE audio encoder. A gated-MLP adapter projects audio features into the language-model embedding space. The model is trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits.
The model has approximately 2.4B parameters and is distributed as a single `bfloat16` safetensors shard of approximately 4.84 GB.
## Model Details
- **Developed by:** Musci Research
- **Model type:** Automatic Speech Recognition / speech-to-text model
- **Language:** English
- **License:** Apache-2.0
- **Library:** Transformers
- **Backbone:** Qwen3-1.7B-base, 28 layers, hidden size 2048
- **Audio encoder:** Qwen3-Omni-MoE audio encoder
- **Adapter:** Gated-MLP adapter, hidden size 8192
- **Parameter size:** approximately 2.4B
- **Checkpoint format:** `bfloat16` safetensors
## Intended Use
This model is intended for English automatic speech recognition, including transcription of English speech audio for research and evaluation purposes.
## Inference
```python
import librosa
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_class_from_dynamic_module
REPO = "Musci-research/Musci-ASR-2.4B"
DEVICE = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(
REPO, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(DEVICE).eval()
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO)
MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO)
mel_cfg = MelConfig(
mel_sr=16000,
mel_dim=128,
mel_n_fft=400,
mel_hop_length=160,
)
processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False)
processor.load_template(hf_hub_download(REPO, "chat_template_default.py"))
waveform, _ = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE)
inputs["audio_data"] = inputs["audio_data"].to(model.dtype)
with torch.no_grad():
out_ids = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
num_beams=1,
use_cache=True,
eos_token_id=[processor.end_token_id],
)
new_ids = out_ids[:, inputs["input_ids"].shape[1]:]
transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip()
print(transcript)
```
## Audio Frontend
- **Sample rate:** 16 kHz
- **Features:** Whisper log-mel filterbank
- **Mel bins:** 128
- **FFT size:** 400
- **Hop length:** 160
## Training
The model was trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits.
## Limitations
The model is designed for English ASR. It may perform worse on non-English speech, heavy accents, noisy recordings, overlapping speakers, far-field audio, domain-specific terminology, or audio conditions that differ significantly from the training and evaluation data. The output should be manually reviewed before use in high-stakes settings.
## Citation
```bibtex
@misc{musci_asr_2025,
title = {{Musci-ASR-2.4B}},
author = {{Musci Research}},
year = {2025},
howpublished = {\url{https://huggingface.co/Musci-research/Musci-ASR-2.4B}}
}
```
## License
This model is released under the Apache-2.0 license.
|