File size: 4,365 Bytes
6cb6a8a
 
 
 
 
396767e
 
6cb6a8a
 
 
396767e
 
 
6371d76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cb6a8a
 
 
 
 
396767e
6cb6a8a
396767e
 
6371d76
396767e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cb6a8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
396767e
 
 
 
 
 
 
 
6cb6a8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
396767e
 
 
 
 
 
 
6cb6a8a
396767e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cb6a8a
 
 
396767e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
language: en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - speech-to-text
  - asr
  - speech
  - english
  - qwen3
  - audio
  - reinforcement-learning
datasets:
  - openslr/librispeech_asr
  - speechcolab/gigaspeech
  - mozilla-foundation/common_voice_17_0
  - facebook/voxpopuli
  - LIUM/tedlium
  - edinburghcstr/ami
  - anton-l/earnings22
  - kensho/spgispeech
metrics:
  - wer
model-index:
  - name: Musci-ASR-2.4B
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: Open ASR Leaderboard
          type: hf-audio/esb-datasets-test-only-sorted
        metrics:
          - type: wer
            value: 5.44
            name: Average WER
license: apache-2.0
---

# Musci-ASR-2.4B

Musci-ASR-2.4B is an English speech-to-text model that pairs a Qwen3-1.7B-base language-model backbone with a Qwen3-Omni-MoE audio encoder. A gated-MLP adapter projects audio features into the language-model embedding space. The model is trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits.

The model has approximately 2.4B parameters and is distributed as a single `bfloat16` safetensors shard of approximately 4.84 GB.


## Model Details

- **Developed by:** Musci Research
- **Model type:** Automatic Speech Recognition / speech-to-text model
- **Language:** English
- **License:** Apache-2.0
- **Library:** Transformers
- **Backbone:** Qwen3-1.7B-base, 28 layers, hidden size 2048
- **Audio encoder:** Qwen3-Omni-MoE audio encoder
- **Adapter:** Gated-MLP adapter, hidden size 8192
- **Parameter size:** approximately 2.4B
- **Checkpoint format:** `bfloat16` safetensors

## Intended Use

This model is intended for English automatic speech recognition, including transcription of English speech audio for research and evaluation purposes.

## Inference

```python
import librosa
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_class_from_dynamic_module

REPO = "Musci-research/Musci-ASR-2.4B"
DEVICE = "cuda:0"

model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(DEVICE).eval()
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)

MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO)
MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO)

mel_cfg = MelConfig(
    mel_sr=16000,
    mel_dim=128,
    mel_n_fft=400,
    mel_hop_length=160,
)
processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False)
processor.load_template(hf_hub_download(REPO, "chat_template_default.py"))

waveform, _ = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE)
inputs["audio_data"] = inputs["audio_data"].to(model.dtype)

with torch.no_grad():
    out_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        num_beams=1,
        use_cache=True,
        eos_token_id=[processor.end_token_id],
    )

new_ids = out_ids[:, inputs["input_ids"].shape[1]:]
transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip()
print(transcript)
```

## Audio Frontend

- **Sample rate:** 16 kHz
- **Features:** Whisper log-mel filterbank
- **Mel bins:** 128
- **FFT size:** 400
- **Hop length:** 160

## Training

The model was trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits.

## Limitations

The model is designed for English ASR. It may perform worse on non-English speech, heavy accents, noisy recordings, overlapping speakers, far-field audio, domain-specific terminology, or audio conditions that differ significantly from the training and evaluation data. The output should be manually reviewed before use in high-stakes settings.

## Citation

```bibtex
@misc{musci_asr_2025,
  title        = {{Musci-ASR-2.4B}},
  author       = {{Musci Research}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/Musci-research/Musci-ASR-2.4B}}
}
```

## License

This model is released under the Apache-2.0 license.