File size: 4,365 Bytes

---
language: en
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - automatic-speech-recognition
  - speech-to-text
  - asr
  - speech
  - english
  - qwen3
  - audio
  - reinforcement-learning
datasets:
  - openslr/librispeech_asr
  - speechcolab/gigaspeech
  - mozilla-foundation/common_voice_17_0
  - facebook/voxpopuli
  - LIUM/tedlium
  - edinburghcstr/ami
  - anton-l/earnings22
  - kensho/spgispeech
metrics:
  - wer
model-index:
  - name: Musci-ASR-2.4B
    results:
      - task:
          type: automatic-speech-recognition
        dataset:
          name: Open ASR Leaderboard
          type: hf-audio/esb-datasets-test-only-sorted
        metrics:
          - type: wer
            value: 5.44
            name: Average WER
license: apache-2.0
---

# Musci-ASR-2.4B

Musci-ASR-2.4B is an English speech-to-text model that pairs a Qwen3-1.7B-base language-model backbone with a Qwen3-Omni-MoE audio encoder. A gated-MLP adapter projects audio features into the language-model embedding space. The model is trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits.

The model has approximately 2.4B parameters and is distributed as a single `bfloat16` safetensors shard of approximately 4.84 GB.


## Model Details

- **Developed by:** Musci Research
- **Model type:** Automatic Speech Recognition / speech-to-text model
- **Language:** English
- **License:** Apache-2.0
- **Library:** Transformers
- **Backbone:** Qwen3-1.7B-base, 28 layers, hidden size 2048
- **Audio encoder:** Qwen3-Omni-MoE audio encoder
- **Adapter:** Gated-MLP adapter, hidden size 8192
- **Parameter size:** approximately 2.4B
- **Checkpoint format:** `bfloat16` safetensors

## Intended Use

This model is intended for English automatic speech recognition, including transcription of English speech audio for research and evaluation purposes.

## Inference

```python
import librosa
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.dynamic_module_utils import get_class_from_dynamic_module

REPO = "Musci-research/Musci-ASR-2.4B"
DEVICE = "cuda:0"

model = AutoModelForCausalLM.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(DEVICE).eval()
tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)

MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO)
MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO)

mel_cfg = MelConfig(
    mel_sr=16000,
    mel_dim=128,
    mel_n_fft=400,
    mel_hop_length=160,
)
processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False)
processor.load_template(hf_hub_download(REPO, "chat_template_default.py"))

waveform, _ = librosa.load("your_audio.wav", sr=16000)
inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE)
inputs["audio_data"] = inputs["audio_data"].to(model.dtype)

with torch.no_grad():
    out_ids = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=False,
        num_beams=1,
        use_cache=True,
        eos_token_id=[processor.end_token_id],
    )

new_ids = out_ids[:, inputs["input_ids"].shape[1]:]
transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip()
print(transcript)
```

## Audio Frontend

- **Sample rate:** 16 kHz
- **Features:** Whisper log-mel filterbank
- **Mel bins:** 128
- **FFT size:** 400
- **Hop length:** 160

## Training

The model was trained on public English ASR corpora and fine-tuned with reinforcement learning on the Open ASR Leaderboard training splits.

## Limitations

The model is designed for English ASR. It may perform worse on non-English speech, heavy accents, noisy recordings, overlapping speakers, far-field audio, domain-specific terminology, or audio conditions that differ significantly from the training and evaluation data. The output should be manually reviewed before use in high-stakes settings.

## Citation

```bibtex
@misc{musci_asr_2025,
  title        = {{Musci-ASR-2.4B}},
  author       = {{Musci Research}},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/Musci-research/Musci-ASR-2.4B}}
}
```

## License

This model is released under the Apache-2.0 license.