Automatic Speech Recognition
Transformers
Safetensors
English
musci
text-generation
asr
speech
english
custom_code
Eval Results
Instructions to use Musci-research/Musci-ASR-2.4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Musci-research/Musci-ASR-2.4B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="Musci-research/Musci-ASR-2.4B", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("Musci-research/Musci-ASR-2.4B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: en | |
| library_name: transformers | |
| pipeline_tag: automatic-speech-recognition | |
| tags: | |
| - asr | |
| - speech | |
| - english | |
| license: apache-2.0 | |
| # Musci-ASR-2.4B | |
| An English speech-to-text model that pairs a Qwen3 language-model backbone with a | |
| Qwen3-Omni-MoE audio encoder. Trained on public English ASR corpora and tuned with | |
| reinforcement learning on the Open ASR Leaderboard training splits. Total \~2.4B parameters, | |
| distributed as a single `bfloat16` safetensors shard (\~4.84 GB). | |
| ## Inference | |
| ```python | |
| import librosa | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from transformers.dynamic_module_utils import get_class_from_dynamic_module | |
| REPO = "Musci-research/Musci-ASR-2.4B" | |
| DEVICE = "cuda:0" | |
| model = AutoModelForCausalLM.from_pretrained( | |
| REPO, torch_dtype=torch.bfloat16, trust_remote_code=True | |
| ).to(DEVICE).eval() | |
| tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True) | |
| MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO) | |
| MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO) | |
| mel_cfg = MelConfig(mel_sr=16000, mel_dim=128, mel_n_fft=400, mel_hop_length=160) | |
| processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False) | |
| processor.load_template(hf_hub_download(REPO, "chat_template_default.py")) | |
| waveform, _ = librosa.load("your_audio.wav", sr=16000) | |
| inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE) | |
| inputs["audio_data"] = inputs["audio_data"].to(model.dtype) | |
| with torch.no_grad(): | |
| out_ids = model.generate( | |
| **inputs, | |
| max_new_tokens=512, | |
| do_sample=False, | |
| num_beams=1, | |
| use_cache=True, | |
| eos_token_id=[processor.end_token_id], | |
| ) | |
| new_ids = out_ids[:, inputs["input_ids"].shape[1]:] | |
| transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip() | |
| print(transcript) | |
| ``` | |
| ## Audio frontend | |
| - Sample rate: **16 kHz** | |
| - Features: Whisper log-mel filterbank — `n_mels=128`, `n_fft=400`, `hop_length=160` | |
| ## License | |
| apache-2.0. | |