--- language: en library_name: transformers pipeline_tag: automatic-speech-recognition tags: - asr - speech - english license: apache-2.0 --- # Musci-ASR-2.4B An English speech-to-text model that pairs a Qwen3 language-model backbone with a Qwen3-Omni-MoE audio encoder. Trained on public English ASR corpora and tuned with reinforcement learning on the Open ASR Leaderboard training splits. Total \~2.4B parameters, distributed as a single `bfloat16` safetensors shard (\~4.84 GB). ## Inference ```python import librosa import torch from huggingface_hub import hf_hub_download from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.dynamic_module_utils import get_class_from_dynamic_module REPO = "Musci-research/Musci-ASR-2.4B" DEVICE = "cuda:0" model = AutoModelForCausalLM.from_pretrained( REPO, torch_dtype=torch.bfloat16, trust_remote_code=True ).to(DEVICE).eval() tokenizer = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True) MusciProcessor = get_class_from_dynamic_module("processing_Musci.MusciProcessor", REPO) MelConfig = get_class_from_dynamic_module("processing_Musci.MelConfig", REPO) mel_cfg = MelConfig(mel_sr=16000, mel_dim=128, mel_n_fft=400, mel_hop_length=160) processor = MusciProcessor(tokenizer, config=mel_cfg, enable_time_marker=False) processor.load_template(hf_hub_download(REPO, "chat_template_default.py")) waveform, _ = librosa.load("your_audio.wav", sr=16000) inputs = processor(audio=waveform, return_tensors="pt").to(DEVICE) inputs["audio_data"] = inputs["audio_data"].to(model.dtype) with torch.no_grad(): out_ids = model.generate( **inputs, max_new_tokens=512, do_sample=False, num_beams=1, use_cache=True, eos_token_id=[processor.end_token_id], ) new_ids = out_ids[:, inputs["input_ids"].shape[1]:] transcript = processor.batch_decode(new_ids, skip_special_tokens=True)[0].strip() print(transcript) ``` ## Audio frontend - Sample rate: **16 kHz** - Features: Whisper log-mel filterbank — `n_mels=128`, `n_fft=400`, `hop_length=160` ## License apache-2.0.