--- license: apache-2.0 language: - en - ko library_name: transformers tags: - audio - text-generation pipeline_tag: audio-text-to-text base_model: - Qwen/Qwen3-4B model-index: - name: SYMPHONY-ASR results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: AMI (Meetings test) type: edinburghcstr/ami config: ihm split: test args: language: en metrics: - name: Test WER type: wer value: 9.56 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Earnings-22 type: revdotcom/earnings22 split: test args: language: en metrics: - name: Test WER type: wer value: 9.45 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: GigaSpeech type: speechcolab/gigaspeech split: test args: language: en metrics: - name: Test WER type: wer value: 9.96 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (clean) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 1.91 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: LibriSpeech (other) type: librispeech_asr config: other split: test args: language: en metrics: - name: Test WER type: wer value: 4.43 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Vox Populi type: facebook/voxpopuli config: en split: test args: language: en metrics: - name: Test WER type: wer value: 6.30 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: tedlium-v3 type: LIUM/tedlium config: release1 split: test args: language: en metrics: - name: Test WER type: wer value: 3.39 - task: type: Automatic Speech Recognition name: automatic-speech-recognition dataset: name: SPGI Speech type: kensho/spgispeech config: test split: test args: language: en metrics: - name: Test WER type: wer value: 2.29 new_version: okestro-ai-lab/SYMPHONY-ASR --- * SYMPHONY-ASR is an **Automatic Speech Recognition (ASR)-specialized model** designed for efficient and accurate speech-to-text transcription. ## 📖 Model Architecture 🚀 **SYMPHONY-ASR** is an **Automatic Speech Recognition (ASR)-specialized model** designed for efficient speech-to-text transcription. ## 📌 Key Features (Safe Version) * ⚡ Efficient long-form speech processing * 🧠 Adaptation of pre-trained LLMs to audio * ✅ Evaluation results on standard ASR benchmarks (WERs listed above) * ## 🚀 Getting Started ### 1. Installation First, install the required libraries. ```bash sudo apt install ffmpeg # pip torch==2.3.1 peft==0.14.0 librosa==0.11.0 transformers==4.53.1 accelerate==0.34.2 einops==0.8.1 torchaudio==2.3.1 openai-whisper soundfile ``` ### 2. Load Model and Tokenizer You can easily load the model using `AutoModelForCausalLM.from_pretrained`. This model includes custom code, so the `trust_remote_code=True` option is required. ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig # ⬅️ Enter your Hugging Face repository ID here. repo_id = "okestro-ai-lab/SYMPHONY-ASR" model = AutoModelForCausalLM.from_pretrained( repo_id, trust_remote_code=True, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(repo_id) generation_config = GenerationConfig.from_pretrained(repo_id) model.eval() ``` ### 3. Sample Inference (Automatic Speech Recognition) This example shows how to load an audio file and transcribe it to text. ```python import torch import librosa # 1. Load and resample the audio file # ⬅️ Path to the audio file to be transcribed wav_path = "sample_audio/English_audio.wav" wav,sample_rate = librosa.load(wav_path) # SYMPHONY-ASR requires 16kHz audio. if sample_rate != 16000: audio = librosa.resample(wav,orig_sr=sample_rate,target_sr=16000) else: audio = wav # 2. Prepare the prompt and tokenize the prompt # Automatic Speech Recognition (ASR) task # Addiational Tasks: please refer to Supported Tasks # A task token is not required, but it is recommended for achieving a more appropriate task. TASK_TOKEN = "<|ASR|>" AUDIO_TOKEN = "<|audio_bos|><|AUDIO|><|audio_eos|>" user_prompt = f"{TASK_TOKEN}{AUDIO_TOKEN}\nTranscribe the audio clip into text." prompt = [{"role": "user", "content": user_prompt}] input_ids = tokenizer.apply_chat_template( prompt, add_generation_prompt=True, tokenize=True, return_tensors='pt' ).to(model.device) # 3. Perform inference # The model's generate function expects the audio input as a list. audio_tensor = torch.tensor((audio,),dtype=torch.float32).cuda() with torch.no_grad(): with torch.cuda.amp.autocast(dtype=torch.bfloat16): output_ids = model.generate( input_ids=input_ids, audio=audio_tensor, generation_config=generation_config, max_new_tokens=256 ) # 5. Decode the result transcription = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0] print("--- Transcription Result ---") print(transcription) ``` --- ## 📌 Supported Tasks You can perform different tasks by using the following special tokens in your prompt: * `<|ASR|>`: **Automatic Speech Recognition** - Transcribes audio into text. * `<|AST|>`: **Automatic Speech Translation** - Translates audio into text of another language. * `<|SSUM|>`: **Speech Summarization** - Summarizes the content of an audio clip. * `<|SQQA|>`: **Spoken Query-based Question Answering** - Answers questions based on the content of an audio clip. --- ## ⚡ GPU Requirements SYMPHONY-ASR inference requires a GPU with sufficient memory. | Task | Recommended GPU | Minimum VRAM | | --------- | ---------------- | --------- | | **Inference** | NVIDIA A100 / H100 | ≥ 11.8 GB | > 💡 Using mixed precision (`bfloat16` or `fp16`) is recommended to reduce memory usage.