---
license: apache-2.0
language:
- en
- ko
library_name: transformers
tags:
- audio
- text-generation
pipeline_tag: audio-text-to-text
base_model:
- Qwen/Qwen3-4B
model-index:
- name: SYMPHONY-ASR
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: AMI (Meetings test)
      type: edinburghcstr/ami
      config: ihm
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 9.56
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Earnings-22
      type: revdotcom/earnings22
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 9.45
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: GigaSpeech
      type: speechcolab/gigaspeech
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 9.96
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 1.91
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 4.43
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Vox Populi
      type: facebook/voxpopuli
      config: en
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 6.30
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: tedlium-v3
      type: LIUM/tedlium
      config: release1
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 3.39
  - task:
      type: Automatic Speech Recognition
      name: automatic-speech-recognition
    dataset:
      name: SPGI Speech
      type: kensho/spgispeech
      config: test
      split: test
      args:
        language: en
    metrics:
    - name: Test WER
      type: wer
      value: 2.29
new_version: okestro-ai-lab/SYMPHONY-ASR
---

* SYMPHONY-ASR is an **Automatic Speech Recognition (ASR)-specialized model** designed for efficient and accurate speech-to-text transcription.


<!-- * 🔊 **HFQ-Former**: Hierarchically compresses high-frame-rate audio features while preserving the audio's local and global contextual information. -->
<!-- * 🔊 **Adaptor**:  -->
<!-- * 🧠 **LLM Adaptation**: Effectively adapts pre-trained Large Language Models (LLMs) to the audio modality. -->
<!-- * **ASR-specialized architecture** tailored for speech recognition tasks
* **Bilingual support (Korean & English)** in a single model   -->

## 📖 Model Architecture

🚀 **SYMPHONY-ASR** is an **Automatic Speech Recognition (ASR)-specialized model** designed for efficient speech-to-text transcription.  

## 📌 Key Features (Safe Version)

* ⚡ Efficient long-form speech processing
* 🧠 Adaptation of pre-trained LLMs to audio
* ✅ Evaluation results on standard ASR benchmarks (WERs listed above)
* 

<!-- <p align="center">
  <img src="HFQ-Former.png" width="700">
</p>
 -->

## 🚀 Getting Started

### 1. Installation

First, install the required libraries.

```bash
sudo apt install ffmpeg
# pip
torch==2.3.1
peft==0.14.0
librosa==0.11.0
transformers==4.53.1
accelerate==0.34.2
einops==0.8.1
torchaudio==2.3.1
openai-whisper
soundfile
```

### 2. Load Model and Tokenizer

You can easily load the model using `AutoModelForCausalLM.from_pretrained`. This model includes custom code, so the `trust_remote_code=True` option is required.

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# ⬅️ Enter your Hugging Face repository ID here.
repo_id = "okestro-ai-lab/SYMPHONY-ASR" 

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
generation_config = GenerationConfig.from_pretrained(repo_id)

model.eval()
```

### 3. Sample Inference (Automatic Speech Recognition)

This example shows how to load an audio file and transcribe it to text.

```python
import torch
import librosa

# 1. Load and resample the audio file
# ⬅️ Path to the audio file to be transcribed
wav_path = "sample_audio/English_audio.wav" 
wav,sample_rate  = librosa.load(wav_path)

# SYMPHONY-ASR requires 16kHz audio.
if sample_rate != 16000:
    audio = librosa.resample(wav,orig_sr=sample_rate,target_sr=16000)
else:
    audio = wav

# 2. Prepare the prompt and tokenize the prompt
# Automatic Speech Recognition (ASR) task
# Addiational Tasks: please refer to Supported Tasks
# A task token is not required, but it is recommended for achieving a more appropriate task.
TASK_TOKEN = "<|ASR|>" 
AUDIO_TOKEN = "<|audio_bos|><|AUDIO|><|audio_eos|>"
user_prompt = f"{TASK_TOKEN}{AUDIO_TOKEN}\nTranscribe the audio clip into text."

prompt = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors='pt'
).to(model.device)

# 3. Perform inference
# The model's generate function expects the audio input as a list.
audio_tensor = torch.tensor((audio,),dtype=torch.float32).cuda()

with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        output_ids = model.generate(
            input_ids=input_ids,
            audio=audio_tensor,
            generation_config=generation_config,
            max_new_tokens=256
        )

# 5. Decode the result
transcription = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]

print("--- Transcription Result ---")
print(transcription)
```

---

## 📌 Supported Tasks

You can perform different tasks by using the following special tokens in your prompt:

* `<|ASR|>`: **Automatic Speech Recognition** - Transcribes audio into text.
* `<|AST|>`: **Automatic Speech Translation** - Translates audio into text of another language.
* `<|SSUM|>`: **Speech Summarization** - Summarizes the content of an audio clip.
* `<|SQQA|>`: **Spoken Query-based Question Answering** - Answers questions based on the content of an audio clip.

---

## ⚡ GPU Requirements

SYMPHONY-ASR inference requires a GPU with sufficient memory.

| Task      | Recommended GPU         | Minimum VRAM |
| --------- | ---------------- | --------- |
| **Inference** | NVIDIA A100 / H100 | ≥ 11.8 GB |

> 💡 Using mixed precision (`bfloat16` or `fp16`) is recommended to reduce memory usage.