metadata
license: apache-2.0
language:
- en
- ko
library_name: transformers
tags:
- audio
- text-generation
pipeline_tag: audio-text-to-text
base_model:
- Qwen/Qwen3-4B
model-index:
- name: SYMPHONY-ASR
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: AMI (Meetings test)
type: edinburghcstr/ami
config: ihm
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 9.56
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Earnings-22
type: revdotcom/earnings22
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 9.45
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: GigaSpeech
type: speechcolab/gigaspeech
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 9.96
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (clean)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 1.91
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: LibriSpeech (other)
type: librispeech_asr
config: other
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 4.43
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Vox Populi
type: facebook/voxpopuli
config: en
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 6.3
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: tedlium-v3
type: LIUM/tedlium
config: release1
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 3.39
- task:
type: Automatic Speech Recognition
name: automatic-speech-recognition
dataset:
name: SPGI Speech
type: kensho/spgispeech
config: test
split: test
args:
language: en
metrics:
- name: Test WER
type: wer
value: 2.29
new_version: okestro-ai-lab/SYMPHONY-ASR
- SYMPHONY-ASR is an Automatic Speech Recognition (ASR)-specialized model designed for efficient and accurate speech-to-text transcription.
π Model Architecture
π SYMPHONY-ASR is an Automatic Speech Recognition (ASR)-specialized model designed for efficient speech-to-text transcription.
π Key Features (Safe Version)
- β‘ Efficient long-form speech processing
- π§ Adaptation of pre-trained LLMs to audio
- β Evaluation results on standard ASR benchmarks (WERs listed above)
π Getting Started
1. Installation
First, install the required libraries.
sudo apt install ffmpeg
# pip
torch==2.3.1
peft==0.14.0
librosa==0.11.0
transformers==4.53.1
accelerate==0.34.2
einops==0.8.1
torchaudio==2.3.1
openai-whisper
soundfile
2. Load Model and Tokenizer
You can easily load the model using AutoModelForCausalLM.from_pretrained. This model includes custom code, so the trust_remote_code=True option is required.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
# β¬
οΈ Enter your Hugging Face repository ID here.
repo_id = "okestro-ai-lab/SYMPHONY-ASR"
model = AutoModelForCausalLM.from_pretrained(
repo_id,
trust_remote_code=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
generation_config = GenerationConfig.from_pretrained(repo_id)
model.eval()
3. Sample Inference (Automatic Speech Recognition)
This example shows how to load an audio file and transcribe it to text.
import torch
import librosa
# 1. Load and resample the audio file
# β¬
οΈ Path to the audio file to be transcribed
wav_path = "sample_audio/English_audio.wav"
wav,sample_rate = librosa.load(wav_path)
# SYMPHONY-ASR requires 16kHz audio.
if sample_rate != 16000:
audio = librosa.resample(wav,orig_sr=sample_rate,target_sr=16000)
else:
audio = wav
# 2. Prepare the prompt and tokenize the prompt
# Automatic Speech Recognition (ASR) task
# Addiational Tasks: please refer to Supported Tasks
# A task token is not required, but it is recommended for achieving a more appropriate task.
TASK_TOKEN = "<|ASR|>"
AUDIO_TOKEN = "<|audio_bos|><|AUDIO|><|audio_eos|>"
user_prompt = f"{TASK_TOKEN}{AUDIO_TOKEN}\nTranscribe the audio clip into text."
prompt = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
tokenize=True,
return_tensors='pt'
).to(model.device)
# 3. Perform inference
# The model's generate function expects the audio input as a list.
audio_tensor = torch.tensor((audio,),dtype=torch.float32).cuda()
with torch.no_grad():
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
output_ids = model.generate(
input_ids=input_ids,
audio=audio_tensor,
generation_config=generation_config,
max_new_tokens=256
)
# 5. Decode the result
transcription = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print("--- Transcription Result ---")
print(transcription)
π Supported Tasks
You can perform different tasks by using the following special tokens in your prompt:
<|ASR|>: Automatic Speech Recognition - Transcribes audio into text.<|AST|>: Automatic Speech Translation - Translates audio into text of another language.<|SSUM|>: Speech Summarization - Summarizes the content of an audio clip.<|SQQA|>: Spoken Query-based Question Answering - Answers questions based on the content of an audio clip.
β‘ GPU Requirements
SYMPHONY-ASR inference requires a GPU with sufficient memory.
| Task | Recommended GPU | Minimum VRAM |
|---|---|---|
| Inference | NVIDIA A100 / H100 | β₯ 11.8 GB |
π‘ Using mixed precision (
bfloat16orfp16) is recommended to reduce memory usage.