SYMPHONY-ASR / README.md
speechisalluneed's picture
Update README.md
9b14119 verified
metadata
license: apache-2.0
language:
  - en
  - ko
library_name: transformers
tags:
  - audio
  - text-generation
pipeline_tag: audio-text-to-text
base_model:
  - Qwen/Qwen3-4B
model-index:
  - name: SYMPHONY-ASR
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: AMI (Meetings test)
          type: edinburghcstr/ami
          config: ihm
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 9.56
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Earnings-22
          type: revdotcom/earnings22
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 9.45
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: GigaSpeech
          type: speechcolab/gigaspeech
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 9.96
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (clean)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 1.91
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: LibriSpeech (other)
          type: librispeech_asr
          config: other
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 4.43
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Vox Populi
          type: facebook/voxpopuli
          config: en
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 6.3
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: tedlium-v3
          type: LIUM/tedlium
          config: release1
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 3.39
      - task:
          type: Automatic Speech Recognition
          name: automatic-speech-recognition
        dataset:
          name: SPGI Speech
          type: kensho/spgispeech
          config: test
          split: test
          args:
            language: en
        metrics:
          - name: Test WER
            type: wer
            value: 2.29
new_version: okestro-ai-lab/SYMPHONY-ASR
  • SYMPHONY-ASR is an Automatic Speech Recognition (ASR)-specialized model designed for efficient and accurate speech-to-text transcription.

πŸ“– Model Architecture

πŸš€ SYMPHONY-ASR is an Automatic Speech Recognition (ASR)-specialized model designed for efficient speech-to-text transcription.

πŸ“Œ Key Features (Safe Version)

  • ⚑ Efficient long-form speech processing
  • 🧠 Adaptation of pre-trained LLMs to audio
  • βœ… Evaluation results on standard ASR benchmarks (WERs listed above)

πŸš€ Getting Started

1. Installation

First, install the required libraries.

sudo apt install ffmpeg
# pip
torch==2.3.1
peft==0.14.0
librosa==0.11.0
transformers==4.53.1
accelerate==0.34.2
einops==0.8.1
torchaudio==2.3.1
openai-whisper
soundfile

2. Load Model and Tokenizer

You can easily load the model using AutoModelForCausalLM.from_pretrained. This model includes custom code, so the trust_remote_code=True option is required.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

# ⬅️ Enter your Hugging Face repository ID here.
repo_id = "okestro-ai-lab/SYMPHONY-ASR" 

model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
generation_config = GenerationConfig.from_pretrained(repo_id)

model.eval()

3. Sample Inference (Automatic Speech Recognition)

This example shows how to load an audio file and transcribe it to text.

import torch
import librosa

# 1. Load and resample the audio file
# ⬅️ Path to the audio file to be transcribed
wav_path = "sample_audio/English_audio.wav" 
wav,sample_rate  = librosa.load(wav_path)

# SYMPHONY-ASR requires 16kHz audio.
if sample_rate != 16000:
    audio = librosa.resample(wav,orig_sr=sample_rate,target_sr=16000)
else:
    audio = wav

# 2. Prepare the prompt and tokenize the prompt
# Automatic Speech Recognition (ASR) task
# Addiational Tasks: please refer to Supported Tasks
# A task token is not required, but it is recommended for achieving a more appropriate task.
TASK_TOKEN = "<|ASR|>" 
AUDIO_TOKEN = "<|audio_bos|><|AUDIO|><|audio_eos|>"
user_prompt = f"{TASK_TOKEN}{AUDIO_TOKEN}\nTranscribe the audio clip into text."

prompt = [{"role": "user", "content": user_prompt}]
input_ids = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors='pt'
).to(model.device)

# 3. Perform inference
# The model's generate function expects the audio input as a list.
audio_tensor = torch.tensor((audio,),dtype=torch.float32).cuda()

with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=torch.bfloat16):
        output_ids = model.generate(
            input_ids=input_ids,
            audio=audio_tensor,
            generation_config=generation_config,
            max_new_tokens=256
        )

# 5. Decode the result
transcription = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]

print("--- Transcription Result ---")
print(transcription)

πŸ“Œ Supported Tasks

You can perform different tasks by using the following special tokens in your prompt:

  • <|ASR|>: Automatic Speech Recognition - Transcribes audio into text.
  • <|AST|>: Automatic Speech Translation - Translates audio into text of another language.
  • <|SSUM|>: Speech Summarization - Summarizes the content of an audio clip.
  • <|SQQA|>: Spoken Query-based Question Answering - Answers questions based on the content of an audio clip.

⚑ GPU Requirements

SYMPHONY-ASR inference requires a GPU with sufficient memory.

Task Recommended GPU Minimum VRAM
Inference NVIDIA A100 / H100 β‰₯ 11.8 GB

πŸ’‘ Using mixed precision (bfloat16 or fp16) is recommended to reduce memory usage.