Yoruba ASR — Qwen2-Audio-7B Fine-tuned

Fine-tuned version of Qwen2-Audio-7B-Instruct for Yoruba automatic speech recognition (ASR), trained on the AfricanVoices corpus using LoRA.

Model Details

Model Description

  • Developed by: Similoluwa Ola-obaado
  • Model type: Audio-language multimodal Transformer (Qwen2-Audio)
  • Language(s): Yoruba (yo)
  • License: Apache 2.0
  • Finetuned from: Qwen/Qwen2-Audio-7B-Instruct

Model Sources

Uses

Direct Use

  • Yoruba speech-to-text transcription
  • Low-resource ASR research

Out-of-Scope Use

  • Non-Yoruba languages
  • Real-time streaming ASR

How to Get Started with the Model

import librosa
import torch
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

model_id = "Simih/yoruba_asr_audio_llm"

model = Qwen2AudioForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, sampling_rate=16000)

audio, sr = librosa.load("sample.flac", sr=processor.feature_extractor.sampling_rate)

conversation = [
    {"role": "user", "content": [
        {"type": "audio"},
        {"type": "text", "text": "Transcribe the audio"},
    ]}
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], audio=[audio], return_tensors="pt").to("cuda")

generate_ids = model.generate(**inputs, max_new_tokens=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
print(processor.batch_decode(generate_ids, skip_special_tokens=True)[0])

Training Details

Training Data

Yoruba speech dataset from the AfricanVoices corpus. Paired audio-transcription samples across multiple speakers and domains.

Training Hyperparameters

  • Training regime: bf16 mixed precision
  • Learning rate: 2e-5
  • Epochs: 1
  • Batch size: 8
  • LR scheduler: Cosine
  • LoRA rank: 32, alpha: 64, RSLoRA: true
  • Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Evaluation

Results

Metric Score
WER 44.1%

Evaluated on the AfricanVoices Yoruba test set with Yoruba text normalization (lowercasing, punctuation removal, unicode NFC normalization).

Downloads last month
42
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Simih/yoruba_asr_audio_llm

Adapter
(15)
this model