Yoruba ASR — Qwen2-Audio-7B Fine-tuned

Fine-tuned version of Qwen2-Audio-7B-Instruct for Yoruba automatic speech recognition (ASR), trained on the AfricanVoices corpus using LoRA.

Model Details

Model Description

Developed by: Similoluwa Ola-obaado
Model type: Audio-language multimodal Transformer (Qwen2-Audio)
Language(s): Yoruba (yo)
License: Apache 2.0
Finetuned from: Qwen/Qwen2-Audio-7B-Instruct

Model Sources

Repository: https://github.com/simi-I/yoruba_asr_audio_llm

Uses

Direct Use

Yoruba speech-to-text transcription
Low-resource ASR research

Out-of-Scope Use

Non-Yoruba languages
Real-time streaming ASR

How to Get Started with the Model

import librosa
import torch
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor

model_id = "Simih/yoruba_asr_audio_llm"

model = Qwen2AudioForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, sampling_rate=16000)

audio, sr = librosa.load("sample.flac", sr=processor.feature_extractor.sampling_rate)

conversation = [
    {"role": "user", "content": [
        {"type": "audio"},
        {"type": "text", "text": "Transcribe the audio"},
    ]}
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], audio=[audio], return_tensors="pt").to("cuda")

generate_ids = model.generate(**inputs, max_new_tokens=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
print(processor.batch_decode(generate_ids, skip_special_tokens=True)[0])

Training Details

Training Data

Yoruba speech dataset from the AfricanVoices corpus. Paired audio-transcription samples across multiple speakers and domains.

Training Hyperparameters

Training regime: bf16 mixed precision
Learning rate: 2e-5
Epochs: 1
Batch size: 8
LR scheduler: Cosine
LoRA rank: 32, alpha: 64, RSLoRA: true
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Evaluation

Results

Metric	Score
WER	44.1%

Evaluated on the AfricanVoices Yoruba test set with Yoruba text normalization (lowercasing, punctuation removal, unicode NFC normalization).

Downloads last month: 42

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Simih/yoruba_asr_audio_llm

Base model

Qwen/Qwen2-Audio-7B-Instruct

Adapter

(15)

this model