Transformers
Safetensors
Yoruba
qwen2_audio
text2text-generation
audio
asr
speech-recognition
yoruba
low-resource
lora
qwen2-audio
Instructions to use Simih/yoruba_asr_audio_llm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Simih/yoruba_asr_audio_llm with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("Simih/yoruba_asr_audio_llm") model = AutoModelForMultimodalLM.from_pretrained("Simih/yoruba_asr_audio_llm") - Notebooks
- Google Colab
- Kaggle
Yoruba ASR — Qwen2-Audio-7B Fine-tuned
Fine-tuned version of Qwen2-Audio-7B-Instruct for Yoruba automatic speech recognition (ASR), trained on the AfricanVoices corpus using LoRA.
Model Details
Model Description
- Developed by: Similoluwa Ola-obaado
- Model type: Audio-language multimodal Transformer (Qwen2-Audio)
- Language(s): Yoruba (yo)
- License: Apache 2.0
- Finetuned from: Qwen/Qwen2-Audio-7B-Instruct
Model Sources
- Repository: https://github.com/simi-I/yoruba_asr_audio_llm
Uses
Direct Use
- Yoruba speech-to-text transcription
- Low-resource ASR research
Out-of-Scope Use
- Non-Yoruba languages
- Real-time streaming ASR
How to Get Started with the Model
import librosa
import torch
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
model_id = "Simih/yoruba_asr_audio_llm"
model = Qwen2AudioForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
).to("cuda")
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, sampling_rate=16000)
audio, sr = librosa.load("sample.flac", sr=processor.feature_extractor.sampling_rate)
conversation = [
{"role": "user", "content": [
{"type": "audio"},
{"type": "text", "text": "Transcribe the audio"},
]}
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], audio=[audio], return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=256)
generate_ids = generate_ids[:, inputs.input_ids.size(1):]
print(processor.batch_decode(generate_ids, skip_special_tokens=True)[0])
Training Details
Training Data
Yoruba speech dataset from the AfricanVoices corpus. Paired audio-transcription samples across multiple speakers and domains.
Training Hyperparameters
- Training regime: bf16 mixed precision
- Learning rate: 2e-5
- Epochs: 1
- Batch size: 8
- LR scheduler: Cosine
- LoRA rank: 32, alpha: 64, RSLoRA: true
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Evaluation
Results
| Metric | Score |
|---|---|
| WER | 44.1% |
Evaluated on the AfricanVoices Yoruba test set with Yoruba text normalization (lowercasing, punctuation removal, unicode NFC normalization).
- Downloads last month
- 42
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Simih/yoruba_asr_audio_llm
Base model
Qwen/Qwen2-Audio-7B-Instruct