YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Whisper Small Setswana (LoRA) - 5,000 Steps
This model is a fine-tuned version of openai/whisper-small on the Setswana (tn) Common Voice dataset. It was optimized specifically for high-accuracy Speech-to-Text (ASR) as part of the PuoSpeaker project.
π Training Summary
- Total Steps: 5,000
- Final Training Loss: 0.1736
- Hardware: NVIDIA RTX A4000 (16GB VRAM)
- Method: Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Rank 64, Alpha 128).
- Duration: ~5.5 hours.
π Capabilities & Limitations
β Automatic Speech Recognition (ASR) - "Near-Perfect"
The model shows exceptional performance in capturing Setswana phonetics, tone, and rhythm.
- Pros: Handles fast native speech and subtle vowel shifts (e.g., 'Γͺ' and 'Γ΄') with high precision.
- Suitability: Professional-grade transcription and pronunciation scoring.
β οΈ Text-to-Speech (TTS) - "Low Quality"
While the ASR is state-of-the-art, the current Text-to-Speech (TTS) integration (XTTS-v2) is in a prototype stage.
- Current State: Low prosody alignment and robotic rhythm.
- Next Steps: Dedicated TTS prosody fine-tuning is required to match the ASR's quality.
π οΈ Usage (Python)
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch
base_model = "openai/whisper-small"
adapter_model = "ogaufi/whisper-small-tn-lora-v2" # Recommended 5k checkpoint
processor = WhisperProcessor.from_pretrained(base_model)
model = WhisperForConditionalGeneration.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_model)
# Standard Whisper inference pipeline follows...
π Dataset Details
Trained on the Common Voice Setswana corpus, specifically focusing on the validated split to ensure high-quality linguistic grounding.
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support