Whisper Small Setswana (LoRA) - 5,000 Steps

This model is a fine-tuned version of openai/whisper-small on the Setswana (tn) Common Voice dataset. It was optimized specifically for high-accuracy Speech-to-Text (ASR) as part of the PuoSpeaker project.

🚀 Training Summary

Total Steps: 5,000
Final Training Loss: 0.1736
Hardware: NVIDIA RTX A4000 (16GB VRAM)
Method: Parameter-Efficient Fine-Tuning (PEFT) using LoRA (Rank 64, Alpha 128).
Duration: ~5.5 hours.

📊 Capabilities & Limitations

✅ Automatic Speech Recognition (ASR) - "Near-Perfect"

The model shows exceptional performance in capturing Setswana phonetics, tone, and rhythm.

Pros: Handles fast native speech and subtle vowel shifts (e.g., 'ê' and 'ô') with high precision.
Suitability: Professional-grade transcription and pronunciation scoring.

⚠️ Text-to-Speech (TTS) - "Low Quality"

While the ASR is state-of-the-art, the current Text-to-Speech (TTS) integration (XTTS-v2) is in a prototype stage.

Current State: Low prosody alignment and robotic rhythm.
Next Steps: Dedicated TTS prosody fine-tuning is required to match the ASR's quality.

🛠️ Usage (Python)

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from peft import PeftModel
import torch

base_model = "openai/whisper-small"
adapter_model = "ogaufi/whisper-small-tn-lora-v2" # Recommended 5k checkpoint

processor = WhisperProcessor.from_pretrained(base_model)
model = WhisperForConditionalGeneration.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter_model)

# Standard Whisper inference pipeline follows...

📚 Dataset Details

Trained on the Common Voice Setswana corpus, specifically focusing on the validated split to ensure high-quality linguistic grounding.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support