LoRA: Low-Rank Adaptation of Large Language Models
Paper
β’ 2106.09685 β’ Published
β’ 59
A LoRA (Low-Rank Adaptation) fine-tune of openai/whisper-large-v3 for Uzbek speech recognition, trained on the Mozilla Common Voice Uzbek dataset.
This adapter is designed to work as part of a full Speaker Diarization Pipeline using pyannote/speaker-diarization-3.1 β producing timestamped, speaker-labelled Uzbek transcripts from any audio file.
Only ~1% of parameters were trained (15.7M out of 1.55B), so the adapter is just 63 MB while the base model stays completely frozen.
run.sh watchdog that survives SIGSEGV / GPU memory crashes and auto-resumes from the latest checkpointWhisperDataset using plain Python lists in RAM, bypassing HuggingFace Arrow/mmap memory issues that caused segfaultsq_proj and v_proj attention layers only, achieving strong Uzbek accuracy without catastrophic forgetting of the base modelimport torch
from peft import PeftModel, PeftConfig
from transformers import WhisperForConditionalGeneration, WhisperProcessor
# =====================================================================
# π‘ REMINDER: This LoRA adapter was specifically trained on and
# MUST be used with the base model: "openai/whisper-large-v3"
# =====================================================================
# 1. Load the Configuration and Base Model
model_id = "AnvarMexmonov/uz-speech-adapter-v1"
config = PeftConfig.from_pretrained(model_id)
# The config automatically pulls "openai/whisper-large-v3" as the base
processor = WhisperProcessor.from_pretrained(config.base_model_name_or_path)
model = WhisperForConditionalGeneration.from_pretrained(
config.base_model_name_or_path,
torch_dtype=torch.float16,
device_map="auto"
)
# 2. Load your custom Uzbek LoRA Adapter
model = PeftModel.from_pretrained(model, model_id)
print(" Custom Uzbek Whisper Large-v3 model is ready for inference!")
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA via HuggingFace PEFT |
| LoRA targets | q_proj, v_proj |
| LoRA rank / alpha | 32 / 64 |
| Trainable params | 15.7M / 1.55B (1%) |
| Dataset | yakhyo/mozilla-common-voice-uzbek |
| Train samples | 3,000 clips |
| Eval samples | 200 clips |
| Steps | 1,000 |
| Effective batch size | 8 |
| Learning rate | 5e-4 with 50-step warmup |
| Precision | bf16 |
| GPU | NVIDIA RTX 5090 |
| Training time | ~90 minutes |
| Metric | Value |
|---|---|
| Best eval loss | 0.7835 |
| Training steps | 1,000 |
The complete diarization + transcription project (code, training scripts, inference) is available at: https://github.com/anvarmexmonov/WhoSaidWhatWhen-Uzbek
MIT Β© AnvarMexmonov
Base model
openai/whisper-large-v3