Robust Speech Recognition via Large-Scale Weak Supervision
Paper
•
2212.04356
•
Published
•
47
This repository contains a LoRA/PEFT adapter for Kyrgyz automatic speech recognition (ASR).
This repo provides adapter weights only. For inference, you must load the base model and then attach this adapter via PEFT.
If you want a single, standalone checkpoint, use the merged model linked above.
fsicoli/common_voice_22_0 (config: ky)Evaluation on Common Voice 22.0 Kyrgyz (test split):
WER (normalized): 16.2061WER_ortho (orthographic): 19.1491test_loss: 0.1722Quick check (200 random test samples):
WER: 16.1677WER_ortho: 19.6021Note: WER depends on text normalization (punctuation/case), decoding settings, and audio preprocessing.
LoRA fine-tuning summary:
r=8, lora_alpha=16, lora_dropout=0.1q_proj, v_projmax_steps=4000checkpoint-4000 (WER=16.21)Training progress (selected checkpoints):
| Step | Train loss | Val loss | WER_ortho | WER |
|---|---|---|---|---|
| 500 | 0.7980 | 0.7911 | 44.3501 | 42.0754 |
| 1000 | 0.3980 | 0.2043 | 28.9947 | 27.8551 |
| 1500 | 0.1712 | 0.1821 | 20.7479 | 17.7343 |
| 2000 | 0.1734 | 0.1770 | 20.7569 | 17.6977 |
| 2500 | 0.1935 | 0.1743 | 19.7995 | 16.8192 |
| 3000 | 0.3406 | 0.1728 | 19.8988 | 16.9656 |
| 3500 | 0.3192 | 0.1724 | 19.3840 | 16.4074 |
| 4000 | 0.1499 | 0.1722 | 19.1491 | 16.2061 |
pip install -U "transformers" "peft" "accelerate" "torch"
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
adapter_id = "AleksTv/whisper-medium-ky-lora"
peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path # nineninesix/kyrgyz-whisper-medium
device = 0 if torch.cuda.is_available() else -1
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
base_id,
torch_dtype=dtype,
device_map="auto" if torch.cuda.is_available() else None,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
# The base model uses custom tokenizer components for Kyrgyz support.
processor = AutoProcessor.from_pretrained(base_id, trust_remote_code=True)
asr = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=device,
)
print(asr("path/to/audio.wav")["text"])
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
adapter_id = "AleksTv/whisper-medium-ky-lora"
peft_cfg = PeftConfig.from_pretrained(adapter_id)
base_id = peft_cfg.base_model_name_or_path
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
base_model = AutoModelForSpeechSeq2Seq.from_pretrained(
base_id,
torch_dtype=dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model = PeftModel.from_pretrained(base_model, adapter_id)
merged = model.merge_and_unload()
out_dir = "whisper-medium-ky-merged"
merged.save_pretrained(out_dir, safe_serialization=True)
AutoProcessor.from_pretrained(base_id, trust_remote_code=True).save_pretrained(out_dir)
Apache-2.0.