Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string

Whisper Small — Vietnamese (LoRA Fine-tuned)

Fine-tuned version of openai/whisper-small on Vietnamese speech using LoRA adapters and the Mozilla Common Voice 11 dataset.

Training Results

Metric Value
Training Loss 0.9382
Epochs 5
Global Steps 470
Samples/sec 7.37
Total FLOPs 4.60e+18

Model Details

  • Base model: openai/whisper-small (244M params)
  • Method: LoRA (Low-Rank Adaptation)
  • Trainable params: ~13M (5.09% of base)
  • Target modules: q_proj, v_proj, k_proj, out_proj, fc1, fc2
  • LoRA rank: 32 · alpha: 64 · dropout: 0.05
  • Language: Vietnamese (vi)
  • Task: Transcription

Training Details

  • Dataset: Mozilla Common Voice 11.0 (vi)
  • Learning rate: 1e-4 with linear warmup (500 steps)
  • Batch size: 8 × 2 gradient accumulation = effective 16
  • Precision: FP16
  • Framework: 🤗 Transformers + PEFT

Data augmentation applied:

  • Speed perturbation ±10% (p=0.3)
  • Additive Gaussian noise (p=0.3)

Usage

from peft import PeftModel
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

base = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
model = PeftModel.from_pretrained(base, "LakoreAI/whisper-small-vi-lora")
processor = WhisperProcessor.from_pretrained("LakoreAI/whisper-small-vi-lora")

# Optional: merge LoRA for faster inference
model = model.merge_and_unload()
model.eval()

# Inference
def transcribe(audio_array, sampling_rate=16000):
    inputs = processor(audio_array, sampling_rate=sampling_rate, return_tensors="pt")
    with torch.no_grad():
        ids = model.generate(
            inputs.input_features,
            language="vietnamese",
            task="transcribe",
            max_new_tokens=225,
        )
    return processor.tokenizer.decode(ids[0], skip_special_tokens=True)

Limitations

  • Optimized for Vietnamese only; other languages will degrade significantly
  • Common Voice data skews toward read speech; spontaneous/accented speech may perform worse
  • Short clips (<1s) or clipped audio may cause hallucinations
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LakoreAI/whisper-small-vi-lora

Adapter
(234)
this model