You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

whisper-mild-lora-adapter

A LoRA fine-tune of openai/whisper-large-v3 specialised for mild-severity dysarthric speech. This adapter is one of three severity-specific checkpoints produced by a larger system that routes audio through a wav2vec2-based severity classifier before transcription.

Model Details

Field	Value
Base model	`openai/whisper-large-v3`
Fine-tuning method	LoRA (PEFT 0.11.1)
Target severity	Mild dysarthria
Language	English (`en`)
Task	Transcription
Framework	PyTorch + 🤗 Transformers
Inference dtype	`float16`

Companion models

Severity	Repo
Mild	`jojo007unfi/whisper-mild` ← this model
Moderate	`jojo007unfi/whisper-moderate`
Severe	`jojo007unfi/whisper-severe`
Severity router	`jojo007unfi/whisper-severity-classifier`

Motivation

Although mild dysarthria is the least severe form of motor speech impairment, standard Whisper large-v3 still incurs elevated error rates on speakers with subtle articulatory differences, reduced prosodic range, or mild irregular rhythm. This adapter is trained specifically on mild-severity dysarthric audio to recover those marginal errors and produce clean, reliable transcripts for real-time accessibility use.

Performance

All metrics are evaluated on a held-out test split of mild-severity dysarthric speech. Lower is better for both WER and CER.

Model	WER (%)	CER (%)
`whisper-large-v3` (baseline, no fine-tune)	`25.45`	`14.92
This adapter (severe LoRA)	`21.91`	`12.31`
Relative improvement	↓ 13.9%%	↓ 17.5%%

System Architecture

This adapter is designed to be used inside a severity-routing pipeline, not in isolation:

Raw audio (16 kHz, mono)
    │
    ▼
SeverityClassifier  (wav2vec2-base → MLP head)
    │  labels: mild | moderate | severe
    │
    ├─ mild     → whisper-mild-lora-adapter      ◄ this model
    ├─ moderate → whisper-moderate-lora-adapter
    └─ severe   → whisper-severe-lora-adapter
                        │
                        ▼
              Streaming transcription
              (TextIteratorStreamer, greedy decode)

The classifier uses the first 8 seconds of audio to route the session. The LoRA adapter is merged into the base weights (merge_and_unload()) and used for the remainder of the WebSocket session.

How to Use

Standalone inference

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel

base_model_id = "openai/whisper-large-v3"
adapter_id    = "jojo007unfi/whisper-mild-lora-adapter"

processor = WhisperProcessor.from_pretrained(base_model_id, language="en", task="transcribe")

base  = WhisperForConditionalGeneration.from_pretrained(
    base_model_id, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_id)
model = model.merge_and_unload()   # fuse LoRA weights for faster inference

model.config.forced_decoder_ids            = None
model.config.suppress_tokens               = []
model.generation_config.forced_decoder_ids = None

def transcribe(audio_array, sample_rate: int = 16_000) -> str:
    inputs = processor(
        audio_array, sampling_rate=sample_rate,
        return_tensors="pt", return_attention_mask=True
    )
    input_features = inputs.input_features.to(model.device, dtype=torch.float16)
    attention_mask = inputs.attention_mask.to(model.device)

    with torch.no_grad():
        ids = model.generate(
            input_features,
            attention_mask=attention_mask,
            language="en",
            task="transcribe",
            num_beams=1,
            max_new_tokens=225,
            temperature=0.0,
            no_repeat_ngram_size=5,
            repetition_penalty=1.8,
            compression_ratio_threshold=1.35,
            condition_on_prev_tokens=False,
        )
    return processor.tokenizer.decode(ids[0], skip_special_tokens=True)

Inside the full routing pipeline

See jojo007unfi/whisper-severity-classifier for the classifier and the modal_streaming_whisper_severity.py serving script that orchestrates classifier + all three adapters over a WebSocket with real-time token streaming.

Training Details

Base model

openai/whisper-large-v3 — 1.5 B parameter encoder-decoder transformer pre-trained on 5 million hours of multilingual audio.

Fine-tuning method

Low-Rank Adaptation (LoRA) via PEFT 0.11.1. LoRA injects trainable rank-decomposition matrices into the attention layers of the Whisper decoder, keeping base model weights frozen. This drastically reduces trainable parameter count while matching full fine-tune quality on domain-specific data.

Training data

Mild-severity dysarthric speech recordings, English, 16 kHz mono. Data sourced from [YOUR_DATASET] — annotated transcripts aligned with audio from speakers with mild motor speech impairment.

Generation constraints (applied at both training and inference)

Hyperparameter	Value	Rationale
`no_repeat_ngram_size`	5	Blocks 5-gram repeats — prevents Whisper looping on irregular rhythm
`repetition_penalty`	1.8	Suppresses confabulation on phonemes that deviate from standard articulation
`compression_ratio_threshold`	1.35	Rejects outputs that are too compressible (repetitive)
`condition_on_prev_tokens`	`False`	Prevents prior context polluting predictions on short streaming chunks
`num_beams`	1 (streaming)	Greedy decode required for `TextIteratorStreamer` compatibility
`max_new_tokens`	225	Standard Whisper 30-second window limit
`temperature`	0.0	Deterministic output

Inference dtype

float16 on CUDA (NVIDIA A10G, 24 GB VRAM). The merged checkpoint fits alongside the severity classifier and the other two adapters in a single GPU.

Intended Use

Direct use: Real-time or batch transcription of mild-severity dysarthric English speech, particularly in accessibility tooling, AAC (augmentative and alternative communication) applications, and clinical documentation workflows.

Use within the routing system: Automatically selected by the wav2vec2 severity classifier when a speaker's dysarthria is detected as mild severity.

Out-of-Scope Use

Non-dysarthric general-purpose ASR — use the unmodified whisper-large-v3 instead; this adapter may underperform on typical speech due to domain shift.
Languages other than English — the adapter was trained solely on English data.
Speaker identification or any biometric inference — this model transcribes speech content only.

Limitations and Bias

Performance degrades on speakers whose mild dysarthria presentation differs substantially from the training distribution (e.g. different aetiologies, accents, or recording conditions).
The severity boundary between "mild" and "moderate" is fuzzy; classifier mis-routing may direct audio here when the moderate adapter would have been more appropriate.
Background noise and non-speech audio below the VAD RMS threshold (0.02) are silently dropped — short utterances in noisy environments may be missed entirely.
The model inherits any biases present in the base whisper-large-v3 for phonemes and vocabulary not well-represented in dysarthric training data.

Environmental Impact

Estimated using the ML CO₂ Impact Calculator.

Field	Value
Hardware	`NVIDIA A10G`
Training duration	`2` hours
Cloud provider	`Modal Labs inc.`
Compute region	`US East`
Estimated CO₂ emitted	`0.44` kg

Citation

@misc{jojo007unfi2024whisper-mild,
  author    = {TinyefuzaJoe, Mariajemanabaccwa, KatulubaPaul, SsekibuuleRajabRayan},
  title     = {whisper-mild-lora-adapter: LoRA fine-tune of Whisper large-v3
               for mild dysarthric speech},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/jojo007unfi/whisper-mild-lora-adapter}
}

Framework Versions

PEFT 0.11.1
Transformers ≥ 4.40
PyTorch ≥ 2.1
Safetensors ≥ 0.4

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

F32

Model tree for jojo007unfi/whisper-mild

Base model

openai/whisper-large-v3

Adapter

(211)

this model