You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

whisper-moderate-lora-adapter

A LoRA fine-tune of openai/whisper-large-v3 specialised for moderate-severity dysarthric speech. This adapter is one of three severity-specific checkpoints produced by a larger system that routes audio through a wav2vec2-based severity classifier before transcription.


Model Details

Field Value
Base model openai/whisper-large-v3
Fine-tuning method LoRA (PEFT 0.11.1)
Target severity Moderate dysarthria
Language English (en)
Task Transcription
Framework PyTorch + ๐Ÿค— Transformers
Inference dtype float16

Companion models

Severity Repo
Mild jojo007unfi/whisper-mild
Moderate jojo007unfi/whisper-moderate โ† this model
Severe jojo007unfi/whisper-severe
Severity router jojo007unfi/whisper-severity-classifier

Motivation

Standard Whisper large-v3 struggles with dysarthric speech โ€” irregular rhythm, reduced articulatory precision, and atypical prosody cause high word-error rates that make real-time transcription unreliable for accessibility use cases. This adapter was trained specifically on moderate-severity dysarthric audio to close that gap.


Performance

All metrics are evaluated on a held-out test split of moderate-severity dysarthric speech. Lower is better for both WER and CER.

Model WER (%) CER (%)
whisper-large-v3 (baseline, no fine-tune) 27.55 17.62
This adapter (moderate LoRA) 20.90 12.15
Relative improvement โ†“ 24.1%% โ†“ 31.0%%

System Architecture

This adapter is designed to be used inside a severity-routing pipeline, not in isolation:

Raw audio (16 kHz, mono)
    โ”‚
    โ–ผ
SeverityClassifier  (wav2vec2-base โ†’ MLP head)
    โ”‚  labels: mild | moderate | severe
    โ”‚
    โ”œโ”€ mild     โ†’ whisper-mild-lora-adapter
    โ”œโ”€ moderate โ†’ whisper-moderate-lora-adapter  โ—„ this model
    โ””โ”€ severe   โ†’ whisper-severe-lora-adapter
                        โ”‚
                        โ–ผ
              Streaming transcription
              (TextIteratorStreamer, greedy decode)

The classifier uses the first 8 seconds of audio to route the session. The Whisper LoRA adapter is then loaded, merged into the base weights (merge_and_unload()), and used for the remainder of the WebSocket session.


How to Use

Standalone inference

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel

base_model_id = "openai/whisper-large-v3"
adapter_id    = "jojo007unfi/whisper-moderate-lora-adapter"

processor = WhisperProcessor.from_pretrained(base_model_id, language="en", task="transcribe")

base  = WhisperForConditionalGeneration.from_pretrained(
    base_model_id, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_id)
model = model.merge_and_unload()   # fuse LoRA weights for faster inference

model.config.forced_decoder_ids            = None
model.config.suppress_tokens               = []
model.generation_config.forced_decoder_ids = None

def transcribe(audio_array: "np.ndarray", sample_rate: int = 16_000) -> str:
    inputs = processor(
        audio_array, sampling_rate=sample_rate,
        return_tensors="pt", return_attention_mask=True
    )
    input_features = inputs.input_features.to(model.device, dtype=torch.float16)
    attention_mask = inputs.attention_mask.to(model.device)

    with torch.no_grad():
        ids = model.generate(
            input_features,
            attention_mask=attention_mask,
            language="en",
            task="transcribe",
            num_beams=1,                     # greedy โ€” lowest latency
            max_new_tokens=225,
            temperature=0.0,
            no_repeat_ngram_size=5,
            repetition_penalty=1.8,
            compression_ratio_threshold=1.35,
            condition_on_prev_tokens=False,
        )
    return processor.tokenizer.decode(ids[0], skip_special_tokens=True)

Inside the full routing pipeline

See jojo007unfi/whisper-severity-classifier for the classifier and the modal_streaming_whisper_severity.py serving script that orchestrates classifier + all three adapters over a WebSocket with real-time token streaming.


Training Details

Base model

openai/whisper-large-v3 โ€” 1.5 B parameter encoder-decoder transformer pre-trained on 5 million hours of multilingual audio.

Fine-tuning method

Low-Rank Adaptation (LoRA) via PEFT 0.11.1. LoRA injects trainable rank-decomposition matrices into the attention layers of the Whisper decoder, keeping base model weights frozen. This drastically reduces trainable parameter count while matching full fine-tune quality on domain-specific data.

Training data

Moderate-severity dysarthric speech recordings, English, 16 kHz mono. Data sourced from [YOUR_DATASET] โ€” annotated transcripts aligned with audio from speakers with moderate motor speech impairment.

Generation constraints (applied at both training and inference)

Hyperparameter Value Rationale
no_repeat_ngram_size 5 Blocks 5-gram repeats โ€” critical for dysarthric audio where Whisper tends to loop
repetition_penalty 1.8 Strong penalty suppresses confabulation on unclear phonemes
compression_ratio_threshold 1.35 Rejects outputs that are too compressible (i.e. repetitive)
condition_on_prev_tokens False Prevents prior context from polluting predictions on short chunks
num_beams 1 (streaming) Greedy decode required for TextIteratorStreamer compatibility
max_new_tokens 225 Standard Whisper 30-second window limit
temperature 0.0 Deterministic output

Inference dtype

float16 on CUDA (NVIDIA A10G, 24 GB VRAM). The merged checkpoint fits comfortably alongside the severity classifier in a single GPU.


Intended Use

Direct use: Real-time or batch transcription of moderate-severity dysarthric English speech, particularly in accessibility tooling, AAC (augmentative and alternative communication) applications, and clinical documentation workflows.

Use within the routing system: Automatically selected by the wav2vec2 severity classifier when a speaker's dysarthria is detected as moderate severity.


Out-of-Scope Use

  • Non-dysarthric general-purpose ASR โ€” use the unmodified whisper-large-v3 instead; this adapter may underperform on typical speech due to domain shift.
  • Languages other than English โ€” the adapter was trained solely on English data.
  • Speaker identification or any biometric inference โ€” this model transcribes speech content only.

Limitations and Bias

  • Performance degrades on speakers whose dysarthria presentation differs substantially from the training distribution (e.g. different aetiologies, accents, or recording conditions).
  • The severity boundary between "moderate" and adjacent categories is fuzzy; mis-routing by the classifier will direct audio to this adapter when the severe adapter may have been more appropriate, or vice versa.
  • Background noise and non-speech audio below the VAD RMS threshold (0.02) are silently dropped โ€” short utterances in noisy environments may be missed entirely.
  • The model inherits any biases present in the base whisper-large-v3 for phonemes and vocabulary not well-represented in dysarthric training data.

Environmental Impact

Estimated using the ML COโ‚‚ Impact Calculator.

Field Value
Hardware NVIDIA A10G
Training duration 2 hours
Cloud provider Modal Labs inc
Compute region US east
Estimated COโ‚‚ emitted 0.44 kg

Citation

If you use this model in research, please cite:

@misc{jojo007unfi2024whisper-moderate,
  author    = {TinyefuzaJoe, MariaJemanabaccwa, KatulubaPaul, SsekibuuleRajabRayan},
  title     = {whisper-moderate-lora-adapter: LoRA fine-tune of Whisper large-v3
               for moderate dysarthric speech},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/jojo007unfi/whisper-moderate-lora-adapter}
}

Framework Versions

  • PEFT 0.11.1
  • Transformers โ‰ฅ 4.40
  • PyTorch โ‰ฅ 2.1
  • Safetensors โ‰ฅ 0.4
Downloads last month
6
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jojo007unfi/whisper-moderate

Adapter
(211)
this model