Instructions to use jojo007unfi/whisper-moderate with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use jojo007unfi/whisper-moderate with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
whisper-moderate-lora-adapter
A LoRA fine-tune of openai/whisper-large-v3 specialised for moderate-severity dysarthric speech. This adapter is one of three severity-specific checkpoints produced by a larger system that routes audio through a wav2vec2-based severity classifier before transcription.
Model Details
| Field | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA (PEFT 0.11.1) |
| Target severity | Moderate dysarthria |
| Language | English (en) |
| Task | Transcription |
| Framework | PyTorch + ๐ค Transformers |
| Inference dtype | float16 |
Companion models
| Severity | Repo |
|---|---|
| Mild | jojo007unfi/whisper-mild |
| Moderate | jojo007unfi/whisper-moderate โ this model |
| Severe | jojo007unfi/whisper-severe |
| Severity router | jojo007unfi/whisper-severity-classifier |
Motivation
Standard Whisper large-v3 struggles with dysarthric speech โ irregular rhythm, reduced articulatory precision, and atypical prosody cause high word-error rates that make real-time transcription unreliable for accessibility use cases. This adapter was trained specifically on moderate-severity dysarthric audio to close that gap.
Performance
All metrics are evaluated on a held-out test split of moderate-severity dysarthric speech. Lower is better for both WER and CER.
| Model | WER (%) | CER (%) |
|---|---|---|
whisper-large-v3 (baseline, no fine-tune) |
27.55 |
17.62 |
| This adapter (moderate LoRA) | 20.90 |
12.15 |
| Relative improvement | โ 24.1%% | โ 31.0%% |
System Architecture
This adapter is designed to be used inside a severity-routing pipeline, not in isolation:
Raw audio (16 kHz, mono)
โ
โผ
SeverityClassifier (wav2vec2-base โ MLP head)
โ labels: mild | moderate | severe
โ
โโ mild โ whisper-mild-lora-adapter
โโ moderate โ whisper-moderate-lora-adapter โ this model
โโ severe โ whisper-severe-lora-adapter
โ
โผ
Streaming transcription
(TextIteratorStreamer, greedy decode)
The classifier uses the first 8 seconds of audio to route the session. The Whisper LoRA adapter is then loaded, merged into the base weights (merge_and_unload()), and used for the remainder of the WebSocket session.
How to Use
Standalone inference
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
base_model_id = "openai/whisper-large-v3"
adapter_id = "jojo007unfi/whisper-moderate-lora-adapter"
processor = WhisperProcessor.from_pretrained(base_model_id, language="en", task="transcribe")
base = WhisperForConditionalGeneration.from_pretrained(
base_model_id, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_id)
model = model.merge_and_unload() # fuse LoRA weights for faster inference
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = None
def transcribe(audio_array: "np.ndarray", sample_rate: int = 16_000) -> str:
inputs = processor(
audio_array, sampling_rate=sample_rate,
return_tensors="pt", return_attention_mask=True
)
input_features = inputs.input_features.to(model.device, dtype=torch.float16)
attention_mask = inputs.attention_mask.to(model.device)
with torch.no_grad():
ids = model.generate(
input_features,
attention_mask=attention_mask,
language="en",
task="transcribe",
num_beams=1, # greedy โ lowest latency
max_new_tokens=225,
temperature=0.0,
no_repeat_ngram_size=5,
repetition_penalty=1.8,
compression_ratio_threshold=1.35,
condition_on_prev_tokens=False,
)
return processor.tokenizer.decode(ids[0], skip_special_tokens=True)
Inside the full routing pipeline
See jojo007unfi/whisper-severity-classifier for the classifier and the modal_streaming_whisper_severity.py serving script that orchestrates classifier + all three adapters over a WebSocket with real-time token streaming.
Training Details
Base model
openai/whisper-large-v3 โ 1.5 B parameter encoder-decoder transformer pre-trained on 5 million hours of multilingual audio.
Fine-tuning method
Low-Rank Adaptation (LoRA) via PEFT 0.11.1. LoRA injects trainable rank-decomposition matrices into the attention layers of the Whisper decoder, keeping base model weights frozen. This drastically reduces trainable parameter count while matching full fine-tune quality on domain-specific data.
Training data
Moderate-severity dysarthric speech recordings, English, 16 kHz mono. Data sourced from [YOUR_DATASET] โ annotated transcripts aligned with audio from speakers with moderate motor speech impairment.
Generation constraints (applied at both training and inference)
| Hyperparameter | Value | Rationale |
|---|---|---|
no_repeat_ngram_size |
5 | Blocks 5-gram repeats โ critical for dysarthric audio where Whisper tends to loop |
repetition_penalty |
1.8 | Strong penalty suppresses confabulation on unclear phonemes |
compression_ratio_threshold |
1.35 | Rejects outputs that are too compressible (i.e. repetitive) |
condition_on_prev_tokens |
False |
Prevents prior context from polluting predictions on short chunks |
num_beams |
1 (streaming) | Greedy decode required for TextIteratorStreamer compatibility |
max_new_tokens |
225 | Standard Whisper 30-second window limit |
temperature |
0.0 | Deterministic output |
Inference dtype
float16 on CUDA (NVIDIA A10G, 24 GB VRAM). The merged checkpoint fits comfortably alongside the severity classifier in a single GPU.
Intended Use
Direct use: Real-time or batch transcription of moderate-severity dysarthric English speech, particularly in accessibility tooling, AAC (augmentative and alternative communication) applications, and clinical documentation workflows.
Use within the routing system: Automatically selected by the wav2vec2 severity classifier when a speaker's dysarthria is detected as moderate severity.
Out-of-Scope Use
- Non-dysarthric general-purpose ASR โ use the unmodified
whisper-large-v3instead; this adapter may underperform on typical speech due to domain shift. - Languages other than English โ the adapter was trained solely on English data.
- Speaker identification or any biometric inference โ this model transcribes speech content only.
Limitations and Bias
- Performance degrades on speakers whose dysarthria presentation differs substantially from the training distribution (e.g. different aetiologies, accents, or recording conditions).
- The severity boundary between "moderate" and adjacent categories is fuzzy; mis-routing by the classifier will direct audio to this adapter when the severe adapter may have been more appropriate, or vice versa.
- Background noise and non-speech audio below the VAD RMS threshold (
0.02) are silently dropped โ short utterances in noisy environments may be missed entirely. - The model inherits any biases present in the base
whisper-large-v3for phonemes and vocabulary not well-represented in dysarthric training data.
Environmental Impact
Estimated using the ML COโ Impact Calculator.
| Field | Value |
|---|---|
| Hardware | NVIDIA A10G |
| Training duration | 2 hours |
| Cloud provider | Modal Labs inc |
| Compute region | US east |
| Estimated COโ emitted | 0.44 kg |
Citation
If you use this model in research, please cite:
@misc{jojo007unfi2024whisper-moderate,
author = {TinyefuzaJoe, MariaJemanabaccwa, KatulubaPaul, SsekibuuleRajabRayan},
title = {whisper-moderate-lora-adapter: LoRA fine-tune of Whisper large-v3
for moderate dysarthric speech},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/jojo007unfi/whisper-moderate-lora-adapter}
}
Framework Versions
- PEFT 0.11.1
- Transformers โฅ 4.40
- PyTorch โฅ 2.1
- Safetensors โฅ 0.4
- Downloads last month
- 6
Model tree for jojo007unfi/whisper-moderate
Base model
openai/whisper-large-v3