Instructions to use jojo007unfi/whisper-severe with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use jojo007unfi/whisper-severe with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
whisper-severe-lora-adapter
A LoRA fine-tune of openai/whisper-large-v3 specialised for severe-severity dysarthric speech. This adapter is one of three severity-specific checkpoints produced by a larger system that routes audio through a wav2vec2-based severity classifier before transcription.
Model Details
| Field | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA (PEFT 0.11.1) |
| Target severity | Severe dysarthria |
| Language | English (en) |
| Task | Transcription |
| Framework | PyTorch + ๐ค Transformers |
| Inference dtype | float16 |
Companion models
| Severity | Repo |
|---|---|
| Mild | jojo007unfi/whisper-mild |
| Moderate | jojo007unfi/whisper-moderate |
| Severe | jojo007unfi/whisper-severe โ this model |
| Severity router | jojo007unfi/whisper-severity-classifier |
Motivation
Severe dysarthria presents the hardest transcription challenge for general-purpose ASR: highly reduced intelligibility, significant articulatory distortion, atypical prosody, and irregular timing cause standard Whisper large-v3 to produce transcripts that are often unusable. Speakers with severe dysarthria stand to gain the most from a domain-adapted model, yet are the least served by off-the-shelf systems. This adapter was trained specifically on severe-severity dysarthric audio to maximise intelligibility recovery for this underserved population.
Performance
All metrics are evaluated on a held-out test split of severe-severity dysarthric speech. Lower is better for both WER and CER.
| Model | WER (%) | CER (%) |
|---|---|---|
whisper-large-v3 (baseline, no fine-tune) |
30.43 |
21.08 |
| This adapter (severe LoRA) | 25.35 |
17.77 |
| Relative improvement | โ 16.7% | โ 15.7% |
System Architecture
This adapter is designed to be used inside a severity-routing pipeline, not in isolation:
Raw audio (16 kHz, mono)
โ
โผ
SeverityClassifier (wav2vec2-base โ MLP head)
โ labels: mild | moderate | severe
โ
โโ mild โ whisper-mild-lora-adapter
โโ moderate โ whisper-moderate-lora-adapter
โโ severe โ whisper-severe-lora-adapter โ this model
โ
โผ
Streaming transcription
(TextIteratorStreamer, greedy decode)
The classifier uses the first 8 seconds of audio to route the session. The LoRA adapter is merged into the base weights (merge_and_unload()) and used for the remainder of the WebSocket session.
How to Use
Standalone inference
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
base_model_id = "openai/whisper-large-v3"
adapter_id = "jojo007unfi/whisper-severe-lora-adapter"
processor = WhisperProcessor.from_pretrained(base_model_id, language="en", task="transcribe")
base = WhisperForConditionalGeneration.from_pretrained(
base_model_id, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_id)
model = model.merge_and_unload() # fuse LoRA weights for faster inference
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = None
def transcribe(audio_array, sample_rate: int = 16_000) -> str:
inputs = processor(
audio_array, sampling_rate=sample_rate,
return_tensors="pt", return_attention_mask=True
)
input_features = inputs.input_features.to(model.device, dtype=torch.float16)
attention_mask = inputs.attention_mask.to(model.device)
with torch.no_grad():
ids = model.generate(
input_features,
attention_mask=attention_mask,
language="en",
task="transcribe",
num_beams=1,
max_new_tokens=225,
temperature=0.0,
no_repeat_ngram_size=5,
repetition_penalty=1.8,
compression_ratio_threshold=1.35,
condition_on_prev_tokens=False,
)
return processor.tokenizer.decode(ids[0], skip_special_tokens=True)
Tip: For severe dysarthria, audio pre-processing (noise reduction, normalisation) before passing to the model can further improve results.
Inside the full routing pipeline
See jojo007unfi/whisper-severity-classifier for the classifier and the modal_streaming_whisper_severity.py serving script that orchestrates classifier + all three adapters over a WebSocket with real-time token streaming.
Training Details
Base model
openai/whisper-large-v3 โ 1.5 B parameter encoder-decoder transformer pre-trained on 5 million hours of multilingual audio.
Fine-tuning method
Low-Rank Adaptation (LoRA) via PEFT 0.11.1. LoRA injects trainable rank-decomposition matrices into the attention layers of the Whisper decoder, keeping base model weights frozen. Fine-tuning on severe dysarthric data teaches the decoder to map highly distorted acoustic patterns to correct token sequences without catastrophic forgetting of the base model's general speech knowledge.
Training data
Severe-severity dysarthric speech recordings, English, 16 kHz mono. Data sourced from [YOUR_DATASET] โ annotated transcripts aligned with audio from speakers with severe motor speech impairment.
Generation constraints (applied at both training and inference)
| Hyperparameter | Value | Rationale |
|---|---|---|
no_repeat_ngram_size |
5 | Critical for severe cases โ Whisper loops heavily on distorted phonemes |
repetition_penalty |
1.8 | Strong penalty essential to suppress confabulation on low-intelligibility audio |
compression_ratio_threshold |
1.35 | Rejects highly repetitive outputs that indicate model confusion |
condition_on_prev_tokens |
False |
Prevents cascading errors in streaming when previous chunks were uncertain |
num_beams |
1 (streaming) | Greedy decode required for TextIteratorStreamer compatibility |
max_new_tokens |
225 | Standard Whisper 30-second window limit |
temperature |
0.0 | Deterministic output โ avoids stochastic confabulation |
Inference dtype
float16 on CUDA (NVIDIA A10G, 24 GB VRAM). The merged checkpoint fits alongside the severity classifier and the other two adapters in a single GPU.
Intended Use
Direct use: Real-time or batch transcription of severe-severity dysarthric English speech, particularly in accessibility tooling, AAC (augmentative and alternative communication) applications, and clinical documentation workflows where speaker intelligibility is significantly reduced.
Use within the routing system: Automatically selected by the wav2vec2 severity classifier when a speaker's dysarthria is detected as severe. The classifier defaults to moderate when its confidence falls below 50%, so edge cases on the severe/moderate boundary will fall to the moderate adapter.
Out-of-Scope Use
- Non-dysarthric general-purpose ASR โ use the unmodified
whisper-large-v3instead; this adapter is heavily domain-shifted and will underperform on typical speech. - Languages other than English โ the adapter was trained solely on English data.
- Speaker identification or any biometric inference โ this model transcribes speech content only.
Limitations and Bias
- Severe dysarthria is highly speaker-dependent. Performance will vary more than the mild/moderate adapters depending on how closely a new speaker's presentation matches the training distribution.
- The severity boundary between "severe" and "moderate" is fuzzy; classifier mis-routing may direct audio here when the moderate adapter would have been more appropriate, or vice versa.
- Very low signal-to-noise audio may fall below the VAD RMS threshold (
0.02) and be silently discarded โ this disproportionately affects speakers with very low vocal intensity. - The model inherits any biases present in the base
whisper-large-v3for phonemes and vocabulary not well-represented in dysarthric training data. - Training data for severe dysarthria is inherently scarce. If your speaker's presentation is unlike any in the training set, results may be poor regardless of the adapter.
Environmental Impact
Estimated using the ML COโ Impact Calculator.
| Field | Value |
|---|---|
| Hardware | NVIDIA A10G |
| Training duration | 2 hours |
| Cloud provider | Modal Labs inc. |
| Compute region | US East |
| Estimated COโ emitted | o.44 kg |
Citation
@misc{jojo007unfi2024whisper-severe,
author = {TinyefuzaJoe, MariaJemanabaccwa, KatulubaPaul, SsekibuuleRajabRayan},
title = {whisper-severe-lora-adapter: LoRA fine-tune of Whisper large-v3
for severe dysarthric speech},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/jojo007unfi/whisper-severe-lora-adapter}
}
Framework Versions
- PEFT 0.11.1
- Transformers โฅ 4.40
- PyTorch โฅ 2.1
- Safetensors โฅ 0.4
- Downloads last month
- 4
Model tree for jojo007unfi/whisper-severe
Base model
openai/whisper-large-v3