Instructions to use jojo007unfi/whisper-mild with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use jojo007unfi/whisper-mild with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
whisper-mild-lora-adapter
A LoRA fine-tune of openai/whisper-large-v3 specialised for mild-severity dysarthric speech. This adapter is one of three severity-specific checkpoints produced by a larger system that routes audio through a wav2vec2-based severity classifier before transcription.
Model Details
| Field | Value |
|---|---|
| Base model | openai/whisper-large-v3 |
| Fine-tuning method | LoRA (PEFT 0.11.1) |
| Target severity | Mild dysarthria |
| Language | English (en) |
| Task | Transcription |
| Framework | PyTorch + ๐ค Transformers |
| Inference dtype | float16 |
Companion models
| Severity | Repo |
|---|---|
| Mild | jojo007unfi/whisper-mild โ this model |
| Moderate | jojo007unfi/whisper-moderate |
| Severe | jojo007unfi/whisper-severe |
| Severity router | jojo007unfi/whisper-severity-classifier |
Motivation
Although mild dysarthria is the least severe form of motor speech impairment, standard Whisper large-v3 still incurs elevated error rates on speakers with subtle articulatory differences, reduced prosodic range, or mild irregular rhythm. This adapter is trained specifically on mild-severity dysarthric audio to recover those marginal errors and produce clean, reliable transcripts for real-time accessibility use.
Performance
All metrics are evaluated on a held-out test split of mild-severity dysarthric speech. Lower is better for both WER and CER.
| Model | WER (%) | CER (%) |
|---|---|---|
whisper-large-v3 (baseline, no fine-tune) |
25.45 |
`14.92 |
| This adapter (severe LoRA) | 21.91 |
12.31 |
| Relative improvement | โ 13.9%% | โ 17.5%% |
System Architecture
This adapter is designed to be used inside a severity-routing pipeline, not in isolation:
Raw audio (16 kHz, mono)
โ
โผ
SeverityClassifier (wav2vec2-base โ MLP head)
โ labels: mild | moderate | severe
โ
โโ mild โ whisper-mild-lora-adapter โ this model
โโ moderate โ whisper-moderate-lora-adapter
โโ severe โ whisper-severe-lora-adapter
โ
โผ
Streaming transcription
(TextIteratorStreamer, greedy decode)
The classifier uses the first 8 seconds of audio to route the session. The LoRA adapter is merged into the base weights (merge_and_unload()) and used for the remainder of the WebSocket session.
How to Use
Standalone inference
import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor
from peft import PeftModel
base_model_id = "openai/whisper-large-v3"
adapter_id = "jojo007unfi/whisper-mild-lora-adapter"
processor = WhisperProcessor.from_pretrained(base_model_id, language="en", task="transcribe")
base = WhisperForConditionalGeneration.from_pretrained(
base_model_id, torch_dtype=torch.float16, device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_id)
model = model.merge_and_unload() # fuse LoRA weights for faster inference
model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = None
def transcribe(audio_array, sample_rate: int = 16_000) -> str:
inputs = processor(
audio_array, sampling_rate=sample_rate,
return_tensors="pt", return_attention_mask=True
)
input_features = inputs.input_features.to(model.device, dtype=torch.float16)
attention_mask = inputs.attention_mask.to(model.device)
with torch.no_grad():
ids = model.generate(
input_features,
attention_mask=attention_mask,
language="en",
task="transcribe",
num_beams=1,
max_new_tokens=225,
temperature=0.0,
no_repeat_ngram_size=5,
repetition_penalty=1.8,
compression_ratio_threshold=1.35,
condition_on_prev_tokens=False,
)
return processor.tokenizer.decode(ids[0], skip_special_tokens=True)
Inside the full routing pipeline
See jojo007unfi/whisper-severity-classifier for the classifier and the modal_streaming_whisper_severity.py serving script that orchestrates classifier + all three adapters over a WebSocket with real-time token streaming.
Training Details
Base model
openai/whisper-large-v3 โ 1.5 B parameter encoder-decoder transformer pre-trained on 5 million hours of multilingual audio.
Fine-tuning method
Low-Rank Adaptation (LoRA) via PEFT 0.11.1. LoRA injects trainable rank-decomposition matrices into the attention layers of the Whisper decoder, keeping base model weights frozen. This drastically reduces trainable parameter count while matching full fine-tune quality on domain-specific data.
Training data
Mild-severity dysarthric speech recordings, English, 16 kHz mono. Data sourced from [YOUR_DATASET] โ annotated transcripts aligned with audio from speakers with mild motor speech impairment.
Generation constraints (applied at both training and inference)
| Hyperparameter | Value | Rationale |
|---|---|---|
no_repeat_ngram_size |
5 | Blocks 5-gram repeats โ prevents Whisper looping on irregular rhythm |
repetition_penalty |
1.8 | Suppresses confabulation on phonemes that deviate from standard articulation |
compression_ratio_threshold |
1.35 | Rejects outputs that are too compressible (repetitive) |
condition_on_prev_tokens |
False |
Prevents prior context polluting predictions on short streaming chunks |
num_beams |
1 (streaming) | Greedy decode required for TextIteratorStreamer compatibility |
max_new_tokens |
225 | Standard Whisper 30-second window limit |
temperature |
0.0 | Deterministic output |
Inference dtype
float16 on CUDA (NVIDIA A10G, 24 GB VRAM). The merged checkpoint fits alongside the severity classifier and the other two adapters in a single GPU.
Intended Use
Direct use: Real-time or batch transcription of mild-severity dysarthric English speech, particularly in accessibility tooling, AAC (augmentative and alternative communication) applications, and clinical documentation workflows.
Use within the routing system: Automatically selected by the wav2vec2 severity classifier when a speaker's dysarthria is detected as mild severity.
Out-of-Scope Use
- Non-dysarthric general-purpose ASR โ use the unmodified
whisper-large-v3instead; this adapter may underperform on typical speech due to domain shift. - Languages other than English โ the adapter was trained solely on English data.
- Speaker identification or any biometric inference โ this model transcribes speech content only.
Limitations and Bias
- Performance degrades on speakers whose mild dysarthria presentation differs substantially from the training distribution (e.g. different aetiologies, accents, or recording conditions).
- The severity boundary between "mild" and "moderate" is fuzzy; classifier mis-routing may direct audio here when the moderate adapter would have been more appropriate.
- Background noise and non-speech audio below the VAD RMS threshold (
0.02) are silently dropped โ short utterances in noisy environments may be missed entirely. - The model inherits any biases present in the base
whisper-large-v3for phonemes and vocabulary not well-represented in dysarthric training data.
Environmental Impact
Estimated using the ML COโ Impact Calculator.
| Field | Value |
|---|---|
| Hardware | NVIDIA A10G |
| Training duration | 2 hours |
| Cloud provider | Modal Labs inc. |
| Compute region | US East |
| Estimated COโ emitted | 0.44 kg |
Citation
@misc{jojo007unfi2024whisper-mild,
author = {TinyefuzaJoe, Mariajemanabaccwa, KatulubaPaul, SsekibuuleRajabRayan},
title = {whisper-mild-lora-adapter: LoRA fine-tune of Whisper large-v3
for mild dysarthric speech},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/jojo007unfi/whisper-mild-lora-adapter}
}
Framework Versions
- PEFT 0.11.1
- Transformers โฅ 4.40
- PyTorch โฅ 2.1
- Safetensors โฅ 0.4
- Downloads last month
- 5
Model tree for jojo007unfi/whisper-mild
Base model
openai/whisper-large-v3