Dysarthric Speech Severity Level Classifier
A regression probe trained on top of Whisper-large-v3 encoder features for estimating the severity level of dysarthric speech.
Score scale: 1.0 (most severe dysarthria) to 7.0 (typical speech)
GitHub: JaesungBae/DA-DSQA
Model Description
This model uses a three-stage training pipeline:
- Pseudo-labeling β A baseline probe generates pseudo-labels for unlabeled data
- Contrastive pre-training β Weakly-supervised contrastive learning with typical speech augmentation
- Fine-tuning β Regression probe fine-tuned with the pre-trained projector
Architecture: Whisper-large-v3 encoder (frozen) β LayerNorm β 2-layer MLP (proj_dim=320) β Statistics Pooling (mean+std) β Linear β Score
For details, see our paper:
Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech [arXiv]
Available Checkpoints
This repository contains 9 checkpoints trained with different contrastive losses:
| Checkpoint | Contrastive Loss | Ο |
|---|---|---|
proposed_L_coarse_tau0.1 |
Proposed (L_coarse) | 0.1 |
proposed_L_coarse_tau1.0 |
Proposed (L_coarse) | 1.0 |
proposed_L_coarse_tau10.0 |
Proposed (L_coarse) | 10.0 |
proposed_L_coarse_tau50.0 |
Proposed (L_coarse) | 50.0 |
proposed_L_coarse_tau100.0 (default) |
Proposed (L_coarse) | 100.0 |
proposed_L_cont_tau0.1 |
Proposed (L_cont) | 0.1 |
proposed_L_dis_tau1.0 |
Proposed (L_dis) | 1.0 |
rank-n-contrast_tau100.0 |
Rank-N-Contrast | 100.0 |
simclr_tau0.1 |
SimCLR | 0.1 |
Setup
1. Create conda environment
conda create -n da-dsqa python=3.10 -y
conda activate da-dsqa
2. Install PyTorch with CUDA
conda install pytorch torchaudio -c pytorch -y
For a GPU build with a specific CUDA version, see pytorch.org for the appropriate command.
3. Install remaining dependencies
pip install -r requirements.txt
Note: Silero VAD is loaded automatically at runtime via
torch.hubβ no separate installation needed.
Runtime Dependencies
This model loads openai/whisper-large-v3 (~6GB) and Silero VAD at initialization time. Ensure sufficient memory is available.
Usage
With the custom pipeline
from huggingface_hub import snapshot_download
# Download the model
model_dir = snapshot_download("jaesungbae/da-dsqa")
# Load pipeline (defaults to proposed_L_coarse_tau100.0)
from pipeline import PreTrainedPipeline
pipe = PreTrainedPipeline(model_dir)
# Run inference
result = pipe("/path/to/audio.wav")
print(result)
# {"severity_score": 4.25, "raw_score": 4.2483, "model_name": "proposed_L_coarse_tau100.0"}
Select a specific checkpoint
# Option 1: specify at initialization
pipe = PreTrainedPipeline(model_dir, model_name="simclr_tau0.1")
# Option 2: switch at runtime (Whisper & VAD stay loaded)
pipe.switch_model("rank-n-contrast_tau100.0")
result = pipe("/path/to/audio.wav")
# Option 3: override per call
result = pipe("/path/to/audio.wav", model_name="proposed_L_dis_tau1.0")
Batch inference
results = pipe.batch_inference([
"/path/to/audio1.wav",
"/path/to/audio2.wav",
"/path/to/audio3.wav",
])
for r in results:
print(f"{r['file']}: {r['severity_score']}")
List available checkpoints
print(pipe.list_models())
# ['proposed_L_coarse_tau0.1', 'proposed_L_coarse_tau1.0', ...]
Compare all checkpoints on a single file
for name in pipe.list_models():
result = pipe("/path/to/audio.wav", model_name=name)
print(f"{name}: {result['severity_score']}")
Standalone inference
Clone the full repository and run:
python inference.py \
--wav /path/to/audio.wav \
--checkpoint ./checkpoints/stage3/proposed_L_coarse_tau100.0/average
Citation
@misc{bae2026something,
title = {Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech},
author = {Jaesung Bae and Xiuwen Zheng and Minje Kim and Chang D. Yoo and Mark Hasegawa-Johnson},
year = {2026},
eprint = {2603.15988},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2603.15988}
}
- Downloads last month
- 31