File size: 4,714 Bytes
8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 911f61c 8096e64 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | ---
license: mit
tags:
- speech
- dysarthria
- severity-estimation
- whisper
- audio-classification
language:
- en
pipeline_tag: audio-classification
---
# Dysarthric Speech Severity Level Classifier
A regression probe trained on top of Whisper-large-v3 encoder features for estimating the severity level of dysarthric speech.
**Score scale:** 1.0 (most severe dysarthria) to 7.0 (typical speech)
**GitHub:** [JaesungBae/DA-DSQA](https://github.com/JaesungBae/DA-DSQA)
## Model Description
This model uses a three-stage training pipeline:
1. **Pseudo-labeling** β A baseline probe generates pseudo-labels for unlabeled data
2. **Contrastive pre-training** β Weakly-supervised contrastive learning with typical speech augmentation
3. **Fine-tuning** β Regression probe fine-tuned with the pre-trained projector
**Architecture:** Whisper-large-v3 encoder (frozen) β LayerNorm β 2-layer MLP (proj_dim=320) β Statistics Pooling (mean+std) β Linear β Score
For details, see our paper:
> **Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech** [[arXiv]](https://arxiv.org/abs/2603.15988)
## Available Checkpoints
This repository contains **9 checkpoints** trained with different contrastive losses:
| Checkpoint | Contrastive Loss | τ |
|---|---|---|
| `proposed_L_coarse_tau0.1` | Proposed (L_coarse) | 0.1 |
| `proposed_L_coarse_tau1.0` | Proposed (L_coarse) | 1.0 |
| `proposed_L_coarse_tau10.0` | Proposed (L_coarse) | 10.0 |
| `proposed_L_coarse_tau50.0` | Proposed (L_coarse) | 50.0 |
| **`proposed_L_coarse_tau100.0`** (default) | Proposed (L_coarse) | 100.0 |
| `proposed_L_cont_tau0.1` | Proposed (L_cont) | 0.1 |
| `proposed_L_dis_tau1.0` | Proposed (L_dis) | 1.0 |
| `rank-n-contrast_tau100.0` | Rank-N-Contrast | 100.0 |
| `simclr_tau0.1` | SimCLR | 0.1 |
## Setup
### 1. Create conda environment
```bash
conda create -n da-dsqa python=3.10 -y
conda activate da-dsqa
```
### 2. Install PyTorch with CUDA
```bash
conda install pytorch torchaudio -c pytorch -y
```
> For a GPU build with a specific CUDA version, see [pytorch.org](https://pytorch.org/get-started/locally/) for the appropriate command.
### 3. Install remaining dependencies
```bash
pip install -r requirements.txt
```
> **Note:** [Silero VAD](https://github.com/snakers4/silero-vad) is loaded automatically at runtime via `torch.hub` β no separate installation needed.
### Runtime Dependencies
This model loads **openai/whisper-large-v3** (~6GB) and **Silero VAD** at initialization time. Ensure sufficient memory is available.
## Usage
### With the custom pipeline
```python
from huggingface_hub import snapshot_download
# Download the model
model_dir = snapshot_download("jaesungbae/da-dsqa")
# Load pipeline (defaults to proposed_L_coarse_tau100.0)
from pipeline import PreTrainedPipeline
pipe = PreTrainedPipeline(model_dir)
# Run inference
result = pipe("/path/to/audio.wav")
print(result)
# {"severity_score": 4.25, "raw_score": 4.2483, "model_name": "proposed_L_coarse_tau100.0"}
```
### Select a specific checkpoint
```python
# Option 1: specify at initialization
pipe = PreTrainedPipeline(model_dir, model_name="simclr_tau0.1")
# Option 2: switch at runtime (Whisper & VAD stay loaded)
pipe.switch_model("rank-n-contrast_tau100.0")
result = pipe("/path/to/audio.wav")
# Option 3: override per call
result = pipe("/path/to/audio.wav", model_name="proposed_L_dis_tau1.0")
```
### Batch inference
```python
results = pipe.batch_inference([
"/path/to/audio1.wav",
"/path/to/audio2.wav",
"/path/to/audio3.wav",
])
for r in results:
print(f"{r['file']}: {r['severity_score']}")
```
### List available checkpoints
```python
print(pipe.list_models())
# ['proposed_L_coarse_tau0.1', 'proposed_L_coarse_tau1.0', ...]
```
### Compare all checkpoints on a single file
```python
for name in pipe.list_models():
result = pipe("/path/to/audio.wav", model_name=name)
print(f"{name}: {result['severity_score']}")
```
### Standalone inference
Clone the [full repository](https://github.com/JaesungBae/DA-DSQA) and run:
```bash
python inference.py \
--wav /path/to/audio.wav \
--checkpoint ./checkpoints/stage3/proposed_L_coarse_tau100.0/average
```
## Citation
```bibtex
@misc{bae2026something,
title = {Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech},
author = {Jaesung Bae and Xiuwen Zheng and Minje Kim and Chang D. Yoo and Mark Hasegawa-Johnson},
year = {2026},
eprint = {2603.15988},
archivePrefix = {arXiv},
primaryClass = {eess.AS},
url = {https://arxiv.org/abs/2603.15988}
}
```
|