File size: 4,714 Bytes
8096e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
911f61c
 
8096e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
911f61c
8096e64
911f61c
8096e64
 
 
 
 
911f61c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8096e64
 
 
 
 
 
 
 
911f61c
8096e64
911f61c
8096e64
 
 
 
 
 
911f61c
8096e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
911f61c
 
 
 
 
 
 
 
 
 
 
 
8096e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
911f61c
8096e64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: mit
tags:
  - speech
  - dysarthria
  - severity-estimation
  - whisper
  - audio-classification
language:
  - en
pipeline_tag: audio-classification
---

# Dysarthric Speech Severity Level Classifier

A regression probe trained on top of Whisper-large-v3 encoder features for estimating the severity level of dysarthric speech.

**Score scale:** 1.0 (most severe dysarthria) to 7.0 (typical speech)

**GitHub:** [JaesungBae/DA-DSQA](https://github.com/JaesungBae/DA-DSQA)

## Model Description

This model uses a three-stage training pipeline:
1. **Pseudo-labeling** β€” A baseline probe generates pseudo-labels for unlabeled data
2. **Contrastive pre-training** β€” Weakly-supervised contrastive learning with typical speech augmentation
3. **Fine-tuning** β€” Regression probe fine-tuned with the pre-trained projector

**Architecture:** Whisper-large-v3 encoder (frozen) β†’ LayerNorm β†’ 2-layer MLP (proj_dim=320) β†’ Statistics Pooling (mean+std) β†’ Linear β†’ Score

For details, see our paper:
> **Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech** [[arXiv]](https://arxiv.org/abs/2603.15988)

## Available Checkpoints

This repository contains **9 checkpoints** trained with different contrastive losses:

| Checkpoint | Contrastive Loss | τ |
|---|---|---|
| `proposed_L_coarse_tau0.1` | Proposed (L_coarse) | 0.1 |
| `proposed_L_coarse_tau1.0` | Proposed (L_coarse) | 1.0 |
| `proposed_L_coarse_tau10.0` | Proposed (L_coarse) | 10.0 |
| `proposed_L_coarse_tau50.0` | Proposed (L_coarse) | 50.0 |
| **`proposed_L_coarse_tau100.0`** (default) | Proposed (L_coarse) | 100.0 |
| `proposed_L_cont_tau0.1` | Proposed (L_cont) | 0.1 |
| `proposed_L_dis_tau1.0` | Proposed (L_dis) | 1.0 |
| `rank-n-contrast_tau100.0` | Rank-N-Contrast | 100.0 |
| `simclr_tau0.1` | SimCLR | 0.1 |

## Setup

### 1. Create conda environment

```bash
conda create -n da-dsqa python=3.10 -y
conda activate da-dsqa
```

### 2. Install PyTorch with CUDA

```bash
conda install pytorch torchaudio -c pytorch -y
```

> For a GPU build with a specific CUDA version, see [pytorch.org](https://pytorch.org/get-started/locally/) for the appropriate command.

### 3. Install remaining dependencies

```bash
pip install -r requirements.txt
```

> **Note:** [Silero VAD](https://github.com/snakers4/silero-vad) is loaded automatically at runtime via `torch.hub` β€” no separate installation needed.

### Runtime Dependencies

This model loads **openai/whisper-large-v3** (~6GB) and **Silero VAD** at initialization time. Ensure sufficient memory is available.

## Usage

### With the custom pipeline

```python
from huggingface_hub import snapshot_download

# Download the model
model_dir = snapshot_download("jaesungbae/da-dsqa")

# Load pipeline (defaults to proposed_L_coarse_tau100.0)
from pipeline import PreTrainedPipeline
pipe = PreTrainedPipeline(model_dir)

# Run inference
result = pipe("/path/to/audio.wav")
print(result)
# {"severity_score": 4.25, "raw_score": 4.2483, "model_name": "proposed_L_coarse_tau100.0"}
```

### Select a specific checkpoint

```python
# Option 1: specify at initialization
pipe = PreTrainedPipeline(model_dir, model_name="simclr_tau0.1")

# Option 2: switch at runtime (Whisper & VAD stay loaded)
pipe.switch_model("rank-n-contrast_tau100.0")
result = pipe("/path/to/audio.wav")

# Option 3: override per call
result = pipe("/path/to/audio.wav", model_name="proposed_L_dis_tau1.0")
```

### Batch inference

```python
results = pipe.batch_inference([
    "/path/to/audio1.wav",
    "/path/to/audio2.wav",
    "/path/to/audio3.wav",
])
for r in results:
    print(f"{r['file']}: {r['severity_score']}")
```

### List available checkpoints

```python
print(pipe.list_models())
# ['proposed_L_coarse_tau0.1', 'proposed_L_coarse_tau1.0', ...]
```

### Compare all checkpoints on a single file

```python
for name in pipe.list_models():
    result = pipe("/path/to/audio.wav", model_name=name)
    print(f"{name}: {result['severity_score']}")
```

### Standalone inference

Clone the [full repository](https://github.com/JaesungBae/DA-DSQA) and run:

```bash
python inference.py \
    --wav /path/to/audio.wav \
    --checkpoint ./checkpoints/stage3/proposed_L_coarse_tau100.0/average
```

## Citation

```bibtex
@misc{bae2026something,
  title         = {Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech},
  author        = {Jaesung Bae and Xiuwen Zheng and Minje Kim and Chang D. Yoo and Mark Hasegawa-Johnson},
  year          = {2026},
  eprint        = {2603.15988},
  archivePrefix = {arXiv},
  primaryClass  = {eess.AS},
  url           = {https://arxiv.org/abs/2603.15988}
}
```