da-dsqa / README.md

Upload folder using huggingface_hub

911f61c verified 7 days ago

4.71 kB

	---
	license: mit
	tags:
	- speech
	- dysarthria
	- severity-estimation
	- whisper
	- audio-classification
	language:
	- en
	pipeline_tag: audio-classification
	---

	# Dysarthric Speech Severity Level Classifier

	A regression probe trained on top of Whisper-large-v3 encoder features for estimating the severity level of dysarthric speech.

	Score scale: 1.0 (most severe dysarthria) to 7.0 (typical speech)

	GitHub: [JaesungBae/DA-DSQA](https://github.com/JaesungBae/DA-DSQA)

	## Model Description

	This model uses a three-stage training pipeline:
	1. Pseudo-labeling — A baseline probe generates pseudo-labels for unlabeled data
	2. Contrastive pre-training — Weakly-supervised contrastive learning with typical speech augmentation
	3. Fine-tuning — Regression probe fine-tuned with the pre-trained projector

	Architecture: Whisper-large-v3 encoder (frozen) → LayerNorm → 2-layer MLP (proj_dim=320) → Statistics Pooling (mean+std) → Linear → Score

	For details, see our paper:
	> Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech [[arXiv]](https://arxiv.org/abs/2603.15988)

	## Available Checkpoints

	This repository contains 9 checkpoints trained with different contrastive losses:

	\| Checkpoint \| Contrastive Loss \| τ \|
	\|---\|---\|---\|
	\| `proposed_L_coarse_tau0.1` \| Proposed (L_coarse) \| 0.1 \|
	\| `proposed_L_coarse_tau1.0` \| Proposed (L_coarse) \| 1.0 \|
	\| `proposed_L_coarse_tau10.0` \| Proposed (L_coarse) \| 10.0 \|
	\| `proposed_L_coarse_tau50.0` \| Proposed (L_coarse) \| 50.0 \|
	\| `proposed_L_coarse_tau100.0` (default) \| Proposed (L_coarse) \| 100.0 \|
	\| `proposed_L_cont_tau0.1` \| Proposed (L_cont) \| 0.1 \|
	\| `proposed_L_dis_tau1.0` \| Proposed (L_dis) \| 1.0 \|
	\| `rank-n-contrast_tau100.0` \| Rank-N-Contrast \| 100.0 \|
	\| `simclr_tau0.1` \| SimCLR \| 0.1 \|

	## Setup

	### 1. Create conda environment

	```bash
	conda create -n da-dsqa python=3.10 -y
	conda activate da-dsqa
	```

	### 2. Install PyTorch with CUDA

	```bash
	conda install pytorch torchaudio -c pytorch -y
	```

	> For a GPU build with a specific CUDA version, see [pytorch.org](https://pytorch.org/get-started/locally/) for the appropriate command.

	### 3. Install remaining dependencies

	```bash
	pip install -r requirements.txt
	```

	> Note: [Silero VAD](https://github.com/snakers4/silero-vad) is loaded automatically at runtime via `torch.hub` — no separate installation needed.

	### Runtime Dependencies

	This model loads openai/whisper-large-v3 (~6GB) and Silero VAD at initialization time. Ensure sufficient memory is available.

	## Usage

	### With the custom pipeline

	```python
	from huggingface_hub import snapshot_download

	# Download the model
	model_dir = snapshot_download("jaesungbae/da-dsqa")

	# Load pipeline (defaults to proposed_L_coarse_tau100.0)
	from pipeline import PreTrainedPipeline
	pipe = PreTrainedPipeline(model_dir)

	# Run inference
	result = pipe("/path/to/audio.wav")
	print(result)
	# {"severity_score": 4.25, "raw_score": 4.2483, "model_name": "proposed_L_coarse_tau100.0"}
	```

	### Select a specific checkpoint

	```python
	# Option 1: specify at initialization
	pipe = PreTrainedPipeline(model_dir, model_name="simclr_tau0.1")

	# Option 2: switch at runtime (Whisper & VAD stay loaded)
	pipe.switch_model("rank-n-contrast_tau100.0")
	result = pipe("/path/to/audio.wav")

	# Option 3: override per call
	result = pipe("/path/to/audio.wav", model_name="proposed_L_dis_tau1.0")
	```

	### Batch inference

	```python
	results = pipe.batch_inference([
	"/path/to/audio1.wav",
	"/path/to/audio2.wav",
	"/path/to/audio3.wav",
	])
	for r in results:
	print(f"{r['file']}: {r['severity_score']}")
	```

	### List available checkpoints

	```python
	print(pipe.list_models())
	# ['proposed_L_coarse_tau0.1', 'proposed_L_coarse_tau1.0', ...]
	```

	### Compare all checkpoints on a single file

	```python
	for name in pipe.list_models():
	result = pipe("/path/to/audio.wav", model_name=name)
	print(f"{name}: {result['severity_score']}")
	```

	### Standalone inference

	Clone the [full repository](https://github.com/JaesungBae/DA-DSQA) and run:

	```bash
	python inference.py \
	--wav /path/to/audio.wav \
	--checkpoint ./checkpoints/stage3/proposed_L_coarse_tau100.0/average
	```

	## Citation

	```bibtex
	@misc{bae2026something,
	title = {Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech},
	author = {Jaesung Bae and Xiuwen Zheng and Minje Kim and Chang D. Yoo and Mark Hasegawa-Johnson},
	year = {2026},
	eprint = {2603.15988},
	archivePrefix = {arXiv},
	primaryClass = {eess.AS},
	url = {https://arxiv.org/abs/2603.15988}
	}
	```