Upload folder using huggingface_hub

Browse files

Files changed (12) hide show

README.md +131 -0
checkpoints/proposed_L_coarse_tau0.1/model.safetensors +3 -0
checkpoints/proposed_L_coarse_tau1.0/model.safetensors +3 -0
checkpoints/proposed_L_coarse_tau10.0/model.safetensors +3 -0
checkpoints/proposed_L_coarse_tau100.0/model.safetensors +3 -0
checkpoints/proposed_L_coarse_tau50.0/model.safetensors +3 -0
checkpoints/proposed_L_cont_tau0.1/model.safetensors +3 -0
checkpoints/proposed_L_dis_tau1.0/model.safetensors +3 -0
checkpoints/rank-n-contrast_tau100.0/model.safetensors +3 -0
checkpoints/simclr_tau0.1/model.safetensors +3 -0
config.json +22 -0
pipeline.py +307 -0

README.md ADDED Viewed

	@@ -0,0 +1,131 @@

+---
+license: mit
+tags:
+  - speech
+  - dysarthria
+  - severity-estimation
+  - whisper
+  - audio-classification
+language:
+  - en
+pipeline_tag: audio-classification
+---
+# Dysarthric Speech Severity Level Classifier
+A regression probe trained on top of Whisper-large-v3 encoder features for estimating the severity level of dysarthric speech.
+**Score scale:** 1.0 (most severe dysarthria) to 7.0 (typical speech)
+## Model Description
+This model uses a three-stage training pipeline:
+1. **Pseudo-labeling** — A baseline probe generates pseudo-labels for unlabeled data
+2. **Contrastive pre-training** — Weakly-supervised contrastive learning with typical speech augmentation
+3. **Fine-tuning** — Regression probe fine-tuned with the pre-trained projector
+**Architecture:** Whisper-large-v3 encoder (frozen) → LayerNorm → 2-layer MLP (proj_dim=320) → Statistics Pooling (mean+std) → Linear → Score
+For details, see our paper:
+> **Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech** [[arXiv]](https://arxiv.org/abs/2603.15988)
+## Available Checkpoints
+This repository contains **9 checkpoints** trained with different contrastive losses:
+| Checkpoint | Contrastive Loss | &tau; |
+|---|---|---|
+| `proposed_L_coarse_tau0.1` | Proposed (L_coarse) | 0.1 |
+| `proposed_L_coarse_tau1.0` | Proposed (L_coarse) | 1.0 |
+| **`proposed_L_coarse_tau10.0`** (default) | Proposed (L_coarse) | 10.0 |
+| `proposed_L_coarse_tau50.0` | Proposed (L_coarse) | 50.0 |
+| `proposed_L_coarse_tau100.0` | Proposed (L_coarse) | 100.0 |
+| `proposed_L_cont_tau0.1` | Proposed (L_cont) | 0.1 |
+| `proposed_L_dis_tau1.0` | Proposed (L_dis) | 1.0 |
+| `rank-n-contrast_tau100.0` | Rank-N-Contrast | 100.0 |
+| `simclr_tau0.1` | SimCLR | 0.1 |
+## Usage
+### With the custom pipeline
+```python
+from huggingface_hub import snapshot_download
+# Download the model
+model_dir = snapshot_download("jaesungbae/severity-level-classifier")
+# Load pipeline (defaults to proposed_L_coarse_tau10.0)
+from pipeline import PreTrainedPipeline
+pipe = PreTrainedPipeline(model_dir)
+# Run inference
+result = pipe("/path/to/audio.wav")
+print(result)
+# {"severity_score": 4.25, "raw_score": 4.2483, "model_name": "proposed_L_coarse_tau10.0"}
+```
+### Select a specific checkpoint
+```python
+# Option 1: specify at initialization
+pipe = PreTrainedPipeline(model_dir, model_name="simclr_tau0.1")
+# Option 2: switch at runtime (Whisper & VAD stay loaded)
+pipe.switch_model("rank-n-contrast_tau100.0")
+result = pipe("/path/to/audio.wav")
+# Option 3: override per call
+result = pipe("/path/to/audio.wav", model_name="proposed_L_dis_tau1.0")
+```
+### List available checkpoints
+```python
+print(pipe.list_models())
+# ['proposed_L_coarse_tau0.1', 'proposed_L_coarse_tau1.0', ...]
+```
+### Compare all checkpoints on a single file
+```python
+for name in pipe.list_models():
+    result = pipe("/path/to/audio.wav", model_name=name)
+    print(f"{name}: {result['severity_score']}")
+```
+### Standalone inference
+Clone the [full repository](https://github.com/JaesungBae/DA-DSQA) and run:
+```bash
+python inference.py \
+    --wav /path/to/audio.wav \
+    --checkpoint ./checkpoints/stage3/proposed_L_coarse_tau10.0/average
+```
+## Requirements
+- Python 3.10+
+- PyTorch + torchaudio
+- transformers >= 4.40.0
+- safetensors >= 0.4.0
+- Silero VAD (loaded via `torch.hub` at runtime)
+## Runtime Dependencies
+This model loads **openai/whisper-large-v3** (~6GB) and **Silero VAD** at initialization time. Ensure sufficient memory is available.
+## Citation
+```bibtex
+@misc{bae2026something,
+  title         = {Something from Nothing: Data Augmentation for Robust Severity Level Estimation of Dysarthric Speech},
+  author        = {Jaesung Bae and Xiuwen Zheng and Minje Kim and Chang D. Yoo and Mark Hasegawa-Johnson},
+  year          = {2026},
+  eprint        = {2603.15988},
+  archivePrefix = {arXiv},
+  primaryClass  = {eess.AS},
+  url           = {https://arxiv.org/abs/2603.15988}
+}
+```

checkpoints/proposed_L_coarse_tau0.1/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e5db1659b456f24c1f59d718e7eb53953c0d3490c634db37ae9dc0486597a844
+size 2064020

checkpoints/proposed_L_coarse_tau1.0/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a0a640d0d054fbc551004472c838a9d573dfbe9a47b218470ee54f551476559a
+size 2064020

checkpoints/proposed_L_coarse_tau10.0/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cf031ed59c2b0c127e8d0bcee38a57936aea5e32b4fc5750dc0985ffe02a2c94
+size 2064020

checkpoints/proposed_L_coarse_tau100.0/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:67574bec8e6cbaab9ea9c74f5c331a8590d08cbeb02f55c28e1eabaa7cbf6788
+size 2064020

checkpoints/proposed_L_coarse_tau50.0/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fdc5e9a6751f03db05dc74e3029308e9f1c39f379ea8496f40cb9ca65c953097
+size 2064020

checkpoints/proposed_L_cont_tau0.1/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c28f3c4cb5e5e5e424ae4672d362fa1a3e2642834b851c7d7e9c152036b2b9f3
+size 2064020

checkpoints/proposed_L_dis_tau1.0/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:38ea3fd55b00ae1e7ea3ddfa64e623a73bf482ba925f6c09794d7d636e9c79e9
+size 2064020

checkpoints/rank-n-contrast_tau100.0/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3457d677fe1467f7a4376d232e1084fedcc847f29afb5511b666c886b743543f
+size 2064020

checkpoints/simclr_tau0.1/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f53f4e3121d9ede7c2c7c37f38a7eae1e63f3b664fca2af1a6383ff540d110e
+size 2064020

config.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+  "model_type": "whisper_severity_probe",
+  "architectures": ["WhisperFeatureProbeV2"],
+  "input_dim": 1280,
+  "proj_dim": 320,
+  "dropout": 0.1,
+  "num_classes": 1,
+  "whisper_model_name": "openai/whisper-large-v3",
+  "sampling_rate": 16000,
+  "default_checkpoint": "proposed_L_coarse_tau10.0",
+  "available_checkpoints": [
+    "proposed_L_coarse_tau0.1",
+    "proposed_L_coarse_tau1.0",
+    "proposed_L_coarse_tau10.0",
+    "proposed_L_coarse_tau50.0",
+    "proposed_L_coarse_tau100.0",
+    "proposed_L_cont_tau0.1",
+    "proposed_L_dis_tau1.0",
+    "rank-n-contrast_tau100.0",
+    "simclr_tau0.1"
+  ]
+}

pipeline.py ADDED Viewed

	@@ -0,0 +1,307 @@

+"""
+Custom inference pipeline for HuggingFace Hub.
+Pipeline: WAV -> Silero VAD -> Whisper feature extraction -> Probe -> Severity score
+Score scale: 1.0 (most severe) to 7.0 (typical speech)
+Supports multiple checkpoints. Pass `model_name` to select which checkpoint to use:
+    pipe = PreTrainedPipeline(model_dir)                                    # default
+    pipe = PreTrainedPipeline(model_dir, model_name="simclr_tau0.1")        # specific
+Available checkpoints:
+    - proposed_L_coarse_tau0.1
+    - proposed_L_coarse_tau1.0
+    - proposed_L_coarse_tau10.0   (default)
+    - proposed_L_coarse_tau50.0
+    - proposed_L_coarse_tau100.0
+    - proposed_L_cont_tau0.1
+    - proposed_L_dis_tau1.0
+    - rank-n-contrast_tau100.0
+    - simclr_tau0.1
+"""
+import io
+import json
+import os
+import torch
+import torch.nn as nn
+import torchaudio
+SAMPLING_RATE = 16000
+WHISPER_MODEL_NAME = "openai/whisper-large-v3"
+WHISPER_HIDDEN_DIM = 1280
+DEFAULT_CHECKPOINT = "proposed_L_coarse_tau10.0"
+class WhisperFeatureProbeV2(nn.Module):
+    """
+    Regression probe on Whisper encoder features.
+    Architecture: LayerNorm -> Linear -> ReLU -> Dropout -> Linear -> ReLU -> Dropout
+                  -> Statistics Pooling (mean+std) -> Linear(proj_dim*2, num_classes)
+    """
+    def __init__(self, input_dim=1280, proj_dim=256, dropout=0.1, num_classes=1):
+        super().__init__()
+        self.norm = nn.LayerNorm(input_dim)
+        self.projector = nn.Linear(input_dim, proj_dim)
+        self.projector2 = nn.Linear(proj_dim, proj_dim)
+        self.relu = nn.ReLU()
+        self.dropout = nn.Dropout(dropout)
+        self.classifier = nn.Linear(proj_dim * 2, num_classes)
+    def forward(self, input_values, lengths=None, **kwargs):
+        x = self.norm(input_values)
+        x = self.dropout(self.relu(self.projector(x)))
+        x = self.dropout(self.relu(self.projector2(x)))
+        if lengths is not None:
+            batch_size, max_len, _ = x.shape
+            mask = (
+                torch.arange(max_len, device=x.device).unsqueeze(0)
+                < lengths.unsqueeze(1)
+            )
+            mask_f = mask.unsqueeze(-1).float()
+            x_masked = x * mask_f
+            lengths_f = lengths.unsqueeze(1).float().clamp(min=1)
+            mean = x_masked.sum(dim=1) / lengths_f
+            var = (x_masked**2).sum(dim=1) / lengths_f - mean**2
+            std = var.clamp(min=1e-8).sqrt()
+        else:
+            mean = x.mean(dim=1)
+            std = x.std(dim=1)
+        pooled = torch.cat([mean, std], dim=1)
+        logits = self.classifier(pooled)
+        return type("Output", (), {"logits": logits, "hidden_states": pooled})()
+def _load_vad():
+    """Load Silero VAD model."""
+    model, utils = torch.hub.load(
+        repo_or_dir="snakers4/silero-vad",
+        model="silero_vad",
+        force_reload=False,
+        onnx=False,
+    )
+    model.eval()
+    get_speech_timestamps = utils[0]
+    return model, get_speech_timestamps
+def _apply_vad(wav, vad_model, get_speech_timestamps):
+    """Apply VAD and return concatenated speech segments."""
+    if wav.dim() > 1:
+        wav = wav.squeeze()
+    speech_timestamps = get_speech_timestamps(
+        wav,
+        vad_model,
+        threshold=0.5,
+        sampling_rate=SAMPLING_RATE,
+        min_speech_duration_ms=250,
+        min_silence_duration_ms=100,
+        speech_pad_ms=30,
+    )
+    if not speech_timestamps:
+        return wav
+    segments = [
+        wav[max(0, ts["start"]) : min(len(wav), ts["end"])]
+        for ts in speech_timestamps
+    ]
+    return torch.cat(segments)
+def _extract_features(wav, whisper_model, processor, device):
+    """Extract Whisper encoder last-layer hidden states."""
+    if isinstance(wav, torch.Tensor):
+        wav_np = wav.cpu().numpy()
+    else:
+        wav_np = wav
+    feat_len = len(wav_np) // 320
+    input_features = processor(
+        wav_np, sampling_rate=SAMPLING_RATE, return_tensors="pt"
+    ).input_features.to(
+        device=device, dtype=next(whisper_model.parameters()).dtype
+    )
+    with torch.no_grad():
+        out = whisper_model.encoder(input_features, output_hidden_states=True)
+    return out.last_hidden_state[:, :feat_len, :].float()
+def _load_probe(checkpoint_dir, device):
+    """Load a probe model from a checkpoint directory."""
+    probe = WhisperFeatureProbeV2(
+        input_dim=WHISPER_HIDDEN_DIM, proj_dim=320, num_classes=1
+    )
+    safe_path = os.path.join(checkpoint_dir, "model.safetensors")
+    bin_path = os.path.join(checkpoint_dir, "pytorch_model.bin")
+    if os.path.isfile(safe_path):
+        from safetensors.torch import load_file
+        state_dict = load_file(safe_path, device=str(device))
+    elif os.path.isfile(bin_path):
+        state_dict = torch.load(
+            bin_path, map_location=device, weights_only=True
+        )
+    else:
+        raise FileNotFoundError(
+            f"No model.safetensors or pytorch_model.bin in {checkpoint_dir}"
+        )
+    probe.load_state_dict(state_dict)
+    probe.to(device).eval()
+    return probe
+def _discover_checkpoints(path):
+    """Find all available checkpoint subdirectories."""
+    checkpoints_dir = os.path.join(path, "checkpoints")
+    if not os.path.isdir(checkpoints_dir):
+        return []
+    names = []
+    for name in sorted(os.listdir(checkpoints_dir)):
+        ckpt_dir = os.path.join(checkpoints_dir, name)
+        if os.path.isdir(ckpt_dir) and (
+            os.path.isfile(os.path.join(ckpt_dir, "model.safetensors"))
+            or os.path.isfile(os.path.join(ckpt_dir, "pytorch_model.bin"))
+        ):
+            names.append(name)
+    return names
+class PreTrainedPipeline:
+    """
+    HuggingFace custom inference pipeline for dysarthric speech severity estimation.
+    Accepts a WAV file path or raw audio bytes and returns a severity score
+    on a 1.0 (most severe) to 7.0 (typical speech) scale.
+    Supports multiple checkpoints stored under `checkpoints/` in the model repo.
+    Use `model_name` to select which checkpoint, or call `switch_model()` to
+    change at runtime.
+    Args:
+        path: Path to the downloaded HuggingFace model directory.
+        model_name: Name of the checkpoint to load (e.g., "proposed_L_coarse_tau10.0").
+                    If None, uses the default from config.json.
+    """
+    def __init__(self, path: str, model_name: str = None):
+        self.path = path
+        self.device = torch.device(
+            "cuda" if torch.cuda.is_available() else "cpu"
+        )
+        # Read config
+        config_path = os.path.join(path, "config.json")
+        if os.path.isfile(config_path):
+            with open(config_path) as f:
+                self.config = json.load(f)
+        else:
+            self.config = {}
+        # Discover available checkpoints
+        self.available_checkpoints = _discover_checkpoints(path)
+        if not self.available_checkpoints:
+            raise FileNotFoundError(
+                f"No checkpoints found under {os.path.join(path, 'checkpoints')}/"
+            )
+        # Load probe for the selected checkpoint
+        if model_name is None:
+            model_name = self.config.get("default_checkpoint", DEFAULT_CHECKPOINT)
+        self.current_model_name = None
+        self.probe = None
+        self.switch_model(model_name)
+        # Load Whisper encoder (shared across all checkpoints)
+        from transformers import WhisperFeatureExtractor, WhisperModel
+        self.processor = WhisperFeatureExtractor.from_pretrained(
+            WHISPER_MODEL_NAME
+        )
+        self.whisper = WhisperModel.from_pretrained(WHISPER_MODEL_NAME)
+        self.whisper.eval().to(self.device)
+        # Load Silero VAD (shared across all checkpoints)
+        self.vad_model, self.get_speech_timestamps = _load_vad()
+    def switch_model(self, model_name: str):
+        """
+        Switch to a different checkpoint without reloading Whisper or VAD.
+        Args:
+            model_name: Name of the checkpoint (e.g., "simclr_tau0.1")
+        """
+        if model_name == self.current_model_name:
+            return
+        if model_name not in self.available_checkpoints:
+            raise ValueError(
+                f"Checkpoint '{model_name}' not found. "
+                f"Available: {self.available_checkpoints}"
+            )
+        checkpoint_dir = os.path.join(self.path, "checkpoints", model_name)
+        self.probe = _load_probe(checkpoint_dir, self.device)
+        self.current_model_name = model_name
+    def list_models(self):
+        """Return list of available checkpoint names."""
+        return list(self.available_checkpoints)
+    def __call__(self, inputs, model_name: str = None):
+        """
+        Run severity estimation on audio input.
+        Args:
+            inputs: file path (str) or raw audio bytes
+            model_name: optionally override the checkpoint for this call
+        Returns:
+            dict with "severity_score" (clipped to 1-7), "raw_score",
+            and "model_name"
+        """
+        if model_name is not None:
+            self.switch_model(model_name)
+        # Load audio
+        if isinstance(inputs, str):
+            wav, sr = torchaudio.load(inputs)
+        elif isinstance(inputs, bytes):
+            wav, sr = torchaudio.load(io.BytesIO(inputs))
+        else:
+            wav, sr = torchaudio.load(io.BytesIO(inputs))
+        if sr != SAMPLING_RATE:
+            wav = torchaudio.functional.resample(wav, sr, SAMPLING_RATE)
+        wav = wav.squeeze()
+        # VAD
+        wav = _apply_vad(wav, self.vad_model, self.get_speech_timestamps)
+        # Whisper feature extraction
+        features = _extract_features(
+            wav, self.whisper, self.processor, self.device
+        )
+        # Probe inference
+        with torch.no_grad():
+            output = self.probe(features)
+        score = output.logits.item()
+        return {
+            "severity_score": round(max(1.0, min(7.0, score)), 2),
+            "raw_score": round(score, 4),
+            "model_name": self.current_model_name,
+        }