File size: 6,551 Bytes

234d407
 
 
 
 
 
 
 
 
 
 
 
2f2d8cc
61e3296
2f2d8cc
61e3296
 
 
2f2d8cc
 
 
 
 
61e3296
2f2d8cc
61e3296
2f2d8cc
 
db8340d
037c4fc
2f2d8cc
61e3296
2f2d8cc
61e3296
2f2d8cc
61e3296
2f2d8cc
 
 
 
61e3296
2f2d8cc
61e3296
db8340d
2f2d8cc
 
61e3296
2f2d8cc
61e3296
 
 
2f2d8cc
 
61e3296
 
 
2f2d8cc
61e3296
2f2d8cc
 
61e3296
 
 
 
 
2f2d8cc
 
 
 
 
 
 
 
 
 
 
 
 
61e3296
 
2f2d8cc
61e3296
2f2d8cc
61e3296
2f2d8cc
 
61e3296
2f2d8cc
 
 
61e3296
2f2d8cc
61e3296
2f2d8cc
 
 
 
 
 
 
db8340d
2f2d8cc
61e3296
2f2d8cc
 
 
 
 
 
 
 
 
 
 
 
 
 
61e3296
 
 
2f2d8cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234d407

---
license: apache-2.0
datasets:
- AImpower/MandarinStutteredSpeech
language:
- zh
metrics:
- cer
base_model:
- openai/whisper-large-v2
pipeline_tag: automatic-speech-recognition
---
# Model Card: AImpower/StutteredSpeechASR

This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the **AImpower/MandarinStutteredSpeech** dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).

## Model Details

* **Base Model:** `openai/whisper-large-v2`
* **Language:** Mandarin Chinese
* **Fine-tuning Dataset:** [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)
* **Fine-tuning Method:** The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
* **Paper:** [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179)

## Model Description

This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.  
Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.  
This model was fine-tuned on literal transcriptions that intentionally preserve these disfluencies.

The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.

## Intended Uses & Limitations

### Intended Use

This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:
* Improving accessibility in speech-to-text applications.
* Linguistic research on stuttered speech.
* Developing more inclusive voice-enabled technologies.

### Limitations

* **Language Specificity:** The model is fine-tuned exclusively on Mandarin Chinese and is not intended for other languages.
* **Data Specificity:** Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
* **Variability:** Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.

---

## How to Use

You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed.

```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa

# Load the fine-tuned model and processor
model_path = "AImpower/StutteredSpeechASR"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load an example audio file (replace with your audio file)
audio_input_name = "example_stuttered_speech.wav"
waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)

# Process the audio and generate transcription
input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
input_features = input_features.to(device)

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")
```

-----

## Training Data

The model was fine-tuned on the **[AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)** dataset.  
This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.

  * **Size:** The dataset contains nearly 50 hours of speech from 72 adults who stutter.
  * **Content:** It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
  * **Transcription:** The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.

## Training Procedure

  * **Data Split:** A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
  * **Hyperparameters:**
      * **Epochs:** 3
      * **Learning Rate:** 0.001
      * **Optimizer:** AdamW
      * **Batch Size:** 16
      * **Fine-tuning Method:** AdaLora
  * **GPU:** Four NVIDIA A100 80G GPUs
-----

## Evaluation Results

The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model.  
The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.

| Stuttering Severity | Baseline Whisper CER | Fine-tuned Model CER |
| :------------------ | :------------------- | :------------------- |
| Mild                | 16.34%               | **5.80%**            |
| Moderate            | 21.72%               | **9.03%**            |
| Severe              | 49.24%               | **20.46%**           |

*(Results from Figure 3 of the paper)*

Notably, the model achieved a significant reduction in **deletion errors (DEL)**, especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.

## Citation

If you use this model, please cite the original paper:

```bibtex
@inproceedings{li2025collective,
  author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
  title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
  year = {2025},
  isbn = {9798400714825},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3715275.3732179},
  booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
  pages = {2768–2783},
  location = {Athens, Greece},
  series = {FAccT '25}
}
```