Automatic Speech Recognition
Safetensors
Chinese
whisper

Model Card: AImpower/StutteredSpeechASR

This model is a version of OpenAI's whisper-large-v2 fine-tuned on the AImpower/MandarinStutteredSpeech dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).

Model Details

Model Description

This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
This model was fine-tuned on literal transcriptions that intentionally preserve these disfluencies.

The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.

Intended Uses & Limitations

Intended Use

This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:

  • Improving accessibility in speech-to-text applications.
  • Linguistic research on stuttered speech.
  • Developing more inclusive voice-enabled technologies.

Limitations

  • Language Specificity: The model is fine-tuned exclusively on Mandarin Chinese and is not intended for other languages.
  • Data Specificity: Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
  • Variability: Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.

How to Use

You can use the model with the transformers library. Ensure you have torch, transformers, and librosa installed.

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa

# Load the fine-tuned model and processor
model_path = "AImpower/StutteredSpeechASR"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load an example audio file (replace with your audio file)
audio_input_name = "example_stuttered_speech.wav"
waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)

# Process the audio and generate transcription
input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
input_features = input_features.to(device)

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")

Training Data

The model was fine-tuned on the AImpower/MandarinStutteredSpeech dataset.
This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.

  • Size: The dataset contains nearly 50 hours of speech from 72 adults who stutter.
  • Content: It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
  • Transcription: The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.

Training Procedure

  • Data Split: A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
  • Hyperparameters:
    • Epochs: 3
    • Learning Rate: 0.001
    • Optimizer: AdamW
    • Batch Size: 16
    • Fine-tuning Method: AdaLora
  • GPU: Four NVIDIA A100 80G GPUs

Evaluation Results

The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline whisper-large-v2 model.
The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.

Stuttering Severity Baseline Whisper CER Fine-tuned Model CER
Mild 16.34% 5.80%
Moderate 21.72% 9.03%
Severe 49.24% 20.46%

(Results from Figure 3 of the paper)

Notably, the model achieved a significant reduction in deletion errors (DEL), especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.

Citation

If you use this model, please cite the original paper:

@inproceedings{li2025collective,
  author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
  title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
  year = {2025},
  isbn = {9798400714825},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3715275.3732179},
  booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
  pages = {2768–2783},
  location = {Athens, Greece},
  series = {FAccT '25}
}
Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AImpower/StutteredSpeechASR

Finetuned
(251)
this model

Dataset used to train AImpower/StutteredSpeechASR