Model Card: AImpower/StutteredSpeechASR

This model is a version of OpenAI's whisper-large-v2 fine-tuned on the AImpower/MandarinStutteredSpeech dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).

Model Details

Base Model: openai/whisper-large-v2
Language: Mandarin Chinese
Fine-tuning Dataset: AImpower/MandarinStutteredSpeech
Fine-tuning Method: The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
Paper: Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset

Model Description

This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
This model was fine-tuned on literal transcriptions that intentionally preserve these disfluencies.

The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.

Intended Uses & Limitations

Intended Use

This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:

Improving accessibility in speech-to-text applications.
Linguistic research on stuttered speech.
Developing more inclusive voice-enabled technologies.

Limitations

Language Specificity: The model is fine-tuned exclusively on Mandarin Chinese and is not intended for other languages.
Data Specificity: Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
Variability: Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.

How to Use

You can use the model with the transformers library. Ensure you have torch, transformers, and librosa installed.

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa

# Load the fine-tuned model and processor
model_path = "AImpower/StutteredSpeechASR"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# Load an example audio file (replace with your audio file)
audio_input_name = "example_stuttered_speech.wav"
waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)

# Process the audio and generate transcription
input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
input_features = input_features.to(device)

predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

print(f"Transcription: {transcription}")

Training Data

The model was fine-tuned on the AImpower/MandarinStutteredSpeech dataset.
This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.

Size: The dataset contains nearly 50 hours of speech from 72 adults who stutter.
Content: It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
Transcription: The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.

Training Procedure

Data Split: A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
Hyperparameters:
- Epochs: 3
- Learning Rate: 0.001
- Optimizer: AdamW
- Batch Size: 16
- Fine-tuning Method: AdaLora
GPU: Four NVIDIA A100 80G GPUs

Evaluation Results

The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline whisper-large-v2 model.
The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.

Stuttering Severity	Baseline Whisper CER	Fine-tuned Model CER
Mild	16.34%	5.80%
Moderate	21.72%	9.03%
Severe	49.24%	20.46%

(Results from Figure 3 of the paper)

Notably, the model achieved a significant reduction in deletion errors (DEL), especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.

Citation

If you use this model, please cite the original paper:

@inproceedings{li2025collective,
  author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
  title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
  year = {2025},
  isbn = {9798400714825},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3715275.3732179},
  booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
  pages = {2768–2783},
  location = {Athens, Greece},
  series = {FAccT '25}
}

Downloads last month: 9

Model tree for AImpower/StutteredSpeechASR

Base model

openai/whisper-large-v2

Finetuned

(251)

this model

AImpower
/

StutteredSpeechASR