Automatic Speech Recognition
Safetensors
Chinese
whisper
StutteredSpeechASR / README.md
kexinf1's picture
Update README.md
234d407 verified
|
raw
history blame
6.55 kB
---
license: apache-2.0
datasets:
- AImpower/MandarinStutteredSpeech
language:
- zh
metrics:
- cer
base_model:
- openai/whisper-large-v2
pipeline_tag: automatic-speech-recognition
---
# Model Card: AImpower/StutteredSpeechASR
This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the **AImpower/MandarinStutteredSpeech** dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).
## Model Details
* **Base Model:** `openai/whisper-large-v2`
* **Language:** Mandarin Chinese
* **Fine-tuning Dataset:** [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)
* **Fine-tuning Method:** The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
* **Paper:** [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179)
## Model Description
This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
This model was fine-tuned on literal transcriptions that intentionally preserve these disfluencies.
The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.
## Intended Uses & Limitations
### Intended Use
This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:
* Improving accessibility in speech-to-text applications.
* Linguistic research on stuttered speech.
* Developing more inclusive voice-enabled technologies.
### Limitations
* **Language Specificity:** The model is fine-tuned exclusively on Mandarin Chinese and is not intended for other languages.
* **Data Specificity:** Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
* **Variability:** Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.
---
## How to Use
You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed.
```python
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
import librosa
# Load the fine-tuned model and processor
model_path = "AImpower/StutteredSpeechASR"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Load an example audio file (replace with your audio file)
audio_input_name = "example_stuttered_speech.wav"
waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)
# Process the audio and generate transcription
input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
input_features = input_features.to(device)
predicted_ids = model.generate(input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
```
-----
## Training Data
The model was fine-tuned on the **[AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)** dataset.
This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.
* **Size:** The dataset contains nearly 50 hours of speech from 72 adults who stutter.
* **Content:** It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
* **Transcription:** The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.
## Training Procedure
* **Data Split:** A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
* **Hyperparameters:**
* **Epochs:** 3
* **Learning Rate:** 0.001
* **Optimizer:** AdamW
* **Batch Size:** 16
* **Fine-tuning Method:** AdaLora
* **GPU:** Four NVIDIA A100 80G GPUs
-----
## Evaluation Results
The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model.
The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.
| Stuttering Severity | Baseline Whisper CER | Fine-tuned Model CER |
| :------------------ | :------------------- | :------------------- |
| Mild | 16.34% | **5.80%** |
| Moderate | 21.72% | **9.03%** |
| Severe | 49.24% | **20.46%** |
*(Results from Figure 3 of the paper)*
Notably, the model achieved a significant reduction in **deletion errors (DEL)**, especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.
## Citation
If you use this model, please cite the original paper:
```bibtex
@inproceedings{li2025collective,
author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
year = {2025},
isbn = {9798400714825},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3715275.3732179},
booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
pages = {2768–2783},
location = {Athens, Greece},
series = {FAccT '25}
}
```