| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - AImpower/MandarinStutteredSpeech |
| | language: |
| | - zh |
| | metrics: |
| | - cer |
| | base_model: |
| | - openai/whisper-large-v2 |
| | pipeline_tag: automatic-speech-recognition |
| | --- |
| | # Model Card: AImpower/StutteredSpeechASR |
| |
|
| | This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the **AImpower/MandarinStutteredSpeech** dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS). |
| |
|
| | ## Model Details |
| |
|
| | * **Base Model:** `openai/whisper-large-v2` |
| | * **Language:** Mandarin Chinese |
| | * **Fine-tuning Dataset:** [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech) |
| | * **Fine-tuning Method:** The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions. |
| | * **Paper:** [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179) |
| |
|
| | ## Model Description |
| |
|
| | This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS. |
| | Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections. |
| | This model was fine-tuned on literal transcriptions that intentionally preserve these disfluencies. |
| |
|
| | The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy. |
| |
|
| | ## Intended Uses & Limitations |
| |
|
| | ### Intended Use |
| |
|
| | This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for: |
| | * Improving accessibility in speech-to-text applications. |
| | * Linguistic research on stuttered speech. |
| | * Developing more inclusive voice-enabled technologies. |
| |
|
| | ### Limitations |
| |
|
| | * **Language Specificity:** The model is fine-tuned exclusively on Mandarin Chinese and is not intended for other languages. |
| | * **Data Specificity:** Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise. |
| | * **Variability:** Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts. |
| |
|
| | --- |
| |
|
| | ## How to Use |
| |
|
| | You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed. |
| |
|
| | ```python |
| | from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor |
| | import torch |
| | import librosa |
| | |
| | # Load the fine-tuned model and processor |
| | model_path = "AImpower/StutteredSpeechASR" |
| | model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path) |
| | processor = AutoProcessor.from_pretrained(model_path) |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | model.to(device) |
| | |
| | # Load an example audio file (replace with your audio file) |
| | audio_input_name = "example_stuttered_speech.wav" |
| | waveform, sampling_rate = librosa.load(audio_input_name, sr=16000) |
| | |
| | # Process the audio and generate transcription |
| | input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features |
| | input_features = input_features.to(device) |
| | |
| | predicted_ids = model.generate(input_features) |
| | transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
| | |
| | print(f"Transcription: {transcription}") |
| | ``` |
| |
|
| | ----- |
| |
|
| | ## Training Data |
| |
|
| | The model was fine-tuned on the **[AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)** dataset. |
| | This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS. |
| |
|
| | * **Size:** The dataset contains nearly 50 hours of speech from 72 adults who stutter. |
| | * **Content:** It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands. |
| | * **Transcription:** The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically. |
| |
|
| | ## Training Procedure |
| |
|
| | * **Data Split:** A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold. |
| | * **Hyperparameters:** |
| | * **Epochs:** 3 |
| | * **Learning Rate:** 0.001 |
| | * **Optimizer:** AdamW |
| | * **Batch Size:** 16 |
| | * **Fine-tuning Method:** AdaLora |
| | * **GPU:** Four NVIDIA A100 80G GPUs |
| | ----- |
| |
|
| | ## Evaluation Results |
| |
|
| | The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model. |
| | The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies. |
| |
|
| | | Stuttering Severity | Baseline Whisper CER | Fine-tuned Model CER | |
| | | :------------------ | :------------------- | :------------------- | |
| | | Mild | 16.34% | **5.80%** | |
| | | Moderate | 21.72% | **9.03%** | |
| | | Severe | 49.24% | **20.46%** | |
| |
|
| | *(Results from Figure 3 of the paper)* |
| |
|
| | Notably, the model achieved a significant reduction in **deletion errors (DEL)**, especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the original paper: |
| |
|
| | ```bibtex |
| | @inproceedings{li2025collective, |
| | author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei}, |
| | title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset}, |
| | year = {2025}, |
| | isbn = {9798400714825}, |
| | publisher = {Association for Computing Machinery}, |
| | address = {New York, NY, USA}, |
| | url = {https://doi.org/10.1145/3715275.3732179}, |
| | booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency}, |
| | pages = {2768–2783}, |
| | location = {Athens, Greece}, |
| | series = {FAccT '25} |
| | } |
| | ``` |