AImpower
/

StutteredSpeechASR

@@ -1,88 +1,125 @@
----
-license: cc-by-4.0
-datasets:
-- AImpower/MandarinStutteredSpeech
-language:
-- zh
-base_model:
-- openai/whisper-large-v2
-pipeline_tag: automatic-speech-recognition
----
-# Whisper Large v2 Chinese Stuttering Fine-Tuned
-This is a fine-tuned version of OpenAI’s Whisper Large v2 model, adapted for transcribing Mandarin Chinese speech, especially with stuttering.
-The model was fine-tuned on the **AS-70: A Mandarin stuttered speech dataset** for automatic speech recognition and stuttering event detection.
 ## Model Details
-**Model type:** Automatic Speech Recognition (ASR)
-**Languages:** Mandarin Chinese
-**Base model:** Whisper Large v2
-**Finetuned by:** Rong Gong et al.
-**Model folder:** `whisper-large-v2-finetune`
-### Authors of the Dataset & Paper
-- Dataset: Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li
-- Dataset paper: Gong, R., Xue, H., Wang, L., Xu, X., Li, Q., Xie, L., Bu, H., Wu, S., Zhou, J., Qin, Y., Zhang, B., Du, J., Bin, J., Li, M. (2024) AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection. Proc. Interspeech 2024, 5098-5102, doi: 10.21437/Interspeech.2024-918
-- Fine-tuning paper: Jingjin Li, Qisheng Li, Rong Gong, Lezhi Wang, and Shaomei Wu. 2025. Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). Association for Computing Machinery, New York, NY, USA, 2768–2783. https://doi.org/10.1145/3715275.3732179
-## Intended Uses
-- Transcribing Mandarin Chinese spoken language verbatim, particularly for speakers who stutter.
-- Research in stuttering affirming speech therapy, clinical linguistics, or accessibility applications.
-### Out-of-Scope Use
-- Non-Chinese languages or highly noisy audio.
-- Real-time transcription without optimization.
-- Sensitive or legal audio without human verification.
-- Other use cases that undermine the dignity and quality of life of people who stutter.
-## Limitations & Risks
-- Accuracy may drop on fast speech, mixed-language speech, or heavy background noise.
-- Stuttering is highly variable and heterogenous, certain stuttering patterns may still result in high transcription errors.
-- Not recommended to use as sole source for clinical or legal decisions.
 ## How to Use
 ```python
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
 import torch
-model_path = "whisper-large-v2-finetune"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
 processor = AutoProcessor.from_pretrained(model_path)
 device = "cuda" if torch.cuda.is_available() else "cpu"
 model.to(device)
 ```
-## Training Details
-- **Data:** training split of AS-70 Mandarin stuttered speech dataset.
-- **Preprocessing:** Standard Whisper tokenization and audio normalization.
-- **Training regime:** PEFT fine-tuning
----
-## Evaluation
-- **Test data:** Held-out evaluation split of AS-70 Mandarin stuttered speech dataset.
-- **Metrics:**
-- **Results:**
----
-## Environmental Impact
-- **Hardware:** NVIDIA A100 GPU
-- **Compute hours:**
----
 ## Citation
-**Paper:**
-Jingjin Li, Qisheng Li, Rong Gong, Lezhi Wang, and Shaomei Wu. 2025. Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). Association for Computing Machinery, New York, NY, USA, 2768–2783. https://doi.org/10.1145/3715275.3732179

+# Model Card: AImpower/StutteredSpeechASR
+This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the **AImpower/MandarinStutteredSpeech** dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).
 ## Model Details
+* **Base Model:** `openai/whisper-large-v2`
+* **Language:** Mandarin Chinese
+* **Fine-tuning Dataset:** [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)
+* **Fine-tuning Method:** The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
+* **Paper:** [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179)
+## Model Description
+This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
+Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
+This model was fine-tuned on **literal transcriptions** that intentionally preserve these disfluencies.
+The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.
+## Intended Uses & Limitations
+### Intended Use
+This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:
+* Improving accessibility in speech-to-text applications.
+* Linguistic research on stuttered speech.
+* Developing more inclusive voice-enabled technologies.
+### Limitations
+* **Language Specificity:** The model is trained exclusively on Mandarin Chinese and is not intended for other languages.
+* **Data Specificity:** Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
+* **Variability:** Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.
+---
 ## How to Use
+You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed.
 ```python
 from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
 import torch
+import librosa
+# Load the fine-tuned model and processor
+model_path = "AImpower/StutteredSpeechASR"
 model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
 processor = AutoProcessor.from_pretrained(model_path)
 device = "cuda" if torch.cuda.is_available() else "cpu"
 model.to(device)
+# Load an example audio file (replace with your audio file)
+audio_input_name = "example_stuttered_speech.wav"
+waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)
+# Process the audio and generate transcription
+input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
+input_features = input_features.to(device)
+predicted_ids = model.generate(input_features)
+transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
+print(f"Transcription: {transcription}")
 ```
+-----
+## Training Data
+The model was fine-tuned on the **[AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)** dataset.
+This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.
+  * **Size:** The dataset contains nearly 50 hours of speech from 72 adults who stutter.
+  * **Content:** It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
+  * **Transcription:** The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.
+## Training Procedure
+  * **Data Split:** A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
+  * **Hyperparameters:**
+      * **Epochs:** 3
+      * **Learning Rate:** 0.001
+      * **Optimizer:** AdamW
+      * **Batch Size:** 16
+      * **Fine-tuning Method:** AdaLora
+-----
+## Evaluation Results
+The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model.
+The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.
+| Stuttering Severity | Baseline Whisper CER | Fine-tuned Model CER |
+| :------------------ | :------------------- | :------------------- |
+| Mild                | 16.34%               | **5.80%**            |
+| Moderate            | 21.72%               | **9.03%**            |
+| Severe              | 49.24%               | **20.46%**           |
+*(Results from Figure 3 of the paper)*
+Notably, the model achieved a significant reduction in **deletion errors (DEL)**, especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.
 ## Citation
+If you use this model, please cite the original paper:
+```bibtex
+@inproceedings{li2025collective,
+  author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
+  title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
+  year = {2025},
+  isbn = {9798400714825},
+  publisher = {Association for Computing Machinery},
+  address = {New York, NY, USA},
+  url = {https://doi.org/10.1145/3715275.3732179},
+  booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
+  pages = {2768–2783},
+  location = {Athens, Greece},
+  series = {FAccT '25}
+}
+```