Automatic Speech Recognition
Safetensors
Chinese
whisper
kexinf1 commited on
Commit
2f2d8cc
·
verified ·
1 Parent(s): 037c4fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -51
README.md CHANGED
@@ -1,88 +1,125 @@
1
- ---
2
- license: cc-by-4.0
3
- datasets:
4
- - AImpower/MandarinStutteredSpeech
5
- language:
6
- - zh
7
- base_model:
8
- - openai/whisper-large-v2
9
- pipeline_tag: automatic-speech-recognition
10
- ---
11
- # Whisper Large v2 Chinese Stuttering Fine-Tuned
12
 
13
- This is a fine-tuned version of OpenAIs Whisper Large v2 model, adapted for transcribing Mandarin Chinese speech, especially with stuttering.
14
- The model was fine-tuned on the **AS-70: A Mandarin stuttered speech dataset** for automatic speech recognition and stuttering event detection.
15
 
16
  ## Model Details
17
 
18
- **Model type:** Automatic Speech Recognition (ASR)
19
- **Languages:** Mandarin Chinese
20
- **Base model:** Whisper Large v2
21
- **Finetuned by:** Rong Gong et al.
22
- **Model folder:** `whisper-large-v2-finetune`
23
 
24
- ### Authors of the Dataset & Paper
25
 
26
- - Dataset: Rong Gong, Hongfei Xue, Lezhi Wang, Xin Xu, Qisheng Li, Lei Xie, Hui Bu, Shaomei Wu, Jiaming Zhou, Yong Qin, Binbin Zhang, Jun Du, Jia Bin, Ming Li
27
- - Dataset paper: Gong, R., Xue, H., Wang, L., Xu, X., Li, Q., Xie, L., Bu, H., Wu, S., Zhou, J., Qin, Y., Zhang, B., Du, J., Bin, J., Li, M. (2024) AS-70: A Mandarin stuttered speech dataset for automatic speech recognition and stuttering event detection. Proc. Interspeech 2024, 5098-5102, doi: 10.21437/Interspeech.2024-918
28
- - Fine-tuning paper: Jingjin Li, Qisheng Li, Rong Gong, Lezhi Wang, and Shaomei Wu. 2025. Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). Association for Computing Machinery, New York, NY, USA, 2768–2783. https://doi.org/10.1145/3715275.3732179
29
 
 
30
 
31
- ## Intended Uses
32
 
33
- - Transcribing Mandarin Chinese spoken language verbatim, particularly for speakers who stutter.
34
- - Research in stuttering affirming speech therapy, clinical linguistics, or accessibility applications.
35
 
36
- ### Out-of-Scope Use
 
 
 
37
 
38
- - Non-Chinese languages or highly noisy audio.
39
- - Real-time transcription without optimization.
40
- - Sensitive or legal audio without human verification.
41
- - Other use cases that undermine the dignity and quality of life of people who stutter.
42
 
43
- ## Limitations & Risks
 
 
44
 
45
- - Accuracy may drop on fast speech, mixed-language speech, or heavy background noise.
46
- - Stuttering is highly variable and heterogenous, certain stuttering patterns may still result in high transcription errors.
47
- - Not recommended to use as sole source for clinical or legal decisions.
48
 
49
  ## How to Use
50
 
 
 
51
  ```python
52
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
53
  import torch
 
54
 
55
- model_path = "whisper-large-v2-finetune"
 
56
  model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
57
  processor = AutoProcessor.from_pretrained(model_path)
58
 
59
  device = "cuda" if torch.cuda.is_available() else "cpu"
60
  model.to(device)
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ```
62
- ## Training Details
63
 
64
- - **Data:** training split of AS-70 Mandarin stuttered speech dataset.
65
- - **Preprocessing:** Standard Whisper tokenization and audio normalization.
66
- - **Training regime:** PEFT fine-tuning
67
 
68
- ---
69
 
70
- ## Evaluation
 
71
 
72
- - **Test data:** Held-out evaluation split of AS-70 Mandarin stuttered speech dataset.
73
- - **Metrics:**
74
- - **Results:**
75
 
76
- ---
77
 
78
- ## Environmental Impact
 
 
 
 
 
 
79
 
80
- - **Hardware:** NVIDIA A100 GPU
81
- - **Compute hours:**
82
 
83
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  ## Citation
86
 
87
- **Paper:**
88
- Jingjin Li, Qisheng Li, Rong Gong, Lezhi Wang, and Shaomei Wu. 2025. Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset. In Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25). Association for Computing Machinery, New York, NY, USA, 2768–2783. https://doi.org/10.1145/3715275.3732179
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card: AImpower/StutteredSpeechASR
 
 
 
 
 
 
 
 
 
 
2
 
3
+ This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the **AImpower/MandarinStutteredSpeech** dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).
 
4
 
5
  ## Model Details
6
 
7
+ * **Base Model:** `openai/whisper-large-v2`
8
+ * **Language:** Mandarin Chinese
9
+ * **Fine-tuning Dataset:** [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)
10
+ * **Fine-tuning Method:** The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
11
+ * **Paper:** [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179)
12
 
13
+ ## Model Description
14
 
15
+ This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
16
+ Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
17
+ This model was fine-tuned on **literal transcriptions** that intentionally preserve these disfluencies.
18
 
19
+ The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.
20
 
21
+ ## Intended Uses & Limitations
22
 
23
+ ### Intended Use
 
24
 
25
+ This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:
26
+ * Improving accessibility in speech-to-text applications.
27
+ * Linguistic research on stuttered speech.
28
+ * Developing more inclusive voice-enabled technologies.
29
 
30
+ ### Limitations
 
 
 
31
 
32
+ * **Language Specificity:** The model is trained exclusively on Mandarin Chinese and is not intended for other languages.
33
+ * **Data Specificity:** Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
34
+ * **Variability:** Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.
35
 
36
+ ---
 
 
37
 
38
  ## How to Use
39
 
40
+ You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed.
41
+
42
  ```python
43
  from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
44
  import torch
45
+ import librosa
46
 
47
+ # Load the fine-tuned model and processor
48
+ model_path = "AImpower/StutteredSpeechASR"
49
  model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
50
  processor = AutoProcessor.from_pretrained(model_path)
51
 
52
  device = "cuda" if torch.cuda.is_available() else "cpu"
53
  model.to(device)
54
+
55
+ # Load an example audio file (replace with your audio file)
56
+ audio_input_name = "example_stuttered_speech.wav"
57
+ waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)
58
+
59
+ # Process the audio and generate transcription
60
+ input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
61
+ input_features = input_features.to(device)
62
+
63
+ predicted_ids = model.generate(input_features)
64
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
65
+
66
+ print(f"Transcription: {transcription}")
67
  ```
 
68
 
69
+ -----
 
 
70
 
71
+ ## Training Data
72
 
73
+ The model was fine-tuned on the **[AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)** dataset.
74
+ This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.
75
 
76
+ * **Size:** The dataset contains nearly 50 hours of speech from 72 adults who stutter.
77
+ * **Content:** It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
78
+ * **Transcription:** The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.
79
 
80
+ ## Training Procedure
81
 
82
+ * **Data Split:** A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
83
+ * **Hyperparameters:**
84
+ * **Epochs:** 3
85
+ * **Learning Rate:** 0.001
86
+ * **Optimizer:** AdamW
87
+ * **Batch Size:** 16
88
+ * **Fine-tuning Method:** AdaLora
89
 
90
+ -----
 
91
 
92
+ ## Evaluation Results
93
+
94
+ The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model.
95
+ The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.
96
+
97
+ | Stuttering Severity | Baseline Whisper CER | Fine-tuned Model CER |
98
+ | :------------------ | :------------------- | :------------------- |
99
+ | Mild | 16.34% | **5.80%** |
100
+ | Moderate | 21.72% | **9.03%** |
101
+ | Severe | 49.24% | **20.46%** |
102
+
103
+ *(Results from Figure 3 of the paper)*
104
+
105
+ Notably, the model achieved a significant reduction in **deletion errors (DEL)**, especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.
106
 
107
  ## Citation
108
 
109
+ If you use this model, please cite the original paper:
110
+
111
+ ```bibtex
112
+ @inproceedings{li2025collective,
113
+ author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
114
+ title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
115
+ year = {2025},
116
+ isbn = {9798400714825},
117
+ publisher = {Association for Computing Machinery},
118
+ address = {New York, NY, USA},
119
+ url = {https://doi.org/10.1145/3715275.3732179},
120
+ booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
121
+ pages = {2768–2783},
122
+ location = {Athens, Greece},
123
+ series = {FAccT '25}
124
+ }
125
+ ```