Update README.md
Browse files
README.md
CHANGED
|
@@ -1,88 +1,125 @@
|
|
| 1 |
-
|
| 2 |
-
license: cc-by-4.0
|
| 3 |
-
datasets:
|
| 4 |
-
- AImpower/MandarinStutteredSpeech
|
| 5 |
-
language:
|
| 6 |
-
- zh
|
| 7 |
-
base_model:
|
| 8 |
-
- openai/whisper-large-v2
|
| 9 |
-
pipeline_tag: automatic-speech-recognition
|
| 10 |
-
---
|
| 11 |
-
# Whisper Large v2 Chinese Stuttering Fine-Tuned
|
| 12 |
|
| 13 |
-
This is a
|
| 14 |
-
The model was fine-tuned on the **AS-70: A Mandarin stuttered speech dataset** for automatic speech recognition and stuttering event detection.
|
| 15 |
|
| 16 |
## Model Details
|
| 17 |
|
| 18 |
-
**
|
| 19 |
-
**
|
| 20 |
-
**
|
| 21 |
-
**
|
| 22 |
-
*
|
| 23 |
|
| 24 |
-
##
|
| 25 |
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
|
|
|
|
| 30 |
|
| 31 |
-
## Intended Uses
|
| 32 |
|
| 33 |
-
|
| 34 |
-
- Research in stuttering affirming speech therapy, clinical linguistics, or accessibility applications.
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
|
| 39 |
-
- Real-time transcription without optimization.
|
| 40 |
-
- Sensitive or legal audio without human verification.
|
| 41 |
-
- Other use cases that undermine the dignity and quality of life of people who stutter.
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
-
|
| 46 |
-
- Stuttering is highly variable and heterogenous, certain stuttering patterns may still result in high transcription errors.
|
| 47 |
-
- Not recommended to use as sole source for clinical or legal decisions.
|
| 48 |
|
| 49 |
## How to Use
|
| 50 |
|
|
|
|
|
|
|
| 51 |
```python
|
| 52 |
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
|
| 53 |
import torch
|
|
|
|
| 54 |
|
| 55 |
-
|
|
|
|
| 56 |
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
|
| 57 |
processor = AutoProcessor.from_pretrained(model_path)
|
| 58 |
|
| 59 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 60 |
model.to(device)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
```
|
| 62 |
-
## Training Details
|
| 63 |
|
| 64 |
-
-
|
| 65 |
-
- **Preprocessing:** Standard Whisper tokenization and audio normalization.
|
| 66 |
-
- **Training regime:** PEFT fine-tuning
|
| 67 |
|
| 68 |
-
|
| 69 |
|
| 70 |
-
|
|
|
|
| 71 |
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
|
| 80 |
-
-
|
| 81 |
-
- **Compute hours:**
|
| 82 |
|
| 83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
## Citation
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Card: AImpower/StutteredSpeechASR
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the **AImpower/MandarinStutteredSpeech** dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).
|
|
|
|
| 4 |
|
| 5 |
## Model Details
|
| 6 |
|
| 7 |
+
* **Base Model:** `openai/whisper-large-v2`
|
| 8 |
+
* **Language:** Mandarin Chinese
|
| 9 |
+
* **Fine-tuning Dataset:** [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)
|
| 10 |
+
* **Fine-tuning Method:** The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
|
| 11 |
+
* **Paper:** [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179)
|
| 12 |
|
| 13 |
+
## Model Description
|
| 14 |
|
| 15 |
+
This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
|
| 16 |
+
Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
|
| 17 |
+
This model was fine-tuned on **literal transcriptions** that intentionally preserve these disfluencies.
|
| 18 |
|
| 19 |
+
The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.
|
| 20 |
|
| 21 |
+
## Intended Uses & Limitations
|
| 22 |
|
| 23 |
+
### Intended Use
|
|
|
|
| 24 |
|
| 25 |
+
This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:
|
| 26 |
+
* Improving accessibility in speech-to-text applications.
|
| 27 |
+
* Linguistic research on stuttered speech.
|
| 28 |
+
* Developing more inclusive voice-enabled technologies.
|
| 29 |
|
| 30 |
+
### Limitations
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
* **Language Specificity:** The model is trained exclusively on Mandarin Chinese and is not intended for other languages.
|
| 33 |
+
* **Data Specificity:** Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
|
| 34 |
+
* **Variability:** Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.
|
| 35 |
|
| 36 |
+
---
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## How to Use
|
| 39 |
|
| 40 |
+
You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed.
|
| 41 |
+
|
| 42 |
```python
|
| 43 |
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
|
| 44 |
import torch
|
| 45 |
+
import librosa
|
| 46 |
|
| 47 |
+
# Load the fine-tuned model and processor
|
| 48 |
+
model_path = "AImpower/StutteredSpeechASR"
|
| 49 |
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
|
| 50 |
processor = AutoProcessor.from_pretrained(model_path)
|
| 51 |
|
| 52 |
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 53 |
model.to(device)
|
| 54 |
+
|
| 55 |
+
# Load an example audio file (replace with your audio file)
|
| 56 |
+
audio_input_name = "example_stuttered_speech.wav"
|
| 57 |
+
waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)
|
| 58 |
+
|
| 59 |
+
# Process the audio and generate transcription
|
| 60 |
+
input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
|
| 61 |
+
input_features = input_features.to(device)
|
| 62 |
+
|
| 63 |
+
predicted_ids = model.generate(input_features)
|
| 64 |
+
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
|
| 65 |
+
|
| 66 |
+
print(f"Transcription: {transcription}")
|
| 67 |
```
|
|
|
|
| 68 |
|
| 69 |
+
-----
|
|
|
|
|
|
|
| 70 |
|
| 71 |
+
## Training Data
|
| 72 |
|
| 73 |
+
The model was fine-tuned on the **[AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)** dataset.
|
| 74 |
+
This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.
|
| 75 |
|
| 76 |
+
* **Size:** The dataset contains nearly 50 hours of speech from 72 adults who stutter.
|
| 77 |
+
* **Content:** It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
|
| 78 |
+
* **Transcription:** The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.
|
| 79 |
|
| 80 |
+
## Training Procedure
|
| 81 |
|
| 82 |
+
* **Data Split:** A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
|
| 83 |
+
* **Hyperparameters:**
|
| 84 |
+
* **Epochs:** 3
|
| 85 |
+
* **Learning Rate:** 0.001
|
| 86 |
+
* **Optimizer:** AdamW
|
| 87 |
+
* **Batch Size:** 16
|
| 88 |
+
* **Fine-tuning Method:** AdaLora
|
| 89 |
|
| 90 |
+
-----
|
|
|
|
| 91 |
|
| 92 |
+
## Evaluation Results
|
| 93 |
+
|
| 94 |
+
The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model.
|
| 95 |
+
The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.
|
| 96 |
+
|
| 97 |
+
| Stuttering Severity | Baseline Whisper CER | Fine-tuned Model CER |
|
| 98 |
+
| :------------------ | :------------------- | :------------------- |
|
| 99 |
+
| Mild | 16.34% | **5.80%** |
|
| 100 |
+
| Moderate | 21.72% | **9.03%** |
|
| 101 |
+
| Severe | 49.24% | **20.46%** |
|
| 102 |
+
|
| 103 |
+
*(Results from Figure 3 of the paper)*
|
| 104 |
+
|
| 105 |
+
Notably, the model achieved a significant reduction in **deletion errors (DEL)**, especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.
|
| 106 |
|
| 107 |
## Citation
|
| 108 |
|
| 109 |
+
If you use this model, please cite the original paper:
|
| 110 |
+
|
| 111 |
+
```bibtex
|
| 112 |
+
@inproceedings{li2025collective,
|
| 113 |
+
author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
|
| 114 |
+
title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
|
| 115 |
+
year = {2025},
|
| 116 |
+
isbn = {9798400714825},
|
| 117 |
+
publisher = {Association for Computing Machinery},
|
| 118 |
+
address = {New York, NY, USA},
|
| 119 |
+
url = {https://doi.org/10.1145/3715275.3732179},
|
| 120 |
+
booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
|
| 121 |
+
pages = {2768–2783},
|
| 122 |
+
location = {Athens, Greece},
|
| 123 |
+
series = {FAccT '25}
|
| 124 |
+
}
|
| 125 |
+
```
|