Update README.md

234d407 verified 5 months ago

6.55 kB

	---
	license: apache-2.0
	datasets:
	- AImpower/MandarinStutteredSpeech
	language:
	- zh
	metrics:
	- cer
	base_model:
	- openai/whisper-large-v2
	pipeline_tag: automatic-speech-recognition
	---
	# Model Card: AImpower/StutteredSpeechASR

	This model is a version of OpenAI's `whisper-large-v2` fine-tuned on the AImpower/MandarinStutteredSpeech dataset, a grassroots-collected corpus of Mandarin Chinese speech from people who stutter (PWS).

	## Model Details

	* Base Model: `openai/whisper-large-v2`
	* Language: Mandarin Chinese
	* Fine-tuning Dataset: [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech)
	* Fine-tuning Method: The model was fine-tuned using the LoRA adapter (AdaLora) methodology to preserve speech disfluencies in its transcriptions.
	* Paper: [Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset](https://doi.org/10.1145/3715275.3732179)

	## Model Description

	This model is specifically adapted to provide more accurate and authentic transcriptions for Mandarin-speaking PWS.
	Standard Automatic Speech Recognition (ASR) models often exhibit "fluency bias," where they "smoothen" out or delete stuttered speech patterns like repetitions and interjections.
	This model was fine-tuned on literal transcriptions that intentionally preserve these disfluencies.

	The primary goal is to create a more inclusive ASR system that recognizes and respects the natural speech patterns of PWS, reducing deletion errors and improving overall accuracy.

	## Intended Uses & Limitations

	### Intended Use

	This model is intended for transcribing conversational Mandarin Chinese speech from individuals who stutter. It's particularly useful for:
	* Improving accessibility in speech-to-text applications.
	* Linguistic research on stuttered speech.
	* Developing more inclusive voice-enabled technologies.

	### Limitations

	* Language Specificity: The model is fine-tuned exclusively on Mandarin Chinese and is not intended for other languages.
	* Data Specificity: Performance is optimized for speech patterns present in the AImpower/MandarinStutteredSpeech dataset. It may not perform as well on other types of atypical speech or in environments with significant background noise.
	* Variability: Stuttering is highly variable. While the model shows significant improvements across severity levels, accuracy may still vary between individuals and contexts.

	---

	## How to Use

	You can use the model with the `transformers` library. Ensure you have `torch`, `transformers`, and `librosa` installed.

	```python
	from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
	import torch
	import librosa

	# Load the fine-tuned model and processor
	model_path = "AImpower/StutteredSpeechASR"
	model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
	processor = AutoProcessor.from_pretrained(model_path)

	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	# Load an example audio file (replace with your audio file)
	audio_input_name = "example_stuttered_speech.wav"
	waveform, sampling_rate = librosa.load(audio_input_name, sr=16000)

	# Process the audio and generate transcription
	input_features = processor(waveform, sampling_rate=sampling_rate, return_tensors="pt").input_features
	input_features = input_features.to(device)

	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

	print(f"Transcription: {transcription}")
	```

	-----

	## Training Data

	The model was fine-tuned on the [AImpower/MandarinStutteredSpeech](https://huggingface.co/datasets/AImpower/MandarinStutteredSpeech) dataset.
	This dataset was created through a community-led, grassroots effort with StammerTalk, an online community for Chinese-speaking PWS.

	* Size: The dataset contains nearly 50 hours of speech from 72 adults who stutter.
	* Content: It includes both unscripted, spontaneous conversations between two PWS and the dictation of 200 voice commands.
	* Transcription: The training was performed on verbatim (literal) transcriptions that include disfluencies such as word repetitions and interjections, which was a deliberate choice by the community to ensure their speech was represented authentically.

	## Training Procedure

	* Data Split: A three-fold cross-validation approach was used, with data split by participant to ensure robustness. Each fold had a roughly 65:10:25 split for train/dev/test sets, with a balanced representation of mild, moderate, and severe stuttering levels. This model card represents the best-performing fold.
	* Hyperparameters:
	* Epochs: 3
	* Learning Rate: 0.001
	* Optimizer: AdamW
	* Batch Size: 16
	* Fine-tuning Method: AdaLora
	* GPU: Four NVIDIA A100 80G GPUs
	-----

	## Evaluation Results

	The fine-tuned model demonstrates a substantial improvement in transcription accuracy across all stuttering severity levels compared to the baseline `whisper-large-v2` model.
	The key metric used is Character Error Rate (CER), evaluated on literal transcriptions to measure the model's ability to preserve disfluencies.

	\| Stuttering Severity \| Baseline Whisper CER \| Fine-tuned Model CER \|
	\| :------------------ \| :------------------- \| :------------------- \|
	\| Mild \| 16.34% \| 5.80% \|
	\| Moderate \| 21.72% \| 9.03% \|
	\| Severe \| 49.24% \| 20.46% \|

	(Results from Figure 3 of the paper)

	Notably, the model achieved a significant reduction in deletion errors (DEL), especially for severe speech (from 26.56% to 2.29%), indicating that it is much more effective at preserving repeated words and phrases instead of omitting them.

	## Citation

	If you use this model, please cite the original paper:

	```bibtex
	@inproceedings{li2025collective,
	author = {Li, Jingjin and Li, Qisheng and Gong, Rong and Wang, Lezhi and Wu, Shaomei},
	title = {Our Collective Voices: The Social and Technical Values of a Grassroots Chinese Stuttered Speech Dataset},
	year = {2025},
	isbn = {9798400714825},
	publisher = {Association for Computing Machinery},
	address = {New York, NY, USA},
	url = {https://doi.org/10.1145/3715275.3732179},
	booktitle = {The 2025 ACM Conference on Fairness, Accountability, and Transparency},
	pages = {2768–2783},
	location = {Athens, Greece},
	series = {FAccT '25}
	}
	```