Create README.md

62db0d8 verified about 1 year ago

3.07 kB

	## Model Card: Wav2vec-Classroom

	### Model Overview
	Model Name: Wav2vec-Classroom
	Version: 1.0
	Developed By: Ahmed Adel Attia (University of Maryland & Stanford University)
	Date: 2025

	Description:
	Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using Continued Pretraining (CPT) on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups.

	Use Case:
	- Speech-to-text transcription for classroom recordings.
	- Automatic feedback generation for educational AI tools.
	- ASR research in low-resource, noisy environments.

	### Model Details
	Architecture: Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq

	Training Data:
	- Unlabeled Classroom Audio (NCTE dataset): 5235 hours of classroom recordings used for self-supervised CPT.
	- NCTE-Gold: 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning.

	Training Strategy:
	1. Continued Pretraining (CPT): The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations.
	2. Supervised Fine-tuning: The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions.

	### Evaluation Results
	Word Error Rate (WER) comparison on NCTE and MPT test sets:

	\| Training Data \| NCTE WER \| MPT WER \|
	\|--------------\|----------\|---------\|
	\| Pretraining from Scratch (W2V-SCR) \| 30.25 / 38.59 \| 51.39 / 38.59 \|
	\| Wav2vec2.0-LV60K (No CPT) \| 30.39 / 33.56 \| 39.11 / 37.82 \|
	\| Wav2vec2.0-Robust (No CPT) \| 27.99 / 31.49 \| 35.07 / 36.36 \|
	\| Wav2vec2.0-Robust (CPT) \| 17.71 / 26.50 \| 25.04 / 30.97 \|

	### Limitations
	- The model is optimized for classroom speech and may not generalize well to other domains.
	- Background noise, overlapping speech, and speaker variations may still impact performance.
	- The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases.

	### Usage Request
	If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper:

	> CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments
	> Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson
	> arXiv preprint arXiv:2409.14494, 2024

	```
	@article{attia2024cpt_wav2vec,
	title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments},
	author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson},
	journal={arXiv preprint arXiv:2409.14494},
	year={2024}
	}
	```

	## Model Card: Wav2vec-Classroom

	### Model Overview
	Model Name: Wav2vec-Classroom
	Version: 1.0
	Developed By: Ahmed Adel Attia (University of Maryland & Stanford University)
	Date: 2025

	Description:
	Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using Continued Pretraining (CPT) on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups.

	Use Case:
	- Speech-to-text transcription for classroom recordings.
	- Automatic feedback generation for educational AI tools.
	- ASR research in low-resource, noisy environments.

	### Model Details
	Architecture: Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq

	Training Data:
	- Unlabeled Classroom Audio (NCTE dataset): 5235 hours of classroom recordings used for self-supervised CPT.
	- NCTE-Gold: 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning.

	Training Strategy:
	1. Continued Pretraining (CPT): The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations.
	2. Supervised Fine-tuning: The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions.

	### Evaluation Results
	Word Error Rate (WER) comparison on NCTE and MPT test sets:

	\| Training Data \| NCTE WER \| MPT WER \|
	\|--------------\|----------\|---------\|
	\| Pretraining from Scratch (W2V-SCR) \| 30.25 / 38.59 \| 51.39 / 38.59 \|
	\| Wav2vec2.0-LV60K (No CPT) \| 30.39 / 33.56 \| 39.11 / 37.82 \|
	\| Wav2vec2.0-Robust (No CPT) \| 27.99 / 31.49 \| 35.07 / 36.36 \|
	\| Wav2vec2.0-Robust (CPT) \| 17.71 / 26.50 \| 25.04 / 30.97 \|

	### Limitations
	- The model is optimized for classroom speech and may not generalize well to other domains.
	- Background noise, overlapping speech, and speaker variations may still impact performance.
	- The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases.

	### Usage Request
	If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper:

	> CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments
	> Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson
	> arXiv preprint arXiv:2409.14494, 2024

	```
	@article{attia2024cpt_wav2vec,
	title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments},
	author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson},
	journal={arXiv preprint arXiv:2409.14494},
	year={2024}
	}
	```