| ## Model Card: Wav2vec-Classroom |
|
|
| ### Model Overview |
| **Model Name:** Wav2vec-Classroom |
| **Version:** 1.0 |
| **Developed By:** Ahmed Adel Attia (University of Maryland & Stanford University) |
| **Date:** 2025 |
|
|
| **Description:** |
| Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using **Continued Pretraining (CPT)** on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups. |
|
|
| **Use Case:** |
| - Speech-to-text transcription for classroom recordings. |
| - Automatic feedback generation for educational AI tools. |
| - ASR research in low-resource, noisy environments. |
|
|
| ### Model Details |
| **Architecture:** Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq |
|
|
| **Training Data:** |
| - **Unlabeled Classroom Audio (NCTE dataset):** 5235 hours of classroom recordings used for self-supervised CPT. |
| - **NCTE-Gold:** 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning. |
|
|
| **Training Strategy:** |
| 1. **Continued Pretraining (CPT):** The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations. |
| 2. **Supervised Fine-tuning:** The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions. |
|
|
| ### Evaluation Results |
| **Word Error Rate (WER) comparison on NCTE and MPT test sets:** |
|
|
| | Training Data | NCTE WER | MPT WER | |
| |--------------|----------|---------| |
| | **Pretraining from Scratch (W2V-SCR)** | 30.25 / 38.59 | 51.39 / 38.59 | |
| | **Wav2vec2.0-LV60K (No CPT)** | 30.39 / 33.56 | 39.11 / 37.82 | |
| | **Wav2vec2.0-Robust (No CPT)** | 27.99 / 31.49 | 35.07 / 36.36 | |
| | **Wav2vec2.0-Robust (CPT)** | **17.71 / 26.50** | **25.04 / 30.97** | |
|
|
| ### Limitations |
| - The model is optimized for classroom speech and may not generalize well to other domains. |
| - Background noise, overlapping speech, and speaker variations may still impact performance. |
| - The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases. |
|
|
| ### Usage Request |
| If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper: |
|
|
| > **CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments** |
| > Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson |
| > *arXiv preprint arXiv:2409.14494*, 2024 |
|
|
| ``` |
| @article{attia2024cpt_wav2vec, |
| title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments}, |
| author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson}, |
| journal={arXiv preprint arXiv:2409.14494}, |
| year={2024} |
| } |
| ``` |
|
|
|
|