Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Model Card: Wav2vec-Classroom
|
| 2 |
+
|
| 3 |
+
### Model Overview
|
| 4 |
+
**Model Name:** Wav2vec-Classroom
|
| 5 |
+
**Version:** 1.0
|
| 6 |
+
**Developed By:** Ahmed Adel Attia (University of Maryland & Stanford University)
|
| 7 |
+
**Date:** 2025
|
| 8 |
+
|
| 9 |
+
**Description:**
|
| 10 |
+
Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using **Continued Pretraining (CPT)** on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups.
|
| 11 |
+
|
| 12 |
+
**Use Case:**
|
| 13 |
+
- Speech-to-text transcription for classroom recordings.
|
| 14 |
+
- Automatic feedback generation for educational AI tools.
|
| 15 |
+
- ASR research in low-resource, noisy environments.
|
| 16 |
+
|
| 17 |
+
### Model Details
|
| 18 |
+
**Architecture:** Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq
|
| 19 |
+
|
| 20 |
+
**Training Data:**
|
| 21 |
+
- **Unlabeled Classroom Audio (NCTE dataset):** 5235 hours of classroom recordings used for self-supervised CPT.
|
| 22 |
+
- **NCTE-Gold:** 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning.
|
| 23 |
+
|
| 24 |
+
**Training Strategy:**
|
| 25 |
+
1. **Continued Pretraining (CPT):** The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations.
|
| 26 |
+
2. **Supervised Fine-tuning:** The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions.
|
| 27 |
+
|
| 28 |
+
### Evaluation Results
|
| 29 |
+
**Word Error Rate (WER) comparison on NCTE and MPT test sets:**
|
| 30 |
+
|
| 31 |
+
| Training Data | NCTE WER | MPT WER |
|
| 32 |
+
|--------------|----------|---------|
|
| 33 |
+
| **Pretraining from Scratch (W2V-SCR)** | 30.25 / 38.59 | 51.39 / 38.59 |
|
| 34 |
+
| **Wav2vec2.0-LV60K (No CPT)** | 30.39 / 33.56 | 39.11 / 37.82 |
|
| 35 |
+
| **Wav2vec2.0-Robust (No CPT)** | 27.99 / 31.49 | 35.07 / 36.36 |
|
| 36 |
+
| **Wav2vec2.0-Robust (CPT)** | **17.71 / 26.50** | **25.04 / 30.97** |
|
| 37 |
+
|
| 38 |
+
### Limitations
|
| 39 |
+
- The model is optimized for classroom speech and may not generalize well to other domains.
|
| 40 |
+
- Background noise, overlapping speech, and speaker variations may still impact performance.
|
| 41 |
+
- The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases.
|
| 42 |
+
|
| 43 |
+
### Usage Request
|
| 44 |
+
If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper:
|
| 45 |
+
|
| 46 |
+
> **CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments**
|
| 47 |
+
> Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson
|
| 48 |
+
> *arXiv preprint arXiv:2409.14494*, 2024
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
@article{attia2024cpt_wav2vec,
|
| 52 |
+
title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments},
|
| 53 |
+
author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson},
|
| 54 |
+
journal={arXiv preprint arXiv:2409.14494},
|
| 55 |
+
year={2024}
|
| 56 |
+
}
|
| 57 |
+
```
|
| 58 |
+
|