aadel4 commited on
Commit
62db0d8
·
verified ·
1 Parent(s): b314481

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Model Card: Wav2vec-Classroom
2
+
3
+ ### Model Overview
4
+ **Model Name:** Wav2vec-Classroom
5
+ **Version:** 1.0
6
+ **Developed By:** Ahmed Adel Attia (University of Maryland & Stanford University)
7
+ **Date:** 2025
8
+
9
+ **Description:**
10
+ Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using **Continued Pretraining (CPT)** on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups.
11
+
12
+ **Use Case:**
13
+ - Speech-to-text transcription for classroom recordings.
14
+ - Automatic feedback generation for educational AI tools.
15
+ - ASR research in low-resource, noisy environments.
16
+
17
+ ### Model Details
18
+ **Architecture:** Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq
19
+
20
+ **Training Data:**
21
+ - **Unlabeled Classroom Audio (NCTE dataset):** 5235 hours of classroom recordings used for self-supervised CPT.
22
+ - **NCTE-Gold:** 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning.
23
+
24
+ **Training Strategy:**
25
+ 1. **Continued Pretraining (CPT):** The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations.
26
+ 2. **Supervised Fine-tuning:** The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions.
27
+
28
+ ### Evaluation Results
29
+ **Word Error Rate (WER) comparison on NCTE and MPT test sets:**
30
+
31
+ | Training Data | NCTE WER | MPT WER |
32
+ |--------------|----------|---------|
33
+ | **Pretraining from Scratch (W2V-SCR)** | 30.25 / 38.59 | 51.39 / 38.59 |
34
+ | **Wav2vec2.0-LV60K (No CPT)** | 30.39 / 33.56 | 39.11 / 37.82 |
35
+ | **Wav2vec2.0-Robust (No CPT)** | 27.99 / 31.49 | 35.07 / 36.36 |
36
+ | **Wav2vec2.0-Robust (CPT)** | **17.71 / 26.50** | **25.04 / 30.97** |
37
+
38
+ ### Limitations
39
+ - The model is optimized for classroom speech and may not generalize well to other domains.
40
+ - Background noise, overlapping speech, and speaker variations may still impact performance.
41
+ - The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases.
42
+
43
+ ### Usage Request
44
+ If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper:
45
+
46
+ > **CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments**
47
+ > Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson
48
+ > *arXiv preprint arXiv:2409.14494*, 2024
49
+
50
+ ```
51
+ @article{attia2024cpt_wav2vec,
52
+ title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments},
53
+ author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson},
54
+ journal={arXiv preprint arXiv:2409.14494},
55
+ year={2024}
56
+ }
57
+ ```
58
+