File size: 3,020 Bytes
62db0d8
 
 
 
 
d5cf1b7
62db0d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e063b71
62db0d8
 
 
 
 
e063b71
62db0d8
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
## Model Card: Wav2vec-Classroom

### Model Overview
**Model Name:** Wav2vec-Classroom  
**Version:** 1.0  
**Developed By:** Ahmed Adel Attia (University of Maryland & Stanford University)  
**Date:** 2025  

**Description:**  
Wav2vec-Classroom is an automatic speech recognition (ASR) model designed for robust performance in classroom environments. The model is adapted from Wav2vec2.0 using **Continued Pretraining (CPT)** on large-scale unlabeled classroom audio data, followed by fine-tuning on a small set of transcribed classroom recordings. This approach enhances the model’s ability to handle classroom noise, overlapping speech, and diverse microphone setups.

**Use Case:**  
- Speech-to-text transcription for classroom recordings.  
- Automatic feedback generation for educational AI tools.  
- ASR research in low-resource, noisy environments.

### Model Details
**Architecture:** Wav2vec2.0-based self-supervised model, fine-tuned with Fairseq  

**Training Data:**  
- **Unlabeled Classroom Audio (NCTE dataset):** 5235 hours of classroom recordings used for self-supervised CPT.  
- **NCTE-Gold:** 5.15 hours of human-verified classroom transcriptions for supervised fine-tuning.

**Training Strategy:**  
1. **Continued Pretraining (CPT):** The model is initialized with a pre-trained Wav2vec2.0 checkpoint and further pre-trained on 5235 hours of unlabeled classroom speech data. This step allows the model to learn domain-specific acoustic representations.
2. **Supervised Fine-tuning:** The CPT-pretrained model is then fine-tuned using the NCTE-Gold dataset for better alignment with transcriptions.

### Evaluation Results
**Word Error Rate (WER) comparison on NCTE and MPT test sets:**

| Training Data | NCTE WER | MPT WER |
|--------------|----------|---------|
| **Pretraining from Scratch (W2V-SCR)** | 30.25 / 38.59 | 51.39 / 38.59 |
| **Wav2vec2.0-LV60K (No CPT)** | 30.39 / 33.56 | 39.11 / 37.82 |
| **Wav2vec2.0-Robust (No CPT)** | 27.99 / 31.49 | 35.07 / 36.36 |
| **Wav2vec2.0-Robust (CPT)** | **17.71 / 26.50** | **25.04 / 30.97** |

### Limitations
- The model is optimized for classroom speech and may not generalize well to other domains.
- Background noise, overlapping speech, and speaker variations may still impact performance.
- The amount of labeled training data remains limited, which may affect ASR accuracy in extreme cases.

### Usage Request
If you use the Wav2vec-Classroom model in your research, please acknowledge this work and cite the following paper:

> **CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments**  
> Ahmed Adel Attia, Dorottya Demszky, Tolulopé Ògúnrẹ̀mí, Jing Liu, Carol Espy-Wilson  
> ICASSP 2025

```
@article{attia2024cpt_wav2vec,
  title={CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments},
  author={Ahmed Adel Attia and Dorottya Demszky and Tolulopé Ògúnrẹ̀mí and Jing Liu and Carol Espy-Wilson},
  journal={ICASSP 2025},
  year={2024}
}
```