File size: 2,774 Bytes
f68beae
 
 
 
44055ba
f68beae
44055ba
 
52f5389
f68beae
fdc0752
 
 
f68beae
fdc0752
afc763f
fdc0752
 
 
ee8551b
fdc0752
 
 
f68beae
 
fdc0752
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f68beae
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
license: mit
base_model:
- facebook/wav2vec2-large-robust
- aadel4/Wav2vec_Classroom
pipeline_tag: automatic-speech-recognition
tags:
- wav2vec2
library_name: transformers
---
## Model Card: Wav2vec_Classroom_FT

### Model Overview
**Model Name:** Wav2vec_Classroom_FT  
**Version:** 1.0  
**Developed By:** Ahmed Adel Attia (University of Maryland and Stanford University)  
**Date:** 2025  

**Description:**  
Wav2vec_Classroom_FT is an automatic speech recognition (ASR) model trained for classroom speech transcription using direct fine-tuning on a small set of human-verified gold-standard transcriptions. Unlike **NCTE-WSP-ASR**, this model does not leverage weak transcriptions for intermediate training and is solely trained on high-quality annotations.

This model is adapted from **[Wav2vec-Classroom](https://huggingface.co/aadel4/Wav2vec_Classroom)**, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves direct fine-tuning on a limited transcribed dataset.

This model was originally trained using the fairseq library then ported into Huggingface.

**Use Case:**  
- Speech-to-text transcription for classroom environments.  
- ASR applications requiring high precision with limited data.  
- Benchmarking ASR performance without weakly supervised pretraining.

### Model Details
**Architecture:** Wav2vec2.0-based model fine-tuned with Fairseq  

**Training Data:**  
- **NCTE-Gold:** 13 hours of manually transcribed classroom recordings.

**Training Strategy:**  
1. **Direct Fine-tuning:** The model is fine-tuned directly on NCTE-Gold without any pretraining on weak transcripts.
2. **Evaluation:** The model is tested on classroom ASR tasks to compare its performance with WSP-based models.

### Evaluation Results
**Word Error Rate (WER) comparison on NCTE and MPT test sets:**

| Training Data | NCTE WER | MPT WER |
|--------------|----------|---------|
| **Baseline (TEDLIUM-trained ASR)** | 55.82 / 50.56 | 55.11 / 50.50 |
| **NCTE-Gold only (NCTE-Baseline-ASR)** | 21.12 / 16.47 | 31.52 / 27.93 |
| **NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold)** | **16.54 / 13.51** | **25.07 / 23.70** |

### Limitations
- The model is trained on a small dataset (13 hours), which limits its ability to generalize beyond classroom speech.
- Performance is lower than **NCTE-WSP-ASR**, which benefits from weak transcripts for pretraining.
- Background noise, overlapping speech, and speaker variations may still impact transcription quality.

### Usage Request
If you use the NCTE-Baseline-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.

For inquiries or collaborations, please contact the authors of the original paper.