|
|
--- |
|
|
license: mit |
|
|
base_model: |
|
|
- facebook/wav2vec2-large-robust |
|
|
- aadel4/Wav2vec_Classroom |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
tags: |
|
|
- wav2vec2 |
|
|
library_name: transformers |
|
|
--- |
|
|
## Model Card: Wav2vec_Classroom_FT |
|
|
|
|
|
### Model Overview |
|
|
**Model Name:** Wav2vec_Classroom_FT |
|
|
**Version:** 1.0 |
|
|
**Developed By:** Ahmed Adel Attia (University of Maryland and Stanford University) |
|
|
**Date:** 2025 |
|
|
|
|
|
**Description:** |
|
|
Wav2vec_Classroom_FT is an automatic speech recognition (ASR) model trained for classroom speech transcription using direct fine-tuning on a small set of human-verified gold-standard transcriptions. Unlike **NCTE-WSP-ASR**, this model does not leverage weak transcriptions for intermediate training and is solely trained on high-quality annotations. |
|
|
|
|
|
This model is adapted from **[Wav2vec-Classroom](https://huggingface.co/aadel4/Wav2vec_Classroom)**, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves direct fine-tuning on a limited transcribed dataset. |
|
|
|
|
|
This model was originally trained using the fairseq library then ported into Huggingface. |
|
|
|
|
|
**Use Case:** |
|
|
- Speech-to-text transcription for classroom environments. |
|
|
- ASR applications requiring high precision with limited data. |
|
|
- Benchmarking ASR performance without weakly supervised pretraining. |
|
|
|
|
|
### Model Details |
|
|
**Architecture:** Wav2vec2.0-based model fine-tuned with Fairseq |
|
|
|
|
|
**Training Data:** |
|
|
- **NCTE-Gold:** 13 hours of manually transcribed classroom recordings. |
|
|
|
|
|
**Training Strategy:** |
|
|
1. **Direct Fine-tuning:** The model is fine-tuned directly on NCTE-Gold without any pretraining on weak transcripts. |
|
|
2. **Evaluation:** The model is tested on classroom ASR tasks to compare its performance with WSP-based models. |
|
|
|
|
|
### Evaluation Results |
|
|
**Word Error Rate (WER) comparison on NCTE and MPT test sets:** |
|
|
|
|
|
| Training Data | NCTE WER | MPT WER | |
|
|
|--------------|----------|---------| |
|
|
| **Baseline (TEDLIUM-trained ASR)** | 55.82 / 50.56 | 55.11 / 50.50 | |
|
|
| **NCTE-Gold only (NCTE-Baseline-ASR)** | 21.12 / 16.47 | 31.52 / 27.93 | |
|
|
| **NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold)** | **16.54 / 13.51** | **25.07 / 23.70** | |
|
|
|
|
|
### Limitations |
|
|
- The model is trained on a small dataset (13 hours), which limits its ability to generalize beyond classroom speech. |
|
|
- Performance is lower than **NCTE-WSP-ASR**, which benefits from weak transcripts for pretraining. |
|
|
- Background noise, overlapping speech, and speaker variations may still impact transcription quality. |
|
|
|
|
|
### Usage Request |
|
|
If you use the NCTE-Baseline-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025. |
|
|
|
|
|
For inquiries or collaborations, please contact the authors of the original paper. |