--- license: mit base_model: - facebook/wav2vec2-large-robust - aadel4/Wav2vec_Classroom pipeline_tag: automatic-speech-recognition tags: - wav2vec2 library_name: transformers --- ## Model Card: Wav2vec_Classroom_WSP_FT ### Model Overview **Model Name:** Wav2vec_Classroom_WSP_FT **Version:** 1.0 **Developed By:** Ahmed Adel Attia (University of Maryland & Stanford University) **Date:** 2025 **Description:** Wav2vec_Classroom_WSP_FT is an automatic speech recognition (ASR) model trained specifically for classroom speech transcription using a weakly supervised pretraining (WSP) approach. The model first undergoes supervised pretraining on weakly transcribed classroom data (NCTE-Weak) and is then fine-tuned using a small amount of human-verified gold-standard data (NCTE-Gold). This methodology allows the model to generalize well despite the scarcity of precisely transcribed classroom speech. This model is adapted from **[Wav2vec-Classroom](https://huggingface.co/aadel4/Wav2vec_Classroom)**, which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves further fine-tuning to leverage weak transcriptions before final refinement on high-quality annotations. This model was originally trained using the fairseq library then ported into Huggingface. **Use Case:** - Speech-to-text transcription for classroom environments. - Educational research and analysis of classroom discourse. - Low-resource ASR applications where gold-standard labels are limited. ### Model Details **Architecture:** Wav2vec2.0-based model fine-tuned with Fairseq **Training Data:** - **NCTE-Weak:** 5000 hours of weak transcriptions from the NCTE dataset. - **NCTE-Gold:** 13 hours of manually transcribed classroom recordings. **Training Strategy:** 1. **Weakly Supervised Pretraining (WSP):** The model is first trained using NCTE-Weak transcripts, which contain alignment errors and omissions but provide useful weak supervision. 2. **Precise Fine-tuning:** The pretrained model is fine-tuned on NCTE-Gold, ensuring it adapts to high-quality transcriptions. ### Evaluation Results **Word Error Rate (WER) comparison on NCTE and MPT test sets:** | Training Data | NCTE WER | MPT WER | |--------------|----------|---------| | **Baseline (TEDLIUM-trained ASR)** | 55.82 / 50.56 | 55.11 / 50.50 | | **NCTE-Weak only** | 36.23 / 32.30 | 50.84 / 46.09 | | **NCTE-Gold only** | 21.12 / 16.47 | 31.52 / 27.93 | | **Self-training** | 17.45 / 15.09 | 27.42 / 26.24 | | **NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold)** | **16.54 / 13.51** | **25.07 / 23.70** | ### Limitations - The model relies on weak supervision, and transcription quality is dependent on the balance between weak and gold-standard data. - Classroom noise, overlapping speech, and spontaneous interactions may still lead to recognition errors. - The model was trained specifically on elementary math classrooms and may not generalize well to other educational settings without further adaptation. ### Usage Request If you use the NCTE-WSP-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025. For inquiries or collaborations, please contact the authors of the original paper.