aadel4
/

Wav2vec_Classroom_FT

Automatic Speech Recognition

Model card Files Files and versions

Wav2vec_Classroom_FT / README.md

aadel4's picture

Update README.md

44055ba verified 10 months ago

|

history blame contribute delete

2.77 kB

	---
	license: mit
	base_model:
	- facebook/wav2vec2-large-robust
	- aadel4/Wav2vec_Classroom
	pipeline_tag: automatic-speech-recognition
	tags:
	- wav2vec2
	library_name: transformers
	---
	## Model Card: Wav2vec_Classroom_FT

	### Model Overview
	Model Name: Wav2vec_Classroom_FT
	Version: 1.0
	Developed By: Ahmed Adel Attia (University of Maryland and Stanford University)
	Date: 2025

	Description:
	Wav2vec_Classroom_FT is an automatic speech recognition (ASR) model trained for classroom speech transcription using direct fine-tuning on a small set of human-verified gold-standard transcriptions. Unlike NCTE-WSP-ASR, this model does not leverage weak transcriptions for intermediate training and is solely trained on high-quality annotations.

	This model is adapted from [Wav2vec-Classroom](https://huggingface.co/aadel4/Wav2vec_Classroom), which was trained using continued pretraining (CPT) on large-scale unlabeled classroom speech data. The adaptation involves direct fine-tuning on a limited transcribed dataset.

	This model was originally trained using the fairseq library then ported into Huggingface.

	Use Case:
	- Speech-to-text transcription for classroom environments.
	- ASR applications requiring high precision with limited data.
	- Benchmarking ASR performance without weakly supervised pretraining.

	### Model Details
	Architecture: Wav2vec2.0-based model fine-tuned with Fairseq

	Training Data:
	- NCTE-Gold: 13 hours of manually transcribed classroom recordings.

	Training Strategy:
	1. Direct Fine-tuning: The model is fine-tuned directly on NCTE-Gold without any pretraining on weak transcripts.
	2. Evaluation: The model is tested on classroom ASR tasks to compare its performance with WSP-based models.

	### Evaluation Results
	Word Error Rate (WER) comparison on NCTE and MPT test sets:

	\| Training Data \| NCTE WER \| MPT WER \|
	\|--------------\|----------\|---------\|
	\| Baseline (TEDLIUM-trained ASR) \| 55.82 / 50.56 \| 55.11 / 50.50 \|
	\| NCTE-Gold only (NCTE-Baseline-ASR) \| 21.12 / 16.47 \| 31.52 / 27.93 \|
	\| NCTE-WSP-ASR (NCTE-Weak → NCTE-Gold) \| 16.54 / 13.51 \| 25.07 / 23.70 \|

	### Limitations
	- The model is trained on a small dataset (13 hours), which limits its ability to generalize beyond classroom speech.
	- Performance is lower than NCTE-WSP-ASR, which benefits from weak transcripts for pretraining.
	- Background noise, overlapping speech, and speaker variations may still impact transcription quality.

	### Usage Request
	If you use the NCTE-Baseline-ASR model in your research, please acknowledge this work and refer to the original paper submitted to Interspeech 2025.

	For inquiries or collaborations, please contact the authors of the original paper.