ASR / README.md

add training sample spec

8620a2a 4 months ago

8.32 kB

	---
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- speech
	- asr
	- ctc
	- wav2vec2
	- common-voice
	- onnx
	- sagemaker
	- huggingface
	- transformers
	- jiwer
	datasets:
	- mozilla-foundation/common_voice_17_0
	base_model:
	- facebook/wav2vec2-base-960h
	license: other
	language: en
	metrics:
	- wer
	- cer
	---

	# Model Card for ASR (CTC-based ASR on English)

	<!-- Provide a quick summary of what the model is/does. -->
	This repository contains an end‑to‑end Automatic Speech Recognition (ASR) pipeline built around Hugging Face Transformers. The default configuration fine‑tunes `facebook/wav2vec2-base-960h` with a CTC head on 50k subsample of Common Voice 17.0 (English) and provides scripts to train, evaluate, export to ONNX, and deploy on AWS SageMaker. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).

	## Model Details

	### Model Description

	- Developed by: Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`)
	- Funded by : Not specified
	- Shared by : Amirhossein Yousefi
	- Model type: CTC-based ASR using Transformers (Wav2Vec2ForCTC)
	- Language(s) (NLP): English (`en`)
	- License: Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as other until clarified)
	- Finetuned from model : `facebook/wav2vec2-base-960h`

	> The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment.

	### Model Sources

	- Repository: https://github.com/amirhossein-yousefi/ASR
	- Paper : Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
	- Demo : N/A (local CLI and SageMaker examples included)

	## Uses

	### Direct Use

	- General‑purpose English speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
	- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).

	### Downstream Use

	- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
	- Export to ONNX for CPU‑friendly inference and integration in production applications.

	### Out-of-Scope Use

	- Speaker diarization, punctuation restoration, and true streaming ASR are not included.
	- Multilingual or code‑switched speech without additional fine‑tuning.
	- Very long files without chunking; heavy background noise without augmentation/tuning.

	## Bias, Risks, and Limitations

	- The default fine‑tuning dataset (Common Voice 17.0, English) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
	- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.

	### Recommendations

	- Always evaluate WER/CER on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
	- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.

	## How to Get Started with the Model

	Python (local inference):
	```python
	import torch, torchaudio
	from transformers import AutoModelForCTC, AutoProcessor

	model_dir = "./outputs/asr" # or a Hugging Face hub id
	device = "cuda" if torch.cuda.is_available() else "cpu"

	processor = AutoProcessor.from_pretrained(model_dir)
	model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()

	wav, sr = torchaudio.load("path/to/file.wav")
	target_sr = processor.feature_extractor.sampling_rate
	if sr != target_sr:
	wav = torchaudio.functional.resample(wav, sr, target_sr)

	inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
	pred_ids = torch.argmax(logits, dim=-1)
	print(processor.batch_decode(pred_ids.cpu().numpy())[0])
	```

	CLI (example):
	```bash
	python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
	```

	## Training Details

	### Training Data

	- Dataset: Common Voice 17.0 (English), text column: `sentence`
	- Duration filter: min ~1.0s, max ~18.0s
	- Notes: Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.

	### Training Procedure

	#### Preprocessing [optional]

	- Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2.
	- Tokenization via the model’s processor; dynamic padding with a CTC collator.

	#### Training Hyperparameters

	- Epochs: 3
	- Per‑device batch size: 8 (× 8 grad accumulation → effective 64)
	- Learning rate: 3e‑5
	- Warmup ratio: 0.05
	- Optimizer: `adamw_torch_fused`
	- Weight decay: 0.0
	- Precision: FP16
	- Max grad norm: 1.0
	- Logging: every 50 steps; Eval/Save: every 500 steps; keep last 2 checkpoints; early stopping patience = 3
	- Seed: 42

	#### Speeds, Sizes, Times [optional]

	- Total FLOPs (training): 10,814,747,992,293,114,000
	- Training runtime: ~11,168 s for 2,346 steps
	- Logs: TensorBoard at `src/output/logs` (or similar path as configured)

	### Evaluation

	#### Testing Data, Factors & Metrics

	- Metrics: WER (primary) and CER (auxiliary), computed with `jiwer` utilities.
	- Factors: English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.

	#### Results

	- Training includes loss, eval WER, and eval CER curves. See the `assets/` directory for plots.

	#### Summary

	- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.

	## Model Examination

	- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.

	## Environmental Impact

	- Hardware Type: Laptop (Windows)
	- GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
	- CUDA / PyTorch: CUDA 12.9, PyTorch 2.8.0+cu129
	- Hours used: ~3.1 h (approx.)
	- Cloud Provider: N/A for local; AWS SageMaker utilities available for cloud training/deployment
	- Compute Region: N/A (local)
	- Carbon Emitted: Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute)

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: Wav2Vec2 encoder with CTC output layer
	- Objective: Character‑level CTC loss for ASR

	### Compute Infrastructure

	#### Hardware

	- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`).

	#### Software

	- Python 3.10+
	- Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment.

	## Citation

	BibTeX:
	```bibtex
	@article{baevski2020wav2vec,
	title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
	author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
	journal={arXiv preprint arXiv:2006.11477},
	year={2020}
	}
	```

	APA:
	Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self‑supervised learning of speech representations. arXiv:2006.11477.

	## Glossary

	- WER: Word Error Rate; lower is better.
	- CER: Character Error Rate; lower is better.
	- CTC: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.

	## More Information

	- ONNX export: `src/export_onnx.py`
	- AWS SageMaker: scripts in `sagemaker/` for training, deployment, and autoscaling.
	- Training/metrics plots: see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`).

	## Model Card Authors

	- Amirhossein Yousefi (repo author)

	## Model Card Contact

	- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR