File size: 8,317 Bytes
f6cd478 8620a2a f6cd478 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 |
---
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- speech
- asr
- ctc
- wav2vec2
- common-voice
- onnx
- sagemaker
- huggingface
- transformers
- jiwer
datasets:
- mozilla-foundation/common_voice_17_0
base_model:
- facebook/wav2vec2-base-960h
license: other
language: en
metrics:
- wer
- cer
---
# Model Card for **ASR** (CTC-based ASR on English)
<!-- Provide a quick summary of what the model is/does. -->
This repository contains an end‑to‑end **Automatic Speech Recognition (ASR)** pipeline built around Hugging Face Transformers. The default configuration fine‑tunes **`facebook/wav2vec2-base-960h`** with a **CTC** head on 50k subsample of **Common Voice 17.0 (English)** and provides scripts to **train, evaluate, export to ONNX, and deploy on AWS SageMaker**. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).
## Model Details
### Model Description
- **Developed by:** Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`)
- **Funded by :** Not specified
- **Shared by :** Amirhossein Yousefi
- **Model type:** CTC-based ASR using Transformers (**Wav2Vec2ForCTC**)
- **Language(s) (NLP):** English (`en`)
- **License:** Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as **other** until clarified)
- **Finetuned from model :** `facebook/wav2vec2-base-960h`
> The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment.
### Model Sources
- **Repository:** https://github.com/amirhossein-yousefi/ASR
- **Paper :** Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
- **Demo :** N/A (local CLI and SageMaker examples included)
## Uses
### Direct Use
- General‑purpose **English** speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).
### Downstream Use
- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
- Export to **ONNX** for CPU‑friendly inference and integration in production applications.
### Out-of-Scope Use
- **Speaker diarization**, **punctuation restoration**, and **true streaming ASR** are not included.
- Multilingual or code‑switched speech without additional fine‑tuning.
- Very long files without chunking; heavy background noise without augmentation/tuning.
## Bias, Risks, and Limitations
- The default fine‑tuning dataset (**Common Voice 17.0, English**) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.
### Recommendations
- Always evaluate **WER/CER** on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.
## How to Get Started with the Model
**Python (local inference):**
```python
import torch, torchaudio
from transformers import AutoModelForCTC, AutoProcessor
model_dir = "./outputs/asr" # or a Hugging Face hub id
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_dir)
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()
wav, sr = torchaudio.load("path/to/file.wav")
target_sr = processor.feature_extractor.sampling_rate
if sr != target_sr:
wav = torchaudio.functional.resample(wav, sr, target_sr)
inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids.cpu().numpy())[0])
```
**CLI (example):**
```bash
python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
```
## Training Details
### Training Data
- **Dataset:** Common Voice 17.0 (English), text column: `sentence`
- **Duration filter:** min ~1.0s, max ~18.0s
- **Notes:** Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.
### Training Procedure
#### Preprocessing [optional]
- Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2.
- Tokenization via the model’s processor; dynamic padding with a **CTC** collator.
#### Training Hyperparameters
- **Epochs:** 3
- **Per‑device batch size:** 8 (× **8** grad accumulation → effective **64**)
- **Learning rate:** 3e‑5
- **Warmup ratio:** 0.05
- **Optimizer:** `adamw_torch_fused`
- **Weight decay:** 0.0
- **Precision:** FP16
- **Max grad norm:** 1.0
- **Logging:** every 50 steps; **Eval/Save:** every 500 steps; keep last 2 checkpoints; early stopping patience = 3
- **Seed:** 42
#### Speeds, Sizes, Times [optional]
- **Total FLOPs (training):** 10,814,747,992,293,114,000
- **Training runtime:** ~11,168 s for 2,346 steps
- **Logs:** TensorBoard at `src/output/logs` (or similar path as configured)
### Evaluation
#### Testing Data, Factors & Metrics
- **Metrics:** **WER** (primary) and **CER** (auxiliary), computed with `jiwer` utilities.
- **Factors:** English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.
#### Results
- Training includes **loss**, **eval WER**, and **eval CER** curves. See the `assets/` directory for plots.
#### Summary
- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.
## Model Examination
- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.
## Environmental Impact
- **Hardware Type:** Laptop (Windows)
- **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
- **CUDA / PyTorch:** CUDA 12.9, PyTorch 2.8.0+cu129
- **Hours used:** ~3.1 h (approx.)
- **Cloud Provider:** N/A for local; **AWS SageMaker** utilities available for cloud training/deployment
- **Compute Region:** N/A (local)
- **Carbon Emitted:** Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute)
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** Wav2Vec2 encoder with **CTC** output layer
- **Objective:** Character‑level CTC loss for ASR
### Compute Infrastructure
#### Hardware
- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`).
#### Software
- Python 3.10+
- Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment.
## Citation
**BibTeX:**
```bibtex
@article{baevski2020wav2vec,
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
journal={arXiv preprint arXiv:2006.11477},
year={2020}
}
```
**APA:**
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self‑supervised learning of speech representations*. arXiv:2006.11477.
## Glossary
- **WER**: Word Error Rate; lower is better.
- **CER**: Character Error Rate; lower is better.
- **CTC**: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.
## More Information
- **ONNX export:** `src/export_onnx.py`
- **AWS SageMaker:** scripts in `sagemaker/` for training, deployment, and autoscaling.
- **Training/metrics plots:** see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`).
## Model Card Authors
- Amirhossein Yousefi (repo author)
## Model Card Contact
- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR
|