Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,223 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
pipeline_tag: automatic-speech-recognition
|
| 4 |
+
tags:
|
| 5 |
+
- speech
|
| 6 |
+
- asr
|
| 7 |
+
- ctc
|
| 8 |
+
- wav2vec2
|
| 9 |
+
- common-voice
|
| 10 |
+
- onnx
|
| 11 |
+
- sagemaker
|
| 12 |
+
- huggingface
|
| 13 |
+
- transformers
|
| 14 |
+
- jiwer
|
| 15 |
+
datasets:
|
| 16 |
+
- mozilla-foundation/common_voice_17_0
|
| 17 |
+
base_model:
|
| 18 |
+
- facebook/wav2vec2-base-960h
|
| 19 |
+
license: other
|
| 20 |
+
language: en
|
| 21 |
+
metrics:
|
| 22 |
+
- wer
|
| 23 |
+
- cer
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
# Model Card for **ASR** (CTC-based ASR on English)
|
| 27 |
+
|
| 28 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 29 |
+
This repository contains an end‑to‑end **Automatic Speech Recognition (ASR)** pipeline built around Hugging Face Transformers. The default configuration fine‑tunes **`facebook/wav2vec2-base-960h`** with a **CTC** head on **Common Voice 17.0 (English)** and provides scripts to **train, evaluate, export to ONNX, and deploy on AWS SageMaker**. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).
|
| 30 |
+
|
| 31 |
+
## Model Details
|
| 32 |
+
|
| 33 |
+
### Model Description
|
| 34 |
+
|
| 35 |
+
- **Developed by:** Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`)
|
| 36 |
+
- **Funded by :** Not specified
|
| 37 |
+
- **Shared by :** Amirhossein Yousefi
|
| 38 |
+
- **Model type:** CTC-based ASR using Transformers (**Wav2Vec2ForCTC**)
|
| 39 |
+
- **Language(s) (NLP):** English (`en`)
|
| 40 |
+
- **License:** Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as **other** until clarified)
|
| 41 |
+
- **Finetuned from model :** `facebook/wav2vec2-base-960h`
|
| 42 |
+
|
| 43 |
+
> The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment.
|
| 44 |
+
|
| 45 |
+
### Model Sources
|
| 46 |
+
|
| 47 |
+
- **Repository:** https://github.com/amirhossein-yousefi/ASR
|
| 48 |
+
- **Paper :** Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
|
| 49 |
+
- **Demo :** N/A (local CLI and SageMaker examples included)
|
| 50 |
+
|
| 51 |
+
## Uses
|
| 52 |
+
|
| 53 |
+
### Direct Use
|
| 54 |
+
|
| 55 |
+
- General‑purpose **English** speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
|
| 56 |
+
- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).
|
| 57 |
+
|
| 58 |
+
### Downstream Use
|
| 59 |
+
|
| 60 |
+
- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
|
| 61 |
+
- Export to **ONNX** for CPU‑friendly inference and integration in production applications.
|
| 62 |
+
|
| 63 |
+
### Out-of-Scope Use
|
| 64 |
+
|
| 65 |
+
- **Speaker diarization**, **punctuation restoration**, and **true streaming ASR** are not included.
|
| 66 |
+
- Multilingual or code‑switched speech without additional fine‑tuning.
|
| 67 |
+
- Very long files without chunking; heavy background noise without augmentation/tuning.
|
| 68 |
+
|
| 69 |
+
## Bias, Risks, and Limitations
|
| 70 |
+
|
| 71 |
+
- The default fine‑tuning dataset (**Common Voice 17.0, English**) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
|
| 72 |
+
- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.
|
| 73 |
+
|
| 74 |
+
### Recommendations
|
| 75 |
+
|
| 76 |
+
- Always evaluate **WER/CER** on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
|
| 77 |
+
- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.
|
| 78 |
+
|
| 79 |
+
## How to Get Started with the Model
|
| 80 |
+
|
| 81 |
+
**Python (local inference):**
|
| 82 |
+
```python
|
| 83 |
+
import torch, torchaudio
|
| 84 |
+
from transformers import AutoModelForCTC, AutoProcessor
|
| 85 |
+
|
| 86 |
+
model_dir = "./outputs/asr" # or a Hugging Face hub id
|
| 87 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 88 |
+
|
| 89 |
+
processor = AutoProcessor.from_pretrained(model_dir)
|
| 90 |
+
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()
|
| 91 |
+
|
| 92 |
+
wav, sr = torchaudio.load("path/to/file.wav")
|
| 93 |
+
target_sr = processor.feature_extractor.sampling_rate
|
| 94 |
+
if sr != target_sr:
|
| 95 |
+
wav = torchaudio.functional.resample(wav, sr, target_sr)
|
| 96 |
+
|
| 97 |
+
inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
|
| 98 |
+
with torch.no_grad():
|
| 99 |
+
logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
|
| 100 |
+
pred_ids = torch.argmax(logits, dim=-1)
|
| 101 |
+
print(processor.batch_decode(pred_ids.cpu().numpy())[0])
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
**CLI (example):**
|
| 105 |
+
```bash
|
| 106 |
+
python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
## Training Details
|
| 110 |
+
|
| 111 |
+
### Training Data
|
| 112 |
+
|
| 113 |
+
- **Dataset:** Common Voice 17.0 (English), text column: `sentence`
|
| 114 |
+
- **Duration filter:** min ~1.0s, max ~18.0s
|
| 115 |
+
- **Notes:** Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.
|
| 116 |
+
|
| 117 |
+
### Training Procedure
|
| 118 |
+
|
| 119 |
+
#### Preprocessing [optional]
|
| 120 |
+
|
| 121 |
+
- Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2.
|
| 122 |
+
- Tokenization via the model’s processor; dynamic padding with a **CTC** collator.
|
| 123 |
+
|
| 124 |
+
#### Training Hyperparameters
|
| 125 |
+
|
| 126 |
+
- **Epochs:** 3
|
| 127 |
+
- **Per‑device batch size:** 8 (× **8** grad accumulation → effective **64**)
|
| 128 |
+
- **Learning rate:** 3e‑5
|
| 129 |
+
- **Warmup ratio:** 0.05
|
| 130 |
+
- **Optimizer:** `adamw_torch_fused`
|
| 131 |
+
- **Weight decay:** 0.0
|
| 132 |
+
- **Precision:** FP16
|
| 133 |
+
- **Max grad norm:** 1.0
|
| 134 |
+
- **Logging:** every 50 steps; **Eval/Save:** every 500 steps; keep last 2 checkpoints; early stopping patience = 3
|
| 135 |
+
- **Seed:** 42
|
| 136 |
+
|
| 137 |
+
#### Speeds, Sizes, Times [optional]
|
| 138 |
+
|
| 139 |
+
- **Total FLOPs (training):** 10,814,747,992,293,114,000
|
| 140 |
+
- **Training runtime:** ~11,168 s for 2,346 steps
|
| 141 |
+
- **Logs:** TensorBoard at `src/output/logs` (or similar path as configured)
|
| 142 |
+
|
| 143 |
+
### Evaluation
|
| 144 |
+
|
| 145 |
+
#### Testing Data, Factors & Metrics
|
| 146 |
+
|
| 147 |
+
- **Metrics:** **WER** (primary) and **CER** (auxiliary), computed with `jiwer` utilities.
|
| 148 |
+
- **Factors:** English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.
|
| 149 |
+
|
| 150 |
+
#### Results
|
| 151 |
+
|
| 152 |
+
- Training includes **loss**, **eval WER**, and **eval CER** curves. See the `assets/` directory for plots.
|
| 153 |
+
|
| 154 |
+
#### Summary
|
| 155 |
+
|
| 156 |
+
- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.
|
| 157 |
+
|
| 158 |
+
## Model Examination
|
| 159 |
+
|
| 160 |
+
- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.
|
| 161 |
+
|
| 162 |
+
## Environmental Impact
|
| 163 |
+
|
| 164 |
+
- **Hardware Type:** Laptop (Windows)
|
| 165 |
+
- **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
|
| 166 |
+
- **CUDA / PyTorch:** CUDA 12.9, PyTorch 2.8.0+cu129
|
| 167 |
+
- **Hours used:** ~3.1 h (approx.)
|
| 168 |
+
- **Cloud Provider:** N/A for local; **AWS SageMaker** utilities available for cloud training/deployment
|
| 169 |
+
- **Compute Region:** N/A (local)
|
| 170 |
+
- **Carbon Emitted:** Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute)
|
| 171 |
+
|
| 172 |
+
## Technical Specifications
|
| 173 |
+
|
| 174 |
+
### Model Architecture and Objective
|
| 175 |
+
|
| 176 |
+
- **Architecture:** Wav2Vec2 encoder with **CTC** output layer
|
| 177 |
+
- **Objective:** Character‑level CTC loss for ASR
|
| 178 |
+
|
| 179 |
+
### Compute Infrastructure
|
| 180 |
+
|
| 181 |
+
#### Hardware
|
| 182 |
+
|
| 183 |
+
- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`).
|
| 184 |
+
|
| 185 |
+
#### Software
|
| 186 |
+
|
| 187 |
+
- Python 3.10+
|
| 188 |
+
- Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment.
|
| 189 |
+
|
| 190 |
+
## Citation
|
| 191 |
+
|
| 192 |
+
**BibTeX:**
|
| 193 |
+
```bibtex
|
| 194 |
+
@article{baevski2020wav2vec,
|
| 195 |
+
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
|
| 196 |
+
author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
|
| 197 |
+
journal={arXiv preprint arXiv:2006.11477},
|
| 198 |
+
year={2020}
|
| 199 |
+
}
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
**APA:**
|
| 203 |
+
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self‑supervised learning of speech representations*. arXiv:2006.11477.
|
| 204 |
+
|
| 205 |
+
## Glossary
|
| 206 |
+
|
| 207 |
+
- **WER**: Word Error Rate; lower is better.
|
| 208 |
+
- **CER**: Character Error Rate; lower is better.
|
| 209 |
+
- **CTC**: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.
|
| 210 |
+
|
| 211 |
+
## More Information
|
| 212 |
+
|
| 213 |
+
- **ONNX export:** `src/export_onnx.py`
|
| 214 |
+
- **AWS SageMaker:** scripts in `sagemaker/` for training, deployment, and autoscaling.
|
| 215 |
+
- **Training/metrics plots:** see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`).
|
| 216 |
+
|
| 217 |
+
## Model Card Authors
|
| 218 |
+
|
| 219 |
+
- Amirhossein Yousefi (repo author)
|
| 220 |
+
|
| 221 |
+
## Model Card Contact
|
| 222 |
+
|
| 223 |
+
- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR
|