File size: 8,317 Bytes
f6cd478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8620a2a
f6cd478
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - speech
  - asr
  - ctc
  - wav2vec2
  - common-voice
  - onnx
  - sagemaker
  - huggingface
  - transformers
  - jiwer
datasets:
  - mozilla-foundation/common_voice_17_0
base_model:
  - facebook/wav2vec2-base-960h
license: other
language: en
metrics:
  - wer
  - cer
---

# Model Card for **ASR** (CTC-based ASR on English)

<!-- Provide a quick summary of what the model is/does. -->
This repository contains an end‑to‑end **Automatic Speech Recognition (ASR)** pipeline built around Hugging Face Transformers. The default configuration fine‑tunes **`facebook/wav2vec2-base-960h`** with a **CTC** head on  50k subsample of **Common Voice 17.0 (English)** and provides scripts to **train, evaluate, export to ONNX, and deploy on AWS SageMaker**. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).

## Model Details

### Model Description

- **Developed by:** Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`)
- **Funded by :** Not specified
- **Shared by :** Amirhossein Yousefi
- **Model type:** CTC-based ASR using Transformers (**Wav2Vec2ForCTC**)
- **Language(s) (NLP):** English (`en`)
- **License:** Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as **other** until clarified)
- **Finetuned from model :** `facebook/wav2vec2-base-960h`

> The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment.

### Model Sources 

- **Repository:** https://github.com/amirhossein-yousefi/ASR
- **Paper :** Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
- **Demo :** N/A (local CLI and SageMaker examples included)

## Uses

### Direct Use

- General‑purpose **English** speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).

### Downstream Use 

- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
- Export to **ONNX** for CPU‑friendly inference and integration in production applications.

### Out-of-Scope Use

- **Speaker diarization**, **punctuation restoration**, and **true streaming ASR** are not included.
- Multilingual or code‑switched speech without additional fine‑tuning.
- Very long files without chunking; heavy background noise without augmentation/tuning.

## Bias, Risks, and Limitations

- The default fine‑tuning dataset (**Common Voice 17.0, English**) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.

### Recommendations

- Always evaluate **WER/CER** on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.

## How to Get Started with the Model

**Python (local inference):**
```python
import torch, torchaudio
from transformers import AutoModelForCTC, AutoProcessor

model_dir = "./outputs/asr"  # or a Hugging Face hub id
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_dir)
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()

wav, sr = torchaudio.load("path/to/file.wav")
target_sr = processor.feature_extractor.sampling_rate
if sr != target_sr:
    wav = torchaudio.functional.resample(wav, sr, target_sr)

inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids.cpu().numpy())[0])
```

**CLI (example):**
```bash
python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
```

## Training Details

### Training Data

- **Dataset:** Common Voice 17.0 (English), text column: `sentence`
- **Duration filter:** min ~1.0s, max ~18.0s
- **Notes:** Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.

### Training Procedure

#### Preprocessing [optional]

- Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2.
- Tokenization via the model’s processor; dynamic padding with a **CTC** collator.

#### Training Hyperparameters

- **Epochs:** 3
- **Per‑device batch size:** 8 (× **8** grad accumulation → effective **64**)
- **Learning rate:** 3e‑5
- **Warmup ratio:** 0.05
- **Optimizer:** `adamw_torch_fused`
- **Weight decay:** 0.0
- **Precision:** FP16
- **Max grad norm:** 1.0
- **Logging:** every 50 steps; **Eval/Save:** every 500 steps; keep last 2 checkpoints; early stopping patience = 3
- **Seed:** 42

#### Speeds, Sizes, Times [optional]

- **Total FLOPs (training):** 10,814,747,992,293,114,000
- **Training runtime:** ~11,168 s for 2,346 steps
- **Logs:** TensorBoard at `src/output/logs` (or similar path as configured)

### Evaluation

#### Testing Data, Factors & Metrics

- **Metrics:** **WER** (primary) and **CER** (auxiliary), computed with `jiwer` utilities.
- **Factors:** English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.

#### Results

- Training includes **loss**, **eval WER**, and **eval CER** curves. See the `assets/` directory for plots.

#### Summary

- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.

## Model Examination 

- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.

## Environmental Impact

- **Hardware Type:** Laptop (Windows)
- **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
- **CUDA / PyTorch:** CUDA 12.9, PyTorch 2.8.0+cu129
- **Hours used:** ~3.1 h (approx.)
- **Cloud Provider:** N/A for local; **AWS SageMaker** utilities available for cloud training/deployment
- **Compute Region:** N/A (local)
- **Carbon Emitted:** Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute)

## Technical Specifications 

### Model Architecture and Objective

- **Architecture:** Wav2Vec2 encoder with **CTC** output layer
- **Objective:** Character‑level CTC loss for ASR

### Compute Infrastructure

#### Hardware

- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`).

#### Software

- Python 3.10+
- Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment.

## Citation 

**BibTeX:**
```bibtex
@article{baevski2020wav2vec,
  title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  journal={arXiv preprint arXiv:2006.11477},
  year={2020}
}
```

**APA:**
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self‑supervised learning of speech representations*. arXiv:2006.11477.

## Glossary 

- **WER**: Word Error Rate; lower is better.
- **CER**: Character Error Rate; lower is better.
- **CTC**: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.

## More Information 

- **ONNX export:** `src/export_onnx.py`
- **AWS SageMaker:** scripts in `sagemaker/` for training, deployment, and autoscaling.
- **Training/metrics plots:** see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`).

## Model Card Authors 

- Amirhossein Yousefi (repo author)

## Model Card Contact

- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR