Amirhossein75 commited on
Commit
f6cd478
·
verified ·
1 Parent(s): c18825d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +223 -0
README.md ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: automatic-speech-recognition
4
+ tags:
5
+ - speech
6
+ - asr
7
+ - ctc
8
+ - wav2vec2
9
+ - common-voice
10
+ - onnx
11
+ - sagemaker
12
+ - huggingface
13
+ - transformers
14
+ - jiwer
15
+ datasets:
16
+ - mozilla-foundation/common_voice_17_0
17
+ base_model:
18
+ - facebook/wav2vec2-base-960h
19
+ license: other
20
+ language: en
21
+ metrics:
22
+ - wer
23
+ - cer
24
+ ---
25
+
26
+ # Model Card for **ASR** (CTC-based ASR on English)
27
+
28
+ <!-- Provide a quick summary of what the model is/does. -->
29
+ This repository contains an end‑to‑end **Automatic Speech Recognition (ASR)** pipeline built around Hugging Face Transformers. The default configuration fine‑tunes **`facebook/wav2vec2-base-960h`** with a **CTC** head on **Common Voice 17.0 (English)** and provides scripts to **train, evaluate, export to ONNX, and deploy on AWS SageMaker**. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).
30
+
31
+ ## Model Details
32
+
33
+ ### Model Description
34
+
35
+ - **Developed by:** Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`)
36
+ - **Funded by :** Not specified
37
+ - **Shared by :** Amirhossein Yousefi
38
+ - **Model type:** CTC-based ASR using Transformers (**Wav2Vec2ForCTC**)
39
+ - **Language(s) (NLP):** English (`en`)
40
+ - **License:** Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as **other** until clarified)
41
+ - **Finetuned from model :** `facebook/wav2vec2-base-960h`
42
+
43
+ > The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment.
44
+
45
+ ### Model Sources
46
+
47
+ - **Repository:** https://github.com/amirhossein-yousefi/ASR
48
+ - **Paper :** Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
49
+ - **Demo :** N/A (local CLI and SageMaker examples included)
50
+
51
+ ## Uses
52
+
53
+ ### Direct Use
54
+
55
+ - General‑purpose **English** speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
56
+ - Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).
57
+
58
+ ### Downstream Use
59
+
60
+ - Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
61
+ - Export to **ONNX** for CPU‑friendly inference and integration in production applications.
62
+
63
+ ### Out-of-Scope Use
64
+
65
+ - **Speaker diarization**, **punctuation restoration**, and **true streaming ASR** are not included.
66
+ - Multilingual or code‑switched speech without additional fine‑tuning.
67
+ - Very long files without chunking; heavy background noise without augmentation/tuning.
68
+
69
+ ## Bias, Risks, and Limitations
70
+
71
+ - The default fine‑tuning dataset (**Common Voice 17.0, English**) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
72
+ - Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.
73
+
74
+ ### Recommendations
75
+
76
+ - Always evaluate **WER/CER** on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
77
+ - For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.
78
+
79
+ ## How to Get Started with the Model
80
+
81
+ **Python (local inference):**
82
+ ```python
83
+ import torch, torchaudio
84
+ from transformers import AutoModelForCTC, AutoProcessor
85
+
86
+ model_dir = "./outputs/asr" # or a Hugging Face hub id
87
+ device = "cuda" if torch.cuda.is_available() else "cpu"
88
+
89
+ processor = AutoProcessor.from_pretrained(model_dir)
90
+ model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()
91
+
92
+ wav, sr = torchaudio.load("path/to/file.wav")
93
+ target_sr = processor.feature_extractor.sampling_rate
94
+ if sr != target_sr:
95
+ wav = torchaudio.functional.resample(wav, sr, target_sr)
96
+
97
+ inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
98
+ with torch.no_grad():
99
+ logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
100
+ pred_ids = torch.argmax(logits, dim=-1)
101
+ print(processor.batch_decode(pred_ids.cpu().numpy())[0])
102
+ ```
103
+
104
+ **CLI (example):**
105
+ ```bash
106
+ python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
107
+ ```
108
+
109
+ ## Training Details
110
+
111
+ ### Training Data
112
+
113
+ - **Dataset:** Common Voice 17.0 (English), text column: `sentence`
114
+ - **Duration filter:** min ~1.0s, max ~18.0s
115
+ - **Notes:** Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.
116
+
117
+ ### Training Procedure
118
+
119
+ #### Preprocessing [optional]
120
+
121
+ - Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2.
122
+ - Tokenization via the model’s processor; dynamic padding with a **CTC** collator.
123
+
124
+ #### Training Hyperparameters
125
+
126
+ - **Epochs:** 3
127
+ - **Per‑device batch size:** 8 (× **8** grad accumulation → effective **64**)
128
+ - **Learning rate:** 3e‑5
129
+ - **Warmup ratio:** 0.05
130
+ - **Optimizer:** `adamw_torch_fused`
131
+ - **Weight decay:** 0.0
132
+ - **Precision:** FP16
133
+ - **Max grad norm:** 1.0
134
+ - **Logging:** every 50 steps; **Eval/Save:** every 500 steps; keep last 2 checkpoints; early stopping patience = 3
135
+ - **Seed:** 42
136
+
137
+ #### Speeds, Sizes, Times [optional]
138
+
139
+ - **Total FLOPs (training):** 10,814,747,992,293,114,000
140
+ - **Training runtime:** ~11,168 s for 2,346 steps
141
+ - **Logs:** TensorBoard at `src/output/logs` (or similar path as configured)
142
+
143
+ ### Evaluation
144
+
145
+ #### Testing Data, Factors & Metrics
146
+
147
+ - **Metrics:** **WER** (primary) and **CER** (auxiliary), computed with `jiwer` utilities.
148
+ - **Factors:** English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.
149
+
150
+ #### Results
151
+
152
+ - Training includes **loss**, **eval WER**, and **eval CER** curves. See the `assets/` directory for plots.
153
+
154
+ #### Summary
155
+
156
+ - Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.
157
+
158
+ ## Model Examination
159
+
160
+ - Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.
161
+
162
+ ## Environmental Impact
163
+
164
+ - **Hardware Type:** Laptop (Windows)
165
+ - **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
166
+ - **CUDA / PyTorch:** CUDA 12.9, PyTorch 2.8.0+cu129
167
+ - **Hours used:** ~3.1 h (approx.)
168
+ - **Cloud Provider:** N/A for local; **AWS SageMaker** utilities available for cloud training/deployment
169
+ - **Compute Region:** N/A (local)
170
+ - **Carbon Emitted:** Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute)
171
+
172
+ ## Technical Specifications
173
+
174
+ ### Model Architecture and Objective
175
+
176
+ - **Architecture:** Wav2Vec2 encoder with **CTC** output layer
177
+ - **Objective:** Character‑level CTC loss for ASR
178
+
179
+ ### Compute Infrastructure
180
+
181
+ #### Hardware
182
+
183
+ - Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`).
184
+
185
+ #### Software
186
+
187
+ - Python 3.10+
188
+ - Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment.
189
+
190
+ ## Citation
191
+
192
+ **BibTeX:**
193
+ ```bibtex
194
+ @article{baevski2020wav2vec,
195
+ title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
196
+ author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
197
+ journal={arXiv preprint arXiv:2006.11477},
198
+ year={2020}
199
+ }
200
+ ```
201
+
202
+ **APA:**
203
+ Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self‑supervised learning of speech representations*. arXiv:2006.11477.
204
+
205
+ ## Glossary
206
+
207
+ - **WER**: Word Error Rate; lower is better.
208
+ - **CER**: Character Error Rate; lower is better.
209
+ - **CTC**: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.
210
+
211
+ ## More Information
212
+
213
+ - **ONNX export:** `src/export_onnx.py`
214
+ - **AWS SageMaker:** scripts in `sagemaker/` for training, deployment, and autoscaling.
215
+ - **Training/metrics plots:** see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`).
216
+
217
+ ## Model Card Authors
218
+
219
+ - Amirhossein Yousefi (repo author)
220
+
221
+ ## Model Card Contact
222
+
223
+ - Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR