Amirhossein75
/

ASR

+---
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+tags:
+  - speech
+  - asr
+  - ctc
+  - wav2vec2
+  - common-voice
+  - onnx
+  - sagemaker
+  - huggingface
+  - transformers
+  - jiwer
+datasets:
+  - mozilla-foundation/common_voice_17_0
+base_model:
+  - facebook/wav2vec2-base-960h
+license: other
+language: en
+metrics:
+  - wer
+  - cer
+---
+# Model Card for **ASR** (CTC-based ASR on English)
+<!-- Provide a quick summary of what the model is/does. -->
+This repository contains an end‑to‑end **Automatic Speech Recognition (ASR)** pipeline built around Hugging Face Transformers. The default configuration fine‑tunes **`facebook/wav2vec2-base-960h`** with a **CTC** head on **Common Voice 17.0 (English)** and provides scripts to **train, evaluate, export to ONNX, and deploy on AWS SageMaker**. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).
+## Model Details
+### Model Description
+- **Developed by:** Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`)
+- **Funded by :** Not specified
+- **Shared by :** Amirhossein Yousefi
+- **Model type:** CTC-based ASR using Transformers (**Wav2Vec2ForCTC**)
+- **Language(s) (NLP):** English (`en`)
+- **License:** Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as **other** until clarified)
+- **Finetuned from model :** `facebook/wav2vec2-base-960h`
+> The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment.
+### Model Sources
+- **Repository:** https://github.com/amirhossein-yousefi/ASR
+- **Paper :** Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
+- **Demo :** N/A (local CLI and SageMaker examples included)
+## Uses
+### Direct Use
+- General‑purpose **English** speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
+- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).
+### Downstream Use
+- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
+- Export to **ONNX** for CPU‑friendly inference and integration in production applications.
+### Out-of-Scope Use
+- **Speaker diarization**, **punctuation restoration**, and **true streaming ASR** are not included.
+- Multilingual or code‑switched speech without additional fine‑tuning.
+- Very long files without chunking; heavy background noise without augmentation/tuning.
+## Bias, Risks, and Limitations
+- The default fine‑tuning dataset (**Common Voice 17.0, English**) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
+- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.
+### Recommendations
+- Always evaluate **WER/CER** on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
+- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.
+## How to Get Started with the Model
+**Python (local inference):**
+```python
+import torch, torchaudio
+from transformers import AutoModelForCTC, AutoProcessor
+model_dir = "./outputs/asr"  # or a Hugging Face hub id
+device = "cuda" if torch.cuda.is_available() else "cpu"
+processor = AutoProcessor.from_pretrained(model_dir)
+model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()
+wav, sr = torchaudio.load("path/to/file.wav")
+target_sr = processor.feature_extractor.sampling_rate
+if sr != target_sr:
+    wav = torchaudio.functional.resample(wav, sr, target_sr)
+inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
+with torch.no_grad():
+    logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
+pred_ids = torch.argmax(logits, dim=-1)
+print(processor.batch_decode(pred_ids.cpu().numpy())[0])
+```
+**CLI (example):**
+```bash
+python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
+```
+## Training Details
+### Training Data
+- **Dataset:** Common Voice 17.0 (English), text column: `sentence`
+- **Duration filter:** min ~1.0s, max ~18.0s
+- **Notes:** Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.
+### Training Procedure
+#### Preprocessing [optional]
+- Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2.
+- Tokenization via the model’s processor; dynamic padding with a **CTC** collator.
+#### Training Hyperparameters
+- **Epochs:** 3
+- **Per‑device batch size:** 8 (× **8** grad accumulation → effective **64**)
+- **Learning rate:** 3e‑5
+- **Warmup ratio:** 0.05
+- **Optimizer:** `adamw_torch_fused`
+- **Weight decay:** 0.0
+- **Precision:** FP16
+- **Max grad norm:** 1.0
+- **Logging:** every 50 steps; **Eval/Save:** every 500 steps; keep last 2 checkpoints; early stopping patience = 3
+- **Seed:** 42
+#### Speeds, Sizes, Times [optional]
+- **Total FLOPs (training):** 10,814,747,992,293,114,000
+- **Training runtime:** ~11,168 s for 2,346 steps
+- **Logs:** TensorBoard at `src/output/logs` (or similar path as configured)
+### Evaluation
+#### Testing Data, Factors & Metrics
+- **Metrics:** **WER** (primary) and **CER** (auxiliary), computed with `jiwer` utilities.
+- **Factors:** English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.
+#### Results
+- Training includes **loss**, **eval WER**, and **eval CER** curves. See the `assets/` directory for plots.
+#### Summary
+- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.
+## Model Examination
+- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.
+## Environmental Impact
+- **Hardware Type:** Laptop (Windows)
+- **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
+- **CUDA / PyTorch:** CUDA 12.9, PyTorch 2.8.0+cu129
+- **Hours used:** ~3.1 h (approx.)
+- **Cloud Provider:** N/A for local; **AWS SageMaker** utilities available for cloud training/deployment
+- **Compute Region:** N/A (local)
+- **Carbon Emitted:** Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute)
+## Technical Specifications
+### Model Architecture and Objective
+- **Architecture:** Wav2Vec2 encoder with **CTC** output layer
+- **Objective:** Character‑level CTC loss for ASR
+### Compute Infrastructure
+#### Hardware
+- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`).
+#### Software
+- Python 3.10+
+- Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment.
+## Citation
+**BibTeX:**
+```bibtex
+@article{baevski2020wav2vec,
+  title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
+  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
+  journal={arXiv preprint arXiv:2006.11477},
+  year={2020}
+}
+```
+**APA:**
+Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self‑supervised learning of speech representations*. arXiv:2006.11477.
+## Glossary
+- **WER**: Word Error Rate; lower is better.
+- **CER**: Character Error Rate; lower is better.
+- **CTC**: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.
+## More Information
+- **ONNX export:** `src/export_onnx.py`
+- **AWS SageMaker:** scripts in `sagemaker/` for training, deployment, and autoscaling.
+- **Training/metrics plots:** see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`).
+## Model Card Authors
+- Amirhossein Yousefi (repo author)
+## Model Card Contact
+- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR