|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: automatic-speech-recognition |
|
|
tags: |
|
|
- speech |
|
|
- asr |
|
|
- ctc |
|
|
- wav2vec2 |
|
|
- common-voice |
|
|
- onnx |
|
|
- sagemaker |
|
|
- huggingface |
|
|
- transformers |
|
|
- jiwer |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
base_model: |
|
|
- facebook/wav2vec2-base-960h |
|
|
license: other |
|
|
language: en |
|
|
metrics: |
|
|
- wer |
|
|
- cer |
|
|
--- |
|
|
|
|
|
# Model Card for **ASR** (CTC-based ASR on English) |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
This repository contains an end‑to‑end **Automatic Speech Recognition (ASR)** pipeline built around Hugging Face Transformers. The default configuration fine‑tunes **`facebook/wav2vec2-base-960h`** with a **CTC** head on 50k subsample of **Common Voice 17.0 (English)** and provides scripts to **train, evaluate, export to ONNX, and deploy on AWS SageMaker**. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Amirhossein Yousefi (GitHub: `@amirhossein-yousefi`) |
|
|
- **Funded by :** Not specified |
|
|
- **Shared by :** Amirhossein Yousefi |
|
|
- **Model type:** CTC-based ASR using Transformers (**Wav2Vec2ForCTC**) |
|
|
- **Language(s) (NLP):** English (`en`) |
|
|
- **License:** Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as **other** until clarified) |
|
|
- **Finetuned from model :** `facebook/wav2vec2-base-960h` |
|
|
|
|
|
> The training/evaluation pipeline uses Hugging Face `transformers`, `datasets`, and `jiwer` and includes scripts for inference and SageMaker deployment. |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** https://github.com/amirhossein-yousefi/ASR |
|
|
- **Paper :** Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477) |
|
|
- **Demo :** N/A (local CLI and SageMaker examples included) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- General‑purpose **English** speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds). |
|
|
- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types). |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets. |
|
|
- Export to **ONNX** for CPU‑friendly inference and integration in production applications. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- **Speaker diarization**, **punctuation restoration**, and **true streaming ASR** are not included. |
|
|
- Multilingual or code‑switched speech without additional fine‑tuning. |
|
|
- Very long files without chunking; heavy background noise without augmentation/tuning. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- The default fine‑tuning dataset (**Common Voice 17.0, English**) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms). |
|
|
- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- Always evaluate **WER/CER** on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed. |
|
|
- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
**Python (local inference):** |
|
|
```python |
|
|
import torch, torchaudio |
|
|
from transformers import AutoModelForCTC, AutoProcessor |
|
|
|
|
|
model_dir = "./outputs/asr" # or a Hugging Face hub id |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
processor = AutoProcessor.from_pretrained(model_dir) |
|
|
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval() |
|
|
|
|
|
wav, sr = torchaudio.load("path/to/file.wav") |
|
|
target_sr = processor.feature_extractor.sampling_rate |
|
|
if sr != target_sr: |
|
|
wav = torchaudio.functional.resample(wav, sr, target_sr) |
|
|
|
|
|
inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True) |
|
|
with torch.no_grad(): |
|
|
logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits |
|
|
pred_ids = torch.argmax(logits, dim=-1) |
|
|
print(processor.batch_decode(pred_ids.cpu().numpy())[0]) |
|
|
``` |
|
|
|
|
|
**CLI (example):** |
|
|
```bash |
|
|
python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** Common Voice 17.0 (English), text column: `sentence` |
|
|
- **Duration filter:** min ~1.0s, max ~18.0s |
|
|
- **Notes:** Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
- Robust audio decoding (FFmpeg preferred on Windows; fallback to `torchaudio/soundfile/librosa`), resampling to 16 kHz as required by Wav2Vec2. |
|
|
- Tokenization via the model’s processor; dynamic padding with a **CTC** collator. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Epochs:** 3 |
|
|
- **Per‑device batch size:** 8 (× **8** grad accumulation → effective **64**) |
|
|
- **Learning rate:** 3e‑5 |
|
|
- **Warmup ratio:** 0.05 |
|
|
- **Optimizer:** `adamw_torch_fused` |
|
|
- **Weight decay:** 0.0 |
|
|
- **Precision:** FP16 |
|
|
- **Max grad norm:** 1.0 |
|
|
- **Logging:** every 50 steps; **Eval/Save:** every 500 steps; keep last 2 checkpoints; early stopping patience = 3 |
|
|
- **Seed:** 42 |
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
|
|
- **Total FLOPs (training):** 10,814,747,992,293,114,000 |
|
|
- **Training runtime:** ~11,168 s for 2,346 steps |
|
|
- **Logs:** TensorBoard at `src/output/logs` (or similar path as configured) |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
#### Testing Data, Factors & Metrics |
|
|
|
|
|
- **Metrics:** **WER** (primary) and **CER** (auxiliary), computed with `jiwer` utilities. |
|
|
- **Factors:** English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length. |
|
|
|
|
|
#### Results |
|
|
|
|
|
- Training includes **loss**, **eval WER**, and **eval CER** curves. See the `assets/` directory for plots. |
|
|
|
|
|
#### Summary |
|
|
|
|
|
- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets. |
|
|
|
|
|
## Model Examination |
|
|
|
|
|
- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis. |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** Laptop (Windows) |
|
|
- **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52 |
|
|
- **CUDA / PyTorch:** CUDA 12.9, PyTorch 2.8.0+cu129 |
|
|
- **Hours used:** ~3.1 h (approx.) |
|
|
- **Cloud Provider:** N/A for local; **AWS SageMaker** utilities available for cloud training/deployment |
|
|
- **Compute Region:** N/A (local) |
|
|
- **Carbon Emitted:** Not calculated; estimate with the [MLCO2 calculator](https://mlco2.github.io/impact#compute) |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- **Architecture:** Wav2Vec2 encoder with **CTC** output layer |
|
|
- **Objective:** Character‑level CTC loss for ASR |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., `ml.g4dn.xlarge`). |
|
|
|
|
|
#### Software |
|
|
|
|
|
- Python 3.10+ |
|
|
- Key dependencies: `transformers`, `datasets`, `torch`, `torchaudio`, `soundfile`, `librosa`, `jiwer`, `onnxruntime` (for ONNX testing), and `boto3`/`sagemaker` for deployment. |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
```bibtex |
|
|
@article{baevski2020wav2vec, |
|
|
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations}, |
|
|
author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael}, |
|
|
journal={arXiv preprint arXiv:2006.11477}, |
|
|
year={2020} |
|
|
} |
|
|
``` |
|
|
|
|
|
**APA:** |
|
|
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self‑supervised learning of speech representations*. arXiv:2006.11477. |
|
|
|
|
|
## Glossary |
|
|
|
|
|
- **WER**: Word Error Rate; lower is better. |
|
|
- **CER**: Character Error Rate; lower is better. |
|
|
- **CTC**: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling. |
|
|
|
|
|
## More Information |
|
|
|
|
|
- **ONNX export:** `src/export_onnx.py` |
|
|
- **AWS SageMaker:** scripts in `sagemaker/` for training, deployment, and autoscaling. |
|
|
- **Training/metrics plots:** see `assets/` (e.g., `train_loss.svg`, `eval_wer.svg`, `eval_cer.svg`). |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- Amirhossein Yousefi (repo author) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR |
|
|
|