File size: 4,614 Bytes
7ad3000
 
 
 
 
 
 
 
 
 
a91b83c
 
2c43552
 
4f241a2
2c43552
a91b83c
2c43552
a91b83c
4f241a2
2c43552
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a91b83c
4f241a2
a91b83c
4f241a2
2c43552
 
 
 
a91b83c
2c43552
 
a91b83c
2c43552
 
a91b83c
2c43552
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
title: Multilingual ASR
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: gradio_ui.py
pinned: false
---

# πŸŽ™οΈ Multilingual Automatic Speech Recognition (ASR)

> **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/adiitya29/Multilingual-ASR) Β· **Landing Page:** [Vercel](https://YOUR_VERCEL_URL)

## πŸ“Œ Project Overview & Importance
An end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio into transcribed text. ASR is notoriously difficult due to variability in audio data β€” background noise, hardware differences, varied accents, and speech speeds.

This project bridges Deep Learning model inference with full-stack software engineering, deploying a production-ready application that handles digital signal processing, neural network inference, and a user-facing REST API β€” all served from a single Python process.

## βš™οΈ How It Works (The Pipeline)
1. **Audio Ingestion & DSP:** Raw audio (`.mp3`, `.wav`) is loaded via `librosa` and resampled to **16kHz** β€” the exact rate the model was trained on.
2. **Feature Extraction:** The Wav2Vec2 Processor normalizes the waveform into padded PyTorch tensors.
3. **Acoustic Model Inference:** `Wav2Vec2ForCTC` (1.26GB, Large architecture) runs the forward pass using self-supervised learned speech representations.
4. **CTC Decoding:** Connectionist Temporal Classification decodes raw logits into the most probable character sequence.
5. **Output & Storage:** The transcript is persisted in local JSON history, downloadable as `.txt`, with full history exportable as `.csv`.

## πŸ› οΈ Technology Stack

| Layer | Technology |
|---|---|
| Acoustic Model | `facebook/wav2vec2-large-960h-lv60-self` |
| Deep Learning | PyTorch (CPU-forced inference) |
| DSP | Librosa |
| Backend API | FastAPI + Uvicorn |
| ML UI | Gradio (Tabbed Blocks) |
| Language Detection | LangDetect |
| Landing Page | React + Vite |
| ML Deployment | Hugging Face Spaces |
| Web Deployment | Vercel |
| CI/CD | GitHub Actions |

## πŸ—‚οΈ Project Structure
```
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ asr_model.py        # Wav2Vec2 model loading & inference (lazy-loaded)
β”‚   β”œβ”€β”€ audio_processing.py # Librosa resampling to 16kHz
β”‚   └── history.py          # JSON persistence, CSV/TXT export
β”œβ”€β”€ landing page/           # React + Vite landing page (deployed to Vercel)
β”‚   └── src/
β”‚       β”œβ”€β”€ components/     # Nav, Hero, HowItWorks, TechStack, About, Footer
β”‚       └── index.css
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_evaluation.ipynb # WER evaluation template
β”‚   └── 02_finetuning.ipynb # Fine-tuning notebook (Colab-ready)
β”œβ”€β”€ .github/workflows/
β”‚   └── sync_to_hub.yml     # Auto-deploys to Hugging Face on every git push
β”œβ”€β”€ gradio_ui.py            # Gradio Tabbed UI (Transcribe + History tabs)
β”œβ”€β”€ main.py                 # FastAPI entry point, mounts Gradio at "/"
└── requirements.txt
```

## 🧠 Interview Talking Points (Key Technical Decisions)

### 1. Why Wav2Vec 2.0?
Traditional ASR models require massive amounts of perfectly transcribed audio. Wav2Vec 2.0 uses **Self-Supervised Learning** β€” it learns from raw, unlabeled audio by masking parts of speech and predicting the missing content (similar to BERT for text). This makes it highly accurate even when fine-tuning data is scarce.

### 2. Handling Apple Silicon Hardware Constraints
During development on M1 Mac, model inference hung indefinitely. I debugged this to a PyTorch limitation: the `mps` backend lacks support for CTC operations used by Wav2Vec. **Solution:** Hardware-fallback in `asr_model.py` forces CPU execution, prioritizing stability over theoretical GPU speed.

### 3. Lazy Loading Pattern
Loading a 1.26GB model on server boot blocks FastAPI's main thread and causes timeouts. **Solution:** The model is loaded only on the first transcription request. Server boot time stays under 1 second regardless of model size.

### 4. Unified Server Architecture
Rather than running two separate processes, the Gradio UI is mounted directly onto the FastAPI app (`app.mount("/", gr.routes.App.create_app(demo))`). One `uvicorn` process serves both the REST API and the interactive UI.

### 5. Dual CI/CD Pipelines
- **ML Backend:** `sync_to_hub.yml` (GitHub Actions) auto-deploys to Hugging Face Spaces on every push to `main`, using a scoped `HF_TOKEN` secret.
- **Frontend:** Vercel's GitHub integration auto-builds and deploys the React landing page on every push, with the `landing page/` subfolder set as the root directory.