Spaces:

adiitya29
/

Multilingual-ASR

Running

App Files Files Community

Multilingual-ASR / README.md

adiitya29

readme file updated with current information of project

2c43552 2 days ago

preview code

raw

history blame contribute delete

4.61 kB

	---
	title: Multilingual ASR
	emoji: 🎙️
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	app_file: gradio_ui.py
	pinned: false
	---

	# 🎙️ Multilingual Automatic Speech Recognition (ASR)

	> Live Demo: [Hugging Face Space](https://huggingface.co/spaces/adiitya29/Multilingual-ASR) · Landing Page: [Vercel](https://YOUR_VERCEL_URL)

	## 📌 Project Overview & Importance
	An end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio into transcribed text. ASR is notoriously difficult due to variability in audio data — background noise, hardware differences, varied accents, and speech speeds.

	This project bridges Deep Learning model inference with full-stack software engineering, deploying a production-ready application that handles digital signal processing, neural network inference, and a user-facing REST API — all served from a single Python process.

	## ⚙️ How It Works (The Pipeline)
	1. Audio Ingestion & DSP: Raw audio (`.mp3`, `.wav`) is loaded via `librosa` and resampled to 16kHz — the exact rate the model was trained on.
	2. Feature Extraction: The Wav2Vec2 Processor normalizes the waveform into padded PyTorch tensors.
	3. Acoustic Model Inference: `Wav2Vec2ForCTC` (1.26GB, Large architecture) runs the forward pass using self-supervised learned speech representations.
	4. CTC Decoding: Connectionist Temporal Classification decodes raw logits into the most probable character sequence.
	5. Output & Storage: The transcript is persisted in local JSON history, downloadable as `.txt`, with full history exportable as `.csv`.

	## 🛠️ Technology Stack

	\| Layer \| Technology \|
	\|---\|---\|
	\| Acoustic Model \| `facebook/wav2vec2-large-960h-lv60-self` \|
	\| Deep Learning \| PyTorch (CPU-forced inference) \|
	\| DSP \| Librosa \|
	\| Backend API \| FastAPI + Uvicorn \|
	\| ML UI \| Gradio (Tabbed Blocks) \|
	\| Language Detection \| LangDetect \|
	\| Landing Page \| React + Vite \|
	\| ML Deployment \| Hugging Face Spaces \|
	\| Web Deployment \| Vercel \|
	\| CI/CD \| GitHub Actions \|

	## 🗂️ Project Structure
	```
	├── app/
	│ ├── asr_model.py # Wav2Vec2 model loading & inference (lazy-loaded)
	│ ├── audio_processing.py # Librosa resampling to 16kHz
	│ └── history.py # JSON persistence, CSV/TXT export
	├── landing page/ # React + Vite landing page (deployed to Vercel)
	│ └── src/
	│ ├── components/ # Nav, Hero, HowItWorks, TechStack, About, Footer
	│ └── index.css
	├── notebooks/
	│ ├── 01_evaluation.ipynb # WER evaluation template
	│ └── 02_finetuning.ipynb # Fine-tuning notebook (Colab-ready)
	├── .github/workflows/
	│ └── sync_to_hub.yml # Auto-deploys to Hugging Face on every git push
	├── gradio_ui.py # Gradio Tabbed UI (Transcribe + History tabs)
	├── main.py # FastAPI entry point, mounts Gradio at "/"
	└── requirements.txt
	```

	## 🧠 Interview Talking Points (Key Technical Decisions)

	### 1. Why Wav2Vec 2.0?
	Traditional ASR models require massive amounts of perfectly transcribed audio. Wav2Vec 2.0 uses Self-Supervised Learning — it learns from raw, unlabeled audio by masking parts of speech and predicting the missing content (similar to BERT for text). This makes it highly accurate even when fine-tuning data is scarce.

	### 2. Handling Apple Silicon Hardware Constraints
	During development on M1 Mac, model inference hung indefinitely. I debugged this to a PyTorch limitation: the `mps` backend lacks support for CTC operations used by Wav2Vec. Solution: Hardware-fallback in `asr_model.py` forces CPU execution, prioritizing stability over theoretical GPU speed.

	### 3. Lazy Loading Pattern
	Loading a 1.26GB model on server boot blocks FastAPI's main thread and causes timeouts. Solution: The model is loaded only on the first transcription request. Server boot time stays under 1 second regardless of model size.

	### 4. Unified Server Architecture
	Rather than running two separate processes, the Gradio UI is mounted directly onto the FastAPI app (`app.mount("/", gr.routes.App.create_app(demo))`). One `uvicorn` process serves both the REST API and the interactive UI.

	### 5. Dual CI/CD Pipelines
	- ML Backend: `sync_to_hub.yml` (GitHub Actions) auto-deploys to Hugging Face Spaces on every push to `main`, using a scoped `HF_TOKEN` secret.
	- Frontend: Vercel's GitHub integration auto-builds and deploys the React landing page on every push, with the `landing page/` subfolder set as the root directory.