Spaces:
Running
A newer version of the Gradio SDK is available: 6.17.3
title: Multilingual ASR
emoji: ποΈ
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: gradio_ui.py
pinned: false
ποΈ Multilingual Automatic Speech Recognition (ASR)
Live Demo: Hugging Face Space Β· Landing Page: Vercel
π Project Overview & Importance
An end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio into transcribed text. ASR is notoriously difficult due to variability in audio data β background noise, hardware differences, varied accents, and speech speeds.
This project bridges Deep Learning model inference with full-stack software engineering, deploying a production-ready application that handles digital signal processing, neural network inference, and a user-facing REST API β all served from a single Python process.
βοΈ How It Works (The Pipeline)
- Audio Ingestion & DSP: Raw audio (
.mp3,.wav) is loaded vialibrosaand resampled to 16kHz β the exact rate the model was trained on. - Feature Extraction: The Wav2Vec2 Processor normalizes the waveform into padded PyTorch tensors.
- Acoustic Model Inference:
Wav2Vec2ForCTC(1.26GB, Large architecture) runs the forward pass using self-supervised learned speech representations. - CTC Decoding: Connectionist Temporal Classification decodes raw logits into the most probable character sequence.
- Output & Storage: The transcript is persisted in local JSON history, downloadable as
.txt, with full history exportable as.csv.
π οΈ Technology Stack
| Layer | Technology |
|---|---|
| Acoustic Model | facebook/wav2vec2-large-960h-lv60-self |
| Deep Learning | PyTorch (CPU-forced inference) |
| DSP | Librosa |
| Backend API | FastAPI + Uvicorn |
| ML UI | Gradio (Tabbed Blocks) |
| Language Detection | LangDetect |
| Landing Page | React + Vite |
| ML Deployment | Hugging Face Spaces |
| Web Deployment | Vercel |
| CI/CD | GitHub Actions |
ποΈ Project Structure
βββ app/
β βββ asr_model.py # Wav2Vec2 model loading & inference (lazy-loaded)
β βββ audio_processing.py # Librosa resampling to 16kHz
β βββ history.py # JSON persistence, CSV/TXT export
βββ landing page/ # React + Vite landing page (deployed to Vercel)
β βββ src/
β βββ components/ # Nav, Hero, HowItWorks, TechStack, About, Footer
β βββ index.css
βββ notebooks/
β βββ 01_evaluation.ipynb # WER evaluation template
β βββ 02_finetuning.ipynb # Fine-tuning notebook (Colab-ready)
βββ .github/workflows/
β βββ sync_to_hub.yml # Auto-deploys to Hugging Face on every git push
βββ gradio_ui.py # Gradio Tabbed UI (Transcribe + History tabs)
βββ main.py # FastAPI entry point, mounts Gradio at "/"
βββ requirements.txt
π§ Interview Talking Points (Key Technical Decisions)
1. Why Wav2Vec 2.0?
Traditional ASR models require massive amounts of perfectly transcribed audio. Wav2Vec 2.0 uses Self-Supervised Learning β it learns from raw, unlabeled audio by masking parts of speech and predicting the missing content (similar to BERT for text). This makes it highly accurate even when fine-tuning data is scarce.
2. Handling Apple Silicon Hardware Constraints
During development on M1 Mac, model inference hung indefinitely. I debugged this to a PyTorch limitation: the mps backend lacks support for CTC operations used by Wav2Vec. Solution: Hardware-fallback in asr_model.py forces CPU execution, prioritizing stability over theoretical GPU speed.
3. Lazy Loading Pattern
Loading a 1.26GB model on server boot blocks FastAPI's main thread and causes timeouts. Solution: The model is loaded only on the first transcription request. Server boot time stays under 1 second regardless of model size.
4. Unified Server Architecture
Rather than running two separate processes, the Gradio UI is mounted directly onto the FastAPI app (app.mount("/", gr.routes.App.create_app(demo))). One uvicorn process serves both the REST API and the interactive UI.
5. Dual CI/CD Pipelines
- ML Backend:
sync_to_hub.yml(GitHub Actions) auto-deploys to Hugging Face Spaces on every push tomain, using a scopedHF_TOKENsecret. - Frontend: Vercel's GitHub integration auto-builds and deploys the React landing page on every push, with the
landing page/subfolder set as the root directory.