Multilingual-ASR / README.md
adiitya29's picture
readme file updated with current information of project
2c43552

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: Multilingual ASR
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: gradio_ui.py
pinned: false

πŸŽ™οΈ Multilingual Automatic Speech Recognition (ASR)

Live Demo: Hugging Face Space Β· Landing Page: Vercel

πŸ“Œ Project Overview & Importance

An end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio into transcribed text. ASR is notoriously difficult due to variability in audio data β€” background noise, hardware differences, varied accents, and speech speeds.

This project bridges Deep Learning model inference with full-stack software engineering, deploying a production-ready application that handles digital signal processing, neural network inference, and a user-facing REST API β€” all served from a single Python process.

βš™οΈ How It Works (The Pipeline)

  1. Audio Ingestion & DSP: Raw audio (.mp3, .wav) is loaded via librosa and resampled to 16kHz β€” the exact rate the model was trained on.
  2. Feature Extraction: The Wav2Vec2 Processor normalizes the waveform into padded PyTorch tensors.
  3. Acoustic Model Inference: Wav2Vec2ForCTC (1.26GB, Large architecture) runs the forward pass using self-supervised learned speech representations.
  4. CTC Decoding: Connectionist Temporal Classification decodes raw logits into the most probable character sequence.
  5. Output & Storage: The transcript is persisted in local JSON history, downloadable as .txt, with full history exportable as .csv.

πŸ› οΈ Technology Stack

Layer Technology
Acoustic Model facebook/wav2vec2-large-960h-lv60-self
Deep Learning PyTorch (CPU-forced inference)
DSP Librosa
Backend API FastAPI + Uvicorn
ML UI Gradio (Tabbed Blocks)
Language Detection LangDetect
Landing Page React + Vite
ML Deployment Hugging Face Spaces
Web Deployment Vercel
CI/CD GitHub Actions

πŸ—‚οΈ Project Structure

β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ asr_model.py        # Wav2Vec2 model loading & inference (lazy-loaded)
β”‚   β”œβ”€β”€ audio_processing.py # Librosa resampling to 16kHz
β”‚   └── history.py          # JSON persistence, CSV/TXT export
β”œβ”€β”€ landing page/           # React + Vite landing page (deployed to Vercel)
β”‚   └── src/
β”‚       β”œβ”€β”€ components/     # Nav, Hero, HowItWorks, TechStack, About, Footer
β”‚       └── index.css
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_evaluation.ipynb # WER evaluation template
β”‚   └── 02_finetuning.ipynb # Fine-tuning notebook (Colab-ready)
β”œβ”€β”€ .github/workflows/
β”‚   └── sync_to_hub.yml     # Auto-deploys to Hugging Face on every git push
β”œβ”€β”€ gradio_ui.py            # Gradio Tabbed UI (Transcribe + History tabs)
β”œβ”€β”€ main.py                 # FastAPI entry point, mounts Gradio at "/"
└── requirements.txt

🧠 Interview Talking Points (Key Technical Decisions)

1. Why Wav2Vec 2.0?

Traditional ASR models require massive amounts of perfectly transcribed audio. Wav2Vec 2.0 uses Self-Supervised Learning β€” it learns from raw, unlabeled audio by masking parts of speech and predicting the missing content (similar to BERT for text). This makes it highly accurate even when fine-tuning data is scarce.

2. Handling Apple Silicon Hardware Constraints

During development on M1 Mac, model inference hung indefinitely. I debugged this to a PyTorch limitation: the mps backend lacks support for CTC operations used by Wav2Vec. Solution: Hardware-fallback in asr_model.py forces CPU execution, prioritizing stability over theoretical GPU speed.

3. Lazy Loading Pattern

Loading a 1.26GB model on server boot blocks FastAPI's main thread and causes timeouts. Solution: The model is loaded only on the first transcription request. Server boot time stays under 1 second regardless of model size.

4. Unified Server Architecture

Rather than running two separate processes, the Gradio UI is mounted directly onto the FastAPI app (app.mount("/", gr.routes.App.create_app(demo))). One uvicorn process serves both the REST API and the interactive UI.

5. Dual CI/CD Pipelines

  • ML Backend: sync_to_hub.yml (GitHub Actions) auto-deploys to Hugging Face Spaces on every push to main, using a scoped HF_TOKEN secret.
  • Frontend: Vercel's GitHub integration auto-builds and deploys the React landing page on every push, with the landing page/ subfolder set as the root directory.