--- title: Multilingual ASR emoji: ๐ŸŽ™๏ธ colorFrom: blue colorTo: indigo sdk: gradio app_file: gradio_ui.py pinned: false --- # ๐ŸŽ™๏ธ Multilingual Automatic Speech Recognition (ASR) > **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/adiitya29/Multilingual-ASR) ยท **Landing Page:** [Vercel](https://YOUR_VERCEL_URL) ## ๐Ÿ“Œ Project Overview & Importance An end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio into transcribed text. ASR is notoriously difficult due to variability in audio data โ€” background noise, hardware differences, varied accents, and speech speeds. This project bridges Deep Learning model inference with full-stack software engineering, deploying a production-ready application that handles digital signal processing, neural network inference, and a user-facing REST API โ€” all served from a single Python process. ## โš™๏ธ How It Works (The Pipeline) 1. **Audio Ingestion & DSP:** Raw audio (`.mp3`, `.wav`) is loaded via `librosa` and resampled to **16kHz** โ€” the exact rate the model was trained on. 2. **Feature Extraction:** The Wav2Vec2 Processor normalizes the waveform into padded PyTorch tensors. 3. **Acoustic Model Inference:** `Wav2Vec2ForCTC` (1.26GB, Large architecture) runs the forward pass using self-supervised learned speech representations. 4. **CTC Decoding:** Connectionist Temporal Classification decodes raw logits into the most probable character sequence. 5. **Output & Storage:** The transcript is persisted in local JSON history, downloadable as `.txt`, with full history exportable as `.csv`. ## ๐Ÿ› ๏ธ Technology Stack | Layer | Technology | |---|---| | Acoustic Model | `facebook/wav2vec2-large-960h-lv60-self` | | Deep Learning | PyTorch (CPU-forced inference) | | DSP | Librosa | | Backend API | FastAPI + Uvicorn | | ML UI | Gradio (Tabbed Blocks) | | Language Detection | LangDetect | | Landing Page | React + Vite | | ML Deployment | Hugging Face Spaces | | Web Deployment | Vercel | | CI/CD | GitHub Actions | ## ๐Ÿ—‚๏ธ Project Structure ``` โ”œโ”€โ”€ app/ โ”‚ โ”œโ”€โ”€ asr_model.py # Wav2Vec2 model loading & inference (lazy-loaded) โ”‚ โ”œโ”€โ”€ audio_processing.py # Librosa resampling to 16kHz โ”‚ โ””โ”€โ”€ history.py # JSON persistence, CSV/TXT export โ”œโ”€โ”€ landing page/ # React + Vite landing page (deployed to Vercel) โ”‚ โ””โ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ components/ # Nav, Hero, HowItWorks, TechStack, About, Footer โ”‚ โ””โ”€โ”€ index.css โ”œโ”€โ”€ notebooks/ โ”‚ โ”œโ”€โ”€ 01_evaluation.ipynb # WER evaluation template โ”‚ โ””โ”€โ”€ 02_finetuning.ipynb # Fine-tuning notebook (Colab-ready) โ”œโ”€โ”€ .github/workflows/ โ”‚ โ””โ”€โ”€ sync_to_hub.yml # Auto-deploys to Hugging Face on every git push โ”œโ”€โ”€ gradio_ui.py # Gradio Tabbed UI (Transcribe + History tabs) โ”œโ”€โ”€ main.py # FastAPI entry point, mounts Gradio at "/" โ””โ”€โ”€ requirements.txt ``` ## ๐Ÿง  Interview Talking Points (Key Technical Decisions) ### 1. Why Wav2Vec 2.0? Traditional ASR models require massive amounts of perfectly transcribed audio. Wav2Vec 2.0 uses **Self-Supervised Learning** โ€” it learns from raw, unlabeled audio by masking parts of speech and predicting the missing content (similar to BERT for text). This makes it highly accurate even when fine-tuning data is scarce. ### 2. Handling Apple Silicon Hardware Constraints During development on M1 Mac, model inference hung indefinitely. I debugged this to a PyTorch limitation: the `mps` backend lacks support for CTC operations used by Wav2Vec. **Solution:** Hardware-fallback in `asr_model.py` forces CPU execution, prioritizing stability over theoretical GPU speed. ### 3. Lazy Loading Pattern Loading a 1.26GB model on server boot blocks FastAPI's main thread and causes timeouts. **Solution:** The model is loaded only on the first transcription request. Server boot time stays under 1 second regardless of model size. ### 4. Unified Server Architecture Rather than running two separate processes, the Gradio UI is mounted directly onto the FastAPI app (`app.mount("/", gr.routes.App.create_app(demo))`). One `uvicorn` process serves both the REST API and the interactive UI. ### 5. Dual CI/CD Pipelines - **ML Backend:** `sync_to_hub.yml` (GitHub Actions) auto-deploys to Hugging Face Spaces on every push to `main`, using a scoped `HF_TOKEN` secret. - **Frontend:** Vercel's GitHub integration auto-builds and deploys the React landing page on every push, with the `landing page/` subfolder set as the root directory.