Spaces:

adiitya29
/

Multilingual-ASR

Running

App Files Files Community

adiitya29 commited on 4 days ago

Commit

4f241a2

1 Parent(s): 1343069

Docs: rewrite README for technical depth and portfolio presentation

Browse files

Files changed (1) hide show

README.md +25 -62

README.md CHANGED Viewed

@@ -10,73 +10,36 @@ pinned: false
 # 🎙️ Multilingual Automatic Speech Recognition (ASR)
-A full-stack, end-to-end Automatic Speech Recognition web application that allows users to upload audio files, detect spoken language, convert speech to text, and download transcriptions. It includes both a beautiful graphical web interface and a programmatic REST API endpoint.
-## 🚀 Features
-- **Upload & Transcribe**: Easily upload `.wav` or `.mp3` audio files and get text transcriptions.
-- **REST API Support**: Fully functional API endpoint to transcribe audio programmatically.
-- **Language Detection**: Automatically detects the language of the transcribed text.
-- **Transcription History**: Saves all previous transcriptions in a tabular UI with a dedicated history tab.
-- **Export Options**: Download individual transcripts as `.txt` files or export your entire history as a `.csv`.
-- **Lazy Model Loading**: Optimizes startup speed by downloading and loading the AI model only when the first transcription is requested.
-## 🛠️ Technology Stack
-- **Frontend Interface**: [Gradio](https://gradio.app/) - For the interactive web UI.
-- **REST API**: [FastAPI](https://fastapi.tiangolo.com/) - Wraps the Gradio UI and exposes a `/api/transcribe` endpoint.
-- **Deep Learning Backend**: [PyTorch](https://pytorch.org/) - Handles tensor operations and model execution.
-- **Machine Learning library**: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) - Manages downloading and running the ASR models.
-- **Audio Preprocessing**: [Librosa](https://librosa.org/) - Handles loading and resampling arbitrary audio to the strict `16kHz` required by the model.
-- **Language Detection**: `langdetect` - Analyzes the resulting transcript string to determine the spoken language.
-## 🧠 Model Used
-**Primary Model**: `facebook/wav2vec2-large-960h-lv60-self`
-- This project uses the Large version of Meta's Wav2Vec2 model, which provides a significant accuracy boost over the base versions.
-- *Note:* The project defaults to CPU execution to bypass known PyTorch `mps` (Apple Silicon) bugs associated with this specific model architecture.
-## ⚙️ Setup and Installation
-1. **Clone the repository**:
-   ```bash
-   git clone <your-repository-url>
-   cd "Multilingual Automatic Speech Recognition"
-   ```
-2. **Create a virtual environment**:
-   ```bash
-   python -m venv venv
-   source venv/bin/activate
-   ```
-3. **Install dependencies**:
-   ```bash
-   pip install -r requirements.txt
-   ```
-4. **(Optional) Pre-download the model**:
-   To avoid waiting on your first run, download the model weights ahead of time:
-   ```bash
-   hf download facebook/wav2vec2-large-960h-lv60-self
-   ```
-## 🖥️ Usage
-Start the unified server (which runs both the API and the Web UI):
-```bash
-python main.py
-```
-- **Web UI:** Open your browser and navigate to `http://127.0.0.1:7860`.
-- **REST API:** Send a POST request with an audio file to `http://127.0.0.1:7860/api/transcribe`.
-## 📁 Project Structure
-- `main.py`: The FastAPI server entry point.
-- `gradio_ui.py`: The Gradio frontend layout and user logic.
-- `app/`: Core backend modules.
-  - `asr_model.py`: Handles Hugging Face model loading and inference.
-  - `audio_processing.py`: Handles audio ingest and 16kHz resampling.
-  - `language_detection.py`: Determines language from text.
-  - `history.py`: Manages reading and writing to `data/history.json`.
-- `notebooks/`: Jupyter notebooks for experimental fine-tuning and evaluation.

 # 🎙️ Multilingual Automatic Speech Recognition (ASR)
+## 📌 Project Overview & Importance
+This project is an end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio data into transcribed text. ASR is a notoriously difficult problem in Machine Learning due to immense variability in audio data (background noise, hardware microphone differences, varied accents, and speech speeds).
+By building this project, I demonstrated the ability to bridge the gap between heavy Deep Learning models and full-stack software engineering, deploying a production-ready application that handles complex digital signal processing, neural network inference, and a user-facing REST API.
+## ⚙️ How It Works (The Pipeline)
+1. **Audio Ingestion & DSP:** The user uploads raw audio (`.mp3`, `.wav`). The backend uses `librosa` to load the audio array and resamples it to a strict **16kHz**, which is mathematically required by the acoustic model.
+2. **Feature Extraction:** The raw waveform is passed into the Wav2Vec2 Processor, which normalizes the data into PyTorch tensors.
+3. **Acoustic Model Inference:** The tensors are passed through the Wav2Vec2 neural network to predict the most likely phonetic sequences.
+4. **CTC Decoding:** Connectionist Temporal Classification (CTC) decodes the raw logits into readable text.
+5. **Language Detection & Storage:** NLP libraries (`langdetect`) analyze the resulting string, and the output is persisted in local JSON storage for the UI's History Dataframe.
+## 🛠️ Technology Stack & Architecture
+- **Machine Learning Model:** `facebook/wav2vec2-large-960h-lv60-self` (Pre-trained Acoustic Model)
+- **Deep Learning Framework:** `PyTorch` (Tensor operations and inference)
+- **Digital Signal Processing:** `Librosa` (Audio resampling and array manipulation)
+- **Backend API:** `FastAPI` (Asynchronous Python web framework)
+- **Frontend UI:** `Gradio` (Python-based UI library for rapid ML prototyping)
+- **CI/CD & Deployment:** `GitHub Actions` syncing to `Hugging Face Spaces`
+## 🧠 Interview Talking Points (Key Technical Decisions)
+### 1. Why Wav2Vec 2.0?
+Traditional ASR models require massive amounts of perfectly transcribed audio to learn. Wav2Vec 2.0 uses **Self-Supervised Learning**. It learns representations from raw, unlabeled audio by masking parts of the speech and forcing the model to predict the missing parts (similar to how BERT works for text). This makes it incredibly efficient and accurate, even when fine-tuning data is scarce.
+### 2. Handling Hardware Constraints (Apple Silicon)
+During development on an M1 Mac, I encountered an issue where the model inference would hang indefinitely. I debugged this to a known PyTorch limitation: the `mps` (Metal Performance Shaders) backend on Apple Silicon currently lacks support for certain advanced CTC operations used by Wav2Vec. **My Solution:** I implemented a hardware-fallback in the code to explicitly force the model to run on the `CPU`, prioritizing stability over theoretical GPU speed.
+### 3. Server Architecture & Lazy Loading
+Loading a 1.2GB neural network into RAM takes time. If I loaded the model globally on server boot, it would block the main thread and cause FastAPI to timeout or appear unresponsive to early web requests. **My Solution:** I implemented the **Lazy Loading** design pattern. The model remains unloaded until the first user explicitly requests a transcription. This keeps the web server boot time to under 1 second.
+### 4. Continuous Integration / Deployment (CI/CD)
+Instead of manually deploying code, I configured a fully automated CI/CD pipeline. I wrote a YAML workflow in GitHub Actions (`sync_to_hub.yml`) that listens for any push to the `main` branch. It securely uses a GitHub Repository Secret (`HF_TOKEN`) to automatically build and push the latest code to Hugging Face Spaces, ensuring the production environment always mirrors the master repository with zero manual intervention.