adiitya29 commited on
Commit
4f241a2
Β·
1 Parent(s): 1343069

Docs: rewrite README for technical depth and portfolio presentation

Browse files
Files changed (1) hide show
  1. README.md +25 -62
README.md CHANGED
@@ -10,73 +10,36 @@ pinned: false
10
 
11
  # πŸŽ™οΈ Multilingual Automatic Speech Recognition (ASR)
12
 
13
- A full-stack, end-to-end Automatic Speech Recognition web application that allows users to upload audio files, detect spoken language, convert speech to text, and download transcriptions. It includes both a beautiful graphical web interface and a programmatic REST API endpoint.
 
14
 
15
- ## πŸš€ Features
16
 
17
- - **Upload & Transcribe**: Easily upload `.wav` or `.mp3` audio files and get text transcriptions.
18
- - **REST API Support**: Fully functional API endpoint to transcribe audio programmatically.
19
- - **Language Detection**: Automatically detects the language of the transcribed text.
20
- - **Transcription History**: Saves all previous transcriptions in a tabular UI with a dedicated history tab.
21
- - **Export Options**: Download individual transcripts as `.txt` files or export your entire history as a `.csv`.
22
- - **Lazy Model Loading**: Optimizes startup speed by downloading and loading the AI model only when the first transcription is requested.
23
 
24
- ## πŸ› οΈ Technology Stack
 
 
 
 
 
 
25
 
26
- - **Frontend Interface**: [Gradio](https://gradio.app/) - For the interactive web UI.
27
- - **REST API**: [FastAPI](https://fastapi.tiangolo.com/) - Wraps the Gradio UI and exposes a `/api/transcribe` endpoint.
28
- - **Deep Learning Backend**: [PyTorch](https://pytorch.org/) - Handles tensor operations and model execution.
29
- - **Machine Learning library**: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) - Manages downloading and running the ASR models.
30
- - **Audio Preprocessing**: [Librosa](https://librosa.org/) - Handles loading and resampling arbitrary audio to the strict `16kHz` required by the model.
31
- - **Language Detection**: `langdetect` - Analyzes the resulting transcript string to determine the spoken language.
32
 
33
- ## 🧠 Model Used
 
34
 
35
- **Primary Model**: `facebook/wav2vec2-large-960h-lv60-self`
36
- - This project uses the Large version of Meta's Wav2Vec2 model, which provides a significant accuracy boost over the base versions.
37
- - *Note:* The project defaults to CPU execution to bypass known PyTorch `mps` (Apple Silicon) bugs associated with this specific model architecture.
38
 
39
- ## βš™οΈ Setup and Installation
 
40
 
41
- 1. **Clone the repository**:
42
- ```bash
43
- git clone <your-repository-url>
44
- cd "Multilingual Automatic Speech Recognition"
45
- ```
46
-
47
- 2. **Create a virtual environment**:
48
- ```bash
49
- python -m venv venv
50
- source venv/bin/activate
51
- ```
52
-
53
- 3. **Install dependencies**:
54
- ```bash
55
- pip install -r requirements.txt
56
- ```
57
-
58
- 4. **(Optional) Pre-download the model**:
59
- To avoid waiting on your first run, download the model weights ahead of time:
60
- ```bash
61
- hf download facebook/wav2vec2-large-960h-lv60-self
62
- ```
63
-
64
- ## πŸ–₯️ Usage
65
-
66
- Start the unified server (which runs both the API and the Web UI):
67
- ```bash
68
- python main.py
69
- ```
70
- - **Web UI:** Open your browser and navigate to `http://127.0.0.1:7860`.
71
- - **REST API:** Send a POST request with an audio file to `http://127.0.0.1:7860/api/transcribe`.
72
-
73
- ## πŸ“ Project Structure
74
-
75
- - `main.py`: The FastAPI server entry point.
76
- - `gradio_ui.py`: The Gradio frontend layout and user logic.
77
- - `app/`: Core backend modules.
78
- - `asr_model.py`: Handles Hugging Face model loading and inference.
79
- - `audio_processing.py`: Handles audio ingest and 16kHz resampling.
80
- - `language_detection.py`: Determines language from text.
81
- - `history.py`: Manages reading and writing to `data/history.json`.
82
- - `notebooks/`: Jupyter notebooks for experimental fine-tuning and evaluation.
 
10
 
11
  # πŸŽ™οΈ Multilingual Automatic Speech Recognition (ASR)
12
 
13
+ ## πŸ“Œ Project Overview & Importance
14
+ This project is an end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio data into transcribed text. ASR is a notoriously difficult problem in Machine Learning due to immense variability in audio data (background noise, hardware microphone differences, varied accents, and speech speeds).
15
 
16
+ By building this project, I demonstrated the ability to bridge the gap between heavy Deep Learning models and full-stack software engineering, deploying a production-ready application that handles complex digital signal processing, neural network inference, and a user-facing REST API.
17
 
18
+ ## βš™οΈ How It Works (The Pipeline)
19
+ 1. **Audio Ingestion & DSP:** The user uploads raw audio (`.mp3`, `.wav`). The backend uses `librosa` to load the audio array and resamples it to a strict **16kHz**, which is mathematically required by the acoustic model.
20
+ 2. **Feature Extraction:** The raw waveform is passed into the Wav2Vec2 Processor, which normalizes the data into PyTorch tensors.
21
+ 3. **Acoustic Model Inference:** The tensors are passed through the Wav2Vec2 neural network to predict the most likely phonetic sequences.
22
+ 4. **CTC Decoding:** Connectionist Temporal Classification (CTC) decodes the raw logits into readable text.
23
+ 5. **Language Detection & Storage:** NLP libraries (`langdetect`) analyze the resulting string, and the output is persisted in local JSON storage for the UI's History Dataframe.
24
 
25
+ ## πŸ› οΈ Technology Stack & Architecture
26
+ - **Machine Learning Model:** `facebook/wav2vec2-large-960h-lv60-self` (Pre-trained Acoustic Model)
27
+ - **Deep Learning Framework:** `PyTorch` (Tensor operations and inference)
28
+ - **Digital Signal Processing:** `Librosa` (Audio resampling and array manipulation)
29
+ - **Backend API:** `FastAPI` (Asynchronous Python web framework)
30
+ - **Frontend UI:** `Gradio` (Python-based UI library for rapid ML prototyping)
31
+ - **CI/CD & Deployment:** `GitHub Actions` syncing to `Hugging Face Spaces`
32
 
33
+ ## 🧠 Interview Talking Points (Key Technical Decisions)
 
 
 
 
 
34
 
35
+ ### 1. Why Wav2Vec 2.0?
36
+ Traditional ASR models require massive amounts of perfectly transcribed audio to learn. Wav2Vec 2.0 uses **Self-Supervised Learning**. It learns representations from raw, unlabeled audio by masking parts of the speech and forcing the model to predict the missing parts (similar to how BERT works for text). This makes it incredibly efficient and accurate, even when fine-tuning data is scarce.
37
 
38
+ ### 2. Handling Hardware Constraints (Apple Silicon)
39
+ During development on an M1 Mac, I encountered an issue where the model inference would hang indefinitely. I debugged this to a known PyTorch limitation: the `mps` (Metal Performance Shaders) backend on Apple Silicon currently lacks support for certain advanced CTC operations used by Wav2Vec. **My Solution:** I implemented a hardware-fallback in the code to explicitly force the model to run on the `CPU`, prioritizing stability over theoretical GPU speed.
 
40
 
41
+ ### 3. Server Architecture & Lazy Loading
42
+ Loading a 1.2GB neural network into RAM takes time. If I loaded the model globally on server boot, it would block the main thread and cause FastAPI to timeout or appear unresponsive to early web requests. **My Solution:** I implemented the **Lazy Loading** design pattern. The model remains unloaded until the first user explicitly requests a transcription. This keeps the web server boot time to under 1 second.
43
 
44
+ ### 4. Continuous Integration / Deployment (CI/CD)
45
+ Instead of manually deploying code, I configured a fully automated CI/CD pipeline. I wrote a YAML workflow in GitHub Actions (`sync_to_hub.yml`) that listens for any push to the `main` branch. It securely uses a GitHub Repository Secret (`HF_TOKEN`) to automatically build and push the latest code to Hugging Face Spaces, ensuring the production environment always mirrors the master repository with zero manual intervention.