Spaces:
Running
Running
Docs: rewrite README for technical depth and portfolio presentation
Browse files
README.md
CHANGED
|
@@ -10,73 +10,36 @@ pinned: false
|
|
| 10 |
|
| 11 |
# ποΈ Multilingual Automatic Speech Recognition (ASR)
|
| 12 |
|
| 13 |
-
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
|
| 24 |
-
## π οΈ Technology Stack
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
-
|
| 27 |
-
- **REST API**: [FastAPI](https://fastapi.tiangolo.com/) - Wraps the Gradio UI and exposes a `/api/transcribe` endpoint.
|
| 28 |
-
- **Deep Learning Backend**: [PyTorch](https://pytorch.org/) - Handles tensor operations and model execution.
|
| 29 |
-
- **Machine Learning library**: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) - Manages downloading and running the ASR models.
|
| 30 |
-
- **Audio Preprocessing**: [Librosa](https://librosa.org/) - Handles loading and resampling arbitrary audio to the strict `16kHz` required by the model.
|
| 31 |
-
- **Language Detection**: `langdetect` - Analyzes the resulting transcript string to determine the spoken language.
|
| 32 |
|
| 33 |
-
##
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
- *Note:* The project defaults to CPU execution to bypass known PyTorch `mps` (Apple Silicon) bugs associated with this specific model architecture.
|
| 38 |
|
| 39 |
-
##
|
|
|
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
git clone <your-repository-url>
|
| 44 |
-
cd "Multilingual Automatic Speech Recognition"
|
| 45 |
-
```
|
| 46 |
-
|
| 47 |
-
2. **Create a virtual environment**:
|
| 48 |
-
```bash
|
| 49 |
-
python -m venv venv
|
| 50 |
-
source venv/bin/activate
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
3. **Install dependencies**:
|
| 54 |
-
```bash
|
| 55 |
-
pip install -r requirements.txt
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
4. **(Optional) Pre-download the model**:
|
| 59 |
-
To avoid waiting on your first run, download the model weights ahead of time:
|
| 60 |
-
```bash
|
| 61 |
-
hf download facebook/wav2vec2-large-960h-lv60-self
|
| 62 |
-
```
|
| 63 |
-
|
| 64 |
-
## π₯οΈ Usage
|
| 65 |
-
|
| 66 |
-
Start the unified server (which runs both the API and the Web UI):
|
| 67 |
-
```bash
|
| 68 |
-
python main.py
|
| 69 |
-
```
|
| 70 |
-
- **Web UI:** Open your browser and navigate to `http://127.0.0.1:7860`.
|
| 71 |
-
- **REST API:** Send a POST request with an audio file to `http://127.0.0.1:7860/api/transcribe`.
|
| 72 |
-
|
| 73 |
-
## π Project Structure
|
| 74 |
-
|
| 75 |
-
- `main.py`: The FastAPI server entry point.
|
| 76 |
-
- `gradio_ui.py`: The Gradio frontend layout and user logic.
|
| 77 |
-
- `app/`: Core backend modules.
|
| 78 |
-
- `asr_model.py`: Handles Hugging Face model loading and inference.
|
| 79 |
-
- `audio_processing.py`: Handles audio ingest and 16kHz resampling.
|
| 80 |
-
- `language_detection.py`: Determines language from text.
|
| 81 |
-
- `history.py`: Manages reading and writing to `data/history.json`.
|
| 82 |
-
- `notebooks/`: Jupyter notebooks for experimental fine-tuning and evaluation.
|
|
|
|
| 10 |
|
| 11 |
# ποΈ Multilingual Automatic Speech Recognition (ASR)
|
| 12 |
|
| 13 |
+
## π Project Overview & Importance
|
| 14 |
+
This project is an end-to-end Automatic Speech Recognition (ASR) pipeline that converts raw unstructured audio data into transcribed text. ASR is a notoriously difficult problem in Machine Learning due to immense variability in audio data (background noise, hardware microphone differences, varied accents, and speech speeds).
|
| 15 |
|
| 16 |
+
By building this project, I demonstrated the ability to bridge the gap between heavy Deep Learning models and full-stack software engineering, deploying a production-ready application that handles complex digital signal processing, neural network inference, and a user-facing REST API.
|
| 17 |
|
| 18 |
+
## βοΈ How It Works (The Pipeline)
|
| 19 |
+
1. **Audio Ingestion & DSP:** The user uploads raw audio (`.mp3`, `.wav`). The backend uses `librosa` to load the audio array and resamples it to a strict **16kHz**, which is mathematically required by the acoustic model.
|
| 20 |
+
2. **Feature Extraction:** The raw waveform is passed into the Wav2Vec2 Processor, which normalizes the data into PyTorch tensors.
|
| 21 |
+
3. **Acoustic Model Inference:** The tensors are passed through the Wav2Vec2 neural network to predict the most likely phonetic sequences.
|
| 22 |
+
4. **CTC Decoding:** Connectionist Temporal Classification (CTC) decodes the raw logits into readable text.
|
| 23 |
+
5. **Language Detection & Storage:** NLP libraries (`langdetect`) analyze the resulting string, and the output is persisted in local JSON storage for the UI's History Dataframe.
|
| 24 |
|
| 25 |
+
## π οΈ Technology Stack & Architecture
|
| 26 |
+
- **Machine Learning Model:** `facebook/wav2vec2-large-960h-lv60-self` (Pre-trained Acoustic Model)
|
| 27 |
+
- **Deep Learning Framework:** `PyTorch` (Tensor operations and inference)
|
| 28 |
+
- **Digital Signal Processing:** `Librosa` (Audio resampling and array manipulation)
|
| 29 |
+
- **Backend API:** `FastAPI` (Asynchronous Python web framework)
|
| 30 |
+
- **Frontend UI:** `Gradio` (Python-based UI library for rapid ML prototyping)
|
| 31 |
+
- **CI/CD & Deployment:** `GitHub Actions` syncing to `Hugging Face Spaces`
|
| 32 |
|
| 33 |
+
## π§ Interview Talking Points (Key Technical Decisions)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
### 1. Why Wav2Vec 2.0?
|
| 36 |
+
Traditional ASR models require massive amounts of perfectly transcribed audio to learn. Wav2Vec 2.0 uses **Self-Supervised Learning**. It learns representations from raw, unlabeled audio by masking parts of the speech and forcing the model to predict the missing parts (similar to how BERT works for text). This makes it incredibly efficient and accurate, even when fine-tuning data is scarce.
|
| 37 |
|
| 38 |
+
### 2. Handling Hardware Constraints (Apple Silicon)
|
| 39 |
+
During development on an M1 Mac, I encountered an issue where the model inference would hang indefinitely. I debugged this to a known PyTorch limitation: the `mps` (Metal Performance Shaders) backend on Apple Silicon currently lacks support for certain advanced CTC operations used by Wav2Vec. **My Solution:** I implemented a hardware-fallback in the code to explicitly force the model to run on the `CPU`, prioritizing stability over theoretical GPU speed.
|
|
|
|
| 40 |
|
| 41 |
+
### 3. Server Architecture & Lazy Loading
|
| 42 |
+
Loading a 1.2GB neural network into RAM takes time. If I loaded the model globally on server boot, it would block the main thread and cause FastAPI to timeout or appear unresponsive to early web requests. **My Solution:** I implemented the **Lazy Loading** design pattern. The model remains unloaded until the first user explicitly requests a transcription. This keeps the web server boot time to under 1 second.
|
| 43 |
|
| 44 |
+
### 4. Continuous Integration / Deployment (CI/CD)
|
| 45 |
+
Instead of manually deploying code, I configured a fully automated CI/CD pipeline. I wrote a YAML workflow in GitHub Actions (`sync_to_hub.yml`) that listens for any push to the `main` branch. It securely uses a GitHub Repository Secret (`HF_TOKEN`) to automatically build and push the latest code to Hugging Face Spaces, ensuring the production environment always mirrors the master repository with zero manual intervention.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|