Spaces:
Running
Running
project title changed and details added in readme.md
Browse files
README.md
CHANGED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ๐๏ธ Multilingual Automatic Speech Recognition (ASR)
|
| 2 |
+
|
| 3 |
+
A full-stack, end-to-end Automatic Speech Recognition web application that allows users to upload audio files, detect spoken language, convert speech to text, and download transcriptions. It includes both a beautiful graphical web interface and a programmatic REST API endpoint.
|
| 4 |
+
|
| 5 |
+
## ๐ Features
|
| 6 |
+
|
| 7 |
+
- **Upload & Transcribe**: Easily upload `.wav` or `.mp3` audio files and get text transcriptions.
|
| 8 |
+
- **REST API Support**: Fully functional API endpoint to transcribe audio programmatically.
|
| 9 |
+
- **Language Detection**: Automatically detects the language of the transcribed text.
|
| 10 |
+
- **Transcription History**: Saves all previous transcriptions in a tabular UI with a dedicated history tab.
|
| 11 |
+
- **Export Options**: Download individual transcripts as `.txt` files or export your entire history as a `.csv`.
|
| 12 |
+
- **Lazy Model Loading**: Optimizes startup speed by downloading and loading the AI model only when the first transcription is requested.
|
| 13 |
+
|
| 14 |
+
## ๐ ๏ธ Technology Stack
|
| 15 |
+
|
| 16 |
+
- **Frontend Interface**: [Gradio](https://gradio.app/) - For the interactive web UI.
|
| 17 |
+
- **REST API**: [FastAPI](https://fastapi.tiangolo.com/) - Wraps the Gradio UI and exposes a `/api/transcribe` endpoint.
|
| 18 |
+
- **Deep Learning Backend**: [PyTorch](https://pytorch.org/) - Handles tensor operations and model execution.
|
| 19 |
+
- **Machine Learning library**: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) - Manages downloading and running the ASR models.
|
| 20 |
+
- **Audio Preprocessing**: [Librosa](https://librosa.org/) - Handles loading and resampling arbitrary audio to the strict `16kHz` required by the model.
|
| 21 |
+
- **Language Detection**: `langdetect` - Analyzes the resulting transcript string to determine the spoken language.
|
| 22 |
+
|
| 23 |
+
## ๐ง Model Used
|
| 24 |
+
|
| 25 |
+
**Primary Model**: `facebook/wav2vec2-large-960h-lv60-self`
|
| 26 |
+
- This project uses the Large version of Meta's Wav2Vec2 model, which provides a significant accuracy boost over the base versions.
|
| 27 |
+
- *Note:* The project defaults to CPU execution to bypass known PyTorch `mps` (Apple Silicon) bugs associated with this specific model architecture.
|
| 28 |
+
|
| 29 |
+
## โ๏ธ Setup and Installation
|
| 30 |
+
|
| 31 |
+
1. **Clone the repository**:
|
| 32 |
+
```bash
|
| 33 |
+
git clone <your-repository-url>
|
| 34 |
+
cd "Multilingual Automatic Speech Recognition"
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
2. **Create a virtual environment**:
|
| 38 |
+
```bash
|
| 39 |
+
python -m venv venv
|
| 40 |
+
source venv/bin/activate
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
3. **Install dependencies**:
|
| 44 |
+
```bash
|
| 45 |
+
pip install -r requirements.txt
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
4. **(Optional) Pre-download the model**:
|
| 49 |
+
To avoid waiting on your first run, download the model weights ahead of time:
|
| 50 |
+
```bash
|
| 51 |
+
hf download facebook/wav2vec2-large-960h-lv60-self
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## ๐ฅ๏ธ Usage
|
| 55 |
+
|
| 56 |
+
Start the unified server (which runs both the API and the Web UI):
|
| 57 |
+
```bash
|
| 58 |
+
python main.py
|
| 59 |
+
```
|
| 60 |
+
- **Web UI:** Open your browser and navigate to `http://127.0.0.1:7860`.
|
| 61 |
+
- **REST API:** Send a POST request with an audio file to `http://127.0.0.1:7860/api/transcribe`.
|
| 62 |
+
|
| 63 |
+
## ๐ Project Structure
|
| 64 |
+
|
| 65 |
+
- `main.py`: The FastAPI server entry point.
|
| 66 |
+
- `gradio_ui.py`: The Gradio frontend layout and user logic.
|
| 67 |
+
- `app/`: Core backend modules.
|
| 68 |
+
- `asr_model.py`: Handles Hugging Face model loading and inference.
|
| 69 |
+
- `audio_processing.py`: Handles audio ingest and 16kHz resampling.
|
| 70 |
+
- `language_detection.py`: Determines language from text.
|
| 71 |
+
- `history.py`: Manages reading and writing to `data/history.json`.
|
| 72 |
+
- `notebooks/`: Jupyter notebooks for experimental fine-tuning and evaluation.
|