Spaces:

adiitya29
/

Multilingual-ASR

Running

App Files Files Community

adiitya29 commited on 6 days ago

Commit

a91b83c

1 Parent(s): 9622244

project title changed and details added in readme.md

Browse files

Files changed (1) hide show

README.md +72 -0

README.md CHANGED Viewed

	@@ -0,0 +1,72 @@

+# 🎙️ Multilingual Automatic Speech Recognition (ASR)
+A full-stack, end-to-end Automatic Speech Recognition web application that allows users to upload audio files, detect spoken language, convert speech to text, and download transcriptions. It includes both a beautiful graphical web interface and a programmatic REST API endpoint.
+## 🚀 Features
+- **Upload & Transcribe**: Easily upload `.wav` or `.mp3` audio files and get text transcriptions.
+- **REST API Support**: Fully functional API endpoint to transcribe audio programmatically.
+- **Language Detection**: Automatically detects the language of the transcribed text.
+- **Transcription History**: Saves all previous transcriptions in a tabular UI with a dedicated history tab.
+- **Export Options**: Download individual transcripts as `.txt` files or export your entire history as a `.csv`.
+- **Lazy Model Loading**: Optimizes startup speed by downloading and loading the AI model only when the first transcription is requested.
+## 🛠️ Technology Stack
+- **Frontend Interface**: [Gradio](https://gradio.app/) - For the interactive web UI.
+- **REST API**: [FastAPI](https://fastapi.tiangolo.com/) - Wraps the Gradio UI and exposes a `/api/transcribe` endpoint.
+- **Deep Learning Backend**: [PyTorch](https://pytorch.org/) - Handles tensor operations and model execution.
+- **Machine Learning library**: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) - Manages downloading and running the ASR models.
+- **Audio Preprocessing**: [Librosa](https://librosa.org/) - Handles loading and resampling arbitrary audio to the strict `16kHz` required by the model.
+- **Language Detection**: `langdetect` - Analyzes the resulting transcript string to determine the spoken language.
+## 🧠 Model Used
+**Primary Model**: `facebook/wav2vec2-large-960h-lv60-self`
+- This project uses the Large version of Meta's Wav2Vec2 model, which provides a significant accuracy boost over the base versions.
+- *Note:* The project defaults to CPU execution to bypass known PyTorch `mps` (Apple Silicon) bugs associated with this specific model architecture.
+## ⚙️ Setup and Installation
+1. **Clone the repository**:
+   ```bash
+   git clone <your-repository-url>
+   cd "Multilingual Automatic Speech Recognition"
+   ```
+2. **Create a virtual environment**:
+   ```bash
+   python -m venv venv
+   source venv/bin/activate
+   ```
+3. **Install dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. **(Optional) Pre-download the model**:
+   To avoid waiting on your first run, download the model weights ahead of time:
+   ```bash
+   hf download facebook/wav2vec2-large-960h-lv60-self
+   ```
+## 🖥️ Usage
+Start the unified server (which runs both the API and the Web UI):
+```bash
+python main.py
+```
+- **Web UI:** Open your browser and navigate to `http://127.0.0.1:7860`.
+- **REST API:** Send a POST request with an audio file to `http://127.0.0.1:7860/api/transcribe`.
+## 📁 Project Structure
+- `main.py`: The FastAPI server entry point.
+- `gradio_ui.py`: The Gradio frontend layout and user logic.
+- `app/`: Core backend modules.
+  - `asr_model.py`: Handles Hugging Face model loading and inference.
+  - `audio_processing.py`: Handles audio ingest and 16kHz resampling.
+  - `language_detection.py`: Determines language from text.
+  - `history.py`: Manages reading and writing to `data/history.json`.
+- `notebooks/`: Jupyter notebooks for experimental fine-tuning and evaluation.