adiitya29 commited on
Commit
a91b83c
ยท
1 Parent(s): 9622244

project title changed and details added in readme.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md CHANGED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ๐ŸŽ™๏ธ Multilingual Automatic Speech Recognition (ASR)
2
+
3
+ A full-stack, end-to-end Automatic Speech Recognition web application that allows users to upload audio files, detect spoken language, convert speech to text, and download transcriptions. It includes both a beautiful graphical web interface and a programmatic REST API endpoint.
4
+
5
+ ## ๐Ÿš€ Features
6
+
7
+ - **Upload & Transcribe**: Easily upload `.wav` or `.mp3` audio files and get text transcriptions.
8
+ - **REST API Support**: Fully functional API endpoint to transcribe audio programmatically.
9
+ - **Language Detection**: Automatically detects the language of the transcribed text.
10
+ - **Transcription History**: Saves all previous transcriptions in a tabular UI with a dedicated history tab.
11
+ - **Export Options**: Download individual transcripts as `.txt` files or export your entire history as a `.csv`.
12
+ - **Lazy Model Loading**: Optimizes startup speed by downloading and loading the AI model only when the first transcription is requested.
13
+
14
+ ## ๐Ÿ› ๏ธ Technology Stack
15
+
16
+ - **Frontend Interface**: [Gradio](https://gradio.app/) - For the interactive web UI.
17
+ - **REST API**: [FastAPI](https://fastapi.tiangolo.com/) - Wraps the Gradio UI and exposes a `/api/transcribe` endpoint.
18
+ - **Deep Learning Backend**: [PyTorch](https://pytorch.org/) - Handles tensor operations and model execution.
19
+ - **Machine Learning library**: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) - Manages downloading and running the ASR models.
20
+ - **Audio Preprocessing**: [Librosa](https://librosa.org/) - Handles loading and resampling arbitrary audio to the strict `16kHz` required by the model.
21
+ - **Language Detection**: `langdetect` - Analyzes the resulting transcript string to determine the spoken language.
22
+
23
+ ## ๐Ÿง  Model Used
24
+
25
+ **Primary Model**: `facebook/wav2vec2-large-960h-lv60-self`
26
+ - This project uses the Large version of Meta's Wav2Vec2 model, which provides a significant accuracy boost over the base versions.
27
+ - *Note:* The project defaults to CPU execution to bypass known PyTorch `mps` (Apple Silicon) bugs associated with this specific model architecture.
28
+
29
+ ## โš™๏ธ Setup and Installation
30
+
31
+ 1. **Clone the repository**:
32
+ ```bash
33
+ git clone <your-repository-url>
34
+ cd "Multilingual Automatic Speech Recognition"
35
+ ```
36
+
37
+ 2. **Create a virtual environment**:
38
+ ```bash
39
+ python -m venv venv
40
+ source venv/bin/activate
41
+ ```
42
+
43
+ 3. **Install dependencies**:
44
+ ```bash
45
+ pip install -r requirements.txt
46
+ ```
47
+
48
+ 4. **(Optional) Pre-download the model**:
49
+ To avoid waiting on your first run, download the model weights ahead of time:
50
+ ```bash
51
+ hf download facebook/wav2vec2-large-960h-lv60-self
52
+ ```
53
+
54
+ ## ๐Ÿ–ฅ๏ธ Usage
55
+
56
+ Start the unified server (which runs both the API and the Web UI):
57
+ ```bash
58
+ python main.py
59
+ ```
60
+ - **Web UI:** Open your browser and navigate to `http://127.0.0.1:7860`.
61
+ - **REST API:** Send a POST request with an audio file to `http://127.0.0.1:7860/api/transcribe`.
62
+
63
+ ## ๐Ÿ“ Project Structure
64
+
65
+ - `main.py`: The FastAPI server entry point.
66
+ - `gradio_ui.py`: The Gradio frontend layout and user logic.
67
+ - `app/`: Core backend modules.
68
+ - `asr_model.py`: Handles Hugging Face model loading and inference.
69
+ - `audio_processing.py`: Handles audio ingest and 16kHz resampling.
70
+ - `language_detection.py`: Determines language from text.
71
+ - `history.py`: Manages reading and writing to `data/history.json`.
72
+ - `notebooks/`: Jupyter notebooks for experimental fine-tuning and evaluation.