Spaces:
Build error
Build error
GitHub Actions commited on
Commit Β·
57b8470
1
Parent(s): 921859d
Deploy from bot_text branch - Sat Dec 27 17:53:04 UTC 2025
Browse files- 1.4.0 +0 -0
- INSTALL.md +152 -0
- README.md +214 -10
- STT_INTEGRATION_GUIDE.md +270 -0
- app.py +1263 -1105
- hf-space +1 -0
- pyproject.toml +139 -0
- requirements.txt +42 -42
- requirements_coqui.txt +6 -0
- requirements_hubert.txt +7 -0
- requirements_tawasul.txt +13 -0
- requirements_vosk.txt +3 -0
- requirements_wav2vec2.txt +24 -0
- requirements_whisper.txt +27 -0
- setup.py +212 -0
- setup_hf_auth.py +156 -0
- stt/__init__.py +19 -0
- stt/chirp3_stt.py +136 -0
- stt/coqui_stt.py +390 -0
- stt/example_custom_stt.py +288 -0
- stt/hubert_arabic_stt.py +568 -0
- stt/stt_base.py +251 -0
- stt/tawasul_stt.py +448 -0
- stt/vosk_stt.py +561 -0
- stt/wav2vec2_arabic_stt.py +509 -0
- stt/whisper_stt.py +377 -0
- test_coqui.py +163 -0
- test_gradio_voice_transcriber.py +186 -0
- test_hubert_arabic.py +146 -0
- test_tawasul.py +132 -0
- test_vosk.py +185 -0
- test_wav2vec2_arabic.py +142 -0
- test_whisper_local.py +32 -0
- uv.lock +0 -0
1.4.0
ADDED
|
File without changes
|
INSTALL.md
ADDED
|
@@ -0,0 +1,152 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Installation Guide
|
| 2 |
+
|
| 3 |
+
This guide explains how to install and set up the Modular Voice Transcriber with different STT models.
|
| 4 |
+
|
| 5 |
+
## π Quick Start
|
| 6 |
+
|
| 7 |
+
### Option 1: Automated Setup (Recommended)
|
| 8 |
+
```bash
|
| 9 |
+
# Essential models (Whisper + Wav2Vec2)
|
| 10 |
+
python setup.py --profile essential --test
|
| 11 |
+
|
| 12 |
+
# Or for specific models only
|
| 13 |
+
python setup.py --profile whisper-only --test
|
| 14 |
+
python setup.py --profile wav2vec2-only --test
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
### Option 2: Manual Installation
|
| 18 |
+
|
| 19 |
+
#### Base Installation
|
| 20 |
+
```bash
|
| 21 |
+
# Core requirements
|
| 22 |
+
pip install gradio>=4.0.0 numpy>=1.21.0 soundfile>=0.12.1
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
#### Choose Your STT Models
|
| 26 |
+
|
| 27 |
+
**OpenAI Whisper (Local + API)**
|
| 28 |
+
```bash
|
| 29 |
+
pip install -r requirements_whisper.txt
|
| 30 |
+
# Or: pip install -e .[whisper,whisper-api]
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
**Wav2Vec2 Arabic**
|
| 34 |
+
```bash
|
| 35 |
+
pip install -r requirements_wav2vec2.txt
|
| 36 |
+
# Or: pip install -e .[wav2vec2]
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
**All Models**
|
| 40 |
+
```bash
|
| 41 |
+
pip install -r requirements.txt
|
| 42 |
+
# Or: pip install -e .[all-stt]
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## π¦ Installation Profiles
|
| 46 |
+
|
| 47 |
+
| Profile | Models Included | Use Case |
|
| 48 |
+
|---------|----------------|----------|
|
| 49 |
+
| `minimal` | None | Interface only (for development) |
|
| 50 |
+
| `essential` | Whisper + Wav2Vec2 | Best balance of features |
|
| 51 |
+
| `whisper-only` | OpenAI Whisper | English + Multilingual |
|
| 52 |
+
| `wav2vec2-only` | Wav2Vec2 Arabic | Arabic Egyptian dialect |
|
| 53 |
+
| `all` | All supported models | Complete functionality |
|
| 54 |
+
|
| 55 |
+
## π§ System Requirements
|
| 56 |
+
|
| 57 |
+
### Minimum Requirements
|
| 58 |
+
- Python 3.8+
|
| 59 |
+
- 4GB RAM
|
| 60 |
+
- 2GB free disk space
|
| 61 |
+
|
| 62 |
+
### Recommended Requirements
|
| 63 |
+
- Python 3.9+
|
| 64 |
+
- 8GB RAM
|
| 65 |
+
- 5GB free disk space
|
| 66 |
+
- GPU with CUDA support (for faster transcription)
|
| 67 |
+
|
| 68 |
+
## π Model Download Sizes
|
| 69 |
+
|
| 70 |
+
| Model | First Download | Disk Space |
|
| 71 |
+
|-------|---------------|------------|
|
| 72 |
+
| Whisper Tiny | 39MB | 39MB |
|
| 73 |
+
| Whisper Base | 142MB | 142MB |
|
| 74 |
+
| Whisper Medium | 1.5GB | 1.5GB |
|
| 75 |
+
| Wav2Vec2 Arabic | 1.2GB | 1.2GB |
|
| 76 |
+
|
| 77 |
+
## π§ͺ Testing Your Installation
|
| 78 |
+
|
| 79 |
+
### Test Individual Models
|
| 80 |
+
```bash
|
| 81 |
+
# Test Wav2Vec2 Arabic
|
| 82 |
+
python test_wav2vec2_arabic.py
|
| 83 |
+
|
| 84 |
+
# Test Whisper (coming soon)
|
| 85 |
+
python test_whisper_local.py
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Test Full Interface
|
| 89 |
+
```bash
|
| 90 |
+
python gradio_voice_transcriber_clean.py
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## π Troubleshooting
|
| 94 |
+
|
| 95 |
+
### Common Issues
|
| 96 |
+
|
| 97 |
+
**Import Error: transformers**
|
| 98 |
+
```bash
|
| 99 |
+
pip install transformers torch torchaudio
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
**Import Error: whisper**
|
| 103 |
+
```bash
|
| 104 |
+
pip install openai-whisper
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
**CUDA Issues**
|
| 108 |
+
- Install PyTorch with CUDA support from [pytorch.org](https://pytorch.org)
|
| 109 |
+
- Or use CPU-only: `pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu`
|
| 110 |
+
|
| 111 |
+
**Model Download Issues**
|
| 112 |
+
- Check internet connection
|
| 113 |
+
- Hugging Face models download automatically on first use
|
| 114 |
+
- Downloads go to `~/.cache/huggingface/` and `~/.cache/whisper/`
|
| 115 |
+
|
| 116 |
+
### Performance Tips
|
| 117 |
+
|
| 118 |
+
**For Better Speed:**
|
| 119 |
+
- Use GPU if available
|
| 120 |
+
- Choose smaller models for real-time use
|
| 121 |
+
- Use larger models for better accuracy
|
| 122 |
+
|
| 123 |
+
**For Better Quality:**
|
| 124 |
+
- Record in quiet environment
|
| 125 |
+
- Use good microphone
|
| 126 |
+
- Speak clearly and at normal pace
|
| 127 |
+
- Choose appropriate language/dialect model
|
| 128 |
+
|
| 129 |
+
## π Updating
|
| 130 |
+
|
| 131 |
+
```bash
|
| 132 |
+
# Update to latest versions
|
| 133 |
+
pip install --upgrade -r requirements.txt
|
| 134 |
+
|
| 135 |
+
# Update specific models
|
| 136 |
+
pip install --upgrade transformers openai-whisper
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
## π― Next Steps
|
| 140 |
+
|
| 141 |
+
1. **Run the interface:** `python gradio_voice_transcriber_clean.py`
|
| 142 |
+
2. **Choose your model** from the dropdown
|
| 143 |
+
3. **Load the model** (first time will download)
|
| 144 |
+
4. **Test with audio** recording or upload
|
| 145 |
+
5. **Check quality analysis** for audio tips
|
| 146 |
+
|
| 147 |
+
## π Additional Resources
|
| 148 |
+
|
| 149 |
+
- [Gradio Documentation](https://gradio.app/docs/)
|
| 150 |
+
- [Whisper by OpenAI](https://openai.com/research/whisper)
|
| 151 |
+
- [Wav2Vec2 Models](https://huggingface.co/models?search=wav2vec2)
|
| 152 |
+
- [Transformers Library](https://huggingface.co/docs/transformers/)
|
README.md
CHANGED
|
@@ -1,12 +1,216 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
| 1 |
+
# Modular Voice Transcriber
|
| 2 |
+
|
| 3 |
+
A comprehensive, modular Gradio-based web interface for speech-to-text transcription supporting multiple STT engines including OpenAI Whisper, Wav2Vec2, HuBERT Arabic, Tawasul, Vosk, and Coqui STT models.
|
| 4 |
+
|
| 5 |
+
## π Features
|
| 6 |
+
|
| 7 |
+
- **Comprehensive STT Support**: 7+ different speech-to-text engines
|
| 8 |
+
- **Multiple Models**: OpenAI Whisper, Wav2Vec2 Arabic, HuBERT, Tawasul, Vosk, Coqui STT
|
| 9 |
+
- **Arabic Language Focus**: Specialized models for Arabic dialect recognition
|
| 10 |
+
- **Web Interface**: User-friendly Gradio interface with image gallery
|
| 11 |
+
- **Real-time Processing**: Live audio recording and transcription
|
| 12 |
+
- **Quality Analysis**: Audio quality feedback and recommendations
|
| 13 |
+
- **Device Support**: CPU/GPU automatic detection and selection
|
| 14 |
+
- **Authentication**: Support for private HuggingFace models
|
| 15 |
+
- **Static Class Support**: Optimized memory usage for certain models
|
| 16 |
+
- **Visual Interface**: Interactive image gallery with thumbnail navigation
|
| 17 |
+
|
| 18 |
+
## π Quick Start
|
| 19 |
+
|
| 20 |
+
### Option 1: Automated Setup (Recommended)
|
| 21 |
+
```bash
|
| 22 |
+
# Essential models (Whisper + Wav2Vec2)
|
| 23 |
+
python setup.py --profile essential --test
|
| 24 |
+
|
| 25 |
+
# Or specific models
|
| 26 |
+
python setup.py --profile whisper-only --test
|
| 27 |
+
python setup.py --profile wav2vec2-only --test
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
### Option 2: Manual Installation
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
# Base installation (Whisper + Wav2Vec2)
|
| 34 |
+
pip install -r requirements.txt
|
| 35 |
+
|
| 36 |
+
# Specific model installations
|
| 37 |
+
pip install -r requirements_whisper.txt # OpenAI Whisper only
|
| 38 |
+
pip install -r requirements_wav2vec2.txt # Wav2Vec2 only
|
| 39 |
+
pip install -r requirements_hubert.txt # HuBERT Arabic only
|
| 40 |
+
pip install -r requirements_tawasul.txt # Tawasul Arabic only
|
| 41 |
+
pip install -r requirements_vosk.txt # Vosk offline only
|
| 42 |
+
pip install -r requirements_coqui.txt # Coqui STT only
|
| 43 |
+
|
| 44 |
+
# Or install with specific extras
|
| 45 |
+
pip install -e .[essential] # Whisper + Wav2Vec2
|
| 46 |
+
pip install -e .[all-stt] # All models
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
## π― Supported STT Models
|
| 50 |
+
|
| 51 |
+
| Model | Language | Size | Type | Quality | Features |
|
| 52 |
+
|-------|----------|------|------|---------|----------|
|
| 53 |
+
| **Whisper Tiny** | Multilingual | 39MB | Local/API | Fast | General purpose |
|
| 54 |
+
| **Whisper Base** | Multilingual | 142MB | Local/API | Good | General purpose |
|
| 55 |
+
| **Whisper Medium** | Multilingual | 1.5GB | Local/API | Better | General purpose |
|
| 56 |
+
| **Whisper Large** | Multilingual | 2.9GB | Local/API | Best | General purpose |
|
| 57 |
+
| **Wav2Vec2 Arabic** | Arabic | 1.2GB | Local | Excellent | Arabic dialects |
|
| 58 |
+
| **HuBERT Arabic** | Arabic Egyptian | 1.2GB | Local | Excellent | Egyptian dialect |
|
| 59 |
+
| **Tawasul V0** | Arabic | 800MB | Local | Very Good | Arabic speech, Static class |
|
| 60 |
+
| **Vosk** | Multilingual | 50MB-1.8GB | Local/Offline | Good | Offline capable |
|
| 61 |
+
| **Coqui STT** | Multilingual | 180MB-2GB | Local | Good | Open source |
|
| 62 |
+
|
| 63 |
+
## π§ Usage
|
| 64 |
+
|
| 65 |
+
### Start the Interface
|
| 66 |
+
```bash
|
| 67 |
+
python gradio_voice_transcriber_clean.py
|
| 68 |
+
```
|
| 69 |
+
|
| 70 |
+
### Using Different Models
|
| 71 |
+
|
| 72 |
+
1. **Select Model**: Choose from dropdown (WhisperSTT, Wav2Vec2ArabicSTT, HubertArabicSTT, TawasulSTT, VoskSTT, CoquiSTT)
|
| 73 |
+
2. **Configure**: Set model size, device, language, and authentication if needed
|
| 74 |
+
3. **Load**: Click "Load Model" (first time downloads model automatically)
|
| 75 |
+
4. **Transcribe**: Record audio or upload audio files
|
| 76 |
+
5. **Gallery**: Browse sample images using the interactive thumbnail gallery
|
| 77 |
+
|
| 78 |
+
### Authentication for Private Models
|
| 79 |
+
|
| 80 |
+
Some experimental models require HuggingFace authentication:
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
# Option 1: Use helper script
|
| 84 |
+
python setup_hf_auth.py
|
| 85 |
+
|
| 86 |
+
# Option 2: Manual login
|
| 87 |
+
pip install huggingface-hub
|
| 88 |
+
huggingface-cli login
|
| 89 |
+
|
| 90 |
+
# Option 3: Use token in interface
|
| 91 |
+
# Get token from: https://huggingface.co/settings/tokens
|
| 92 |
+
# Enter in "HuggingFace Token" field
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## ποΈ Adding New STT Models
|
| 96 |
+
|
| 97 |
+
The system is designed to be easily extensible:
|
| 98 |
+
|
| 99 |
+
1. **Create STT Class**: Inherit from `BaseSTT` in `stt/your_model.py`
|
| 100 |
+
2. **Register Model**: Add to `STT_MODELS` in `gradio_voice_transcriber_clean.py`
|
| 101 |
+
3. **Configure Options**: Update `get_model_options()` method
|
| 102 |
+
4. **Test**: Run your model through the interface
|
| 103 |
+
|
| 104 |
+
See `STT_INTEGRATION_GUIDE.md` for detailed instructions.
|
| 105 |
+
|
| 106 |
+
## π¨ Interface Features
|
| 107 |
+
|
| 108 |
+
### Image Gallery
|
| 109 |
+
The enhanced interface (`gradio_voice_transcript_temp.py`) includes:
|
| 110 |
+
- **Interactive Gallery**: Browse sample images with thumbnail navigation
|
| 111 |
+
- **Horizontal Scrolling**: Smooth image browsing experience
|
| 112 |
+
- **Thumbnail Selection**: Click thumbnails to view full images
|
| 113 |
+
- **Gallery Controls**: Navigation and zoom functionality
|
| 114 |
+
|
| 115 |
+
### Static Class Models
|
| 116 |
+
Some models (like TawasulSTT) use static class implementation for:
|
| 117 |
+
- **Memory Efficiency**: Reduced memory footprint
|
| 118 |
+
- **Faster Loading**: Optimized model initialization
|
| 119 |
+
- **Shared Resources**: Better resource management
|
| 120 |
+
|
| 121 |
+
## π Model Storage Locations
|
| 122 |
+
|
| 123 |
+
Different STT models store their files in various locations:
|
| 124 |
+
|
| 125 |
+
| Model Type | Storage Location | Description |
|
| 126 |
+
|------------|------------------|-------------|
|
| 127 |
+
| **Hugging Face Models** | `~/.cache/huggingface/` | Wav2Vec2, HuBERT, Tawasul models |
|
| 128 |
+
| **Whisper Models** | `~/.cache/whisper/` | OpenAI Whisper model files |
|
| 129 |
+
| **Vosk Models** | `~/.vosk/models/` | Offline Vosk language models |
|
| 130 |
+
| **Coqui Models** | Managed by model manager | Coqui STT model files |
|
| 131 |
+
|
| 132 |
+
## π Project Structure
|
| 133 |
+
|
| 134 |
+
```
|
| 135 |
+
STT-trails/
|
| 136 |
+
βββ gradio_voice_transcriber.py # Main comprehensive interface
|
| 137 |
+
βββ gradio_voice_transcript_temp.py # Enhanced interface with image gallery
|
| 138 |
+
βββ stt/ # STT implementations
|
| 139 |
+
β βββ stt_base.py # Base class for all STT models
|
| 140 |
+
β βββ whisper_stt.py # OpenAI Whisper implementation
|
| 141 |
+
β βββ wav2vec2_arabic_stt.py # Wav2Vec2 Arabic model
|
| 142 |
+
β βββ hubert_arabic_stt.py # HuBERT Arabic dialect model
|
| 143 |
+
β βββ tawasul_stt.py # Tawasul Arabic model (static class)
|
| 144 |
+
β βββ vosk_stt.py # Vosk offline STT
|
| 145 |
+
β βββ coqui_stt.py # Coqui open-source STT
|
| 146 |
+
β βββ example_custom_stt.py # Template for new models
|
| 147 |
+
βββ setup.py # Installation helper
|
| 148 |
+
βββ setup_hf_auth.py # HuggingFace authentication helper
|
| 149 |
+
βββ test_*.py # Model testing scripts
|
| 150 |
+
βββ requirements*.txt # Dependencies for each model
|
| 151 |
+
βββ recordings/ # Audio recordings directory
|
| 152 |
+
βββ INSTALL.md # Detailed installation guide
|
| 153 |
+
βββ STT_INTEGRATION_GUIDE.md # Developer integration guide
|
| 154 |
+
βββ pyproject.toml # Project configuration
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## π§ͺ Testing
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
# Test specific models
|
| 161 |
+
python test_whisper_local.py # Test Whisper models
|
| 162 |
+
python test_wav2vec2_arabic.py # Test Wav2Vec2 Arabic
|
| 163 |
+
python test_hubert_arabic.py # Test HuBERT Arabic
|
| 164 |
+
python test_tawasul.py # Test Tawasul Arabic
|
| 165 |
+
python test_vosk.py # Test Vosk offline STT
|
| 166 |
+
python test_coqui.py # Test Coqui STT
|
| 167 |
+
|
| 168 |
+
# Test installation
|
| 169 |
+
python setup.py --profile essential --test
|
| 170 |
+
|
| 171 |
+
# Run the interface
|
| 172 |
+
python gradio_voice_transcriber.py # Main interface
|
| 173 |
+
python gradio_voice_transcript_temp.py # Interface with image gallery
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
## π Troubleshooting
|
| 177 |
+
|
| 178 |
+
### Model Loading Issues
|
| 179 |
+
- **HuggingFace Authentication**: Use `setup_hf_auth.py` or manual token
|
| 180 |
+
- **Memory Issues**: Use smaller models or CPU-only mode
|
| 181 |
+
- **Internet Required**: First model download needs internet connection
|
| 182 |
+
|
| 183 |
+
### Audio Issues
|
| 184 |
+
- **No Audio Detected**: Check microphone permissions and volume
|
| 185 |
+
- **Poor Quality**: Use audio quality analysis feature
|
| 186 |
+
- **Wrong Language**: Select appropriate model for your language
|
| 187 |
+
|
| 188 |
+
### Performance Tips
|
| 189 |
+
- **Use GPU**: Automatic if CUDA PyTorch is installed
|
| 190 |
+
- **Chunk Long Audio**: Handled automatically for 20+ second clips
|
| 191 |
+
- **Choose Right Model**: Balance size vs. accuracy for your use case
|
| 192 |
+
|
| 193 |
+
## π License
|
| 194 |
+
|
| 195 |
+
This project is open source. See individual model licenses:
|
| 196 |
+
- OpenAI Whisper: MIT License
|
| 197 |
+
- Wav2Vec2: MIT License
|
| 198 |
+
- HuggingFace Transformers: Apache 2.0
|
| 199 |
+
|
| 200 |
+
## π€ Contributing
|
| 201 |
+
|
| 202 |
+
1. Fork the repository
|
| 203 |
+
2. Create your feature branch
|
| 204 |
+
3. Add your STT model following the integration guide
|
| 205 |
+
4. Submit a pull request
|
| 206 |
+
|
| 207 |
+
## π Resources
|
| 208 |
+
|
| 209 |
+
- [OpenAI Whisper](https://openai.com/research/whisper)
|
| 210 |
+
- [Wav2Vec2 Paper](https://arxiv.org/abs/2006.11477)
|
| 211 |
+
- [HuggingFace Models](https://huggingface.co/models)
|
| 212 |
+
- [Gradio Documentation](https://gradio.app/docs/)
|
| 213 |
+
|
| 214 |
---
|
| 215 |
|
| 216 |
+
**Made with β€οΈ for the speech recognition community**
|
STT_INTEGRATION_GUIDE.md
ADDED
|
@@ -0,0 +1,270 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Modular STT Integration Guide
|
| 2 |
+
|
| 3 |
+
This guide explains how to integrate new Speech-to-Text models into the modular Gradio voice transcriber.
|
| 4 |
+
|
| 5 |
+
## ποΈ Architecture Overview
|
| 6 |
+
|
| 7 |
+
The system is built with a modular architecture that makes it easy to add new STT engines:
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
gradio_voice_transcriber_clean.py
|
| 11 |
+
βββ ModelManager # Handles model registration and loading
|
| 12 |
+
βββ AudioProcessor # Preprocesses audio for better quality
|
| 13 |
+
βββ TranscriptionEngine # Manages transcription workflow
|
| 14 |
+
βββ GradioInterface # Creates the web UI
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
## π§ Adding a New STT Model
|
| 18 |
+
|
| 19 |
+
### Step 1: Create Your STT Class
|
| 20 |
+
|
| 21 |
+
Create a new file in the `stt/` directory (e.g., `your_stt.py`) that inherits from `BaseSTT`:
|
| 22 |
+
|
| 23 |
+
```python
|
| 24 |
+
from stt.stt_base import BaseSTT, STTResult
|
| 25 |
+
import numpy as np
|
| 26 |
+
|
| 27 |
+
class YourSTT(BaseSTT):
|
| 28 |
+
model_name = "YourSTT"
|
| 29 |
+
model = None
|
| 30 |
+
is_loaded = False
|
| 31 |
+
config = {}
|
| 32 |
+
|
| 33 |
+
@classmethod
|
| 34 |
+
def load_model(cls, **kwargs):
|
| 35 |
+
# Initialize your STT service
|
| 36 |
+
cls.model = your_stt_client()
|
| 37 |
+
cls.is_loaded = True
|
| 38 |
+
|
| 39 |
+
@classmethod
|
| 40 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 41 |
+
# Implement transcription logic
|
| 42 |
+
result = cls.model.transcribe(audio_data)
|
| 43 |
+
return STTResult(text=result.text, confidence=result.confidence)
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Step 2: Register Your Model
|
| 47 |
+
|
| 48 |
+
Add your model to the registry in `gradio_voice_transcriber_clean.py`:
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
# Import your model
|
| 52 |
+
from stt.your_stt import YourSTT
|
| 53 |
+
|
| 54 |
+
# Add to registry
|
| 55 |
+
STT_MODELS = {
|
| 56 |
+
"WhisperSTT": WhisperSTT,
|
| 57 |
+
"YourSTT": YourSTT, # Add this line
|
| 58 |
+
}
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
### Step 3: Configure Model Options
|
| 62 |
+
|
| 63 |
+
Update the `ModelManager.get_model_options()` method:
|
| 64 |
+
|
| 65 |
+
```python
|
| 66 |
+
@staticmethod
|
| 67 |
+
def get_model_options(model_name: str) -> Dict[str, Any]:
|
| 68 |
+
if model_name == "YourSTT":
|
| 69 |
+
return {
|
| 70 |
+
"model_sizes": ["small", "large"],
|
| 71 |
+
"supports_api": True,
|
| 72 |
+
"languages": [("English", "en"), ("Spanish", "es")],
|
| 73 |
+
"default_params": {"temperature": 0.0}
|
| 74 |
+
}
|
| 75 |
+
# ... existing code
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Step 4: Handle Model Loading
|
| 79 |
+
|
| 80 |
+
Update the loading logic in `ModelManager.load_model()`:
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
if model_name == "YourSTT":
|
| 84 |
+
api_key = kwargs.get("api_key", "")
|
| 85 |
+
model_size = kwargs.get("model_size", "small")
|
| 86 |
+
|
| 87 |
+
YourSTT.load_model(api_key=api_key, model_size=model_size)
|
| 88 |
+
status = f"β
{model_name} loaded successfully"
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
## π Real Examples
|
| 92 |
+
|
| 93 |
+
### Azure Speech Service
|
| 94 |
+
|
| 95 |
+
```python
|
| 96 |
+
import azure.cognitiveservices.speech as speechsdk
|
| 97 |
+
|
| 98 |
+
class AzureSTT(BaseSTT):
|
| 99 |
+
model_name = "AzureSTT"
|
| 100 |
+
|
| 101 |
+
@classmethod
|
| 102 |
+
def load_model(cls, subscription_key, region):
|
| 103 |
+
speech_config = speechsdk.SpeechConfig(
|
| 104 |
+
subscription=subscription_key,
|
| 105 |
+
region=region
|
| 106 |
+
)
|
| 107 |
+
cls.model = speech_config
|
| 108 |
+
cls.is_loaded = True
|
| 109 |
+
|
| 110 |
+
@classmethod
|
| 111 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 112 |
+
# Convert audio and send to Azure
|
| 113 |
+
# Return STTResult with transcription
|
| 114 |
+
pass
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
### Google Cloud Speech
|
| 118 |
+
|
| 119 |
+
```python
|
| 120 |
+
from google.cloud import speech
|
| 121 |
+
|
| 122 |
+
class GoogleSTT(BaseSTT):
|
| 123 |
+
model_name = "GoogleSTT"
|
| 124 |
+
|
| 125 |
+
@classmethod
|
| 126 |
+
def load_model(cls, credentials_path):
|
| 127 |
+
cls.model = speech.SpeechClient()
|
| 128 |
+
cls.is_loaded = True
|
| 129 |
+
|
| 130 |
+
@classmethod
|
| 131 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 132 |
+
# Process with Google Cloud Speech
|
| 133 |
+
pass
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
### AssemblyAI
|
| 137 |
+
|
| 138 |
+
```python
|
| 139 |
+
import assemblyai as aai
|
| 140 |
+
|
| 141 |
+
class AssemblyAISTT(BaseSTT):
|
| 142 |
+
model_name = "AssemblyAISTT"
|
| 143 |
+
|
| 144 |
+
@classmethod
|
| 145 |
+
def load_model(cls, api_key):
|
| 146 |
+
aai.settings.api_key = api_key
|
| 147 |
+
cls.model = aai.Transcriber()
|
| 148 |
+
cls.is_loaded = True
|
| 149 |
+
|
| 150 |
+
@classmethod
|
| 151 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 152 |
+
# Save audio temporarily and transcribe
|
| 153 |
+
pass
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
## π― Best Practices
|
| 157 |
+
|
| 158 |
+
### 1. Error Handling
|
| 159 |
+
```python
|
| 160 |
+
@classmethod
|
| 161 |
+
def load_model(cls, **kwargs):
|
| 162 |
+
try:
|
| 163 |
+
# Model loading logic
|
| 164 |
+
cls.is_loaded = True
|
| 165 |
+
except Exception as e:
|
| 166 |
+
cls.is_loaded = False
|
| 167 |
+
raise RuntimeError(f"Failed to load {cls.model_name}: {e}")
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
### 2. Configuration Management
|
| 171 |
+
```python
|
| 172 |
+
class YourSTT(BaseSTT):
|
| 173 |
+
config = {
|
| 174 |
+
"default_language": "en",
|
| 175 |
+
"timeout": 30,
|
| 176 |
+
"retry_count": 3
|
| 177 |
+
}
|
| 178 |
+
|
| 179 |
+
@classmethod
|
| 180 |
+
def set_language(cls, language):
|
| 181 |
+
cls.config["default_language"] = language
|
| 182 |
+
```
|
| 183 |
+
|
| 184 |
+
### 3. Audio Format Handling
|
| 185 |
+
```python
|
| 186 |
+
@classmethod
|
| 187 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 188 |
+
# Handle numpy arrays
|
| 189 |
+
if isinstance(audio_data, np.ndarray):
|
| 190 |
+
# Convert to required format
|
| 191 |
+
audio_bytes = audio_to_bytes(audio_data, sample_rate)
|
| 192 |
+
else:
|
| 193 |
+
# Handle file paths
|
| 194 |
+
with open(audio_data, 'rb') as f:
|
| 195 |
+
audio_bytes = f.read()
|
| 196 |
+
|
| 197 |
+
# Transcribe and return result
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
### 4. Metadata and Confidence
|
| 201 |
+
```python
|
| 202 |
+
return STTResult(
|
| 203 |
+
text=transcription,
|
| 204 |
+
confidence=confidence_score,
|
| 205 |
+
processing_time=processing_time,
|
| 206 |
+
metadata={
|
| 207 |
+
"model": cls.model_name,
|
| 208 |
+
"language_detected": detected_language,
|
| 209 |
+
"audio_duration": duration,
|
| 210 |
+
"service_info": additional_info
|
| 211 |
+
}
|
| 212 |
+
)
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
## π Testing Your Integration
|
| 216 |
+
|
| 217 |
+
1. **Unit Test Your STT Class**:
|
| 218 |
+
```python
|
| 219 |
+
def test_your_stt():
|
| 220 |
+
YourSTT.load_model(api_key="test")
|
| 221 |
+
dummy_audio = np.random.randn(16000).astype(np.float32)
|
| 222 |
+
result = YourSTT.transcribe_audio(dummy_audio, 16000)
|
| 223 |
+
assert result.text is not None
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
2. **Test in Gradio Interface**:
|
| 227 |
+
- Run `python gradio_voice_transcriber_clean.py`
|
| 228 |
+
- Select your model from the dropdown
|
| 229 |
+
- Load it and test with audio
|
| 230 |
+
|
| 231 |
+
## π οΈ Advanced Features
|
| 232 |
+
|
| 233 |
+
### Custom UI Components
|
| 234 |
+
|
| 235 |
+
You can add model-specific UI components by extending the interface:
|
| 236 |
+
|
| 237 |
+
```python
|
| 238 |
+
# Add custom fields for your model
|
| 239 |
+
if model_name == "YourSTT":
|
| 240 |
+
custom_setting = gr.Slider(
|
| 241 |
+
minimum=0, maximum=1, value=0.5,
|
| 242 |
+
label="Custom Setting"
|
| 243 |
+
)
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
### Background Processing
|
| 247 |
+
|
| 248 |
+
For long-running transcriptions:
|
| 249 |
+
|
| 250 |
+
```python
|
| 251 |
+
@classmethod
|
| 252 |
+
def transcribe_audio_async(cls, audio_data, callback):
|
| 253 |
+
# Start background transcription
|
| 254 |
+
# Call callback when done
|
| 255 |
+
pass
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
## π Current Available Models
|
| 259 |
+
|
| 260 |
+
- **WhisperSTT**: OpenAI Whisper (local + API)
|
| 261 |
+
- **ExampleCustomSTT**: Template for new integrations
|
| 262 |
+
|
| 263 |
+
## π― Next Steps
|
| 264 |
+
|
| 265 |
+
1. Choose your STT service
|
| 266 |
+
2. Follow the integration pattern
|
| 267 |
+
3. Test thoroughly
|
| 268 |
+
4. Contribute back to the project!
|
| 269 |
+
|
| 270 |
+
The modular design makes it easy to support any STT service while maintaining a consistent user experience.
|
app.py
CHANGED
|
@@ -1,1106 +1,1264 @@
|
|
| 1 |
-
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
-
Modular Gradio Voice Transcriber
|
| 4 |
-
|
| 5 |
-
A flexible web interface for voice transcription supporting multiple STT models.
|
| 6 |
-
Easily extensible to support any STT implementation that follows the BaseSTT interface.
|
| 7 |
-
|
| 8 |
-
Usage:
|
| 9 |
-
python gradio_voice_transcriber_clean.py
|
| 10 |
-
"""
|
| 11 |
-
|
| 12 |
-
import gradio as gr
|
| 13 |
-
import numpy as np
|
| 14 |
-
import logging
|
| 15 |
-
import time
|
| 16 |
-
from typing import Tuple, Optional, Dict, Any, Type, List, Union
|
| 17 |
-
from pathlib import Path
|
| 18 |
-
|
| 19 |
-
# Import base STT class and available implementations
|
| 20 |
-
from stt.stt_base import BaseSTT, STTResult
|
| 21 |
-
from stt.whisper_stt import WhisperSTT
|
| 22 |
-
|
| 23 |
-
# Try to import Wav2Vec2 Arabic STT (optional)
|
| 24 |
-
try:
|
| 25 |
-
from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
|
| 26 |
-
WAV2VEC2_AVAILABLE = True
|
| 27 |
-
except ImportError:
|
| 28 |
-
WAV2VEC2_AVAILABLE = False
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
#
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
#
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
if
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
if
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
if
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
audio_data
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
audio_data =
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
| 214 |
-
|
| 215 |
-
|
| 216 |
-
|
| 217 |
-
|
| 218 |
-
audio_tensor
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
| 253 |
-
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
|
| 280 |
-
|
| 281 |
-
|
| 282 |
-
|
| 283 |
-
("
|
| 284 |
-
("
|
| 285 |
-
("
|
| 286 |
-
("
|
| 287 |
-
("
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
"
|
| 291 |
-
"
|
| 292 |
-
"
|
| 293 |
-
"
|
| 294 |
-
"
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
|
| 308 |
-
|
| 309 |
-
|
| 310 |
-
("Arabic Standard", "
|
| 311 |
-
("
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
"
|
| 324 |
-
|
| 325 |
-
|
| 326 |
-
|
| 327 |
-
|
| 328 |
-
|
| 329 |
-
|
| 330 |
-
|
| 331 |
-
|
| 332 |
-
|
| 333 |
-
|
| 334 |
-
|
| 335 |
-
|
| 336 |
-
|
| 337 |
-
("
|
| 338 |
-
("
|
| 339 |
-
("
|
| 340 |
-
("
|
| 341 |
-
("
|
| 342 |
-
|
| 343 |
-
|
| 344 |
-
|
| 345 |
-
|
| 346 |
-
|
| 347 |
-
"
|
| 348 |
-
"
|
| 349 |
-
"
|
| 350 |
-
|
| 351 |
-
|
| 352 |
-
|
| 353 |
-
|
| 354 |
-
|
| 355 |
-
"
|
| 356 |
-
|
| 357 |
-
|
| 358 |
-
|
| 359 |
-
|
| 360 |
-
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
|
| 364 |
-
|
| 365 |
-
("Arabic
|
| 366 |
-
("
|
| 367 |
-
|
| 368 |
-
|
| 369 |
-
|
| 370 |
-
|
| 371 |
-
|
| 372 |
-
|
| 373 |
-
"
|
| 374 |
-
|
| 375 |
-
|
| 376 |
-
|
| 377 |
-
|
| 378 |
-
|
| 379 |
-
|
| 380 |
-
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
| 387 |
-
|
| 388 |
-
"
|
| 389 |
-
("English", "
|
| 390 |
-
("
|
| 391 |
-
("
|
| 392 |
-
("
|
| 393 |
-
("
|
| 394 |
-
],
|
| 395 |
-
"
|
| 396 |
-
|
| 397 |
-
|
| 398 |
-
"
|
| 399 |
-
"
|
| 400 |
-
"
|
| 401 |
-
"
|
| 402 |
-
|
| 403 |
-
|
| 404 |
-
|
| 405 |
-
|
| 406 |
-
|
| 407 |
-
|
| 408 |
-
|
| 409 |
-
|
| 410 |
-
|
| 411 |
-
|
| 412 |
-
|
| 413 |
-
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
("
|
| 418 |
-
("Arabic
|
| 419 |
-
("Arabic
|
| 420 |
-
("
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
"
|
| 428 |
-
"
|
| 429 |
-
"
|
| 430 |
-
"
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
| 454 |
-
|
| 455 |
-
|
| 456 |
-
|
| 457 |
-
|
| 458 |
-
|
| 459 |
-
|
| 460 |
-
|
| 461 |
-
|
| 462 |
-
|
| 463 |
-
|
| 464 |
-
|
| 465 |
-
|
| 466 |
-
|
| 467 |
-
|
| 468 |
-
|
| 469 |
-
|
| 470 |
-
|
| 471 |
-
|
| 472 |
-
|
| 473 |
-
|
| 474 |
-
|
| 475 |
-
if api_key:
|
| 476 |
-
|
| 477 |
-
|
| 478 |
-
#
|
| 479 |
-
|
| 480 |
-
|
| 481 |
-
|
| 482 |
-
|
| 483 |
-
|
| 484 |
-
|
| 485 |
-
|
| 486 |
-
|
| 487 |
-
|
| 488 |
-
|
| 489 |
-
|
| 490 |
-
|
| 491 |
-
|
| 492 |
-
|
| 493 |
-
|
| 494 |
-
|
| 495 |
-
|
| 496 |
-
|
| 497 |
-
|
| 498 |
-
|
| 499 |
-
|
| 500 |
-
|
| 501 |
-
|
| 502 |
-
|
| 503 |
-
|
| 504 |
-
|
| 505 |
-
|
| 506 |
-
|
| 507 |
-
|
| 508 |
-
|
| 509 |
-
|
| 510 |
-
|
| 511 |
-
|
| 512 |
-
|
| 513 |
-
|
| 514 |
-
|
| 515 |
-
|
| 516 |
-
|
| 517 |
-
"
|
| 518 |
-
|
| 519 |
-
|
| 520 |
-
|
| 521 |
-
|
| 522 |
-
|
| 523 |
-
|
| 524 |
-
|
| 525 |
-
|
| 526 |
-
|
| 527 |
-
|
| 528 |
-
|
| 529 |
-
|
| 530 |
-
|
| 531 |
-
|
| 532 |
-
|
| 533 |
-
|
| 534 |
-
|
| 535 |
-
|
| 536 |
-
|
| 537 |
-
|
| 538 |
-
|
| 539 |
-
|
| 540 |
-
|
| 541 |
-
|
| 542 |
-
"
|
| 543 |
-
|
| 544 |
-
|
| 545 |
-
|
| 546 |
-
|
| 547 |
-
|
| 548 |
-
|
| 549 |
-
|
| 550 |
-
|
| 551 |
-
|
| 552 |
-
|
| 553 |
-
|
| 554 |
-
|
| 555 |
-
|
| 556 |
-
|
| 557 |
-
|
| 558 |
-
|
| 559 |
-
|
| 560 |
-
|
| 561 |
-
|
| 562 |
-
|
| 563 |
-
|
| 564 |
-
|
| 565 |
-
|
| 566 |
-
|
| 567 |
-
|
| 568 |
-
"
|
| 569 |
-
|
| 570 |
-
|
| 571 |
-
|
| 572 |
-
"
|
| 573 |
-
|
| 574 |
-
|
| 575 |
-
|
| 576 |
-
|
| 577 |
-
|
| 578 |
-
|
| 579 |
-
|
| 580 |
-
|
| 581 |
-
|
| 582 |
-
|
| 583 |
-
|
| 584 |
-
|
| 585 |
-
|
| 586 |
-
|
| 587 |
-
|
| 588 |
-
|
| 589 |
-
|
| 590 |
-
|
| 591 |
-
|
| 592 |
-
|
| 593 |
-
|
| 594 |
-
|
| 595 |
-
|
| 596 |
-
|
| 597 |
-
|
| 598 |
-
"
|
| 599 |
-
"
|
| 600 |
-
"
|
| 601 |
-
"
|
| 602 |
-
|
| 603 |
-
|
| 604 |
-
|
| 605 |
-
|
| 606 |
-
|
| 607 |
-
|
| 608 |
-
|
| 609 |
-
|
| 610 |
-
|
| 611 |
-
|
| 612 |
-
|
| 613 |
-
|
| 614 |
-
|
| 615 |
-
|
| 616 |
-
|
| 617 |
-
|
| 618 |
-
|
| 619 |
-
|
| 620 |
-
|
| 621 |
-
|
| 622 |
-
|
| 623 |
-
|
| 624 |
-
|
| 625 |
-
|
| 626 |
-
|
| 627 |
-
|
| 628 |
-
|
| 629 |
-
|
| 630 |
-
|
| 631 |
-
|
| 632 |
-
|
| 633 |
-
"
|
| 634 |
-
|
| 635 |
-
|
| 636 |
-
|
| 637 |
-
|
| 638 |
-
|
| 639 |
-
|
| 640 |
-
|
| 641 |
-
|
| 642 |
-
|
| 643 |
-
|
| 644 |
-
|
| 645 |
-
|
| 646 |
-
|
| 647 |
-
|
| 648 |
-
|
| 649 |
-
|
| 650 |
-
|
| 651 |
-
|
| 652 |
-
|
| 653 |
-
|
| 654 |
-
|
| 655 |
-
|
| 656 |
-
|
| 657 |
-
|
| 658 |
-
|
| 659 |
-
|
| 660 |
-
|
| 661 |
-
|
| 662 |
-
|
| 663 |
-
|
| 664 |
-
|
| 665 |
-
|
| 666 |
-
|
| 667 |
-
|
| 668 |
-
|
| 669 |
-
|
| 670 |
-
|
| 671 |
-
|
| 672 |
-
|
| 673 |
-
|
| 674 |
-
|
| 675 |
-
|
| 676 |
-
|
| 677 |
-
|
| 678 |
-
|
| 679 |
-
|
| 680 |
-
|
| 681 |
-
|
| 682 |
-
|
| 683 |
-
|
| 684 |
-
|
| 685 |
-
|
| 686 |
-
|
| 687 |
-
|
| 688 |
-
|
| 689 |
-
|
| 690 |
-
|
| 691 |
-
|
| 692 |
-
|
| 693 |
-
|
| 694 |
-
|
| 695 |
-
|
| 696 |
-
|
| 697 |
-
|
| 698 |
-
|
| 699 |
-
|
| 700 |
-
|
| 701 |
-
|
| 702 |
-
|
| 703 |
-
|
| 704 |
-
|
| 705 |
-
|
| 706 |
-
|
| 707 |
-
|
| 708 |
-
|
| 709 |
-
|
| 710 |
-
|
| 711 |
-
|
| 712 |
-
|
| 713 |
-
|
| 714 |
-
|
| 715 |
-
|
| 716 |
-
|
| 717 |
-
|
| 718 |
-
|
| 719 |
-
|
| 720 |
-
|
| 721 |
-
|
| 722 |
-
|
| 723 |
-
|
| 724 |
-
|
| 725 |
-
|
| 726 |
-
|
| 727 |
-
|
| 728 |
-
|
| 729 |
-
|
| 730 |
-
|
| 731 |
-
|
| 732 |
-
|
| 733 |
-
|
| 734 |
-
|
| 735 |
-
|
| 736 |
-
|
| 737 |
-
|
| 738 |
-
|
| 739 |
-
|
| 740 |
-
|
| 741 |
-
|
| 742 |
-
|
| 743 |
-
|
| 744 |
-
|
| 745 |
-
|
| 746 |
-
|
| 747 |
-
|
| 748 |
-
|
| 749 |
-
|
| 750 |
-
|
| 751 |
-
|
| 752 |
-
|
| 753 |
-
|
| 754 |
-
|
| 755 |
-
|
| 756 |
-
|
| 757 |
-
|
| 758 |
-
|
| 759 |
-
|
| 760 |
-
|
| 761 |
-
|
| 762 |
-
|
| 763 |
-
|
| 764 |
-
|
| 765 |
-
#
|
| 766 |
-
|
| 767 |
-
|
| 768 |
-
|
| 769 |
-
|
| 770 |
-
|
| 771 |
-
|
| 772 |
-
|
| 773 |
-
|
| 774 |
-
|
| 775 |
-
|
| 776 |
-
|
| 777 |
-
|
| 778 |
-
|
| 779 |
-
|
| 780 |
-
|
| 781 |
-
|
| 782 |
-
|
| 783 |
-
|
| 784 |
-
|
| 785 |
-
|
| 786 |
-
|
| 787 |
-
|
| 788 |
-
|
| 789 |
-
|
| 790 |
-
|
| 791 |
-
|
| 792 |
-
|
| 793 |
-
|
| 794 |
-
|
| 795 |
-
|
| 796 |
-
|
| 797 |
-
|
| 798 |
-
|
| 799 |
-
|
| 800 |
-
|
| 801 |
-
|
| 802 |
-
|
| 803 |
-
|
| 804 |
-
|
| 805 |
-
|
| 806 |
-
|
| 807 |
-
|
| 808 |
-
|
| 809 |
-
|
| 810 |
-
|
| 811 |
-
|
| 812 |
-
|
| 813 |
-
|
| 814 |
-
|
| 815 |
-
|
| 816 |
-
|
| 817 |
-
|
| 818 |
-
|
| 819 |
-
|
| 820 |
-
|
| 821 |
-
|
| 822 |
-
|
| 823 |
-
|
| 824 |
-
|
| 825 |
-
|
| 826 |
-
|
| 827 |
-
|
| 828 |
-
|
| 829 |
-
|
| 830 |
-
|
| 831 |
-
|
| 832 |
-
|
| 833 |
-
|
| 834 |
-
|
| 835 |
-
|
| 836 |
-
|
| 837 |
-
|
| 838 |
-
|
| 839 |
-
|
| 840 |
-
|
| 841 |
-
|
| 842 |
-
|
| 843 |
-
|
| 844 |
-
|
| 845 |
-
|
| 846 |
-
|
| 847 |
-
|
| 848 |
-
|
| 849 |
-
|
| 850 |
-
|
| 851 |
-
|
| 852 |
-
|
| 853 |
-
|
| 854 |
-
|
| 855 |
-
|
| 856 |
-
|
| 857 |
-
|
| 858 |
-
|
| 859 |
-
|
| 860 |
-
|
| 861 |
-
|
| 862 |
-
|
| 863 |
-
|
| 864 |
-
|
| 865 |
-
|
| 866 |
-
|
| 867 |
-
|
| 868 |
-
|
| 869 |
-
|
| 870 |
-
|
| 871 |
-
|
| 872 |
-
|
| 873 |
-
|
| 874 |
-
|
| 875 |
-
|
| 876 |
-
|
| 877 |
-
|
| 878 |
-
|
| 879 |
-
|
| 880 |
-
|
| 881 |
-
|
| 882 |
-
value=
|
| 883 |
-
label="
|
| 884 |
-
|
| 885 |
-
|
| 886 |
-
|
| 887 |
-
|
| 888 |
-
|
| 889 |
-
|
| 890 |
-
|
| 891 |
-
|
| 892 |
-
|
| 893 |
-
|
| 894 |
-
|
| 895 |
-
|
| 896 |
-
|
| 897 |
-
|
| 898 |
-
|
| 899 |
-
|
| 900 |
-
|
| 901 |
-
|
| 902 |
-
|
| 903 |
-
|
| 904 |
-
)
|
| 905 |
-
|
| 906 |
-
|
| 907 |
-
|
| 908 |
-
|
| 909 |
-
|
| 910 |
-
|
| 911 |
-
|
| 912 |
-
|
| 913 |
-
|
| 914 |
-
|
| 915 |
-
|
| 916 |
-
|
| 917 |
-
|
| 918 |
-
|
| 919 |
-
|
| 920 |
-
|
| 921 |
-
|
| 922 |
-
|
| 923 |
-
|
| 924 |
-
|
| 925 |
-
|
| 926 |
-
|
| 927 |
-
|
| 928 |
-
|
| 929 |
-
|
| 930 |
-
|
| 931 |
-
|
| 932 |
-
|
| 933 |
-
|
| 934 |
-
|
| 935 |
-
|
| 936 |
-
|
| 937 |
-
|
| 938 |
-
|
| 939 |
-
|
| 940 |
-
|
| 941 |
-
|
| 942 |
-
|
| 943 |
-
|
| 944 |
-
|
| 945 |
-
|
| 946 |
-
|
| 947 |
-
|
| 948 |
-
|
| 949 |
-
|
| 950 |
-
|
| 951 |
-
|
| 952 |
-
|
| 953 |
-
|
| 954 |
-
|
| 955 |
-
|
| 956 |
-
|
| 957 |
-
|
| 958 |
-
|
| 959 |
-
|
| 960 |
-
|
| 961 |
-
|
| 962 |
-
|
| 963 |
-
|
| 964 |
-
|
| 965 |
-
gr.
|
| 966 |
-
|
| 967 |
-
|
| 968 |
-
|
| 969 |
-
|
| 970 |
-
|
| 971 |
-
|
| 972 |
-
|
| 973 |
-
|
| 974 |
-
|
| 975 |
-
|
| 976 |
-
|
| 977 |
-
|
| 978 |
-
|
| 979 |
-
|
| 980 |
-
|
| 981 |
-
|
| 982 |
-
|
| 983 |
-
|
| 984 |
-
|
| 985 |
-
|
| 986 |
-
|
| 987 |
-
|
| 988 |
-
|
| 989 |
-
|
| 990 |
-
|
| 991 |
-
|
| 992 |
-
|
| 993 |
-
|
| 994 |
-
|
| 995 |
-
|
| 996 |
-
|
| 997 |
-
|
| 998 |
-
|
| 999 |
-
|
| 1000 |
-
|
| 1001 |
-
|
| 1002 |
-
|
| 1003 |
-
|
| 1004 |
-
|
| 1005 |
-
|
| 1006 |
-
|
| 1007 |
-
|
| 1008 |
-
|
| 1009 |
-
|
| 1010 |
-
|
| 1011 |
-
|
| 1012 |
-
|
| 1013 |
-
|
| 1014 |
-
|
| 1015 |
-
|
| 1016 |
-
|
| 1017 |
-
|
| 1018 |
-
|
| 1019 |
-
|
| 1020 |
-
|
| 1021 |
-
|
| 1022 |
-
|
| 1023 |
-
|
| 1024 |
-
|
| 1025 |
-
|
| 1026 |
-
|
| 1027 |
-
|
| 1028 |
-
|
| 1029 |
-
|
| 1030 |
-
|
| 1031 |
-
|
| 1032 |
-
|
| 1033 |
-
|
| 1034 |
-
|
| 1035 |
-
|
| 1036 |
-
|
| 1037 |
-
|
| 1038 |
-
|
| 1039 |
-
|
| 1040 |
-
|
| 1041 |
-
|
| 1042 |
-
|
| 1043 |
-
|
| 1044 |
-
|
| 1045 |
-
|
| 1046 |
-
|
| 1047 |
-
|
| 1048 |
-
|
| 1049 |
-
|
| 1050 |
-
|
| 1051 |
-
|
| 1052 |
-
|
| 1053 |
-
|
| 1054 |
-
|
| 1055 |
-
|
| 1056 |
-
|
| 1057 |
-
|
| 1058 |
-
|
| 1059 |
-
|
| 1060 |
-
|
| 1061 |
-
|
| 1062 |
-
|
| 1063 |
-
|
| 1064 |
-
|
| 1065 |
-
|
| 1066 |
-
|
| 1067 |
-
|
| 1068 |
-
|
| 1069 |
-
|
| 1070 |
-
|
| 1071 |
-
|
| 1072 |
-
|
| 1073 |
-
|
| 1074 |
-
|
| 1075 |
-
|
| 1076 |
-
|
| 1077 |
-
|
| 1078 |
-
|
| 1079 |
-
|
| 1080 |
-
|
| 1081 |
-
|
| 1082 |
-
|
| 1083 |
-
|
| 1084 |
-
|
| 1085 |
-
|
| 1086 |
-
|
| 1087 |
-
|
| 1088 |
-
|
| 1089 |
-
|
| 1090 |
-
|
| 1091 |
-
|
| 1092 |
-
|
| 1093 |
-
|
| 1094 |
-
|
| 1095 |
-
|
| 1096 |
-
|
| 1097 |
-
|
| 1098 |
-
|
| 1099 |
-
|
| 1100 |
-
|
| 1101 |
-
|
| 1102 |
-
|
| 1103 |
-
|
| 1104 |
-
|
| 1105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1106 |
main()
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Modular Gradio Voice Transcriber
|
| 4 |
+
|
| 5 |
+
A flexible web interface for voice transcription supporting multiple STT models.
|
| 6 |
+
Easily extensible to support any STT implementation that follows the BaseSTT interface.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python gradio_voice_transcriber_clean.py
|
| 10 |
+
"""
|
| 11 |
+
|
| 12 |
+
import gradio as gr
|
| 13 |
+
import numpy as np
|
| 14 |
+
import logging
|
| 15 |
+
import time
|
| 16 |
+
from typing import Tuple, Optional, Dict, Any, Type, List, Union
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
|
| 19 |
+
# Import base STT class and available implementations
|
| 20 |
+
from stt.stt_base import BaseSTT, STTResult
|
| 21 |
+
from stt.whisper_stt import WhisperSTT
|
| 22 |
+
|
| 23 |
+
# Try to import Wav2Vec2 Arabic STT (optional)
|
| 24 |
+
try:
|
| 25 |
+
from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
|
| 26 |
+
WAV2VEC2_AVAILABLE = True
|
| 27 |
+
except ImportError:
|
| 28 |
+
WAV2VEC2_AVAILABLE = False
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
# Try to import Chirp3 STT (optional)
|
| 32 |
+
try:
|
| 33 |
+
from stt.chirp3_stt import Chirp3STT
|
| 34 |
+
CHIRP3_AVAILABLE = True
|
| 35 |
+
except ImportError:
|
| 36 |
+
Chirp3STT = None
|
| 37 |
+
CHIRP3_AVAILABLE = False
|
| 38 |
+
|
| 39 |
+
# Try to import HuBERT Arabic STT (optional)
|
| 40 |
+
try:
|
| 41 |
+
from stt.hubert_arabic_stt import HuBERTArabicSTT
|
| 42 |
+
HUBERT_AVAILABLE = True
|
| 43 |
+
except ImportError:
|
| 44 |
+
HUBERT_AVAILABLE = False
|
| 45 |
+
|
| 46 |
+
# Try to import Vosk STT (optional)
|
| 47 |
+
try:
|
| 48 |
+
from stt.vosk_stt import VoskSTT
|
| 49 |
+
VOSK_AVAILABLE = True
|
| 50 |
+
except ImportError:
|
| 51 |
+
VOSK_AVAILABLE = False
|
| 52 |
+
|
| 53 |
+
# Try to import Coqui STT (optional)
|
| 54 |
+
try:
|
| 55 |
+
from stt.coqui_stt import CoquiSTT
|
| 56 |
+
COQUI_AVAILABLE = True
|
| 57 |
+
except ImportError:
|
| 58 |
+
COQUI_AVAILABLE = False
|
| 59 |
+
|
| 60 |
+
# Try to import Tawasul STT (optional)
|
| 61 |
+
try:
|
| 62 |
+
from stt.tawasul_stt import TawasulSTT
|
| 63 |
+
TAWASUL_AVAILABLE = True
|
| 64 |
+
except ImportError:
|
| 65 |
+
TAWASUL_AVAILABLE = False
|
| 66 |
+
|
| 67 |
+
# Setup logging
|
| 68 |
+
logging.basicConfig(level=logging.INFO)
|
| 69 |
+
logger = logging.getLogger(__name__)
|
| 70 |
+
|
| 71 |
+
# STT Model Registry - Add new models here
|
| 72 |
+
STT_MODELS: Dict[str, Type[BaseSTT]] = {
|
| 73 |
+
"WhisperSTT": WhisperSTT,
|
| 74 |
+
}
|
| 75 |
+
|
| 76 |
+
# Add Wav2Vec2 Arabic if available
|
| 77 |
+
if WAV2VEC2_AVAILABLE:
|
| 78 |
+
STT_MODELS["Wav2Vec2ArabicSTT"] = Wav2Vec2ArabicSTT
|
| 79 |
+
|
| 80 |
+
# Add HuBERT Arabic if available
|
| 81 |
+
if HUBERT_AVAILABLE:
|
| 82 |
+
STT_MODELS["HuBERTArabicSTT"] = HuBERTArabicSTT
|
| 83 |
+
|
| 84 |
+
# Add Vosk if available
|
| 85 |
+
if VOSK_AVAILABLE:
|
| 86 |
+
STT_MODELS["VoskSTT"] = VoskSTT
|
| 87 |
+
|
| 88 |
+
# Add Coqui STT if available
|
| 89 |
+
if COQUI_AVAILABLE:
|
| 90 |
+
STT_MODELS["CoquiSTT"] = CoquiSTT
|
| 91 |
+
|
| 92 |
+
# Add Tawasul STT if available
|
| 93 |
+
if TAWASUL_AVAILABLE:
|
| 94 |
+
STT_MODELS["TawasulSTT"] = TawasulSTT
|
| 95 |
+
|
| 96 |
+
# Global state
|
| 97 |
+
current_stt_model: Optional[Type[BaseSTT]] = None
|
| 98 |
+
current_model_config: Dict[str, Any] = {}
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
class AudioProcessor:
|
| 102 |
+
"""Handle audio preprocessing for better transcription quality."""
|
| 103 |
+
|
| 104 |
+
@staticmethod
|
| 105 |
+
def preprocess(audio_data: np.ndarray, sample_rate: int, target_sr: int = 16000) -> np.ndarray:
|
| 106 |
+
"""
|
| 107 |
+
Preprocess audio for better transcription quality.
|
| 108 |
+
|
| 109 |
+
Args:
|
| 110 |
+
audio_data: Raw audio data
|
| 111 |
+
sample_rate: Original sample rate
|
| 112 |
+
target_sr: Target sample rate (default: 16000 for Whisper)
|
| 113 |
+
|
| 114 |
+
Returns:
|
| 115 |
+
Preprocessed audio data
|
| 116 |
+
"""
|
| 117 |
+
# Convert to mono if stereo
|
| 118 |
+
if audio_data.ndim > 1:
|
| 119 |
+
audio_data = np.mean(audio_data, axis=1)
|
| 120 |
+
|
| 121 |
+
# Normalize to float32 [-1, 1]
|
| 122 |
+
if audio_data.dtype == np.int16:
|
| 123 |
+
audio_data = audio_data.astype(np.float32) / 32768.0
|
| 124 |
+
elif audio_data.dtype == np.int32:
|
| 125 |
+
audio_data = audio_data.astype(np.float32) / 2147483648.0
|
| 126 |
+
else:
|
| 127 |
+
audio_data = audio_data.astype(np.float32)
|
| 128 |
+
|
| 129 |
+
# Clip to prevent overflow
|
| 130 |
+
audio_data = np.clip(audio_data, -1.0, 1.0)
|
| 131 |
+
|
| 132 |
+
# Remove DC offset
|
| 133 |
+
audio_data = audio_data - np.mean(audio_data)
|
| 134 |
+
|
| 135 |
+
# Simple noise gate (remove very quiet sections)
|
| 136 |
+
if len(audio_data) > 0:
|
| 137 |
+
threshold = np.max(np.abs(audio_data)) * 0.01
|
| 138 |
+
audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
|
| 139 |
+
|
| 140 |
+
# Resample if needed
|
| 141 |
+
if sample_rate != target_sr:
|
| 142 |
+
audio_data = AudioProcessor._resample(audio_data, sample_rate, target_sr)
|
| 143 |
+
|
| 144 |
+
return audio_data
|
| 145 |
+
|
| 146 |
+
@staticmethod
|
| 147 |
+
def _resample(audio_data: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
|
| 148 |
+
"""Simple resampling (prefer librosa if available)."""
|
| 149 |
+
try:
|
| 150 |
+
import librosa
|
| 151 |
+
return librosa.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)
|
| 152 |
+
except ImportError:
|
| 153 |
+
# Simple resampling fallback
|
| 154 |
+
if orig_sr > target_sr:
|
| 155 |
+
step = orig_sr // target_sr
|
| 156 |
+
return audio_data[::step]
|
| 157 |
+
else:
|
| 158 |
+
repeat_factor = target_sr // orig_sr
|
| 159 |
+
return np.repeat(audio_data, repeat_factor)
|
| 160 |
+
|
| 161 |
+
@staticmethod
|
| 162 |
+
def _preprocess_audio(audio_path: str) -> Tuple[np.ndarray, int]:
|
| 163 |
+
"""
|
| 164 |
+
Preprocess audio file for STT models that need torch.Tensor input.
|
| 165 |
+
|
| 166 |
+
Args:
|
| 167 |
+
audio_path: Path to audio file
|
| 168 |
+
|
| 169 |
+
Returns:
|
| 170 |
+
Tuple of (audio_tensor_as_numpy, sample_rate) that can be converted to torch.Tensor
|
| 171 |
+
"""
|
| 172 |
+
try:
|
| 173 |
+
import librosa
|
| 174 |
+
import soundfile as sf
|
| 175 |
+
|
| 176 |
+
# Try to load with librosa first (more robust)
|
| 177 |
+
try:
|
| 178 |
+
audio_data, sample_rate = librosa.load(audio_path, sr=16000)
|
| 179 |
+
except Exception:
|
| 180 |
+
# Fallback to soundfile
|
| 181 |
+
audio_data, sample_rate = sf.read(audio_path)
|
| 182 |
+
if sample_rate != 16000:
|
| 183 |
+
audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
|
| 184 |
+
sample_rate = 16000
|
| 185 |
+
|
| 186 |
+
# Convert to mono if needed
|
| 187 |
+
if audio_data.ndim > 1:
|
| 188 |
+
audio_data = np.mean(audio_data, axis=1)
|
| 189 |
+
|
| 190 |
+
# Normalize audio to [-1, 1]
|
| 191 |
+
if audio_data.max() > 1.0:
|
| 192 |
+
audio_data = audio_data / audio_data.max()
|
| 193 |
+
|
| 194 |
+
# Remove DC offset
|
| 195 |
+
audio_data = audio_data - np.mean(audio_data)
|
| 196 |
+
|
| 197 |
+
# Apply noise gate for very quiet audio
|
| 198 |
+
threshold = np.max(np.abs(audio_data)) * 0.01
|
| 199 |
+
audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
|
| 200 |
+
|
| 201 |
+
# Convert to float32 for compatibility
|
| 202 |
+
audio_data = audio_data.astype(np.float32)
|
| 203 |
+
|
| 204 |
+
return audio_data, sample_rate
|
| 205 |
+
|
| 206 |
+
except Exception as e:
|
| 207 |
+
raise RuntimeError(f"Audio preprocessing failed: {str(e)}")
|
| 208 |
+
|
| 209 |
+
@staticmethod
|
| 210 |
+
def _preprocess_audio_torch(audio_path: str):
|
| 211 |
+
"""
|
| 212 |
+
Preprocess audio file and return torch.Tensor for PyTorch-based STT models.
|
| 213 |
+
|
| 214 |
+
Args:
|
| 215 |
+
audio_path: Path to audio file
|
| 216 |
+
|
| 217 |
+
Returns:
|
| 218 |
+
Tuple of (audio_tensor, sample_rate) where audio_tensor is torch.Tensor
|
| 219 |
+
"""
|
| 220 |
+
try:
|
| 221 |
+
import torch
|
| 222 |
+
|
| 223 |
+
# Get numpy array first
|
| 224 |
+
audio_data, sample_rate = AudioProcessor._preprocess_audio(audio_path)
|
| 225 |
+
|
| 226 |
+
# Convert to torch tensor
|
| 227 |
+
audio_tensor = torch.FloatTensor(audio_data)
|
| 228 |
+
|
| 229 |
+
return audio_tensor, sample_rate
|
| 230 |
+
|
| 231 |
+
except ImportError:
|
| 232 |
+
raise RuntimeError("PyTorch not available. Install with: pip install torch")
|
| 233 |
+
except Exception as e:
|
| 234 |
+
raise RuntimeError(f"Torch audio preprocessing failed: {str(e)}")
|
| 235 |
+
|
| 236 |
+
@staticmethod
|
| 237 |
+
def analyze_quality(audio_data: np.ndarray, sample_rate: int) -> Dict[str, Any]:
|
| 238 |
+
"""Analyze audio quality and provide feedback."""
|
| 239 |
+
if audio_data.ndim > 1:
|
| 240 |
+
audio_data = np.mean(audio_data, axis=1)
|
| 241 |
+
|
| 242 |
+
duration = len(audio_data) / sample_rate
|
| 243 |
+
max_amp = np.max(np.abs(audio_data))
|
| 244 |
+
mean_amp = np.mean(np.abs(audio_data))
|
| 245 |
+
|
| 246 |
+
# Check for clipping and silence
|
| 247 |
+
clipping_ratio = np.sum(np.abs(audio_data) > 0.95) / len(audio_data)
|
| 248 |
+
silence_threshold = max_amp * 0.01
|
| 249 |
+
silence_ratio = np.sum(np.abs(audio_data) < silence_threshold) / len(audio_data)
|
| 250 |
+
|
| 251 |
+
return {
|
| 252 |
+
"duration": duration,
|
| 253 |
+
"max_amplitude": max_amp,
|
| 254 |
+
"mean_amplitude": mean_amp,
|
| 255 |
+
"clipping_ratio": clipping_ratio,
|
| 256 |
+
"silence_ratio": silence_ratio,
|
| 257 |
+
"sample_rate": sample_rate,
|
| 258 |
+
"is_good_quality": (
|
| 259 |
+
duration > 1.0 and
|
| 260 |
+
0.1 < max_amp < 0.9 and
|
| 261 |
+
clipping_ratio < 0.01 and
|
| 262 |
+
silence_ratio < 0.5
|
| 263 |
+
)
|
| 264 |
+
}
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
class ModelManager:
|
| 268 |
+
"""Handle STT model registration and loading."""
|
| 269 |
+
|
| 270 |
+
@staticmethod
|
| 271 |
+
def get_available_models() -> List[str]:
|
| 272 |
+
"""Get list of available STT model names."""
|
| 273 |
+
return list(STT_MODELS.keys())
|
| 274 |
+
|
| 275 |
+
@staticmethod
|
| 276 |
+
def get_model_options(model_name: str) -> Dict[str, Any]:
|
| 277 |
+
"""Get model-specific configuration options."""
|
| 278 |
+
if model_name == "WhisperSTT":
|
| 279 |
+
return {
|
| 280 |
+
"model_sizes": ["tiny", "base", "small", "medium", "large"],
|
| 281 |
+
"supports_api": True,
|
| 282 |
+
"languages": [
|
| 283 |
+
("Auto-detect", "auto"),
|
| 284 |
+
("English", "en"),
|
| 285 |
+
("Spanish", "es"),
|
| 286 |
+
("French", "fr"),
|
| 287 |
+
("German", "de"),
|
| 288 |
+
("Italian", "it"),
|
| 289 |
+
("Portuguese", "pt"),
|
| 290 |
+
("Russian", "ru"),
|
| 291 |
+
("Japanese", "ja"),
|
| 292 |
+
("Korean", "ko"),
|
| 293 |
+
("Chinese", "zh"),
|
| 294 |
+
("Dutch", "nl"),
|
| 295 |
+
("Arabic", "ar"),
|
| 296 |
+
("Hindi", "hi")
|
| 297 |
+
],
|
| 298 |
+
"default_params": {
|
| 299 |
+
"temperature": 0.0,
|
| 300 |
+
"beam_size": 5,
|
| 301 |
+
"best_of": 5,
|
| 302 |
+
"patience": 2.0,
|
| 303 |
+
"condition_on_previous_text": True,
|
| 304 |
+
}
|
| 305 |
+
}
|
| 306 |
+
|
| 307 |
+
elif model_name == "Wav2Vec2ArabicSTT":
|
| 308 |
+
return {
|
| 309 |
+
"model_sizes": [
|
| 310 |
+
("Arabic Standard", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
|
| 311 |
+
("Multilingual", "facebook/wav2vec2-large-xlsr-53"),
|
| 312 |
+
("English Fallback", "facebook/wav2vec2-base-960h"),
|
| 313 |
+
("Arabic Egyptian (Experimental)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian")
|
| 314 |
+
],
|
| 315 |
+
"supports_api": False,
|
| 316 |
+
"supports_hf_token": True,
|
| 317 |
+
"languages": [
|
| 318 |
+
("Arabic Egyptian", "ar-EG"),
|
| 319 |
+
("Arabic Standard", "ar"),
|
| 320 |
+
("Auto-detect", "auto"),
|
| 321 |
+
],
|
| 322 |
+
"device_options": ["auto", "cpu", "cuda"],
|
| 323 |
+
"default_params": {
|
| 324 |
+
"device": "auto",
|
| 325 |
+
"chunk_length": 20,
|
| 326 |
+
"return_confidence": True,
|
| 327 |
+
}
|
| 328 |
+
}
|
| 329 |
+
|
| 330 |
+
elif model_name == "VoskSTT":
|
| 331 |
+
return {
|
| 332 |
+
"model_sizes": [
|
| 333 |
+
("English US Small (40MB)", "vosk-model-small-en-us-0.15"),
|
| 334 |
+
("English US Large (1.8GB)", "vosk-model-en-us-0.22"),
|
| 335 |
+
("Arabic (318MB)", "vosk-model-ar-mgb2-0.4"),
|
| 336 |
+
("French (1.4GB)", "vosk-model-fr-0.22"),
|
| 337 |
+
("German (1.2GB)", "vosk-model-de-0.21"),
|
| 338 |
+
("Spanish (1.4GB)", "vosk-model-es-0.42"),
|
| 339 |
+
("Russian Large (1.5GB)", "vosk-model-ru-0.42"),
|
| 340 |
+
("Russian Small (45MB)", "vosk-model-small-ru-0.22"),
|
| 341 |
+
("Chinese Small (42MB)", "vosk-model-small-cn-0.22"),
|
| 342 |
+
],
|
| 343 |
+
"supports_api": False,
|
| 344 |
+
"supports_auto_download": True,
|
| 345 |
+
"languages": [
|
| 346 |
+
("Auto (based on model)", "auto"),
|
| 347 |
+
("English", "en"),
|
| 348 |
+
("Arabic", "ar"),
|
| 349 |
+
("French", "fr"),
|
| 350 |
+
("German", "de"),
|
| 351 |
+
("Spanish", "es"),
|
| 352 |
+
("Russian", "ru"),
|
| 353 |
+
("Chinese", "zh"),
|
| 354 |
+
],
|
| 355 |
+
"default_params": {
|
| 356 |
+
"auto_download": True,
|
| 357 |
+
"return_confidence": True,
|
| 358 |
+
"return_words": True,
|
| 359 |
+
}
|
| 360 |
+
}
|
| 361 |
+
|
| 362 |
+
elif model_name == "HuBERTArabicSTT":
|
| 363 |
+
return {
|
| 364 |
+
"model_sizes": [
|
| 365 |
+
("Arabic Egyptian (HuBERT)", "omarxadel/hubert-large-arabic-egyptian"),
|
| 366 |
+
("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
|
| 367 |
+
("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
|
| 368 |
+
("Arabic MSA", "facebook/wav2vec2-large-xlsr-53")
|
| 369 |
+
],
|
| 370 |
+
"supports_api": False,
|
| 371 |
+
"supports_hf_token": True,
|
| 372 |
+
"languages": [
|
| 373 |
+
("Arabic Egyptian", "ar-EG"),
|
| 374 |
+
("Arabic Standard", "ar"),
|
| 375 |
+
("Auto-detect", "auto"),
|
| 376 |
+
],
|
| 377 |
+
"device_options": ["auto", "cpu", "cuda"],
|
| 378 |
+
"default_params": {
|
| 379 |
+
"device": "auto",
|
| 380 |
+
"chunk_length": 20,
|
| 381 |
+
"return_confidence": True,
|
| 382 |
+
"max_audio_length": 120
|
| 383 |
+
}
|
| 384 |
+
}
|
| 385 |
+
|
| 386 |
+
elif model_name == "CoquiSTT":
|
| 387 |
+
return {
|
| 388 |
+
"model_sizes": [
|
| 389 |
+
("English Large Vocab", "english-large"),
|
| 390 |
+
("English Huge Vocab", "english-huge"),
|
| 391 |
+
("German", "german"),
|
| 392 |
+
("French", "french"),
|
| 393 |
+
("Spanish", "spanish")
|
| 394 |
+
],
|
| 395 |
+
"supports_api": False,
|
| 396 |
+
"supports_auto_download": True,
|
| 397 |
+
"languages": [
|
| 398 |
+
("English", "en"),
|
| 399 |
+
("German", "de"),
|
| 400 |
+
("French", "fr"),
|
| 401 |
+
("Spanish", "es"),
|
| 402 |
+
("Auto (based on model)", "auto"),
|
| 403 |
+
],
|
| 404 |
+
"default_params": {
|
| 405 |
+
"auto_download": True,
|
| 406 |
+
"beam_width": 512,
|
| 407 |
+
"lm_alpha": 0.931289039105002,
|
| 408 |
+
"lm_beta": 1.1834137581510284,
|
| 409 |
+
"return_confidence": True,
|
| 410 |
+
"return_timestamps": False,
|
| 411 |
+
}
|
| 412 |
+
}
|
| 413 |
+
|
| 414 |
+
elif model_name == "TawasulSTT":
|
| 415 |
+
return {
|
| 416 |
+
"model_sizes": [
|
| 417 |
+
("Tawasul STT V0 (Arabic)", "Kareem35/Tawasul-STT-V0"),
|
| 418 |
+
("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
|
| 419 |
+
("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
|
| 420 |
+
("Multilingual Fallback", "facebook/wav2vec2-large-xlsr-53")
|
| 421 |
+
],
|
| 422 |
+
"supports_api": False,
|
| 423 |
+
"supports_hf_token": True,
|
| 424 |
+
"languages": [
|
| 425 |
+
("Arabic Standard", "ar"),
|
| 426 |
+
("Arabic Egyptian", "ar-EG"),
|
| 427 |
+
("Arabic Saudi", "ar-SA"),
|
| 428 |
+
("Arabic Jordanian", "ar-JO"),
|
| 429 |
+
("Arabic Lebanese", "ar-LB"),
|
| 430 |
+
("Arabic Syrian", "ar-SY"),
|
| 431 |
+
("Arabic Iraqi", "ar-IQ"),
|
| 432 |
+
("Auto-detect", "auto"),
|
| 433 |
+
],
|
| 434 |
+
"device_options": ["auto", "cpu", "cuda"],
|
| 435 |
+
"default_params": {
|
| 436 |
+
"device": "auto",
|
| 437 |
+
"chunk_length": 20,
|
| 438 |
+
"return_confidence": True,
|
| 439 |
+
"max_audio_length": 300
|
| 440 |
+
}
|
| 441 |
+
}
|
| 442 |
+
|
| 443 |
+
# Default options for other models
|
| 444 |
+
return {
|
| 445 |
+
"model_sizes": ["default"],
|
| 446 |
+
"supports_api": False,
|
| 447 |
+
"languages": [("Auto-detect", "auto")],
|
| 448 |
+
"default_params": {}
|
| 449 |
+
}
|
| 450 |
+
|
| 451 |
+
@staticmethod
|
| 452 |
+
def load_model(model_name: str, **kwargs) -> str:
|
| 453 |
+
"""Load specified STT model with configuration."""
|
| 454 |
+
global current_stt_model, current_model_config
|
| 455 |
+
|
| 456 |
+
if model_name not in STT_MODELS:
|
| 457 |
+
return f"β Unknown model: {model_name}. Available: {list(STT_MODELS.keys())}"
|
| 458 |
+
|
| 459 |
+
try:
|
| 460 |
+
model_class = STT_MODELS[model_name]
|
| 461 |
+
|
| 462 |
+
# Handle TawasulSTT as static class (don't instantiate)
|
| 463 |
+
if model_name == "TawasulSTT":
|
| 464 |
+
model_instance = model_class # Use class directly for static methods
|
| 465 |
+
else:
|
| 466 |
+
# Instantiate the model for instance-based classes
|
| 467 |
+
model_instance = model_class()
|
| 468 |
+
|
| 469 |
+
if model_name == "WhisperSTT":
|
| 470 |
+
# Handle WhisperSTT specific loading
|
| 471 |
+
model_size = kwargs.get("model_size", "base")
|
| 472 |
+
use_api = kwargs.get("use_api", False)
|
| 473 |
+
api_key = kwargs.get("api_key", "")
|
| 474 |
+
|
| 475 |
+
if use_api and not api_key.strip():
|
| 476 |
+
return "β Error: API key required for API mode"
|
| 477 |
+
|
| 478 |
+
# Load with optimized parameters
|
| 479 |
+
load_params = {
|
| 480 |
+
"model_size": model_size,
|
| 481 |
+
"use_api": use_api,
|
| 482 |
+
}
|
| 483 |
+
|
| 484 |
+
if api_key:
|
| 485 |
+
load_params["api_key"] = api_key.strip()
|
| 486 |
+
|
| 487 |
+
# Add quality optimization parameters for local models
|
| 488 |
+
if not use_api:
|
| 489 |
+
load_params.update({
|
| 490 |
+
"temperature": 0.0,
|
| 491 |
+
"beam_size": 5,
|
| 492 |
+
"best_of": 5,
|
| 493 |
+
"patience": 2.0,
|
| 494 |
+
"condition_on_previous_text": True,
|
| 495 |
+
})
|
| 496 |
+
|
| 497 |
+
model_instance.load_model(**load_params)
|
| 498 |
+
|
| 499 |
+
current_model_config = {
|
| 500 |
+
"model_name": model_name,
|
| 501 |
+
"model_size": model_size,
|
| 502 |
+
"use_api": use_api
|
| 503 |
+
}
|
| 504 |
+
|
| 505 |
+
status = f"β
{model_name} ({'API' if use_api else model_size}) loaded successfully"
|
| 506 |
+
|
| 507 |
+
elif model_name == "Wav2Vec2ArabicSTT":
|
| 508 |
+
# Handle Wav2Vec2 Arabic specific loading
|
| 509 |
+
device = kwargs.get("device", "auto")
|
| 510 |
+
chunk_length = kwargs.get("chunk_length", 20)
|
| 511 |
+
hf_token = kwargs.get("hf_token", "")
|
| 512 |
+
model_id = kwargs.get("model_size", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
|
| 513 |
+
|
| 514 |
+
load_params = {
|
| 515 |
+
"device": device,
|
| 516 |
+
"chunk_length": chunk_length,
|
| 517 |
+
"model_id": model_id,
|
| 518 |
+
}
|
| 519 |
+
|
| 520 |
+
if hf_token:
|
| 521 |
+
load_params["hf_token"] = hf_token.strip()
|
| 522 |
+
|
| 523 |
+
model_instance.load_model(**load_params)
|
| 524 |
+
|
| 525 |
+
current_model_config = {
|
| 526 |
+
"model_name": model_name,
|
| 527 |
+
"model_id": model_id,
|
| 528 |
+
"device": device,
|
| 529 |
+
"chunk_length": chunk_length
|
| 530 |
+
}
|
| 531 |
+
|
| 532 |
+
# Extract model name for display
|
| 533 |
+
model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
|
| 534 |
+
status = f"β
{model_name} ({model_display_name}) loaded on {device}"
|
| 535 |
+
|
| 536 |
+
elif model_name == "VoskSTT":
|
| 537 |
+
# Handle VoskSTT specific loading
|
| 538 |
+
model_name_param = kwargs.get("model_size", "vosk-model-small-en-us-0.15")
|
| 539 |
+
auto_download = kwargs.get("auto_download", True)
|
| 540 |
+
|
| 541 |
+
load_params = {
|
| 542 |
+
"model_name": model_name_param,
|
| 543 |
+
"auto_download": auto_download,
|
| 544 |
+
}
|
| 545 |
+
|
| 546 |
+
model_instance.load_model(**load_params)
|
| 547 |
+
|
| 548 |
+
current_model_config = {
|
| 549 |
+
"model_name": model_name,
|
| 550 |
+
"model_name_param": model_name_param,
|
| 551 |
+
"auto_download": auto_download
|
| 552 |
+
}
|
| 553 |
+
|
| 554 |
+
status = f"β
{model_name} ({model_name_param}) loaded successfully"
|
| 555 |
+
|
| 556 |
+
elif model_name == "HuBERTArabicSTT":
|
| 557 |
+
# Handle HuBERT Arabic specific loading
|
| 558 |
+
device = kwargs.get("device", "auto")
|
| 559 |
+
chunk_length = kwargs.get("chunk_length", 20)
|
| 560 |
+
hf_token = kwargs.get("hf_token", "")
|
| 561 |
+
model_id = kwargs.get("model_size", "omarxadel/hubert-large-arabic-egyptian")
|
| 562 |
+
max_audio_length = kwargs.get("max_audio_length", 120)
|
| 563 |
+
|
| 564 |
+
load_params = {
|
| 565 |
+
"device": device,
|
| 566 |
+
"chunk_length": chunk_length,
|
| 567 |
+
"model_id": model_id,
|
| 568 |
+
"max_audio_length": max_audio_length,
|
| 569 |
+
}
|
| 570 |
+
|
| 571 |
+
if hf_token:
|
| 572 |
+
load_params["hf_token"] = hf_token.strip()
|
| 573 |
+
|
| 574 |
+
model_instance.load_model(**load_params)
|
| 575 |
+
|
| 576 |
+
current_model_config = {
|
| 577 |
+
"model_name": model_name,
|
| 578 |
+
"model_id": model_id,
|
| 579 |
+
"device": device,
|
| 580 |
+
"chunk_length": chunk_length,
|
| 581 |
+
"max_audio_length": max_audio_length
|
| 582 |
+
}
|
| 583 |
+
|
| 584 |
+
# Extract model name for display
|
| 585 |
+
model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
|
| 586 |
+
status = f"β
{model_name} ({model_display_name}) loaded on {device}"
|
| 587 |
+
|
| 588 |
+
elif model_name == "CoquiSTT":
|
| 589 |
+
# Handle Coqui STT specific loading
|
| 590 |
+
model_name_param = kwargs.get("model_size", "english-large")
|
| 591 |
+
auto_download = kwargs.get("auto_download", True)
|
| 592 |
+
beam_width = kwargs.get("beam_width", 512)
|
| 593 |
+
lm_alpha = kwargs.get("lm_alpha", 0.931289039105002)
|
| 594 |
+
lm_beta = kwargs.get("lm_beta", 1.1834137581510284)
|
| 595 |
+
|
| 596 |
+
load_params = {
|
| 597 |
+
"model_name": model_name_param,
|
| 598 |
+
"auto_download": auto_download,
|
| 599 |
+
"beam_width": beam_width,
|
| 600 |
+
"lm_alpha": lm_alpha,
|
| 601 |
+
"lm_beta": lm_beta,
|
| 602 |
+
}
|
| 603 |
+
|
| 604 |
+
model_instance.load_model(**load_params)
|
| 605 |
+
|
| 606 |
+
current_model_config = {
|
| 607 |
+
"model_name": model_name,
|
| 608 |
+
"model_name_param": model_name_param,
|
| 609 |
+
"auto_download": auto_download,
|
| 610 |
+
"beam_width": beam_width,
|
| 611 |
+
"lm_alpha": lm_alpha,
|
| 612 |
+
"lm_beta": lm_beta
|
| 613 |
+
}
|
| 614 |
+
|
| 615 |
+
status = f"β
{model_name} ({model_name_param}) loaded successfully"
|
| 616 |
+
|
| 617 |
+
elif model_name == "TawasulSTT":
|
| 618 |
+
# Handle Tawasul STT specific loading (static class)
|
| 619 |
+
device = kwargs.get("device", "auto")
|
| 620 |
+
chunk_length = kwargs.get("chunk_length", 20)
|
| 621 |
+
hf_token = kwargs.get("hf_token", "")
|
| 622 |
+
model_id = kwargs.get("model_size", "Kareem35/Tawasul-STT-V0")
|
| 623 |
+
max_audio_length = kwargs.get("max_audio_length", 300)
|
| 624 |
+
|
| 625 |
+
load_params = {
|
| 626 |
+
"device": device,
|
| 627 |
+
"chunk_length": chunk_length,
|
| 628 |
+
"model_id": model_id,
|
| 629 |
+
"max_audio_length": max_audio_length,
|
| 630 |
+
}
|
| 631 |
+
|
| 632 |
+
if hf_token:
|
| 633 |
+
load_params["hf_token"] = hf_token.strip()
|
| 634 |
+
|
| 635 |
+
# Call static method directly
|
| 636 |
+
model_class.load_model(**load_params)
|
| 637 |
+
|
| 638 |
+
current_model_config = {
|
| 639 |
+
"model_name": model_name,
|
| 640 |
+
"model_id": model_id,
|
| 641 |
+
"device": device,
|
| 642 |
+
"chunk_length": chunk_length,
|
| 643 |
+
"max_audio_length": max_audio_length
|
| 644 |
+
}
|
| 645 |
+
|
| 646 |
+
# Extract model name for display
|
| 647 |
+
model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
|
| 648 |
+
status = f"β
{model_name} ({model_display_name}) loaded on {device}"
|
| 649 |
+
|
| 650 |
+
else:
|
| 651 |
+
# Generic model loading for future STT models
|
| 652 |
+
model_instance.load_model(**kwargs)
|
| 653 |
+
current_model_config = {"model_name": model_name, **kwargs}
|
| 654 |
+
status = f"β
{model_name} loaded successfully"
|
| 655 |
+
|
| 656 |
+
current_stt_model = model_instance
|
| 657 |
+
logger.info(status)
|
| 658 |
+
return status
|
| 659 |
+
|
| 660 |
+
except Exception as e:
|
| 661 |
+
error_msg = f"β Error loading {model_name}: {str(e)}"
|
| 662 |
+
logger.error(error_msg)
|
| 663 |
+
return error_msg
|
| 664 |
+
|
| 665 |
+
@staticmethod
|
| 666 |
+
def get_model_info() -> str:
|
| 667 |
+
"""Get information about available and loaded models."""
|
| 668 |
+
info = f"**Available Models:** {', '.join(STT_MODELS.keys())}\n\n"
|
| 669 |
+
|
| 670 |
+
if current_stt_model:
|
| 671 |
+
model_info = current_stt_model.get_model_info()
|
| 672 |
+
# Handle different key names for model name
|
| 673 |
+
model_name = model_info.get('model_name') or model_info.get('name', 'Unknown')
|
| 674 |
+
info += f"**Currently Loaded:** {model_name}\n"
|
| 675 |
+
info += f"**Status:** {'β
Ready' if model_info['is_loaded'] else 'β Not loaded'}\n"
|
| 676 |
+
info += f"**Config:** {current_model_config}"
|
| 677 |
+
else:
|
| 678 |
+
info += "**Currently Loaded:** None"
|
| 679 |
+
|
| 680 |
+
return info
|
| 681 |
+
|
| 682 |
+
|
| 683 |
+
class ImageGallery:
|
| 684 |
+
"""Handle static image gallery with slider navigation."""
|
| 685 |
+
|
| 686 |
+
def __init__(self):
|
| 687 |
+
"""Initialize image gallery with predefined images."""
|
| 688 |
+
# Define your static images here - you can add more images to this list
|
| 689 |
+
self.images = [
|
| 690 |
+
"https://picsum.photos/400/300?random=1", # Random image 1
|
| 691 |
+
"https://picsum.photos/400/300?random=2", # Random image 2
|
| 692 |
+
"https://picsum.photos/400/300?random=3", # Random image 3
|
| 693 |
+
"https://picsum.photos/400/300?random=4", # Random image 4
|
| 694 |
+
"https://picsum.photos/400/300?random=5", # Random image 5
|
| 695 |
+
]
|
| 696 |
+
|
| 697 |
+
# Alternative: Use local images (uncomment and modify paths as needed)
|
| 698 |
+
# self.images = [
|
| 699 |
+
# "path/to/image1.jpg",
|
| 700 |
+
# "path/to/image2.png",
|
| 701 |
+
# "path/to/image3.jpg",
|
| 702 |
+
# "path/to/image4.png",
|
| 703 |
+
# "path/to/image5.jpg",
|
| 704 |
+
# ]
|
| 705 |
+
|
| 706 |
+
self.current_index = 0
|
| 707 |
+
|
| 708 |
+
def get_image_by_index(self, index: int) -> str:
|
| 709 |
+
"""Get image by index with bounds checking."""
|
| 710 |
+
if 0 <= index < len(self.images):
|
| 711 |
+
self.current_index = index
|
| 712 |
+
return self.images[index]
|
| 713 |
+
return self.images[0] # Return first image as fallback
|
| 714 |
+
|
| 715 |
+
def get_image_info(self, index: int) -> str:
|
| 716 |
+
"""Get information about current image."""
|
| 717 |
+
return f"Image {index + 1} of {len(self.images)}"
|
| 718 |
+
|
| 719 |
+
def get_total_images(self) -> int:
|
| 720 |
+
"""Get total number of images."""
|
| 721 |
+
return len(self.images)
|
| 722 |
+
|
| 723 |
+
|
| 724 |
+
class TranscriptionEngine:
|
| 725 |
+
"""Handle audio transcription using the loaded STT model."""
|
| 726 |
+
|
| 727 |
+
@staticmethod
|
| 728 |
+
def transcribe(audio_input: Tuple[int, np.ndarray],
|
| 729 |
+
language: Optional[str] = None) -> Tuple[str, str, str]:
|
| 730 |
+
"""
|
| 731 |
+
Transcribe audio input using the currently loaded STT model.
|
| 732 |
+
|
| 733 |
+
Args:
|
| 734 |
+
audio_input: Tuple of (sample_rate, audio_data) from Gradio
|
| 735 |
+
language: Language code for transcription
|
| 736 |
+
|
| 737 |
+
Returns:
|
| 738 |
+
Tuple of (transcription, confidence_info, processing_info)
|
| 739 |
+
"""
|
| 740 |
+
if audio_input is None:
|
| 741 |
+
return "β No audio provided", "", ""
|
| 742 |
+
|
| 743 |
+
if not current_stt_model or not current_stt_model.is_loaded:
|
| 744 |
+
return "β No STT model loaded. Please load a model first.", "", ""
|
| 745 |
+
|
| 746 |
+
try:
|
| 747 |
+
sample_rate, audio_data = audio_input
|
| 748 |
+
|
| 749 |
+
# Preprocess audio
|
| 750 |
+
processed_audio = AudioProcessor.preprocess(audio_data, sample_rate)
|
| 751 |
+
|
| 752 |
+
# Quality checks
|
| 753 |
+
quality = AudioProcessor.analyze_quality(processed_audio, 16000)
|
| 754 |
+
|
| 755 |
+
if quality["duration"] < 0.5:
|
| 756 |
+
return "β Audio too short (minimum 0.5 seconds)", "", ""
|
| 757 |
+
|
| 758 |
+
if quality["max_amplitude"] < 0.001:
|
| 759 |
+
return "β Audio too quiet or silent", "", f"Max amplitude: {quality['max_amplitude']:.6f}"
|
| 760 |
+
|
| 761 |
+
# Set language for models that support it
|
| 762 |
+
if hasattr(current_stt_model, 'set_language') and language and language != "auto":
|
| 763 |
+
current_stt_model.set_language(language)
|
| 764 |
+
|
| 765 |
+
# Transcribe using different approaches for different models
|
| 766 |
+
start_time = time.time()
|
| 767 |
+
|
| 768 |
+
# Check if this is TawasulSTT (static class) which needs file path
|
| 769 |
+
if current_model_config.get('model_name') == 'TawasulSTT':
|
| 770 |
+
# TawasulSTT needs a file path, so save audio to temporary file
|
| 771 |
+
import tempfile
|
| 772 |
+
import soundfile as sf
|
| 773 |
+
|
| 774 |
+
with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
|
| 775 |
+
temp_path = temp_file.name
|
| 776 |
+
sf.write(temp_path, processed_audio, 16000)
|
| 777 |
+
|
| 778 |
+
try:
|
| 779 |
+
# Call TawasulSTT.transcribe() with file path
|
| 780 |
+
transcription, confidence_info_raw, processing_info_raw = current_stt_model.transcribe(temp_path)
|
| 781 |
+
|
| 782 |
+
# Create a result-like object for consistency
|
| 783 |
+
class TempResult:
|
| 784 |
+
def __init__(self, text, confidence=None, processing_time=None):
|
| 785 |
+
self.text = text
|
| 786 |
+
self.confidence = confidence
|
| 787 |
+
self.processing_time = processing_time
|
| 788 |
+
|
| 789 |
+
# Extract confidence from confidence_info_raw if available
|
| 790 |
+
confidence_value = None
|
| 791 |
+
if confidence_info_raw and "Confidence:" in confidence_info_raw:
|
| 792 |
+
try:
|
| 793 |
+
conf_str = confidence_info_raw.split("Confidence:")[1].strip()
|
| 794 |
+
confidence_value = float(conf_str)
|
| 795 |
+
except:
|
| 796 |
+
confidence_value = None
|
| 797 |
+
|
| 798 |
+
processing_time = time.time() - start_time
|
| 799 |
+
result = TempResult(transcription, confidence_value, processing_time)
|
| 800 |
+
|
| 801 |
+
finally:
|
| 802 |
+
# Clean up temporary file
|
| 803 |
+
import os
|
| 804 |
+
try:
|
| 805 |
+
os.unlink(temp_path)
|
| 806 |
+
except:
|
| 807 |
+
pass
|
| 808 |
+
else:
|
| 809 |
+
# For other STT models that use transcribe_audio
|
| 810 |
+
result = current_stt_model.transcribe_audio(processed_audio, 16000)
|
| 811 |
+
|
| 812 |
+
# Prepare output
|
| 813 |
+
transcription = result.text.strip() if result.text else "No speech detected"
|
| 814 |
+
|
| 815 |
+
# Filter out common false positives
|
| 816 |
+
if transcription.lower() in ["you", "thank you.", "thanks for watching!", ""]:
|
| 817 |
+
transcription = "π No clear speech detected"
|
| 818 |
+
|
| 819 |
+
# Confidence info
|
| 820 |
+
confidence_info = ""
|
| 821 |
+
if result.confidence is not None:
|
| 822 |
+
confidence_info = f"Confidence: {result.confidence:.2%}"
|
| 823 |
+
if result.confidence < 0.3:
|
| 824 |
+
confidence_info += " (Low - consider re-recording)"
|
| 825 |
+
else:
|
| 826 |
+
confidence_info = "Confidence: N/A"
|
| 827 |
+
|
| 828 |
+
# Processing info
|
| 829 |
+
processing_info = f"Processing: {result.processing_time or 0:.2f}s\n"
|
| 830 |
+
processing_info += f"Model: {current_model_config.get('model_name', 'Unknown')}\n"
|
| 831 |
+
processing_info += f"Audio: {quality['duration']:.2f}s, {quality['max_amplitude']:.3f} amplitude\n"
|
| 832 |
+
processing_info += f"Quality: {'β
Good' if quality['is_good_quality'] else 'β οΈ Poor'}"
|
| 833 |
+
|
| 834 |
+
return transcription, confidence_info, processing_info
|
| 835 |
+
|
| 836 |
+
except Exception as e:
|
| 837 |
+
error_msg = f"β Transcription error: {str(e)}"
|
| 838 |
+
logger.error(error_msg)
|
| 839 |
+
return error_msg, "", ""
|
| 840 |
+
|
| 841 |
+
|
| 842 |
+
class GradioInterface:
|
| 843 |
+
"""Create and manage the Gradio web interface."""
|
| 844 |
+
|
| 845 |
+
@staticmethod
|
| 846 |
+
def create_interface():
|
| 847 |
+
"""Create the main Gradio interface."""
|
| 848 |
+
|
| 849 |
+
# Initialize image gallery
|
| 850 |
+
gallery = ImageGallery()
|
| 851 |
+
|
| 852 |
+
with gr.Blocks(
|
| 853 |
+
title="ποΈ Modular Voice Transcriber with Image Gallery",
|
| 854 |
+
theme=gr.themes.Soft()
|
| 855 |
+
) as demo:
|
| 856 |
+
|
| 857 |
+
gr.Markdown(
|
| 858 |
+
"""
|
| 859 |
+
# ποΈ Modular Voice Transcriber with Image Gallery
|
| 860 |
+
|
| 861 |
+
A flexible interface supporting multiple STT models with an integrated image viewer.
|
| 862 |
+
Easily extensible for new transcription engines and image collections.
|
| 863 |
+
"""
|
| 864 |
+
)
|
| 865 |
+
|
| 866 |
+
# Image Gallery Section (at the top)
|
| 867 |
+
with gr.Row():
|
| 868 |
+
with gr.Column():
|
| 869 |
+
gr.Markdown("### πΌοΈ Image Gallery")
|
| 870 |
+
|
| 871 |
+
# Main image display
|
| 872 |
+
image_display = gr.Image(
|
| 873 |
+
value=gallery.get_image_by_index(0),
|
| 874 |
+
label="Selected Image",
|
| 875 |
+
height=400,
|
| 876 |
+
width=500,
|
| 877 |
+
interactive=False
|
| 878 |
+
)
|
| 879 |
+
|
| 880 |
+
# Image info
|
| 881 |
+
image_info = gr.Textbox(
|
| 882 |
+
value=gallery.get_image_info(0),
|
| 883 |
+
label="Image Info",
|
| 884 |
+
interactive=False
|
| 885 |
+
)
|
| 886 |
+
|
| 887 |
+
# Horizontal thumbnail gallery with actual image previews
|
| 888 |
+
gr.Markdown("**Click on a thumbnail to view:**")
|
| 889 |
+
with gr.Row():
|
| 890 |
+
# Create thumbnail gallery using Gradio's Gallery component
|
| 891 |
+
thumbnail_gallery = gr.Gallery(
|
| 892 |
+
value=gallery.images, # All images as thumbnails
|
| 893 |
+
label="Image Gallery",
|
| 894 |
+
show_label=False,
|
| 895 |
+
elem_id="thumbnail_gallery",
|
| 896 |
+
columns=len(gallery.images), # Horizontal layout
|
| 897 |
+
rows=1,
|
| 898 |
+
height=120, # Small thumbnail height
|
| 899 |
+
allow_preview=False, # Don't show preview popup
|
| 900 |
+
interactive=True
|
| 901 |
+
)
|
| 902 |
+
|
| 903 |
+
# Navigation buttons (kept for convenience)
|
| 904 |
+
gr.Markdown("**Or use navigation:**")
|
| 905 |
+
with gr.Row():
|
| 906 |
+
prev_btn = gr.Button("βοΈ Previous", size="sm")
|
| 907 |
+
next_btn = gr.Button("Next βΆοΈ", size="sm")
|
| 908 |
+
random_btn = gr.Button("π² Random", size="sm")
|
| 909 |
+
|
| 910 |
+
gr.Markdown("---") # Separator line
|
| 911 |
+
|
| 912 |
+
with gr.Row():
|
| 913 |
+
# Model Configuration Panel
|
| 914 |
+
with gr.Column(scale=1):
|
| 915 |
+
gr.Markdown("### π§ Model Configuration")
|
| 916 |
+
|
| 917 |
+
# Model selection
|
| 918 |
+
model_selector = gr.Dropdown(
|
| 919 |
+
choices=ModelManager.get_available_models(),
|
| 920 |
+
value="WhisperSTT",
|
| 921 |
+
label="STT Model",
|
| 922 |
+
info="Choose your speech-to-text engine"
|
| 923 |
+
)
|
| 924 |
+
|
| 925 |
+
# Dynamic model options (will update based on selected model)
|
| 926 |
+
model_size = gr.Dropdown(
|
| 927 |
+
choices=["tiny", "base", "small", "medium", "large"],
|
| 928 |
+
value="base",
|
| 929 |
+
label="Model Size",
|
| 930 |
+
visible=True
|
| 931 |
+
)
|
| 932 |
+
|
| 933 |
+
use_api = gr.Checkbox(
|
| 934 |
+
label="Use API",
|
| 935 |
+
info="Use cloud API instead of local model",
|
| 936 |
+
visible=True
|
| 937 |
+
)
|
| 938 |
+
|
| 939 |
+
api_key = gr.Textbox(
|
| 940 |
+
label="API Key",
|
| 941 |
+
type="password",
|
| 942 |
+
placeholder="Enter API key...",
|
| 943 |
+
visible=False
|
| 944 |
+
)
|
| 945 |
+
|
| 946 |
+
# Device selection for models that support it
|
| 947 |
+
device_selector = gr.Dropdown(
|
| 948 |
+
choices=["auto", "cpu", "cuda"],
|
| 949 |
+
value="auto",
|
| 950 |
+
label="Device",
|
| 951 |
+
info="Processing device (auto recommended)",
|
| 952 |
+
visible=False
|
| 953 |
+
)
|
| 954 |
+
|
| 955 |
+
# HuggingFace token for private models
|
| 956 |
+
hf_token = gr.Textbox(
|
| 957 |
+
label="HuggingFace Token",
|
| 958 |
+
type="password",
|
| 959 |
+
placeholder="hf_...",
|
| 960 |
+
info="Optional: For private or experimental models",
|
| 961 |
+
visible=False
|
| 962 |
+
)
|
| 963 |
+
|
| 964 |
+
# Load button and status
|
| 965 |
+
load_btn = gr.Button("π Load Model", variant="primary")
|
| 966 |
+
load_status = gr.Textbox(
|
| 967 |
+
label="Status",
|
| 968 |
+
value="No model loaded",
|
| 969 |
+
interactive=False
|
| 970 |
+
)
|
| 971 |
+
|
| 972 |
+
# Model info
|
| 973 |
+
model_info = gr.Markdown(ModelManager.get_model_info())
|
| 974 |
+
|
| 975 |
+
# Transcription Panel
|
| 976 |
+
with gr.Column(scale=2):
|
| 977 |
+
gr.Markdown("### π€ Voice Transcription")
|
| 978 |
+
|
| 979 |
+
# Language selection
|
| 980 |
+
language = gr.Dropdown(
|
| 981 |
+
choices=[("Auto-detect", "auto"), ("English", "en")],
|
| 982 |
+
value="auto",
|
| 983 |
+
label="Language"
|
| 984 |
+
)
|
| 985 |
+
|
| 986 |
+
# Audio input
|
| 987 |
+
audio_input = gr.Audio(
|
| 988 |
+
label="Record or Upload Audio",
|
| 989 |
+
type="numpy",
|
| 990 |
+
format="wav"
|
| 991 |
+
)
|
| 992 |
+
|
| 993 |
+
# Action buttons
|
| 994 |
+
with gr.Row():
|
| 995 |
+
transcribe_btn = gr.Button("π― Transcribe", variant="primary")
|
| 996 |
+
quality_btn = gr.Button("π Check Quality")
|
| 997 |
+
clear_btn = gr.Button("ποΈ Clear")
|
| 998 |
+
|
| 999 |
+
# Outputs
|
| 1000 |
+
transcription_output = gr.Textbox(
|
| 1001 |
+
label="π Transcription",
|
| 1002 |
+
lines=4,
|
| 1003 |
+
placeholder="Transcribed text will appear here..."
|
| 1004 |
+
)
|
| 1005 |
+
|
| 1006 |
+
with gr.Row():
|
| 1007 |
+
confidence_output = gr.Textbox(
|
| 1008 |
+
label="π― Confidence",
|
| 1009 |
+
interactive=False
|
| 1010 |
+
)
|
| 1011 |
+
processing_output = gr.Textbox(
|
| 1012 |
+
label="β±οΈ Processing Info",
|
| 1013 |
+
interactive=False
|
| 1014 |
+
)
|
| 1015 |
+
|
| 1016 |
+
quality_output = gr.Markdown(
|
| 1017 |
+
value="",
|
| 1018 |
+
visible=False,
|
| 1019 |
+
label="π Audio Quality Analysis"
|
| 1020 |
+
)
|
| 1021 |
+
|
| 1022 |
+
# Usage tips
|
| 1023 |
+
gr.Markdown(
|
| 1024 |
+
"""
|
| 1025 |
+
### π‘ Tips for Best Results
|
| 1026 |
+
- **Record clearly** in a quiet environment
|
| 1027 |
+
- **Speak at normal pace** - not too fast or slow
|
| 1028 |
+
- **Use good audio quality** - avoid background noise
|
| 1029 |
+
- **Try different models** - larger models are more accurate but slower
|
| 1030 |
+
- **Check quality analysis** to identify audio issues
|
| 1031 |
+
- **Browse images** using the slider or navigation buttons
|
| 1032 |
+
"""
|
| 1033 |
+
)
|
| 1034 |
+
|
| 1035 |
+
# Event handlers
|
| 1036 |
+
def update_model_options(model_name: str):
|
| 1037 |
+
"""Update interface based on selected model."""
|
| 1038 |
+
options = ModelManager.get_model_options(model_name)
|
| 1039 |
+
|
| 1040 |
+
# Determine visibility of components
|
| 1041 |
+
show_model_size = len(options["model_sizes"]) > 1
|
| 1042 |
+
show_api = options["supports_api"]
|
| 1043 |
+
show_device = "device_options" in options
|
| 1044 |
+
show_hf_token = options.get("supports_hf_token", False)
|
| 1045 |
+
|
| 1046 |
+
# Extract model size options (handle both simple lists and tuples)
|
| 1047 |
+
if show_model_size and isinstance(options["model_sizes"][0], tuple):
|
| 1048 |
+
# Model sizes are tuples of (display_name, value)
|
| 1049 |
+
size_choices = options["model_sizes"]
|
| 1050 |
+
size_value = size_choices[0][1] # Use the value from first tuple
|
| 1051 |
+
else:
|
| 1052 |
+
# Model sizes are simple strings
|
| 1053 |
+
size_choices = options["model_sizes"]
|
| 1054 |
+
size_value = size_choices[0]
|
| 1055 |
+
|
| 1056 |
+
return (
|
| 1057 |
+
gr.update(choices=size_choices, value=size_value, visible=show_model_size),
|
| 1058 |
+
gr.update(visible=show_api),
|
| 1059 |
+
gr.update(visible=False), # Hide API key initially
|
| 1060 |
+
gr.update(choices=options["languages"], value="auto"),
|
| 1061 |
+
gr.update(
|
| 1062 |
+
choices=options.get("device_options", ["auto"]),
|
| 1063 |
+
value="auto",
|
| 1064 |
+
visible=show_device
|
| 1065 |
+
),
|
| 1066 |
+
gr.update(visible=show_hf_token)
|
| 1067 |
+
)
|
| 1068 |
+
|
| 1069 |
+
def toggle_api_key(use_api: bool):
|
| 1070 |
+
"""Show/hide API key field."""
|
| 1071 |
+
return gr.update(visible=use_api)
|
| 1072 |
+
|
| 1073 |
+
def load_selected_model(model_name: str, model_size: str, use_api: bool, api_key: str, device: str, hf_token: str):
|
| 1074 |
+
"""Load the selected model with configuration."""
|
| 1075 |
+
kwargs = {"model_size": model_size, "use_api": use_api}
|
| 1076 |
+
if api_key:
|
| 1077 |
+
kwargs["api_key"] = api_key
|
| 1078 |
+
if device and device != "auto":
|
| 1079 |
+
kwargs["device"] = device
|
| 1080 |
+
if hf_token:
|
| 1081 |
+
kwargs["hf_token"] = hf_token
|
| 1082 |
+
return ModelManager.load_model(model_name, **kwargs)
|
| 1083 |
+
|
| 1084 |
+
def analyze_audio_quality(audio_input):
|
| 1085 |
+
"""Analyze and display audio quality."""
|
| 1086 |
+
if audio_input is None:
|
| 1087 |
+
return "", gr.update(visible=False)
|
| 1088 |
+
|
| 1089 |
+
sample_rate, audio_data = audio_input
|
| 1090 |
+
quality = AudioProcessor.analyze_quality(audio_data, sample_rate)
|
| 1091 |
+
|
| 1092 |
+
report = f"""
|
| 1093 |
+
**π Audio Quality Analysis:**
|
| 1094 |
+
- Duration: {quality['duration']:.2f}s
|
| 1095 |
+
- Max amplitude: {quality['max_amplitude']:.3f}
|
| 1096 |
+
- Clipping: {quality['clipping_ratio']:.2%}
|
| 1097 |
+
- Silence ratio: {quality['silence_ratio']:.2%}
|
| 1098 |
+
- Overall quality: {'β
Good' if quality['is_good_quality'] else 'β οΈ Needs improvement'}
|
| 1099 |
+
|
| 1100 |
+
**π§ Recommendations:**
|
| 1101 |
+
{_get_quality_recommendations(quality)}
|
| 1102 |
+
"""
|
| 1103 |
+
|
| 1104 |
+
return report, gr.update(visible=True)
|
| 1105 |
+
|
| 1106 |
+
# Image Gallery Event Handlers
|
| 1107 |
+
current_image_index = [0] # Use list to make it mutable in nested functions
|
| 1108 |
+
|
| 1109 |
+
def select_image_from_gallery(evt: gr.SelectData):
|
| 1110 |
+
"""Handle image selection from gallery thumbnail."""
|
| 1111 |
+
index = evt.index
|
| 1112 |
+
current_image_index[0] = index
|
| 1113 |
+
image_path = gallery.get_image_by_index(index)
|
| 1114 |
+
image_info_text = gallery.get_image_info(index)
|
| 1115 |
+
return image_path, image_info_text
|
| 1116 |
+
|
| 1117 |
+
def go_to_previous_image():
|
| 1118 |
+
"""Go to previous image."""
|
| 1119 |
+
current_image_index[0] = max(0, current_image_index[0] - 1)
|
| 1120 |
+
image_path = gallery.get_image_by_index(current_image_index[0])
|
| 1121 |
+
image_info_text = gallery.get_image_info(current_image_index[0])
|
| 1122 |
+
return image_path, image_info_text
|
| 1123 |
+
|
| 1124 |
+
def go_to_next_image():
|
| 1125 |
+
"""Go to next image."""
|
| 1126 |
+
current_image_index[0] = min(gallery.get_total_images() - 1, current_image_index[0] + 1)
|
| 1127 |
+
image_path = gallery.get_image_by_index(current_image_index[0])
|
| 1128 |
+
image_info_text = gallery.get_image_info(current_image_index[0])
|
| 1129 |
+
return image_path, image_info_text
|
| 1130 |
+
|
| 1131 |
+
def go_to_random_image():
|
| 1132 |
+
"""Go to random image."""
|
| 1133 |
+
import random
|
| 1134 |
+
current_image_index[0] = random.randint(0, gallery.get_total_images() - 1)
|
| 1135 |
+
image_path = gallery.get_image_by_index(current_image_index[0])
|
| 1136 |
+
image_info_text = gallery.get_image_info(current_image_index[0])
|
| 1137 |
+
return image_path, image_info_text
|
| 1138 |
+
return image_path, image_info_text, new_index
|
| 1139 |
+
|
| 1140 |
+
# Connect events
|
| 1141 |
+
model_selector.change(
|
| 1142 |
+
fn=update_model_options,
|
| 1143 |
+
inputs=model_selector,
|
| 1144 |
+
outputs=[model_size, use_api, api_key, language, device_selector, hf_token]
|
| 1145 |
+
)
|
| 1146 |
+
|
| 1147 |
+
use_api.change(
|
| 1148 |
+
fn=toggle_api_key,
|
| 1149 |
+
inputs=use_api,
|
| 1150 |
+
outputs=api_key
|
| 1151 |
+
)
|
| 1152 |
+
|
| 1153 |
+
load_btn.click(
|
| 1154 |
+
fn=load_selected_model,
|
| 1155 |
+
inputs=[model_selector, model_size, use_api, api_key, device_selector, hf_token],
|
| 1156 |
+
outputs=load_status
|
| 1157 |
+
).then(
|
| 1158 |
+
fn=lambda: ModelManager.get_model_info(),
|
| 1159 |
+
outputs=model_info
|
| 1160 |
+
)
|
| 1161 |
+
|
| 1162 |
+
transcribe_btn.click(
|
| 1163 |
+
fn=TranscriptionEngine.transcribe,
|
| 1164 |
+
inputs=[audio_input, language],
|
| 1165 |
+
outputs=[transcription_output, confidence_output, processing_output]
|
| 1166 |
+
)
|
| 1167 |
+
|
| 1168 |
+
quality_btn.click(
|
| 1169 |
+
fn=analyze_audio_quality,
|
| 1170 |
+
inputs=audio_input,
|
| 1171 |
+
outputs=[quality_output, quality_output]
|
| 1172 |
+
)
|
| 1173 |
+
|
| 1174 |
+
clear_btn.click(
|
| 1175 |
+
fn=lambda: ("", "", "", "", gr.update(visible=False)),
|
| 1176 |
+
outputs=[transcription_output, confidence_output, processing_output, quality_output, quality_output]
|
| 1177 |
+
)
|
| 1178 |
+
|
| 1179 |
+
# Auto-transcribe on audio change (optional)
|
| 1180 |
+
audio_input.change(
|
| 1181 |
+
fn=TranscriptionEngine.transcribe,
|
| 1182 |
+
inputs=[audio_input, language],
|
| 1183 |
+
outputs=[transcription_output, confidence_output, processing_output]
|
| 1184 |
+
)
|
| 1185 |
+
|
| 1186 |
+
# Image Gallery Event Connections
|
| 1187 |
+
# Connect thumbnail gallery selection
|
| 1188 |
+
thumbnail_gallery.select(
|
| 1189 |
+
fn=select_image_from_gallery,
|
| 1190 |
+
outputs=[image_display, image_info]
|
| 1191 |
+
)
|
| 1192 |
+
|
| 1193 |
+
# Connect navigation buttons
|
| 1194 |
+
prev_btn.click(
|
| 1195 |
+
fn=go_to_previous_image,
|
| 1196 |
+
outputs=[image_display, image_info]
|
| 1197 |
+
)
|
| 1198 |
+
|
| 1199 |
+
next_btn.click(
|
| 1200 |
+
fn=go_to_next_image,
|
| 1201 |
+
outputs=[image_display, image_info]
|
| 1202 |
+
)
|
| 1203 |
+
|
| 1204 |
+
random_btn.click(
|
| 1205 |
+
fn=go_to_random_image,
|
| 1206 |
+
outputs=[image_display, image_info]
|
| 1207 |
+
)
|
| 1208 |
+
|
| 1209 |
+
return demo
|
| 1210 |
+
|
| 1211 |
+
|
| 1212 |
+
def _get_quality_recommendations(quality: Dict[str, Any]) -> str:
|
| 1213 |
+
"""Generate quality recommendations based on analysis."""
|
| 1214 |
+
recommendations = []
|
| 1215 |
+
|
| 1216 |
+
if quality["duration"] < 1.0:
|
| 1217 |
+
recommendations.append("β’ Try recording for longer (1+ seconds)")
|
| 1218 |
+
|
| 1219 |
+
if quality["max_amplitude"] < 0.1:
|
| 1220 |
+
recommendations.append("β’ Increase volume or move closer to microphone")
|
| 1221 |
+
elif quality["max_amplitude"] > 0.9:
|
| 1222 |
+
recommendations.append("β’ Reduce volume to avoid clipping")
|
| 1223 |
+
|
| 1224 |
+
if quality["clipping_ratio"] > 0.01:
|
| 1225 |
+
recommendations.append("β’ Audio is clipping - reduce input gain")
|
| 1226 |
+
|
| 1227 |
+
if quality["silence_ratio"] > 0.5:
|
| 1228 |
+
recommendations.append("β’ Too much silence - record in quieter environment")
|
| 1229 |
+
|
| 1230 |
+
if not recommendations:
|
| 1231 |
+
recommendations.append("β’ Audio quality looks good!")
|
| 1232 |
+
|
| 1233 |
+
return "\n".join(recommendations)
|
| 1234 |
+
|
| 1235 |
+
|
| 1236 |
+
def main():
|
| 1237 |
+
"""Main application entry point."""
|
| 1238 |
+
# Check dependencies
|
| 1239 |
+
print("π Checking dependencies...")
|
| 1240 |
+
|
| 1241 |
+
try:
|
| 1242 |
+
import gradio
|
| 1243 |
+
print("β
Gradio available")
|
| 1244 |
+
except ImportError:
|
| 1245 |
+
print("β Gradio not installed. Run: pip install gradio")
|
| 1246 |
+
return
|
| 1247 |
+
|
| 1248 |
+
# Check available STT models
|
| 1249 |
+
print(f"π€ Available STT models: {ModelManager.get_available_models()}")
|
| 1250 |
+
|
| 1251 |
+
# Create and launch interface
|
| 1252 |
+
print("π Launching Gradio interface...")
|
| 1253 |
+
demo = GradioInterface.create_interface()
|
| 1254 |
+
|
| 1255 |
+
demo.launch(
|
| 1256 |
+
share=False, # Set to True for public sharing
|
| 1257 |
+
server_name="127.0.0.1",
|
| 1258 |
+
server_port=7861,
|
| 1259 |
+
show_error=True
|
| 1260 |
+
)
|
| 1261 |
+
|
| 1262 |
+
|
| 1263 |
+
if __name__ == "__main__":
|
| 1264 |
main()
|
hf-space
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Subproject commit 921859d10816a9cb386449308c5f66037f50deb2
|
pyproject.toml
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[build-system]
|
| 2 |
+
requires = ["hatchling"]
|
| 3 |
+
build-backend = "hatchling.build"
|
| 4 |
+
|
| 5 |
+
[project]
|
| 6 |
+
name = "modular-voice-transcriber"
|
| 7 |
+
version = "0.2.0"
|
| 8 |
+
description = "A modular Gradio-based web interface for speech-to-text transcription supporting multiple STT engines"
|
| 9 |
+
readme = "README.md"
|
| 10 |
+
requires-python = ">=3.8"
|
| 11 |
+
dependencies = [
|
| 12 |
+
"gradio>=4.0.0",
|
| 13 |
+
"soundfile>=0.12.1",
|
| 14 |
+
"numpy>=1.21.0",
|
| 15 |
+
"pathlib2>=2.3.0; python_version < '3.4'",
|
| 16 |
+
]
|
| 17 |
+
|
| 18 |
+
[project.optional-dependencies]
|
| 19 |
+
dev = [
|
| 20 |
+
"pytest>=7.0.0",
|
| 21 |
+
"black>=22.0.0",
|
| 22 |
+
"flake8>=4.0.0",
|
| 23 |
+
"mypy>=1.0.0",
|
| 24 |
+
]
|
| 25 |
+
# OpenAI Whisper STT (local models)
|
| 26 |
+
whisper = [
|
| 27 |
+
"openai-whisper>=20230314",
|
| 28 |
+
"torch>=1.10.0",
|
| 29 |
+
"torchaudio>=0.10.0",
|
| 30 |
+
]
|
| 31 |
+
# OpenAI Whisper API
|
| 32 |
+
whisper-api = [
|
| 33 |
+
"openai>=1.0.0",
|
| 34 |
+
]
|
| 35 |
+
# Wav2Vec2 models (Hugging Face)
|
| 36 |
+
wav2vec2 = [
|
| 37 |
+
"transformers>=4.20.0",
|
| 38 |
+
"torch>=1.12.0",
|
| 39 |
+
"torchaudio>=0.12.0",
|
| 40 |
+
"librosa>=0.9.0", # Optional but recommended
|
| 41 |
+
]
|
| 42 |
+
# HuBERT models (Hugging Face, Arabic Egyptian)
|
| 43 |
+
hubert = [
|
| 44 |
+
"transformers>=4.20.0",
|
| 45 |
+
"torch>=1.12.0",
|
| 46 |
+
"torchaudio>=0.12.0",
|
| 47 |
+
"librosa>=0.9.2",
|
| 48 |
+
"soundfile>=0.10.3",
|
| 49 |
+
"huggingface-hub>=0.14.0",
|
| 50 |
+
]
|
| 51 |
+
# Coqui STT (open-source multilingual)
|
| 52 |
+
coqui = [
|
| 53 |
+
"coqui-stt>=1.4.0",
|
| 54 |
+
"soundfile>=0.10.3",
|
| 55 |
+
"librosa>=0.9.2",
|
| 56 |
+
"requests>=2.25.0",
|
| 57 |
+
]
|
| 58 |
+
# Vosk STT (offline recognition)
|
| 59 |
+
vosk = [
|
| 60 |
+
"vosk>=0.3.42",
|
| 61 |
+
"soundfile>=0.12.1",
|
| 62 |
+
]
|
| 63 |
+
# Tawasul STT (Arabic speech recognition)
|
| 64 |
+
tawasul = [
|
| 65 |
+
"transformers>=4.20.0",
|
| 66 |
+
"torch>=1.12.0",
|
| 67 |
+
"torchaudio>=0.12.0",
|
| 68 |
+
"librosa>=0.9.2",
|
| 69 |
+
"soundfile>=0.10.3",
|
| 70 |
+
"huggingface-hub>=0.14.0",
|
| 71 |
+
]
|
| 72 |
+
# Azure Speech Service
|
| 73 |
+
azure-speech = [
|
| 74 |
+
"azure-cognitiveservices-speech>=1.25.0",
|
| 75 |
+
]
|
| 76 |
+
# Google Cloud Speech-to-Text
|
| 77 |
+
google-speech = [
|
| 78 |
+
"google-cloud-speech>=2.15.0",
|
| 79 |
+
]
|
| 80 |
+
# AssemblyAI
|
| 81 |
+
assemblyai = [
|
| 82 |
+
"assemblyai>=0.15.0",
|
| 83 |
+
]
|
| 84 |
+
# Amazon Transcribe
|
| 85 |
+
aws-transcribe = [
|
| 86 |
+
"boto3>=1.26.0",
|
| 87 |
+
"botocore>=1.29.0",
|
| 88 |
+
]
|
| 89 |
+
# All STT engines (for full functionality)
|
| 90 |
+
all-stt = [
|
| 91 |
+
"openai-whisper>=20230314",
|
| 92 |
+
"openai>=1.0.0",
|
| 93 |
+
"transformers>=4.20.0",
|
| 94 |
+
"torch>=1.12.0",
|
| 95 |
+
"torchaudio>=0.12.0",
|
| 96 |
+
"librosa>=0.9.0",
|
| 97 |
+
"vosk>=0.3.42",
|
| 98 |
+
"soundfile>=0.10.3",
|
| 99 |
+
"huggingface-hub>=0.14.0",
|
| 100 |
+
"coqui-stt>=1.4.0",
|
| 101 |
+
"requests>=2.25.0",
|
| 102 |
+
"azure-cognitiveservices-speech>=1.25.0",
|
| 103 |
+
"google-cloud-speech>=2.15.0",
|
| 104 |
+
"assemblyai>=0.15.0",
|
| 105 |
+
"boto3>=1.26.0",
|
| 106 |
+
]
|
| 107 |
+
# Essential models (Whisper + Wav2Vec2 + HuBERT + Vosk + Coqui)
|
| 108 |
+
essential = [
|
| 109 |
+
"openai-whisper>=20230314",
|
| 110 |
+
"openai>=1.0.0",
|
| 111 |
+
"transformers>=4.20.0",
|
| 112 |
+
"torch>=1.12.0",
|
| 113 |
+
"torchaudio>=0.12.0",
|
| 114 |
+
"librosa>=0.9.0",
|
| 115 |
+
"soundfile>=0.10.3",
|
| 116 |
+
"huggingface-hub>=0.14.0",
|
| 117 |
+
"vosk>=0.3.42",
|
| 118 |
+
"coqui-stt>=1.4.0",
|
| 119 |
+
"requests>=2.25.0",
|
| 120 |
+
]
|
| 121 |
+
|
| 122 |
+
[project.urls]
|
| 123 |
+
Homepage = "https://github.com/your-username/modular-voice-transcriber"
|
| 124 |
+
Repository = "https://github.com/your-username/modular-voice-transcriber.git"
|
| 125 |
+
Issues = "https://github.com/your-username/modular-voice-transcriber/issues"
|
| 126 |
+
|
| 127 |
+
[project.scripts]
|
| 128 |
+
voice-transcriber = "gradio_voice_transcriber_clean:main"
|
| 129 |
+
|
| 130 |
+
[tool.black]
|
| 131 |
+
line-length = 88
|
| 132 |
+
target-version = ['py38']
|
| 133 |
+
|
| 134 |
+
[tool.uv]
|
| 135 |
+
dev-dependencies = [
|
| 136 |
+
"pytest>=7.0.0",
|
| 137 |
+
"black>=22.0.0",
|
| 138 |
+
"flake8>=4.0.0",
|
| 139 |
+
]
|
requirements.txt
CHANGED
|
@@ -1,43 +1,43 @@
|
|
| 1 |
-
# Modular Voice Transcriber - Core Dependencies
|
| 2 |
-
# Base requirements for the Gradio interface
|
| 3 |
-
gradio>=4.0.0
|
| 4 |
-
soundfile>=0.12.1
|
| 5 |
-
numpy>=1.21.0
|
| 6 |
-
|
| 7 |
-
# Essential STT Models (Whisper + Wav2Vec2)
|
| 8 |
-
# OpenAI Whisper (local and API)
|
| 9 |
-
openai-whisper>=20231117
|
| 10 |
-
openai>=1.0.0
|
| 11 |
-
|
| 12 |
-
# Wav2Vec2 (Hugging Face Transformers)
|
| 13 |
-
transformers>=4.20.0
|
| 14 |
-
torch>=1.12.0
|
| 15 |
-
torchaudio>=0.12.0
|
| 16 |
-
|
| 17 |
-
# Audio processing (recommended)
|
| 18 |
-
librosa>=0.9.0
|
| 19 |
-
|
| 20 |
-
# Optional: Vosk STT (for offline recognition)
|
| 21 |
-
# vosk>=0.3.42
|
| 22 |
-
|
| 23 |
-
# Optional: HuBERT Arabic STT (for Arabic Egyptian dialect)
|
| 24 |
-
# Use requirements_hubert.txt for full setup
|
| 25 |
-
|
| 26 |
-
# Optional: Coqui STT (open-source multilingual)
|
| 27 |
-
# Use requirements_coqui.txt for full setup
|
| 28 |
-
|
| 29 |
-
# Optional: Tawasul STT (for Arabic speech recognition)
|
| 30 |
-
# Use requirements_tawasul.txt for full setup
|
| 31 |
-
|
| 32 |
-
# Installation options:
|
| 33 |
-
# pip install -r requirements.txt # Core + Essential STT models
|
| 34 |
-
# pip install -r requirements_whisper.txt # Whisper-only setup
|
| 35 |
-
# pip install -r requirements_wav2vec2.txt # Wav2Vec2-only setup
|
| 36 |
-
# pip install -r requirements_vosk.txt # Vosk-only setup
|
| 37 |
-
# pip install -r requirements_hubert.txt # HuBERT Arabic-only setup
|
| 38 |
-
# pip install -r requirements_coqui.txt # Coqui STT-only setup
|
| 39 |
-
# pip install -r requirements_tawasul.txt # Tawasul STT-only setup
|
| 40 |
-
# pip install -e .[essential] # Same as core
|
| 41 |
-
# pip install -e .[all-stt] # All supported STT engines
|
| 42 |
-
# pip install -e .[whisper,wav2vec2,vosk,hubert,coqui,tawasul] # Specific models only
|
| 43 |
# pip install -e .[dev] # Development dependencies
|
|
|
|
| 1 |
+
# Modular Voice Transcriber - Core Dependencies
|
| 2 |
+
# Base requirements for the Gradio interface
|
| 3 |
+
gradio>=4.0.0
|
| 4 |
+
soundfile>=0.12.1
|
| 5 |
+
numpy>=1.21.0
|
| 6 |
+
|
| 7 |
+
# Essential STT Models (Whisper + Wav2Vec2)
|
| 8 |
+
# OpenAI Whisper (local and API)
|
| 9 |
+
openai-whisper>=20231117
|
| 10 |
+
openai>=1.0.0
|
| 11 |
+
|
| 12 |
+
# Wav2Vec2 (Hugging Face Transformers)
|
| 13 |
+
transformers>=4.20.0
|
| 14 |
+
torch>=1.12.0
|
| 15 |
+
torchaudio>=0.12.0
|
| 16 |
+
|
| 17 |
+
# Audio processing (recommended)
|
| 18 |
+
librosa>=0.9.0
|
| 19 |
+
|
| 20 |
+
# Optional: Vosk STT (for offline recognition)
|
| 21 |
+
# vosk>=0.3.42
|
| 22 |
+
|
| 23 |
+
# Optional: HuBERT Arabic STT (for Arabic Egyptian dialect)
|
| 24 |
+
# Use requirements_hubert.txt for full setup
|
| 25 |
+
|
| 26 |
+
# Optional: Coqui STT (open-source multilingual)
|
| 27 |
+
# Use requirements_coqui.txt for full setup
|
| 28 |
+
|
| 29 |
+
# Optional: Tawasul STT (for Arabic speech recognition)
|
| 30 |
+
# Use requirements_tawasul.txt for full setup
|
| 31 |
+
|
| 32 |
+
# Installation options:
|
| 33 |
+
# pip install -r requirements.txt # Core + Essential STT models
|
| 34 |
+
# pip install -r requirements_whisper.txt # Whisper-only setup
|
| 35 |
+
# pip install -r requirements_wav2vec2.txt # Wav2Vec2-only setup
|
| 36 |
+
# pip install -r requirements_vosk.txt # Vosk-only setup
|
| 37 |
+
# pip install -r requirements_hubert.txt # HuBERT Arabic-only setup
|
| 38 |
+
# pip install -r requirements_coqui.txt # Coqui STT-only setup
|
| 39 |
+
# pip install -r requirements_tawasul.txt # Tawasul STT-only setup
|
| 40 |
+
# pip install -e .[essential] # Same as core
|
| 41 |
+
# pip install -e .[all-stt] # All supported STT engines
|
| 42 |
+
# pip install -e .[whisper,wav2vec2,vosk,hubert,coqui,tawasul] # Specific models only
|
| 43 |
# pip install -e .[dev] # Development dependencies
|
requirements_coqui.txt
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Coqui STT Requirements
|
| 2 |
+
coqui-stt-model-manager
|
| 3 |
+
soundfile>=0.10.3
|
| 4 |
+
librosa>=0.9.2
|
| 5 |
+
numpy>=1.21.0
|
| 6 |
+
requests>=2.25.0
|
requirements_hubert.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HuBERT Arabic STT Requirements
|
| 2 |
+
torch>=1.12.0
|
| 3 |
+
transformers>=4.20.0
|
| 4 |
+
torchaudio>=0.12.0
|
| 5 |
+
librosa>=0.9.2
|
| 6 |
+
soundfile>=0.10.3
|
| 7 |
+
huggingface-hub>=0.14.0
|
requirements_tawasul.txt
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tawasul STT Requirements
|
| 2 |
+
# Arabic Speech Recognition using Tawasul STT V0 model
|
| 3 |
+
torch>=1.12.0
|
| 4 |
+
transformers>=4.20.0
|
| 5 |
+
torchaudio>=0.12.0
|
| 6 |
+
librosa>=0.9.2
|
| 7 |
+
soundfile>=0.10.3
|
| 8 |
+
huggingface-hub>=0.14.0
|
| 9 |
+
numpy>=1.21.0
|
| 10 |
+
|
| 11 |
+
# Optional: For better performance
|
| 12 |
+
# accelerate>=0.20.0
|
| 13 |
+
# optimum>=1.8.0
|
requirements_vosk.txt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Vosk STT requirements
|
| 2 |
+
vosk>=0.3.42
|
| 3 |
+
soundfile>=0.12.1
|
requirements_wav2vec2.txt
ADDED
|
@@ -0,0 +1,24 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wav2Vec2 Arabic STT Requirements
|
| 2 |
+
# Minimal requirements for using the Wav2Vec2 Arabic Egyptian model
|
| 3 |
+
|
| 4 |
+
# Base requirements
|
| 5 |
+
gradio>=4.0.0
|
| 6 |
+
numpy>=1.21.0
|
| 7 |
+
soundfile>=0.12.1
|
| 8 |
+
|
| 9 |
+
# Wav2Vec2 specific requirements
|
| 10 |
+
transformers>=4.20.0
|
| 11 |
+
torch>=1.12.0
|
| 12 |
+
torchaudio>=0.12.0
|
| 13 |
+
|
| 14 |
+
# Optional but highly recommended for better audio processing
|
| 15 |
+
librosa>=0.9.0
|
| 16 |
+
|
| 17 |
+
# Installation:
|
| 18 |
+
# pip install -r requirements_wav2vec2.txt
|
| 19 |
+
|
| 20 |
+
# Notes:
|
| 21 |
+
# - First model load will download ~1.2GB from Hugging Face Hub
|
| 22 |
+
# - GPU support is automatic if PyTorch with CUDA is installed
|
| 23 |
+
# - Model runs on CPU but GPU is significantly faster for longer audio
|
| 24 |
+
# - Optimized for Arabic Egyptian dialect but works with Standard Arabic
|
requirements_whisper.txt
ADDED
|
@@ -0,0 +1,27 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenAI Whisper STT Requirements
|
| 2 |
+
# Requirements for using OpenAI Whisper (local and API)
|
| 3 |
+
|
| 4 |
+
# Base requirements
|
| 5 |
+
gradio>=4.0.0
|
| 6 |
+
numpy>=1.21.0
|
| 7 |
+
soundfile>=0.12.1
|
| 8 |
+
|
| 9 |
+
# Whisper local model requirements
|
| 10 |
+
openai-whisper>=20231117
|
| 11 |
+
torch>=1.10.0
|
| 12 |
+
torchaudio>=0.10.0
|
| 13 |
+
|
| 14 |
+
# Whisper API requirements
|
| 15 |
+
openai>=1.0.0
|
| 16 |
+
|
| 17 |
+
# Optional for better audio processing
|
| 18 |
+
librosa>=0.9.0
|
| 19 |
+
|
| 20 |
+
# Installation:
|
| 21 |
+
# pip install -r requirements_whisper.txt
|
| 22 |
+
|
| 23 |
+
# Notes:
|
| 24 |
+
# - Local models download automatically on first use
|
| 25 |
+
# - API requires OpenAI API key
|
| 26 |
+
# - Model sizes: tiny(39MB) < base(142MB) < small(461MB) < medium(1.5GB) < large(2.9GB)
|
| 27 |
+
# - GPU support automatic if PyTorch with CUDA is installed
|
setup.py
ADDED
|
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Setup script for Modular Voice Transcriber
|
| 4 |
+
|
| 5 |
+
This script helps set up the environment and install dependencies
|
| 6 |
+
based on which STT models you want to use.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import subprocess
|
| 10 |
+
import sys
|
| 11 |
+
import argparse
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
def run_command(command, description=""):
|
| 15 |
+
"""Run a command and handle errors."""
|
| 16 |
+
if description:
|
| 17 |
+
print(f"π¦ {description}...")
|
| 18 |
+
|
| 19 |
+
try:
|
| 20 |
+
result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
|
| 21 |
+
print(f"β
{description or 'Command'} completed successfully")
|
| 22 |
+
return True
|
| 23 |
+
except subprocess.CalledProcessError as e:
|
| 24 |
+
print(f"β {description or 'Command'} failed: {e}")
|
| 25 |
+
if e.stdout:
|
| 26 |
+
print(f"Output: {e.stdout}")
|
| 27 |
+
if e.stderr:
|
| 28 |
+
print(f"Error: {e.stderr}")
|
| 29 |
+
return False
|
| 30 |
+
|
| 31 |
+
def install_requirements(requirements_file):
|
| 32 |
+
"""Install requirements from a specific file."""
|
| 33 |
+
if not Path(requirements_file).exists():
|
| 34 |
+
print(f"β Requirements file not found: {requirements_file}")
|
| 35 |
+
return False
|
| 36 |
+
|
| 37 |
+
return run_command(
|
| 38 |
+
f"pip install -r {requirements_file}",
|
| 39 |
+
f"Installing requirements from {requirements_file}"
|
| 40 |
+
)
|
| 41 |
+
|
| 42 |
+
def install_optional_dependencies(groups):
|
| 43 |
+
"""Install optional dependencies using pip install -e ."""
|
| 44 |
+
group_str = ",".join(groups)
|
| 45 |
+
return run_command(
|
| 46 |
+
f"pip install -e .[{group_str}]",
|
| 47 |
+
f"Installing optional dependencies: {group_str}"
|
| 48 |
+
)
|
| 49 |
+
|
| 50 |
+
def test_imports(modules):
|
| 51 |
+
"""Test if modules can be imported."""
|
| 52 |
+
print("\nπ Testing module imports...")
|
| 53 |
+
all_good = True
|
| 54 |
+
|
| 55 |
+
for module in modules:
|
| 56 |
+
try:
|
| 57 |
+
__import__(module)
|
| 58 |
+
print(f"β
{module}")
|
| 59 |
+
except ImportError as e:
|
| 60 |
+
print(f"β {module}: {e}")
|
| 61 |
+
all_good = False
|
| 62 |
+
|
| 63 |
+
return all_good
|
| 64 |
+
|
| 65 |
+
def main():
|
| 66 |
+
parser = argparse.ArgumentParser(description="Setup Modular Voice Transcriber")
|
| 67 |
+
parser.add_argument(
|
| 68 |
+
"--profile",
|
| 69 |
+
choices=["minimal", "essential", "whisper-only", "wav2vec2-only", "vosk-only", "hubert-only", "coqui-only", "tawasul-only", "all"],
|
| 70 |
+
default="essential",
|
| 71 |
+
help="Installation profile (default: essential)"
|
| 72 |
+
)
|
| 73 |
+
parser.add_argument(
|
| 74 |
+
"--test",
|
| 75 |
+
action="store_true",
|
| 76 |
+
help="Test the installation after setup"
|
| 77 |
+
)
|
| 78 |
+
|
| 79 |
+
args = parser.parse_args()
|
| 80 |
+
|
| 81 |
+
print("π Modular Voice Transcriber Setup")
|
| 82 |
+
print("=" * 50)
|
| 83 |
+
print(f"Profile: {args.profile}")
|
| 84 |
+
print()
|
| 85 |
+
|
| 86 |
+
# Install base requirements first
|
| 87 |
+
print("π¦ Installing base requirements...")
|
| 88 |
+
base_success = run_command(
|
| 89 |
+
"pip install gradio>=4.0.0 numpy>=1.21.0 soundfile>=0.12.1",
|
| 90 |
+
"Installing base dependencies"
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
if not base_success:
|
| 94 |
+
print("β Failed to install base requirements. Exiting.")
|
| 95 |
+
return 1
|
| 96 |
+
|
| 97 |
+
# Install profile-specific requirements
|
| 98 |
+
success = True
|
| 99 |
+
|
| 100 |
+
if args.profile == "minimal":
|
| 101 |
+
print("\nπ¦ Minimal installation - Gradio interface only")
|
| 102 |
+
# Base requirements already installed
|
| 103 |
+
|
| 104 |
+
elif args.profile == "essential":
|
| 105 |
+
print("\nπ¦ Essential installation - Whisper + Wav2Vec2")
|
| 106 |
+
success = install_optional_dependencies(["essential"])
|
| 107 |
+
|
| 108 |
+
elif args.profile == "whisper-only":
|
| 109 |
+
print("\nπ¦ Whisper-only installation")
|
| 110 |
+
success = install_requirements("requirements_whisper.txt")
|
| 111 |
+
|
| 112 |
+
elif args.profile == "wav2vec2-only":
|
| 113 |
+
print("\nπ¦ Wav2Vec2-only installation")
|
| 114 |
+
success = install_requirements("requirements_wav2vec2.txt")
|
| 115 |
+
|
| 116 |
+
elif args.profile == "vosk-only":
|
| 117 |
+
print("\nπ¦ Vosk-only installation")
|
| 118 |
+
success = install_requirements("requirements_vosk.txt")
|
| 119 |
+
|
| 120 |
+
elif args.profile == "hubert-only":
|
| 121 |
+
print("\nπ¦ HuBERT Arabic-only installation")
|
| 122 |
+
success = install_requirements("requirements_hubert.txt")
|
| 123 |
+
|
| 124 |
+
elif args.profile == "coqui-only":
|
| 125 |
+
print("\nπ¦ Coqui STT-only installation")
|
| 126 |
+
success = install_requirements("requirements_coqui.txt")
|
| 127 |
+
|
| 128 |
+
elif args.profile == "tawasul-only":
|
| 129 |
+
print("\nπ¦ Tawasul STT-only installation")
|
| 130 |
+
success = install_requirements("requirements_tawasul.txt")
|
| 131 |
+
|
| 132 |
+
elif args.profile == "all":
|
| 133 |
+
print("\nπ¦ Full installation - All STT models")
|
| 134 |
+
success = install_optional_dependencies(["all-stt"])
|
| 135 |
+
|
| 136 |
+
if not success:
|
| 137 |
+
print(f"β Failed to install {args.profile} profile requirements.")
|
| 138 |
+
return 1
|
| 139 |
+
|
| 140 |
+
# Test installation if requested
|
| 141 |
+
if args.test:
|
| 142 |
+
print("\nπ§ͺ Testing installation...")
|
| 143 |
+
|
| 144 |
+
# Basic imports
|
| 145 |
+
basic_modules = ["gradio", "numpy", "soundfile"]
|
| 146 |
+
test_imports(basic_modules)
|
| 147 |
+
|
| 148 |
+
# Profile-specific tests
|
| 149 |
+
if args.profile in ["essential", "whisper-only", "all"]:
|
| 150 |
+
whisper_modules = ["whisper", "openai"]
|
| 151 |
+
test_imports(whisper_modules)
|
| 152 |
+
|
| 153 |
+
if args.profile in ["essential", "wav2vec2-only", "hubert-only", "tawasul-only", "all"]:
|
| 154 |
+
wav2vec2_modules = ["transformers", "torch", "torchaudio"]
|
| 155 |
+
test_imports(wav2vec2_modules)
|
| 156 |
+
|
| 157 |
+
# Test our modules
|
| 158 |
+
try:
|
| 159 |
+
from stt.stt_base import BaseSTT
|
| 160 |
+
from stt.whisper_stt import WhisperSTT
|
| 161 |
+
print("β
STT base classes")
|
| 162 |
+
except ImportError as e:
|
| 163 |
+
print(f"β STT base classes: {e}")
|
| 164 |
+
|
| 165 |
+
if args.profile in ["essential", "wav2vec2-only", "all"]:
|
| 166 |
+
try:
|
| 167 |
+
from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
|
| 168 |
+
print("β
Wav2Vec2 Arabic STT")
|
| 169 |
+
except ImportError as e:
|
| 170 |
+
print(f"β Wav2Vec2 Arabic STT: {e}")
|
| 171 |
+
|
| 172 |
+
if args.profile in ["hubert-only", "all"]:
|
| 173 |
+
try:
|
| 174 |
+
from stt.hubert_arabic_stt import HuBERTArabicSTT
|
| 175 |
+
print("β
HuBERT Arabic STT")
|
| 176 |
+
except ImportError as e:
|
| 177 |
+
print(f"β HuBERT Arabic STT: {e}")
|
| 178 |
+
|
| 179 |
+
if args.profile in ["coqui-only", "all"]:
|
| 180 |
+
try:
|
| 181 |
+
from stt.coqui_stt import CoquiSTT
|
| 182 |
+
print("β
Coqui STT")
|
| 183 |
+
except ImportError as e:
|
| 184 |
+
print(f"β Coqui STT: {e}")
|
| 185 |
+
|
| 186 |
+
if args.profile in ["tawasul-only", "all"]:
|
| 187 |
+
try:
|
| 188 |
+
from stt.tawasul_stt import TawasulSTT
|
| 189 |
+
print("β
Tawasul STT")
|
| 190 |
+
except ImportError as e:
|
| 191 |
+
print(f"β Tawasul STT: {e}")
|
| 192 |
+
|
| 193 |
+
if args.profile in ["vosk-only", "all"]:
|
| 194 |
+
try:
|
| 195 |
+
from stt.vosk_stt import VoskSTT
|
| 196 |
+
print("β
Vosk STT")
|
| 197 |
+
except ImportError as e:
|
| 198 |
+
print(f"β Vosk STT: {e}")
|
| 199 |
+
|
| 200 |
+
print("\n" + "=" * 50)
|
| 201 |
+
print("π Setup completed!")
|
| 202 |
+
print("\nπ‘ Next steps:")
|
| 203 |
+
print(" 1. Run the transcriber:")
|
| 204 |
+
print(" python gradio_voice_transcriber_clean.py")
|
| 205 |
+
print("\n 2. Or test specific models:")
|
| 206 |
+
print(" python test_wav2vec2_arabic.py")
|
| 207 |
+
print("\n 3. Check available models in the web interface")
|
| 208 |
+
|
| 209 |
+
return 0
|
| 210 |
+
|
| 211 |
+
if __name__ == "__main__":
|
| 212 |
+
sys.exit(main())
|
setup_hf_auth.py
ADDED
|
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
HuggingFace Authentication Helper
|
| 4 |
+
|
| 5 |
+
This script helps set up HuggingFace authentication for accessing private models.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import os
|
| 9 |
+
import subprocess
|
| 10 |
+
import sys
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
def check_hf_cli():
|
| 14 |
+
"""Check if huggingface-hub CLI is available."""
|
| 15 |
+
try:
|
| 16 |
+
result = subprocess.run(["huggingface-cli", "--version"],
|
| 17 |
+
capture_output=True, text=True, check=True)
|
| 18 |
+
print(f"β
HuggingFace CLI available: {result.stdout.strip()}")
|
| 19 |
+
return True
|
| 20 |
+
except (subprocess.CalledProcessError, FileNotFoundError):
|
| 21 |
+
print("β HuggingFace CLI not found")
|
| 22 |
+
return False
|
| 23 |
+
|
| 24 |
+
def install_hf_hub():
|
| 25 |
+
"""Install huggingface-hub package."""
|
| 26 |
+
print("π¦ Installing huggingface-hub...")
|
| 27 |
+
try:
|
| 28 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "huggingface-hub"],
|
| 29 |
+
check=True)
|
| 30 |
+
print("β
huggingface-hub installed successfully")
|
| 31 |
+
return True
|
| 32 |
+
except subprocess.CalledProcessError as e:
|
| 33 |
+
print(f"β Failed to install huggingface-hub: {e}")
|
| 34 |
+
return False
|
| 35 |
+
|
| 36 |
+
def login_to_hf():
|
| 37 |
+
"""Login to HuggingFace using CLI."""
|
| 38 |
+
print("\nπ Logging in to HuggingFace...")
|
| 39 |
+
print("This will open a browser to get your token.")
|
| 40 |
+
print("If you don't have a token, create one at: https://huggingface.co/settings/tokens")
|
| 41 |
+
|
| 42 |
+
try:
|
| 43 |
+
subprocess.run(["huggingface-cli", "login"], check=True)
|
| 44 |
+
print("β
Successfully logged in to HuggingFace")
|
| 45 |
+
return True
|
| 46 |
+
except subprocess.CalledProcessError as e:
|
| 47 |
+
print(f"β Failed to login: {e}")
|
| 48 |
+
return False
|
| 49 |
+
|
| 50 |
+
def check_auth_status():
|
| 51 |
+
"""Check current authentication status."""
|
| 52 |
+
try:
|
| 53 |
+
result = subprocess.run(["huggingface-cli", "whoami"],
|
| 54 |
+
capture_output=True, text=True, check=True)
|
| 55 |
+
username = result.stdout.strip()
|
| 56 |
+
print(f"β
Logged in as: {username}")
|
| 57 |
+
return True, username
|
| 58 |
+
except subprocess.CalledProcessError:
|
| 59 |
+
print("β Not logged in to HuggingFace")
|
| 60 |
+
return False, None
|
| 61 |
+
|
| 62 |
+
def test_model_access():
|
| 63 |
+
"""Test access to the Arabic Egyptian model."""
|
| 64 |
+
print("\nπ§ͺ Testing model access...")
|
| 65 |
+
|
| 66 |
+
try:
|
| 67 |
+
from transformers import AutoTokenizer
|
| 68 |
+
|
| 69 |
+
# Test the main model
|
| 70 |
+
models_to_test = [
|
| 71 |
+
"jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
|
| 72 |
+
"jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
|
| 73 |
+
"facebook/wav2vec2-large-xlsr-53"
|
| 74 |
+
]
|
| 75 |
+
|
| 76 |
+
for model_id in models_to_test:
|
| 77 |
+
try:
|
| 78 |
+
print(f"Testing: {model_id}")
|
| 79 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 80 |
+
print(f"β
{model_id} - Accessible")
|
| 81 |
+
return True
|
| 82 |
+
except Exception as e:
|
| 83 |
+
print(f"β {model_id} - {str(e)}")
|
| 84 |
+
continue
|
| 85 |
+
|
| 86 |
+
print("β None of the models are accessible")
|
| 87 |
+
return False
|
| 88 |
+
|
| 89 |
+
except ImportError:
|
| 90 |
+
print("β Transformers library not installed")
|
| 91 |
+
return False
|
| 92 |
+
|
| 93 |
+
def manual_token_setup():
|
| 94 |
+
"""Guide user through manual token setup."""
|
| 95 |
+
print("\nπ Manual Token Setup")
|
| 96 |
+
print("=" * 40)
|
| 97 |
+
print("1. Go to: https://huggingface.co/settings/tokens")
|
| 98 |
+
print("2. Create a new token with 'Read' permissions")
|
| 99 |
+
print("3. Copy the token (starts with 'hf_')")
|
| 100 |
+
print("4. Use it in the Gradio interface:")
|
| 101 |
+
print(" - Select 'Wav2Vec2ArabicSTT'")
|
| 102 |
+
print(" - Choose 'Arabic Egyptian (Experimental)' model")
|
| 103 |
+
print(" - Enter your token in 'HuggingFace Token' field")
|
| 104 |
+
print(" - Click 'Load Model'")
|
| 105 |
+
print("\nπ‘ Alternatively, set environment variable:")
|
| 106 |
+
print(" export HF_TOKEN=your_token_here")
|
| 107 |
+
|
| 108 |
+
def main():
|
| 109 |
+
"""Main authentication helper."""
|
| 110 |
+
print("π€ HuggingFace Authentication Helper")
|
| 111 |
+
print("=" * 50)
|
| 112 |
+
|
| 113 |
+
# Check if already logged in
|
| 114 |
+
is_logged_in, username = check_auth_status()
|
| 115 |
+
|
| 116 |
+
if is_logged_in:
|
| 117 |
+
print(f"\nβ
Already authenticated as: {username}")
|
| 118 |
+
|
| 119 |
+
# Test model access
|
| 120 |
+
if test_model_access():
|
| 121 |
+
print("\nπ Authentication is working! You can use the experimental models.")
|
| 122 |
+
else:
|
| 123 |
+
print("\nβ οΈ Authentication works but model access failed.")
|
| 124 |
+
print("The experimental model might not be available.")
|
| 125 |
+
print("Try using the standard Arabic model instead.")
|
| 126 |
+
|
| 127 |
+
return 0
|
| 128 |
+
|
| 129 |
+
# Not logged in, try to set up
|
| 130 |
+
print("\nβ Not authenticated with HuggingFace")
|
| 131 |
+
|
| 132 |
+
# Check if CLI is available
|
| 133 |
+
if not check_hf_cli():
|
| 134 |
+
print("\nπ¦ Installing HuggingFace CLI...")
|
| 135 |
+
if not install_hf_hub():
|
| 136 |
+
print("\nβ Failed to install HuggingFace Hub")
|
| 137 |
+
manual_token_setup()
|
| 138 |
+
return 1
|
| 139 |
+
|
| 140 |
+
# Try to login
|
| 141 |
+
print("\nπ Setting up authentication...")
|
| 142 |
+
if login_to_hf():
|
| 143 |
+
# Test access after login
|
| 144 |
+
if test_model_access():
|
| 145 |
+
print("\nπ Setup complete! You can now use all models.")
|
| 146 |
+
else:
|
| 147 |
+
print("\nβ οΈ Login successful but some models may not be accessible.")
|
| 148 |
+
else:
|
| 149 |
+
print("\nβ Automatic login failed")
|
| 150 |
+
manual_token_setup()
|
| 151 |
+
return 1
|
| 152 |
+
|
| 153 |
+
return 0
|
| 154 |
+
|
| 155 |
+
if __name__ == "__main__":
|
| 156 |
+
sys.exit(main())
|
stt/__init__.py
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
STT (Speech-to-Text) Package
|
| 3 |
+
|
| 4 |
+
This package contains various STT model implementations that inherit from BaseSTT.
|
| 5 |
+
|
| 6 |
+
Available STT Models:
|
| 7 |
+
- DummySTT: Test implementation for interface validation
|
| 8 |
+
- WhisperSTT: OpenAI Whisper implementation (local + API)
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
from .stt_base import BaseSTT, STTResult, DummySTT
|
| 12 |
+
|
| 13 |
+
# Import WhisperSTT with error handling
|
| 14 |
+
try:
|
| 15 |
+
from .whisper_stt import WhisperSTT
|
| 16 |
+
__all__ = ['BaseSTT', 'STTResult', 'DummySTT', 'WhisperSTT']
|
| 17 |
+
except ImportError as e:
|
| 18 |
+
# WhisperSTT dependencies not available
|
| 19 |
+
__all__ = ['BaseSTT', 'STTResult', 'DummySTT']
|
stt/chirp3_stt.py
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Chirp3 Speech-to-Text (STT) Implementation (Stub)
|
| 4 |
+
|
| 5 |
+
This is a stub for integrating the Chirp3 model with the BaseSTT interface.
|
| 6 |
+
Replace the stub methods with actual model loading and transcription logic as needed.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from .stt_base import BaseSTT, STTResult
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
import time
|
| 13 |
+
import numpy as np
|
| 14 |
+
from typing import Union
|
| 15 |
+
import io
|
| 16 |
+
import wave
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
from google.cloud import speech
|
| 20 |
+
except ImportError:
|
| 21 |
+
speech = None
|
| 22 |
+
|
| 23 |
+
from .stt_base import BaseSTT, STTResult
|
| 24 |
+
|
| 25 |
+
class Chirp3STT(BaseSTT):
|
| 26 |
+
"""
|
| 27 |
+
Chirp3STT implementation using Google Cloud Speech-to-Text API.
|
| 28 |
+
Accepts file path or numpy array as input.
|
| 29 |
+
"""
|
| 30 |
+
model_name = "Chirp3STT"
|
| 31 |
+
client = None
|
| 32 |
+
is_loaded = False
|
| 33 |
+
config = {
|
| 34 |
+
"language": "ar-EG",
|
| 35 |
+
"sample_rate": 16000,
|
| 36 |
+
"encoding": "LINEAR16",
|
| 37 |
+
"enable_automatic_punctuation": True,
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
@classmethod
|
| 41 |
+
def load_model(cls, **kwargs) -> None:
|
| 42 |
+
"""
|
| 43 |
+
Initialize Google Cloud Speech client.
|
| 44 |
+
"""
|
| 45 |
+
cls.client = speech.SpeechClient()
|
| 46 |
+
cls.is_loaded = True
|
| 47 |
+
|
| 48 |
+
@classmethod
|
| 49 |
+
def transcribe_audio(cls, audio_data: Union[str, np.ndarray], sample_rate: int = None):
|
| 50 |
+
"""
|
| 51 |
+
Transcribe audio using Google Cloud Speech-to-Text API.
|
| 52 |
+
Args:
|
| 53 |
+
audio_data: Path to WAV file or numpy array (float32, mono)
|
| 54 |
+
sample_rate: Sample rate if numpy array is provided
|
| 55 |
+
Returns:
|
| 56 |
+
STTResult
|
| 57 |
+
"""
|
| 58 |
+
if not cls.is_loaded:
|
| 59 |
+
raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
|
| 60 |
+
|
| 61 |
+
start_time = time.time()
|
| 62 |
+
# Check google-cloud-speech import
|
| 63 |
+
if speech is None:
|
| 64 |
+
return STTResult(
|
| 65 |
+
text="",
|
| 66 |
+
confidence=0.0,
|
| 67 |
+
processing_time=0.0,
|
| 68 |
+
metadata={"error": "google-cloud-speech not installed"}
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
# Prepare audio for Google API
|
| 72 |
+
audio_content = None
|
| 73 |
+
actual_sample_rate = sample_rate or cls.config["sample_rate"]
|
| 74 |
+
|
| 75 |
+
if isinstance(audio_data, str):
|
| 76 |
+
# File path
|
| 77 |
+
try:
|
| 78 |
+
with open(audio_data, "rb") as f:
|
| 79 |
+
audio_content = f.read()
|
| 80 |
+
except Exception as e:
|
| 81 |
+
return STTResult(
|
| 82 |
+
text="",
|
| 83 |
+
confidence=0.0,
|
| 84 |
+
processing_time=0.0,
|
| 85 |
+
metadata={"error": f"Failed to read file: {e}"}
|
| 86 |
+
)
|
| 87 |
+
elif isinstance(audio_data, np.ndarray):
|
| 88 |
+
# Numpy array (float32 or int16)
|
| 89 |
+
arr = audio_data
|
| 90 |
+
if arr.dtype != np.int16:
|
| 91 |
+
arr = (arr * 32767).astype(np.int16)
|
| 92 |
+
buf = io.BytesIO()
|
| 93 |
+
with wave.open(buf, 'wb') as wf:
|
| 94 |
+
wf.setnchannels(1)
|
| 95 |
+
wf.setsampwidth(2)
|
| 96 |
+
wf.setframerate(actual_sample_rate)
|
| 97 |
+
wf.writeframes(arr.tobytes())
|
| 98 |
+
audio_content = buf.getvalue()
|
| 99 |
+
else:
|
| 100 |
+
return STTResult(
|
| 101 |
+
text="",
|
| 102 |
+
confidence=0.0,
|
| 103 |
+
processing_time=0.0,
|
| 104 |
+
metadata={"error": "Unsupported audio input type"}
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
# Prepare Google API request
|
| 108 |
+
audio = speech.RecognitionAudio(content=audio_content)
|
| 109 |
+
config = speech.RecognitionConfig(
|
| 110 |
+
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
|
| 111 |
+
sample_rate_hertz=actual_sample_rate,
|
| 112 |
+
language_code=cls.config["language"],
|
| 113 |
+
enable_automatic_punctuation=cls.config["enable_automatic_punctuation"],
|
| 114 |
+
)
|
| 115 |
+
try:
|
| 116 |
+
response = cls.client.recognize(config=config, audio=audio)
|
| 117 |
+
if response.results:
|
| 118 |
+
transcript = response.results[0].alternatives[0].transcript
|
| 119 |
+
confidence = response.results[0].alternatives[0].confidence if response.results[0].alternatives else 0.0
|
| 120 |
+
else:
|
| 121 |
+
transcript = ""
|
| 122 |
+
confidence = 0.0
|
| 123 |
+
processing_time = time.time() - start_time
|
| 124 |
+
return STTResult(
|
| 125 |
+
text=transcript,
|
| 126 |
+
confidence=confidence,
|
| 127 |
+
processing_time=processing_time,
|
| 128 |
+
metadata={"api": "google-cloud-speech"}
|
| 129 |
+
)
|
| 130 |
+
except Exception as e:
|
| 131 |
+
return STTResult(
|
| 132 |
+
text="",
|
| 133 |
+
confidence=0.0,
|
| 134 |
+
processing_time=time.time() - start_time,
|
| 135 |
+
metadata={"error": str(e)}
|
| 136 |
+
)
|
stt/coqui_stt.py
ADDED
|
@@ -0,0 +1,390 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Coqui STT Implementation with Model Manager
|
| 4 |
+
|
| 5 |
+
This module provides speech-to-text functionality using coqui-stt-model-manager.
|
| 6 |
+
The model manager provides a simplified interface for downloading and using
|
| 7 |
+
Coqui STT models with automatic model management.
|
| 8 |
+
|
| 9 |
+
Features:
|
| 10 |
+
- Automatic model downloading and management
|
| 11 |
+
- Multiple pre-trained models available
|
| 12 |
+
- Language-specific models
|
| 13 |
+
- Offline processing
|
| 14 |
+
- Simplified API interface
|
| 15 |
+
- GPU acceleration support
|
| 16 |
+
|
| 17 |
+
Dependencies:
|
| 18 |
+
- coqui-stt-model-manager
|
| 19 |
+
- numpy
|
| 20 |
+
- soundfile
|
| 21 |
+
- librosa (for audio preprocessing)
|
| 22 |
+
|
| 23 |
+
Model Management:
|
| 24 |
+
Models are automatically managed by the coqui-stt-model-manager.
|
| 25 |
+
Popular models include English, German, French, Spanish, and more.
|
| 26 |
+
"""
|
| 27 |
+
|
| 28 |
+
import os
|
| 29 |
+
import logging
|
| 30 |
+
import tempfile
|
| 31 |
+
from pathlib import Path
|
| 32 |
+
from typing import Optional, Dict, Any, List, Tuple
|
| 33 |
+
import numpy as np
|
| 34 |
+
|
| 35 |
+
try:
|
| 36 |
+
from coqui_stt_model_manager import CoquiSTTModelManager
|
| 37 |
+
COQUI_STT_AVAILABLE = True
|
| 38 |
+
except ImportError:
|
| 39 |
+
COQUI_STT_AVAILABLE = False
|
| 40 |
+
CoquiSTTModelManager = None
|
| 41 |
+
|
| 42 |
+
try:
|
| 43 |
+
import soundfile as sf
|
| 44 |
+
SOUNDFILE_AVAILABLE = True
|
| 45 |
+
except ImportError:
|
| 46 |
+
SOUNDFILE_AVAILABLE = False
|
| 47 |
+
sf = None
|
| 48 |
+
|
| 49 |
+
try:
|
| 50 |
+
import librosa
|
| 51 |
+
LIBROSA_AVAILABLE = True
|
| 52 |
+
except ImportError:
|
| 53 |
+
LIBROSA_AVAILABLE = False
|
| 54 |
+
librosa = None
|
| 55 |
+
|
| 56 |
+
from .stt_base import BaseSTT
|
| 57 |
+
|
| 58 |
+
logger = logging.getLogger(__name__)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class CoquiSTT(BaseSTT):
|
| 62 |
+
"""
|
| 63 |
+
Coqui STT implementation using coqui-stt-model-manager.
|
| 64 |
+
|
| 65 |
+
Coqui STT provides high-quality open-source speech recognition
|
| 66 |
+
with simplified model management through the model manager.
|
| 67 |
+
"""
|
| 68 |
+
|
| 69 |
+
def __init__(self):
|
| 70 |
+
"""Initialize Coqui STT with model manager."""
|
| 71 |
+
super().__init__()
|
| 72 |
+
self.model_manager = None
|
| 73 |
+
self.current_model = None
|
| 74 |
+
self.model_info = {}
|
| 75 |
+
|
| 76 |
+
# Available models through the model manager
|
| 77 |
+
self.available_models = {
|
| 78 |
+
"english-huge": {
|
| 79 |
+
"language": "en",
|
| 80 |
+
"description": "English model with huge vocabulary",
|
| 81 |
+
"model_id": "english-huge-vocab"
|
| 82 |
+
},
|
| 83 |
+
"english-large": {
|
| 84 |
+
"language": "en",
|
| 85 |
+
"description": "English model with large vocabulary",
|
| 86 |
+
"model_id": "english-large-vocab"
|
| 87 |
+
},
|
| 88 |
+
"german": {
|
| 89 |
+
"language": "de",
|
| 90 |
+
"description": "German language model",
|
| 91 |
+
"model_id": "german"
|
| 92 |
+
},
|
| 93 |
+
"french": {
|
| 94 |
+
"language": "fr",
|
| 95 |
+
"description": "French language model",
|
| 96 |
+
"model_id": "french"
|
| 97 |
+
},
|
| 98 |
+
"spanish": {
|
| 99 |
+
"language": "es",
|
| 100 |
+
"description": "Spanish language model",
|
| 101 |
+
"model_id": "spanish"
|
| 102 |
+
}
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
@classmethod
|
| 106 |
+
def is_available(cls) -> bool:
|
| 107 |
+
"""Check if Coqui STT Model Manager is available."""
|
| 108 |
+
try:
|
| 109 |
+
from coqui_stt_model_manager import CoquiSTTModelManager
|
| 110 |
+
import soundfile
|
| 111 |
+
return True
|
| 112 |
+
except ImportError as e:
|
| 113 |
+
logger.warning(f"Coqui STT Model Manager dependencies not available: {e}")
|
| 114 |
+
return False
|
| 115 |
+
|
| 116 |
+
def check_dependencies(self) -> Tuple[bool, str]:
|
| 117 |
+
"""Check if required dependencies are available."""
|
| 118 |
+
missing_deps = []
|
| 119 |
+
|
| 120 |
+
if not COQUI_STT_AVAILABLE:
|
| 121 |
+
missing_deps.append("coqui-stt-model-manager")
|
| 122 |
+
|
| 123 |
+
if not SOUNDFILE_AVAILABLE:
|
| 124 |
+
missing_deps.append("soundfile")
|
| 125 |
+
|
| 126 |
+
if not LIBROSA_AVAILABLE:
|
| 127 |
+
missing_deps.append("librosa (recommended for audio preprocessing)")
|
| 128 |
+
|
| 129 |
+
if missing_deps:
|
| 130 |
+
return False, f"Missing dependencies: {', '.join(missing_deps)}"
|
| 131 |
+
|
| 132 |
+
return True, "All dependencies available"
|
| 133 |
+
|
| 134 |
+
def load_model(
|
| 135 |
+
self,
|
| 136 |
+
model_name: str = "english-large",
|
| 137 |
+
auto_download: bool = True,
|
| 138 |
+
beam_width: int = 512,
|
| 139 |
+
lm_alpha: float = 0.931289039105002,
|
| 140 |
+
lm_beta: float = 1.1834137581510284,
|
| 141 |
+
**kwargs
|
| 142 |
+
) -> None:
|
| 143 |
+
"""
|
| 144 |
+
Load a Coqui STT model using the model manager.
|
| 145 |
+
|
| 146 |
+
Args:
|
| 147 |
+
model_name: Name of the model to load
|
| 148 |
+
auto_download: Whether to automatically download the model if not found
|
| 149 |
+
beam_width: Beam width for CTC beam search decoder
|
| 150 |
+
lm_alpha: Language model alpha parameter
|
| 151 |
+
lm_beta: Language model beta parameter
|
| 152 |
+
**kwargs: Additional model parameters
|
| 153 |
+
|
| 154 |
+
Raises:
|
| 155 |
+
RuntimeError: If model loading fails
|
| 156 |
+
"""
|
| 157 |
+
deps_ok, deps_msg = self.check_dependencies()
|
| 158 |
+
if not deps_ok:
|
| 159 |
+
raise RuntimeError(f"Dependency check failed: {deps_msg}")
|
| 160 |
+
|
| 161 |
+
try:
|
| 162 |
+
# Initialize model manager
|
| 163 |
+
logger.info("Initializing Coqui STT Model Manager...")
|
| 164 |
+
self.model_manager = CoquiSTTModelManager()
|
| 165 |
+
|
| 166 |
+
# Get model identifier
|
| 167 |
+
if model_name in self.available_models:
|
| 168 |
+
model_id = self.available_models[model_name]["model_id"]
|
| 169 |
+
else:
|
| 170 |
+
model_id = model_name # Use as custom model ID
|
| 171 |
+
|
| 172 |
+
# Load the model through model manager
|
| 173 |
+
logger.info(f"Loading Coqui STT model: {model_id}")
|
| 174 |
+
|
| 175 |
+
if auto_download:
|
| 176 |
+
# Download and load model
|
| 177 |
+
self.current_model = self.model_manager.download_and_load_model(
|
| 178 |
+
model_id=model_id,
|
| 179 |
+
beam_width=beam_width,
|
| 180 |
+
lm_alpha=lm_alpha,
|
| 181 |
+
lm_beta=lm_beta
|
| 182 |
+
)
|
| 183 |
+
else:
|
| 184 |
+
# Try to load existing model
|
| 185 |
+
self.current_model = self.model_manager.load_model(
|
| 186 |
+
model_id=model_id,
|
| 187 |
+
beam_width=beam_width,
|
| 188 |
+
lm_alpha=lm_alpha,
|
| 189 |
+
lm_beta=lm_beta
|
| 190 |
+
)
|
| 191 |
+
|
| 192 |
+
# Store model info
|
| 193 |
+
self.model_info = {
|
| 194 |
+
"model_name": model_name,
|
| 195 |
+
"model_id": model_id,
|
| 196 |
+
"beam_width": beam_width,
|
| 197 |
+
"lm_alpha": lm_alpha,
|
| 198 |
+
"lm_beta": lm_beta,
|
| 199 |
+
}
|
| 200 |
+
|
| 201 |
+
if model_name in self.available_models:
|
| 202 |
+
self.model_info.update(self.available_models[model_name])
|
| 203 |
+
|
| 204 |
+
logger.info(f"Coqui STT model loaded successfully: {model_name}")
|
| 205 |
+
|
| 206 |
+
except Exception as e:
|
| 207 |
+
error_msg = f"Error loading Coqui STT model: {e}"
|
| 208 |
+
logger.error(error_msg)
|
| 209 |
+
raise RuntimeError(error_msg)
|
| 210 |
+
|
| 211 |
+
def preprocess_audio(self, audio_data: np.ndarray, sample_rate: int) -> np.ndarray:
|
| 212 |
+
"""
|
| 213 |
+
Preprocess audio for Coqui STT.
|
| 214 |
+
|
| 215 |
+
Coqui STT requires 16kHz mono audio.
|
| 216 |
+
|
| 217 |
+
Args:
|
| 218 |
+
audio_data: Audio data as numpy array
|
| 219 |
+
sample_rate: Original sample rate
|
| 220 |
+
|
| 221 |
+
Returns:
|
| 222 |
+
Preprocessed audio data
|
| 223 |
+
"""
|
| 224 |
+
try:
|
| 225 |
+
# Convert to mono if needed
|
| 226 |
+
if len(audio_data.shape) > 1:
|
| 227 |
+
audio_data = np.mean(audio_data, axis=1)
|
| 228 |
+
|
| 229 |
+
# Resample to 16kHz if needed
|
| 230 |
+
target_sr = 16000
|
| 231 |
+
if sample_rate != target_sr and LIBROSA_AVAILABLE:
|
| 232 |
+
audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=target_sr)
|
| 233 |
+
sample_rate = target_sr
|
| 234 |
+
elif sample_rate != target_sr:
|
| 235 |
+
logger.warning(f"Audio is {sample_rate}Hz but Coqui STT requires 16kHz. Install librosa for automatic resampling.")
|
| 236 |
+
|
| 237 |
+
# Normalize audio
|
| 238 |
+
audio_data = audio_data.astype(np.float32)
|
| 239 |
+
if np.max(np.abs(audio_data)) > 0:
|
| 240 |
+
audio_data = audio_data / np.max(np.abs(audio_data))
|
| 241 |
+
|
| 242 |
+
# Convert to int16 as required by Coqui STT
|
| 243 |
+
audio_data = (audio_data * 32767).astype(np.int16)
|
| 244 |
+
|
| 245 |
+
return audio_data
|
| 246 |
+
|
| 247 |
+
except Exception as e:
|
| 248 |
+
logger.error(f"Error preprocessing audio: {e}")
|
| 249 |
+
return audio_data
|
| 250 |
+
|
| 251 |
+
def transcribe(self, audio_path: str, **kwargs) -> Tuple[str, str, str]:
|
| 252 |
+
"""
|
| 253 |
+
Transcribe audio using Coqui STT Model Manager.
|
| 254 |
+
|
| 255 |
+
Args:
|
| 256 |
+
audio_path: Path to audio file
|
| 257 |
+
**kwargs: Additional transcription parameters
|
| 258 |
+
|
| 259 |
+
Returns:
|
| 260 |
+
Tuple of (transcription, confidence_info, processing_info)
|
| 261 |
+
"""
|
| 262 |
+
if self.current_model is None:
|
| 263 |
+
return "β Model not loaded. Please load the model first.", "", ""
|
| 264 |
+
|
| 265 |
+
try:
|
| 266 |
+
import time
|
| 267 |
+
start_time = time.time()
|
| 268 |
+
|
| 269 |
+
# Validate file
|
| 270 |
+
if not os.path.exists(audio_path):
|
| 271 |
+
return f"β Audio file not found: {audio_path}", "", ""
|
| 272 |
+
|
| 273 |
+
logger.info(f"π΅ Transcribing audio with Coqui STT: {audio_path}")
|
| 274 |
+
|
| 275 |
+
# Load audio file
|
| 276 |
+
audio_data, sample_rate = sf.read(audio_path)
|
| 277 |
+
|
| 278 |
+
# Preprocess audio
|
| 279 |
+
processed_audio = self.preprocess_audio(audio_data, sample_rate)
|
| 280 |
+
|
| 281 |
+
# Get transcription parameters
|
| 282 |
+
return_confidence = kwargs.get("return_confidence", True)
|
| 283 |
+
return_timestamps = kwargs.get("return_timestamps", False)
|
| 284 |
+
|
| 285 |
+
# Perform transcription using model manager
|
| 286 |
+
if return_timestamps:
|
| 287 |
+
# Use metadata for word timestamps
|
| 288 |
+
result = self.model_manager.transcribe_with_metadata(
|
| 289 |
+
audio_data=processed_audio,
|
| 290 |
+
model=self.current_model
|
| 291 |
+
)
|
| 292 |
+
|
| 293 |
+
# Extract text and calculate confidence
|
| 294 |
+
transcription = ""
|
| 295 |
+
total_confidence = 0.0
|
| 296 |
+
word_count = 0
|
| 297 |
+
|
| 298 |
+
if hasattr(result, 'transcripts') and result.transcripts:
|
| 299 |
+
for token in result.transcripts[0].tokens:
|
| 300 |
+
transcription += token.text
|
| 301 |
+
if hasattr(token, 'confidence'):
|
| 302 |
+
total_confidence += token.confidence
|
| 303 |
+
word_count += 1
|
| 304 |
+
|
| 305 |
+
avg_confidence = total_confidence / word_count if word_count > 0 else 0.0
|
| 306 |
+
|
| 307 |
+
else:
|
| 308 |
+
# Simple transcription
|
| 309 |
+
transcription = self.model_manager.transcribe(
|
| 310 |
+
audio_data=processed_audio,
|
| 311 |
+
model=self.current_model
|
| 312 |
+
)
|
| 313 |
+
avg_confidence = 0.8 # Estimated confidence
|
| 314 |
+
|
| 315 |
+
# Calculate processing time
|
| 316 |
+
processing_time = time.time() - start_time
|
| 317 |
+
audio_duration = len(audio_data) / sample_rate
|
| 318 |
+
|
| 319 |
+
# Create info strings
|
| 320 |
+
confidence_info = f"Confidence: {avg_confidence:.2f}" if return_confidence else ""
|
| 321 |
+
processing_info = (
|
| 322 |
+
f"Duration: {audio_duration:.1f}s | "
|
| 323 |
+
f"Time: {processing_time:.1f}s | "
|
| 324 |
+
f"Model: {self.model_info.get('model_name', 'unknown')}"
|
| 325 |
+
)
|
| 326 |
+
|
| 327 |
+
logger.info(f"β
Transcription completed in {processing_time:.1f}s")
|
| 328 |
+
|
| 329 |
+
return transcription.strip(), confidence_info, processing_info
|
| 330 |
+
|
| 331 |
+
except Exception as e:
|
| 332 |
+
error_msg = f"β Coqui STT transcription failed: {str(e)}"
|
| 333 |
+
logger.error(error_msg)
|
| 334 |
+
return error_msg, "", ""
|
| 335 |
+
|
| 336 |
+
def get_supported_languages(self) -> List[str]:
|
| 337 |
+
"""Get list of supported languages."""
|
| 338 |
+
return [
|
| 339 |
+
"en", # English
|
| 340 |
+
"de", # German
|
| 341 |
+
"fr", # French
|
| 342 |
+
"es", # Spanish
|
| 343 |
+
]
|
| 344 |
+
|
| 345 |
+
def get_model_info(self) -> Dict[str, Any]:
|
| 346 |
+
"""Get information about the currently loaded model."""
|
| 347 |
+
if self.current_model is None:
|
| 348 |
+
return {"error": "No model loaded"}
|
| 349 |
+
|
| 350 |
+
info = self.model_info.copy()
|
| 351 |
+
info.update({
|
| 352 |
+
"name": "Coqui STT with Model Manager",
|
| 353 |
+
"is_loaded": self.current_model is not None,
|
| 354 |
+
"supported_languages": self.get_supported_languages(),
|
| 355 |
+
"architecture": "DeepSpeech-based CTC",
|
| 356 |
+
"provider": "Coqui AI"
|
| 357 |
+
})
|
| 358 |
+
|
| 359 |
+
return info
|
| 360 |
+
|
| 361 |
+
def get_available_models(self) -> List[Dict[str, Any]]:
|
| 362 |
+
"""Get list of available models."""
|
| 363 |
+
models = []
|
| 364 |
+
for name, info in self.available_models.items():
|
| 365 |
+
model_info = {
|
| 366 |
+
"name": name,
|
| 367 |
+
"language": info["language"],
|
| 368 |
+
"description": info["description"],
|
| 369 |
+
"model_id": info["model_id"]
|
| 370 |
+
}
|
| 371 |
+
models.append(model_info)
|
| 372 |
+
|
| 373 |
+
return models
|
| 374 |
+
|
| 375 |
+
def cleanup(self):
|
| 376 |
+
"""Clean up resources."""
|
| 377 |
+
if self.current_model is not None:
|
| 378 |
+
# Model manager handles cleanup automatically
|
| 379 |
+
self.current_model = None
|
| 380 |
+
|
| 381 |
+
if self.model_manager is not None:
|
| 382 |
+
self.model_manager = None
|
| 383 |
+
|
| 384 |
+
self.model_info = {}
|
| 385 |
+
|
| 386 |
+
logger.info("Coqui STT cleanup completed")
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
# Export the class
|
| 390 |
+
__all__ = ["CoquiSTT", "COQUI_STT_AVAILABLE"]
|
stt/example_custom_stt.py
ADDED
|
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Example: Adding a Custom STT Model
|
| 4 |
+
|
| 5 |
+
This file demonstrates how to add a new STT model to the modular voice transcriber.
|
| 6 |
+
Follow this pattern to integrate any speech-to-text service.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
1. Create your STT class following the BaseSTT interface
|
| 10 |
+
2. Add it to the STT_MODELS registry in gradio_voice_transcriber_clean.py
|
| 11 |
+
3. Update ModelManager.get_model_options() if needed
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from typing import Union, Optional
|
| 15 |
+
import numpy as np
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
import time
|
| 18 |
+
import random
|
| 19 |
+
|
| 20 |
+
from stt.stt_base import BaseSTT, STTResult
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
class ExampleCustomSTT(BaseSTT):
|
| 24 |
+
"""
|
| 25 |
+
Example custom STT implementation.
|
| 26 |
+
This shows how to create a new STT model following the BaseSTT interface.
|
| 27 |
+
|
| 28 |
+
Replace this with actual integration to your preferred STT service:
|
| 29 |
+
- Azure Speech Service
|
| 30 |
+
- Google Cloud Speech-to-Text
|
| 31 |
+
- Amazon Transcribe
|
| 32 |
+
- IBM Watson Speech to Text
|
| 33 |
+
- AssemblyAI
|
| 34 |
+
- Rev.ai
|
| 35 |
+
- Or any other service
|
| 36 |
+
"""
|
| 37 |
+
|
| 38 |
+
model_name = "ExampleCustomSTT"
|
| 39 |
+
model = None
|
| 40 |
+
is_loaded = False
|
| 41 |
+
config = {
|
| 42 |
+
"api_key": None,
|
| 43 |
+
"region": "us-east-1",
|
| 44 |
+
"language": "en-US",
|
| 45 |
+
"sample_rate": 16000
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
@classmethod
|
| 49 |
+
def load_model(cls, api_key: str = "", region: str = "us-east-1", **kwargs) -> None:
|
| 50 |
+
"""
|
| 51 |
+
Load/initialize the custom STT service.
|
| 52 |
+
|
| 53 |
+
Args:
|
| 54 |
+
api_key: API key for the service
|
| 55 |
+
region: Service region
|
| 56 |
+
**kwargs: Additional configuration parameters
|
| 57 |
+
"""
|
| 58 |
+
if not api_key:
|
| 59 |
+
raise ValueError("API key required for ExampleCustomSTT")
|
| 60 |
+
|
| 61 |
+
# Update configuration
|
| 62 |
+
cls.config.update({
|
| 63 |
+
"api_key": api_key,
|
| 64 |
+
"region": region,
|
| 65 |
+
**kwargs
|
| 66 |
+
})
|
| 67 |
+
|
| 68 |
+
# Initialize your STT service here
|
| 69 |
+
# Example:
|
| 70 |
+
# cls.model = YourSTTClient(
|
| 71 |
+
# api_key=api_key,
|
| 72 |
+
# region=region
|
| 73 |
+
# )
|
| 74 |
+
|
| 75 |
+
# For demonstration, just simulate initialization
|
| 76 |
+
print(f"Initializing ExampleCustomSTT with region {region}")
|
| 77 |
+
time.sleep(1) # Simulate initialization time
|
| 78 |
+
|
| 79 |
+
cls.model = f"custom_stt_client_{region}"
|
| 80 |
+
cls.is_loaded = True
|
| 81 |
+
|
| 82 |
+
print(f"β
{cls.model_name} loaded successfully")
|
| 83 |
+
|
| 84 |
+
@classmethod
|
| 85 |
+
def transcribe_audio(cls,
|
| 86 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 87 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 88 |
+
"""
|
| 89 |
+
Transcribe audio using the custom STT service.
|
| 90 |
+
|
| 91 |
+
Args:
|
| 92 |
+
audio_data: Audio input (numpy array or file path)
|
| 93 |
+
sample_rate: Sample rate for numpy arrays
|
| 94 |
+
|
| 95 |
+
Returns:
|
| 96 |
+
STTResult: Transcription result with metadata
|
| 97 |
+
"""
|
| 98 |
+
if not cls.is_loaded:
|
| 99 |
+
raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
|
| 100 |
+
|
| 101 |
+
start_time = time.time()
|
| 102 |
+
|
| 103 |
+
# Handle different input types
|
| 104 |
+
if isinstance(audio_data, np.ndarray):
|
| 105 |
+
# For numpy arrays, you might need to:
|
| 106 |
+
# 1. Save to temporary file
|
| 107 |
+
# 2. Upload to service
|
| 108 |
+
# 3. Get transcription result
|
| 109 |
+
|
| 110 |
+
duration = len(audio_data) / (sample_rate or 16000)
|
| 111 |
+
print(f"Transcribing numpy array: {duration:.2f}s")
|
| 112 |
+
|
| 113 |
+
# Simulate API call
|
| 114 |
+
time.sleep(0.5 + duration * 0.1) # Simulate processing time
|
| 115 |
+
|
| 116 |
+
# Example transcription (replace with actual API call)
|
| 117 |
+
transcription = f"[Custom STT transcription of {duration:.1f}s audio]"
|
| 118 |
+
confidence = random.uniform(0.85, 0.98) # Simulate confidence
|
| 119 |
+
|
| 120 |
+
else:
|
| 121 |
+
# Handle file path
|
| 122 |
+
file_path = Path(audio_data)
|
| 123 |
+
print(f"Transcribing file: {file_path.name}")
|
| 124 |
+
|
| 125 |
+
# Simulate file upload and transcription
|
| 126 |
+
time.sleep(1.0)
|
| 127 |
+
|
| 128 |
+
transcription = f"[Custom STT transcription of {file_path.name}]"
|
| 129 |
+
confidence = random.uniform(0.80, 0.95)
|
| 130 |
+
|
| 131 |
+
processing_time = time.time() - start_time
|
| 132 |
+
|
| 133 |
+
# Prepare metadata
|
| 134 |
+
metadata = {
|
| 135 |
+
"model": cls.model_name,
|
| 136 |
+
"region": cls.config["region"],
|
| 137 |
+
"language": cls.config.get("language", "en-US"),
|
| 138 |
+
"api_used": True,
|
| 139 |
+
"service": "example-custom-service"
|
| 140 |
+
}
|
| 141 |
+
|
| 142 |
+
return STTResult(
|
| 143 |
+
text=transcription,
|
| 144 |
+
confidence=confidence,
|
| 145 |
+
processing_time=processing_time,
|
| 146 |
+
metadata=metadata
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
@classmethod
|
| 150 |
+
def set_language(cls, language: Optional[str]) -> None:
|
| 151 |
+
"""Set the transcription language."""
|
| 152 |
+
if language:
|
| 153 |
+
cls.config["language"] = language
|
| 154 |
+
print(f"Language set to: {language}")
|
| 155 |
+
|
| 156 |
+
@classmethod
|
| 157 |
+
def get_supported_languages(cls) -> list:
|
| 158 |
+
"""Get list of supported languages."""
|
| 159 |
+
return [
|
| 160 |
+
"en-US", "en-GB", "es-ES", "fr-FR", "de-DE",
|
| 161 |
+
"it-IT", "pt-BR", "ja-JP", "ko-KR", "zh-CN"
|
| 162 |
+
]
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
# Example of how to integrate into the main application:
|
| 166 |
+
def integrate_custom_stt():
|
| 167 |
+
"""
|
| 168 |
+
This function shows how to add the custom STT to the main application.
|
| 169 |
+
|
| 170 |
+
Add this to gradio_voice_transcriber_clean.py:
|
| 171 |
+
"""
|
| 172 |
+
|
| 173 |
+
# 1. Import your custom STT class
|
| 174 |
+
from stt.example_custom_stt import ExampleCustomSTT
|
| 175 |
+
|
| 176 |
+
# 2. Add to STT_MODELS registry
|
| 177 |
+
STT_MODELS = {
|
| 178 |
+
"WhisperSTT": WhisperSTT,
|
| 179 |
+
"ExampleCustomSTT": ExampleCustomSTT, # Add this line
|
| 180 |
+
}
|
| 181 |
+
|
| 182 |
+
# 3. Update ModelManager.get_model_options() to include custom options
|
| 183 |
+
def get_model_options(model_name: str):
|
| 184 |
+
if model_name == "ExampleCustomSTT":
|
| 185 |
+
return {
|
| 186 |
+
"model_sizes": ["default"], # No size options for this service
|
| 187 |
+
"supports_api": True,
|
| 188 |
+
"languages": [
|
| 189 |
+
("Auto-detect", "auto"),
|
| 190 |
+
("English (US)", "en-US"),
|
| 191 |
+
("English (UK)", "en-GB"),
|
| 192 |
+
("Spanish", "es-ES"),
|
| 193 |
+
("French", "fr-FR"),
|
| 194 |
+
("German", "de-DE"),
|
| 195 |
+
],
|
| 196 |
+
"custom_fields": [
|
| 197 |
+
{"name": "api_key", "type": "password", "label": "API Key", "required": True},
|
| 198 |
+
{"name": "region", "type": "dropdown", "label": "Region",
|
| 199 |
+
"choices": ["us-east-1", "us-west-2", "eu-west-1"], "default": "us-east-1"}
|
| 200 |
+
]
|
| 201 |
+
}
|
| 202 |
+
# ... existing code for other models
|
| 203 |
+
|
| 204 |
+
# 4. Update the load_model function to handle custom parameters
|
| 205 |
+
def load_model(model_name: str, **kwargs):
|
| 206 |
+
if model_name == "ExampleCustomSTT":
|
| 207 |
+
api_key = kwargs.get("api_key", "")
|
| 208 |
+
region = kwargs.get("region", "us-east-1")
|
| 209 |
+
|
| 210 |
+
if not api_key:
|
| 211 |
+
return "β API key required for ExampleCustomSTT"
|
| 212 |
+
|
| 213 |
+
ExampleCustomSTT.load_model(api_key=api_key, region=region)
|
| 214 |
+
# ... rest of loading logic
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
# Real-world integration examples:
|
| 218 |
+
|
| 219 |
+
class AzureSTT(BaseSTT):
|
| 220 |
+
"""Example Azure Speech Service integration."""
|
| 221 |
+
|
| 222 |
+
model_name = "AzureSTT"
|
| 223 |
+
model = None
|
| 224 |
+
is_loaded = False
|
| 225 |
+
|
| 226 |
+
@classmethod
|
| 227 |
+
def load_model(cls, subscription_key: str, region: str, **kwargs):
|
| 228 |
+
"""Initialize Azure Speech SDK."""
|
| 229 |
+
try:
|
| 230 |
+
import azure.cognitiveservices.speech as speechsdk
|
| 231 |
+
|
| 232 |
+
speech_config = speechsdk.SpeechConfig(
|
| 233 |
+
subscription=subscription_key,
|
| 234 |
+
region=region
|
| 235 |
+
)
|
| 236 |
+
cls.model = speech_config
|
| 237 |
+
cls.is_loaded = True
|
| 238 |
+
except ImportError:
|
| 239 |
+
raise ImportError("Install Azure Speech SDK: pip install azure-cognitiveservices-speech")
|
| 240 |
+
|
| 241 |
+
@classmethod
|
| 242 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 243 |
+
"""Transcribe using Azure Speech Service."""
|
| 244 |
+
# Implement Azure-specific transcription logic
|
| 245 |
+
pass
|
| 246 |
+
|
| 247 |
+
|
| 248 |
+
class GoogleSTT(BaseSTT):
|
| 249 |
+
"""Example Google Cloud Speech-to-Text integration."""
|
| 250 |
+
|
| 251 |
+
model_name = "GoogleSTT"
|
| 252 |
+
model = None
|
| 253 |
+
is_loaded = False
|
| 254 |
+
|
| 255 |
+
@classmethod
|
| 256 |
+
def load_model(cls, credentials_path: str, **kwargs):
|
| 257 |
+
"""Initialize Google Cloud Speech client."""
|
| 258 |
+
try:
|
| 259 |
+
from google.cloud import speech
|
| 260 |
+
import os
|
| 261 |
+
|
| 262 |
+
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
|
| 263 |
+
cls.model = speech.SpeechClient()
|
| 264 |
+
cls.is_loaded = True
|
| 265 |
+
except ImportError:
|
| 266 |
+
raise ImportError("Install Google Cloud Speech: pip install google-cloud-speech")
|
| 267 |
+
|
| 268 |
+
@classmethod
|
| 269 |
+
def transcribe_audio(cls, audio_data, sample_rate=None):
|
| 270 |
+
"""Transcribe using Google Cloud Speech."""
|
| 271 |
+
# Implement Google-specific transcription logic
|
| 272 |
+
pass
|
| 273 |
+
|
| 274 |
+
|
| 275 |
+
if __name__ == "__main__":
|
| 276 |
+
# Test the example custom STT
|
| 277 |
+
print("Testing ExampleCustomSTT...")
|
| 278 |
+
|
| 279 |
+
# Load model
|
| 280 |
+
ExampleCustomSTT.load_model(api_key="test-api-key", region="us-east-1")
|
| 281 |
+
|
| 282 |
+
# Test transcription
|
| 283 |
+
dummy_audio = np.random.randn(16000).astype(np.float32) # 1 second
|
| 284 |
+
result = ExampleCustomSTT.transcribe_audio(dummy_audio, 16000)
|
| 285 |
+
|
| 286 |
+
print(f"Result: {result}")
|
| 287 |
+
print(f"Metadata: {result.metadata}")
|
| 288 |
+
print("β
Custom STT integration test completed!")
|
stt/hubert_arabic_stt.py
ADDED
|
@@ -0,0 +1,568 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
HuBERT Arabic Egyptian STT Implementation
|
| 4 |
+
|
| 5 |
+
Hugging Face HuBERT speech-to-text implementation for Arabic Egyptian dialect
|
| 6 |
+
using the omarxadel/hubert-large-arabic-egyptian model.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
from stt.hubert_arabic_stt import HuBERTArabicSTT
|
| 10 |
+
|
| 11 |
+
# Load model
|
| 12 |
+
HuBERTArabicSTT.load_model()
|
| 13 |
+
|
| 14 |
+
# Transcribe audio
|
| 15 |
+
result = HuBERTArabicSTT.transcribe_audio(audio_array, 16000)
|
| 16 |
+
print(result.text)
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from typing import Union, Optional, Dict, Any
|
| 20 |
+
import numpy as np
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
import time
|
| 23 |
+
import logging
|
| 24 |
+
import warnings
|
| 25 |
+
|
| 26 |
+
# Suppress warnings for cleaner output
|
| 27 |
+
warnings.filterwarnings("ignore")
|
| 28 |
+
|
| 29 |
+
try:
|
| 30 |
+
import torch
|
| 31 |
+
import torchaudio
|
| 32 |
+
from transformers import (
|
| 33 |
+
HubertForCTC,
|
| 34 |
+
Wav2Vec2Processor,
|
| 35 |
+
Wav2Vec2Tokenizer,
|
| 36 |
+
AutoProcessor,
|
| 37 |
+
AutoModelForCTC
|
| 38 |
+
)
|
| 39 |
+
TRANSFORMERS_AVAILABLE = True
|
| 40 |
+
except ImportError:
|
| 41 |
+
TRANSFORMERS_AVAILABLE = False
|
| 42 |
+
|
| 43 |
+
try:
|
| 44 |
+
import librosa
|
| 45 |
+
LIBROSA_AVAILABLE = True
|
| 46 |
+
except ImportError:
|
| 47 |
+
LIBROSA_AVAILABLE = False
|
| 48 |
+
|
| 49 |
+
from .stt_base import BaseSTT, STTResult
|
| 50 |
+
|
| 51 |
+
logger = logging.getLogger(__name__)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
class HuBERTArabicSTT(BaseSTT):
|
| 55 |
+
class Chirp3STT(BaseSTT):
|
| 56 |
+
"""
|
| 57 |
+
Chirp3 Speech-to-Text implementation (stub).
|
| 58 |
+
Replace this stub with actual Chirp3 model integration as needed.
|
| 59 |
+
"""
|
| 60 |
+
model_name = "Chirp3STT"
|
| 61 |
+
model = None
|
| 62 |
+
processor = None
|
| 63 |
+
is_loaded = False
|
| 64 |
+
config = {
|
| 65 |
+
"model_id": "chirp3/ar-egyptian", # Example placeholder
|
| 66 |
+
"device": "auto",
|
| 67 |
+
"sample_rate": 16000,
|
| 68 |
+
"language": "ar-EG",
|
| 69 |
+
}
|
| 70 |
+
|
| 71 |
+
@classmethod
|
| 72 |
+
def load_model(cls, model_id: str = None, device: str = "auto", **kwargs) -> None:
|
| 73 |
+
"""
|
| 74 |
+
Load the Chirp3 model (stub).
|
| 75 |
+
"""
|
| 76 |
+
# TODO: Implement actual Chirp3 model loading
|
| 77 |
+
cls.is_loaded = True
|
| 78 |
+
cls.config["model_id"] = model_id or cls.config["model_id"]
|
| 79 |
+
cls.config["device"] = device
|
| 80 |
+
|
| 81 |
+
@classmethod
|
| 82 |
+
def transcribe_audio(cls, audio_data, sample_rate: int = None):
|
| 83 |
+
"""
|
| 84 |
+
Transcribe audio using Chirp3 model (stub).
|
| 85 |
+
"""
|
| 86 |
+
if not cls.is_loaded:
|
| 87 |
+
raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
|
| 88 |
+
# TODO: Implement actual transcription logic
|
| 89 |
+
from .stt_base import STTResult
|
| 90 |
+
return STTResult(
|
| 91 |
+
text="[Chirp3STT stub: no transcription]",
|
| 92 |
+
confidence=0.0,
|
| 93 |
+
processing_time=0.0,
|
| 94 |
+
metadata={"note": "Chirp3STT is a stub."}
|
| 95 |
+
)
|
| 96 |
+
"""
|
| 97 |
+
HuBERT Arabic Egyptian STT implementation using Hugging Face transformers.
|
| 98 |
+
|
| 99 |
+
Supports:
|
| 100 |
+
- Arabic Egyptian dialect transcription
|
| 101 |
+
- Local model execution (no API required)
|
| 102 |
+
- Automatic audio preprocessing
|
| 103 |
+
- Confidence estimation
|
| 104 |
+
- Chunked processing for long audio
|
| 105 |
+
"""
|
| 106 |
+
|
| 107 |
+
model_name = "HuBERTArabicSTT"
|
| 108 |
+
model = None
|
| 109 |
+
processor = None
|
| 110 |
+
tokenizer = None
|
| 111 |
+
is_loaded = False
|
| 112 |
+
config = {
|
| 113 |
+
"model_id": "omarxadel/hubert-large-arabic-egyptian",
|
| 114 |
+
"fallback_models": [
|
| 115 |
+
"jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
|
| 116 |
+
"jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
|
| 117 |
+
"facebook/wav2vec2-large-xlsr-53",
|
| 118 |
+
],
|
| 119 |
+
"device": "auto", # auto, cpu, cuda
|
| 120 |
+
"chunk_length": 15, # seconds, for long audio processing
|
| 121 |
+
"sample_rate": 16000,
|
| 122 |
+
"return_confidence": True,
|
| 123 |
+
"language": "ar-EG", # Arabic Egyptian
|
| 124 |
+
"hf_token": None, # Hugging Face token for private models
|
| 125 |
+
"use_auth_token": True # Try to use cached token
|
| 126 |
+
}
|
| 127 |
+
|
| 128 |
+
@classmethod
|
| 129 |
+
def load_model(cls,
|
| 130 |
+
model_id: str = None,
|
| 131 |
+
device: str = "auto",
|
| 132 |
+
hf_token: str = None,
|
| 133 |
+
**kwargs) -> None:
|
| 134 |
+
"""
|
| 135 |
+
Load the HuBERT Arabic model.
|
| 136 |
+
|
| 137 |
+
Args:
|
| 138 |
+
model_id: Hugging Face model ID (default: omarxadel/hubert-large-arabic-egyptian)
|
| 139 |
+
device: Device to use (auto, cpu, cuda)
|
| 140 |
+
hf_token: Hugging Face token for private models (optional)
|
| 141 |
+
**kwargs: Additional configuration parameters
|
| 142 |
+
"""
|
| 143 |
+
if not TRANSFORMERS_AVAILABLE:
|
| 144 |
+
raise ImportError(
|
| 145 |
+
"Transformers library required. Install with: "
|
| 146 |
+
"pip install transformers torch torchaudio"
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
# Update configuration
|
| 150 |
+
cls.config.update({
|
| 151 |
+
"model_id": model_id or cls.config["model_id"],
|
| 152 |
+
"device": device,
|
| 153 |
+
"hf_token": hf_token,
|
| 154 |
+
**kwargs
|
| 155 |
+
})
|
| 156 |
+
|
| 157 |
+
# Determine device
|
| 158 |
+
if device == "auto":
|
| 159 |
+
if torch.cuda.is_available():
|
| 160 |
+
device = "cuda"
|
| 161 |
+
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
| 162 |
+
device = "mps" # Apple Silicon
|
| 163 |
+
else:
|
| 164 |
+
device = "cpu"
|
| 165 |
+
|
| 166 |
+
cls.config["device"] = device
|
| 167 |
+
|
| 168 |
+
# Try to load the model, with fallbacks
|
| 169 |
+
models_to_try = [cls.config["model_id"]] + cls.config["fallback_models"]
|
| 170 |
+
|
| 171 |
+
for model_id_to_try in models_to_try:
|
| 172 |
+
logger.info(f"Attempting to load HuBERT model: {model_id_to_try}")
|
| 173 |
+
|
| 174 |
+
try:
|
| 175 |
+
success = cls._load_model_with_id(model_id_to_try, device, hf_token)
|
| 176 |
+
if success:
|
| 177 |
+
cls.config["model_id"] = model_id_to_try # Update to successful model
|
| 178 |
+
return
|
| 179 |
+
except Exception as e:
|
| 180 |
+
logger.warning(f"Failed to load {model_id_to_try}: {e}")
|
| 181 |
+
continue
|
| 182 |
+
|
| 183 |
+
# If all models failed
|
| 184 |
+
raise RuntimeError(f"Failed to load any HuBERT model. Tried: {models_to_try}")
|
| 185 |
+
|
| 186 |
+
@classmethod
|
| 187 |
+
def _load_model_with_id(cls, model_id: str, device: str, hf_token: str = None) -> bool:
|
| 188 |
+
"""
|
| 189 |
+
Load a specific model ID with authentication handling.
|
| 190 |
+
|
| 191 |
+
Returns:
|
| 192 |
+
bool: True if successful, False otherwise
|
| 193 |
+
"""
|
| 194 |
+
logger.info(f"Loading HuBERT model: {model_id}")
|
| 195 |
+
logger.info(f"Using device: {device}")
|
| 196 |
+
|
| 197 |
+
start_time = time.time()
|
| 198 |
+
|
| 199 |
+
# Prepare authentication
|
| 200 |
+
auth_kwargs = {}
|
| 201 |
+
if hf_token:
|
| 202 |
+
auth_kwargs["token"] = hf_token
|
| 203 |
+
elif cls.config.get("use_auth_token", True):
|
| 204 |
+
auth_kwargs["use_auth_token"] = True
|
| 205 |
+
|
| 206 |
+
try:
|
| 207 |
+
# Try to load as HuBERT model first
|
| 208 |
+
if "hubert" in model_id.lower():
|
| 209 |
+
logger.info("Loading as HuBERT model...")
|
| 210 |
+
cls.processor = AutoProcessor.from_pretrained(model_id, **auth_kwargs)
|
| 211 |
+
cls.model = AutoModelForCTC.from_pretrained(model_id, **auth_kwargs)
|
| 212 |
+
else:
|
| 213 |
+
# Fallback to Wav2Vec2 for other models
|
| 214 |
+
logger.info("Loading as Wav2Vec2 model...")
|
| 215 |
+
cls.processor = Wav2Vec2Processor.from_pretrained(model_id, **auth_kwargs)
|
| 216 |
+
cls.model = AutoModelForCTC.from_pretrained(model_id, **auth_kwargs)
|
| 217 |
+
|
| 218 |
+
# Move model to device
|
| 219 |
+
cls.model = cls.model.to(device)
|
| 220 |
+
cls.model.eval() # Set to evaluation mode
|
| 221 |
+
|
| 222 |
+
# Load tokenizer for confidence calculation
|
| 223 |
+
try:
|
| 224 |
+
cls.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_id, **auth_kwargs)
|
| 225 |
+
except Exception as e:
|
| 226 |
+
logger.warning(f"Could not load tokenizer: {e}")
|
| 227 |
+
cls.tokenizer = None
|
| 228 |
+
|
| 229 |
+
cls.is_loaded = True
|
| 230 |
+
load_time = time.time() - start_time
|
| 231 |
+
|
| 232 |
+
logger.info(f"β
HuBERT model loaded successfully in {load_time:.2f}s")
|
| 233 |
+
logger.info(f"Model vocab size: {cls.model.config.vocab_size}")
|
| 234 |
+
|
| 235 |
+
return True
|
| 236 |
+
|
| 237 |
+
except Exception as e:
|
| 238 |
+
logger.error(f"Failed to load model {model_id}: {e}")
|
| 239 |
+
return False
|
| 240 |
+
|
| 241 |
+
@classmethod
|
| 242 |
+
def transcribe_audio(cls,
|
| 243 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 244 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 245 |
+
"""
|
| 246 |
+
Transcribe audio using HuBERT Arabic model.
|
| 247 |
+
|
| 248 |
+
Args:
|
| 249 |
+
audio_data: Audio input (numpy array or file path)
|
| 250 |
+
sample_rate: Sample rate for numpy arrays
|
| 251 |
+
|
| 252 |
+
Returns:
|
| 253 |
+
STTResult: Transcription with confidence and metadata
|
| 254 |
+
"""
|
| 255 |
+
if not cls.is_loaded:
|
| 256 |
+
raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
|
| 257 |
+
|
| 258 |
+
start_time = time.time()
|
| 259 |
+
|
| 260 |
+
try:
|
| 261 |
+
# Process input audio
|
| 262 |
+
processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
|
| 263 |
+
|
| 264 |
+
# Check audio length
|
| 265 |
+
duration = len(processed_audio) / actual_sr
|
| 266 |
+
if duration < 0.1:
|
| 267 |
+
return STTResult(
|
| 268 |
+
text="",
|
| 269 |
+
confidence=0.0,
|
| 270 |
+
processing_time=time.time() - start_time,
|
| 271 |
+
metadata={"error": "Audio too short", "duration": duration}
|
| 272 |
+
)
|
| 273 |
+
|
| 274 |
+
# Process with model
|
| 275 |
+
if duration > cls.config.get("chunk_length", 15):
|
| 276 |
+
# Handle long audio by chunking
|
| 277 |
+
text, confidence = cls._transcribe_long_audio(processed_audio, actual_sr)
|
| 278 |
+
else:
|
| 279 |
+
# Process short audio directly
|
| 280 |
+
text, confidence = cls._transcribe_chunk(processed_audio, actual_sr)
|
| 281 |
+
|
| 282 |
+
processing_time = time.time() - start_time
|
| 283 |
+
|
| 284 |
+
# Prepare metadata
|
| 285 |
+
metadata = {
|
| 286 |
+
"model": cls.config["model_id"],
|
| 287 |
+
"model_type": "HuBERT" if "hubert" in cls.config["model_id"].lower() else "Wav2Vec2",
|
| 288 |
+
"device": cls.config["device"],
|
| 289 |
+
"language": "ar-EG",
|
| 290 |
+
"duration": duration,
|
| 291 |
+
"sample_rate": actual_sr,
|
| 292 |
+
"chunks_processed": 1 if duration <= cls.config.get("chunk_length", 15) else int(duration / cls.config["chunk_length"]) + 1
|
| 293 |
+
}
|
| 294 |
+
|
| 295 |
+
return STTResult(
|
| 296 |
+
text=text.strip(),
|
| 297 |
+
confidence=confidence,
|
| 298 |
+
processing_time=processing_time,
|
| 299 |
+
metadata=metadata
|
| 300 |
+
)
|
| 301 |
+
|
| 302 |
+
except Exception as e:
|
| 303 |
+
error_msg = f"Transcription failed: {str(e)}"
|
| 304 |
+
logger.error(error_msg)
|
| 305 |
+
return STTResult(
|
| 306 |
+
text="",
|
| 307 |
+
confidence=0.0,
|
| 308 |
+
processing_time=time.time() - start_time,
|
| 309 |
+
metadata={"error": error_msg}
|
| 310 |
+
)
|
| 311 |
+
|
| 312 |
+
@classmethod
|
| 313 |
+
def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
|
| 314 |
+
"""Process and validate audio input."""
|
| 315 |
+
if isinstance(audio_data, (str, Path)):
|
| 316 |
+
# Load audio file
|
| 317 |
+
audio_path = Path(audio_data)
|
| 318 |
+
if not audio_path.exists():
|
| 319 |
+
raise FileNotFoundError(f"Audio file not found: {audio_path}")
|
| 320 |
+
|
| 321 |
+
if LIBROSA_AVAILABLE:
|
| 322 |
+
audio_array, sr = librosa.load(str(audio_path), sr=cls.config["sample_rate"])
|
| 323 |
+
else:
|
| 324 |
+
# Fallback to torchaudio
|
| 325 |
+
audio_tensor, sr = torchaudio.load(str(audio_path))
|
| 326 |
+
audio_array = audio_tensor.numpy().flatten()
|
| 327 |
+
|
| 328 |
+
# Resample if needed
|
| 329 |
+
if sr != cls.config["sample_rate"]:
|
| 330 |
+
resampler = torchaudio.transforms.Resample(sr, cls.config["sample_rate"])
|
| 331 |
+
audio_tensor = resampler(audio_tensor)
|
| 332 |
+
audio_array = audio_tensor.numpy().flatten()
|
| 333 |
+
sr = cls.config["sample_rate"]
|
| 334 |
+
|
| 335 |
+
else:
|
| 336 |
+
# Handle numpy array
|
| 337 |
+
audio_array = audio_data.astype(np.float32)
|
| 338 |
+
sr = sample_rate or cls.config["sample_rate"]
|
| 339 |
+
|
| 340 |
+
# Resample if needed
|
| 341 |
+
if sr != cls.config["sample_rate"]:
|
| 342 |
+
if LIBROSA_AVAILABLE:
|
| 343 |
+
audio_array = librosa.resample(
|
| 344 |
+
audio_array,
|
| 345 |
+
orig_sr=sr,
|
| 346 |
+
target_sr=cls.config["sample_rate"]
|
| 347 |
+
)
|
| 348 |
+
else:
|
| 349 |
+
# Simple resampling fallback
|
| 350 |
+
if sr > cls.config["sample_rate"]:
|
| 351 |
+
step = sr // cls.config["sample_rate"]
|
| 352 |
+
audio_array = audio_array[::step]
|
| 353 |
+
else:
|
| 354 |
+
repeat = cls.config["sample_rate"] // sr
|
| 355 |
+
audio_array = np.repeat(audio_array, repeat)
|
| 356 |
+
|
| 357 |
+
sr = cls.config["sample_rate"]
|
| 358 |
+
|
| 359 |
+
# Normalize audio
|
| 360 |
+
if len(audio_array) > 0:
|
| 361 |
+
# Convert to mono if stereo
|
| 362 |
+
if audio_array.ndim > 1:
|
| 363 |
+
audio_array = np.mean(audio_array, axis=0)
|
| 364 |
+
|
| 365 |
+
# Normalize to [-1, 1]
|
| 366 |
+
max_val = np.max(np.abs(audio_array))
|
| 367 |
+
if max_val > 0:
|
| 368 |
+
audio_array = audio_array / max_val
|
| 369 |
+
|
| 370 |
+
return audio_array, sr
|
| 371 |
+
|
| 372 |
+
@classmethod
|
| 373 |
+
def _transcribe_chunk(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
|
| 374 |
+
"""Transcribe a single audio chunk."""
|
| 375 |
+
# Preprocess audio
|
| 376 |
+
input_values = cls.processor(
|
| 377 |
+
audio_array,
|
| 378 |
+
sampling_rate=sample_rate,
|
| 379 |
+
return_tensors="pt",
|
| 380 |
+
padding=True
|
| 381 |
+
)
|
| 382 |
+
|
| 383 |
+
# Move to device
|
| 384 |
+
input_values = {k: v.to(cls.config["device"]) for k, v in input_values.items()}
|
| 385 |
+
|
| 386 |
+
# Inference
|
| 387 |
+
with torch.no_grad():
|
| 388 |
+
logits = cls.model(**input_values).logits
|
| 389 |
+
|
| 390 |
+
# Get predicted tokens
|
| 391 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
| 392 |
+
|
| 393 |
+
# Decode transcription
|
| 394 |
+
transcription = cls.processor.batch_decode(predicted_ids)[0]
|
| 395 |
+
|
| 396 |
+
# Calculate confidence (average of max probabilities)
|
| 397 |
+
confidence = cls._calculate_confidence(logits)
|
| 398 |
+
|
| 399 |
+
return transcription, confidence
|
| 400 |
+
|
| 401 |
+
@classmethod
|
| 402 |
+
def _transcribe_long_audio(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
|
| 403 |
+
"""Transcribe long audio by chunking."""
|
| 404 |
+
chunk_length = cls.config.get("chunk_length", 15)
|
| 405 |
+
chunk_samples = int(chunk_length * sample_rate)
|
| 406 |
+
overlap_samples = int(1.0 * sample_rate) # 1 second overlap
|
| 407 |
+
|
| 408 |
+
transcriptions = []
|
| 409 |
+
confidences = []
|
| 410 |
+
|
| 411 |
+
for start in range(0, len(audio_array), chunk_samples - overlap_samples):
|
| 412 |
+
end = min(start + chunk_samples, len(audio_array))
|
| 413 |
+
chunk = audio_array[start:end]
|
| 414 |
+
|
| 415 |
+
if len(chunk) < 0.5 * sample_rate: # Skip very short chunks
|
| 416 |
+
continue
|
| 417 |
+
|
| 418 |
+
try:
|
| 419 |
+
chunk_text, chunk_confidence = cls._transcribe_chunk(chunk, sample_rate)
|
| 420 |
+
if chunk_text.strip():
|
| 421 |
+
transcriptions.append(chunk_text.strip())
|
| 422 |
+
confidences.append(chunk_confidence)
|
| 423 |
+
except Exception as e:
|
| 424 |
+
logger.warning(f"Failed to transcribe chunk: {e}")
|
| 425 |
+
continue
|
| 426 |
+
|
| 427 |
+
# Combine results
|
| 428 |
+
full_text = " ".join(transcriptions)
|
| 429 |
+
avg_confidence = np.mean(confidences) if confidences else 0.0
|
| 430 |
+
|
| 431 |
+
return full_text, avg_confidence
|
| 432 |
+
|
| 433 |
+
@classmethod
|
| 434 |
+
def _calculate_confidence(cls, logits: torch.Tensor) -> float:
|
| 435 |
+
"""Calculate confidence score from model logits."""
|
| 436 |
+
try:
|
| 437 |
+
# Apply softmax to get probabilities
|
| 438 |
+
probabilities = torch.softmax(logits, dim=-1)
|
| 439 |
+
|
| 440 |
+
# Get maximum probability for each time step
|
| 441 |
+
max_probs = torch.max(probabilities, dim=-1)[0]
|
| 442 |
+
|
| 443 |
+
# Average over time steps (excluding padding if any)
|
| 444 |
+
confidence = torch.mean(max_probs).item()
|
| 445 |
+
|
| 446 |
+
return confidence
|
| 447 |
+
|
| 448 |
+
except Exception as e:
|
| 449 |
+
logger.warning(f"Could not calculate confidence: {e}")
|
| 450 |
+
return 0.5 # Default confidence
|
| 451 |
+
|
| 452 |
+
@classmethod
|
| 453 |
+
def get_available_models(cls) -> Dict[str, Any]:
|
| 454 |
+
"""Get information about available HuBERT models."""
|
| 455 |
+
models_info = {
|
| 456 |
+
"transformers_available": TRANSFORMERS_AVAILABLE,
|
| 457 |
+
"librosa_available": LIBROSA_AVAILABLE,
|
| 458 |
+
"torch_available": True if TRANSFORMERS_AVAILABLE else False,
|
| 459 |
+
}
|
| 460 |
+
|
| 461 |
+
if TRANSFORMERS_AVAILABLE:
|
| 462 |
+
models_info.update({
|
| 463 |
+
"cuda_available": torch.cuda.is_available(),
|
| 464 |
+
"mps_available": hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(),
|
| 465 |
+
"hubert_models": [
|
| 466 |
+
{
|
| 467 |
+
"id": "omarxadel/hubert-large-arabic-egyptian",
|
| 468 |
+
"name": "HuBERT Arabic Egyptian (Large)",
|
| 469 |
+
"language": "Arabic Egyptian Dialect",
|
| 470 |
+
"size": "1.3GB",
|
| 471 |
+
"type": "HuBERT"
|
| 472 |
+
}
|
| 473 |
+
],
|
| 474 |
+
"fallback_models": [
|
| 475 |
+
{
|
| 476 |
+
"id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
|
| 477 |
+
"name": "Wav2Vec2 Arabic Egyptian",
|
| 478 |
+
"language": "Arabic Egyptian",
|
| 479 |
+
"size": "1.2GB",
|
| 480 |
+
"type": "Wav2Vec2"
|
| 481 |
+
},
|
| 482 |
+
{
|
| 483 |
+
"id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
|
| 484 |
+
"name": "Wav2Vec2 Arabic Standard",
|
| 485 |
+
"language": "Arabic Standard",
|
| 486 |
+
"size": "1.2GB",
|
| 487 |
+
"type": "Wav2Vec2"
|
| 488 |
+
},
|
| 489 |
+
{
|
| 490 |
+
"id": "facebook/wav2vec2-large-xlsr-53",
|
| 491 |
+
"name": "Wav2Vec2 Multilingual",
|
| 492 |
+
"language": "Multilingual",
|
| 493 |
+
"size": "1.2GB",
|
| 494 |
+
"type": "Wav2Vec2"
|
| 495 |
+
}
|
| 496 |
+
]
|
| 497 |
+
})
|
| 498 |
+
|
| 499 |
+
return models_info
|
| 500 |
+
|
| 501 |
+
@classmethod
|
| 502 |
+
def set_language(cls, language: Optional[str]) -> None:
|
| 503 |
+
"""Set language (for compatibility - this model is Arabic-specific)."""
|
| 504 |
+
if language and not language.startswith("ar"):
|
| 505 |
+
logger.warning(f"This model is optimized for Arabic. Language '{language}' may not work well.")
|
| 506 |
+
|
| 507 |
+
cls.config["language"] = language or "ar-EG"
|
| 508 |
+
logger.info(f"Language set to: {cls.config['language']}")
|
| 509 |
+
|
| 510 |
+
@classmethod
|
| 511 |
+
def set_device(cls, device: str) -> None:
|
| 512 |
+
"""Change device for model inference."""
|
| 513 |
+
if cls.model is not None:
|
| 514 |
+
cls.model = cls.model.to(device)
|
| 515 |
+
cls.config["device"] = device
|
| 516 |
+
logger.info(f"Model moved to device: {device}")
|
| 517 |
+
|
| 518 |
+
@classmethod
|
| 519 |
+
def get_model_info(cls) -> Dict[str, Any]:
|
| 520 |
+
"""Get detailed model information."""
|
| 521 |
+
base_info = super().get_model_info()
|
| 522 |
+
|
| 523 |
+
if cls.is_loaded:
|
| 524 |
+
base_info.update({
|
| 525 |
+
"model_id": cls.config["model_id"],
|
| 526 |
+
"model_type": "HuBERT" if "hubert" in cls.config["model_id"].lower() else "Wav2Vec2",
|
| 527 |
+
"device": cls.config["device"],
|
| 528 |
+
"language": cls.config["language"],
|
| 529 |
+
"sample_rate": cls.config["sample_rate"],
|
| 530 |
+
"vocab_size": cls.model.config.vocab_size if cls.model else None,
|
| 531 |
+
"chunk_length": cls.config["chunk_length"],
|
| 532 |
+
})
|
| 533 |
+
|
| 534 |
+
return base_info
|
| 535 |
+
|
| 536 |
+
|
| 537 |
+
# Example usage and testing
|
| 538 |
+
if __name__ == "__main__":
|
| 539 |
+
print("Testing HuBERT Arabic STT implementation...")
|
| 540 |
+
|
| 541 |
+
# Check availability
|
| 542 |
+
models_info = HuBERTArabicSTT.get_available_models()
|
| 543 |
+
print(f"Available models info: {models_info}")
|
| 544 |
+
|
| 545 |
+
if models_info["transformers_available"]:
|
| 546 |
+
try:
|
| 547 |
+
print("Loading HuBERT Arabic model...")
|
| 548 |
+
HuBERTArabicSTT.load_model(device="cpu") # Use CPU for testing
|
| 549 |
+
|
| 550 |
+
print("Creating test audio...")
|
| 551 |
+
# Generate test audio (2 seconds of random noise)
|
| 552 |
+
test_audio = np.random.randn(32000).astype(np.float32) * 0.1
|
| 553 |
+
|
| 554 |
+
print("Testing transcription...")
|
| 555 |
+
result = HuBERTArabicSTT.transcribe_audio(test_audio, 16000)
|
| 556 |
+
print(f"Result: {result}")
|
| 557 |
+
print(f"Metadata: {result.metadata}")
|
| 558 |
+
|
| 559 |
+
except Exception as e:
|
| 560 |
+
print(f"Error: {e}")
|
| 561 |
+
print("Note: This is expected with random audio - the model expects Arabic speech")
|
| 562 |
+
|
| 563 |
+
else:
|
| 564 |
+
print("Transformers not installed - install with:")
|
| 565 |
+
print("pip install transformers torch torchaudio")
|
| 566 |
+
print("Optional: pip install librosa (for better audio processing)")
|
| 567 |
+
|
| 568 |
+
print("\nHuBERT Arabic STT implementation ready!")
|
stt/stt_base.py
ADDED
|
@@ -0,0 +1,251 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Base Speech-to-Text (STT) Static Class
|
| 4 |
+
|
| 5 |
+
This module provides an abstract base class for implementing different STT models using static methods.
|
| 6 |
+
All STT implementations should inherit from this class and implement the required static methods.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
from stt_base import BaseSTT
|
| 10 |
+
|
| 11 |
+
class MySTTModel(BaseSTT):
|
| 12 |
+
model = None # Class variable to hold the model
|
| 13 |
+
|
| 14 |
+
@classmethod
|
| 15 |
+
def load_model(cls):
|
| 16 |
+
# Load your specific model
|
| 17 |
+
cls.model = your_model_loader()
|
| 18 |
+
|
| 19 |
+
@classmethod
|
| 20 |
+
def transcribe_audio(cls, audio_data, sample_rate):
|
| 21 |
+
# Implement transcription logic
|
| 22 |
+
return STTResult("transcribed text")
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
from abc import ABC, abstractmethod
|
| 26 |
+
from typing import Union, Optional, Dict, Any, ClassVar
|
| 27 |
+
import numpy as np
|
| 28 |
+
from pathlib import Path
|
| 29 |
+
import time
|
| 30 |
+
import logging
|
| 31 |
+
|
| 32 |
+
# Set up logging
|
| 33 |
+
logging.basicConfig(level=logging.INFO)
|
| 34 |
+
logger = logging.getLogger(__name__)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
class STTResult:
|
| 38 |
+
"""Container for STT transcription results with metadata."""
|
| 39 |
+
|
| 40 |
+
def __init__(self,
|
| 41 |
+
text: str,
|
| 42 |
+
confidence: Optional[float] = None,
|
| 43 |
+
processing_time: Optional[float] = None,
|
| 44 |
+
metadata: Optional[Dict[str, Any]] = None):
|
| 45 |
+
self.text = text
|
| 46 |
+
self.confidence = confidence
|
| 47 |
+
self.processing_time = processing_time
|
| 48 |
+
self.metadata = metadata or {}
|
| 49 |
+
|
| 50 |
+
def __str__(self) -> str:
|
| 51 |
+
return self.text
|
| 52 |
+
|
| 53 |
+
def __repr__(self) -> str:
|
| 54 |
+
return f"STTResult(text='{self.text}', confidence={self.confidence}, time={self.processing_time}s)"
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
class BaseSTT(ABC):
|
| 58 |
+
"""
|
| 59 |
+
Abstract base class for Speech-to-Text models using static methods.
|
| 60 |
+
|
| 61 |
+
All STT implementations must inherit from this class and implement:
|
| 62 |
+
- load_model(): Load and initialize the STT model (classmethod)
|
| 63 |
+
- transcribe_audio(): Convert audio to text (classmethod)
|
| 64 |
+
"""
|
| 65 |
+
|
| 66 |
+
# Class variables that subclasses should define
|
| 67 |
+
model_name: ClassVar[str] = "BaseSTT"
|
| 68 |
+
model: ClassVar[Any] = None
|
| 69 |
+
is_loaded: ClassVar[bool] = False
|
| 70 |
+
config: ClassVar[Dict[str, Any]] = {}
|
| 71 |
+
|
| 72 |
+
@classmethod
|
| 73 |
+
@abstractmethod
|
| 74 |
+
def load_model(cls) -> None:
|
| 75 |
+
"""
|
| 76 |
+
Load and initialize the STT model.
|
| 77 |
+
|
| 78 |
+
This method must be implemented by subclasses to load their specific model.
|
| 79 |
+
After successful loading, set cls.is_loaded = True
|
| 80 |
+
"""
|
| 81 |
+
pass
|
| 82 |
+
|
| 83 |
+
@classmethod
|
| 84 |
+
@abstractmethod
|
| 85 |
+
def transcribe_audio(cls,
|
| 86 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 87 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 88 |
+
"""
|
| 89 |
+
Transcribe audio data to text.
|
| 90 |
+
|
| 91 |
+
Args:
|
| 92 |
+
audio_data: Audio input - can be:
|
| 93 |
+
- numpy array of audio samples
|
| 94 |
+
- path to audio file (str or Path)
|
| 95 |
+
sample_rate: Sample rate of audio data (required for numpy arrays)
|
| 96 |
+
|
| 97 |
+
Returns:
|
| 98 |
+
STTResult: Object containing transcribed text and metadata
|
| 99 |
+
|
| 100 |
+
This method must be implemented by subclasses.
|
| 101 |
+
"""
|
| 102 |
+
pass
|
| 103 |
+
|
| 104 |
+
@classmethod
|
| 105 |
+
def transcribe_file(cls, file_path: Union[str, Path]) -> STTResult:
|
| 106 |
+
"""
|
| 107 |
+
Transcribe an audio file to text.
|
| 108 |
+
|
| 109 |
+
Args:
|
| 110 |
+
file_path: Path to the audio file
|
| 111 |
+
|
| 112 |
+
Returns:
|
| 113 |
+
STTResult: Transcription result
|
| 114 |
+
"""
|
| 115 |
+
if not cls.is_loaded:
|
| 116 |
+
raise RuntimeError(f"{cls.model_name} model not loaded. Call load_model() first.")
|
| 117 |
+
|
| 118 |
+
file_path = Path(file_path)
|
| 119 |
+
if not file_path.exists():
|
| 120 |
+
raise FileNotFoundError(f"Audio file not found: {file_path}")
|
| 121 |
+
|
| 122 |
+
logger.info(f"Transcribing file: {file_path}")
|
| 123 |
+
start_time = time.time()
|
| 124 |
+
|
| 125 |
+
result = cls.transcribe_audio(file_path)
|
| 126 |
+
|
| 127 |
+
if result.processing_time is None:
|
| 128 |
+
result.processing_time = time.time() - start_time
|
| 129 |
+
|
| 130 |
+
logger.info(f"Transcription completed in {result.processing_time:.2f}s")
|
| 131 |
+
return result
|
| 132 |
+
|
| 133 |
+
@classmethod
|
| 134 |
+
def transcribe_numpy(cls,
|
| 135 |
+
audio_array: np.ndarray,
|
| 136 |
+
sample_rate: int) -> STTResult:
|
| 137 |
+
"""
|
| 138 |
+
Transcribe a numpy array of audio samples to text.
|
| 139 |
+
|
| 140 |
+
Args:
|
| 141 |
+
audio_array: Audio samples as numpy array
|
| 142 |
+
sample_rate: Sample rate of the audio
|
| 143 |
+
|
| 144 |
+
Returns:
|
| 145 |
+
STTResult: Transcription result
|
| 146 |
+
"""
|
| 147 |
+
if not cls.is_loaded:
|
| 148 |
+
raise RuntimeError(f"{cls.model_name} model not loaded. Call load_model() first.")
|
| 149 |
+
|
| 150 |
+
if not isinstance(audio_array, np.ndarray):
|
| 151 |
+
raise TypeError("audio_array must be a numpy array")
|
| 152 |
+
|
| 153 |
+
logger.info(f"Transcribing numpy array: shape={audio_array.shape}, sr={sample_rate}")
|
| 154 |
+
start_time = time.time()
|
| 155 |
+
|
| 156 |
+
result = cls.transcribe_audio(audio_array, sample_rate)
|
| 157 |
+
|
| 158 |
+
if result.processing_time is None:
|
| 159 |
+
result.processing_time = time.time() - start_time
|
| 160 |
+
|
| 161 |
+
logger.info(f"Transcription completed in {result.processing_time:.2f}s")
|
| 162 |
+
return result
|
| 163 |
+
|
| 164 |
+
@classmethod
|
| 165 |
+
def get_model_info(cls) -> Dict[str, Any]:
|
| 166 |
+
"""
|
| 167 |
+
Get information about the loaded model.
|
| 168 |
+
|
| 169 |
+
Returns:
|
| 170 |
+
Dict containing model information
|
| 171 |
+
"""
|
| 172 |
+
return {
|
| 173 |
+
"model_name": cls.model_name,
|
| 174 |
+
"is_loaded": cls.is_loaded,
|
| 175 |
+
"config": cls.config
|
| 176 |
+
}
|
| 177 |
+
|
| 178 |
+
@classmethod
|
| 179 |
+
def ensure_loaded(cls) -> None:
|
| 180 |
+
"""Ensure the model is loaded, load it if not."""
|
| 181 |
+
if not cls.is_loaded:
|
| 182 |
+
cls.load_model()
|
| 183 |
+
|
| 184 |
+
@classmethod
|
| 185 |
+
def get_status(cls) -> str:
|
| 186 |
+
"""Get a string representation of the model status."""
|
| 187 |
+
status = "loaded" if cls.is_loaded else "not loaded"
|
| 188 |
+
return f"{cls.model_name} STT Model ({status})"
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
class DummySTT(BaseSTT):
|
| 192 |
+
"""
|
| 193 |
+
Dummy STT implementation for testing the static class interface.
|
| 194 |
+
Returns placeholder text instead of actual transcription.
|
| 195 |
+
"""
|
| 196 |
+
|
| 197 |
+
model_name = "DummySTT"
|
| 198 |
+
model = None
|
| 199 |
+
is_loaded = False
|
| 200 |
+
config = {}
|
| 201 |
+
|
| 202 |
+
@classmethod
|
| 203 |
+
def load_model(cls) -> None:
|
| 204 |
+
"""Load the dummy model (just a placeholder)."""
|
| 205 |
+
logger.info("Loading dummy STT model...")
|
| 206 |
+
time.sleep(0.5) # Simulate loading time
|
| 207 |
+
cls.model = "dummy_model_loaded"
|
| 208 |
+
cls.is_loaded = True
|
| 209 |
+
logger.info("Dummy STT model loaded successfully")
|
| 210 |
+
|
| 211 |
+
@classmethod
|
| 212 |
+
def transcribe_audio(cls,
|
| 213 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 214 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 215 |
+
"""
|
| 216 |
+
Dummy transcription - returns placeholder text.
|
| 217 |
+
"""
|
| 218 |
+
if isinstance(audio_data, np.ndarray):
|
| 219 |
+
duration = len(audio_data) / (sample_rate or 16000)
|
| 220 |
+
text = f"[Dummy transcription of {duration:.1f}s audio]"
|
| 221 |
+
else:
|
| 222 |
+
text = f"[Dummy transcription of file: {Path(audio_data).name}]"
|
| 223 |
+
|
| 224 |
+
# Simulate processing time
|
| 225 |
+
processing_time = 0.1 + np.random.random() * 0.2
|
| 226 |
+
time.sleep(processing_time)
|
| 227 |
+
|
| 228 |
+
return STTResult(
|
| 229 |
+
text=text,
|
| 230 |
+
confidence=0.95,
|
| 231 |
+
processing_time=processing_time,
|
| 232 |
+
metadata={"model": "dummy", "simulated": True}
|
| 233 |
+
)
|
| 234 |
+
|
| 235 |
+
|
| 236 |
+
# Example usage and testing
|
| 237 |
+
if __name__ == "__main__":
|
| 238 |
+
# Test the dummy implementation
|
| 239 |
+
print("Testing BaseSTT with static DummySTT implementation...")
|
| 240 |
+
|
| 241 |
+
# Load the model
|
| 242 |
+
DummySTT.load_model()
|
| 243 |
+
|
| 244 |
+
# Test with dummy numpy array
|
| 245 |
+
dummy_audio = np.random.randn(16000) # 1 second at 16kHz
|
| 246 |
+
result = DummySTT.transcribe_numpy(dummy_audio, 16000)
|
| 247 |
+
print(f"Numpy result: {result}")
|
| 248 |
+
print(f"Model info: {DummySTT.get_model_info()}")
|
| 249 |
+
print(f"Status: {DummySTT.get_status()}")
|
| 250 |
+
|
| 251 |
+
print("\\nStatic BaseSTT interface ready for real STT implementations!")
|
stt/tawasul_stt.py
ADDED
|
@@ -0,0 +1,448 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Tawasul STT V0 Implementation
|
| 4 |
+
|
| 5 |
+
This module provides Arabic speech-to-text transcription using the Tawasul STT V0 model,
|
| 6 |
+
which is specifically designed for Arabic language recognition.
|
| 7 |
+
|
| 8 |
+
Tawasul STT V0 is built on Wav2Vec2 architecture and fine-tuned for Arabic speech.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import os
|
| 12 |
+
import logging
|
| 13 |
+
import warnings
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
from typing import Optional, Dict, Any, Tuple, List, Union
|
| 16 |
+
import time
|
| 17 |
+
|
| 18 |
+
# Suppress warnings for cleaner output
|
| 19 |
+
warnings.filterwarnings("ignore", category=UserWarning)
|
| 20 |
+
warnings.filterwarnings("ignore", category=FutureWarning)
|
| 21 |
+
|
| 22 |
+
# Try to import torch for type hints
|
| 23 |
+
try:
|
| 24 |
+
import torch
|
| 25 |
+
TORCH_AVAILABLE = True
|
| 26 |
+
except ImportError:
|
| 27 |
+
TORCH_AVAILABLE = False
|
| 28 |
+
# Create a dummy torch class for type hints when torch is not available
|
| 29 |
+
class torch:
|
| 30 |
+
class Tensor:
|
| 31 |
+
pass
|
| 32 |
+
|
| 33 |
+
# Configure logging
|
| 34 |
+
logging.basicConfig(level=logging.INFO)
|
| 35 |
+
logger = logging.getLogger(__name__)
|
| 36 |
+
|
| 37 |
+
class TawasulSTT:
|
| 38 |
+
"""
|
| 39 |
+
Tawasul STT V0 static implementation for Arabic speech recognition.
|
| 40 |
+
|
| 41 |
+
This class provides Arabic speech-to-text transcription using the Tawasul STT V0 model,
|
| 42 |
+
which is specifically optimized for Arabic language variants.
|
| 43 |
+
All methods are static for direct class-level access.
|
| 44 |
+
"""
|
| 45 |
+
|
| 46 |
+
# Class variables for model state
|
| 47 |
+
model = None
|
| 48 |
+
processor = None
|
| 49 |
+
tokenizer = None
|
| 50 |
+
device = "cpu"
|
| 51 |
+
model_id = "Kareem35/Tawasul-STT-V0"
|
| 52 |
+
is_loaded = False
|
| 53 |
+
hf_token = None
|
| 54 |
+
chunk_length = 20 # seconds
|
| 55 |
+
max_audio_length = 300 # 5 minutes max
|
| 56 |
+
|
| 57 |
+
# Model fallback chain for better reliability
|
| 58 |
+
fallback_models = [
|
| 59 |
+
"Kareem35/Tawasul-STT-V0",
|
| 60 |
+
"jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
|
| 61 |
+
"facebook/wav2vec2-large-xlsr-53",
|
| 62 |
+
"facebook/wav2vec2-base-960h"
|
| 63 |
+
]
|
| 64 |
+
|
| 65 |
+
@staticmethod
|
| 66 |
+
def is_available() -> bool:
|
| 67 |
+
"""Check if Tawasul STT dependencies are available."""
|
| 68 |
+
if not TORCH_AVAILABLE:
|
| 69 |
+
logger.warning("Tawasul STT dependencies not available: torch not installed")
|
| 70 |
+
return False
|
| 71 |
+
|
| 72 |
+
try:
|
| 73 |
+
import transformers
|
| 74 |
+
import torchaudio
|
| 75 |
+
import librosa
|
| 76 |
+
import soundfile
|
| 77 |
+
return True
|
| 78 |
+
except ImportError as e:
|
| 79 |
+
logger.warning(f"Tawasul STT dependencies not available: {e}")
|
| 80 |
+
return False
|
| 81 |
+
|
| 82 |
+
@staticmethod
|
| 83 |
+
def load_model(
|
| 84 |
+
model_id: Optional[str] = None,
|
| 85 |
+
device: str = "auto",
|
| 86 |
+
chunk_length: int = 20,
|
| 87 |
+
hf_token: Optional[str] = None,
|
| 88 |
+
max_audio_length: int = 300,
|
| 89 |
+
**kwargs
|
| 90 |
+
) -> None:
|
| 91 |
+
"""
|
| 92 |
+
Load the Tawasul STT V0 model.
|
| 93 |
+
|
| 94 |
+
Args:
|
| 95 |
+
model_id: Model identifier (defaults to Tawasul STT V0)
|
| 96 |
+
device: Device to use ('auto', 'cpu', 'cuda', 'mps')
|
| 97 |
+
chunk_length: Audio chunk length in seconds for processing
|
| 98 |
+
hf_token: Hugging Face authentication token
|
| 99 |
+
max_audio_length: Maximum audio length in seconds
|
| 100 |
+
**kwargs: Additional model parameters
|
| 101 |
+
"""
|
| 102 |
+
try:
|
| 103 |
+
import torch
|
| 104 |
+
import transformers
|
| 105 |
+
from transformers import (
|
| 106 |
+
Wav2Vec2ForCTC,
|
| 107 |
+
Wav2Vec2Processor,
|
| 108 |
+
Wav2Vec2Tokenizer
|
| 109 |
+
)
|
| 110 |
+
import torchaudio
|
| 111 |
+
import librosa
|
| 112 |
+
|
| 113 |
+
# Set authentication token
|
| 114 |
+
if hf_token:
|
| 115 |
+
TawasulSTT.hf_token = hf_token
|
| 116 |
+
# Set token for transformers
|
| 117 |
+
try:
|
| 118 |
+
from huggingface_hub import login
|
| 119 |
+
login(token=hf_token, add_to_git_credential=True)
|
| 120 |
+
logger.info("β
Authenticated with Hugging Face")
|
| 121 |
+
except Exception as e:
|
| 122 |
+
logger.warning(f"HF authentication warning: {e}")
|
| 123 |
+
|
| 124 |
+
# Determine device
|
| 125 |
+
if device == "auto":
|
| 126 |
+
if torch.cuda.is_available():
|
| 127 |
+
TawasulSTT.device = "cuda"
|
| 128 |
+
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
| 129 |
+
TawasulSTT.device = "mps"
|
| 130 |
+
else:
|
| 131 |
+
TawasulSTT.device = "cpu"
|
| 132 |
+
else:
|
| 133 |
+
TawasulSTT.device = device
|
| 134 |
+
|
| 135 |
+
# Set model parameters
|
| 136 |
+
TawasulSTT.model_id = model_id or "Kareem35/Tawasul-STT-V0"
|
| 137 |
+
TawasulSTT.chunk_length = chunk_length
|
| 138 |
+
TawasulSTT.max_audio_length = max_audio_length
|
| 139 |
+
|
| 140 |
+
# Try loading the model with fallback chain
|
| 141 |
+
model_loaded = False
|
| 142 |
+
last_error = None
|
| 143 |
+
|
| 144 |
+
models_to_try = [TawasulSTT.model_id] + [m for m in TawasulSTT.fallback_models if m != TawasulSTT.model_id]
|
| 145 |
+
|
| 146 |
+
for model_name in models_to_try:
|
| 147 |
+
try:
|
| 148 |
+
logger.info(f"π Loading Tawasul STT model: {model_name}")
|
| 149 |
+
|
| 150 |
+
# Load model components
|
| 151 |
+
TawasulSTT.processor = Wav2Vec2Processor.from_pretrained(
|
| 152 |
+
model_name,
|
| 153 |
+
token=TawasulSTT.hf_token,
|
| 154 |
+
trust_remote_code=True
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
TawasulSTT.model = Wav2Vec2ForCTC.from_pretrained(
|
| 158 |
+
model_name,
|
| 159 |
+
token=TawasulSTT.hf_token,
|
| 160 |
+
trust_remote_code=True
|
| 161 |
+
)
|
| 162 |
+
|
| 163 |
+
# Try to load tokenizer if available
|
| 164 |
+
try:
|
| 165 |
+
TawasulSTT.tokenizer = Wav2Vec2Tokenizer.from_pretrained(
|
| 166 |
+
model_name,
|
| 167 |
+
token=TawasulSTT.hf_token
|
| 168 |
+
)
|
| 169 |
+
except Exception:
|
| 170 |
+
logger.info("Using processor instead of separate tokenizer")
|
| 171 |
+
TawasulSTT.tokenizer = TawasulSTT.processor.tokenizer
|
| 172 |
+
|
| 173 |
+
# Move model to device
|
| 174 |
+
TawasulSTT.model = TawasulSTT.model.to(TawasulSTT.device)
|
| 175 |
+
TawasulSTT.model.eval()
|
| 176 |
+
|
| 177 |
+
# Test model with dummy input
|
| 178 |
+
test_input = torch.randn(1, 16000).to(TawasulSTT.device)
|
| 179 |
+
with torch.no_grad():
|
| 180 |
+
_ = TawasulSTT.model(test_input)
|
| 181 |
+
|
| 182 |
+
TawasulSTT.model_id = model_name # Update to actually loaded model
|
| 183 |
+
model_loaded = True
|
| 184 |
+
logger.info(f"β
Successfully loaded Tawasul STT model: {model_name} on {TawasulSTT.device}")
|
| 185 |
+
break
|
| 186 |
+
|
| 187 |
+
except Exception as e:
|
| 188 |
+
last_error = e
|
| 189 |
+
logger.warning(f"Failed to load {model_name}: {str(e)}")
|
| 190 |
+
continue
|
| 191 |
+
|
| 192 |
+
if not model_loaded:
|
| 193 |
+
raise RuntimeError(f"Failed to load any Tawasul STT model. Last error: {last_error}")
|
| 194 |
+
|
| 195 |
+
TawasulSTT.is_loaded = True
|
| 196 |
+
|
| 197 |
+
# Log model info
|
| 198 |
+
total_params = sum(p.numel() for p in TawasulSTT.model.parameters())
|
| 199 |
+
logger.info(f"π Model loaded: {total_params:,} parameters on {TawasulSTT.device}")
|
| 200 |
+
|
| 201 |
+
except Exception as e:
|
| 202 |
+
error_msg = f"Failed to load Tawasul STT model: {str(e)}"
|
| 203 |
+
logger.error(error_msg)
|
| 204 |
+
raise RuntimeError(error_msg)
|
| 205 |
+
|
| 206 |
+
@staticmethod
|
| 207 |
+
def _preprocess_audio(audio_path: str) -> Tuple[torch.Tensor, int]:
|
| 208 |
+
"""
|
| 209 |
+
Preprocess audio file for Tawasul STT model.
|
| 210 |
+
|
| 211 |
+
Args:
|
| 212 |
+
audio_path: Path to audio file
|
| 213 |
+
|
| 214 |
+
Returns:
|
| 215 |
+
Tuple of (audio_tensor, sample_rate)
|
| 216 |
+
"""
|
| 217 |
+
try:
|
| 218 |
+
import librosa
|
| 219 |
+
import torch
|
| 220 |
+
import numpy as np
|
| 221 |
+
|
| 222 |
+
# Load audio file with proper error handling
|
| 223 |
+
try:
|
| 224 |
+
# Load audio at 16kHz as required by Tawasul STT
|
| 225 |
+
audio, sample_rate = librosa.load(audio_path, sr=16000, mono=True)
|
| 226 |
+
except Exception as load_error:
|
| 227 |
+
raise RuntimeError(f"Failed to load audio file {audio_path}: {load_error}")
|
| 228 |
+
|
| 229 |
+
# Validate audio data
|
| 230 |
+
if len(audio) == 0:
|
| 231 |
+
raise RuntimeError("Audio file is empty or corrupted")
|
| 232 |
+
|
| 233 |
+
# Convert to float32 for processing
|
| 234 |
+
audio = audio.astype(np.float32)
|
| 235 |
+
|
| 236 |
+
# Remove DC offset (center around zero)
|
| 237 |
+
audio = audio - np.mean(audio)
|
| 238 |
+
|
| 239 |
+
# Normalize audio with proper scaling
|
| 240 |
+
max_val = np.max(np.abs(audio))
|
| 241 |
+
if max_val > 0:
|
| 242 |
+
# Normalize to [-0.95, 0.95] to prevent clipping
|
| 243 |
+
audio = audio / max_val * 0.95
|
| 244 |
+
else:
|
| 245 |
+
logger.warning("Audio appears to be silent")
|
| 246 |
+
|
| 247 |
+
# Apply simple noise gate to reduce background noise
|
| 248 |
+
noise_threshold = np.max(np.abs(audio)) * 0.01 # 1% of max amplitude
|
| 249 |
+
audio = np.where(np.abs(audio) < noise_threshold, 0, audio)
|
| 250 |
+
|
| 251 |
+
# Check and limit audio duration
|
| 252 |
+
audio_duration = len(audio) / sample_rate
|
| 253 |
+
if audio_duration > TawasulSTT.max_audio_length:
|
| 254 |
+
logger.warning(f"Audio duration ({audio_duration:.1f}s) exceeds maximum ({TawasulSTT.max_audio_length}s)")
|
| 255 |
+
# Truncate to maximum length
|
| 256 |
+
max_samples = int(TawasulSTT.max_audio_length * sample_rate)
|
| 257 |
+
audio = audio[:max_samples]
|
| 258 |
+
logger.info(f"Audio truncated to {TawasulSTT.max_audio_length}s")
|
| 259 |
+
|
| 260 |
+
# Validate minimum duration
|
| 261 |
+
min_duration = 0.1 # 100ms minimum
|
| 262 |
+
if audio_duration < min_duration:
|
| 263 |
+
logger.warning(f"Audio duration ({audio_duration:.3f}s) is very short")
|
| 264 |
+
|
| 265 |
+
# Convert to PyTorch tensor
|
| 266 |
+
audio_tensor = torch.FloatTensor(audio)
|
| 267 |
+
|
| 268 |
+
# Log preprocessing info
|
| 269 |
+
final_duration = len(audio_tensor) / sample_rate
|
| 270 |
+
logger.debug(f"Audio preprocessed: {final_duration:.2f}s, max_amp: {torch.max(torch.abs(audio_tensor)):.3f}")
|
| 271 |
+
|
| 272 |
+
return audio_tensor, sample_rate
|
| 273 |
+
|
| 274 |
+
except Exception as e:
|
| 275 |
+
error_msg = f"Audio preprocessing failed for {audio_path}: {str(e)}"
|
| 276 |
+
logger.error(error_msg)
|
| 277 |
+
raise RuntimeError(error_msg)
|
| 278 |
+
|
| 279 |
+
@staticmethod
|
| 280 |
+
def _chunk_audio(audio_tensor: torch.Tensor, sample_rate: int) -> List[torch.Tensor]:
|
| 281 |
+
"""
|
| 282 |
+
Split audio into chunks for processing.
|
| 283 |
+
|
| 284 |
+
Args:
|
| 285 |
+
audio_tensor: Audio tensor
|
| 286 |
+
sample_rate: Sample rate
|
| 287 |
+
|
| 288 |
+
Returns:
|
| 289 |
+
List of audio chunks
|
| 290 |
+
"""
|
| 291 |
+
chunk_samples = int(TawasulSTT.chunk_length * sample_rate)
|
| 292 |
+
chunks = []
|
| 293 |
+
|
| 294 |
+
for i in range(0, len(audio_tensor), chunk_samples):
|
| 295 |
+
chunk = audio_tensor[i:i + chunk_samples]
|
| 296 |
+
if len(chunk) > sample_rate * 0.5: # Only process chunks > 0.5 seconds
|
| 297 |
+
chunks.append(chunk)
|
| 298 |
+
|
| 299 |
+
return chunks
|
| 300 |
+
|
| 301 |
+
@staticmethod
|
| 302 |
+
def _transcribe_chunk(audio_chunk: torch.Tensor) -> Tuple[str, float]:
|
| 303 |
+
"""
|
| 304 |
+
Transcribe a single audio chunk.
|
| 305 |
+
|
| 306 |
+
Args:
|
| 307 |
+
audio_chunk: Audio chunk tensor
|
| 308 |
+
|
| 309 |
+
Returns:
|
| 310 |
+
Tuple of (transcription, confidence_score)
|
| 311 |
+
"""
|
| 312 |
+
try:
|
| 313 |
+
import torch
|
| 314 |
+
|
| 315 |
+
# Prepare input
|
| 316 |
+
input_values = TawasulSTT.processor(
|
| 317 |
+
audio_chunk,
|
| 318 |
+
sampling_rate=16000,
|
| 319 |
+
return_tensors="pt"
|
| 320 |
+
).input_values
|
| 321 |
+
|
| 322 |
+
input_values = input_values.to(TawasulSTT.device)
|
| 323 |
+
|
| 324 |
+
# Get model predictions
|
| 325 |
+
with torch.no_grad():
|
| 326 |
+
logits = TawasulSTT.model(input_values).logits
|
| 327 |
+
|
| 328 |
+
# Get predicted tokens
|
| 329 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
| 330 |
+
|
| 331 |
+
# Decode transcription
|
| 332 |
+
transcription = TawasulSTT.processor.decode(predicted_ids[0])
|
| 333 |
+
|
| 334 |
+
# Calculate confidence (approximation)
|
| 335 |
+
probs = torch.nn.functional.softmax(logits, dim=-1)
|
| 336 |
+
max_probs = torch.max(probs, dim=-1)[0]
|
| 337 |
+
confidence = torch.mean(max_probs).item()
|
| 338 |
+
|
| 339 |
+
return transcription.strip(), confidence
|
| 340 |
+
|
| 341 |
+
except Exception as e:
|
| 342 |
+
logger.error(f"Chunk transcription error: {str(e)}")
|
| 343 |
+
return "", 0.0
|
| 344 |
+
|
| 345 |
+
@staticmethod
|
| 346 |
+
def transcribe(audio_path: str, **kwargs) -> Tuple[str, str, str]:
|
| 347 |
+
"""
|
| 348 |
+
Transcribe audio file using Tawasul STT V0.
|
| 349 |
+
|
| 350 |
+
Args:
|
| 351 |
+
audio_path: Path to audio file
|
| 352 |
+
**kwargs: Additional transcription parameters
|
| 353 |
+
|
| 354 |
+
Returns:
|
| 355 |
+
Tuple of (transcription, confidence_info, processing_info)
|
| 356 |
+
"""
|
| 357 |
+
if not TawasulSTT.is_loaded:
|
| 358 |
+
return "β Model not loaded. Please load the model first.", "", ""
|
| 359 |
+
|
| 360 |
+
try:
|
| 361 |
+
start_time = time.time()
|
| 362 |
+
|
| 363 |
+
# Validate file
|
| 364 |
+
if not os.path.exists(audio_path):
|
| 365 |
+
return f"β Audio file not found: {audio_path}", "", ""
|
| 366 |
+
|
| 367 |
+
logger.info(f"π΅ Transcribing audio with Tawasul STT: {audio_path}")
|
| 368 |
+
|
| 369 |
+
# Preprocess audio
|
| 370 |
+
audio_tensor, sample_rate = TawasulSTT._preprocess_audio(audio_path)
|
| 371 |
+
audio_duration = len(audio_tensor) / sample_rate
|
| 372 |
+
|
| 373 |
+
# Process audio in chunks
|
| 374 |
+
chunks = TawasulSTT._chunk_audio(audio_tensor, sample_rate)
|
| 375 |
+
|
| 376 |
+
if not chunks:
|
| 377 |
+
return "β No valid audio chunks found", "", ""
|
| 378 |
+
|
| 379 |
+
# Transcribe each chunk
|
| 380 |
+
transcriptions = []
|
| 381 |
+
confidences = []
|
| 382 |
+
|
| 383 |
+
for i, chunk in enumerate(chunks):
|
| 384 |
+
logger.info(f"Processing chunk {i+1}/{len(chunks)}")
|
| 385 |
+
transcription, confidence = TawasulSTT._transcribe_chunk(chunk)
|
| 386 |
+
|
| 387 |
+
if transcription: # Only add non-empty transcriptions
|
| 388 |
+
transcriptions.append(transcription)
|
| 389 |
+
confidences.append(confidence)
|
| 390 |
+
|
| 391 |
+
# Combine results
|
| 392 |
+
if not transcriptions:
|
| 393 |
+
return "β No transcription generated", "", ""
|
| 394 |
+
|
| 395 |
+
final_transcription = " ".join(transcriptions).strip()
|
| 396 |
+
avg_confidence = sum(confidences) / len(confidences) if confidences else 0.0
|
| 397 |
+
|
| 398 |
+
# Calculate processing time
|
| 399 |
+
processing_time = time.time() - start_time
|
| 400 |
+
|
| 401 |
+
# Create info strings
|
| 402 |
+
confidence_info = f"Confidence: {avg_confidence:.2f}"
|
| 403 |
+
processing_info = (
|
| 404 |
+
f"Duration: {audio_duration:.1f}s | "
|
| 405 |
+
f"Chunks: {len(chunks)} | "
|
| 406 |
+
f"Time: {processing_time:.1f}s | "
|
| 407 |
+
f"Model: {TawasulSTT.model_id.split('/')[-1]}"
|
| 408 |
+
)
|
| 409 |
+
|
| 410 |
+
logger.info(f"β
Transcription completed in {processing_time:.1f}s")
|
| 411 |
+
|
| 412 |
+
return final_transcription, confidence_info, processing_info
|
| 413 |
+
|
| 414 |
+
except Exception as e:
|
| 415 |
+
error_msg = f"β Tawasul STT transcription failed: {str(e)}"
|
| 416 |
+
logger.error(error_msg)
|
| 417 |
+
return error_msg, "", ""
|
| 418 |
+
|
| 419 |
+
@staticmethod
|
| 420 |
+
def get_supported_languages() -> List[str]:
|
| 421 |
+
"""Get list of supported languages."""
|
| 422 |
+
return [
|
| 423 |
+
"ar", # Arabic
|
| 424 |
+
"ar-SA", # Saudi Arabic
|
| 425 |
+
"ar-EG", # Egyptian Arabic
|
| 426 |
+
"ar-JO", # Jordanian Arabic
|
| 427 |
+
"ar-LB", # Lebanese Arabic
|
| 428 |
+
"ar-SY", # Syrian Arabic
|
| 429 |
+
"ar-IQ", # Iraqi Arabic
|
| 430 |
+
"ar-MA", # Moroccan Arabic
|
| 431 |
+
"ar-DZ", # Algerian Arabic
|
| 432 |
+
"ar-TN", # Tunisian Arabic
|
| 433 |
+
]
|
| 434 |
+
|
| 435 |
+
@staticmethod
|
| 436 |
+
def get_model_info() -> Dict[str, Any]:
|
| 437 |
+
"""Get model information."""
|
| 438 |
+
return {
|
| 439 |
+
"name": "Tawasul STT V0",
|
| 440 |
+
"model_id": TawasulSTT.model_id,
|
| 441 |
+
"device": TawasulSTT.device,
|
| 442 |
+
"is_loaded": TawasulSTT.is_loaded,
|
| 443 |
+
"supported_languages": TawasulSTT.get_supported_languages(),
|
| 444 |
+
"chunk_length": TawasulSTT.chunk_length,
|
| 445 |
+
"max_audio_length": TawasulSTT.max_audio_length,
|
| 446 |
+
"architecture": "Wav2Vec2",
|
| 447 |
+
"specialization": "Arabic Speech Recognition"
|
| 448 |
+
}
|
stt/vosk_stt.py
ADDED
|
@@ -0,0 +1,561 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Vosk STT Implementation
|
| 4 |
+
|
| 5 |
+
Vosk speech-to-text implementation using the static BaseSTT interface.
|
| 6 |
+
Supports multiple languages with offline models and real-time recognition.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
from stt.vosk_stt import VoskSTT
|
| 10 |
+
|
| 11 |
+
# Load model
|
| 12 |
+
VoskSTT.load_model(model_name="vosk-model-en-us-0.22")
|
| 13 |
+
|
| 14 |
+
# Transcribe audio
|
| 15 |
+
result = VoskSTT.transcribe_audio(audio_array, 16000)
|
| 16 |
+
print(result.text)
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from typing import Union, Optional, Dict, Any, List
|
| 20 |
+
import numpy as np
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
import time
|
| 23 |
+
import json
|
| 24 |
+
import logging
|
| 25 |
+
import os
|
| 26 |
+
import urllib.request
|
| 27 |
+
import zipfile
|
| 28 |
+
import tempfile
|
| 29 |
+
|
| 30 |
+
try:
|
| 31 |
+
import vosk
|
| 32 |
+
VOSK_AVAILABLE = True
|
| 33 |
+
except ImportError:
|
| 34 |
+
VOSK_AVAILABLE = False
|
| 35 |
+
|
| 36 |
+
try:
|
| 37 |
+
import soundfile as sf
|
| 38 |
+
SOUNDFILE_AVAILABLE = True
|
| 39 |
+
except ImportError:
|
| 40 |
+
SOUNDFILE_AVAILABLE = False
|
| 41 |
+
|
| 42 |
+
from .stt_base import BaseSTT, STTResult
|
| 43 |
+
|
| 44 |
+
logger = logging.getLogger(__name__)
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
class VoskSTT(BaseSTT):
|
| 48 |
+
"""
|
| 49 |
+
Vosk STT implementation supporting multiple languages and offline recognition.
|
| 50 |
+
|
| 51 |
+
Features:
|
| 52 |
+
- Multiple language support
|
| 53 |
+
- Offline processing (no internet required after model download)
|
| 54 |
+
- Real-time recognition capability
|
| 55 |
+
- Small to large model options
|
| 56 |
+
- Word-level timestamps and confidence scores
|
| 57 |
+
- Lightweight and fast
|
| 58 |
+
"""
|
| 59 |
+
|
| 60 |
+
model_name = "VoskSTT"
|
| 61 |
+
model = None
|
| 62 |
+
recognizer = None
|
| 63 |
+
is_loaded = False
|
| 64 |
+
config = {
|
| 65 |
+
"model_name": "vosk-model-en-us-0.22", # Default English model
|
| 66 |
+
"model_path": None, # Auto-determined
|
| 67 |
+
"sample_rate": 16000,
|
| 68 |
+
"language": "en",
|
| 69 |
+
"download_url_base": "https://alphacephei.com/vosk/models/",
|
| 70 |
+
"models_dir": str(Path.home() / ".vosk" / "models"),
|
| 71 |
+
"return_confidence": True,
|
| 72 |
+
"return_words": True,
|
| 73 |
+
"chunk_size": 4096,
|
| 74 |
+
}
|
| 75 |
+
|
| 76 |
+
# Available Vosk models with their properties
|
| 77 |
+
AVAILABLE_MODELS = {
|
| 78 |
+
# English models
|
| 79 |
+
"vosk-model-en-us-0.22": {
|
| 80 |
+
"language": "en-US",
|
| 81 |
+
"size": "1.8GB",
|
| 82 |
+
"description": "English US Large",
|
| 83 |
+
"url": "vosk-model-en-us-0.22.zip"
|
| 84 |
+
},
|
| 85 |
+
"vosk-model-small-en-us-0.15": {
|
| 86 |
+
"language": "en-US",
|
| 87 |
+
"size": "40MB",
|
| 88 |
+
"description": "English US Small",
|
| 89 |
+
"url": "vosk-model-small-en-us-0.15.zip"
|
| 90 |
+
},
|
| 91 |
+
|
| 92 |
+
# Arabic models
|
| 93 |
+
"vosk-model-ar-mgb2-0.4": {
|
| 94 |
+
"language": "ar",
|
| 95 |
+
"size": "318MB",
|
| 96 |
+
"description": "Arabic",
|
| 97 |
+
"url": "vosk-model-ar-mgb2-0.4.zip"
|
| 98 |
+
},
|
| 99 |
+
|
| 100 |
+
# Multilingual and other languages
|
| 101 |
+
"vosk-model-small-cn-0.22": {
|
| 102 |
+
"language": "zh-CN",
|
| 103 |
+
"size": "42MB",
|
| 104 |
+
"description": "Chinese Small",
|
| 105 |
+
"url": "vosk-model-small-cn-0.22.zip"
|
| 106 |
+
},
|
| 107 |
+
"vosk-model-fr-0.22": {
|
| 108 |
+
"language": "fr-FR",
|
| 109 |
+
"size": "1.4GB",
|
| 110 |
+
"description": "French",
|
| 111 |
+
"url": "vosk-model-fr-0.22.zip"
|
| 112 |
+
},
|
| 113 |
+
"vosk-model-de-0.21": {
|
| 114 |
+
"language": "de-DE",
|
| 115 |
+
"size": "1.2GB",
|
| 116 |
+
"description": "German",
|
| 117 |
+
"url": "vosk-model-de-0.21.zip"
|
| 118 |
+
},
|
| 119 |
+
"vosk-model-es-0.42": {
|
| 120 |
+
"language": "es-ES",
|
| 121 |
+
"size": "1.4GB",
|
| 122 |
+
"description": "Spanish",
|
| 123 |
+
"url": "vosk-model-es-0.42.zip"
|
| 124 |
+
},
|
| 125 |
+
"vosk-model-ru-0.42": {
|
| 126 |
+
"language": "ru-RU",
|
| 127 |
+
"size": "1.5GB",
|
| 128 |
+
"description": "Russian",
|
| 129 |
+
"url": "vosk-model-ru-0.42.zip"
|
| 130 |
+
},
|
| 131 |
+
"vosk-model-small-ru-0.22": {
|
| 132 |
+
"language": "ru-RU",
|
| 133 |
+
"size": "45MB",
|
| 134 |
+
"description": "Russian Small",
|
| 135 |
+
"url": "vosk-model-small-ru-0.22.zip"
|
| 136 |
+
}
|
| 137 |
+
}
|
| 138 |
+
|
| 139 |
+
@classmethod
|
| 140 |
+
def load_model(cls,
|
| 141 |
+
model_name: str = None,
|
| 142 |
+
model_path: str = None,
|
| 143 |
+
auto_download: bool = True,
|
| 144 |
+
**kwargs) -> None:
|
| 145 |
+
"""
|
| 146 |
+
Load the Vosk model.
|
| 147 |
+
|
| 148 |
+
Args:
|
| 149 |
+
model_name: Name of the Vosk model (e.g., "vosk-model-en-us-0.22")
|
| 150 |
+
model_path: Direct path to model directory (overrides model_name)
|
| 151 |
+
auto_download: Automatically download model if not found
|
| 152 |
+
**kwargs: Additional configuration parameters
|
| 153 |
+
"""
|
| 154 |
+
if not VOSK_AVAILABLE:
|
| 155 |
+
raise ImportError(
|
| 156 |
+
"Vosk library required. Install with: pip install vosk"
|
| 157 |
+
)
|
| 158 |
+
|
| 159 |
+
# Update configuration
|
| 160 |
+
cls.config.update({
|
| 161 |
+
"model_name": model_name or cls.config["model_name"],
|
| 162 |
+
"model_path": model_path,
|
| 163 |
+
"auto_download": auto_download,
|
| 164 |
+
**kwargs
|
| 165 |
+
})
|
| 166 |
+
|
| 167 |
+
# Determine model path
|
| 168 |
+
if model_path:
|
| 169 |
+
final_model_path = Path(model_path)
|
| 170 |
+
else:
|
| 171 |
+
final_model_path = cls._get_model_path(cls.config["model_name"])
|
| 172 |
+
|
| 173 |
+
# Check if model exists, download if needed
|
| 174 |
+
if not final_model_path.exists():
|
| 175 |
+
if auto_download:
|
| 176 |
+
logger.info(f"Model not found at {final_model_path}")
|
| 177 |
+
cls._download_model(cls.config["model_name"])
|
| 178 |
+
else:
|
| 179 |
+
raise FileNotFoundError(f"Model not found: {final_model_path}")
|
| 180 |
+
|
| 181 |
+
logger.info(f"Loading Vosk model from: {final_model_path}")
|
| 182 |
+
start_time = time.time()
|
| 183 |
+
|
| 184 |
+
try:
|
| 185 |
+
# Load the Vosk model
|
| 186 |
+
cls.model = vosk.Model(str(final_model_path))
|
| 187 |
+
|
| 188 |
+
# Create recognizer
|
| 189 |
+
cls.recognizer = vosk.KaldiRecognizer(cls.model, cls.config["sample_rate"])
|
| 190 |
+
|
| 191 |
+
# Configure recognizer options (with compatibility checks)
|
| 192 |
+
try:
|
| 193 |
+
if hasattr(cls.recognizer, 'SetMaxAlternatives'):
|
| 194 |
+
cls.recognizer.SetMaxAlternatives(cls.config.get("max_alternatives", 3))
|
| 195 |
+
logger.info("β
Max alternatives enabled")
|
| 196 |
+
except (AttributeError, Exception) as e:
|
| 197 |
+
logger.warning(f"β οΈ Max alternatives not supported: {e}")
|
| 198 |
+
|
| 199 |
+
try:
|
| 200 |
+
if hasattr(cls.recognizer, 'SetReturnWordTimes'):
|
| 201 |
+
cls.recognizer.SetReturnWordTimes(cls.config.get("return_words", True))
|
| 202 |
+
logger.info("β
Word timing enabled")
|
| 203 |
+
else:
|
| 204 |
+
logger.info("βΉοΈ Word timing not available in this Vosk version")
|
| 205 |
+
except (AttributeError, Exception) as e:
|
| 206 |
+
logger.warning(f"β οΈ Word timing not supported: {e}")
|
| 207 |
+
|
| 208 |
+
try:
|
| 209 |
+
if hasattr(cls.recognizer, 'SetWords'):
|
| 210 |
+
cls.recognizer.SetWords(cls.config.get("return_words", True))
|
| 211 |
+
logger.info("β
Word-level output enabled")
|
| 212 |
+
except (AttributeError, Exception) as e:
|
| 213 |
+
logger.info(f"βΉοΈ Word-level output using basic mode: {e}")
|
| 214 |
+
|
| 215 |
+
# Test recognizer with a small sample
|
| 216 |
+
test_result = cls.recognizer.AcceptWaveform(b'\x00' * 1600) # 0.1s of silence
|
| 217 |
+
logger.info("β
Recognizer test successful")
|
| 218 |
+
|
| 219 |
+
cls.is_loaded = True
|
| 220 |
+
load_time = time.time() - start_time
|
| 221 |
+
|
| 222 |
+
model_info = cls.AVAILABLE_MODELS.get(cls.config["model_name"], {})
|
| 223 |
+
language = model_info.get("language", "unknown")
|
| 224 |
+
|
| 225 |
+
logger.info(f"β
Vosk model loaded successfully in {load_time:.2f}s")
|
| 226 |
+
logger.info(f"Model: {cls.config['model_name']}")
|
| 227 |
+
logger.info(f"Language: {language}")
|
| 228 |
+
logger.info(f"Sample rate: {cls.config['sample_rate']}Hz")
|
| 229 |
+
|
| 230 |
+
except Exception as e:
|
| 231 |
+
cls.is_loaded = False
|
| 232 |
+
error_msg = f"Failed to load Vosk model: {str(e)}"
|
| 233 |
+
logger.error(error_msg)
|
| 234 |
+
raise RuntimeError(error_msg)
|
| 235 |
+
|
| 236 |
+
@classmethod
|
| 237 |
+
def _get_model_path(cls, model_name: str) -> Path:
|
| 238 |
+
"""Get the local path where a model should be stored."""
|
| 239 |
+
models_dir = Path(cls.config["models_dir"])
|
| 240 |
+
models_dir.mkdir(parents=True, exist_ok=True)
|
| 241 |
+
return models_dir / model_name
|
| 242 |
+
|
| 243 |
+
@classmethod
|
| 244 |
+
def _download_model(cls, model_name: str) -> None:
|
| 245 |
+
"""Download a Vosk model if it's not already available."""
|
| 246 |
+
if model_name not in cls.AVAILABLE_MODELS:
|
| 247 |
+
raise ValueError(f"Unknown model: {model_name}. Available: {list(cls.AVAILABLE_MODELS.keys())}")
|
| 248 |
+
|
| 249 |
+
model_info = cls.AVAILABLE_MODELS[model_name]
|
| 250 |
+
download_url = cls.config["download_url_base"] + model_info["url"]
|
| 251 |
+
model_path = cls._get_model_path(model_name)
|
| 252 |
+
|
| 253 |
+
if model_path.exists():
|
| 254 |
+
logger.info(f"Model already exists: {model_path}")
|
| 255 |
+
return
|
| 256 |
+
|
| 257 |
+
logger.info(f"Downloading Vosk model: {model_name}")
|
| 258 |
+
logger.info(f"Size: {model_info['size']} - This may take a while...")
|
| 259 |
+
logger.info(f"URL: {download_url}")
|
| 260 |
+
|
| 261 |
+
try:
|
| 262 |
+
# Create temporary file for download
|
| 263 |
+
with tempfile.NamedTemporaryFile(suffix='.zip', delete=False) as tmp_file:
|
| 264 |
+
tmp_path = tmp_file.name
|
| 265 |
+
|
| 266 |
+
# Download with progress
|
| 267 |
+
def show_progress(block_num, block_size, total_size):
|
| 268 |
+
if total_size > 0:
|
| 269 |
+
percent = min(100, (block_num * block_size * 100) // total_size)
|
| 270 |
+
if block_num % 100 == 0: # Show progress every 100 blocks
|
| 271 |
+
print(f"\rDownloading... {percent}%", end="", flush=True)
|
| 272 |
+
|
| 273 |
+
urllib.request.urlretrieve(download_url, tmp_path, show_progress)
|
| 274 |
+
print() # New line after progress
|
| 275 |
+
|
| 276 |
+
logger.info(f"Download complete. Extracting to: {model_path}")
|
| 277 |
+
|
| 278 |
+
# Extract the zip file
|
| 279 |
+
with zipfile.ZipFile(tmp_path, 'r') as zip_ref:
|
| 280 |
+
# Extract to temporary directory first
|
| 281 |
+
extract_dir = model_path.parent / f"{model_name}_temp"
|
| 282 |
+
extract_dir.mkdir(exist_ok=True)
|
| 283 |
+
zip_ref.extractall(extract_dir)
|
| 284 |
+
|
| 285 |
+
# Find the actual model directory (should contain conf/ and graph/ subdirs)
|
| 286 |
+
extracted_items = list(extract_dir.iterdir())
|
| 287 |
+
if len(extracted_items) == 1 and extracted_items[0].is_dir():
|
| 288 |
+
# Move the inner directory to the final location
|
| 289 |
+
extracted_items[0].rename(model_path)
|
| 290 |
+
extract_dir.rmdir()
|
| 291 |
+
else:
|
| 292 |
+
# Multiple items or files - rename the temp directory
|
| 293 |
+
extract_dir.rename(model_path)
|
| 294 |
+
|
| 295 |
+
# Cleanup
|
| 296 |
+
os.unlink(tmp_path)
|
| 297 |
+
|
| 298 |
+
logger.info(f"β
Model downloaded and extracted successfully: {model_path}")
|
| 299 |
+
|
| 300 |
+
except Exception as e:
|
| 301 |
+
# Cleanup on failure
|
| 302 |
+
if os.path.exists(tmp_path):
|
| 303 |
+
os.unlink(tmp_path)
|
| 304 |
+
if model_path.exists():
|
| 305 |
+
import shutil
|
| 306 |
+
shutil.rmtree(model_path, ignore_errors=True)
|
| 307 |
+
|
| 308 |
+
error_msg = f"Failed to download model {model_name}: {str(e)}"
|
| 309 |
+
logger.error(error_msg)
|
| 310 |
+
raise RuntimeError(error_msg)
|
| 311 |
+
|
| 312 |
+
@classmethod
|
| 313 |
+
def transcribe_audio(cls,
|
| 314 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 315 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 316 |
+
"""
|
| 317 |
+
Transcribe audio using Vosk.
|
| 318 |
+
|
| 319 |
+
Args:
|
| 320 |
+
audio_data: Audio input (numpy array or file path)
|
| 321 |
+
sample_rate: Sample rate for numpy arrays
|
| 322 |
+
|
| 323 |
+
Returns:
|
| 324 |
+
STTResult: Transcription with confidence and metadata
|
| 325 |
+
"""
|
| 326 |
+
if not cls.is_loaded:
|
| 327 |
+
raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
|
| 328 |
+
|
| 329 |
+
start_time = time.time()
|
| 330 |
+
|
| 331 |
+
try:
|
| 332 |
+
# Process input audio
|
| 333 |
+
processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
|
| 334 |
+
|
| 335 |
+
# Check audio length
|
| 336 |
+
duration = len(processed_audio) / actual_sr
|
| 337 |
+
if duration < 0.1:
|
| 338 |
+
return STTResult(
|
| 339 |
+
text="",
|
| 340 |
+
confidence=0.0,
|
| 341 |
+
processing_time=time.time() - start_time,
|
| 342 |
+
metadata={"error": "Audio too short", "duration": duration}
|
| 343 |
+
)
|
| 344 |
+
|
| 345 |
+
# Transcribe using Vosk
|
| 346 |
+
result_text, confidence, words = cls._transcribe_with_vosk(processed_audio)
|
| 347 |
+
|
| 348 |
+
processing_time = time.time() - start_time
|
| 349 |
+
|
| 350 |
+
# Prepare metadata
|
| 351 |
+
metadata = {
|
| 352 |
+
"model": cls.config["model_name"],
|
| 353 |
+
"language": cls.AVAILABLE_MODELS.get(cls.config["model_name"], {}).get("language", "unknown"),
|
| 354 |
+
"duration": duration,
|
| 355 |
+
"sample_rate": actual_sr,
|
| 356 |
+
"words": words if cls.config.get("return_words", True) else None,
|
| 357 |
+
"vosk_version": vosk.__version__ if hasattr(vosk, '__version__') else "unknown"
|
| 358 |
+
}
|
| 359 |
+
|
| 360 |
+
return STTResult(
|
| 361 |
+
text=result_text.strip(),
|
| 362 |
+
confidence=confidence,
|
| 363 |
+
processing_time=processing_time,
|
| 364 |
+
metadata=metadata
|
| 365 |
+
)
|
| 366 |
+
|
| 367 |
+
except Exception as e:
|
| 368 |
+
error_msg = f"Transcription failed: {str(e)}"
|
| 369 |
+
logger.error(error_msg)
|
| 370 |
+
return STTResult(
|
| 371 |
+
text="",
|
| 372 |
+
confidence=0.0,
|
| 373 |
+
processing_time=time.time() - start_time,
|
| 374 |
+
metadata={"error": error_msg}
|
| 375 |
+
)
|
| 376 |
+
|
| 377 |
+
@classmethod
|
| 378 |
+
def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
|
| 379 |
+
"""Process and validate audio input."""
|
| 380 |
+
if isinstance(audio_data, (str, Path)):
|
| 381 |
+
# Load audio file
|
| 382 |
+
audio_path = Path(audio_data)
|
| 383 |
+
if not audio_path.exists():
|
| 384 |
+
raise FileNotFoundError(f"Audio file not found: {audio_path}")
|
| 385 |
+
|
| 386 |
+
if SOUNDFILE_AVAILABLE:
|
| 387 |
+
audio_array, sr = sf.read(str(audio_path))
|
| 388 |
+
if audio_array.ndim > 1:
|
| 389 |
+
audio_array = np.mean(audio_array, axis=1) # Convert to mono
|
| 390 |
+
else:
|
| 391 |
+
raise ImportError("soundfile required for file input. Install with: pip install soundfile")
|
| 392 |
+
else:
|
| 393 |
+
# Handle numpy array
|
| 394 |
+
audio_array = audio_data.astype(np.float32)
|
| 395 |
+
sr = sample_rate or cls.config["sample_rate"]
|
| 396 |
+
|
| 397 |
+
# Convert to mono if stereo
|
| 398 |
+
if audio_array.ndim > 1:
|
| 399 |
+
audio_array = np.mean(audio_array, axis=1)
|
| 400 |
+
|
| 401 |
+
# Resample to target sample rate if needed
|
| 402 |
+
target_sr = cls.config["sample_rate"]
|
| 403 |
+
if sr != target_sr:
|
| 404 |
+
# Simple resampling
|
| 405 |
+
if sr > target_sr:
|
| 406 |
+
step = sr // target_sr
|
| 407 |
+
audio_array = audio_array[::step]
|
| 408 |
+
else:
|
| 409 |
+
repeat = target_sr // sr
|
| 410 |
+
audio_array = np.repeat(audio_array, repeat)
|
| 411 |
+
sr = target_sr
|
| 412 |
+
|
| 413 |
+
# Normalize and convert to 16-bit PCM format expected by Vosk
|
| 414 |
+
audio_array = np.clip(audio_array, -1.0, 1.0)
|
| 415 |
+
audio_int16 = (audio_array * 32767).astype(np.int16)
|
| 416 |
+
|
| 417 |
+
return audio_int16, sr
|
| 418 |
+
|
| 419 |
+
@classmethod
|
| 420 |
+
def _transcribe_with_vosk(cls, audio_int16: np.ndarray) -> tuple:
|
| 421 |
+
"""Transcribe audio using Vosk recognizer."""
|
| 422 |
+
# Convert to bytes
|
| 423 |
+
audio_bytes = audio_int16.tobytes()
|
| 424 |
+
|
| 425 |
+
# Reset recognizer for new transcription
|
| 426 |
+
cls.recognizer = vosk.KaldiRecognizer(cls.model, cls.config["sample_rate"])
|
| 427 |
+
|
| 428 |
+
# Configure recognizer with compatibility checks
|
| 429 |
+
try:
|
| 430 |
+
if hasattr(cls.recognizer, 'SetReturnWordTimes'):
|
| 431 |
+
cls.recognizer.SetReturnWordTimes(cls.config.get("return_words", True))
|
| 432 |
+
except (AttributeError, Exception):
|
| 433 |
+
pass # Use basic recognition without word timing
|
| 434 |
+
|
| 435 |
+
# Process audio in chunks
|
| 436 |
+
chunk_size = cls.config.get("chunk_size", 4096)
|
| 437 |
+
partial_results = []
|
| 438 |
+
|
| 439 |
+
for i in range(0, len(audio_bytes), chunk_size):
|
| 440 |
+
chunk = audio_bytes[i:i + chunk_size]
|
| 441 |
+
if cls.recognizer.AcceptWaveform(chunk):
|
| 442 |
+
result = json.loads(cls.recognizer.Result())
|
| 443 |
+
if result.get("text"):
|
| 444 |
+
partial_results.append(result)
|
| 445 |
+
|
| 446 |
+
# Get final result
|
| 447 |
+
final_result = json.loads(cls.recognizer.FinalResult())
|
| 448 |
+
if final_result.get("text"):
|
| 449 |
+
partial_results.append(final_result)
|
| 450 |
+
|
| 451 |
+
# Combine all results
|
| 452 |
+
if not partial_results:
|
| 453 |
+
return "", 0.0, []
|
| 454 |
+
|
| 455 |
+
# Extract text and confidence
|
| 456 |
+
full_text = " ".join([r.get("text", "") for r in partial_results]).strip()
|
| 457 |
+
|
| 458 |
+
# Calculate average confidence from words
|
| 459 |
+
all_words = []
|
| 460 |
+
total_confidence = 0.0
|
| 461 |
+
word_count = 0
|
| 462 |
+
|
| 463 |
+
for result in partial_results:
|
| 464 |
+
if "result" in result:
|
| 465 |
+
words = result["result"]
|
| 466 |
+
all_words.extend(words)
|
| 467 |
+
for word in words:
|
| 468 |
+
if "conf" in word:
|
| 469 |
+
total_confidence += word["conf"]
|
| 470 |
+
word_count += 1
|
| 471 |
+
|
| 472 |
+
average_confidence = total_confidence / word_count if word_count > 0 else 0.0
|
| 473 |
+
|
| 474 |
+
return full_text, average_confidence, all_words
|
| 475 |
+
|
| 476 |
+
@classmethod
|
| 477 |
+
def get_available_models(cls) -> Dict[str, Any]:
|
| 478 |
+
"""Get information about available Vosk models."""
|
| 479 |
+
return {
|
| 480 |
+
"vosk_available": VOSK_AVAILABLE,
|
| 481 |
+
"soundfile_available": SOUNDFILE_AVAILABLE,
|
| 482 |
+
"models": cls.AVAILABLE_MODELS,
|
| 483 |
+
"models_dir": cls.config["models_dir"],
|
| 484 |
+
"downloaded_models": cls._get_downloaded_models()
|
| 485 |
+
}
|
| 486 |
+
|
| 487 |
+
@classmethod
|
| 488 |
+
def _get_downloaded_models(cls) -> List[str]:
|
| 489 |
+
"""Get list of already downloaded models."""
|
| 490 |
+
models_dir = Path(cls.config["models_dir"])
|
| 491 |
+
if not models_dir.exists():
|
| 492 |
+
return []
|
| 493 |
+
|
| 494 |
+
downloaded = []
|
| 495 |
+
for model_dir in models_dir.iterdir():
|
| 496 |
+
if model_dir.is_dir() and model_dir.name in cls.AVAILABLE_MODELS:
|
| 497 |
+
# Check if it looks like a valid Vosk model
|
| 498 |
+
if (model_dir / "conf").exists() or (model_dir / "graph").exists():
|
| 499 |
+
downloaded.append(model_dir.name)
|
| 500 |
+
|
| 501 |
+
return downloaded
|
| 502 |
+
|
| 503 |
+
@classmethod
|
| 504 |
+
def set_language(cls, language: Optional[str]) -> None:
|
| 505 |
+
"""Set language preference (informational - model determines actual language)."""
|
| 506 |
+
cls.config["language"] = language or "auto"
|
| 507 |
+
logger.info(f"Language preference set to: {cls.config['language']}")
|
| 508 |
+
logger.info("Note: Vosk model determines actual recognition language")
|
| 509 |
+
|
| 510 |
+
@classmethod
|
| 511 |
+
def list_models(cls) -> None:
|
| 512 |
+
"""Print available models in a formatted way."""
|
| 513 |
+
print("\nπ€ Available Vosk Models:")
|
| 514 |
+
print("=" * 60)
|
| 515 |
+
|
| 516 |
+
downloaded = cls._get_downloaded_models()
|
| 517 |
+
|
| 518 |
+
for model_name, info in cls.AVAILABLE_MODELS.items():
|
| 519 |
+
status = "β
Downloaded" if model_name in downloaded else "π₯ Available"
|
| 520 |
+
print(f"{status} {model_name}")
|
| 521 |
+
print(f" Language: {info['language']}")
|
| 522 |
+
print(f" Size: {info['size']}")
|
| 523 |
+
print(f" Description: {info['description']}")
|
| 524 |
+
print()
|
| 525 |
+
|
| 526 |
+
|
| 527 |
+
# Example usage and testing
|
| 528 |
+
if __name__ == "__main__":
|
| 529 |
+
print("Testing Vosk STT implementation...")
|
| 530 |
+
|
| 531 |
+
# Check availability
|
| 532 |
+
models_info = VoskSTT.get_available_models()
|
| 533 |
+
print(f"Vosk available: {models_info['vosk_available']}")
|
| 534 |
+
print(f"Downloaded models: {models_info['downloaded_models']}")
|
| 535 |
+
|
| 536 |
+
if models_info["vosk_available"]:
|
| 537 |
+
try:
|
| 538 |
+
# List available models
|
| 539 |
+
VoskSTT.list_models()
|
| 540 |
+
|
| 541 |
+
# Try to load a small English model for testing
|
| 542 |
+
print("\\nTesting with small English model...")
|
| 543 |
+
VoskSTT.load_model(model_name="vosk-model-small-en-us-0.15")
|
| 544 |
+
|
| 545 |
+
# Test with dummy audio
|
| 546 |
+
print("Testing transcription...")
|
| 547 |
+
test_audio = np.random.randn(16000).astype(np.float32) * 0.1
|
| 548 |
+
|
| 549 |
+
result = VoskSTT.transcribe_audio(test_audio, 16000)
|
| 550 |
+
print(f"Result: {result}")
|
| 551 |
+
print(f"Metadata: {result.metadata}")
|
| 552 |
+
|
| 553 |
+
except Exception as e:
|
| 554 |
+
print(f"Error: {e}")
|
| 555 |
+
print("Note: This is expected with random audio")
|
| 556 |
+
|
| 557 |
+
else:
|
| 558 |
+
print("Vosk not installed - install with: pip install vosk")
|
| 559 |
+
print("Also recommended: pip install soundfile")
|
| 560 |
+
|
| 561 |
+
print("\\nVosk STT implementation ready!")
|
stt/wav2vec2_arabic_stt.py
ADDED
|
@@ -0,0 +1,509 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Wav2Vec2 Arabic Egyptian STT Implementation
|
| 4 |
+
|
| 5 |
+
Hugging Face Wav2Vec2 speech-to-text implementation for Arabic Egyptian dialect
|
| 6 |
+
using the wav2vec2-large-xlsr-53-arabic-egyptian model.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
|
| 10 |
+
|
| 11 |
+
# Load model
|
| 12 |
+
Wav2Vec2ArabicSTT.load_model()
|
| 13 |
+
|
| 14 |
+
# Transcribe audio
|
| 15 |
+
result = Wav2Vec2ArabicSTT.transcribe_audio(audio_array, 16000)
|
| 16 |
+
print(result.text)
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from typing import Union, Optional, Dict, Any
|
| 20 |
+
import numpy as np
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
import time
|
| 23 |
+
import logging
|
| 24 |
+
import warnings
|
| 25 |
+
|
| 26 |
+
# Suppress warnings for cleaner output
|
| 27 |
+
warnings.filterwarnings("ignore")
|
| 28 |
+
|
| 29 |
+
try:
|
| 30 |
+
import torch
|
| 31 |
+
import torchaudio
|
| 32 |
+
from transformers import (
|
| 33 |
+
Wav2Vec2ForCTC,
|
| 34 |
+
Wav2Vec2Processor,
|
| 35 |
+
Wav2Vec2Tokenizer
|
| 36 |
+
)
|
| 37 |
+
TRANSFORMERS_AVAILABLE = True
|
| 38 |
+
except ImportError:
|
| 39 |
+
TRANSFORMERS_AVAILABLE = False
|
| 40 |
+
|
| 41 |
+
try:
|
| 42 |
+
import librosa
|
| 43 |
+
LIBROSA_AVAILABLE = True
|
| 44 |
+
except ImportError:
|
| 45 |
+
LIBROSA_AVAILABLE = False
|
| 46 |
+
|
| 47 |
+
from .stt_base import BaseSTT, STTResult
|
| 48 |
+
|
| 49 |
+
logger = logging.getLogger(__name__)
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
class Wav2Vec2ArabicSTT(BaseSTT):
|
| 53 |
+
"""
|
| 54 |
+
Wav2Vec2 Arabic Egyptian STT implementation using Hugging Face transformers.
|
| 55 |
+
|
| 56 |
+
Supports:
|
| 57 |
+
- Arabic Egyptian dialect transcription
|
| 58 |
+
- Local model execution (no API required)
|
| 59 |
+
- Automatic audio preprocessing
|
| 60 |
+
- Confidence estimation
|
| 61 |
+
"""
|
| 62 |
+
|
| 63 |
+
model_name = "Wav2Vec2ArabicSTT"
|
| 64 |
+
model = None
|
| 65 |
+
processor = None
|
| 66 |
+
tokenizer = None
|
| 67 |
+
is_loaded = False
|
| 68 |
+
config = {
|
| 69 |
+
"model_id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
|
| 70 |
+
"fallback_models": [
|
| 71 |
+
"jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
|
| 72 |
+
"facebook/wav2vec2-large-xlsr-53",
|
| 73 |
+
"facebook/wav2vec2-base-960h" # English fallback
|
| 74 |
+
],
|
| 75 |
+
"device": "auto", # auto, cpu, cuda
|
| 76 |
+
"chunk_length": 20, # seconds, for long audio processing
|
| 77 |
+
"sample_rate": 16000,
|
| 78 |
+
"return_confidence": True,
|
| 79 |
+
"language": "ar-EG", # Arabic Egyptian
|
| 80 |
+
"hf_token": None, # Hugging Face token for private models
|
| 81 |
+
"use_auth_token": True # Try to use cached token
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
@classmethod
|
| 85 |
+
def load_model(cls,
|
| 86 |
+
model_id: str = None,
|
| 87 |
+
device: str = "auto",
|
| 88 |
+
hf_token: str = None,
|
| 89 |
+
**kwargs) -> None:
|
| 90 |
+
"""
|
| 91 |
+
Load the Wav2Vec2 Arabic model.
|
| 92 |
+
|
| 93 |
+
Args:
|
| 94 |
+
model_id: Hugging Face model ID (default: wav2vec2-large-xlsr-53-arabic-egyptian)
|
| 95 |
+
device: Device to use (auto, cpu, cuda)
|
| 96 |
+
hf_token: Hugging Face token for private models (optional)
|
| 97 |
+
**kwargs: Additional configuration parameters
|
| 98 |
+
"""
|
| 99 |
+
if not TRANSFORMERS_AVAILABLE:
|
| 100 |
+
raise ImportError(
|
| 101 |
+
"Transformers library required. Install with: "
|
| 102 |
+
"pip install transformers torch torchaudio"
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
# Update configuration
|
| 106 |
+
cls.config.update({
|
| 107 |
+
"model_id": model_id or cls.config["model_id"],
|
| 108 |
+
"device": device,
|
| 109 |
+
"hf_token": hf_token,
|
| 110 |
+
**kwargs
|
| 111 |
+
})
|
| 112 |
+
|
| 113 |
+
# Determine device
|
| 114 |
+
if device == "auto":
|
| 115 |
+
if torch.cuda.is_available():
|
| 116 |
+
device = "cuda"
|
| 117 |
+
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
| 118 |
+
device = "mps" # Apple Silicon
|
| 119 |
+
else:
|
| 120 |
+
device = "cpu"
|
| 121 |
+
|
| 122 |
+
cls.config["device"] = device
|
| 123 |
+
|
| 124 |
+
# Try to load the model, with fallbacks
|
| 125 |
+
models_to_try = [cls.config["model_id"]] + cls.config["fallback_models"]
|
| 126 |
+
|
| 127 |
+
for model_id_to_try in models_to_try:
|
| 128 |
+
logger.info(f"Attempting to load model: {model_id_to_try}")
|
| 129 |
+
|
| 130 |
+
try:
|
| 131 |
+
success = cls._load_model_with_id(model_id_to_try, device, hf_token)
|
| 132 |
+
if success:
|
| 133 |
+
cls.config["model_id"] = model_id_to_try # Update to successful model
|
| 134 |
+
return
|
| 135 |
+
except Exception as e:
|
| 136 |
+
logger.warning(f"Failed to load {model_id_to_try}: {e}")
|
| 137 |
+
continue
|
| 138 |
+
|
| 139 |
+
# If all models failed
|
| 140 |
+
raise RuntimeError(f"Failed to load any Wav2Vec2 model. Tried: {models_to_try}")
|
| 141 |
+
|
| 142 |
+
@classmethod
|
| 143 |
+
def _load_model_with_id(cls, model_id: str, device: str, hf_token: str = None) -> bool:
|
| 144 |
+
"""
|
| 145 |
+
Load a specific model ID with authentication handling.
|
| 146 |
+
|
| 147 |
+
Returns:
|
| 148 |
+
bool: True if successful, False otherwise
|
| 149 |
+
"""
|
| 150 |
+
logger.info(f"Loading Wav2Vec2 model: {model_id}")
|
| 151 |
+
logger.info(f"Using device: {device}")
|
| 152 |
+
|
| 153 |
+
start_time = time.time()
|
| 154 |
+
|
| 155 |
+
# Prepare authentication
|
| 156 |
+
auth_kwargs = {}
|
| 157 |
+
if hf_token:
|
| 158 |
+
auth_kwargs["token"] = hf_token
|
| 159 |
+
elif cls.config.get("use_auth_token", True):
|
| 160 |
+
auth_kwargs["use_auth_token"] = True
|
| 161 |
+
|
| 162 |
+
# Load processor and tokenizer
|
| 163 |
+
logger.info("Loading processor...")
|
| 164 |
+
cls.processor = Wav2Vec2Processor.from_pretrained(model_id, **auth_kwargs)
|
| 165 |
+
|
| 166 |
+
logger.info("Loading model...")
|
| 167 |
+
cls.model = Wav2Vec2ForCTC.from_pretrained(model_id, **auth_kwargs)
|
| 168 |
+
|
| 169 |
+
# Move model to device
|
| 170 |
+
cls.model = cls.model.to(device)
|
| 171 |
+
cls.model.eval() # Set to evaluation mode
|
| 172 |
+
|
| 173 |
+
# Load tokenizer for confidence calculation
|
| 174 |
+
try:
|
| 175 |
+
cls.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_id, **auth_kwargs)
|
| 176 |
+
except Exception as e:
|
| 177 |
+
logger.warning(f"Could not load tokenizer: {e}")
|
| 178 |
+
cls.tokenizer = None
|
| 179 |
+
|
| 180 |
+
cls.is_loaded = True
|
| 181 |
+
load_time = time.time() - start_time
|
| 182 |
+
|
| 183 |
+
logger.info(f"β
Wav2Vec2 model loaded successfully in {load_time:.2f}s")
|
| 184 |
+
logger.info(f"Model vocab size: {cls.model.config.vocab_size}")
|
| 185 |
+
|
| 186 |
+
return True
|
| 187 |
+
|
| 188 |
+
@classmethod
|
| 189 |
+
def transcribe_audio(cls,
|
| 190 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 191 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 192 |
+
"""
|
| 193 |
+
Transcribe audio using Wav2Vec2 Arabic model.
|
| 194 |
+
|
| 195 |
+
Args:
|
| 196 |
+
audio_data: Audio input (numpy array or file path)
|
| 197 |
+
sample_rate: Sample rate for numpy arrays
|
| 198 |
+
|
| 199 |
+
Returns:
|
| 200 |
+
STTResult: Transcription with confidence and metadata
|
| 201 |
+
"""
|
| 202 |
+
if not cls.is_loaded:
|
| 203 |
+
raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
|
| 204 |
+
|
| 205 |
+
start_time = time.time()
|
| 206 |
+
|
| 207 |
+
try:
|
| 208 |
+
# Process input audio
|
| 209 |
+
processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
|
| 210 |
+
|
| 211 |
+
# Check audio length
|
| 212 |
+
duration = len(processed_audio) / actual_sr
|
| 213 |
+
if duration < 0.1:
|
| 214 |
+
return STTResult(
|
| 215 |
+
text="",
|
| 216 |
+
confidence=0.0,
|
| 217 |
+
processing_time=time.time() - start_time,
|
| 218 |
+
metadata={"error": "Audio too short", "duration": duration}
|
| 219 |
+
)
|
| 220 |
+
|
| 221 |
+
# Process with model
|
| 222 |
+
if duration > cls.config.get("chunk_length", 20):
|
| 223 |
+
# Handle long audio by chunking
|
| 224 |
+
text, confidence = cls._transcribe_long_audio(processed_audio, actual_sr)
|
| 225 |
+
else:
|
| 226 |
+
# Process short audio directly
|
| 227 |
+
text, confidence = cls._transcribe_chunk(processed_audio, actual_sr)
|
| 228 |
+
|
| 229 |
+
processing_time = time.time() - start_time
|
| 230 |
+
|
| 231 |
+
# Prepare metadata
|
| 232 |
+
metadata = {
|
| 233 |
+
"model": cls.config["model_id"],
|
| 234 |
+
"device": cls.config["device"],
|
| 235 |
+
"language": "ar-EG",
|
| 236 |
+
"duration": duration,
|
| 237 |
+
"sample_rate": actual_sr,
|
| 238 |
+
"chunks_processed": 1 if duration <= cls.config.get("chunk_length", 20) else int(duration / cls.config["chunk_length"]) + 1
|
| 239 |
+
}
|
| 240 |
+
|
| 241 |
+
return STTResult(
|
| 242 |
+
text=text.strip(),
|
| 243 |
+
confidence=confidence,
|
| 244 |
+
processing_time=processing_time,
|
| 245 |
+
metadata=metadata
|
| 246 |
+
)
|
| 247 |
+
|
| 248 |
+
except Exception as e:
|
| 249 |
+
error_msg = f"Transcription failed: {str(e)}"
|
| 250 |
+
logger.error(error_msg)
|
| 251 |
+
return STTResult(
|
| 252 |
+
text="",
|
| 253 |
+
confidence=0.0,
|
| 254 |
+
processing_time=time.time() - start_time,
|
| 255 |
+
metadata={"error": error_msg}
|
| 256 |
+
)
|
| 257 |
+
|
| 258 |
+
@classmethod
|
| 259 |
+
def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
|
| 260 |
+
"""Process and validate audio input."""
|
| 261 |
+
if isinstance(audio_data, (str, Path)):
|
| 262 |
+
# Load audio file
|
| 263 |
+
audio_path = Path(audio_data)
|
| 264 |
+
if not audio_path.exists():
|
| 265 |
+
raise FileNotFoundError(f"Audio file not found: {audio_path}")
|
| 266 |
+
|
| 267 |
+
if LIBROSA_AVAILABLE:
|
| 268 |
+
audio_array, sr = librosa.load(str(audio_path), sr=cls.config["sample_rate"])
|
| 269 |
+
else:
|
| 270 |
+
# Fallback to torchaudio
|
| 271 |
+
audio_tensor, sr = torchaudio.load(str(audio_path))
|
| 272 |
+
audio_array = audio_tensor.numpy().flatten()
|
| 273 |
+
|
| 274 |
+
# Resample if needed
|
| 275 |
+
if sr != cls.config["sample_rate"]:
|
| 276 |
+
resampler = torchaudio.transforms.Resample(sr, cls.config["sample_rate"])
|
| 277 |
+
audio_tensor = resampler(audio_tensor)
|
| 278 |
+
audio_array = audio_tensor.numpy().flatten()
|
| 279 |
+
sr = cls.config["sample_rate"]
|
| 280 |
+
|
| 281 |
+
else:
|
| 282 |
+
# Handle numpy array
|
| 283 |
+
audio_array = audio_data.astype(np.float32)
|
| 284 |
+
sr = sample_rate or cls.config["sample_rate"]
|
| 285 |
+
|
| 286 |
+
# Resample if needed
|
| 287 |
+
if sr != cls.config["sample_rate"]:
|
| 288 |
+
if LIBROSA_AVAILABLE:
|
| 289 |
+
audio_array = librosa.resample(
|
| 290 |
+
audio_array,
|
| 291 |
+
orig_sr=sr,
|
| 292 |
+
target_sr=cls.config["sample_rate"]
|
| 293 |
+
)
|
| 294 |
+
else:
|
| 295 |
+
# Simple resampling fallback
|
| 296 |
+
if sr > cls.config["sample_rate"]:
|
| 297 |
+
step = sr // cls.config["sample_rate"]
|
| 298 |
+
audio_array = audio_array[::step]
|
| 299 |
+
else:
|
| 300 |
+
repeat = cls.config["sample_rate"] // sr
|
| 301 |
+
audio_array = np.repeat(audio_array, repeat)
|
| 302 |
+
|
| 303 |
+
sr = cls.config["sample_rate"]
|
| 304 |
+
|
| 305 |
+
# Normalize audio
|
| 306 |
+
if len(audio_array) > 0:
|
| 307 |
+
# Convert to mono if stereo
|
| 308 |
+
if audio_array.ndim > 1:
|
| 309 |
+
audio_array = np.mean(audio_array, axis=0)
|
| 310 |
+
|
| 311 |
+
# Normalize to [-1, 1]
|
| 312 |
+
max_val = np.max(np.abs(audio_array))
|
| 313 |
+
if max_val > 0:
|
| 314 |
+
audio_array = audio_array / max_val
|
| 315 |
+
|
| 316 |
+
return audio_array, sr
|
| 317 |
+
|
| 318 |
+
@classmethod
|
| 319 |
+
def _transcribe_chunk(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
|
| 320 |
+
"""Transcribe a single audio chunk."""
|
| 321 |
+
# Preprocess audio
|
| 322 |
+
input_values = cls.processor(
|
| 323 |
+
audio_array,
|
| 324 |
+
sampling_rate=sample_rate,
|
| 325 |
+
return_tensors="pt",
|
| 326 |
+
padding=True
|
| 327 |
+
)
|
| 328 |
+
|
| 329 |
+
# Move to device
|
| 330 |
+
input_values = {k: v.to(cls.config["device"]) for k, v in input_values.items()}
|
| 331 |
+
|
| 332 |
+
# Inference
|
| 333 |
+
with torch.no_grad():
|
| 334 |
+
logits = cls.model(**input_values).logits
|
| 335 |
+
|
| 336 |
+
# Get predicted tokens
|
| 337 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
| 338 |
+
|
| 339 |
+
# Decode transcription
|
| 340 |
+
transcription = cls.processor.batch_decode(predicted_ids)[0]
|
| 341 |
+
|
| 342 |
+
# Calculate confidence (average of max probabilities)
|
| 343 |
+
confidence = cls._calculate_confidence(logits)
|
| 344 |
+
|
| 345 |
+
return transcription, confidence
|
| 346 |
+
|
| 347 |
+
@classmethod
|
| 348 |
+
def _transcribe_long_audio(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
|
| 349 |
+
"""Transcribe long audio by chunking."""
|
| 350 |
+
chunk_length = cls.config.get("chunk_length", 20)
|
| 351 |
+
chunk_samples = int(chunk_length * sample_rate)
|
| 352 |
+
overlap_samples = int(1.0 * sample_rate) # 1 second overlap
|
| 353 |
+
|
| 354 |
+
transcriptions = []
|
| 355 |
+
confidences = []
|
| 356 |
+
|
| 357 |
+
for start in range(0, len(audio_array), chunk_samples - overlap_samples):
|
| 358 |
+
end = min(start + chunk_samples, len(audio_array))
|
| 359 |
+
chunk = audio_array[start:end]
|
| 360 |
+
|
| 361 |
+
if len(chunk) < 0.5 * sample_rate: # Skip very short chunks
|
| 362 |
+
continue
|
| 363 |
+
|
| 364 |
+
try:
|
| 365 |
+
chunk_text, chunk_confidence = cls._transcribe_chunk(chunk, sample_rate)
|
| 366 |
+
if chunk_text.strip():
|
| 367 |
+
transcriptions.append(chunk_text.strip())
|
| 368 |
+
confidences.append(chunk_confidence)
|
| 369 |
+
except Exception as e:
|
| 370 |
+
logger.warning(f"Failed to transcribe chunk: {e}")
|
| 371 |
+
continue
|
| 372 |
+
|
| 373 |
+
# Combine results
|
| 374 |
+
full_text = " ".join(transcriptions)
|
| 375 |
+
avg_confidence = np.mean(confidences) if confidences else 0.0
|
| 376 |
+
|
| 377 |
+
return full_text, avg_confidence
|
| 378 |
+
|
| 379 |
+
@classmethod
|
| 380 |
+
def _calculate_confidence(cls, logits: torch.Tensor) -> float:
|
| 381 |
+
"""Calculate confidence score from model logits."""
|
| 382 |
+
try:
|
| 383 |
+
# Apply softmax to get probabilities
|
| 384 |
+
probabilities = torch.softmax(logits, dim=-1)
|
| 385 |
+
|
| 386 |
+
# Get maximum probability for each time step
|
| 387 |
+
max_probs = torch.max(probabilities, dim=-1)[0]
|
| 388 |
+
|
| 389 |
+
# Average over time steps (excluding padding if any)
|
| 390 |
+
confidence = torch.mean(max_probs).item()
|
| 391 |
+
|
| 392 |
+
return confidence
|
| 393 |
+
|
| 394 |
+
except Exception as e:
|
| 395 |
+
logger.warning(f"Could not calculate confidence: {e}")
|
| 396 |
+
return 0.5 # Default confidence
|
| 397 |
+
|
| 398 |
+
@classmethod
|
| 399 |
+
def get_available_models(cls) -> Dict[str, Any]:
|
| 400 |
+
"""Get information about available Wav2Vec2 models."""
|
| 401 |
+
models_info = {
|
| 402 |
+
"transformers_available": TRANSFORMERS_AVAILABLE,
|
| 403 |
+
"librosa_available": LIBROSA_AVAILABLE,
|
| 404 |
+
"torch_available": True if TRANSFORMERS_AVAILABLE else False,
|
| 405 |
+
}
|
| 406 |
+
|
| 407 |
+
if TRANSFORMERS_AVAILABLE:
|
| 408 |
+
models_info.update({
|
| 409 |
+
"cuda_available": torch.cuda.is_available(),
|
| 410 |
+
"mps_available": hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(),
|
| 411 |
+
"public_models": [
|
| 412 |
+
{
|
| 413 |
+
"id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
|
| 414 |
+
"name": "Wav2Vec2 Arabic (Large)",
|
| 415 |
+
"language": "Arabic",
|
| 416 |
+
"size": "1.2GB"
|
| 417 |
+
},
|
| 418 |
+
{
|
| 419 |
+
"id": "facebook/wav2vec2-large-xlsr-53",
|
| 420 |
+
"name": "Wav2Vec2 Multilingual (Large)",
|
| 421 |
+
"language": "Multilingual (including Arabic)",
|
| 422 |
+
"size": "1.2GB"
|
| 423 |
+
},
|
| 424 |
+
{
|
| 425 |
+
"id": "facebook/wav2vec2-base-960h",
|
| 426 |
+
"name": "Wav2Vec2 English Base",
|
| 427 |
+
"language": "English",
|
| 428 |
+
"size": "360MB"
|
| 429 |
+
}
|
| 430 |
+
],
|
| 431 |
+
"experimental_models": [
|
| 432 |
+
{
|
| 433 |
+
"id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
|
| 434 |
+
"name": "Wav2Vec2 Arabic Egyptian (Large)",
|
| 435 |
+
"language": "Arabic Egyptian Dialect",
|
| 436 |
+
"size": "1.2GB",
|
| 437 |
+
"note": "May require HuggingFace authentication"
|
| 438 |
+
}
|
| 439 |
+
]
|
| 440 |
+
})
|
| 441 |
+
|
| 442 |
+
return models_info
|
| 443 |
+
|
| 444 |
+
@classmethod
|
| 445 |
+
def set_language(cls, language: Optional[str]) -> None:
|
| 446 |
+
"""Set language (for compatibility - this model is Arabic-specific)."""
|
| 447 |
+
if language and not language.startswith("ar"):
|
| 448 |
+
logger.warning(f"This model is optimized for Arabic. Language '{language}' may not work well.")
|
| 449 |
+
|
| 450 |
+
cls.config["language"] = language or "ar-EG"
|
| 451 |
+
logger.info(f"Language set to: {cls.config['language']}")
|
| 452 |
+
|
| 453 |
+
@classmethod
|
| 454 |
+
def set_device(cls, device: str) -> None:
|
| 455 |
+
"""Change device for model inference."""
|
| 456 |
+
if cls.model is not None:
|
| 457 |
+
cls.model = cls.model.to(device)
|
| 458 |
+
cls.config["device"] = device
|
| 459 |
+
logger.info(f"Model moved to device: {device}")
|
| 460 |
+
|
| 461 |
+
@classmethod
|
| 462 |
+
def get_model_info(cls) -> Dict[str, Any]:
|
| 463 |
+
"""Get detailed model information."""
|
| 464 |
+
base_info = super().get_model_info()
|
| 465 |
+
|
| 466 |
+
if cls.is_loaded:
|
| 467 |
+
base_info.update({
|
| 468 |
+
"model_id": cls.config["model_id"],
|
| 469 |
+
"device": cls.config["device"],
|
| 470 |
+
"language": cls.config["language"],
|
| 471 |
+
"sample_rate": cls.config["sample_rate"],
|
| 472 |
+
"vocab_size": cls.model.config.vocab_size if cls.model else None,
|
| 473 |
+
})
|
| 474 |
+
|
| 475 |
+
return base_info
|
| 476 |
+
|
| 477 |
+
|
| 478 |
+
# Example usage and testing
|
| 479 |
+
if __name__ == "__main__":
|
| 480 |
+
print("Testing Wav2Vec2 Arabic STT implementation...")
|
| 481 |
+
|
| 482 |
+
# Check availability
|
| 483 |
+
models_info = Wav2Vec2ArabicSTT.get_available_models()
|
| 484 |
+
print(f"Available models info: {models_info}")
|
| 485 |
+
|
| 486 |
+
if models_info["transformers_available"]:
|
| 487 |
+
try:
|
| 488 |
+
print("Loading Wav2Vec2 Arabic model...")
|
| 489 |
+
Wav2Vec2ArabicSTT.load_model()
|
| 490 |
+
|
| 491 |
+
print("Creating test audio...")
|
| 492 |
+
# Generate test audio (1 second of random noise)
|
| 493 |
+
test_audio = np.random.randn(16000).astype(np.float32) * 0.1
|
| 494 |
+
|
| 495 |
+
print("Testing transcription...")
|
| 496 |
+
result = Wav2Vec2ArabicSTT.transcribe_audio(test_audio, 16000)
|
| 497 |
+
print(f"Result: {result}")
|
| 498 |
+
print(f"Metadata: {result.metadata}")
|
| 499 |
+
|
| 500 |
+
except Exception as e:
|
| 501 |
+
print(f"Error: {e}")
|
| 502 |
+
print("Note: This is expected with random audio - the model expects Arabic speech")
|
| 503 |
+
|
| 504 |
+
else:
|
| 505 |
+
print("Transformers not installed - install with:")
|
| 506 |
+
print("pip install transformers torch torchaudio")
|
| 507 |
+
print("Optional: pip install librosa (for better audio processing)")
|
| 508 |
+
|
| 509 |
+
print("\nWav2Vec2 Arabic STT implementation ready!")
|
stt/whisper_stt.py
ADDED
|
@@ -0,0 +1,377 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Whisper STT Implementation
|
| 4 |
+
|
| 5 |
+
OpenAI Whisper speech-to-text implementation using the static BaseSTT interface.
|
| 6 |
+
Supports both local Whisper models and OpenAI API calls.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
from stt.whisper_stt import WhisperSTT
|
| 10 |
+
|
| 11 |
+
# Load model (local)
|
| 12 |
+
WhisperSTT.load_model()
|
| 13 |
+
|
| 14 |
+
# Transcribe audio
|
| 15 |
+
result = WhisperSTT.transcribe_file("audio.wav")
|
| 16 |
+
print(result.text)
|
| 17 |
+
|
| 18 |
+
# Or use OpenAI API
|
| 19 |
+
WhisperSTT.load_model(use_api=True, api_key="your-key")
|
| 20 |
+
result = WhisperSTT.transcribe_file("audio.wav")
|
| 21 |
+
"""
|
| 22 |
+
|
| 23 |
+
from typing import Union, Optional, Dict, Any
|
| 24 |
+
import numpy as np
|
| 25 |
+
from pathlib import Path
|
| 26 |
+
import time
|
| 27 |
+
import logging
|
| 28 |
+
import tempfile
|
| 29 |
+
import os
|
| 30 |
+
|
| 31 |
+
try:
|
| 32 |
+
import whisper
|
| 33 |
+
WHISPER_AVAILABLE = True
|
| 34 |
+
except ImportError:
|
| 35 |
+
WHISPER_AVAILABLE = False
|
| 36 |
+
|
| 37 |
+
try:
|
| 38 |
+
import openai
|
| 39 |
+
OPENAI_AVAILABLE = True
|
| 40 |
+
except ImportError:
|
| 41 |
+
OPENAI_AVAILABLE = False
|
| 42 |
+
|
| 43 |
+
try:
|
| 44 |
+
import soundfile as sf
|
| 45 |
+
SOUNDFILE_AVAILABLE = True
|
| 46 |
+
except ImportError:
|
| 47 |
+
SOUNDFILE_AVAILABLE = False
|
| 48 |
+
|
| 49 |
+
from .stt_base import BaseSTT, STTResult
|
| 50 |
+
|
| 51 |
+
logger = logging.getLogger(__name__)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
class WhisperSTT(BaseSTT):
|
| 55 |
+
"""
|
| 56 |
+
OpenAI Whisper STT implementation with support for both local models and API.
|
| 57 |
+
|
| 58 |
+
Supports:
|
| 59 |
+
- Local Whisper models (tiny, base, small, medium, large)
|
| 60 |
+
- OpenAI Whisper API calls
|
| 61 |
+
- Multiple audio formats via soundfile
|
| 62 |
+
- Confidence scoring and metadata
|
| 63 |
+
"""
|
| 64 |
+
|
| 65 |
+
model_name = "WhisperSTT"
|
| 66 |
+
model = None
|
| 67 |
+
is_loaded = False
|
| 68 |
+
config = {
|
| 69 |
+
"model_size": "base",
|
| 70 |
+
"use_api": False,
|
| 71 |
+
"api_key": None,
|
| 72 |
+
"language": None, # Auto-detect if None
|
| 73 |
+
"task": "transcribe", # "transcribe" or "translate"
|
| 74 |
+
"temperature": 0.0,
|
| 75 |
+
"best_of": 5,
|
| 76 |
+
"beam_size": 5,
|
| 77 |
+
"patience": 1.0,
|
| 78 |
+
"length_penalty": 1.0,
|
| 79 |
+
"suppress_tokens": "-1",
|
| 80 |
+
"initial_prompt": None,
|
| 81 |
+
"condition_on_previous_text": True,
|
| 82 |
+
"fp16": True,
|
| 83 |
+
"compression_ratio_threshold": 2.4,
|
| 84 |
+
"logprob_threshold": -1.0,
|
| 85 |
+
"no_speech_threshold": 0.6
|
| 86 |
+
}
|
| 87 |
+
|
| 88 |
+
@classmethod
|
| 89 |
+
def load_model(cls,
|
| 90 |
+
model_size: str = "base",
|
| 91 |
+
use_api: bool = False,
|
| 92 |
+
api_key: Optional[str] = None,
|
| 93 |
+
**kwargs) -> None:
|
| 94 |
+
"""
|
| 95 |
+
Load the Whisper model (local or API setup).
|
| 96 |
+
|
| 97 |
+
Args:
|
| 98 |
+
model_size: Size of local model ("tiny", "base", "small", "medium", "large")
|
| 99 |
+
use_api: Use OpenAI API instead of local model
|
| 100 |
+
api_key: OpenAI API key (required if use_api=True)
|
| 101 |
+
**kwargs: Additional Whisper parameters
|
| 102 |
+
"""
|
| 103 |
+
cls.config.update({
|
| 104 |
+
"model_size": model_size,
|
| 105 |
+
"use_api": use_api,
|
| 106 |
+
"api_key": api_key,
|
| 107 |
+
**kwargs
|
| 108 |
+
})
|
| 109 |
+
|
| 110 |
+
if use_api:
|
| 111 |
+
cls._load_api_model(api_key)
|
| 112 |
+
else:
|
| 113 |
+
cls._load_local_model(model_size)
|
| 114 |
+
|
| 115 |
+
@classmethod
|
| 116 |
+
def _load_local_model(cls, model_size: str) -> None:
|
| 117 |
+
"""Load local Whisper model."""
|
| 118 |
+
if not WHISPER_AVAILABLE:
|
| 119 |
+
raise ImportError(
|
| 120 |
+
"OpenAI Whisper not installed. Install with: pip install openai-whisper"
|
| 121 |
+
)
|
| 122 |
+
|
| 123 |
+
logger.info(f"Loading Whisper local model: {model_size}")
|
| 124 |
+
start_time = time.time()
|
| 125 |
+
|
| 126 |
+
try:
|
| 127 |
+
cls.model = whisper.load_model(model_size)
|
| 128 |
+
cls.is_loaded = True
|
| 129 |
+
load_time = time.time() - start_time
|
| 130 |
+
logger.info(f"Whisper model '{model_size}' loaded successfully in {load_time:.2f}s")
|
| 131 |
+
|
| 132 |
+
except Exception as e:
|
| 133 |
+
cls.is_loaded = False
|
| 134 |
+
raise RuntimeError(f"Failed to load Whisper model '{model_size}': {e}")
|
| 135 |
+
|
| 136 |
+
@classmethod
|
| 137 |
+
def _load_api_model(cls, api_key: Optional[str]) -> None:
|
| 138 |
+
"""Setup OpenAI API client."""
|
| 139 |
+
if not OPENAI_AVAILABLE:
|
| 140 |
+
raise ImportError(
|
| 141 |
+
"OpenAI Python client not installed. Install with: pip install openai"
|
| 142 |
+
)
|
| 143 |
+
|
| 144 |
+
if not api_key:
|
| 145 |
+
# Try to get from environment
|
| 146 |
+
api_key = os.getenv("OPENAI_API_KEY")
|
| 147 |
+
if not api_key:
|
| 148 |
+
raise ValueError(
|
| 149 |
+
"OpenAI API key required. Set OPENAI_API_KEY environment variable or pass api_key parameter."
|
| 150 |
+
)
|
| 151 |
+
|
| 152 |
+
logger.info("Setting up OpenAI Whisper API client")
|
| 153 |
+
|
| 154 |
+
try:
|
| 155 |
+
openai.api_key = api_key
|
| 156 |
+
cls.model = "whisper-1" # API model identifier
|
| 157 |
+
cls.is_loaded = True
|
| 158 |
+
logger.info("OpenAI Whisper API client configured successfully")
|
| 159 |
+
|
| 160 |
+
except Exception as e:
|
| 161 |
+
cls.is_loaded = False
|
| 162 |
+
raise RuntimeError(f"Failed to setup OpenAI API: {e}")
|
| 163 |
+
|
| 164 |
+
@classmethod
|
| 165 |
+
def transcribe_audio(cls,
|
| 166 |
+
audio_data: Union[np.ndarray, str, Path],
|
| 167 |
+
sample_rate: Optional[int] = None) -> STTResult:
|
| 168 |
+
"""
|
| 169 |
+
Transcribe audio using Whisper (local or API).
|
| 170 |
+
|
| 171 |
+
Args:
|
| 172 |
+
audio_data: Audio input (numpy array, file path, or audio file)
|
| 173 |
+
sample_rate: Sample rate for numpy arrays
|
| 174 |
+
|
| 175 |
+
Returns:
|
| 176 |
+
STTResult: Transcription with confidence and metadata
|
| 177 |
+
"""
|
| 178 |
+
if not cls.is_loaded:
|
| 179 |
+
raise RuntimeError("Whisper model not loaded. Call load_model() first.")
|
| 180 |
+
|
| 181 |
+
start_time = time.time()
|
| 182 |
+
|
| 183 |
+
try:
|
| 184 |
+
if cls.config["use_api"]:
|
| 185 |
+
result = cls._transcribe_api(audio_data, sample_rate)
|
| 186 |
+
else:
|
| 187 |
+
result = cls._transcribe_local(audio_data, sample_rate)
|
| 188 |
+
|
| 189 |
+
processing_time = time.time() - start_time
|
| 190 |
+
result.processing_time = processing_time
|
| 191 |
+
|
| 192 |
+
logger.info(f"Transcription completed in {processing_time:.2f}s")
|
| 193 |
+
return result
|
| 194 |
+
|
| 195 |
+
except Exception as e:
|
| 196 |
+
logger.error(f"Transcription failed: {e}")
|
| 197 |
+
raise RuntimeError(f"Whisper transcription failed: {e}")
|
| 198 |
+
|
| 199 |
+
@classmethod
|
| 200 |
+
def _transcribe_local(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> STTResult:
|
| 201 |
+
"""Transcribe using local Whisper model."""
|
| 202 |
+
|
| 203 |
+
# Prepare transcription options
|
| 204 |
+
transcribe_options = {
|
| 205 |
+
"language": cls.config.get("language"),
|
| 206 |
+
"task": cls.config.get("task", "transcribe"),
|
| 207 |
+
"temperature": cls.config.get("temperature", 0.0),
|
| 208 |
+
"best_of": cls.config.get("best_of", 5),
|
| 209 |
+
"beam_size": cls.config.get("beam_size", 5),
|
| 210 |
+
"patience": cls.config.get("patience", 1.0),
|
| 211 |
+
"length_penalty": cls.config.get("length_penalty", 1.0),
|
| 212 |
+
"suppress_tokens": cls.config.get("suppress_tokens", "-1"),
|
| 213 |
+
"initial_prompt": cls.config.get("initial_prompt"),
|
| 214 |
+
"condition_on_previous_text": cls.config.get("condition_on_previous_text", True),
|
| 215 |
+
"fp16": cls.config.get("fp16", True),
|
| 216 |
+
"compression_ratio_threshold": cls.config.get("compression_ratio_threshold", 2.4),
|
| 217 |
+
"logprob_threshold": cls.config.get("logprob_threshold", -1.0),
|
| 218 |
+
"no_speech_threshold": cls.config.get("no_speech_threshold", 0.6)
|
| 219 |
+
}
|
| 220 |
+
|
| 221 |
+
# Remove None values
|
| 222 |
+
transcribe_options = {k: v for k, v in transcribe_options.items() if v is not None}
|
| 223 |
+
|
| 224 |
+
# Handle numpy arrays
|
| 225 |
+
if isinstance(audio_data, np.ndarray):
|
| 226 |
+
audio_input = audio_data.astype(np.float32)
|
| 227 |
+
# Whisper expects mono audio
|
| 228 |
+
if audio_input.ndim > 1:
|
| 229 |
+
audio_input = np.mean(audio_input, axis=1)
|
| 230 |
+
else:
|
| 231 |
+
# File path
|
| 232 |
+
audio_input = str(audio_data)
|
| 233 |
+
|
| 234 |
+
# Transcribe
|
| 235 |
+
result = cls.model.transcribe(audio_input, **transcribe_options)
|
| 236 |
+
|
| 237 |
+
# Calculate confidence (average of segment confidences if available)
|
| 238 |
+
confidence = None
|
| 239 |
+
if "segments" in result and result["segments"]:
|
| 240 |
+
segment_confidences = []
|
| 241 |
+
for segment in result["segments"]:
|
| 242 |
+
if "avg_logprob" in segment:
|
| 243 |
+
# Convert log prob to confidence estimate
|
| 244 |
+
conf = min(1.0, max(0.0, np.exp(segment["avg_logprob"])))
|
| 245 |
+
segment_confidences.append(conf)
|
| 246 |
+
|
| 247 |
+
if segment_confidences:
|
| 248 |
+
confidence = np.mean(segment_confidences)
|
| 249 |
+
|
| 250 |
+
# Prepare metadata
|
| 251 |
+
metadata = {
|
| 252 |
+
"model": cls.config["model_size"],
|
| 253 |
+
"language": result.get("language"),
|
| 254 |
+
"task": cls.config["task"],
|
| 255 |
+
"segments": len(result.get("segments", [])),
|
| 256 |
+
"api_used": False
|
| 257 |
+
}
|
| 258 |
+
|
| 259 |
+
return STTResult(
|
| 260 |
+
text=result["text"].strip(),
|
| 261 |
+
confidence=confidence,
|
| 262 |
+
metadata=metadata
|
| 263 |
+
)
|
| 264 |
+
|
| 265 |
+
@classmethod
|
| 266 |
+
def _transcribe_api(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> STTResult:
|
| 267 |
+
"""Transcribe using OpenAI API."""
|
| 268 |
+
|
| 269 |
+
# Handle numpy arrays - save to temp file for API
|
| 270 |
+
if isinstance(audio_data, np.ndarray):
|
| 271 |
+
if not SOUNDFILE_AVAILABLE:
|
| 272 |
+
raise ImportError("soundfile required for numpy array support. Install with: pip install soundfile")
|
| 273 |
+
|
| 274 |
+
# Create temporary WAV file
|
| 275 |
+
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
|
| 276 |
+
temp_path = temp_file.name
|
| 277 |
+
|
| 278 |
+
try:
|
| 279 |
+
sf.write(temp_path, audio_data, sample_rate or 16000)
|
| 280 |
+
audio_file_path = temp_path
|
| 281 |
+
cleanup_temp = True
|
| 282 |
+
except Exception as e:
|
| 283 |
+
if os.path.exists(temp_path):
|
| 284 |
+
os.unlink(temp_path)
|
| 285 |
+
raise RuntimeError(f"Failed to save temporary audio file: {e}")
|
| 286 |
+
else:
|
| 287 |
+
audio_file_path = str(audio_data)
|
| 288 |
+
cleanup_temp = False
|
| 289 |
+
|
| 290 |
+
try:
|
| 291 |
+
# Make API call
|
| 292 |
+
with open(audio_file_path, "rb") as audio_file:
|
| 293 |
+
transcript = openai.Audio.transcribe(
|
| 294 |
+
model="whisper-1",
|
| 295 |
+
file=audio_file,
|
| 296 |
+
language=cls.config.get("language"),
|
| 297 |
+
prompt=cls.config.get("initial_prompt"),
|
| 298 |
+
temperature=cls.config.get("temperature", 0.0)
|
| 299 |
+
)
|
| 300 |
+
|
| 301 |
+
# API doesn't provide confidence scores
|
| 302 |
+
metadata = {
|
| 303 |
+
"model": "whisper-1",
|
| 304 |
+
"language": cls.config.get("language", "auto"),
|
| 305 |
+
"task": "transcribe",
|
| 306 |
+
"api_used": True
|
| 307 |
+
}
|
| 308 |
+
|
| 309 |
+
return STTResult(
|
| 310 |
+
text=transcript["text"].strip(),
|
| 311 |
+
confidence=None, # API doesn't provide confidence
|
| 312 |
+
metadata=metadata
|
| 313 |
+
)
|
| 314 |
+
|
| 315 |
+
finally:
|
| 316 |
+
# Clean up temporary file if created
|
| 317 |
+
if cleanup_temp and os.path.exists(audio_file_path):
|
| 318 |
+
try:
|
| 319 |
+
os.unlink(audio_file_path)
|
| 320 |
+
except Exception as e:
|
| 321 |
+
logger.warning(f"Failed to cleanup temp file {audio_file_path}: {e}")
|
| 322 |
+
|
| 323 |
+
@classmethod
|
| 324 |
+
def get_available_models(cls) -> Dict[str, Any]:
|
| 325 |
+
"""Get information about available Whisper models."""
|
| 326 |
+
local_models = ["tiny", "base", "small", "medium", "large"] if WHISPER_AVAILABLE else []
|
| 327 |
+
api_available = OPENAI_AVAILABLE
|
| 328 |
+
|
| 329 |
+
return {
|
| 330 |
+
"local_models": local_models,
|
| 331 |
+
"api_available": api_available,
|
| 332 |
+
"whisper_installed": WHISPER_AVAILABLE,
|
| 333 |
+
"openai_installed": OPENAI_AVAILABLE,
|
| 334 |
+
"soundfile_installed": SOUNDFILE_AVAILABLE
|
| 335 |
+
}
|
| 336 |
+
|
| 337 |
+
@classmethod
|
| 338 |
+
def set_language(cls, language: Optional[str]) -> None:
|
| 339 |
+
"""Set the transcription language."""
|
| 340 |
+
cls.config["language"] = language
|
| 341 |
+
logger.info(f"Language set to: {language or 'auto-detect'}")
|
| 342 |
+
|
| 343 |
+
@classmethod
|
| 344 |
+
def set_task(cls, task: str) -> None:
|
| 345 |
+
"""Set the task (transcribe or translate)."""
|
| 346 |
+
if task not in ["transcribe", "translate"]:
|
| 347 |
+
raise ValueError("Task must be 'transcribe' or 'translate'")
|
| 348 |
+
cls.config["task"] = task
|
| 349 |
+
logger.info(f"Task set to: {task}")
|
| 350 |
+
|
| 351 |
+
|
| 352 |
+
# Example usage and testing
|
| 353 |
+
if __name__ == "__main__":
|
| 354 |
+
print("Testing WhisperSTT implementation...")
|
| 355 |
+
|
| 356 |
+
# Check availability
|
| 357 |
+
models_info = WhisperSTT.get_available_models()
|
| 358 |
+
print(f"Available models: {models_info}")
|
| 359 |
+
|
| 360 |
+
if models_info["whisper_installed"]:
|
| 361 |
+
try:
|
| 362 |
+
# Test with local model
|
| 363 |
+
print("\\nTesting local Whisper model...")
|
| 364 |
+
WhisperSTT.load_model("tiny") # Use tiny model for faster testing
|
| 365 |
+
|
| 366 |
+
# Test with dummy numpy audio
|
| 367 |
+
dummy_audio = np.random.randn(16000).astype(np.float32) # 1 second
|
| 368 |
+
result = WhisperSTT.transcribe_numpy(dummy_audio, 16000)
|
| 369 |
+
print(f"Dummy audio result: {result}")
|
| 370 |
+
print(f"Model info: {WhisperSTT.get_model_info()}")
|
| 371 |
+
|
| 372 |
+
except Exception as e:
|
| 373 |
+
print(f"Local model test failed: {e}")
|
| 374 |
+
else:
|
| 375 |
+
print("Whisper not installed - install with: pip install openai-whisper")
|
| 376 |
+
|
| 377 |
+
print("\\nWhisperSTT implementation ready!")
|
test_coqui.py
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for Coqui STT
|
| 4 |
+
|
| 5 |
+
This script tests the Coqui STT implementation with a sample audio file.
|
| 6 |
+
Coqui STT provides open-source speech recognition with multiple language support.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python test_coqui.py [audio_file]
|
| 10 |
+
|
| 11 |
+
If no audio file is provided, it will use the default recording if available.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import sys
|
| 15 |
+
import logging
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
# Add the project root to the path
|
| 19 |
+
sys.path.append(str(Path(__file__).parent))
|
| 20 |
+
|
| 21 |
+
from stt.coqui_stt import CoquiSTT, COQUI_STT_AVAILABLE
|
| 22 |
+
|
| 23 |
+
def test_coqui_stt(audio_file: str = None):
|
| 24 |
+
"""Test Coqui STT functionality."""
|
| 25 |
+
print("π Testing Coqui STT")
|
| 26 |
+
print("=" * 50)
|
| 27 |
+
|
| 28 |
+
# Check if Coqui STT is available
|
| 29 |
+
if not COQUI_STT_AVAILABLE:
|
| 30 |
+
print("β Coqui STT not available. Install with:")
|
| 31 |
+
print("pip install coqui-stt soundfile librosa")
|
| 32 |
+
return False
|
| 33 |
+
|
| 34 |
+
# Create CoquiSTT instance
|
| 35 |
+
coqui = CoquiSTT()
|
| 36 |
+
|
| 37 |
+
# Check dependencies
|
| 38 |
+
deps_ok, deps_msg = coqui.check_dependencies()
|
| 39 |
+
print(f"Dependencies: {deps_msg}")
|
| 40 |
+
if not deps_ok:
|
| 41 |
+
return False
|
| 42 |
+
|
| 43 |
+
# Get available models
|
| 44 |
+
print("\nπ¦ Available Models:")
|
| 45 |
+
available_models = coqui.get_available_models()
|
| 46 |
+
for model in available_models:
|
| 47 |
+
status = "β
Downloaded" if model["downloaded"] else "β¬οΈ Available for download"
|
| 48 |
+
scorer_status = " (with scorer)" if model["has_scorer"] else " (no scorer)"
|
| 49 |
+
print(f" - {model['name']}: {model['description']} ({model['size']}) {status}{scorer_status}")
|
| 50 |
+
|
| 51 |
+
# Test model loading
|
| 52 |
+
print("\nπ Loading English Large model...")
|
| 53 |
+
model_name = "english-large"
|
| 54 |
+
success = coqui.load_model(
|
| 55 |
+
model_name=model_name,
|
| 56 |
+
auto_download=True,
|
| 57 |
+
beam_width=512
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
if not success:
|
| 61 |
+
print("β Failed to load model")
|
| 62 |
+
return False
|
| 63 |
+
|
| 64 |
+
print("β
Model loaded successfully")
|
| 65 |
+
|
| 66 |
+
# Get model info
|
| 67 |
+
model_info = coqui.get_model_info()
|
| 68 |
+
print(f"\nπ Model Info:")
|
| 69 |
+
for key, value in model_info.items():
|
| 70 |
+
print(f" - {key}: {value}")
|
| 71 |
+
|
| 72 |
+
# Test transcription
|
| 73 |
+
if audio_file and Path(audio_file).exists():
|
| 74 |
+
print(f"\nπ€ Transcribing: {audio_file}")
|
| 75 |
+
else:
|
| 76 |
+
# Look for default recording
|
| 77 |
+
default_files = [
|
| 78 |
+
"recordings/recorded_audio.wav",
|
| 79 |
+
"recorded_audio.wav",
|
| 80 |
+
"test_audio.wav"
|
| 81 |
+
]
|
| 82 |
+
|
| 83 |
+
audio_file = None
|
| 84 |
+
for file_path in default_files:
|
| 85 |
+
if Path(file_path).exists():
|
| 86 |
+
audio_file = file_path
|
| 87 |
+
break
|
| 88 |
+
|
| 89 |
+
if not audio_file:
|
| 90 |
+
print("β No audio file found for testing")
|
| 91 |
+
print("Record audio using the Gradio interface first, or provide a file path")
|
| 92 |
+
return False
|
| 93 |
+
|
| 94 |
+
print(f"\nπ€ Using default recording: {audio_file}")
|
| 95 |
+
|
| 96 |
+
# Perform transcription
|
| 97 |
+
try:
|
| 98 |
+
print("Transcribing...")
|
| 99 |
+
result = coqui.transcribe_audio(
|
| 100 |
+
audio_file_path=audio_file,
|
| 101 |
+
return_confidence=True,
|
| 102 |
+
return_timestamps=False
|
| 103 |
+
)
|
| 104 |
+
|
| 105 |
+
if "error" in result:
|
| 106 |
+
print(f"β Transcription error: {result['error']}")
|
| 107 |
+
return False
|
| 108 |
+
|
| 109 |
+
print("\nπ Transcription Results:")
|
| 110 |
+
print(f" Text: {result['text']}")
|
| 111 |
+
print(f" Confidence: {result.get('confidence', 'N/A')}")
|
| 112 |
+
print(f" Language: {result.get('language', 'Unknown')}")
|
| 113 |
+
|
| 114 |
+
# Test with timestamps if successful
|
| 115 |
+
print("\nπ Testing with timestamps...")
|
| 116 |
+
result_with_timestamps = coqui.transcribe_audio(
|
| 117 |
+
audio_file_path=audio_file,
|
| 118 |
+
return_confidence=True,
|
| 119 |
+
return_timestamps=True
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
if "words" in result_with_timestamps:
|
| 123 |
+
print(f" Word count: {len(result_with_timestamps['words'])}")
|
| 124 |
+
if result_with_timestamps['words']:
|
| 125 |
+
print(" First few words with timestamps:")
|
| 126 |
+
for word in result_with_timestamps['words'][:3]:
|
| 127 |
+
print(f" - '{word['word']}' at {word['start_time']:.2f}s (confidence: {word.get('confidence', 'N/A')})")
|
| 128 |
+
|
| 129 |
+
except Exception as e:
|
| 130 |
+
print(f"β Transcription failed: {e}")
|
| 131 |
+
return False
|
| 132 |
+
|
| 133 |
+
# Cleanup
|
| 134 |
+
coqui.cleanup()
|
| 135 |
+
print("\nβ
Test completed successfully!")
|
| 136 |
+
return True
|
| 137 |
+
|
| 138 |
+
def main():
|
| 139 |
+
"""Main function."""
|
| 140 |
+
# Setup logging
|
| 141 |
+
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
|
| 142 |
+
|
| 143 |
+
# Get audio file from command line if provided
|
| 144 |
+
audio_file = sys.argv[1] if len(sys.argv) > 1 else None
|
| 145 |
+
|
| 146 |
+
# Run test
|
| 147 |
+
success = test_coqui_stt(audio_file)
|
| 148 |
+
|
| 149 |
+
if success:
|
| 150 |
+
print("\nπ Coqui STT is working correctly!")
|
| 151 |
+
print("\nπ‘ Next steps:")
|
| 152 |
+
print(" 1. Run the main transcriber: python gradio_voice_transcriber_clean.py")
|
| 153 |
+
print(" 2. Select 'CoquiSTT' as your model")
|
| 154 |
+
print(" 3. Choose your preferred language model")
|
| 155 |
+
print(" 4. Start transcribing!")
|
| 156 |
+
else:
|
| 157 |
+
print("\nβ Coqui STT test failed")
|
| 158 |
+
return 1
|
| 159 |
+
|
| 160 |
+
return 0
|
| 161 |
+
|
| 162 |
+
if __name__ == "__main__":
|
| 163 |
+
sys.exit(main())
|
test_gradio_voice_transcriber.py
ADDED
|
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import types
|
| 2 |
+
import numpy as np
|
| 3 |
+
import builtins
|
| 4 |
+
import time
|
| 5 |
+
import importlib
|
| 6 |
+
import pytest
|
| 7 |
+
|
| 8 |
+
# Import the module under test
|
| 9 |
+
import gradio_voice_transcriber as gvt
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class DummySTT:
|
| 13 |
+
is_loaded = True
|
| 14 |
+
|
| 15 |
+
def __init__(self):
|
| 16 |
+
self._language = None
|
| 17 |
+
|
| 18 |
+
def load_model(self, **kwargs):
|
| 19 |
+
self.is_loaded = True
|
| 20 |
+
|
| 21 |
+
def set_language(self, lang):
|
| 22 |
+
self._language = lang
|
| 23 |
+
|
| 24 |
+
def transcribe_audio(self, audio, sample_rate):
|
| 25 |
+
# Return an object mimicking STTResult
|
| 26 |
+
class R:
|
| 27 |
+
def __init__(self):
|
| 28 |
+
self.text = "hello world"
|
| 29 |
+
self.confidence = 0.75
|
| 30 |
+
self.processing_time = 0.05
|
| 31 |
+
return R()
|
| 32 |
+
|
| 33 |
+
@staticmethod
|
| 34 |
+
def get_model_info():
|
| 35 |
+
return {"is_loaded": True, "model_name": "DummySTT"}
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
class DummyTawasul:
|
| 39 |
+
# Static style class (no instantiation) used by code path
|
| 40 |
+
is_loaded = True
|
| 41 |
+
|
| 42 |
+
@staticmethod
|
| 43 |
+
def load_model(**kwargs):
|
| 44 |
+
DummyTawasul.is_loaded = True
|
| 45 |
+
|
| 46 |
+
@staticmethod
|
| 47 |
+
def get_model_info():
|
| 48 |
+
return {"is_loaded": True, "model_name": "TawasulSTT"}
|
| 49 |
+
|
| 50 |
+
@staticmethod
|
| 51 |
+
def transcribe(path):
|
| 52 |
+
# Return tuple like (text, confidence_info, processing_info)
|
| 53 |
+
return ("transcribed from file", "Confidence: 0.42", "ok")
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
@pytest.fixture(autouse=True)
|
| 57 |
+
def reset_globals(monkeypatch):
|
| 58 |
+
# Ensure clean state between tests
|
| 59 |
+
gvt.current_stt_model = None
|
| 60 |
+
gvt.current_model_config = {}
|
| 61 |
+
yield
|
| 62 |
+
gvt.current_stt_model = None
|
| 63 |
+
gvt.current_model_config = {}
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def test_audio_processor_preprocess_basic():
|
| 67 |
+
sr = 8000
|
| 68 |
+
t = np.linspace(0, 1, sr, endpoint=False)
|
| 69 |
+
audio = (0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.float32)
|
| 70 |
+
out = gvt.AudioProcessor.preprocess(audio, sr, target_sr=16000)
|
| 71 |
+
# Should be float32, mono, and clipped range within [-1,1]
|
| 72 |
+
assert out.dtype == np.float32
|
| 73 |
+
assert out.ndim == 1
|
| 74 |
+
assert np.max(np.abs(out)) <= 1.0
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def test_model_manager_load_whisper_missing_api_key_returns_error(monkeypatch):
|
| 78 |
+
# Register DummySTT under WhisperSTT name to avoid heavy import
|
| 79 |
+
monkeypatch.setitem(gvt.STT_MODELS, "WhisperSTT", DummySTT)
|
| 80 |
+
# Request API mode without key
|
| 81 |
+
msg = gvt.ModelManager.load_model("WhisperSTT", model_size="base", use_api=True, api_key="")
|
| 82 |
+
assert "API key required" in msg
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def test_model_manager_load_generic_success(monkeypatch):
|
| 86 |
+
# Register a generic model name and load
|
| 87 |
+
monkeypatch.setitem(gvt.STT_MODELS, "DummySTT", DummySTT)
|
| 88 |
+
msg = gvt.ModelManager.load_model("DummySTT")
|
| 89 |
+
assert msg.startswith("β
")
|
| 90 |
+
assert gvt.current_stt_model is not None
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def test_transcription_engine_no_audio():
|
| 94 |
+
text, conf, proc = gvt.TranscriptionEngine.transcribe(None, language="en")
|
| 95 |
+
assert text.startswith("β No audio provided")
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def test_transcription_engine_requires_loaded_model():
|
| 99 |
+
# Provide dummy audio but no model
|
| 100 |
+
sr = 16000
|
| 101 |
+
audio = np.zeros(sr, dtype=np.float32)
|
| 102 |
+
text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
|
| 103 |
+
assert "No STT model loaded" in text
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
def test_transcription_engine_happy_path(monkeypatch):
|
| 107 |
+
# Use DummySTT and set as the loaded model
|
| 108 |
+
gvt.current_stt_model = DummySTT()
|
| 109 |
+
gvt.current_model_config = {"model_name": "DummySTT"}
|
| 110 |
+
# Provide a 1 second tone with enough amplitude
|
| 111 |
+
sr = 16000
|
| 112 |
+
t = np.linspace(0, 1.0, sr, endpoint=False)
|
| 113 |
+
audio = (0.3 * np.sin(2 * np.pi * 220 * t)).astype(np.float32)
|
| 114 |
+
text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
|
| 115 |
+
assert text == "hello world"
|
| 116 |
+
assert conf.startswith("Confidence: ")
|
| 117 |
+
assert "Processing:" in proc
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def test_transcription_engine_filters_false_positives(monkeypatch):
|
| 121 |
+
class LowTextDummy(DummySTT):
|
| 122 |
+
def transcribe_audio(self, audio, sample_rate):
|
| 123 |
+
class R:
|
| 124 |
+
def __init__(self):
|
| 125 |
+
self.text = "you" # a known false positive which should be filtered
|
| 126 |
+
self.confidence = None
|
| 127 |
+
self.processing_time = 0.01
|
| 128 |
+
return R()
|
| 129 |
+
|
| 130 |
+
gvt.current_stt_model = LowTextDummy()
|
| 131 |
+
gvt.current_model_config = {"model_name": "LowTextDummy"}
|
| 132 |
+
sr = 16000
|
| 133 |
+
audio = np.ones(sr, dtype=np.float32) * 0.2
|
| 134 |
+
text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
|
| 135 |
+
assert text == "π No clear speech detected"
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
def test_transcription_engine_tawasul_static_path_flow(monkeypatch, tmp_path):
|
| 139 |
+
# Force the Tawasul path by setting current_model_config model_name
|
| 140 |
+
gvt.current_stt_model = DummyTawasul
|
| 141 |
+
gvt.current_model_config = {"model_name": "TawasulSTT"}
|
| 142 |
+
|
| 143 |
+
# Create a simple audio array meeting quality gates
|
| 144 |
+
sr = 16000
|
| 145 |
+
t = np.linspace(0, 1.0, sr, endpoint=False)
|
| 146 |
+
audio = (0.3 * np.sin(2 * np.pi * 220 * t)).astype(np.float32)
|
| 147 |
+
|
| 148 |
+
# Monkeypatch soundfile.write to write to the provided path without needing soundfile dependency
|
| 149 |
+
written = {}
|
| 150 |
+
|
| 151 |
+
def fake_write(path, data, samplerate):
|
| 152 |
+
written["path"] = path
|
| 153 |
+
written["samplerate"] = samplerate
|
| 154 |
+
written["len"] = len(data)
|
| 155 |
+
|
| 156 |
+
monkeypatch.setitem(builtins.__dict__, "__SOUNDFILE_WRITE__", fake_write)
|
| 157 |
+
|
| 158 |
+
# Patch import inside function to use our fake write via simple shim
|
| 159 |
+
import types as _types
|
| 160 |
+
|
| 161 |
+
class SFShim:
|
| 162 |
+
@staticmethod
|
| 163 |
+
def write(path, data, samplerate):
|
| 164 |
+
fake_write(path, data, samplerate)
|
| 165 |
+
|
| 166 |
+
monkeypatch.setitem(importlib.import_module("soundfile").__dict__ if False else globals(), "sf", SFShim)
|
| 167 |
+
|
| 168 |
+
# Run transcription
|
| 169 |
+
text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="ar")
|
| 170 |
+
assert text == "transcribed from file"
|
| 171 |
+
assert conf.startswith("Confidence: ")
|
| 172 |
+
assert "Model: TawasulSTT" in proc
|
| 173 |
+
|
| 174 |
+
|
| 175 |
+
def test_get_quality_recommendations_messages():
|
| 176 |
+
q = {
|
| 177 |
+
"duration": 0.5,
|
| 178 |
+
"max_amplitude": 0.95,
|
| 179 |
+
"clipping_ratio": 0.02,
|
| 180 |
+
"silence_ratio": 0.6,
|
| 181 |
+
}
|
| 182 |
+
msg = gvt._get_quality_recommendations(q)
|
| 183 |
+
# Expect multiple recommendations due to thresholds
|
| 184 |
+
assert "recording for longer" in msg
|
| 185 |
+
assert "clipping" in msg
|
| 186 |
+
assert "Too much silence" in msg
|
test_hubert_arabic.py
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for HuBERT Arabic STT model
|
| 4 |
+
|
| 5 |
+
This script tests the HuBERT Arabic Egyptian STT implementation
|
| 6 |
+
including authentication, model loading, and transcription.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import sys
|
| 10 |
+
import os
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
import soundfile as sf
|
| 13 |
+
import numpy as np
|
| 14 |
+
|
| 15 |
+
# Add the project root to the path
|
| 16 |
+
project_root = Path(__file__).parent
|
| 17 |
+
sys.path.insert(0, str(project_root))
|
| 18 |
+
|
| 19 |
+
def test_hubert_arabic_stt():
|
| 20 |
+
"""Test the HuBERT Arabic STT implementation."""
|
| 21 |
+
print("π Testing HuBERT Arabic STT")
|
| 22 |
+
print("=" * 50)
|
| 23 |
+
|
| 24 |
+
try:
|
| 25 |
+
from stt.hubert_arabic_stt import HuBERTArabicSTT
|
| 26 |
+
print("β
HuBERTArabicSTT imported successfully")
|
| 27 |
+
except ImportError as e:
|
| 28 |
+
print(f"β Failed to import HuBERTArabicSTT: {e}")
|
| 29 |
+
print("\nπ‘ To install HuBERT dependencies:")
|
| 30 |
+
print(" pip install -r requirements_hubert.txt")
|
| 31 |
+
return False
|
| 32 |
+
|
| 33 |
+
# Test model loading
|
| 34 |
+
print("\nπ¦ Testing model loading...")
|
| 35 |
+
try:
|
| 36 |
+
stt = HuBERTArabicSTT()
|
| 37 |
+
|
| 38 |
+
# Try to load the primary model
|
| 39 |
+
print("π§ Loading HuBERT Arabic Egyptian model...")
|
| 40 |
+
result = stt.load_model(
|
| 41 |
+
model_id="omarxadel/hubert-large-arabic-egyptian",
|
| 42 |
+
device="auto"
|
| 43 |
+
)
|
| 44 |
+
print(f"Model load result: {result}")
|
| 45 |
+
|
| 46 |
+
except Exception as e:
|
| 47 |
+
print(f"β Model loading failed: {e}")
|
| 48 |
+
print("\nπ‘ This might be due to:")
|
| 49 |
+
print(" - Missing HuggingFace authentication token")
|
| 50 |
+
print(" - Network connectivity issues")
|
| 51 |
+
print(" - Private model access restrictions")
|
| 52 |
+
print("\nπ§ Try setting up authentication:")
|
| 53 |
+
print(" python setup_hf_auth.py")
|
| 54 |
+
return False
|
| 55 |
+
|
| 56 |
+
# Test with sample audio (if available)
|
| 57 |
+
print("\nπ΅ Testing audio transcription...")
|
| 58 |
+
|
| 59 |
+
# Create a test audio file (silence)
|
| 60 |
+
sample_rate = 16000
|
| 61 |
+
duration = 2.0 # seconds
|
| 62 |
+
test_audio = np.zeros(int(sample_rate * duration), dtype=np.float32)
|
| 63 |
+
|
| 64 |
+
test_audio_path = "test_audio_hubert.wav"
|
| 65 |
+
sf.write(test_audio_path, test_audio, sample_rate)
|
| 66 |
+
|
| 67 |
+
try:
|
| 68 |
+
transcription, confidence, processing_info = stt.transcribe(test_audio_path)
|
| 69 |
+
print(f"β
Transcription completed")
|
| 70 |
+
print(f" Text: '{transcription}'")
|
| 71 |
+
print(f" Confidence: {confidence}")
|
| 72 |
+
print(f" Processing: {processing_info}")
|
| 73 |
+
|
| 74 |
+
except Exception as e:
|
| 75 |
+
print(f"β Transcription failed: {e}")
|
| 76 |
+
return False
|
| 77 |
+
finally:
|
| 78 |
+
# Clean up test file
|
| 79 |
+
if os.path.exists(test_audio_path):
|
| 80 |
+
os.remove(test_audio_path)
|
| 81 |
+
|
| 82 |
+
print("\nβ
All HuBERT Arabic STT tests passed!")
|
| 83 |
+
return True
|
| 84 |
+
|
| 85 |
+
def test_with_real_audio():
|
| 86 |
+
"""Test with real audio if available."""
|
| 87 |
+
recordings_dir = Path("recordings")
|
| 88 |
+
|
| 89 |
+
if not recordings_dir.exists():
|
| 90 |
+
print(f"\nπ‘ No recordings directory found at {recordings_dir}")
|
| 91 |
+
print(" Create the directory and add .wav files to test with real audio")
|
| 92 |
+
return
|
| 93 |
+
|
| 94 |
+
audio_files = list(recordings_dir.glob("*.wav"))
|
| 95 |
+
if not audio_files:
|
| 96 |
+
print(f"\nπ‘ No .wav files found in {recordings_dir}")
|
| 97 |
+
return
|
| 98 |
+
|
| 99 |
+
print(f"\nπ΅ Testing with real audio files from {recordings_dir}...")
|
| 100 |
+
|
| 101 |
+
try:
|
| 102 |
+
from stt.hubert_arabic_stt import HuBERTArabicSTT
|
| 103 |
+
stt = HuBERTArabicSTT()
|
| 104 |
+
stt.load_model()
|
| 105 |
+
|
| 106 |
+
for audio_file in audio_files[:2]: # Test first 2 files
|
| 107 |
+
print(f"\nπ Processing: {audio_file.name}")
|
| 108 |
+
try:
|
| 109 |
+
transcription, confidence, processing_info = stt.transcribe(str(audio_file))
|
| 110 |
+
print(f" Text: '{transcription}'")
|
| 111 |
+
print(f" Confidence: {confidence}")
|
| 112 |
+
except Exception as e:
|
| 113 |
+
print(f" β Error: {e}")
|
| 114 |
+
|
| 115 |
+
except Exception as e:
|
| 116 |
+
print(f"β Real audio test failed: {e}")
|
| 117 |
+
|
| 118 |
+
def main():
|
| 119 |
+
"""Main test function."""
|
| 120 |
+
print("HuBERT Arabic STT Test Suite")
|
| 121 |
+
print("=" * 60)
|
| 122 |
+
|
| 123 |
+
# Basic functionality test
|
| 124 |
+
success = test_hubert_arabic_stt()
|
| 125 |
+
|
| 126 |
+
if success:
|
| 127 |
+
print("\nπ― Running additional tests...")
|
| 128 |
+
test_with_real_audio()
|
| 129 |
+
|
| 130 |
+
print("\n" + "=" * 60)
|
| 131 |
+
if success:
|
| 132 |
+
print("π HuBERT Arabic STT is working correctly!")
|
| 133 |
+
print("\nπ‘ Next steps:")
|
| 134 |
+
print(" 1. Test with the Gradio interface:")
|
| 135 |
+
print(" python gradio_voice_transcriber_clean.py")
|
| 136 |
+
print(" 2. Select 'HuBERTArabicSTT' as the STT model")
|
| 137 |
+
print(" 3. Upload Arabic Egyptian audio for transcription")
|
| 138 |
+
else:
|
| 139 |
+
print("β HuBERT Arabic STT tests failed")
|
| 140 |
+
print("\nπ§ Troubleshooting:")
|
| 141 |
+
print(" 1. Install dependencies: pip install -r requirements_hubert.txt")
|
| 142 |
+
print(" 2. Set up HF authentication: python setup_hf_auth.py")
|
| 143 |
+
print(" 3. Check network connectivity")
|
| 144 |
+
|
| 145 |
+
if __name__ == "__main__":
|
| 146 |
+
main()
|
test_tawasul.py
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for Tawasul STT V0 model
|
| 4 |
+
|
| 5 |
+
This script tests the Tawasul STT V0 Arabic speech recognition model
|
| 6 |
+
with sample audio files.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python test_tawasul.py [audio_file]
|
| 10 |
+
|
| 11 |
+
If no audio file is provided, it will test with any files in the recordings/ directory.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
import sys
|
| 15 |
+
import os
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
import time
|
| 18 |
+
import logging
|
| 19 |
+
|
| 20 |
+
# Add the project root to the path
|
| 21 |
+
sys.path.insert(0, str(Path(__file__).parent))
|
| 22 |
+
|
| 23 |
+
from stt.tawasul_stt import TawasulSTT
|
| 24 |
+
|
| 25 |
+
# Configure logging
|
| 26 |
+
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
|
| 27 |
+
logger = logging.getLogger(__name__)
|
| 28 |
+
|
| 29 |
+
def test_tawasul_stt(audio_file: str = None):
|
| 30 |
+
"""Test Tawasul STT with an audio file."""
|
| 31 |
+
|
| 32 |
+
print("π§ͺ Tawasul STT V0 Test")
|
| 33 |
+
print("=" * 50)
|
| 34 |
+
|
| 35 |
+
# Check if Tawasul STT is available
|
| 36 |
+
if not TawasulSTT.is_available():
|
| 37 |
+
print("β Tawasul STT dependencies not available!")
|
| 38 |
+
print("Install with: pip install -r requirements_tawasul.txt")
|
| 39 |
+
return False
|
| 40 |
+
|
| 41 |
+
# Find audio file if not provided
|
| 42 |
+
if not audio_file:
|
| 43 |
+
recordings_dir = Path("recordings")
|
| 44 |
+
if recordings_dir.exists():
|
| 45 |
+
audio_files = list(recordings_dir.glob("*.wav")) + list(recordings_dir.glob("*.mp3"))
|
| 46 |
+
if audio_files:
|
| 47 |
+
audio_file = str(audio_files[0])
|
| 48 |
+
print(f"π΅ Using sample audio: {audio_file}")
|
| 49 |
+
else:
|
| 50 |
+
print("β No audio files found in recordings/ directory")
|
| 51 |
+
print("Please provide an audio file: python test_tawasul.py your_audio.wav")
|
| 52 |
+
return False
|
| 53 |
+
else:
|
| 54 |
+
print("β No audio file provided and no recordings/ directory found")
|
| 55 |
+
print("Usage: python test_tawasul.py your_audio.wav")
|
| 56 |
+
return False
|
| 57 |
+
|
| 58 |
+
if not os.path.exists(audio_file):
|
| 59 |
+
print(f"β Audio file not found: {audio_file}")
|
| 60 |
+
return False
|
| 61 |
+
|
| 62 |
+
try:
|
| 63 |
+
# Load the model (static method)
|
| 64 |
+
print("π₯ Loading Tawasul STT V0 model...")
|
| 65 |
+
start_time = time.time()
|
| 66 |
+
|
| 67 |
+
TawasulSTT.load_model(
|
| 68 |
+
device="auto", # Automatically choose best device
|
| 69 |
+
chunk_length=20, # 20-second chunks
|
| 70 |
+
max_audio_length=300 # 5 minutes max
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
load_time = time.time() - start_time
|
| 74 |
+
print(f"β
Model loaded in {load_time:.1f} seconds")
|
| 75 |
+
|
| 76 |
+
# Get model info (static method)
|
| 77 |
+
model_info = TawasulSTT.get_model_info()
|
| 78 |
+
print(f"\nπ Model Information:")
|
| 79 |
+
print(f" Name: {model_info['name']}")
|
| 80 |
+
print(f" Model ID: {model_info['model_id']}")
|
| 81 |
+
print(f" Device: {model_info['device']}")
|
| 82 |
+
print(f" Architecture: {model_info['architecture']}")
|
| 83 |
+
print(f" Specialization: {model_info['specialization']}")
|
| 84 |
+
print(f" Supported Languages: {', '.join(model_info['supported_languages'][:5])}...")
|
| 85 |
+
|
| 86 |
+
# Transcribe audio (static method)
|
| 87 |
+
print(f"\nποΈ Transcribing audio: {audio_file}")
|
| 88 |
+
transcription, confidence_info, processing_info = TawasulSTT.transcribe(audio_file)
|
| 89 |
+
|
| 90 |
+
# Display results
|
| 91 |
+
print("\n" + "=" * 50)
|
| 92 |
+
print("π TRANSCRIPTION RESULTS")
|
| 93 |
+
print("=" * 50)
|
| 94 |
+
print(f"Text: {transcription}")
|
| 95 |
+
print(f"Confidence: {confidence_info}")
|
| 96 |
+
print(f"Processing: {processing_info}")
|
| 97 |
+
print("=" * 50)
|
| 98 |
+
|
| 99 |
+
if transcription and not transcription.startswith("β"):
|
| 100 |
+
print("β
Transcription successful!")
|
| 101 |
+
return True
|
| 102 |
+
else:
|
| 103 |
+
print("β Transcription failed!")
|
| 104 |
+
return False
|
| 105 |
+
|
| 106 |
+
except Exception as e:
|
| 107 |
+
print(f"β Test failed: {str(e)}")
|
| 108 |
+
return False
|
| 109 |
+
|
| 110 |
+
def main():
|
| 111 |
+
"""Main function."""
|
| 112 |
+
audio_file = sys.argv[1] if len(sys.argv) > 1 else None
|
| 113 |
+
|
| 114 |
+
# Test with different configurations
|
| 115 |
+
success = test_tawasul_stt(audio_file)
|
| 116 |
+
|
| 117 |
+
if success:
|
| 118 |
+
print("\nπ Tawasul STT test completed successfully!")
|
| 119 |
+
print("\nπ‘ Next steps:")
|
| 120 |
+
print(" 1. Try the main transcriber: python gradio_voice_transcriber_clean.py")
|
| 121 |
+
print(" 2. Test with different Arabic audio files")
|
| 122 |
+
print(" 3. Experiment with different model variants")
|
| 123 |
+
else:
|
| 124 |
+
print("\nβ Tawasul STT test failed!")
|
| 125 |
+
print("\nπ§ Troubleshooting:")
|
| 126 |
+
print(" 1. Install dependencies: pip install -r requirements_tawasul.txt")
|
| 127 |
+
print(" 2. Check audio file format (WAV/MP3)")
|
| 128 |
+
print(" 3. Ensure stable internet for model download")
|
| 129 |
+
print(" 4. Try with a different audio file")
|
| 130 |
+
|
| 131 |
+
if __name__ == "__main__":
|
| 132 |
+
main()
|
test_vosk.py
ADDED
|
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for Vosk STT implementation
|
| 4 |
+
|
| 5 |
+
This script tests the Vosk STT implementation to ensure compatibility
|
| 6 |
+
and proper error handling.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import sys
|
| 10 |
+
import numpy as np
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
# Add the project directory to Python path
|
| 14 |
+
project_dir = Path(__file__).parent
|
| 15 |
+
sys.path.insert(0, str(project_dir))
|
| 16 |
+
|
| 17 |
+
def test_vosk_basic():
|
| 18 |
+
"""Test basic Vosk functionality."""
|
| 19 |
+
print("π Testing Vosk STT...")
|
| 20 |
+
|
| 21 |
+
try:
|
| 22 |
+
from stt.vosk_stt import VoskSTT
|
| 23 |
+
print("β
Successfully imported VoskSTT")
|
| 24 |
+
except ImportError as e:
|
| 25 |
+
print(f"β Failed to import VoskSTT: {e}")
|
| 26 |
+
print("\nπ¦ Required dependencies:")
|
| 27 |
+
print("pip install vosk")
|
| 28 |
+
return False
|
| 29 |
+
|
| 30 |
+
# Check Vosk availability
|
| 31 |
+
print("\nπ Checking Vosk availability...")
|
| 32 |
+
models_info = VoskSTT.get_available_models()
|
| 33 |
+
|
| 34 |
+
for key, value in models_info.items():
|
| 35 |
+
status = "β
" if value else "β"
|
| 36 |
+
print(f"{status} {key}: {value}")
|
| 37 |
+
|
| 38 |
+
if not models_info.get("vosk_available", False):
|
| 39 |
+
print("\nβ Vosk not available. Cannot proceed with test.")
|
| 40 |
+
return False
|
| 41 |
+
|
| 42 |
+
# Test model loading with a small model
|
| 43 |
+
print(f"\nπ Testing model loading...")
|
| 44 |
+
print("β οΈ Note: This will download a small model (~40MB) if not cached")
|
| 45 |
+
|
| 46 |
+
try:
|
| 47 |
+
# Try to load a small English model first
|
| 48 |
+
VoskSTT.load_model(model_name="vosk-model-small-en-us-0.15")
|
| 49 |
+
print("β
Model loaded successfully!")
|
| 50 |
+
|
| 51 |
+
# Get model info
|
| 52 |
+
model_info = VoskSTT.get_model_info()
|
| 53 |
+
print(f"π Model info:")
|
| 54 |
+
for key, value in model_info.items():
|
| 55 |
+
print(f" {key}: {value}")
|
| 56 |
+
|
| 57 |
+
except Exception as e:
|
| 58 |
+
print(f"β Failed to load model: {e}")
|
| 59 |
+
print("\nπ‘ This might be due to:")
|
| 60 |
+
print(" - Network issues (model download)")
|
| 61 |
+
print(" - Vosk version compatibility")
|
| 62 |
+
print(" - Model availability")
|
| 63 |
+
return False
|
| 64 |
+
|
| 65 |
+
# Test with dummy audio
|
| 66 |
+
print(f"\nπ€ Testing transcription with dummy audio...")
|
| 67 |
+
print("β οΈ Note: Random audio won't produce meaningful text")
|
| 68 |
+
|
| 69 |
+
try:
|
| 70 |
+
# Create 2 seconds of random audio
|
| 71 |
+
dummy_audio = np.random.randn(32000).astype(np.float32) * 0.1
|
| 72 |
+
|
| 73 |
+
result = VoskSTT.transcribe_audio(dummy_audio, 16000)
|
| 74 |
+
|
| 75 |
+
print(f"π Transcription result:")
|
| 76 |
+
print(f" Text: '{result.text}'")
|
| 77 |
+
print(f" Confidence: {result.confidence:.2%}" if result.confidence else " Confidence: N/A")
|
| 78 |
+
print(f" Processing time: {result.processing_time:.2f}s")
|
| 79 |
+
print(f" Metadata: {result.metadata}")
|
| 80 |
+
|
| 81 |
+
print("β
Transcription test completed!")
|
| 82 |
+
return True
|
| 83 |
+
|
| 84 |
+
except Exception as e:
|
| 85 |
+
print(f"β Transcription failed: {e}")
|
| 86 |
+
return False
|
| 87 |
+
|
| 88 |
+
def test_vosk_models():
|
| 89 |
+
"""Test different Vosk models."""
|
| 90 |
+
print(f"\nπ Testing different Vosk models...")
|
| 91 |
+
|
| 92 |
+
try:
|
| 93 |
+
from stt.vosk_stt import VoskSTT
|
| 94 |
+
|
| 95 |
+
# Get available models
|
| 96 |
+
available = VoskSTT.AVAILABLE_MODELS
|
| 97 |
+
print(f"π Available models: {len(available)}")
|
| 98 |
+
|
| 99 |
+
# Show a few interesting models
|
| 100 |
+
interesting_models = [
|
| 101 |
+
"vosk-model-small-en-us-0.15",
|
| 102 |
+
"vosk-model-small-ru-0.22",
|
| 103 |
+
"vosk-model-small-fr-0.22",
|
| 104 |
+
"vosk-model-small-de-0.15"
|
| 105 |
+
]
|
| 106 |
+
|
| 107 |
+
print("\nπ Some available models:")
|
| 108 |
+
for model_name in interesting_models:
|
| 109 |
+
if model_name in available:
|
| 110 |
+
model_info = available[model_name]
|
| 111 |
+
print(f" {model_name}:")
|
| 112 |
+
print(f" Language: {model_info['language']}")
|
| 113 |
+
print(f" Size: {model_info['size']}")
|
| 114 |
+
print(f" Description: {model_info['description']}")
|
| 115 |
+
|
| 116 |
+
return True
|
| 117 |
+
|
| 118 |
+
except Exception as e:
|
| 119 |
+
print(f"β Model listing failed: {e}")
|
| 120 |
+
return False
|
| 121 |
+
|
| 122 |
+
def test_integration():
|
| 123 |
+
"""Test integration with the modular transcriber."""
|
| 124 |
+
print(f"\nπ Testing integration with modular transcriber...")
|
| 125 |
+
|
| 126 |
+
try:
|
| 127 |
+
from gradio_voice_transcriber_clean import ModelManager
|
| 128 |
+
|
| 129 |
+
available_models = ModelManager.get_available_models()
|
| 130 |
+
print(f"π Available models: {available_models}")
|
| 131 |
+
|
| 132 |
+
if "VoskSTT" in available_models:
|
| 133 |
+
print("β
VoskSTT is registered in the modular transcriber")
|
| 134 |
+
|
| 135 |
+
# Test model options
|
| 136 |
+
options = ModelManager.get_model_options("VoskSTT")
|
| 137 |
+
print(f"π Model options: {options}")
|
| 138 |
+
|
| 139 |
+
return True
|
| 140 |
+
else:
|
| 141 |
+
print("β VoskSTT not found in available models")
|
| 142 |
+
return False
|
| 143 |
+
|
| 144 |
+
except ImportError as e:
|
| 145 |
+
print(f"β Failed to import modular transcriber components: {e}")
|
| 146 |
+
return False
|
| 147 |
+
|
| 148 |
+
def main():
|
| 149 |
+
"""Main test function."""
|
| 150 |
+
print("π§ͺ Vosk STT Test Suite")
|
| 151 |
+
print("=" * 50)
|
| 152 |
+
|
| 153 |
+
# Test basic functionality
|
| 154 |
+
basic_test = test_vosk_basic()
|
| 155 |
+
|
| 156 |
+
# Test model listing
|
| 157 |
+
models_test = test_vosk_models()
|
| 158 |
+
|
| 159 |
+
# Test integration
|
| 160 |
+
integration_test = test_integration()
|
| 161 |
+
|
| 162 |
+
print("\n" + "=" * 50)
|
| 163 |
+
print("π Test Results Summary:")
|
| 164 |
+
print(f" Basic Functionality: {'β
PASS' if basic_test else 'β FAIL'}")
|
| 165 |
+
print(f" Model Listing: {'β
PASS' if models_test else 'β FAIL'}")
|
| 166 |
+
print(f" Integration: {'β
PASS' if integration_test else 'β FAIL'}")
|
| 167 |
+
|
| 168 |
+
if basic_test and models_test and integration_test:
|
| 169 |
+
print("\nπ All tests passed! Vosk STT is ready to use.")
|
| 170 |
+
print("\nπ‘ Next steps:")
|
| 171 |
+
print(" 1. Run: python gradio_voice_transcriber_clean.py")
|
| 172 |
+
print(" 2. Select 'VoskSTT' from the dropdown")
|
| 173 |
+
print(" 3. Choose your model (small models are faster)")
|
| 174 |
+
print(" 4. Load the model and test with audio!")
|
| 175 |
+
print("\nπ Vosk supports many languages offline!")
|
| 176 |
+
else:
|
| 177 |
+
print("\nβ Some tests failed. Please check the errors above.")
|
| 178 |
+
|
| 179 |
+
if not basic_test:
|
| 180 |
+
print("\nπ¦ To fix Vosk issues:")
|
| 181 |
+
print(" pip install vosk")
|
| 182 |
+
print(" Check internet connection for model download")
|
| 183 |
+
|
| 184 |
+
if __name__ == "__main__":
|
| 185 |
+
main()
|
test_wav2vec2_arabic.py
ADDED
|
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test script for Wav2Vec2 Arabic STT
|
| 4 |
+
|
| 5 |
+
This script tests the Wav2Vec2 Arabic STT implementation without requiring
|
| 6 |
+
the full Gradio interface.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import sys
|
| 10 |
+
import numpy as np
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
# Add the project directory to Python path
|
| 14 |
+
project_dir = Path(__file__).parent
|
| 15 |
+
sys.path.insert(0, str(project_dir))
|
| 16 |
+
|
| 17 |
+
def test_wav2vec2_arabic():
|
| 18 |
+
"""Test the Wav2Vec2 Arabic STT implementation."""
|
| 19 |
+
print("π Testing Wav2Vec2 Arabic STT...")
|
| 20 |
+
|
| 21 |
+
try:
|
| 22 |
+
from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
|
| 23 |
+
print("β
Successfully imported Wav2Vec2ArabicSTT")
|
| 24 |
+
except ImportError as e:
|
| 25 |
+
print(f"β Failed to import Wav2Vec2ArabicSTT: {e}")
|
| 26 |
+
print("\nπ¦ Required dependencies:")
|
| 27 |
+
print("pip install transformers torch torchaudio")
|
| 28 |
+
print("Optional: pip install librosa")
|
| 29 |
+
return False
|
| 30 |
+
|
| 31 |
+
# Check model availability
|
| 32 |
+
print("\nπ Checking model availability...")
|
| 33 |
+
models_info = Wav2Vec2ArabicSTT.get_available_models()
|
| 34 |
+
|
| 35 |
+
for key, value in models_info.items():
|
| 36 |
+
status = "β
" if value else "β"
|
| 37 |
+
print(f"{status} {key}: {value}")
|
| 38 |
+
|
| 39 |
+
if not models_info.get("transformers_available", False):
|
| 40 |
+
print("\nβ Transformers not available. Cannot proceed with test.")
|
| 41 |
+
return False
|
| 42 |
+
|
| 43 |
+
# Test model loading (this will download the model if not cached)
|
| 44 |
+
print(f"\nπ Loading model...")
|
| 45 |
+
print("β οΈ Note: First run will download ~1.2GB model from Hugging Face")
|
| 46 |
+
|
| 47 |
+
try:
|
| 48 |
+
Wav2Vec2ArabicSTT.load_model(device="cpu") # Use CPU for testing
|
| 49 |
+
print("β
Model loaded successfully!")
|
| 50 |
+
|
| 51 |
+
# Get model info
|
| 52 |
+
model_info = Wav2Vec2ArabicSTT.get_model_info()
|
| 53 |
+
print(f"π Model info:")
|
| 54 |
+
for key, value in model_info.items():
|
| 55 |
+
print(f" {key}: {value}")
|
| 56 |
+
|
| 57 |
+
except Exception as e:
|
| 58 |
+
print(f"β Failed to load model: {e}")
|
| 59 |
+
return False
|
| 60 |
+
|
| 61 |
+
# Test with dummy audio (this won't produce meaningful Arabic text)
|
| 62 |
+
print(f"\nπ€ Testing transcription with dummy audio...")
|
| 63 |
+
print("β οΈ Note: Random audio won't produce meaningful Arabic text")
|
| 64 |
+
|
| 65 |
+
try:
|
| 66 |
+
# Create 2 seconds of random audio
|
| 67 |
+
dummy_audio = np.random.randn(32000).astype(np.float32) * 0.1
|
| 68 |
+
|
| 69 |
+
result = Wav2Vec2ArabicSTT.transcribe_audio(dummy_audio, 16000)
|
| 70 |
+
|
| 71 |
+
print(f"π Transcription result:")
|
| 72 |
+
print(f" Text: '{result.text}'")
|
| 73 |
+
print(f" Confidence: {result.confidence:.2%}" if result.confidence else " Confidence: N/A")
|
| 74 |
+
print(f" Processing time: {result.processing_time:.2f}s")
|
| 75 |
+
print(f" Metadata: {result.metadata}")
|
| 76 |
+
|
| 77 |
+
print("β
Transcription test completed!")
|
| 78 |
+
return True
|
| 79 |
+
|
| 80 |
+
except Exception as e:
|
| 81 |
+
print(f"β Transcription failed: {e}")
|
| 82 |
+
return False
|
| 83 |
+
|
| 84 |
+
def test_integration():
|
| 85 |
+
"""Test integration with the modular transcriber."""
|
| 86 |
+
print(f"\nπ Testing integration with modular transcriber...")
|
| 87 |
+
|
| 88 |
+
try:
|
| 89 |
+
from gradio_voice_transcriber_clean import ModelManager, STT_MODELS
|
| 90 |
+
|
| 91 |
+
available_models = ModelManager.get_available_models()
|
| 92 |
+
print(f"π Available models: {available_models}")
|
| 93 |
+
|
| 94 |
+
if "Wav2Vec2ArabicSTT" in available_models:
|
| 95 |
+
print("β
Wav2Vec2ArabicSTT is registered in the modular transcriber")
|
| 96 |
+
|
| 97 |
+
# Test model options
|
| 98 |
+
options = ModelManager.get_model_options("Wav2Vec2ArabicSTT")
|
| 99 |
+
print(f"π Model options: {options}")
|
| 100 |
+
|
| 101 |
+
return True
|
| 102 |
+
else:
|
| 103 |
+
print("β Wav2Vec2ArabicSTT not found in available models")
|
| 104 |
+
return False
|
| 105 |
+
|
| 106 |
+
except ImportError as e:
|
| 107 |
+
print(f"β Failed to import modular transcriber components: {e}")
|
| 108 |
+
return False
|
| 109 |
+
|
| 110 |
+
def main():
|
| 111 |
+
"""Main test function."""
|
| 112 |
+
print("π§ͺ Wav2Vec2 Arabic STT Test Suite")
|
| 113 |
+
print("=" * 50)
|
| 114 |
+
|
| 115 |
+
# Test individual STT implementation
|
| 116 |
+
stt_test = test_wav2vec2_arabic()
|
| 117 |
+
|
| 118 |
+
# Test integration
|
| 119 |
+
integration_test = test_integration()
|
| 120 |
+
|
| 121 |
+
print("\n" + "=" * 50)
|
| 122 |
+
print("π Test Results Summary:")
|
| 123 |
+
print(f" STT Implementation: {'β
PASS' if stt_test else 'β FAIL'}")
|
| 124 |
+
print(f" Integration: {'β
PASS' if integration_test else 'β FAIL'}")
|
| 125 |
+
|
| 126 |
+
if stt_test and integration_test:
|
| 127 |
+
print("\nπ All tests passed! The Wav2Vec2 Arabic STT is ready to use.")
|
| 128 |
+
print("\nπ‘ Next steps:")
|
| 129 |
+
print(" 1. Run: python gradio_voice_transcriber_clean.py")
|
| 130 |
+
print(" 2. Select 'Wav2Vec2ArabicSTT' from the dropdown")
|
| 131 |
+
print(" 3. Choose your device (CPU/CUDA)")
|
| 132 |
+
print(" 4. Load the model and test with Arabic audio!")
|
| 133 |
+
else:
|
| 134 |
+
print("\nβ Some tests failed. Please check the errors above.")
|
| 135 |
+
|
| 136 |
+
if not stt_test:
|
| 137 |
+
print("\nπ¦ To fix STT implementation issues:")
|
| 138 |
+
print(" pip install transformers torch torchaudio")
|
| 139 |
+
print(" pip install librosa # optional, for better audio processing")
|
| 140 |
+
|
| 141 |
+
if __name__ == "__main__":
|
| 142 |
+
main()
|
test_whisper_local.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Simple Whisper Test
|
| 4 |
+
|
| 5 |
+
Load Whisper model and test transcription.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import numpy as np
|
| 9 |
+
from stt import WhisperSTT
|
| 10 |
+
|
| 11 |
+
def main():
|
| 12 |
+
print("Loading Whisper model...")
|
| 13 |
+
WhisperSTT.load_model("tiny") # Load tiny model (fastest)
|
| 14 |
+
|
| 15 |
+
if not WhisperSTT.is_loaded:
|
| 16 |
+
print("Failed to load model")
|
| 17 |
+
return
|
| 18 |
+
|
| 19 |
+
print("Model loaded successfully!")
|
| 20 |
+
|
| 21 |
+
# Create some test audio (1 second of random noise)
|
| 22 |
+
test_audio = np.random.randn(16000).astype(np.float32)
|
| 23 |
+
|
| 24 |
+
print("Transcribing audio...")
|
| 25 |
+
result = WhisperSTT.transcribe_numpy(test_audio, 16000)
|
| 26 |
+
|
| 27 |
+
print(f"Result: {result.text}")
|
| 28 |
+
print(f"Confidence: {result.confidence}")
|
| 29 |
+
print(f"Time: {result.processing_time:.2f}s")
|
| 30 |
+
|
| 31 |
+
if __name__ == "__main__":
|
| 32 |
+
main()
|
uv.lock
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|