GitHub Actions commited on
Commit
57b8470
Β·
1 Parent(s): 921859d

Deploy from bot_text branch - Sat Dec 27 17:53:04 UTC 2025

Browse files
1.4.0 ADDED
File without changes
INSTALL.md ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Installation Guide
2
+
3
+ This guide explains how to install and set up the Modular Voice Transcriber with different STT models.
4
+
5
+ ## πŸš€ Quick Start
6
+
7
+ ### Option 1: Automated Setup (Recommended)
8
+ ```bash
9
+ # Essential models (Whisper + Wav2Vec2)
10
+ python setup.py --profile essential --test
11
+
12
+ # Or for specific models only
13
+ python setup.py --profile whisper-only --test
14
+ python setup.py --profile wav2vec2-only --test
15
+ ```
16
+
17
+ ### Option 2: Manual Installation
18
+
19
+ #### Base Installation
20
+ ```bash
21
+ # Core requirements
22
+ pip install gradio>=4.0.0 numpy>=1.21.0 soundfile>=0.12.1
23
+ ```
24
+
25
+ #### Choose Your STT Models
26
+
27
+ **OpenAI Whisper (Local + API)**
28
+ ```bash
29
+ pip install -r requirements_whisper.txt
30
+ # Or: pip install -e .[whisper,whisper-api]
31
+ ```
32
+
33
+ **Wav2Vec2 Arabic**
34
+ ```bash
35
+ pip install -r requirements_wav2vec2.txt
36
+ # Or: pip install -e .[wav2vec2]
37
+ ```
38
+
39
+ **All Models**
40
+ ```bash
41
+ pip install -r requirements.txt
42
+ # Or: pip install -e .[all-stt]
43
+ ```
44
+
45
+ ## πŸ“¦ Installation Profiles
46
+
47
+ | Profile | Models Included | Use Case |
48
+ |---------|----------------|----------|
49
+ | `minimal` | None | Interface only (for development) |
50
+ | `essential` | Whisper + Wav2Vec2 | Best balance of features |
51
+ | `whisper-only` | OpenAI Whisper | English + Multilingual |
52
+ | `wav2vec2-only` | Wav2Vec2 Arabic | Arabic Egyptian dialect |
53
+ | `all` | All supported models | Complete functionality |
54
+
55
+ ## πŸ”§ System Requirements
56
+
57
+ ### Minimum Requirements
58
+ - Python 3.8+
59
+ - 4GB RAM
60
+ - 2GB free disk space
61
+
62
+ ### Recommended Requirements
63
+ - Python 3.9+
64
+ - 8GB RAM
65
+ - 5GB free disk space
66
+ - GPU with CUDA support (for faster transcription)
67
+
68
+ ## πŸ“‹ Model Download Sizes
69
+
70
+ | Model | First Download | Disk Space |
71
+ |-------|---------------|------------|
72
+ | Whisper Tiny | 39MB | 39MB |
73
+ | Whisper Base | 142MB | 142MB |
74
+ | Whisper Medium | 1.5GB | 1.5GB |
75
+ | Wav2Vec2 Arabic | 1.2GB | 1.2GB |
76
+
77
+ ## πŸ§ͺ Testing Your Installation
78
+
79
+ ### Test Individual Models
80
+ ```bash
81
+ # Test Wav2Vec2 Arabic
82
+ python test_wav2vec2_arabic.py
83
+
84
+ # Test Whisper (coming soon)
85
+ python test_whisper_local.py
86
+ ```
87
+
88
+ ### Test Full Interface
89
+ ```bash
90
+ python gradio_voice_transcriber_clean.py
91
+ ```
92
+
93
+ ## πŸ” Troubleshooting
94
+
95
+ ### Common Issues
96
+
97
+ **Import Error: transformers**
98
+ ```bash
99
+ pip install transformers torch torchaudio
100
+ ```
101
+
102
+ **Import Error: whisper**
103
+ ```bash
104
+ pip install openai-whisper
105
+ ```
106
+
107
+ **CUDA Issues**
108
+ - Install PyTorch with CUDA support from [pytorch.org](https://pytorch.org)
109
+ - Or use CPU-only: `pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu`
110
+
111
+ **Model Download Issues**
112
+ - Check internet connection
113
+ - Hugging Face models download automatically on first use
114
+ - Downloads go to `~/.cache/huggingface/` and `~/.cache/whisper/`
115
+
116
+ ### Performance Tips
117
+
118
+ **For Better Speed:**
119
+ - Use GPU if available
120
+ - Choose smaller models for real-time use
121
+ - Use larger models for better accuracy
122
+
123
+ **For Better Quality:**
124
+ - Record in quiet environment
125
+ - Use good microphone
126
+ - Speak clearly and at normal pace
127
+ - Choose appropriate language/dialect model
128
+
129
+ ## πŸ”„ Updating
130
+
131
+ ```bash
132
+ # Update to latest versions
133
+ pip install --upgrade -r requirements.txt
134
+
135
+ # Update specific models
136
+ pip install --upgrade transformers openai-whisper
137
+ ```
138
+
139
+ ## 🎯 Next Steps
140
+
141
+ 1. **Run the interface:** `python gradio_voice_transcriber_clean.py`
142
+ 2. **Choose your model** from the dropdown
143
+ 3. **Load the model** (first time will download)
144
+ 4. **Test with audio** recording or upload
145
+ 5. **Check quality analysis** for audio tips
146
+
147
+ ## πŸ“š Additional Resources
148
+
149
+ - [Gradio Documentation](https://gradio.app/docs/)
150
+ - [Whisper by OpenAI](https://openai.com/research/whisper)
151
+ - [Wav2Vec2 Models](https://huggingface.co/models?search=wav2vec2)
152
+ - [Transformers Library](https://huggingface.co/docs/transformers/)
README.md CHANGED
@@ -1,12 +1,216 @@
1
- ---
2
- title: Stt Trails
3
- emoji: πŸŒ–
4
- colorFrom: pink
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.49.1
8
- app_file: app.py
9
- pinned: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # Modular Voice Transcriber
2
+
3
+ A comprehensive, modular Gradio-based web interface for speech-to-text transcription supporting multiple STT engines including OpenAI Whisper, Wav2Vec2, HuBERT Arabic, Tawasul, Vosk, and Coqui STT models.
4
+
5
+ ## 🌟 Features
6
+
7
+ - **Comprehensive STT Support**: 7+ different speech-to-text engines
8
+ - **Multiple Models**: OpenAI Whisper, Wav2Vec2 Arabic, HuBERT, Tawasul, Vosk, Coqui STT
9
+ - **Arabic Language Focus**: Specialized models for Arabic dialect recognition
10
+ - **Web Interface**: User-friendly Gradio interface with image gallery
11
+ - **Real-time Processing**: Live audio recording and transcription
12
+ - **Quality Analysis**: Audio quality feedback and recommendations
13
+ - **Device Support**: CPU/GPU automatic detection and selection
14
+ - **Authentication**: Support for private HuggingFace models
15
+ - **Static Class Support**: Optimized memory usage for certain models
16
+ - **Visual Interface**: Interactive image gallery with thumbnail navigation
17
+
18
+ ## πŸš€ Quick Start
19
+
20
+ ### Option 1: Automated Setup (Recommended)
21
+ ```bash
22
+ # Essential models (Whisper + Wav2Vec2)
23
+ python setup.py --profile essential --test
24
+
25
+ # Or specific models
26
+ python setup.py --profile whisper-only --test
27
+ python setup.py --profile wav2vec2-only --test
28
+ ```
29
+
30
+ ### Option 2: Manual Installation
31
+
32
+ ```bash
33
+ # Base installation (Whisper + Wav2Vec2)
34
+ pip install -r requirements.txt
35
+
36
+ # Specific model installations
37
+ pip install -r requirements_whisper.txt # OpenAI Whisper only
38
+ pip install -r requirements_wav2vec2.txt # Wav2Vec2 only
39
+ pip install -r requirements_hubert.txt # HuBERT Arabic only
40
+ pip install -r requirements_tawasul.txt # Tawasul Arabic only
41
+ pip install -r requirements_vosk.txt # Vosk offline only
42
+ pip install -r requirements_coqui.txt # Coqui STT only
43
+
44
+ # Or install with specific extras
45
+ pip install -e .[essential] # Whisper + Wav2Vec2
46
+ pip install -e .[all-stt] # All models
47
+ ```
48
+
49
+ ## 🎯 Supported STT Models
50
+
51
+ | Model | Language | Size | Type | Quality | Features |
52
+ |-------|----------|------|------|---------|----------|
53
+ | **Whisper Tiny** | Multilingual | 39MB | Local/API | Fast | General purpose |
54
+ | **Whisper Base** | Multilingual | 142MB | Local/API | Good | General purpose |
55
+ | **Whisper Medium** | Multilingual | 1.5GB | Local/API | Better | General purpose |
56
+ | **Whisper Large** | Multilingual | 2.9GB | Local/API | Best | General purpose |
57
+ | **Wav2Vec2 Arabic** | Arabic | 1.2GB | Local | Excellent | Arabic dialects |
58
+ | **HuBERT Arabic** | Arabic Egyptian | 1.2GB | Local | Excellent | Egyptian dialect |
59
+ | **Tawasul V0** | Arabic | 800MB | Local | Very Good | Arabic speech, Static class |
60
+ | **Vosk** | Multilingual | 50MB-1.8GB | Local/Offline | Good | Offline capable |
61
+ | **Coqui STT** | Multilingual | 180MB-2GB | Local | Good | Open source |
62
+
63
+ ## πŸ”§ Usage
64
+
65
+ ### Start the Interface
66
+ ```bash
67
+ python gradio_voice_transcriber_clean.py
68
+ ```
69
+
70
+ ### Using Different Models
71
+
72
+ 1. **Select Model**: Choose from dropdown (WhisperSTT, Wav2Vec2ArabicSTT, HubertArabicSTT, TawasulSTT, VoskSTT, CoquiSTT)
73
+ 2. **Configure**: Set model size, device, language, and authentication if needed
74
+ 3. **Load**: Click "Load Model" (first time downloads model automatically)
75
+ 4. **Transcribe**: Record audio or upload audio files
76
+ 5. **Gallery**: Browse sample images using the interactive thumbnail gallery
77
+
78
+ ### Authentication for Private Models
79
+
80
+ Some experimental models require HuggingFace authentication:
81
+
82
+ ```bash
83
+ # Option 1: Use helper script
84
+ python setup_hf_auth.py
85
+
86
+ # Option 2: Manual login
87
+ pip install huggingface-hub
88
+ huggingface-cli login
89
+
90
+ # Option 3: Use token in interface
91
+ # Get token from: https://huggingface.co/settings/tokens
92
+ # Enter in "HuggingFace Token" field
93
+ ```
94
+
95
+ ## πŸ—οΈ Adding New STT Models
96
+
97
+ The system is designed to be easily extensible:
98
+
99
+ 1. **Create STT Class**: Inherit from `BaseSTT` in `stt/your_model.py`
100
+ 2. **Register Model**: Add to `STT_MODELS` in `gradio_voice_transcriber_clean.py`
101
+ 3. **Configure Options**: Update `get_model_options()` method
102
+ 4. **Test**: Run your model through the interface
103
+
104
+ See `STT_INTEGRATION_GUIDE.md` for detailed instructions.
105
+
106
+ ## 🎨 Interface Features
107
+
108
+ ### Image Gallery
109
+ The enhanced interface (`gradio_voice_transcript_temp.py`) includes:
110
+ - **Interactive Gallery**: Browse sample images with thumbnail navigation
111
+ - **Horizontal Scrolling**: Smooth image browsing experience
112
+ - **Thumbnail Selection**: Click thumbnails to view full images
113
+ - **Gallery Controls**: Navigation and zoom functionality
114
+
115
+ ### Static Class Models
116
+ Some models (like TawasulSTT) use static class implementation for:
117
+ - **Memory Efficiency**: Reduced memory footprint
118
+ - **Faster Loading**: Optimized model initialization
119
+ - **Shared Resources**: Better resource management
120
+
121
+ ## πŸ“ Model Storage Locations
122
+
123
+ Different STT models store their files in various locations:
124
+
125
+ | Model Type | Storage Location | Description |
126
+ |------------|------------------|-------------|
127
+ | **Hugging Face Models** | `~/.cache/huggingface/` | Wav2Vec2, HuBERT, Tawasul models |
128
+ | **Whisper Models** | `~/.cache/whisper/` | OpenAI Whisper model files |
129
+ | **Vosk Models** | `~/.vosk/models/` | Offline Vosk language models |
130
+ | **Coqui Models** | Managed by model manager | Coqui STT model files |
131
+
132
+ ## πŸ“ Project Structure
133
+
134
+ ```
135
+ STT-trails/
136
+ β”œβ”€β”€ gradio_voice_transcriber.py # Main comprehensive interface
137
+ β”œβ”€β”€ gradio_voice_transcript_temp.py # Enhanced interface with image gallery
138
+ β”œβ”€β”€ stt/ # STT implementations
139
+ β”‚ β”œβ”€β”€ stt_base.py # Base class for all STT models
140
+ β”‚ β”œβ”€β”€ whisper_stt.py # OpenAI Whisper implementation
141
+ β”‚ β”œβ”€β”€ wav2vec2_arabic_stt.py # Wav2Vec2 Arabic model
142
+ β”‚ β”œβ”€β”€ hubert_arabic_stt.py # HuBERT Arabic dialect model
143
+ β”‚ β”œβ”€β”€ tawasul_stt.py # Tawasul Arabic model (static class)
144
+ β”‚ β”œβ”€β”€ vosk_stt.py # Vosk offline STT
145
+ β”‚ β”œβ”€β”€ coqui_stt.py # Coqui open-source STT
146
+ β”‚ └── example_custom_stt.py # Template for new models
147
+ β”œβ”€β”€ setup.py # Installation helper
148
+ β”œβ”€β”€ setup_hf_auth.py # HuggingFace authentication helper
149
+ β”œβ”€β”€ test_*.py # Model testing scripts
150
+ β”œβ”€β”€ requirements*.txt # Dependencies for each model
151
+ β”œβ”€β”€ recordings/ # Audio recordings directory
152
+ β”œβ”€β”€ INSTALL.md # Detailed installation guide
153
+ β”œβ”€β”€ STT_INTEGRATION_GUIDE.md # Developer integration guide
154
+ └── pyproject.toml # Project configuration
155
+ ```
156
+
157
+ ## πŸ§ͺ Testing
158
+
159
+ ```bash
160
+ # Test specific models
161
+ python test_whisper_local.py # Test Whisper models
162
+ python test_wav2vec2_arabic.py # Test Wav2Vec2 Arabic
163
+ python test_hubert_arabic.py # Test HuBERT Arabic
164
+ python test_tawasul.py # Test Tawasul Arabic
165
+ python test_vosk.py # Test Vosk offline STT
166
+ python test_coqui.py # Test Coqui STT
167
+
168
+ # Test installation
169
+ python setup.py --profile essential --test
170
+
171
+ # Run the interface
172
+ python gradio_voice_transcriber.py # Main interface
173
+ python gradio_voice_transcript_temp.py # Interface with image gallery
174
+ ```
175
+
176
+ ## πŸ” Troubleshooting
177
+
178
+ ### Model Loading Issues
179
+ - **HuggingFace Authentication**: Use `setup_hf_auth.py` or manual token
180
+ - **Memory Issues**: Use smaller models or CPU-only mode
181
+ - **Internet Required**: First model download needs internet connection
182
+
183
+ ### Audio Issues
184
+ - **No Audio Detected**: Check microphone permissions and volume
185
+ - **Poor Quality**: Use audio quality analysis feature
186
+ - **Wrong Language**: Select appropriate model for your language
187
+
188
+ ### Performance Tips
189
+ - **Use GPU**: Automatic if CUDA PyTorch is installed
190
+ - **Chunk Long Audio**: Handled automatically for 20+ second clips
191
+ - **Choose Right Model**: Balance size vs. accuracy for your use case
192
+
193
+ ## πŸ“„ License
194
+
195
+ This project is open source. See individual model licenses:
196
+ - OpenAI Whisper: MIT License
197
+ - Wav2Vec2: MIT License
198
+ - HuggingFace Transformers: Apache 2.0
199
+
200
+ ## 🀝 Contributing
201
+
202
+ 1. Fork the repository
203
+ 2. Create your feature branch
204
+ 3. Add your STT model following the integration guide
205
+ 4. Submit a pull request
206
+
207
+ ## πŸ“š Resources
208
+
209
+ - [OpenAI Whisper](https://openai.com/research/whisper)
210
+ - [Wav2Vec2 Paper](https://arxiv.org/abs/2006.11477)
211
+ - [HuggingFace Models](https://huggingface.co/models)
212
+ - [Gradio Documentation](https://gradio.app/docs/)
213
+
214
  ---
215
 
216
+ **Made with ❀️ for the speech recognition community**
STT_INTEGRATION_GUIDE.md ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Modular STT Integration Guide
2
+
3
+ This guide explains how to integrate new Speech-to-Text models into the modular Gradio voice transcriber.
4
+
5
+ ## πŸ—οΈ Architecture Overview
6
+
7
+ The system is built with a modular architecture that makes it easy to add new STT engines:
8
+
9
+ ```
10
+ gradio_voice_transcriber_clean.py
11
+ β”œβ”€β”€ ModelManager # Handles model registration and loading
12
+ β”œβ”€β”€ AudioProcessor # Preprocesses audio for better quality
13
+ β”œβ”€β”€ TranscriptionEngine # Manages transcription workflow
14
+ └── GradioInterface # Creates the web UI
15
+ ```
16
+
17
+ ## πŸ”§ Adding a New STT Model
18
+
19
+ ### Step 1: Create Your STT Class
20
+
21
+ Create a new file in the `stt/` directory (e.g., `your_stt.py`) that inherits from `BaseSTT`:
22
+
23
+ ```python
24
+ from stt.stt_base import BaseSTT, STTResult
25
+ import numpy as np
26
+
27
+ class YourSTT(BaseSTT):
28
+ model_name = "YourSTT"
29
+ model = None
30
+ is_loaded = False
31
+ config = {}
32
+
33
+ @classmethod
34
+ def load_model(cls, **kwargs):
35
+ # Initialize your STT service
36
+ cls.model = your_stt_client()
37
+ cls.is_loaded = True
38
+
39
+ @classmethod
40
+ def transcribe_audio(cls, audio_data, sample_rate=None):
41
+ # Implement transcription logic
42
+ result = cls.model.transcribe(audio_data)
43
+ return STTResult(text=result.text, confidence=result.confidence)
44
+ ```
45
+
46
+ ### Step 2: Register Your Model
47
+
48
+ Add your model to the registry in `gradio_voice_transcriber_clean.py`:
49
+
50
+ ```python
51
+ # Import your model
52
+ from stt.your_stt import YourSTT
53
+
54
+ # Add to registry
55
+ STT_MODELS = {
56
+ "WhisperSTT": WhisperSTT,
57
+ "YourSTT": YourSTT, # Add this line
58
+ }
59
+ ```
60
+
61
+ ### Step 3: Configure Model Options
62
+
63
+ Update the `ModelManager.get_model_options()` method:
64
+
65
+ ```python
66
+ @staticmethod
67
+ def get_model_options(model_name: str) -> Dict[str, Any]:
68
+ if model_name == "YourSTT":
69
+ return {
70
+ "model_sizes": ["small", "large"],
71
+ "supports_api": True,
72
+ "languages": [("English", "en"), ("Spanish", "es")],
73
+ "default_params": {"temperature": 0.0}
74
+ }
75
+ # ... existing code
76
+ ```
77
+
78
+ ### Step 4: Handle Model Loading
79
+
80
+ Update the loading logic in `ModelManager.load_model()`:
81
+
82
+ ```python
83
+ if model_name == "YourSTT":
84
+ api_key = kwargs.get("api_key", "")
85
+ model_size = kwargs.get("model_size", "small")
86
+
87
+ YourSTT.load_model(api_key=api_key, model_size=model_size)
88
+ status = f"βœ… {model_name} loaded successfully"
89
+ ```
90
+
91
+ ## πŸ“ Real Examples
92
+
93
+ ### Azure Speech Service
94
+
95
+ ```python
96
+ import azure.cognitiveservices.speech as speechsdk
97
+
98
+ class AzureSTT(BaseSTT):
99
+ model_name = "AzureSTT"
100
+
101
+ @classmethod
102
+ def load_model(cls, subscription_key, region):
103
+ speech_config = speechsdk.SpeechConfig(
104
+ subscription=subscription_key,
105
+ region=region
106
+ )
107
+ cls.model = speech_config
108
+ cls.is_loaded = True
109
+
110
+ @classmethod
111
+ def transcribe_audio(cls, audio_data, sample_rate=None):
112
+ # Convert audio and send to Azure
113
+ # Return STTResult with transcription
114
+ pass
115
+ ```
116
+
117
+ ### Google Cloud Speech
118
+
119
+ ```python
120
+ from google.cloud import speech
121
+
122
+ class GoogleSTT(BaseSTT):
123
+ model_name = "GoogleSTT"
124
+
125
+ @classmethod
126
+ def load_model(cls, credentials_path):
127
+ cls.model = speech.SpeechClient()
128
+ cls.is_loaded = True
129
+
130
+ @classmethod
131
+ def transcribe_audio(cls, audio_data, sample_rate=None):
132
+ # Process with Google Cloud Speech
133
+ pass
134
+ ```
135
+
136
+ ### AssemblyAI
137
+
138
+ ```python
139
+ import assemblyai as aai
140
+
141
+ class AssemblyAISTT(BaseSTT):
142
+ model_name = "AssemblyAISTT"
143
+
144
+ @classmethod
145
+ def load_model(cls, api_key):
146
+ aai.settings.api_key = api_key
147
+ cls.model = aai.Transcriber()
148
+ cls.is_loaded = True
149
+
150
+ @classmethod
151
+ def transcribe_audio(cls, audio_data, sample_rate=None):
152
+ # Save audio temporarily and transcribe
153
+ pass
154
+ ```
155
+
156
+ ## 🎯 Best Practices
157
+
158
+ ### 1. Error Handling
159
+ ```python
160
+ @classmethod
161
+ def load_model(cls, **kwargs):
162
+ try:
163
+ # Model loading logic
164
+ cls.is_loaded = True
165
+ except Exception as e:
166
+ cls.is_loaded = False
167
+ raise RuntimeError(f"Failed to load {cls.model_name}: {e}")
168
+ ```
169
+
170
+ ### 2. Configuration Management
171
+ ```python
172
+ class YourSTT(BaseSTT):
173
+ config = {
174
+ "default_language": "en",
175
+ "timeout": 30,
176
+ "retry_count": 3
177
+ }
178
+
179
+ @classmethod
180
+ def set_language(cls, language):
181
+ cls.config["default_language"] = language
182
+ ```
183
+
184
+ ### 3. Audio Format Handling
185
+ ```python
186
+ @classmethod
187
+ def transcribe_audio(cls, audio_data, sample_rate=None):
188
+ # Handle numpy arrays
189
+ if isinstance(audio_data, np.ndarray):
190
+ # Convert to required format
191
+ audio_bytes = audio_to_bytes(audio_data, sample_rate)
192
+ else:
193
+ # Handle file paths
194
+ with open(audio_data, 'rb') as f:
195
+ audio_bytes = f.read()
196
+
197
+ # Transcribe and return result
198
+ ```
199
+
200
+ ### 4. Metadata and Confidence
201
+ ```python
202
+ return STTResult(
203
+ text=transcription,
204
+ confidence=confidence_score,
205
+ processing_time=processing_time,
206
+ metadata={
207
+ "model": cls.model_name,
208
+ "language_detected": detected_language,
209
+ "audio_duration": duration,
210
+ "service_info": additional_info
211
+ }
212
+ )
213
+ ```
214
+
215
+ ## πŸš€ Testing Your Integration
216
+
217
+ 1. **Unit Test Your STT Class**:
218
+ ```python
219
+ def test_your_stt():
220
+ YourSTT.load_model(api_key="test")
221
+ dummy_audio = np.random.randn(16000).astype(np.float32)
222
+ result = YourSTT.transcribe_audio(dummy_audio, 16000)
223
+ assert result.text is not None
224
+ ```
225
+
226
+ 2. **Test in Gradio Interface**:
227
+ - Run `python gradio_voice_transcriber_clean.py`
228
+ - Select your model from the dropdown
229
+ - Load it and test with audio
230
+
231
+ ## πŸ› οΈ Advanced Features
232
+
233
+ ### Custom UI Components
234
+
235
+ You can add model-specific UI components by extending the interface:
236
+
237
+ ```python
238
+ # Add custom fields for your model
239
+ if model_name == "YourSTT":
240
+ custom_setting = gr.Slider(
241
+ minimum=0, maximum=1, value=0.5,
242
+ label="Custom Setting"
243
+ )
244
+ ```
245
+
246
+ ### Background Processing
247
+
248
+ For long-running transcriptions:
249
+
250
+ ```python
251
+ @classmethod
252
+ def transcribe_audio_async(cls, audio_data, callback):
253
+ # Start background transcription
254
+ # Call callback when done
255
+ pass
256
+ ```
257
+
258
+ ## πŸ“Š Current Available Models
259
+
260
+ - **WhisperSTT**: OpenAI Whisper (local + API)
261
+ - **ExampleCustomSTT**: Template for new integrations
262
+
263
+ ## 🎯 Next Steps
264
+
265
+ 1. Choose your STT service
266
+ 2. Follow the integration pattern
267
+ 3. Test thoroughly
268
+ 4. Contribute back to the project!
269
+
270
+ The modular design makes it easy to support any STT service while maintaining a consistent user experience.
app.py CHANGED
@@ -1,1106 +1,1264 @@
1
- #!/usr/bin/env python3
2
- """
3
- Modular Gradio Voice Transcriber
4
-
5
- A flexible web interface for voice transcription supporting multiple STT models.
6
- Easily extensible to support any STT implementation that follows the BaseSTT interface.
7
-
8
- Usage:
9
- python gradio_voice_transcriber_clean.py
10
- """
11
-
12
- import gradio as gr
13
- import numpy as np
14
- import logging
15
- import time
16
- from typing import Tuple, Optional, Dict, Any, Type, List, Union
17
- from pathlib import Path
18
-
19
- # Import base STT class and available implementations
20
- from stt.stt_base import BaseSTT, STTResult
21
- from stt.whisper_stt import WhisperSTT
22
-
23
- # Try to import Wav2Vec2 Arabic STT (optional)
24
- try:
25
- from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
26
- WAV2VEC2_AVAILABLE = True
27
- except ImportError:
28
- WAV2VEC2_AVAILABLE = False
29
-
30
- # Try to import HuBERT Arabic STT (optional)
31
- try:
32
- from stt.hubert_arabic_stt import HuBERTArabicSTT
33
- HUBERT_AVAILABLE = True
34
- except ImportError:
35
- HUBERT_AVAILABLE = False
36
-
37
- # Try to import Vosk STT (optional)
38
- try:
39
- from stt.vosk_stt import VoskSTT
40
- VOSK_AVAILABLE = True
41
- except ImportError:
42
- VOSK_AVAILABLE = False
43
-
44
- # Try to import Coqui STT (optional)
45
- try:
46
- from stt.coqui_stt import CoquiSTT
47
- COQUI_AVAILABLE = True
48
- except ImportError:
49
- COQUI_AVAILABLE = False
50
-
51
- # Try to import Tawasul STT (optional)
52
- try:
53
- from stt.tawasul_stt import TawasulSTT
54
- TAWASUL_AVAILABLE = True
55
- except ImportError:
56
- TAWASUL_AVAILABLE = False
57
-
58
- # Setup logging
59
- logging.basicConfig(level=logging.INFO)
60
- logger = logging.getLogger(__name__)
61
-
62
- # STT Model Registry - Add new models here
63
- STT_MODELS: Dict[str, Type[BaseSTT]] = {
64
- "WhisperSTT": WhisperSTT,
65
- }
66
-
67
- # Add Wav2Vec2 Arabic if available
68
- if WAV2VEC2_AVAILABLE:
69
- STT_MODELS["Wav2Vec2ArabicSTT"] = Wav2Vec2ArabicSTT
70
-
71
- # Add HuBERT Arabic if available
72
- if HUBERT_AVAILABLE:
73
- STT_MODELS["HuBERTArabicSTT"] = HuBERTArabicSTT
74
-
75
- # Add Vosk if available
76
- if VOSK_AVAILABLE:
77
- STT_MODELS["VoskSTT"] = VoskSTT
78
-
79
- # Add Coqui STT if available
80
- if COQUI_AVAILABLE:
81
- STT_MODELS["CoquiSTT"] = CoquiSTT
82
-
83
- # Add Tawasul STT if available
84
- if TAWASUL_AVAILABLE:
85
- STT_MODELS["TawasulSTT"] = TawasulSTT
86
-
87
- # Global state
88
- current_stt_model: Optional[Type[BaseSTT]] = None
89
- current_model_config: Dict[str, Any] = {}
90
-
91
-
92
- class AudioProcessor:
93
- """Handle audio preprocessing for better transcription quality."""
94
-
95
- @staticmethod
96
- def preprocess(audio_data: np.ndarray, sample_rate: int, target_sr: int = 16000) -> np.ndarray:
97
- """
98
- Preprocess audio for better transcription quality.
99
-
100
- Args:
101
- audio_data: Raw audio data
102
- sample_rate: Original sample rate
103
- target_sr: Target sample rate (default: 16000 for Whisper)
104
-
105
- Returns:
106
- Preprocessed audio data
107
- """
108
- # Convert to mono if stereo
109
- if audio_data.ndim > 1:
110
- audio_data = np.mean(audio_data, axis=1)
111
-
112
- # Normalize to float32 [-1, 1]
113
- if audio_data.dtype == np.int16:
114
- audio_data = audio_data.astype(np.float32) / 32768.0
115
- elif audio_data.dtype == np.int32:
116
- audio_data = audio_data.astype(np.float32) / 2147483648.0
117
- else:
118
- audio_data = audio_data.astype(np.float32)
119
-
120
- # Clip to prevent overflow
121
- audio_data = np.clip(audio_data, -1.0, 1.0)
122
-
123
- # Remove DC offset
124
- audio_data = audio_data - np.mean(audio_data)
125
-
126
- # Simple noise gate (remove very quiet sections)
127
- if len(audio_data) > 0:
128
- threshold = np.max(np.abs(audio_data)) * 0.01
129
- audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
130
-
131
- # Resample if needed
132
- if sample_rate != target_sr:
133
- audio_data = AudioProcessor._resample(audio_data, sample_rate, target_sr)
134
-
135
- return audio_data
136
-
137
- @staticmethod
138
- def _resample(audio_data: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
139
- """Simple resampling (prefer librosa if available)."""
140
- try:
141
- import librosa
142
- return librosa.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)
143
- except ImportError:
144
- # Simple resampling fallback
145
- if orig_sr > target_sr:
146
- step = orig_sr // target_sr
147
- return audio_data[::step]
148
- else:
149
- repeat_factor = target_sr // orig_sr
150
- return np.repeat(audio_data, repeat_factor)
151
-
152
- @staticmethod
153
- def _preprocess_audio(audio_path: str) -> Tuple[np.ndarray, int]:
154
- """
155
- Preprocess audio file for STT models that need torch.Tensor input.
156
-
157
- Args:
158
- audio_path: Path to audio file
159
-
160
- Returns:
161
- Tuple of (audio_tensor_as_numpy, sample_rate) that can be converted to torch.Tensor
162
- """
163
- try:
164
- import librosa
165
- import soundfile as sf
166
-
167
- # Try to load with librosa first (more robust)
168
- try:
169
- audio_data, sample_rate = librosa.load(audio_path, sr=16000)
170
- except Exception:
171
- # Fallback to soundfile
172
- audio_data, sample_rate = sf.read(audio_path)
173
- if sample_rate != 16000:
174
- audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
175
- sample_rate = 16000
176
-
177
- # Convert to mono if needed
178
- if audio_data.ndim > 1:
179
- audio_data = np.mean(audio_data, axis=1)
180
-
181
- # Normalize audio to [-1, 1]
182
- if audio_data.max() > 1.0:
183
- audio_data = audio_data / audio_data.max()
184
-
185
- # Remove DC offset
186
- audio_data = audio_data - np.mean(audio_data)
187
-
188
- # Apply noise gate for very quiet audio
189
- threshold = np.max(np.abs(audio_data)) * 0.01
190
- audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
191
-
192
- # Convert to float32 for compatibility
193
- audio_data = audio_data.astype(np.float32)
194
-
195
- return audio_data, sample_rate
196
-
197
- except Exception as e:
198
- raise RuntimeError(f"Audio preprocessing failed: {str(e)}")
199
-
200
- @staticmethod
201
- def _preprocess_audio_torch(audio_path: str):
202
- """
203
- Preprocess audio file and return torch.Tensor for PyTorch-based STT models.
204
-
205
- Args:
206
- audio_path: Path to audio file
207
-
208
- Returns:
209
- Tuple of (audio_tensor, sample_rate) where audio_tensor is torch.Tensor
210
- """
211
- try:
212
- import torch
213
-
214
- # Get numpy array first
215
- audio_data, sample_rate = AudioProcessor._preprocess_audio(audio_path)
216
-
217
- # Convert to torch tensor
218
- audio_tensor = torch.FloatTensor(audio_data)
219
-
220
- return audio_tensor, sample_rate
221
-
222
- except ImportError:
223
- raise RuntimeError("PyTorch not available. Install with: pip install torch")
224
- except Exception as e:
225
- raise RuntimeError(f"Torch audio preprocessing failed: {str(e)}")
226
-
227
- @staticmethod
228
- def analyze_quality(audio_data: np.ndarray, sample_rate: int) -> Dict[str, Any]:
229
- """Analyze audio quality and provide feedback."""
230
- if audio_data.ndim > 1:
231
- audio_data = np.mean(audio_data, axis=1)
232
-
233
- duration = len(audio_data) / sample_rate
234
- max_amp = np.max(np.abs(audio_data))
235
- mean_amp = np.mean(np.abs(audio_data))
236
-
237
- # Check for clipping and silence
238
- clipping_ratio = np.sum(np.abs(audio_data) > 0.95) / len(audio_data)
239
- silence_threshold = max_amp * 0.01
240
- silence_ratio = np.sum(np.abs(audio_data) < silence_threshold) / len(audio_data)
241
-
242
- return {
243
- "duration": duration,
244
- "max_amplitude": max_amp,
245
- "mean_amplitude": mean_amp,
246
- "clipping_ratio": clipping_ratio,
247
- "silence_ratio": silence_ratio,
248
- "sample_rate": sample_rate,
249
- "is_good_quality": (
250
- duration > 1.0 and
251
- 0.1 < max_amp < 0.9 and
252
- clipping_ratio < 0.01 and
253
- silence_ratio < 0.5
254
- )
255
- }
256
-
257
-
258
- class ModelManager:
259
- """Handle STT model registration and loading."""
260
-
261
- @staticmethod
262
- def get_available_models() -> List[str]:
263
- """Get list of available STT model names."""
264
- return list(STT_MODELS.keys())
265
-
266
- @staticmethod
267
- def get_model_options(model_name: str) -> Dict[str, Any]:
268
- """Get model-specific configuration options."""
269
- if model_name == "WhisperSTT":
270
- return {
271
- "model_sizes": ["tiny", "base", "small", "medium", "large"],
272
- "supports_api": True,
273
- "languages": [
274
- ("Auto-detect", "auto"),
275
- ("English", "en"),
276
- ("Spanish", "es"),
277
- ("French", "fr"),
278
- ("German", "de"),
279
- ("Italian", "it"),
280
- ("Portuguese", "pt"),
281
- ("Russian", "ru"),
282
- ("Japanese", "ja"),
283
- ("Korean", "ko"),
284
- ("Chinese", "zh"),
285
- ("Dutch", "nl"),
286
- ("Arabic", "ar"),
287
- ("Hindi", "hi")
288
- ],
289
- "default_params": {
290
- "temperature": 0.0,
291
- "beam_size": 5,
292
- "best_of": 5,
293
- "patience": 2.0,
294
- "condition_on_previous_text": True,
295
- }
296
- }
297
-
298
- elif model_name == "Wav2Vec2ArabicSTT":
299
- return {
300
- "model_sizes": [
301
- ("Arabic Standard", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
302
- ("Multilingual", "facebook/wav2vec2-large-xlsr-53"),
303
- ("English Fallback", "facebook/wav2vec2-base-960h"),
304
- ("Arabic Egyptian (Experimental)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian")
305
- ],
306
- "supports_api": False,
307
- "supports_hf_token": True,
308
- "languages": [
309
- ("Arabic Egyptian", "ar-EG"),
310
- ("Arabic Standard", "ar"),
311
- ("Auto-detect", "auto"),
312
- ],
313
- "device_options": ["auto", "cpu", "cuda"],
314
- "default_params": {
315
- "device": "auto",
316
- "chunk_length": 20,
317
- "return_confidence": True,
318
- }
319
- }
320
-
321
- elif model_name == "VoskSTT":
322
- return {
323
- "model_sizes": [
324
- ("English US Small (40MB)", "vosk-model-small-en-us-0.15"),
325
- ("English US Large (1.8GB)", "vosk-model-en-us-0.22"),
326
- ("Arabic (318MB)", "vosk-model-ar-mgb2-0.4"),
327
- ("French (1.4GB)", "vosk-model-fr-0.22"),
328
- ("German (1.2GB)", "vosk-model-de-0.21"),
329
- ("Spanish (1.4GB)", "vosk-model-es-0.42"),
330
- ("Russian Large (1.5GB)", "vosk-model-ru-0.42"),
331
- ("Russian Small (45MB)", "vosk-model-small-ru-0.22"),
332
- ("Chinese Small (42MB)", "vosk-model-small-cn-0.22"),
333
- ],
334
- "supports_api": False,
335
- "supports_auto_download": True,
336
- "languages": [
337
- ("Auto (based on model)", "auto"),
338
- ("English", "en"),
339
- ("Arabic", "ar"),
340
- ("French", "fr"),
341
- ("German", "de"),
342
- ("Spanish", "es"),
343
- ("Russian", "ru"),
344
- ("Chinese", "zh"),
345
- ],
346
- "default_params": {
347
- "auto_download": True,
348
- "return_confidence": True,
349
- "return_words": True,
350
- }
351
- }
352
-
353
- elif model_name == "HuBERTArabicSTT":
354
- return {
355
- "model_sizes": [
356
- ("Arabic Egyptian (HuBERT)", "omarxadel/hubert-large-arabic-egyptian"),
357
- ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
358
- ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
359
- ("Arabic MSA", "facebook/wav2vec2-large-xlsr-53")
360
- ],
361
- "supports_api": False,
362
- "supports_hf_token": True,
363
- "languages": [
364
- ("Arabic Egyptian", "ar-EG"),
365
- ("Arabic Standard", "ar"),
366
- ("Auto-detect", "auto"),
367
- ],
368
- "device_options": ["auto", "cpu", "cuda"],
369
- "default_params": {
370
- "device": "auto",
371
- "chunk_length": 20,
372
- "return_confidence": True,
373
- "max_audio_length": 120
374
- }
375
- }
376
-
377
- elif model_name == "CoquiSTT":
378
- return {
379
- "model_sizes": [
380
- ("English Large Vocab", "english-large"),
381
- ("English Huge Vocab", "english-huge"),
382
- ("German", "german"),
383
- ("French", "french"),
384
- ("Spanish", "spanish")
385
- ],
386
- "supports_api": False,
387
- "supports_auto_download": True,
388
- "languages": [
389
- ("English", "en"),
390
- ("German", "de"),
391
- ("French", "fr"),
392
- ("Spanish", "es"),
393
- ("Auto (based on model)", "auto"),
394
- ],
395
- "default_params": {
396
- "auto_download": True,
397
- "beam_width": 512,
398
- "lm_alpha": 0.931289039105002,
399
- "lm_beta": 1.1834137581510284,
400
- "return_confidence": True,
401
- "return_timestamps": False,
402
- }
403
- }
404
-
405
- elif model_name == "TawasulSTT":
406
- return {
407
- "model_sizes": [
408
- ("Tawasul STT V0 (Arabic)", "Kareem35/Tawasul-STT-V0"),
409
- ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
410
- ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
411
- ("Multilingual Fallback", "facebook/wav2vec2-large-xlsr-53")
412
- ],
413
- "supports_api": False,
414
- "supports_hf_token": True,
415
- "languages": [
416
- ("Arabic Standard", "ar"),
417
- ("Arabic Egyptian", "ar-EG"),
418
- ("Arabic Saudi", "ar-SA"),
419
- ("Arabic Jordanian", "ar-JO"),
420
- ("Arabic Lebanese", "ar-LB"),
421
- ("Arabic Syrian", "ar-SY"),
422
- ("Arabic Iraqi", "ar-IQ"),
423
- ("Auto-detect", "auto"),
424
- ],
425
- "device_options": ["auto", "cpu", "cuda"],
426
- "default_params": {
427
- "device": "auto",
428
- "chunk_length": 20,
429
- "return_confidence": True,
430
- "max_audio_length": 300
431
- }
432
- }
433
-
434
- # Default options for other models
435
- return {
436
- "model_sizes": ["default"],
437
- "supports_api": False,
438
- "languages": [("Auto-detect", "auto")],
439
- "default_params": {}
440
- }
441
-
442
- @staticmethod
443
- def load_model(model_name: str, **kwargs) -> str:
444
- """Load specified STT model with configuration."""
445
- global current_stt_model, current_model_config
446
-
447
- if model_name not in STT_MODELS:
448
- return f"❌ Unknown model: {model_name}. Available: {list(STT_MODELS.keys())}"
449
-
450
- try:
451
- model_class = STT_MODELS[model_name]
452
-
453
- # Handle TawasulSTT as static class (don't instantiate)
454
- if model_name == "TawasulSTT":
455
- model_instance = model_class # Use class directly for static methods
456
- else:
457
- # Instantiate the model for instance-based classes
458
- model_instance = model_class()
459
-
460
- if model_name == "WhisperSTT":
461
- # Handle WhisperSTT specific loading
462
- model_size = kwargs.get("model_size", "base")
463
- use_api = kwargs.get("use_api", False)
464
- api_key = kwargs.get("api_key", "")
465
-
466
- if use_api and not api_key.strip():
467
- return "❌ Error: API key required for API mode"
468
-
469
- # Load with optimized parameters
470
- load_params = {
471
- "model_size": model_size,
472
- "use_api": use_api,
473
- }
474
-
475
- if api_key:
476
- load_params["api_key"] = api_key.strip()
477
-
478
- # Add quality optimization parameters for local models
479
- if not use_api:
480
- load_params.update({
481
- "temperature": 0.0,
482
- "beam_size": 5,
483
- "best_of": 5,
484
- "patience": 2.0,
485
- "condition_on_previous_text": True,
486
- })
487
-
488
- model_instance.load_model(**load_params)
489
-
490
- current_model_config = {
491
- "model_name": model_name,
492
- "model_size": model_size,
493
- "use_api": use_api
494
- }
495
-
496
- status = f"βœ… {model_name} ({'API' if use_api else model_size}) loaded successfully"
497
-
498
- elif model_name == "Wav2Vec2ArabicSTT":
499
- # Handle Wav2Vec2 Arabic specific loading
500
- device = kwargs.get("device", "auto")
501
- chunk_length = kwargs.get("chunk_length", 20)
502
- hf_token = kwargs.get("hf_token", "")
503
- model_id = kwargs.get("model_size", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
504
-
505
- load_params = {
506
- "device": device,
507
- "chunk_length": chunk_length,
508
- "model_id": model_id,
509
- }
510
-
511
- if hf_token:
512
- load_params["hf_token"] = hf_token.strip()
513
-
514
- model_instance.load_model(**load_params)
515
-
516
- current_model_config = {
517
- "model_name": model_name,
518
- "model_id": model_id,
519
- "device": device,
520
- "chunk_length": chunk_length
521
- }
522
-
523
- # Extract model name for display
524
- model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
525
- status = f"βœ… {model_name} ({model_display_name}) loaded on {device}"
526
-
527
- elif model_name == "VoskSTT":
528
- # Handle VoskSTT specific loading
529
- model_name_param = kwargs.get("model_size", "vosk-model-small-en-us-0.15")
530
- auto_download = kwargs.get("auto_download", True)
531
-
532
- load_params = {
533
- "model_name": model_name_param,
534
- "auto_download": auto_download,
535
- }
536
-
537
- model_instance.load_model(**load_params)
538
-
539
- current_model_config = {
540
- "model_name": model_name,
541
- "model_name_param": model_name_param,
542
- "auto_download": auto_download
543
- }
544
-
545
- status = f"βœ… {model_name} ({model_name_param}) loaded successfully"
546
-
547
- elif model_name == "HuBERTArabicSTT":
548
- # Handle HuBERT Arabic specific loading
549
- device = kwargs.get("device", "auto")
550
- chunk_length = kwargs.get("chunk_length", 20)
551
- hf_token = kwargs.get("hf_token", "")
552
- model_id = kwargs.get("model_size", "omarxadel/hubert-large-arabic-egyptian")
553
- max_audio_length = kwargs.get("max_audio_length", 120)
554
-
555
- load_params = {
556
- "device": device,
557
- "chunk_length": chunk_length,
558
- "model_id": model_id,
559
- "max_audio_length": max_audio_length,
560
- }
561
-
562
- if hf_token:
563
- load_params["hf_token"] = hf_token.strip()
564
-
565
- model_instance.load_model(**load_params)
566
-
567
- current_model_config = {
568
- "model_name": model_name,
569
- "model_id": model_id,
570
- "device": device,
571
- "chunk_length": chunk_length,
572
- "max_audio_length": max_audio_length
573
- }
574
-
575
- # Extract model name for display
576
- model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
577
- status = f"βœ… {model_name} ({model_display_name}) loaded on {device}"
578
-
579
- elif model_name == "CoquiSTT":
580
- # Handle Coqui STT specific loading
581
- model_name_param = kwargs.get("model_size", "english-large")
582
- auto_download = kwargs.get("auto_download", True)
583
- beam_width = kwargs.get("beam_width", 512)
584
- lm_alpha = kwargs.get("lm_alpha", 0.931289039105002)
585
- lm_beta = kwargs.get("lm_beta", 1.1834137581510284)
586
-
587
- load_params = {
588
- "model_name": model_name_param,
589
- "auto_download": auto_download,
590
- "beam_width": beam_width,
591
- "lm_alpha": lm_alpha,
592
- "lm_beta": lm_beta,
593
- }
594
-
595
- model_instance.load_model(**load_params)
596
-
597
- current_model_config = {
598
- "model_name": model_name,
599
- "model_name_param": model_name_param,
600
- "auto_download": auto_download,
601
- "beam_width": beam_width,
602
- "lm_alpha": lm_alpha,
603
- "lm_beta": lm_beta
604
- }
605
-
606
- status = f"βœ… {model_name} ({model_name_param}) loaded successfully"
607
-
608
- elif model_name == "TawasulSTT":
609
- # Handle Tawasul STT specific loading (static class)
610
- device = kwargs.get("device", "auto")
611
- chunk_length = kwargs.get("chunk_length", 20)
612
- hf_token = kwargs.get("hf_token", "")
613
- model_id = kwargs.get("model_size", "Kareem35/Tawasul-STT-V0")
614
- max_audio_length = kwargs.get("max_audio_length", 300)
615
-
616
- load_params = {
617
- "device": device,
618
- "chunk_length": chunk_length,
619
- "model_id": model_id,
620
- "max_audio_length": max_audio_length,
621
- }
622
-
623
- if hf_token:
624
- load_params["hf_token"] = hf_token.strip()
625
-
626
- # Call static method directly
627
- model_class.load_model(**load_params)
628
-
629
- current_model_config = {
630
- "model_name": model_name,
631
- "model_id": model_id,
632
- "device": device,
633
- "chunk_length": chunk_length,
634
- "max_audio_length": max_audio_length
635
- }
636
-
637
- # Extract model name for display
638
- model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
639
- status = f"βœ… {model_name} ({model_display_name}) loaded on {device}"
640
-
641
- else:
642
- # Generic model loading for future STT models
643
- model_instance.load_model(**kwargs)
644
- current_model_config = {"model_name": model_name, **kwargs}
645
- status = f"βœ… {model_name} loaded successfully"
646
-
647
- current_stt_model = model_instance
648
- logger.info(status)
649
- return status
650
-
651
- except Exception as e:
652
- error_msg = f"❌ Error loading {model_name}: {str(e)}"
653
- logger.error(error_msg)
654
- return error_msg
655
-
656
- @staticmethod
657
- def get_model_info() -> str:
658
- """Get information about available and loaded models."""
659
- info = f"**Available Models:** {', '.join(STT_MODELS.keys())}\n\n"
660
-
661
- if current_stt_model:
662
- model_info = current_stt_model.get_model_info()
663
- # Handle different key names for model name
664
- model_name = model_info.get('model_name') or model_info.get('name', 'Unknown')
665
- info += f"**Currently Loaded:** {model_name}\n"
666
- info += f"**Status:** {'βœ… Ready' if model_info['is_loaded'] else '❌ Not loaded'}\n"
667
- info += f"**Config:** {current_model_config}"
668
- else:
669
- info += "**Currently Loaded:** None"
670
-
671
- return info
672
-
673
-
674
- class TranscriptionEngine:
675
- """Handle audio transcription using the loaded STT model."""
676
-
677
- @staticmethod
678
- def transcribe(audio_input: Tuple[int, np.ndarray],
679
- language: Optional[str] = None) -> Tuple[str, str, str]:
680
- """
681
- Transcribe audio input using the currently loaded STT model.
682
-
683
- Args:
684
- audio_input: Tuple of (sample_rate, audio_data) from Gradio
685
- language: Language code for transcription
686
-
687
- Returns:
688
- Tuple of (transcription, confidence_info, processing_info)
689
- """
690
- if audio_input is None:
691
- return "❌ No audio provided", "", ""
692
-
693
- if not current_stt_model or not current_stt_model.is_loaded:
694
- return "❌ No STT model loaded. Please load a model first.", "", ""
695
-
696
- try:
697
- sample_rate, audio_data = audio_input
698
-
699
- # Preprocess audio
700
- processed_audio = AudioProcessor.preprocess(audio_data, sample_rate)
701
-
702
- # Quality checks
703
- quality = AudioProcessor.analyze_quality(processed_audio, 16000)
704
-
705
- if quality["duration"] < 0.5:
706
- return "❌ Audio too short (minimum 0.5 seconds)", "", ""
707
-
708
- if quality["max_amplitude"] < 0.001:
709
- return "❌ Audio too quiet or silent", "", f"Max amplitude: {quality['max_amplitude']:.6f}"
710
-
711
- # Set language for models that support it
712
- if hasattr(current_stt_model, 'set_language') and language and language != "auto":
713
- current_stt_model.set_language(language)
714
-
715
- # Transcribe using different approaches for different models
716
- start_time = time.time()
717
-
718
- # Check if this is TawasulSTT (static class) which needs file path
719
- if current_model_config.get('model_name') == 'TawasulSTT':
720
- # TawasulSTT needs a file path, so save audio to temporary file
721
- import tempfile
722
- import soundfile as sf
723
-
724
- with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
725
- temp_path = temp_file.name
726
- sf.write(temp_path, processed_audio, 16000)
727
-
728
- try:
729
- # Call TawasulSTT.transcribe() with file path
730
- transcription, confidence_info_raw, processing_info_raw = current_stt_model.transcribe(temp_path)
731
-
732
- # Create a result-like object for consistency
733
- class TempResult:
734
- def __init__(self, text, confidence=None, processing_time=None):
735
- self.text = text
736
- self.confidence = confidence
737
- self.processing_time = processing_time
738
-
739
- # Extract confidence from confidence_info_raw if available
740
- confidence_value = None
741
- if confidence_info_raw and "Confidence:" in confidence_info_raw:
742
- try:
743
- conf_str = confidence_info_raw.split("Confidence:")[1].strip()
744
- confidence_value = float(conf_str)
745
- except:
746
- confidence_value = None
747
-
748
- processing_time = time.time() - start_time
749
- result = TempResult(transcription, confidence_value, processing_time)
750
-
751
- finally:
752
- # Clean up temporary file
753
- import os
754
- try:
755
- os.unlink(temp_path)
756
- except:
757
- pass
758
- else:
759
- # For other STT models that use transcribe_audio
760
- result = current_stt_model.transcribe_audio(processed_audio, 16000)
761
-
762
- # Prepare output
763
- transcription = result.text.strip() if result.text else "No speech detected"
764
-
765
- # Filter out common false positives
766
- if transcription.lower() in ["you", "thank you.", "thanks for watching!", ""]:
767
- transcription = "πŸ”‡ No clear speech detected"
768
-
769
- # Confidence info
770
- confidence_info = ""
771
- if result.confidence is not None:
772
- confidence_info = f"Confidence: {result.confidence:.2%}"
773
- if result.confidence < 0.3:
774
- confidence_info += " (Low - consider re-recording)"
775
- else:
776
- confidence_info = "Confidence: N/A"
777
-
778
- # Processing info
779
- processing_info = f"Processing: {result.processing_time or 0:.2f}s\n"
780
- processing_info += f"Model: {current_model_config.get('model_name', 'Unknown')}\n"
781
- processing_info += f"Audio: {quality['duration']:.2f}s, {quality['max_amplitude']:.3f} amplitude\n"
782
- processing_info += f"Quality: {'βœ… Good' if quality['is_good_quality'] else '⚠️ Poor'}"
783
-
784
- return transcription, confidence_info, processing_info
785
-
786
- except Exception as e:
787
- error_msg = f"❌ Transcription error: {str(e)}"
788
- logger.error(error_msg)
789
- return error_msg, "", ""
790
-
791
-
792
- class GradioInterface:
793
- """Create and manage the Gradio web interface."""
794
-
795
- @staticmethod
796
- def create_interface():
797
- """Create the main Gradio interface."""
798
- with gr.Blocks(
799
- title="πŸŽ™οΈ Modular Voice Transcriber",
800
- theme=gr.themes.Soft()
801
- ) as demo:
802
-
803
- gr.Markdown(
804
- """
805
- # πŸŽ™οΈ Modular Voice Transcriber
806
-
807
- A flexible interface supporting multiple STT models.
808
- Easily extensible for new transcription engines.
809
- """
810
- )
811
-
812
- with gr.Row():
813
- # Model Configuration Panel
814
- with gr.Column(scale=1):
815
- gr.Markdown("### πŸ”§ Model Configuration")
816
-
817
- # Model selection
818
- model_selector = gr.Dropdown(
819
- choices=ModelManager.get_available_models(),
820
- value="WhisperSTT",
821
- label="STT Model",
822
- info="Choose your speech-to-text engine"
823
- )
824
-
825
- # Dynamic model options (will update based on selected model)
826
- model_size = gr.Dropdown(
827
- choices=["tiny", "base", "small", "medium", "large"],
828
- value="base",
829
- label="Model Size",
830
- visible=True
831
- )
832
-
833
- use_api = gr.Checkbox(
834
- label="Use API",
835
- info="Use cloud API instead of local model",
836
- visible=True
837
- )
838
-
839
- api_key = gr.Textbox(
840
- label="API Key",
841
- type="password",
842
- placeholder="Enter API key...",
843
- visible=False
844
- )
845
-
846
- # Device selection for models that support it
847
- device_selector = gr.Dropdown(
848
- choices=["auto", "cpu", "cuda"],
849
- value="auto",
850
- label="Device",
851
- info="Processing device (auto recommended)",
852
- visible=False
853
- )
854
-
855
- # HuggingFace token for private models
856
- hf_token = gr.Textbox(
857
- label="HuggingFace Token",
858
- type="password",
859
- placeholder="hf_...",
860
- info="Optional: For private or experimental models",
861
- visible=False
862
- )
863
-
864
- # Load button and status
865
- load_btn = gr.Button("πŸ”„ Load Model", variant="primary")
866
- load_status = gr.Textbox(
867
- label="Status",
868
- value="No model loaded",
869
- interactive=False
870
- )
871
-
872
- # Model info
873
- model_info = gr.Markdown(ModelManager.get_model_info())
874
-
875
- # Transcription Panel
876
- with gr.Column(scale=2):
877
- gr.Markdown("### 🎀 Voice Transcription")
878
-
879
- # Language selection
880
- language = gr.Dropdown(
881
- choices=[("Auto-detect", "auto"), ("English", "en")],
882
- value="auto",
883
- label="Language"
884
- )
885
-
886
- # Audio input
887
- audio_input = gr.Audio(
888
- label="Record or Upload Audio",
889
- type="numpy",
890
- format="wav"
891
- )
892
-
893
- # Action buttons
894
- with gr.Row():
895
- transcribe_btn = gr.Button("🎯 Transcribe", variant="primary")
896
- quality_btn = gr.Button("πŸ“Š Check Quality")
897
- clear_btn = gr.Button("πŸ—‘οΈ Clear")
898
-
899
- # Outputs
900
- transcription_output = gr.Textbox(
901
- label="πŸ“ Transcription",
902
- lines=4,
903
- placeholder="Transcribed text will appear here..."
904
- )
905
-
906
- with gr.Row():
907
- confidence_output = gr.Textbox(
908
- label="🎯 Confidence",
909
- interactive=False
910
- )
911
- processing_output = gr.Textbox(
912
- label="⏱️ Processing Info",
913
- interactive=False
914
- )
915
-
916
- quality_output = gr.Markdown(
917
- value="",
918
- visible=False,
919
- label="πŸ“Š Audio Quality Analysis"
920
- )
921
-
922
- # Usage tips
923
- gr.Markdown(
924
- """
925
- ### πŸ’‘ Tips for Best Results
926
- - **Record clearly** in a quiet environment
927
- - **Speak at normal pace** - not too fast or slow
928
- - **Use good audio quality** - avoid background noise
929
- - **Try different models** - larger models are more accurate but slower
930
- - **Check quality analysis** to identify audio issues
931
- """
932
- )
933
-
934
- # Event handlers
935
- def update_model_options(model_name: str):
936
- """Update interface based on selected model."""
937
- options = ModelManager.get_model_options(model_name)
938
-
939
- # Determine visibility of components
940
- show_model_size = len(options["model_sizes"]) > 1
941
- show_api = options["supports_api"]
942
- show_device = "device_options" in options
943
- show_hf_token = options.get("supports_hf_token", False)
944
-
945
- # Extract model size options (handle both simple lists and tuples)
946
- if show_model_size and isinstance(options["model_sizes"][0], tuple):
947
- # Model sizes are tuples of (display_name, value)
948
- size_choices = options["model_sizes"]
949
- size_value = size_choices[0][1] # Use the value from first tuple
950
- else:
951
- # Model sizes are simple strings
952
- size_choices = options["model_sizes"]
953
- size_value = size_choices[0]
954
-
955
- return (
956
- gr.update(choices=size_choices, value=size_value, visible=show_model_size),
957
- gr.update(visible=show_api),
958
- gr.update(visible=False), # Hide API key initially
959
- gr.update(choices=options["languages"], value="auto"),
960
- gr.update(
961
- choices=options.get("device_options", ["auto"]),
962
- value="auto",
963
- visible=show_device
964
- ),
965
- gr.update(visible=show_hf_token)
966
- )
967
-
968
- def toggle_api_key(use_api: bool):
969
- """Show/hide API key field."""
970
- return gr.update(visible=use_api)
971
-
972
- def load_selected_model(model_name: str, model_size: str, use_api: bool, api_key: str, device: str, hf_token: str):
973
- """Load the selected model with configuration."""
974
- kwargs = {"model_size": model_size, "use_api": use_api}
975
- if api_key:
976
- kwargs["api_key"] = api_key
977
- if device and device != "auto":
978
- kwargs["device"] = device
979
- if hf_token:
980
- kwargs["hf_token"] = hf_token
981
- return ModelManager.load_model(model_name, **kwargs)
982
-
983
- def analyze_audio_quality(audio_input):
984
- """Analyze and display audio quality."""
985
- if audio_input is None:
986
- return "", gr.update(visible=False)
987
-
988
- sample_rate, audio_data = audio_input
989
- quality = AudioProcessor.analyze_quality(audio_data, sample_rate)
990
-
991
- report = f"""
992
- **πŸ“Š Audio Quality Analysis:**
993
- - Duration: {quality['duration']:.2f}s
994
- - Max amplitude: {quality['max_amplitude']:.3f}
995
- - Clipping: {quality['clipping_ratio']:.2%}
996
- - Silence ratio: {quality['silence_ratio']:.2%}
997
- - Overall quality: {'βœ… Good' if quality['is_good_quality'] else '⚠️ Needs improvement'}
998
-
999
- **πŸ”§ Recommendations:**
1000
- {_get_quality_recommendations(quality)}
1001
- """
1002
-
1003
- return report, gr.update(visible=True)
1004
-
1005
- # Connect events
1006
- model_selector.change(
1007
- fn=update_model_options,
1008
- inputs=model_selector,
1009
- outputs=[model_size, use_api, api_key, language, device_selector, hf_token]
1010
- )
1011
-
1012
- use_api.change(
1013
- fn=toggle_api_key,
1014
- inputs=use_api,
1015
- outputs=api_key
1016
- )
1017
-
1018
- load_btn.click(
1019
- fn=load_selected_model,
1020
- inputs=[model_selector, model_size, use_api, api_key, device_selector, hf_token],
1021
- outputs=load_status
1022
- ).then(
1023
- fn=lambda: ModelManager.get_model_info(),
1024
- outputs=model_info
1025
- )
1026
-
1027
- transcribe_btn.click(
1028
- fn=TranscriptionEngine.transcribe,
1029
- inputs=[audio_input, language],
1030
- outputs=[transcription_output, confidence_output, processing_output]
1031
- )
1032
-
1033
- quality_btn.click(
1034
- fn=analyze_audio_quality,
1035
- inputs=audio_input,
1036
- outputs=[quality_output, quality_output]
1037
- )
1038
-
1039
- clear_btn.click(
1040
- fn=lambda: ("", "", "", "", gr.update(visible=False)),
1041
- outputs=[transcription_output, confidence_output, processing_output, quality_output, quality_output]
1042
- )
1043
-
1044
- # Auto-transcribe on audio change (optional)
1045
- audio_input.change(
1046
- fn=TranscriptionEngine.transcribe,
1047
- inputs=[audio_input, language],
1048
- outputs=[transcription_output, confidence_output, processing_output]
1049
- )
1050
-
1051
- return demo
1052
-
1053
-
1054
- def _get_quality_recommendations(quality: Dict[str, Any]) -> str:
1055
- """Generate quality recommendations based on analysis."""
1056
- recommendations = []
1057
-
1058
- if quality["duration"] < 1.0:
1059
- recommendations.append("β€’ Try recording for longer (1+ seconds)")
1060
-
1061
- if quality["max_amplitude"] < 0.1:
1062
- recommendations.append("β€’ Increase volume or move closer to microphone")
1063
- elif quality["max_amplitude"] > 0.9:
1064
- recommendations.append("β€’ Reduce volume to avoid clipping")
1065
-
1066
- if quality["clipping_ratio"] > 0.01:
1067
- recommendations.append("β€’ Audio is clipping - reduce input gain")
1068
-
1069
- if quality["silence_ratio"] > 0.5:
1070
- recommendations.append("β€’ Too much silence - record in quieter environment")
1071
-
1072
- if not recommendations:
1073
- recommendations.append("β€’ Audio quality looks good!")
1074
-
1075
- return "\n".join(recommendations)
1076
-
1077
-
1078
- def main():
1079
- """Main application entry point."""
1080
- # Check dependencies
1081
- print("πŸ” Checking dependencies...")
1082
-
1083
- try:
1084
- import gradio
1085
- print("βœ… Gradio available")
1086
- except ImportError:
1087
- print("❌ Gradio not installed. Run: pip install gradio")
1088
- return
1089
-
1090
- # Check available STT models
1091
- print(f"πŸ€– Available STT models: {ModelManager.get_available_models()}")
1092
-
1093
- # Create and launch interface
1094
- print("πŸš€ Launching Gradio interface...")
1095
- demo = GradioInterface.create_interface()
1096
-
1097
- demo.launch(
1098
- share=True, # Set to True for public sharing
1099
- server_name="127.0.0.1",
1100
- server_port=7860,
1101
- show_error=True
1102
- )
1103
-
1104
-
1105
- if __name__ == "__main__":
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1106
  main()
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Modular Gradio Voice Transcriber
4
+
5
+ A flexible web interface for voice transcription supporting multiple STT models.
6
+ Easily extensible to support any STT implementation that follows the BaseSTT interface.
7
+
8
+ Usage:
9
+ python gradio_voice_transcriber_clean.py
10
+ """
11
+
12
+ import gradio as gr
13
+ import numpy as np
14
+ import logging
15
+ import time
16
+ from typing import Tuple, Optional, Dict, Any, Type, List, Union
17
+ from pathlib import Path
18
+
19
+ # Import base STT class and available implementations
20
+ from stt.stt_base import BaseSTT, STTResult
21
+ from stt.whisper_stt import WhisperSTT
22
+
23
+ # Try to import Wav2Vec2 Arabic STT (optional)
24
+ try:
25
+ from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
26
+ WAV2VEC2_AVAILABLE = True
27
+ except ImportError:
28
+ WAV2VEC2_AVAILABLE = False
29
+
30
+
31
+ # Try to import Chirp3 STT (optional)
32
+ try:
33
+ from stt.chirp3_stt import Chirp3STT
34
+ CHIRP3_AVAILABLE = True
35
+ except ImportError:
36
+ Chirp3STT = None
37
+ CHIRP3_AVAILABLE = False
38
+
39
+ # Try to import HuBERT Arabic STT (optional)
40
+ try:
41
+ from stt.hubert_arabic_stt import HuBERTArabicSTT
42
+ HUBERT_AVAILABLE = True
43
+ except ImportError:
44
+ HUBERT_AVAILABLE = False
45
+
46
+ # Try to import Vosk STT (optional)
47
+ try:
48
+ from stt.vosk_stt import VoskSTT
49
+ VOSK_AVAILABLE = True
50
+ except ImportError:
51
+ VOSK_AVAILABLE = False
52
+
53
+ # Try to import Coqui STT (optional)
54
+ try:
55
+ from stt.coqui_stt import CoquiSTT
56
+ COQUI_AVAILABLE = True
57
+ except ImportError:
58
+ COQUI_AVAILABLE = False
59
+
60
+ # Try to import Tawasul STT (optional)
61
+ try:
62
+ from stt.tawasul_stt import TawasulSTT
63
+ TAWASUL_AVAILABLE = True
64
+ except ImportError:
65
+ TAWASUL_AVAILABLE = False
66
+
67
+ # Setup logging
68
+ logging.basicConfig(level=logging.INFO)
69
+ logger = logging.getLogger(__name__)
70
+
71
+ # STT Model Registry - Add new models here
72
+ STT_MODELS: Dict[str, Type[BaseSTT]] = {
73
+ "WhisperSTT": WhisperSTT,
74
+ }
75
+
76
+ # Add Wav2Vec2 Arabic if available
77
+ if WAV2VEC2_AVAILABLE:
78
+ STT_MODELS["Wav2Vec2ArabicSTT"] = Wav2Vec2ArabicSTT
79
+
80
+ # Add HuBERT Arabic if available
81
+ if HUBERT_AVAILABLE:
82
+ STT_MODELS["HuBERTArabicSTT"] = HuBERTArabicSTT
83
+
84
+ # Add Vosk if available
85
+ if VOSK_AVAILABLE:
86
+ STT_MODELS["VoskSTT"] = VoskSTT
87
+
88
+ # Add Coqui STT if available
89
+ if COQUI_AVAILABLE:
90
+ STT_MODELS["CoquiSTT"] = CoquiSTT
91
+
92
+ # Add Tawasul STT if available
93
+ if TAWASUL_AVAILABLE:
94
+ STT_MODELS["TawasulSTT"] = TawasulSTT
95
+
96
+ # Global state
97
+ current_stt_model: Optional[Type[BaseSTT]] = None
98
+ current_model_config: Dict[str, Any] = {}
99
+
100
+
101
+ class AudioProcessor:
102
+ """Handle audio preprocessing for better transcription quality."""
103
+
104
+ @staticmethod
105
+ def preprocess(audio_data: np.ndarray, sample_rate: int, target_sr: int = 16000) -> np.ndarray:
106
+ """
107
+ Preprocess audio for better transcription quality.
108
+
109
+ Args:
110
+ audio_data: Raw audio data
111
+ sample_rate: Original sample rate
112
+ target_sr: Target sample rate (default: 16000 for Whisper)
113
+
114
+ Returns:
115
+ Preprocessed audio data
116
+ """
117
+ # Convert to mono if stereo
118
+ if audio_data.ndim > 1:
119
+ audio_data = np.mean(audio_data, axis=1)
120
+
121
+ # Normalize to float32 [-1, 1]
122
+ if audio_data.dtype == np.int16:
123
+ audio_data = audio_data.astype(np.float32) / 32768.0
124
+ elif audio_data.dtype == np.int32:
125
+ audio_data = audio_data.astype(np.float32) / 2147483648.0
126
+ else:
127
+ audio_data = audio_data.astype(np.float32)
128
+
129
+ # Clip to prevent overflow
130
+ audio_data = np.clip(audio_data, -1.0, 1.0)
131
+
132
+ # Remove DC offset
133
+ audio_data = audio_data - np.mean(audio_data)
134
+
135
+ # Simple noise gate (remove very quiet sections)
136
+ if len(audio_data) > 0:
137
+ threshold = np.max(np.abs(audio_data)) * 0.01
138
+ audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
139
+
140
+ # Resample if needed
141
+ if sample_rate != target_sr:
142
+ audio_data = AudioProcessor._resample(audio_data, sample_rate, target_sr)
143
+
144
+ return audio_data
145
+
146
+ @staticmethod
147
+ def _resample(audio_data: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
148
+ """Simple resampling (prefer librosa if available)."""
149
+ try:
150
+ import librosa
151
+ return librosa.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)
152
+ except ImportError:
153
+ # Simple resampling fallback
154
+ if orig_sr > target_sr:
155
+ step = orig_sr // target_sr
156
+ return audio_data[::step]
157
+ else:
158
+ repeat_factor = target_sr // orig_sr
159
+ return np.repeat(audio_data, repeat_factor)
160
+
161
+ @staticmethod
162
+ def _preprocess_audio(audio_path: str) -> Tuple[np.ndarray, int]:
163
+ """
164
+ Preprocess audio file for STT models that need torch.Tensor input.
165
+
166
+ Args:
167
+ audio_path: Path to audio file
168
+
169
+ Returns:
170
+ Tuple of (audio_tensor_as_numpy, sample_rate) that can be converted to torch.Tensor
171
+ """
172
+ try:
173
+ import librosa
174
+ import soundfile as sf
175
+
176
+ # Try to load with librosa first (more robust)
177
+ try:
178
+ audio_data, sample_rate = librosa.load(audio_path, sr=16000)
179
+ except Exception:
180
+ # Fallback to soundfile
181
+ audio_data, sample_rate = sf.read(audio_path)
182
+ if sample_rate != 16000:
183
+ audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
184
+ sample_rate = 16000
185
+
186
+ # Convert to mono if needed
187
+ if audio_data.ndim > 1:
188
+ audio_data = np.mean(audio_data, axis=1)
189
+
190
+ # Normalize audio to [-1, 1]
191
+ if audio_data.max() > 1.0:
192
+ audio_data = audio_data / audio_data.max()
193
+
194
+ # Remove DC offset
195
+ audio_data = audio_data - np.mean(audio_data)
196
+
197
+ # Apply noise gate for very quiet audio
198
+ threshold = np.max(np.abs(audio_data)) * 0.01
199
+ audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
200
+
201
+ # Convert to float32 for compatibility
202
+ audio_data = audio_data.astype(np.float32)
203
+
204
+ return audio_data, sample_rate
205
+
206
+ except Exception as e:
207
+ raise RuntimeError(f"Audio preprocessing failed: {str(e)}")
208
+
209
+ @staticmethod
210
+ def _preprocess_audio_torch(audio_path: str):
211
+ """
212
+ Preprocess audio file and return torch.Tensor for PyTorch-based STT models.
213
+
214
+ Args:
215
+ audio_path: Path to audio file
216
+
217
+ Returns:
218
+ Tuple of (audio_tensor, sample_rate) where audio_tensor is torch.Tensor
219
+ """
220
+ try:
221
+ import torch
222
+
223
+ # Get numpy array first
224
+ audio_data, sample_rate = AudioProcessor._preprocess_audio(audio_path)
225
+
226
+ # Convert to torch tensor
227
+ audio_tensor = torch.FloatTensor(audio_data)
228
+
229
+ return audio_tensor, sample_rate
230
+
231
+ except ImportError:
232
+ raise RuntimeError("PyTorch not available. Install with: pip install torch")
233
+ except Exception as e:
234
+ raise RuntimeError(f"Torch audio preprocessing failed: {str(e)}")
235
+
236
+ @staticmethod
237
+ def analyze_quality(audio_data: np.ndarray, sample_rate: int) -> Dict[str, Any]:
238
+ """Analyze audio quality and provide feedback."""
239
+ if audio_data.ndim > 1:
240
+ audio_data = np.mean(audio_data, axis=1)
241
+
242
+ duration = len(audio_data) / sample_rate
243
+ max_amp = np.max(np.abs(audio_data))
244
+ mean_amp = np.mean(np.abs(audio_data))
245
+
246
+ # Check for clipping and silence
247
+ clipping_ratio = np.sum(np.abs(audio_data) > 0.95) / len(audio_data)
248
+ silence_threshold = max_amp * 0.01
249
+ silence_ratio = np.sum(np.abs(audio_data) < silence_threshold) / len(audio_data)
250
+
251
+ return {
252
+ "duration": duration,
253
+ "max_amplitude": max_amp,
254
+ "mean_amplitude": mean_amp,
255
+ "clipping_ratio": clipping_ratio,
256
+ "silence_ratio": silence_ratio,
257
+ "sample_rate": sample_rate,
258
+ "is_good_quality": (
259
+ duration > 1.0 and
260
+ 0.1 < max_amp < 0.9 and
261
+ clipping_ratio < 0.01 and
262
+ silence_ratio < 0.5
263
+ )
264
+ }
265
+
266
+
267
+ class ModelManager:
268
+ """Handle STT model registration and loading."""
269
+
270
+ @staticmethod
271
+ def get_available_models() -> List[str]:
272
+ """Get list of available STT model names."""
273
+ return list(STT_MODELS.keys())
274
+
275
+ @staticmethod
276
+ def get_model_options(model_name: str) -> Dict[str, Any]:
277
+ """Get model-specific configuration options."""
278
+ if model_name == "WhisperSTT":
279
+ return {
280
+ "model_sizes": ["tiny", "base", "small", "medium", "large"],
281
+ "supports_api": True,
282
+ "languages": [
283
+ ("Auto-detect", "auto"),
284
+ ("English", "en"),
285
+ ("Spanish", "es"),
286
+ ("French", "fr"),
287
+ ("German", "de"),
288
+ ("Italian", "it"),
289
+ ("Portuguese", "pt"),
290
+ ("Russian", "ru"),
291
+ ("Japanese", "ja"),
292
+ ("Korean", "ko"),
293
+ ("Chinese", "zh"),
294
+ ("Dutch", "nl"),
295
+ ("Arabic", "ar"),
296
+ ("Hindi", "hi")
297
+ ],
298
+ "default_params": {
299
+ "temperature": 0.0,
300
+ "beam_size": 5,
301
+ "best_of": 5,
302
+ "patience": 2.0,
303
+ "condition_on_previous_text": True,
304
+ }
305
+ }
306
+
307
+ elif model_name == "Wav2Vec2ArabicSTT":
308
+ return {
309
+ "model_sizes": [
310
+ ("Arabic Standard", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
311
+ ("Multilingual", "facebook/wav2vec2-large-xlsr-53"),
312
+ ("English Fallback", "facebook/wav2vec2-base-960h"),
313
+ ("Arabic Egyptian (Experimental)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian")
314
+ ],
315
+ "supports_api": False,
316
+ "supports_hf_token": True,
317
+ "languages": [
318
+ ("Arabic Egyptian", "ar-EG"),
319
+ ("Arabic Standard", "ar"),
320
+ ("Auto-detect", "auto"),
321
+ ],
322
+ "device_options": ["auto", "cpu", "cuda"],
323
+ "default_params": {
324
+ "device": "auto",
325
+ "chunk_length": 20,
326
+ "return_confidence": True,
327
+ }
328
+ }
329
+
330
+ elif model_name == "VoskSTT":
331
+ return {
332
+ "model_sizes": [
333
+ ("English US Small (40MB)", "vosk-model-small-en-us-0.15"),
334
+ ("English US Large (1.8GB)", "vosk-model-en-us-0.22"),
335
+ ("Arabic (318MB)", "vosk-model-ar-mgb2-0.4"),
336
+ ("French (1.4GB)", "vosk-model-fr-0.22"),
337
+ ("German (1.2GB)", "vosk-model-de-0.21"),
338
+ ("Spanish (1.4GB)", "vosk-model-es-0.42"),
339
+ ("Russian Large (1.5GB)", "vosk-model-ru-0.42"),
340
+ ("Russian Small (45MB)", "vosk-model-small-ru-0.22"),
341
+ ("Chinese Small (42MB)", "vosk-model-small-cn-0.22"),
342
+ ],
343
+ "supports_api": False,
344
+ "supports_auto_download": True,
345
+ "languages": [
346
+ ("Auto (based on model)", "auto"),
347
+ ("English", "en"),
348
+ ("Arabic", "ar"),
349
+ ("French", "fr"),
350
+ ("German", "de"),
351
+ ("Spanish", "es"),
352
+ ("Russian", "ru"),
353
+ ("Chinese", "zh"),
354
+ ],
355
+ "default_params": {
356
+ "auto_download": True,
357
+ "return_confidence": True,
358
+ "return_words": True,
359
+ }
360
+ }
361
+
362
+ elif model_name == "HuBERTArabicSTT":
363
+ return {
364
+ "model_sizes": [
365
+ ("Arabic Egyptian (HuBERT)", "omarxadel/hubert-large-arabic-egyptian"),
366
+ ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
367
+ ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
368
+ ("Arabic MSA", "facebook/wav2vec2-large-xlsr-53")
369
+ ],
370
+ "supports_api": False,
371
+ "supports_hf_token": True,
372
+ "languages": [
373
+ ("Arabic Egyptian", "ar-EG"),
374
+ ("Arabic Standard", "ar"),
375
+ ("Auto-detect", "auto"),
376
+ ],
377
+ "device_options": ["auto", "cpu", "cuda"],
378
+ "default_params": {
379
+ "device": "auto",
380
+ "chunk_length": 20,
381
+ "return_confidence": True,
382
+ "max_audio_length": 120
383
+ }
384
+ }
385
+
386
+ elif model_name == "CoquiSTT":
387
+ return {
388
+ "model_sizes": [
389
+ ("English Large Vocab", "english-large"),
390
+ ("English Huge Vocab", "english-huge"),
391
+ ("German", "german"),
392
+ ("French", "french"),
393
+ ("Spanish", "spanish")
394
+ ],
395
+ "supports_api": False,
396
+ "supports_auto_download": True,
397
+ "languages": [
398
+ ("English", "en"),
399
+ ("German", "de"),
400
+ ("French", "fr"),
401
+ ("Spanish", "es"),
402
+ ("Auto (based on model)", "auto"),
403
+ ],
404
+ "default_params": {
405
+ "auto_download": True,
406
+ "beam_width": 512,
407
+ "lm_alpha": 0.931289039105002,
408
+ "lm_beta": 1.1834137581510284,
409
+ "return_confidence": True,
410
+ "return_timestamps": False,
411
+ }
412
+ }
413
+
414
+ elif model_name == "TawasulSTT":
415
+ return {
416
+ "model_sizes": [
417
+ ("Tawasul STT V0 (Arabic)", "Kareem35/Tawasul-STT-V0"),
418
+ ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
419
+ ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
420
+ ("Multilingual Fallback", "facebook/wav2vec2-large-xlsr-53")
421
+ ],
422
+ "supports_api": False,
423
+ "supports_hf_token": True,
424
+ "languages": [
425
+ ("Arabic Standard", "ar"),
426
+ ("Arabic Egyptian", "ar-EG"),
427
+ ("Arabic Saudi", "ar-SA"),
428
+ ("Arabic Jordanian", "ar-JO"),
429
+ ("Arabic Lebanese", "ar-LB"),
430
+ ("Arabic Syrian", "ar-SY"),
431
+ ("Arabic Iraqi", "ar-IQ"),
432
+ ("Auto-detect", "auto"),
433
+ ],
434
+ "device_options": ["auto", "cpu", "cuda"],
435
+ "default_params": {
436
+ "device": "auto",
437
+ "chunk_length": 20,
438
+ "return_confidence": True,
439
+ "max_audio_length": 300
440
+ }
441
+ }
442
+
443
+ # Default options for other models
444
+ return {
445
+ "model_sizes": ["default"],
446
+ "supports_api": False,
447
+ "languages": [("Auto-detect", "auto")],
448
+ "default_params": {}
449
+ }
450
+
451
+ @staticmethod
452
+ def load_model(model_name: str, **kwargs) -> str:
453
+ """Load specified STT model with configuration."""
454
+ global current_stt_model, current_model_config
455
+
456
+ if model_name not in STT_MODELS:
457
+ return f"❌ Unknown model: {model_name}. Available: {list(STT_MODELS.keys())}"
458
+
459
+ try:
460
+ model_class = STT_MODELS[model_name]
461
+
462
+ # Handle TawasulSTT as static class (don't instantiate)
463
+ if model_name == "TawasulSTT":
464
+ model_instance = model_class # Use class directly for static methods
465
+ else:
466
+ # Instantiate the model for instance-based classes
467
+ model_instance = model_class()
468
+
469
+ if model_name == "WhisperSTT":
470
+ # Handle WhisperSTT specific loading
471
+ model_size = kwargs.get("model_size", "base")
472
+ use_api = kwargs.get("use_api", False)
473
+ api_key = kwargs.get("api_key", "")
474
+
475
+ if use_api and not api_key.strip():
476
+ return "❌ Error: API key required for API mode"
477
+
478
+ # Load with optimized parameters
479
+ load_params = {
480
+ "model_size": model_size,
481
+ "use_api": use_api,
482
+ }
483
+
484
+ if api_key:
485
+ load_params["api_key"] = api_key.strip()
486
+
487
+ # Add quality optimization parameters for local models
488
+ if not use_api:
489
+ load_params.update({
490
+ "temperature": 0.0,
491
+ "beam_size": 5,
492
+ "best_of": 5,
493
+ "patience": 2.0,
494
+ "condition_on_previous_text": True,
495
+ })
496
+
497
+ model_instance.load_model(**load_params)
498
+
499
+ current_model_config = {
500
+ "model_name": model_name,
501
+ "model_size": model_size,
502
+ "use_api": use_api
503
+ }
504
+
505
+ status = f"βœ… {model_name} ({'API' if use_api else model_size}) loaded successfully"
506
+
507
+ elif model_name == "Wav2Vec2ArabicSTT":
508
+ # Handle Wav2Vec2 Arabic specific loading
509
+ device = kwargs.get("device", "auto")
510
+ chunk_length = kwargs.get("chunk_length", 20)
511
+ hf_token = kwargs.get("hf_token", "")
512
+ model_id = kwargs.get("model_size", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
513
+
514
+ load_params = {
515
+ "device": device,
516
+ "chunk_length": chunk_length,
517
+ "model_id": model_id,
518
+ }
519
+
520
+ if hf_token:
521
+ load_params["hf_token"] = hf_token.strip()
522
+
523
+ model_instance.load_model(**load_params)
524
+
525
+ current_model_config = {
526
+ "model_name": model_name,
527
+ "model_id": model_id,
528
+ "device": device,
529
+ "chunk_length": chunk_length
530
+ }
531
+
532
+ # Extract model name for display
533
+ model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
534
+ status = f"βœ… {model_name} ({model_display_name}) loaded on {device}"
535
+
536
+ elif model_name == "VoskSTT":
537
+ # Handle VoskSTT specific loading
538
+ model_name_param = kwargs.get("model_size", "vosk-model-small-en-us-0.15")
539
+ auto_download = kwargs.get("auto_download", True)
540
+
541
+ load_params = {
542
+ "model_name": model_name_param,
543
+ "auto_download": auto_download,
544
+ }
545
+
546
+ model_instance.load_model(**load_params)
547
+
548
+ current_model_config = {
549
+ "model_name": model_name,
550
+ "model_name_param": model_name_param,
551
+ "auto_download": auto_download
552
+ }
553
+
554
+ status = f"βœ… {model_name} ({model_name_param}) loaded successfully"
555
+
556
+ elif model_name == "HuBERTArabicSTT":
557
+ # Handle HuBERT Arabic specific loading
558
+ device = kwargs.get("device", "auto")
559
+ chunk_length = kwargs.get("chunk_length", 20)
560
+ hf_token = kwargs.get("hf_token", "")
561
+ model_id = kwargs.get("model_size", "omarxadel/hubert-large-arabic-egyptian")
562
+ max_audio_length = kwargs.get("max_audio_length", 120)
563
+
564
+ load_params = {
565
+ "device": device,
566
+ "chunk_length": chunk_length,
567
+ "model_id": model_id,
568
+ "max_audio_length": max_audio_length,
569
+ }
570
+
571
+ if hf_token:
572
+ load_params["hf_token"] = hf_token.strip()
573
+
574
+ model_instance.load_model(**load_params)
575
+
576
+ current_model_config = {
577
+ "model_name": model_name,
578
+ "model_id": model_id,
579
+ "device": device,
580
+ "chunk_length": chunk_length,
581
+ "max_audio_length": max_audio_length
582
+ }
583
+
584
+ # Extract model name for display
585
+ model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
586
+ status = f"βœ… {model_name} ({model_display_name}) loaded on {device}"
587
+
588
+ elif model_name == "CoquiSTT":
589
+ # Handle Coqui STT specific loading
590
+ model_name_param = kwargs.get("model_size", "english-large")
591
+ auto_download = kwargs.get("auto_download", True)
592
+ beam_width = kwargs.get("beam_width", 512)
593
+ lm_alpha = kwargs.get("lm_alpha", 0.931289039105002)
594
+ lm_beta = kwargs.get("lm_beta", 1.1834137581510284)
595
+
596
+ load_params = {
597
+ "model_name": model_name_param,
598
+ "auto_download": auto_download,
599
+ "beam_width": beam_width,
600
+ "lm_alpha": lm_alpha,
601
+ "lm_beta": lm_beta,
602
+ }
603
+
604
+ model_instance.load_model(**load_params)
605
+
606
+ current_model_config = {
607
+ "model_name": model_name,
608
+ "model_name_param": model_name_param,
609
+ "auto_download": auto_download,
610
+ "beam_width": beam_width,
611
+ "lm_alpha": lm_alpha,
612
+ "lm_beta": lm_beta
613
+ }
614
+
615
+ status = f"βœ… {model_name} ({model_name_param}) loaded successfully"
616
+
617
+ elif model_name == "TawasulSTT":
618
+ # Handle Tawasul STT specific loading (static class)
619
+ device = kwargs.get("device", "auto")
620
+ chunk_length = kwargs.get("chunk_length", 20)
621
+ hf_token = kwargs.get("hf_token", "")
622
+ model_id = kwargs.get("model_size", "Kareem35/Tawasul-STT-V0")
623
+ max_audio_length = kwargs.get("max_audio_length", 300)
624
+
625
+ load_params = {
626
+ "device": device,
627
+ "chunk_length": chunk_length,
628
+ "model_id": model_id,
629
+ "max_audio_length": max_audio_length,
630
+ }
631
+
632
+ if hf_token:
633
+ load_params["hf_token"] = hf_token.strip()
634
+
635
+ # Call static method directly
636
+ model_class.load_model(**load_params)
637
+
638
+ current_model_config = {
639
+ "model_name": model_name,
640
+ "model_id": model_id,
641
+ "device": device,
642
+ "chunk_length": chunk_length,
643
+ "max_audio_length": max_audio_length
644
+ }
645
+
646
+ # Extract model name for display
647
+ model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
648
+ status = f"βœ… {model_name} ({model_display_name}) loaded on {device}"
649
+
650
+ else:
651
+ # Generic model loading for future STT models
652
+ model_instance.load_model(**kwargs)
653
+ current_model_config = {"model_name": model_name, **kwargs}
654
+ status = f"βœ… {model_name} loaded successfully"
655
+
656
+ current_stt_model = model_instance
657
+ logger.info(status)
658
+ return status
659
+
660
+ except Exception as e:
661
+ error_msg = f"❌ Error loading {model_name}: {str(e)}"
662
+ logger.error(error_msg)
663
+ return error_msg
664
+
665
+ @staticmethod
666
+ def get_model_info() -> str:
667
+ """Get information about available and loaded models."""
668
+ info = f"**Available Models:** {', '.join(STT_MODELS.keys())}\n\n"
669
+
670
+ if current_stt_model:
671
+ model_info = current_stt_model.get_model_info()
672
+ # Handle different key names for model name
673
+ model_name = model_info.get('model_name') or model_info.get('name', 'Unknown')
674
+ info += f"**Currently Loaded:** {model_name}\n"
675
+ info += f"**Status:** {'βœ… Ready' if model_info['is_loaded'] else '❌ Not loaded'}\n"
676
+ info += f"**Config:** {current_model_config}"
677
+ else:
678
+ info += "**Currently Loaded:** None"
679
+
680
+ return info
681
+
682
+
683
+ class ImageGallery:
684
+ """Handle static image gallery with slider navigation."""
685
+
686
+ def __init__(self):
687
+ """Initialize image gallery with predefined images."""
688
+ # Define your static images here - you can add more images to this list
689
+ self.images = [
690
+ "https://picsum.photos/400/300?random=1", # Random image 1
691
+ "https://picsum.photos/400/300?random=2", # Random image 2
692
+ "https://picsum.photos/400/300?random=3", # Random image 3
693
+ "https://picsum.photos/400/300?random=4", # Random image 4
694
+ "https://picsum.photos/400/300?random=5", # Random image 5
695
+ ]
696
+
697
+ # Alternative: Use local images (uncomment and modify paths as needed)
698
+ # self.images = [
699
+ # "path/to/image1.jpg",
700
+ # "path/to/image2.png",
701
+ # "path/to/image3.jpg",
702
+ # "path/to/image4.png",
703
+ # "path/to/image5.jpg",
704
+ # ]
705
+
706
+ self.current_index = 0
707
+
708
+ def get_image_by_index(self, index: int) -> str:
709
+ """Get image by index with bounds checking."""
710
+ if 0 <= index < len(self.images):
711
+ self.current_index = index
712
+ return self.images[index]
713
+ return self.images[0] # Return first image as fallback
714
+
715
+ def get_image_info(self, index: int) -> str:
716
+ """Get information about current image."""
717
+ return f"Image {index + 1} of {len(self.images)}"
718
+
719
+ def get_total_images(self) -> int:
720
+ """Get total number of images."""
721
+ return len(self.images)
722
+
723
+
724
+ class TranscriptionEngine:
725
+ """Handle audio transcription using the loaded STT model."""
726
+
727
+ @staticmethod
728
+ def transcribe(audio_input: Tuple[int, np.ndarray],
729
+ language: Optional[str] = None) -> Tuple[str, str, str]:
730
+ """
731
+ Transcribe audio input using the currently loaded STT model.
732
+
733
+ Args:
734
+ audio_input: Tuple of (sample_rate, audio_data) from Gradio
735
+ language: Language code for transcription
736
+
737
+ Returns:
738
+ Tuple of (transcription, confidence_info, processing_info)
739
+ """
740
+ if audio_input is None:
741
+ return "❌ No audio provided", "", ""
742
+
743
+ if not current_stt_model or not current_stt_model.is_loaded:
744
+ return "❌ No STT model loaded. Please load a model first.", "", ""
745
+
746
+ try:
747
+ sample_rate, audio_data = audio_input
748
+
749
+ # Preprocess audio
750
+ processed_audio = AudioProcessor.preprocess(audio_data, sample_rate)
751
+
752
+ # Quality checks
753
+ quality = AudioProcessor.analyze_quality(processed_audio, 16000)
754
+
755
+ if quality["duration"] < 0.5:
756
+ return "❌ Audio too short (minimum 0.5 seconds)", "", ""
757
+
758
+ if quality["max_amplitude"] < 0.001:
759
+ return "❌ Audio too quiet or silent", "", f"Max amplitude: {quality['max_amplitude']:.6f}"
760
+
761
+ # Set language for models that support it
762
+ if hasattr(current_stt_model, 'set_language') and language and language != "auto":
763
+ current_stt_model.set_language(language)
764
+
765
+ # Transcribe using different approaches for different models
766
+ start_time = time.time()
767
+
768
+ # Check if this is TawasulSTT (static class) which needs file path
769
+ if current_model_config.get('model_name') == 'TawasulSTT':
770
+ # TawasulSTT needs a file path, so save audio to temporary file
771
+ import tempfile
772
+ import soundfile as sf
773
+
774
+ with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
775
+ temp_path = temp_file.name
776
+ sf.write(temp_path, processed_audio, 16000)
777
+
778
+ try:
779
+ # Call TawasulSTT.transcribe() with file path
780
+ transcription, confidence_info_raw, processing_info_raw = current_stt_model.transcribe(temp_path)
781
+
782
+ # Create a result-like object for consistency
783
+ class TempResult:
784
+ def __init__(self, text, confidence=None, processing_time=None):
785
+ self.text = text
786
+ self.confidence = confidence
787
+ self.processing_time = processing_time
788
+
789
+ # Extract confidence from confidence_info_raw if available
790
+ confidence_value = None
791
+ if confidence_info_raw and "Confidence:" in confidence_info_raw:
792
+ try:
793
+ conf_str = confidence_info_raw.split("Confidence:")[1].strip()
794
+ confidence_value = float(conf_str)
795
+ except:
796
+ confidence_value = None
797
+
798
+ processing_time = time.time() - start_time
799
+ result = TempResult(transcription, confidence_value, processing_time)
800
+
801
+ finally:
802
+ # Clean up temporary file
803
+ import os
804
+ try:
805
+ os.unlink(temp_path)
806
+ except:
807
+ pass
808
+ else:
809
+ # For other STT models that use transcribe_audio
810
+ result = current_stt_model.transcribe_audio(processed_audio, 16000)
811
+
812
+ # Prepare output
813
+ transcription = result.text.strip() if result.text else "No speech detected"
814
+
815
+ # Filter out common false positives
816
+ if transcription.lower() in ["you", "thank you.", "thanks for watching!", ""]:
817
+ transcription = "πŸ”‡ No clear speech detected"
818
+
819
+ # Confidence info
820
+ confidence_info = ""
821
+ if result.confidence is not None:
822
+ confidence_info = f"Confidence: {result.confidence:.2%}"
823
+ if result.confidence < 0.3:
824
+ confidence_info += " (Low - consider re-recording)"
825
+ else:
826
+ confidence_info = "Confidence: N/A"
827
+
828
+ # Processing info
829
+ processing_info = f"Processing: {result.processing_time or 0:.2f}s\n"
830
+ processing_info += f"Model: {current_model_config.get('model_name', 'Unknown')}\n"
831
+ processing_info += f"Audio: {quality['duration']:.2f}s, {quality['max_amplitude']:.3f} amplitude\n"
832
+ processing_info += f"Quality: {'βœ… Good' if quality['is_good_quality'] else '⚠️ Poor'}"
833
+
834
+ return transcription, confidence_info, processing_info
835
+
836
+ except Exception as e:
837
+ error_msg = f"❌ Transcription error: {str(e)}"
838
+ logger.error(error_msg)
839
+ return error_msg, "", ""
840
+
841
+
842
+ class GradioInterface:
843
+ """Create and manage the Gradio web interface."""
844
+
845
+ @staticmethod
846
+ def create_interface():
847
+ """Create the main Gradio interface."""
848
+
849
+ # Initialize image gallery
850
+ gallery = ImageGallery()
851
+
852
+ with gr.Blocks(
853
+ title="πŸŽ™οΈ Modular Voice Transcriber with Image Gallery",
854
+ theme=gr.themes.Soft()
855
+ ) as demo:
856
+
857
+ gr.Markdown(
858
+ """
859
+ # πŸŽ™οΈ Modular Voice Transcriber with Image Gallery
860
+
861
+ A flexible interface supporting multiple STT models with an integrated image viewer.
862
+ Easily extensible for new transcription engines and image collections.
863
+ """
864
+ )
865
+
866
+ # Image Gallery Section (at the top)
867
+ with gr.Row():
868
+ with gr.Column():
869
+ gr.Markdown("### πŸ–ΌοΈ Image Gallery")
870
+
871
+ # Main image display
872
+ image_display = gr.Image(
873
+ value=gallery.get_image_by_index(0),
874
+ label="Selected Image",
875
+ height=400,
876
+ width=500,
877
+ interactive=False
878
+ )
879
+
880
+ # Image info
881
+ image_info = gr.Textbox(
882
+ value=gallery.get_image_info(0),
883
+ label="Image Info",
884
+ interactive=False
885
+ )
886
+
887
+ # Horizontal thumbnail gallery with actual image previews
888
+ gr.Markdown("**Click on a thumbnail to view:**")
889
+ with gr.Row():
890
+ # Create thumbnail gallery using Gradio's Gallery component
891
+ thumbnail_gallery = gr.Gallery(
892
+ value=gallery.images, # All images as thumbnails
893
+ label="Image Gallery",
894
+ show_label=False,
895
+ elem_id="thumbnail_gallery",
896
+ columns=len(gallery.images), # Horizontal layout
897
+ rows=1,
898
+ height=120, # Small thumbnail height
899
+ allow_preview=False, # Don't show preview popup
900
+ interactive=True
901
+ )
902
+
903
+ # Navigation buttons (kept for convenience)
904
+ gr.Markdown("**Or use navigation:**")
905
+ with gr.Row():
906
+ prev_btn = gr.Button("◀️ Previous", size="sm")
907
+ next_btn = gr.Button("Next ▢️", size="sm")
908
+ random_btn = gr.Button("🎲 Random", size="sm")
909
+
910
+ gr.Markdown("---") # Separator line
911
+
912
+ with gr.Row():
913
+ # Model Configuration Panel
914
+ with gr.Column(scale=1):
915
+ gr.Markdown("### πŸ”§ Model Configuration")
916
+
917
+ # Model selection
918
+ model_selector = gr.Dropdown(
919
+ choices=ModelManager.get_available_models(),
920
+ value="WhisperSTT",
921
+ label="STT Model",
922
+ info="Choose your speech-to-text engine"
923
+ )
924
+
925
+ # Dynamic model options (will update based on selected model)
926
+ model_size = gr.Dropdown(
927
+ choices=["tiny", "base", "small", "medium", "large"],
928
+ value="base",
929
+ label="Model Size",
930
+ visible=True
931
+ )
932
+
933
+ use_api = gr.Checkbox(
934
+ label="Use API",
935
+ info="Use cloud API instead of local model",
936
+ visible=True
937
+ )
938
+
939
+ api_key = gr.Textbox(
940
+ label="API Key",
941
+ type="password",
942
+ placeholder="Enter API key...",
943
+ visible=False
944
+ )
945
+
946
+ # Device selection for models that support it
947
+ device_selector = gr.Dropdown(
948
+ choices=["auto", "cpu", "cuda"],
949
+ value="auto",
950
+ label="Device",
951
+ info="Processing device (auto recommended)",
952
+ visible=False
953
+ )
954
+
955
+ # HuggingFace token for private models
956
+ hf_token = gr.Textbox(
957
+ label="HuggingFace Token",
958
+ type="password",
959
+ placeholder="hf_...",
960
+ info="Optional: For private or experimental models",
961
+ visible=False
962
+ )
963
+
964
+ # Load button and status
965
+ load_btn = gr.Button("πŸ”„ Load Model", variant="primary")
966
+ load_status = gr.Textbox(
967
+ label="Status",
968
+ value="No model loaded",
969
+ interactive=False
970
+ )
971
+
972
+ # Model info
973
+ model_info = gr.Markdown(ModelManager.get_model_info())
974
+
975
+ # Transcription Panel
976
+ with gr.Column(scale=2):
977
+ gr.Markdown("### 🎀 Voice Transcription")
978
+
979
+ # Language selection
980
+ language = gr.Dropdown(
981
+ choices=[("Auto-detect", "auto"), ("English", "en")],
982
+ value="auto",
983
+ label="Language"
984
+ )
985
+
986
+ # Audio input
987
+ audio_input = gr.Audio(
988
+ label="Record or Upload Audio",
989
+ type="numpy",
990
+ format="wav"
991
+ )
992
+
993
+ # Action buttons
994
+ with gr.Row():
995
+ transcribe_btn = gr.Button("🎯 Transcribe", variant="primary")
996
+ quality_btn = gr.Button("πŸ“Š Check Quality")
997
+ clear_btn = gr.Button("πŸ—‘οΈ Clear")
998
+
999
+ # Outputs
1000
+ transcription_output = gr.Textbox(
1001
+ label="πŸ“ Transcription",
1002
+ lines=4,
1003
+ placeholder="Transcribed text will appear here..."
1004
+ )
1005
+
1006
+ with gr.Row():
1007
+ confidence_output = gr.Textbox(
1008
+ label="🎯 Confidence",
1009
+ interactive=False
1010
+ )
1011
+ processing_output = gr.Textbox(
1012
+ label="⏱️ Processing Info",
1013
+ interactive=False
1014
+ )
1015
+
1016
+ quality_output = gr.Markdown(
1017
+ value="",
1018
+ visible=False,
1019
+ label="πŸ“Š Audio Quality Analysis"
1020
+ )
1021
+
1022
+ # Usage tips
1023
+ gr.Markdown(
1024
+ """
1025
+ ### πŸ’‘ Tips for Best Results
1026
+ - **Record clearly** in a quiet environment
1027
+ - **Speak at normal pace** - not too fast or slow
1028
+ - **Use good audio quality** - avoid background noise
1029
+ - **Try different models** - larger models are more accurate but slower
1030
+ - **Check quality analysis** to identify audio issues
1031
+ - **Browse images** using the slider or navigation buttons
1032
+ """
1033
+ )
1034
+
1035
+ # Event handlers
1036
+ def update_model_options(model_name: str):
1037
+ """Update interface based on selected model."""
1038
+ options = ModelManager.get_model_options(model_name)
1039
+
1040
+ # Determine visibility of components
1041
+ show_model_size = len(options["model_sizes"]) > 1
1042
+ show_api = options["supports_api"]
1043
+ show_device = "device_options" in options
1044
+ show_hf_token = options.get("supports_hf_token", False)
1045
+
1046
+ # Extract model size options (handle both simple lists and tuples)
1047
+ if show_model_size and isinstance(options["model_sizes"][0], tuple):
1048
+ # Model sizes are tuples of (display_name, value)
1049
+ size_choices = options["model_sizes"]
1050
+ size_value = size_choices[0][1] # Use the value from first tuple
1051
+ else:
1052
+ # Model sizes are simple strings
1053
+ size_choices = options["model_sizes"]
1054
+ size_value = size_choices[0]
1055
+
1056
+ return (
1057
+ gr.update(choices=size_choices, value=size_value, visible=show_model_size),
1058
+ gr.update(visible=show_api),
1059
+ gr.update(visible=False), # Hide API key initially
1060
+ gr.update(choices=options["languages"], value="auto"),
1061
+ gr.update(
1062
+ choices=options.get("device_options", ["auto"]),
1063
+ value="auto",
1064
+ visible=show_device
1065
+ ),
1066
+ gr.update(visible=show_hf_token)
1067
+ )
1068
+
1069
+ def toggle_api_key(use_api: bool):
1070
+ """Show/hide API key field."""
1071
+ return gr.update(visible=use_api)
1072
+
1073
+ def load_selected_model(model_name: str, model_size: str, use_api: bool, api_key: str, device: str, hf_token: str):
1074
+ """Load the selected model with configuration."""
1075
+ kwargs = {"model_size": model_size, "use_api": use_api}
1076
+ if api_key:
1077
+ kwargs["api_key"] = api_key
1078
+ if device and device != "auto":
1079
+ kwargs["device"] = device
1080
+ if hf_token:
1081
+ kwargs["hf_token"] = hf_token
1082
+ return ModelManager.load_model(model_name, **kwargs)
1083
+
1084
+ def analyze_audio_quality(audio_input):
1085
+ """Analyze and display audio quality."""
1086
+ if audio_input is None:
1087
+ return "", gr.update(visible=False)
1088
+
1089
+ sample_rate, audio_data = audio_input
1090
+ quality = AudioProcessor.analyze_quality(audio_data, sample_rate)
1091
+
1092
+ report = f"""
1093
+ **πŸ“Š Audio Quality Analysis:**
1094
+ - Duration: {quality['duration']:.2f}s
1095
+ - Max amplitude: {quality['max_amplitude']:.3f}
1096
+ - Clipping: {quality['clipping_ratio']:.2%}
1097
+ - Silence ratio: {quality['silence_ratio']:.2%}
1098
+ - Overall quality: {'βœ… Good' if quality['is_good_quality'] else '⚠️ Needs improvement'}
1099
+
1100
+ **πŸ”§ Recommendations:**
1101
+ {_get_quality_recommendations(quality)}
1102
+ """
1103
+
1104
+ return report, gr.update(visible=True)
1105
+
1106
+ # Image Gallery Event Handlers
1107
+ current_image_index = [0] # Use list to make it mutable in nested functions
1108
+
1109
+ def select_image_from_gallery(evt: gr.SelectData):
1110
+ """Handle image selection from gallery thumbnail."""
1111
+ index = evt.index
1112
+ current_image_index[0] = index
1113
+ image_path = gallery.get_image_by_index(index)
1114
+ image_info_text = gallery.get_image_info(index)
1115
+ return image_path, image_info_text
1116
+
1117
+ def go_to_previous_image():
1118
+ """Go to previous image."""
1119
+ current_image_index[0] = max(0, current_image_index[0] - 1)
1120
+ image_path = gallery.get_image_by_index(current_image_index[0])
1121
+ image_info_text = gallery.get_image_info(current_image_index[0])
1122
+ return image_path, image_info_text
1123
+
1124
+ def go_to_next_image():
1125
+ """Go to next image."""
1126
+ current_image_index[0] = min(gallery.get_total_images() - 1, current_image_index[0] + 1)
1127
+ image_path = gallery.get_image_by_index(current_image_index[0])
1128
+ image_info_text = gallery.get_image_info(current_image_index[0])
1129
+ return image_path, image_info_text
1130
+
1131
+ def go_to_random_image():
1132
+ """Go to random image."""
1133
+ import random
1134
+ current_image_index[0] = random.randint(0, gallery.get_total_images() - 1)
1135
+ image_path = gallery.get_image_by_index(current_image_index[0])
1136
+ image_info_text = gallery.get_image_info(current_image_index[0])
1137
+ return image_path, image_info_text
1138
+ return image_path, image_info_text, new_index
1139
+
1140
+ # Connect events
1141
+ model_selector.change(
1142
+ fn=update_model_options,
1143
+ inputs=model_selector,
1144
+ outputs=[model_size, use_api, api_key, language, device_selector, hf_token]
1145
+ )
1146
+
1147
+ use_api.change(
1148
+ fn=toggle_api_key,
1149
+ inputs=use_api,
1150
+ outputs=api_key
1151
+ )
1152
+
1153
+ load_btn.click(
1154
+ fn=load_selected_model,
1155
+ inputs=[model_selector, model_size, use_api, api_key, device_selector, hf_token],
1156
+ outputs=load_status
1157
+ ).then(
1158
+ fn=lambda: ModelManager.get_model_info(),
1159
+ outputs=model_info
1160
+ )
1161
+
1162
+ transcribe_btn.click(
1163
+ fn=TranscriptionEngine.transcribe,
1164
+ inputs=[audio_input, language],
1165
+ outputs=[transcription_output, confidence_output, processing_output]
1166
+ )
1167
+
1168
+ quality_btn.click(
1169
+ fn=analyze_audio_quality,
1170
+ inputs=audio_input,
1171
+ outputs=[quality_output, quality_output]
1172
+ )
1173
+
1174
+ clear_btn.click(
1175
+ fn=lambda: ("", "", "", "", gr.update(visible=False)),
1176
+ outputs=[transcription_output, confidence_output, processing_output, quality_output, quality_output]
1177
+ )
1178
+
1179
+ # Auto-transcribe on audio change (optional)
1180
+ audio_input.change(
1181
+ fn=TranscriptionEngine.transcribe,
1182
+ inputs=[audio_input, language],
1183
+ outputs=[transcription_output, confidence_output, processing_output]
1184
+ )
1185
+
1186
+ # Image Gallery Event Connections
1187
+ # Connect thumbnail gallery selection
1188
+ thumbnail_gallery.select(
1189
+ fn=select_image_from_gallery,
1190
+ outputs=[image_display, image_info]
1191
+ )
1192
+
1193
+ # Connect navigation buttons
1194
+ prev_btn.click(
1195
+ fn=go_to_previous_image,
1196
+ outputs=[image_display, image_info]
1197
+ )
1198
+
1199
+ next_btn.click(
1200
+ fn=go_to_next_image,
1201
+ outputs=[image_display, image_info]
1202
+ )
1203
+
1204
+ random_btn.click(
1205
+ fn=go_to_random_image,
1206
+ outputs=[image_display, image_info]
1207
+ )
1208
+
1209
+ return demo
1210
+
1211
+
1212
+ def _get_quality_recommendations(quality: Dict[str, Any]) -> str:
1213
+ """Generate quality recommendations based on analysis."""
1214
+ recommendations = []
1215
+
1216
+ if quality["duration"] < 1.0:
1217
+ recommendations.append("β€’ Try recording for longer (1+ seconds)")
1218
+
1219
+ if quality["max_amplitude"] < 0.1:
1220
+ recommendations.append("β€’ Increase volume or move closer to microphone")
1221
+ elif quality["max_amplitude"] > 0.9:
1222
+ recommendations.append("β€’ Reduce volume to avoid clipping")
1223
+
1224
+ if quality["clipping_ratio"] > 0.01:
1225
+ recommendations.append("β€’ Audio is clipping - reduce input gain")
1226
+
1227
+ if quality["silence_ratio"] > 0.5:
1228
+ recommendations.append("β€’ Too much silence - record in quieter environment")
1229
+
1230
+ if not recommendations:
1231
+ recommendations.append("β€’ Audio quality looks good!")
1232
+
1233
+ return "\n".join(recommendations)
1234
+
1235
+
1236
+ def main():
1237
+ """Main application entry point."""
1238
+ # Check dependencies
1239
+ print("πŸ” Checking dependencies...")
1240
+
1241
+ try:
1242
+ import gradio
1243
+ print("βœ… Gradio available")
1244
+ except ImportError:
1245
+ print("❌ Gradio not installed. Run: pip install gradio")
1246
+ return
1247
+
1248
+ # Check available STT models
1249
+ print(f"πŸ€– Available STT models: {ModelManager.get_available_models()}")
1250
+
1251
+ # Create and launch interface
1252
+ print("πŸš€ Launching Gradio interface...")
1253
+ demo = GradioInterface.create_interface()
1254
+
1255
+ demo.launch(
1256
+ share=False, # Set to True for public sharing
1257
+ server_name="127.0.0.1",
1258
+ server_port=7861,
1259
+ show_error=True
1260
+ )
1261
+
1262
+
1263
+ if __name__ == "__main__":
1264
  main()
hf-space ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit 921859d10816a9cb386449308c5f66037f50deb2
pyproject.toml ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "modular-voice-transcriber"
7
+ version = "0.2.0"
8
+ description = "A modular Gradio-based web interface for speech-to-text transcription supporting multiple STT engines"
9
+ readme = "README.md"
10
+ requires-python = ">=3.8"
11
+ dependencies = [
12
+ "gradio>=4.0.0",
13
+ "soundfile>=0.12.1",
14
+ "numpy>=1.21.0",
15
+ "pathlib2>=2.3.0; python_version < '3.4'",
16
+ ]
17
+
18
+ [project.optional-dependencies]
19
+ dev = [
20
+ "pytest>=7.0.0",
21
+ "black>=22.0.0",
22
+ "flake8>=4.0.0",
23
+ "mypy>=1.0.0",
24
+ ]
25
+ # OpenAI Whisper STT (local models)
26
+ whisper = [
27
+ "openai-whisper>=20230314",
28
+ "torch>=1.10.0",
29
+ "torchaudio>=0.10.0",
30
+ ]
31
+ # OpenAI Whisper API
32
+ whisper-api = [
33
+ "openai>=1.0.0",
34
+ ]
35
+ # Wav2Vec2 models (Hugging Face)
36
+ wav2vec2 = [
37
+ "transformers>=4.20.0",
38
+ "torch>=1.12.0",
39
+ "torchaudio>=0.12.0",
40
+ "librosa>=0.9.0", # Optional but recommended
41
+ ]
42
+ # HuBERT models (Hugging Face, Arabic Egyptian)
43
+ hubert = [
44
+ "transformers>=4.20.0",
45
+ "torch>=1.12.0",
46
+ "torchaudio>=0.12.0",
47
+ "librosa>=0.9.2",
48
+ "soundfile>=0.10.3",
49
+ "huggingface-hub>=0.14.0",
50
+ ]
51
+ # Coqui STT (open-source multilingual)
52
+ coqui = [
53
+ "coqui-stt>=1.4.0",
54
+ "soundfile>=0.10.3",
55
+ "librosa>=0.9.2",
56
+ "requests>=2.25.0",
57
+ ]
58
+ # Vosk STT (offline recognition)
59
+ vosk = [
60
+ "vosk>=0.3.42",
61
+ "soundfile>=0.12.1",
62
+ ]
63
+ # Tawasul STT (Arabic speech recognition)
64
+ tawasul = [
65
+ "transformers>=4.20.0",
66
+ "torch>=1.12.0",
67
+ "torchaudio>=0.12.0",
68
+ "librosa>=0.9.2",
69
+ "soundfile>=0.10.3",
70
+ "huggingface-hub>=0.14.0",
71
+ ]
72
+ # Azure Speech Service
73
+ azure-speech = [
74
+ "azure-cognitiveservices-speech>=1.25.0",
75
+ ]
76
+ # Google Cloud Speech-to-Text
77
+ google-speech = [
78
+ "google-cloud-speech>=2.15.0",
79
+ ]
80
+ # AssemblyAI
81
+ assemblyai = [
82
+ "assemblyai>=0.15.0",
83
+ ]
84
+ # Amazon Transcribe
85
+ aws-transcribe = [
86
+ "boto3>=1.26.0",
87
+ "botocore>=1.29.0",
88
+ ]
89
+ # All STT engines (for full functionality)
90
+ all-stt = [
91
+ "openai-whisper>=20230314",
92
+ "openai>=1.0.0",
93
+ "transformers>=4.20.0",
94
+ "torch>=1.12.0",
95
+ "torchaudio>=0.12.0",
96
+ "librosa>=0.9.0",
97
+ "vosk>=0.3.42",
98
+ "soundfile>=0.10.3",
99
+ "huggingface-hub>=0.14.0",
100
+ "coqui-stt>=1.4.0",
101
+ "requests>=2.25.0",
102
+ "azure-cognitiveservices-speech>=1.25.0",
103
+ "google-cloud-speech>=2.15.0",
104
+ "assemblyai>=0.15.0",
105
+ "boto3>=1.26.0",
106
+ ]
107
+ # Essential models (Whisper + Wav2Vec2 + HuBERT + Vosk + Coqui)
108
+ essential = [
109
+ "openai-whisper>=20230314",
110
+ "openai>=1.0.0",
111
+ "transformers>=4.20.0",
112
+ "torch>=1.12.0",
113
+ "torchaudio>=0.12.0",
114
+ "librosa>=0.9.0",
115
+ "soundfile>=0.10.3",
116
+ "huggingface-hub>=0.14.0",
117
+ "vosk>=0.3.42",
118
+ "coqui-stt>=1.4.0",
119
+ "requests>=2.25.0",
120
+ ]
121
+
122
+ [project.urls]
123
+ Homepage = "https://github.com/your-username/modular-voice-transcriber"
124
+ Repository = "https://github.com/your-username/modular-voice-transcriber.git"
125
+ Issues = "https://github.com/your-username/modular-voice-transcriber/issues"
126
+
127
+ [project.scripts]
128
+ voice-transcriber = "gradio_voice_transcriber_clean:main"
129
+
130
+ [tool.black]
131
+ line-length = 88
132
+ target-version = ['py38']
133
+
134
+ [tool.uv]
135
+ dev-dependencies = [
136
+ "pytest>=7.0.0",
137
+ "black>=22.0.0",
138
+ "flake8>=4.0.0",
139
+ ]
requirements.txt CHANGED
@@ -1,43 +1,43 @@
1
- # Modular Voice Transcriber - Core Dependencies
2
- # Base requirements for the Gradio interface
3
- gradio>=4.0.0
4
- soundfile>=0.12.1
5
- numpy>=1.21.0
6
-
7
- # Essential STT Models (Whisper + Wav2Vec2)
8
- # OpenAI Whisper (local and API)
9
- openai-whisper>=20231117
10
- openai>=1.0.0
11
-
12
- # Wav2Vec2 (Hugging Face Transformers)
13
- transformers>=4.20.0
14
- torch>=1.12.0
15
- torchaudio>=0.12.0
16
-
17
- # Audio processing (recommended)
18
- librosa>=0.9.0
19
-
20
- # Optional: Vosk STT (for offline recognition)
21
- # vosk>=0.3.42
22
-
23
- # Optional: HuBERT Arabic STT (for Arabic Egyptian dialect)
24
- # Use requirements_hubert.txt for full setup
25
-
26
- # Optional: Coqui STT (open-source multilingual)
27
- # Use requirements_coqui.txt for full setup
28
-
29
- # Optional: Tawasul STT (for Arabic speech recognition)
30
- # Use requirements_tawasul.txt for full setup
31
-
32
- # Installation options:
33
- # pip install -r requirements.txt # Core + Essential STT models
34
- # pip install -r requirements_whisper.txt # Whisper-only setup
35
- # pip install -r requirements_wav2vec2.txt # Wav2Vec2-only setup
36
- # pip install -r requirements_vosk.txt # Vosk-only setup
37
- # pip install -r requirements_hubert.txt # HuBERT Arabic-only setup
38
- # pip install -r requirements_coqui.txt # Coqui STT-only setup
39
- # pip install -r requirements_tawasul.txt # Tawasul STT-only setup
40
- # pip install -e .[essential] # Same as core
41
- # pip install -e .[all-stt] # All supported STT engines
42
- # pip install -e .[whisper,wav2vec2,vosk,hubert,coqui,tawasul] # Specific models only
43
  # pip install -e .[dev] # Development dependencies
 
1
+ # Modular Voice Transcriber - Core Dependencies
2
+ # Base requirements for the Gradio interface
3
+ gradio>=4.0.0
4
+ soundfile>=0.12.1
5
+ numpy>=1.21.0
6
+
7
+ # Essential STT Models (Whisper + Wav2Vec2)
8
+ # OpenAI Whisper (local and API)
9
+ openai-whisper>=20231117
10
+ openai>=1.0.0
11
+
12
+ # Wav2Vec2 (Hugging Face Transformers)
13
+ transformers>=4.20.0
14
+ torch>=1.12.0
15
+ torchaudio>=0.12.0
16
+
17
+ # Audio processing (recommended)
18
+ librosa>=0.9.0
19
+
20
+ # Optional: Vosk STT (for offline recognition)
21
+ # vosk>=0.3.42
22
+
23
+ # Optional: HuBERT Arabic STT (for Arabic Egyptian dialect)
24
+ # Use requirements_hubert.txt for full setup
25
+
26
+ # Optional: Coqui STT (open-source multilingual)
27
+ # Use requirements_coqui.txt for full setup
28
+
29
+ # Optional: Tawasul STT (for Arabic speech recognition)
30
+ # Use requirements_tawasul.txt for full setup
31
+
32
+ # Installation options:
33
+ # pip install -r requirements.txt # Core + Essential STT models
34
+ # pip install -r requirements_whisper.txt # Whisper-only setup
35
+ # pip install -r requirements_wav2vec2.txt # Wav2Vec2-only setup
36
+ # pip install -r requirements_vosk.txt # Vosk-only setup
37
+ # pip install -r requirements_hubert.txt # HuBERT Arabic-only setup
38
+ # pip install -r requirements_coqui.txt # Coqui STT-only setup
39
+ # pip install -r requirements_tawasul.txt # Tawasul STT-only setup
40
+ # pip install -e .[essential] # Same as core
41
+ # pip install -e .[all-stt] # All supported STT engines
42
+ # pip install -e .[whisper,wav2vec2,vosk,hubert,coqui,tawasul] # Specific models only
43
  # pip install -e .[dev] # Development dependencies
requirements_coqui.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # Coqui STT Requirements
2
+ coqui-stt-model-manager
3
+ soundfile>=0.10.3
4
+ librosa>=0.9.2
5
+ numpy>=1.21.0
6
+ requests>=2.25.0
requirements_hubert.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # HuBERT Arabic STT Requirements
2
+ torch>=1.12.0
3
+ transformers>=4.20.0
4
+ torchaudio>=0.12.0
5
+ librosa>=0.9.2
6
+ soundfile>=0.10.3
7
+ huggingface-hub>=0.14.0
requirements_tawasul.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tawasul STT Requirements
2
+ # Arabic Speech Recognition using Tawasul STT V0 model
3
+ torch>=1.12.0
4
+ transformers>=4.20.0
5
+ torchaudio>=0.12.0
6
+ librosa>=0.9.2
7
+ soundfile>=0.10.3
8
+ huggingface-hub>=0.14.0
9
+ numpy>=1.21.0
10
+
11
+ # Optional: For better performance
12
+ # accelerate>=0.20.0
13
+ # optimum>=1.8.0
requirements_vosk.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Vosk STT requirements
2
+ vosk>=0.3.42
3
+ soundfile>=0.12.1
requirements_wav2vec2.txt ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Wav2Vec2 Arabic STT Requirements
2
+ # Minimal requirements for using the Wav2Vec2 Arabic Egyptian model
3
+
4
+ # Base requirements
5
+ gradio>=4.0.0
6
+ numpy>=1.21.0
7
+ soundfile>=0.12.1
8
+
9
+ # Wav2Vec2 specific requirements
10
+ transformers>=4.20.0
11
+ torch>=1.12.0
12
+ torchaudio>=0.12.0
13
+
14
+ # Optional but highly recommended for better audio processing
15
+ librosa>=0.9.0
16
+
17
+ # Installation:
18
+ # pip install -r requirements_wav2vec2.txt
19
+
20
+ # Notes:
21
+ # - First model load will download ~1.2GB from Hugging Face Hub
22
+ # - GPU support is automatic if PyTorch with CUDA is installed
23
+ # - Model runs on CPU but GPU is significantly faster for longer audio
24
+ # - Optimized for Arabic Egyptian dialect but works with Standard Arabic
requirements_whisper.txt ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenAI Whisper STT Requirements
2
+ # Requirements for using OpenAI Whisper (local and API)
3
+
4
+ # Base requirements
5
+ gradio>=4.0.0
6
+ numpy>=1.21.0
7
+ soundfile>=0.12.1
8
+
9
+ # Whisper local model requirements
10
+ openai-whisper>=20231117
11
+ torch>=1.10.0
12
+ torchaudio>=0.10.0
13
+
14
+ # Whisper API requirements
15
+ openai>=1.0.0
16
+
17
+ # Optional for better audio processing
18
+ librosa>=0.9.0
19
+
20
+ # Installation:
21
+ # pip install -r requirements_whisper.txt
22
+
23
+ # Notes:
24
+ # - Local models download automatically on first use
25
+ # - API requires OpenAI API key
26
+ # - Model sizes: tiny(39MB) < base(142MB) < small(461MB) < medium(1.5GB) < large(2.9GB)
27
+ # - GPU support automatic if PyTorch with CUDA is installed
setup.py ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Setup script for Modular Voice Transcriber
4
+
5
+ This script helps set up the environment and install dependencies
6
+ based on which STT models you want to use.
7
+ """
8
+
9
+ import subprocess
10
+ import sys
11
+ import argparse
12
+ from pathlib import Path
13
+
14
+ def run_command(command, description=""):
15
+ """Run a command and handle errors."""
16
+ if description:
17
+ print(f"πŸ“¦ {description}...")
18
+
19
+ try:
20
+ result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
21
+ print(f"βœ… {description or 'Command'} completed successfully")
22
+ return True
23
+ except subprocess.CalledProcessError as e:
24
+ print(f"❌ {description or 'Command'} failed: {e}")
25
+ if e.stdout:
26
+ print(f"Output: {e.stdout}")
27
+ if e.stderr:
28
+ print(f"Error: {e.stderr}")
29
+ return False
30
+
31
+ def install_requirements(requirements_file):
32
+ """Install requirements from a specific file."""
33
+ if not Path(requirements_file).exists():
34
+ print(f"❌ Requirements file not found: {requirements_file}")
35
+ return False
36
+
37
+ return run_command(
38
+ f"pip install -r {requirements_file}",
39
+ f"Installing requirements from {requirements_file}"
40
+ )
41
+
42
+ def install_optional_dependencies(groups):
43
+ """Install optional dependencies using pip install -e ."""
44
+ group_str = ",".join(groups)
45
+ return run_command(
46
+ f"pip install -e .[{group_str}]",
47
+ f"Installing optional dependencies: {group_str}"
48
+ )
49
+
50
+ def test_imports(modules):
51
+ """Test if modules can be imported."""
52
+ print("\nπŸ” Testing module imports...")
53
+ all_good = True
54
+
55
+ for module in modules:
56
+ try:
57
+ __import__(module)
58
+ print(f"βœ… {module}")
59
+ except ImportError as e:
60
+ print(f"❌ {module}: {e}")
61
+ all_good = False
62
+
63
+ return all_good
64
+
65
+ def main():
66
+ parser = argparse.ArgumentParser(description="Setup Modular Voice Transcriber")
67
+ parser.add_argument(
68
+ "--profile",
69
+ choices=["minimal", "essential", "whisper-only", "wav2vec2-only", "vosk-only", "hubert-only", "coqui-only", "tawasul-only", "all"],
70
+ default="essential",
71
+ help="Installation profile (default: essential)"
72
+ )
73
+ parser.add_argument(
74
+ "--test",
75
+ action="store_true",
76
+ help="Test the installation after setup"
77
+ )
78
+
79
+ args = parser.parse_args()
80
+
81
+ print("πŸš€ Modular Voice Transcriber Setup")
82
+ print("=" * 50)
83
+ print(f"Profile: {args.profile}")
84
+ print()
85
+
86
+ # Install base requirements first
87
+ print("πŸ“¦ Installing base requirements...")
88
+ base_success = run_command(
89
+ "pip install gradio>=4.0.0 numpy>=1.21.0 soundfile>=0.12.1",
90
+ "Installing base dependencies"
91
+ )
92
+
93
+ if not base_success:
94
+ print("❌ Failed to install base requirements. Exiting.")
95
+ return 1
96
+
97
+ # Install profile-specific requirements
98
+ success = True
99
+
100
+ if args.profile == "minimal":
101
+ print("\nπŸ“¦ Minimal installation - Gradio interface only")
102
+ # Base requirements already installed
103
+
104
+ elif args.profile == "essential":
105
+ print("\nπŸ“¦ Essential installation - Whisper + Wav2Vec2")
106
+ success = install_optional_dependencies(["essential"])
107
+
108
+ elif args.profile == "whisper-only":
109
+ print("\nπŸ“¦ Whisper-only installation")
110
+ success = install_requirements("requirements_whisper.txt")
111
+
112
+ elif args.profile == "wav2vec2-only":
113
+ print("\nπŸ“¦ Wav2Vec2-only installation")
114
+ success = install_requirements("requirements_wav2vec2.txt")
115
+
116
+ elif args.profile == "vosk-only":
117
+ print("\nπŸ“¦ Vosk-only installation")
118
+ success = install_requirements("requirements_vosk.txt")
119
+
120
+ elif args.profile == "hubert-only":
121
+ print("\nπŸ“¦ HuBERT Arabic-only installation")
122
+ success = install_requirements("requirements_hubert.txt")
123
+
124
+ elif args.profile == "coqui-only":
125
+ print("\nπŸ“¦ Coqui STT-only installation")
126
+ success = install_requirements("requirements_coqui.txt")
127
+
128
+ elif args.profile == "tawasul-only":
129
+ print("\nπŸ“¦ Tawasul STT-only installation")
130
+ success = install_requirements("requirements_tawasul.txt")
131
+
132
+ elif args.profile == "all":
133
+ print("\nπŸ“¦ Full installation - All STT models")
134
+ success = install_optional_dependencies(["all-stt"])
135
+
136
+ if not success:
137
+ print(f"❌ Failed to install {args.profile} profile requirements.")
138
+ return 1
139
+
140
+ # Test installation if requested
141
+ if args.test:
142
+ print("\nπŸ§ͺ Testing installation...")
143
+
144
+ # Basic imports
145
+ basic_modules = ["gradio", "numpy", "soundfile"]
146
+ test_imports(basic_modules)
147
+
148
+ # Profile-specific tests
149
+ if args.profile in ["essential", "whisper-only", "all"]:
150
+ whisper_modules = ["whisper", "openai"]
151
+ test_imports(whisper_modules)
152
+
153
+ if args.profile in ["essential", "wav2vec2-only", "hubert-only", "tawasul-only", "all"]:
154
+ wav2vec2_modules = ["transformers", "torch", "torchaudio"]
155
+ test_imports(wav2vec2_modules)
156
+
157
+ # Test our modules
158
+ try:
159
+ from stt.stt_base import BaseSTT
160
+ from stt.whisper_stt import WhisperSTT
161
+ print("βœ… STT base classes")
162
+ except ImportError as e:
163
+ print(f"❌ STT base classes: {e}")
164
+
165
+ if args.profile in ["essential", "wav2vec2-only", "all"]:
166
+ try:
167
+ from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
168
+ print("βœ… Wav2Vec2 Arabic STT")
169
+ except ImportError as e:
170
+ print(f"❌ Wav2Vec2 Arabic STT: {e}")
171
+
172
+ if args.profile in ["hubert-only", "all"]:
173
+ try:
174
+ from stt.hubert_arabic_stt import HuBERTArabicSTT
175
+ print("βœ… HuBERT Arabic STT")
176
+ except ImportError as e:
177
+ print(f"❌ HuBERT Arabic STT: {e}")
178
+
179
+ if args.profile in ["coqui-only", "all"]:
180
+ try:
181
+ from stt.coqui_stt import CoquiSTT
182
+ print("βœ… Coqui STT")
183
+ except ImportError as e:
184
+ print(f"❌ Coqui STT: {e}")
185
+
186
+ if args.profile in ["tawasul-only", "all"]:
187
+ try:
188
+ from stt.tawasul_stt import TawasulSTT
189
+ print("βœ… Tawasul STT")
190
+ except ImportError as e:
191
+ print(f"❌ Tawasul STT: {e}")
192
+
193
+ if args.profile in ["vosk-only", "all"]:
194
+ try:
195
+ from stt.vosk_stt import VoskSTT
196
+ print("βœ… Vosk STT")
197
+ except ImportError as e:
198
+ print(f"❌ Vosk STT: {e}")
199
+
200
+ print("\n" + "=" * 50)
201
+ print("πŸŽ‰ Setup completed!")
202
+ print("\nπŸ’‘ Next steps:")
203
+ print(" 1. Run the transcriber:")
204
+ print(" python gradio_voice_transcriber_clean.py")
205
+ print("\n 2. Or test specific models:")
206
+ print(" python test_wav2vec2_arabic.py")
207
+ print("\n 3. Check available models in the web interface")
208
+
209
+ return 0
210
+
211
+ if __name__ == "__main__":
212
+ sys.exit(main())
setup_hf_auth.py ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ HuggingFace Authentication Helper
4
+
5
+ This script helps set up HuggingFace authentication for accessing private models.
6
+ """
7
+
8
+ import os
9
+ import subprocess
10
+ import sys
11
+ from pathlib import Path
12
+
13
+ def check_hf_cli():
14
+ """Check if huggingface-hub CLI is available."""
15
+ try:
16
+ result = subprocess.run(["huggingface-cli", "--version"],
17
+ capture_output=True, text=True, check=True)
18
+ print(f"βœ… HuggingFace CLI available: {result.stdout.strip()}")
19
+ return True
20
+ except (subprocess.CalledProcessError, FileNotFoundError):
21
+ print("❌ HuggingFace CLI not found")
22
+ return False
23
+
24
+ def install_hf_hub():
25
+ """Install huggingface-hub package."""
26
+ print("πŸ“¦ Installing huggingface-hub...")
27
+ try:
28
+ subprocess.run([sys.executable, "-m", "pip", "install", "huggingface-hub"],
29
+ check=True)
30
+ print("βœ… huggingface-hub installed successfully")
31
+ return True
32
+ except subprocess.CalledProcessError as e:
33
+ print(f"❌ Failed to install huggingface-hub: {e}")
34
+ return False
35
+
36
+ def login_to_hf():
37
+ """Login to HuggingFace using CLI."""
38
+ print("\nπŸ” Logging in to HuggingFace...")
39
+ print("This will open a browser to get your token.")
40
+ print("If you don't have a token, create one at: https://huggingface.co/settings/tokens")
41
+
42
+ try:
43
+ subprocess.run(["huggingface-cli", "login"], check=True)
44
+ print("βœ… Successfully logged in to HuggingFace")
45
+ return True
46
+ except subprocess.CalledProcessError as e:
47
+ print(f"❌ Failed to login: {e}")
48
+ return False
49
+
50
+ def check_auth_status():
51
+ """Check current authentication status."""
52
+ try:
53
+ result = subprocess.run(["huggingface-cli", "whoami"],
54
+ capture_output=True, text=True, check=True)
55
+ username = result.stdout.strip()
56
+ print(f"βœ… Logged in as: {username}")
57
+ return True, username
58
+ except subprocess.CalledProcessError:
59
+ print("❌ Not logged in to HuggingFace")
60
+ return False, None
61
+
62
+ def test_model_access():
63
+ """Test access to the Arabic Egyptian model."""
64
+ print("\nπŸ§ͺ Testing model access...")
65
+
66
+ try:
67
+ from transformers import AutoTokenizer
68
+
69
+ # Test the main model
70
+ models_to_test = [
71
+ "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
72
+ "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
73
+ "facebook/wav2vec2-large-xlsr-53"
74
+ ]
75
+
76
+ for model_id in models_to_test:
77
+ try:
78
+ print(f"Testing: {model_id}")
79
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
80
+ print(f"βœ… {model_id} - Accessible")
81
+ return True
82
+ except Exception as e:
83
+ print(f"❌ {model_id} - {str(e)}")
84
+ continue
85
+
86
+ print("❌ None of the models are accessible")
87
+ return False
88
+
89
+ except ImportError:
90
+ print("❌ Transformers library not installed")
91
+ return False
92
+
93
+ def manual_token_setup():
94
+ """Guide user through manual token setup."""
95
+ print("\nπŸ“ Manual Token Setup")
96
+ print("=" * 40)
97
+ print("1. Go to: https://huggingface.co/settings/tokens")
98
+ print("2. Create a new token with 'Read' permissions")
99
+ print("3. Copy the token (starts with 'hf_')")
100
+ print("4. Use it in the Gradio interface:")
101
+ print(" - Select 'Wav2Vec2ArabicSTT'")
102
+ print(" - Choose 'Arabic Egyptian (Experimental)' model")
103
+ print(" - Enter your token in 'HuggingFace Token' field")
104
+ print(" - Click 'Load Model'")
105
+ print("\nπŸ’‘ Alternatively, set environment variable:")
106
+ print(" export HF_TOKEN=your_token_here")
107
+
108
+ def main():
109
+ """Main authentication helper."""
110
+ print("πŸ€— HuggingFace Authentication Helper")
111
+ print("=" * 50)
112
+
113
+ # Check if already logged in
114
+ is_logged_in, username = check_auth_status()
115
+
116
+ if is_logged_in:
117
+ print(f"\nβœ… Already authenticated as: {username}")
118
+
119
+ # Test model access
120
+ if test_model_access():
121
+ print("\nπŸŽ‰ Authentication is working! You can use the experimental models.")
122
+ else:
123
+ print("\n⚠️ Authentication works but model access failed.")
124
+ print("The experimental model might not be available.")
125
+ print("Try using the standard Arabic model instead.")
126
+
127
+ return 0
128
+
129
+ # Not logged in, try to set up
130
+ print("\n❌ Not authenticated with HuggingFace")
131
+
132
+ # Check if CLI is available
133
+ if not check_hf_cli():
134
+ print("\nπŸ“¦ Installing HuggingFace CLI...")
135
+ if not install_hf_hub():
136
+ print("\n❌ Failed to install HuggingFace Hub")
137
+ manual_token_setup()
138
+ return 1
139
+
140
+ # Try to login
141
+ print("\nπŸ” Setting up authentication...")
142
+ if login_to_hf():
143
+ # Test access after login
144
+ if test_model_access():
145
+ print("\nπŸŽ‰ Setup complete! You can now use all models.")
146
+ else:
147
+ print("\n⚠️ Login successful but some models may not be accessible.")
148
+ else:
149
+ print("\n❌ Automatic login failed")
150
+ manual_token_setup()
151
+ return 1
152
+
153
+ return 0
154
+
155
+ if __name__ == "__main__":
156
+ sys.exit(main())
stt/__init__.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ STT (Speech-to-Text) Package
3
+
4
+ This package contains various STT model implementations that inherit from BaseSTT.
5
+
6
+ Available STT Models:
7
+ - DummySTT: Test implementation for interface validation
8
+ - WhisperSTT: OpenAI Whisper implementation (local + API)
9
+ """
10
+
11
+ from .stt_base import BaseSTT, STTResult, DummySTT
12
+
13
+ # Import WhisperSTT with error handling
14
+ try:
15
+ from .whisper_stt import WhisperSTT
16
+ __all__ = ['BaseSTT', 'STTResult', 'DummySTT', 'WhisperSTT']
17
+ except ImportError as e:
18
+ # WhisperSTT dependencies not available
19
+ __all__ = ['BaseSTT', 'STTResult', 'DummySTT']
stt/chirp3_stt.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Chirp3 Speech-to-Text (STT) Implementation (Stub)
4
+
5
+ This is a stub for integrating the Chirp3 model with the BaseSTT interface.
6
+ Replace the stub methods with actual model loading and transcription logic as needed.
7
+ """
8
+
9
+ from .stt_base import BaseSTT, STTResult
10
+
11
+
12
+ import time
13
+ import numpy as np
14
+ from typing import Union
15
+ import io
16
+ import wave
17
+
18
+ try:
19
+ from google.cloud import speech
20
+ except ImportError:
21
+ speech = None
22
+
23
+ from .stt_base import BaseSTT, STTResult
24
+
25
+ class Chirp3STT(BaseSTT):
26
+ """
27
+ Chirp3STT implementation using Google Cloud Speech-to-Text API.
28
+ Accepts file path or numpy array as input.
29
+ """
30
+ model_name = "Chirp3STT"
31
+ client = None
32
+ is_loaded = False
33
+ config = {
34
+ "language": "ar-EG",
35
+ "sample_rate": 16000,
36
+ "encoding": "LINEAR16",
37
+ "enable_automatic_punctuation": True,
38
+ }
39
+
40
+ @classmethod
41
+ def load_model(cls, **kwargs) -> None:
42
+ """
43
+ Initialize Google Cloud Speech client.
44
+ """
45
+ cls.client = speech.SpeechClient()
46
+ cls.is_loaded = True
47
+
48
+ @classmethod
49
+ def transcribe_audio(cls, audio_data: Union[str, np.ndarray], sample_rate: int = None):
50
+ """
51
+ Transcribe audio using Google Cloud Speech-to-Text API.
52
+ Args:
53
+ audio_data: Path to WAV file or numpy array (float32, mono)
54
+ sample_rate: Sample rate if numpy array is provided
55
+ Returns:
56
+ STTResult
57
+ """
58
+ if not cls.is_loaded:
59
+ raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
60
+
61
+ start_time = time.time()
62
+ # Check google-cloud-speech import
63
+ if speech is None:
64
+ return STTResult(
65
+ text="",
66
+ confidence=0.0,
67
+ processing_time=0.0,
68
+ metadata={"error": "google-cloud-speech not installed"}
69
+ )
70
+
71
+ # Prepare audio for Google API
72
+ audio_content = None
73
+ actual_sample_rate = sample_rate or cls.config["sample_rate"]
74
+
75
+ if isinstance(audio_data, str):
76
+ # File path
77
+ try:
78
+ with open(audio_data, "rb") as f:
79
+ audio_content = f.read()
80
+ except Exception as e:
81
+ return STTResult(
82
+ text="",
83
+ confidence=0.0,
84
+ processing_time=0.0,
85
+ metadata={"error": f"Failed to read file: {e}"}
86
+ )
87
+ elif isinstance(audio_data, np.ndarray):
88
+ # Numpy array (float32 or int16)
89
+ arr = audio_data
90
+ if arr.dtype != np.int16:
91
+ arr = (arr * 32767).astype(np.int16)
92
+ buf = io.BytesIO()
93
+ with wave.open(buf, 'wb') as wf:
94
+ wf.setnchannels(1)
95
+ wf.setsampwidth(2)
96
+ wf.setframerate(actual_sample_rate)
97
+ wf.writeframes(arr.tobytes())
98
+ audio_content = buf.getvalue()
99
+ else:
100
+ return STTResult(
101
+ text="",
102
+ confidence=0.0,
103
+ processing_time=0.0,
104
+ metadata={"error": "Unsupported audio input type"}
105
+ )
106
+
107
+ # Prepare Google API request
108
+ audio = speech.RecognitionAudio(content=audio_content)
109
+ config = speech.RecognitionConfig(
110
+ encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
111
+ sample_rate_hertz=actual_sample_rate,
112
+ language_code=cls.config["language"],
113
+ enable_automatic_punctuation=cls.config["enable_automatic_punctuation"],
114
+ )
115
+ try:
116
+ response = cls.client.recognize(config=config, audio=audio)
117
+ if response.results:
118
+ transcript = response.results[0].alternatives[0].transcript
119
+ confidence = response.results[0].alternatives[0].confidence if response.results[0].alternatives else 0.0
120
+ else:
121
+ transcript = ""
122
+ confidence = 0.0
123
+ processing_time = time.time() - start_time
124
+ return STTResult(
125
+ text=transcript,
126
+ confidence=confidence,
127
+ processing_time=processing_time,
128
+ metadata={"api": "google-cloud-speech"}
129
+ )
130
+ except Exception as e:
131
+ return STTResult(
132
+ text="",
133
+ confidence=0.0,
134
+ processing_time=time.time() - start_time,
135
+ metadata={"error": str(e)}
136
+ )
stt/coqui_stt.py ADDED
@@ -0,0 +1,390 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Coqui STT Implementation with Model Manager
4
+
5
+ This module provides speech-to-text functionality using coqui-stt-model-manager.
6
+ The model manager provides a simplified interface for downloading and using
7
+ Coqui STT models with automatic model management.
8
+
9
+ Features:
10
+ - Automatic model downloading and management
11
+ - Multiple pre-trained models available
12
+ - Language-specific models
13
+ - Offline processing
14
+ - Simplified API interface
15
+ - GPU acceleration support
16
+
17
+ Dependencies:
18
+ - coqui-stt-model-manager
19
+ - numpy
20
+ - soundfile
21
+ - librosa (for audio preprocessing)
22
+
23
+ Model Management:
24
+ Models are automatically managed by the coqui-stt-model-manager.
25
+ Popular models include English, German, French, Spanish, and more.
26
+ """
27
+
28
+ import os
29
+ import logging
30
+ import tempfile
31
+ from pathlib import Path
32
+ from typing import Optional, Dict, Any, List, Tuple
33
+ import numpy as np
34
+
35
+ try:
36
+ from coqui_stt_model_manager import CoquiSTTModelManager
37
+ COQUI_STT_AVAILABLE = True
38
+ except ImportError:
39
+ COQUI_STT_AVAILABLE = False
40
+ CoquiSTTModelManager = None
41
+
42
+ try:
43
+ import soundfile as sf
44
+ SOUNDFILE_AVAILABLE = True
45
+ except ImportError:
46
+ SOUNDFILE_AVAILABLE = False
47
+ sf = None
48
+
49
+ try:
50
+ import librosa
51
+ LIBROSA_AVAILABLE = True
52
+ except ImportError:
53
+ LIBROSA_AVAILABLE = False
54
+ librosa = None
55
+
56
+ from .stt_base import BaseSTT
57
+
58
+ logger = logging.getLogger(__name__)
59
+
60
+
61
+ class CoquiSTT(BaseSTT):
62
+ """
63
+ Coqui STT implementation using coqui-stt-model-manager.
64
+
65
+ Coqui STT provides high-quality open-source speech recognition
66
+ with simplified model management through the model manager.
67
+ """
68
+
69
+ def __init__(self):
70
+ """Initialize Coqui STT with model manager."""
71
+ super().__init__()
72
+ self.model_manager = None
73
+ self.current_model = None
74
+ self.model_info = {}
75
+
76
+ # Available models through the model manager
77
+ self.available_models = {
78
+ "english-huge": {
79
+ "language": "en",
80
+ "description": "English model with huge vocabulary",
81
+ "model_id": "english-huge-vocab"
82
+ },
83
+ "english-large": {
84
+ "language": "en",
85
+ "description": "English model with large vocabulary",
86
+ "model_id": "english-large-vocab"
87
+ },
88
+ "german": {
89
+ "language": "de",
90
+ "description": "German language model",
91
+ "model_id": "german"
92
+ },
93
+ "french": {
94
+ "language": "fr",
95
+ "description": "French language model",
96
+ "model_id": "french"
97
+ },
98
+ "spanish": {
99
+ "language": "es",
100
+ "description": "Spanish language model",
101
+ "model_id": "spanish"
102
+ }
103
+ }
104
+
105
+ @classmethod
106
+ def is_available(cls) -> bool:
107
+ """Check if Coqui STT Model Manager is available."""
108
+ try:
109
+ from coqui_stt_model_manager import CoquiSTTModelManager
110
+ import soundfile
111
+ return True
112
+ except ImportError as e:
113
+ logger.warning(f"Coqui STT Model Manager dependencies not available: {e}")
114
+ return False
115
+
116
+ def check_dependencies(self) -> Tuple[bool, str]:
117
+ """Check if required dependencies are available."""
118
+ missing_deps = []
119
+
120
+ if not COQUI_STT_AVAILABLE:
121
+ missing_deps.append("coqui-stt-model-manager")
122
+
123
+ if not SOUNDFILE_AVAILABLE:
124
+ missing_deps.append("soundfile")
125
+
126
+ if not LIBROSA_AVAILABLE:
127
+ missing_deps.append("librosa (recommended for audio preprocessing)")
128
+
129
+ if missing_deps:
130
+ return False, f"Missing dependencies: {', '.join(missing_deps)}"
131
+
132
+ return True, "All dependencies available"
133
+
134
+ def load_model(
135
+ self,
136
+ model_name: str = "english-large",
137
+ auto_download: bool = True,
138
+ beam_width: int = 512,
139
+ lm_alpha: float = 0.931289039105002,
140
+ lm_beta: float = 1.1834137581510284,
141
+ **kwargs
142
+ ) -> None:
143
+ """
144
+ Load a Coqui STT model using the model manager.
145
+
146
+ Args:
147
+ model_name: Name of the model to load
148
+ auto_download: Whether to automatically download the model if not found
149
+ beam_width: Beam width for CTC beam search decoder
150
+ lm_alpha: Language model alpha parameter
151
+ lm_beta: Language model beta parameter
152
+ **kwargs: Additional model parameters
153
+
154
+ Raises:
155
+ RuntimeError: If model loading fails
156
+ """
157
+ deps_ok, deps_msg = self.check_dependencies()
158
+ if not deps_ok:
159
+ raise RuntimeError(f"Dependency check failed: {deps_msg}")
160
+
161
+ try:
162
+ # Initialize model manager
163
+ logger.info("Initializing Coqui STT Model Manager...")
164
+ self.model_manager = CoquiSTTModelManager()
165
+
166
+ # Get model identifier
167
+ if model_name in self.available_models:
168
+ model_id = self.available_models[model_name]["model_id"]
169
+ else:
170
+ model_id = model_name # Use as custom model ID
171
+
172
+ # Load the model through model manager
173
+ logger.info(f"Loading Coqui STT model: {model_id}")
174
+
175
+ if auto_download:
176
+ # Download and load model
177
+ self.current_model = self.model_manager.download_and_load_model(
178
+ model_id=model_id,
179
+ beam_width=beam_width,
180
+ lm_alpha=lm_alpha,
181
+ lm_beta=lm_beta
182
+ )
183
+ else:
184
+ # Try to load existing model
185
+ self.current_model = self.model_manager.load_model(
186
+ model_id=model_id,
187
+ beam_width=beam_width,
188
+ lm_alpha=lm_alpha,
189
+ lm_beta=lm_beta
190
+ )
191
+
192
+ # Store model info
193
+ self.model_info = {
194
+ "model_name": model_name,
195
+ "model_id": model_id,
196
+ "beam_width": beam_width,
197
+ "lm_alpha": lm_alpha,
198
+ "lm_beta": lm_beta,
199
+ }
200
+
201
+ if model_name in self.available_models:
202
+ self.model_info.update(self.available_models[model_name])
203
+
204
+ logger.info(f"Coqui STT model loaded successfully: {model_name}")
205
+
206
+ except Exception as e:
207
+ error_msg = f"Error loading Coqui STT model: {e}"
208
+ logger.error(error_msg)
209
+ raise RuntimeError(error_msg)
210
+
211
+ def preprocess_audio(self, audio_data: np.ndarray, sample_rate: int) -> np.ndarray:
212
+ """
213
+ Preprocess audio for Coqui STT.
214
+
215
+ Coqui STT requires 16kHz mono audio.
216
+
217
+ Args:
218
+ audio_data: Audio data as numpy array
219
+ sample_rate: Original sample rate
220
+
221
+ Returns:
222
+ Preprocessed audio data
223
+ """
224
+ try:
225
+ # Convert to mono if needed
226
+ if len(audio_data.shape) > 1:
227
+ audio_data = np.mean(audio_data, axis=1)
228
+
229
+ # Resample to 16kHz if needed
230
+ target_sr = 16000
231
+ if sample_rate != target_sr and LIBROSA_AVAILABLE:
232
+ audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=target_sr)
233
+ sample_rate = target_sr
234
+ elif sample_rate != target_sr:
235
+ logger.warning(f"Audio is {sample_rate}Hz but Coqui STT requires 16kHz. Install librosa for automatic resampling.")
236
+
237
+ # Normalize audio
238
+ audio_data = audio_data.astype(np.float32)
239
+ if np.max(np.abs(audio_data)) > 0:
240
+ audio_data = audio_data / np.max(np.abs(audio_data))
241
+
242
+ # Convert to int16 as required by Coqui STT
243
+ audio_data = (audio_data * 32767).astype(np.int16)
244
+
245
+ return audio_data
246
+
247
+ except Exception as e:
248
+ logger.error(f"Error preprocessing audio: {e}")
249
+ return audio_data
250
+
251
+ def transcribe(self, audio_path: str, **kwargs) -> Tuple[str, str, str]:
252
+ """
253
+ Transcribe audio using Coqui STT Model Manager.
254
+
255
+ Args:
256
+ audio_path: Path to audio file
257
+ **kwargs: Additional transcription parameters
258
+
259
+ Returns:
260
+ Tuple of (transcription, confidence_info, processing_info)
261
+ """
262
+ if self.current_model is None:
263
+ return "❌ Model not loaded. Please load the model first.", "", ""
264
+
265
+ try:
266
+ import time
267
+ start_time = time.time()
268
+
269
+ # Validate file
270
+ if not os.path.exists(audio_path):
271
+ return f"❌ Audio file not found: {audio_path}", "", ""
272
+
273
+ logger.info(f"🎡 Transcribing audio with Coqui STT: {audio_path}")
274
+
275
+ # Load audio file
276
+ audio_data, sample_rate = sf.read(audio_path)
277
+
278
+ # Preprocess audio
279
+ processed_audio = self.preprocess_audio(audio_data, sample_rate)
280
+
281
+ # Get transcription parameters
282
+ return_confidence = kwargs.get("return_confidence", True)
283
+ return_timestamps = kwargs.get("return_timestamps", False)
284
+
285
+ # Perform transcription using model manager
286
+ if return_timestamps:
287
+ # Use metadata for word timestamps
288
+ result = self.model_manager.transcribe_with_metadata(
289
+ audio_data=processed_audio,
290
+ model=self.current_model
291
+ )
292
+
293
+ # Extract text and calculate confidence
294
+ transcription = ""
295
+ total_confidence = 0.0
296
+ word_count = 0
297
+
298
+ if hasattr(result, 'transcripts') and result.transcripts:
299
+ for token in result.transcripts[0].tokens:
300
+ transcription += token.text
301
+ if hasattr(token, 'confidence'):
302
+ total_confidence += token.confidence
303
+ word_count += 1
304
+
305
+ avg_confidence = total_confidence / word_count if word_count > 0 else 0.0
306
+
307
+ else:
308
+ # Simple transcription
309
+ transcription = self.model_manager.transcribe(
310
+ audio_data=processed_audio,
311
+ model=self.current_model
312
+ )
313
+ avg_confidence = 0.8 # Estimated confidence
314
+
315
+ # Calculate processing time
316
+ processing_time = time.time() - start_time
317
+ audio_duration = len(audio_data) / sample_rate
318
+
319
+ # Create info strings
320
+ confidence_info = f"Confidence: {avg_confidence:.2f}" if return_confidence else ""
321
+ processing_info = (
322
+ f"Duration: {audio_duration:.1f}s | "
323
+ f"Time: {processing_time:.1f}s | "
324
+ f"Model: {self.model_info.get('model_name', 'unknown')}"
325
+ )
326
+
327
+ logger.info(f"βœ… Transcription completed in {processing_time:.1f}s")
328
+
329
+ return transcription.strip(), confidence_info, processing_info
330
+
331
+ except Exception as e:
332
+ error_msg = f"❌ Coqui STT transcription failed: {str(e)}"
333
+ logger.error(error_msg)
334
+ return error_msg, "", ""
335
+
336
+ def get_supported_languages(self) -> List[str]:
337
+ """Get list of supported languages."""
338
+ return [
339
+ "en", # English
340
+ "de", # German
341
+ "fr", # French
342
+ "es", # Spanish
343
+ ]
344
+
345
+ def get_model_info(self) -> Dict[str, Any]:
346
+ """Get information about the currently loaded model."""
347
+ if self.current_model is None:
348
+ return {"error": "No model loaded"}
349
+
350
+ info = self.model_info.copy()
351
+ info.update({
352
+ "name": "Coqui STT with Model Manager",
353
+ "is_loaded": self.current_model is not None,
354
+ "supported_languages": self.get_supported_languages(),
355
+ "architecture": "DeepSpeech-based CTC",
356
+ "provider": "Coqui AI"
357
+ })
358
+
359
+ return info
360
+
361
+ def get_available_models(self) -> List[Dict[str, Any]]:
362
+ """Get list of available models."""
363
+ models = []
364
+ for name, info in self.available_models.items():
365
+ model_info = {
366
+ "name": name,
367
+ "language": info["language"],
368
+ "description": info["description"],
369
+ "model_id": info["model_id"]
370
+ }
371
+ models.append(model_info)
372
+
373
+ return models
374
+
375
+ def cleanup(self):
376
+ """Clean up resources."""
377
+ if self.current_model is not None:
378
+ # Model manager handles cleanup automatically
379
+ self.current_model = None
380
+
381
+ if self.model_manager is not None:
382
+ self.model_manager = None
383
+
384
+ self.model_info = {}
385
+
386
+ logger.info("Coqui STT cleanup completed")
387
+
388
+
389
+ # Export the class
390
+ __all__ = ["CoquiSTT", "COQUI_STT_AVAILABLE"]
stt/example_custom_stt.py ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Example: Adding a Custom STT Model
4
+
5
+ This file demonstrates how to add a new STT model to the modular voice transcriber.
6
+ Follow this pattern to integrate any speech-to-text service.
7
+
8
+ Usage:
9
+ 1. Create your STT class following the BaseSTT interface
10
+ 2. Add it to the STT_MODELS registry in gradio_voice_transcriber_clean.py
11
+ 3. Update ModelManager.get_model_options() if needed
12
+ """
13
+
14
+ from typing import Union, Optional
15
+ import numpy as np
16
+ from pathlib import Path
17
+ import time
18
+ import random
19
+
20
+ from stt.stt_base import BaseSTT, STTResult
21
+
22
+
23
+ class ExampleCustomSTT(BaseSTT):
24
+ """
25
+ Example custom STT implementation.
26
+ This shows how to create a new STT model following the BaseSTT interface.
27
+
28
+ Replace this with actual integration to your preferred STT service:
29
+ - Azure Speech Service
30
+ - Google Cloud Speech-to-Text
31
+ - Amazon Transcribe
32
+ - IBM Watson Speech to Text
33
+ - AssemblyAI
34
+ - Rev.ai
35
+ - Or any other service
36
+ """
37
+
38
+ model_name = "ExampleCustomSTT"
39
+ model = None
40
+ is_loaded = False
41
+ config = {
42
+ "api_key": None,
43
+ "region": "us-east-1",
44
+ "language": "en-US",
45
+ "sample_rate": 16000
46
+ }
47
+
48
+ @classmethod
49
+ def load_model(cls, api_key: str = "", region: str = "us-east-1", **kwargs) -> None:
50
+ """
51
+ Load/initialize the custom STT service.
52
+
53
+ Args:
54
+ api_key: API key for the service
55
+ region: Service region
56
+ **kwargs: Additional configuration parameters
57
+ """
58
+ if not api_key:
59
+ raise ValueError("API key required for ExampleCustomSTT")
60
+
61
+ # Update configuration
62
+ cls.config.update({
63
+ "api_key": api_key,
64
+ "region": region,
65
+ **kwargs
66
+ })
67
+
68
+ # Initialize your STT service here
69
+ # Example:
70
+ # cls.model = YourSTTClient(
71
+ # api_key=api_key,
72
+ # region=region
73
+ # )
74
+
75
+ # For demonstration, just simulate initialization
76
+ print(f"Initializing ExampleCustomSTT with region {region}")
77
+ time.sleep(1) # Simulate initialization time
78
+
79
+ cls.model = f"custom_stt_client_{region}"
80
+ cls.is_loaded = True
81
+
82
+ print(f"βœ… {cls.model_name} loaded successfully")
83
+
84
+ @classmethod
85
+ def transcribe_audio(cls,
86
+ audio_data: Union[np.ndarray, str, Path],
87
+ sample_rate: Optional[int] = None) -> STTResult:
88
+ """
89
+ Transcribe audio using the custom STT service.
90
+
91
+ Args:
92
+ audio_data: Audio input (numpy array or file path)
93
+ sample_rate: Sample rate for numpy arrays
94
+
95
+ Returns:
96
+ STTResult: Transcription result with metadata
97
+ """
98
+ if not cls.is_loaded:
99
+ raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
100
+
101
+ start_time = time.time()
102
+
103
+ # Handle different input types
104
+ if isinstance(audio_data, np.ndarray):
105
+ # For numpy arrays, you might need to:
106
+ # 1. Save to temporary file
107
+ # 2. Upload to service
108
+ # 3. Get transcription result
109
+
110
+ duration = len(audio_data) / (sample_rate or 16000)
111
+ print(f"Transcribing numpy array: {duration:.2f}s")
112
+
113
+ # Simulate API call
114
+ time.sleep(0.5 + duration * 0.1) # Simulate processing time
115
+
116
+ # Example transcription (replace with actual API call)
117
+ transcription = f"[Custom STT transcription of {duration:.1f}s audio]"
118
+ confidence = random.uniform(0.85, 0.98) # Simulate confidence
119
+
120
+ else:
121
+ # Handle file path
122
+ file_path = Path(audio_data)
123
+ print(f"Transcribing file: {file_path.name}")
124
+
125
+ # Simulate file upload and transcription
126
+ time.sleep(1.0)
127
+
128
+ transcription = f"[Custom STT transcription of {file_path.name}]"
129
+ confidence = random.uniform(0.80, 0.95)
130
+
131
+ processing_time = time.time() - start_time
132
+
133
+ # Prepare metadata
134
+ metadata = {
135
+ "model": cls.model_name,
136
+ "region": cls.config["region"],
137
+ "language": cls.config.get("language", "en-US"),
138
+ "api_used": True,
139
+ "service": "example-custom-service"
140
+ }
141
+
142
+ return STTResult(
143
+ text=transcription,
144
+ confidence=confidence,
145
+ processing_time=processing_time,
146
+ metadata=metadata
147
+ )
148
+
149
+ @classmethod
150
+ def set_language(cls, language: Optional[str]) -> None:
151
+ """Set the transcription language."""
152
+ if language:
153
+ cls.config["language"] = language
154
+ print(f"Language set to: {language}")
155
+
156
+ @classmethod
157
+ def get_supported_languages(cls) -> list:
158
+ """Get list of supported languages."""
159
+ return [
160
+ "en-US", "en-GB", "es-ES", "fr-FR", "de-DE",
161
+ "it-IT", "pt-BR", "ja-JP", "ko-KR", "zh-CN"
162
+ ]
163
+
164
+
165
+ # Example of how to integrate into the main application:
166
+ def integrate_custom_stt():
167
+ """
168
+ This function shows how to add the custom STT to the main application.
169
+
170
+ Add this to gradio_voice_transcriber_clean.py:
171
+ """
172
+
173
+ # 1. Import your custom STT class
174
+ from stt.example_custom_stt import ExampleCustomSTT
175
+
176
+ # 2. Add to STT_MODELS registry
177
+ STT_MODELS = {
178
+ "WhisperSTT": WhisperSTT,
179
+ "ExampleCustomSTT": ExampleCustomSTT, # Add this line
180
+ }
181
+
182
+ # 3. Update ModelManager.get_model_options() to include custom options
183
+ def get_model_options(model_name: str):
184
+ if model_name == "ExampleCustomSTT":
185
+ return {
186
+ "model_sizes": ["default"], # No size options for this service
187
+ "supports_api": True,
188
+ "languages": [
189
+ ("Auto-detect", "auto"),
190
+ ("English (US)", "en-US"),
191
+ ("English (UK)", "en-GB"),
192
+ ("Spanish", "es-ES"),
193
+ ("French", "fr-FR"),
194
+ ("German", "de-DE"),
195
+ ],
196
+ "custom_fields": [
197
+ {"name": "api_key", "type": "password", "label": "API Key", "required": True},
198
+ {"name": "region", "type": "dropdown", "label": "Region",
199
+ "choices": ["us-east-1", "us-west-2", "eu-west-1"], "default": "us-east-1"}
200
+ ]
201
+ }
202
+ # ... existing code for other models
203
+
204
+ # 4. Update the load_model function to handle custom parameters
205
+ def load_model(model_name: str, **kwargs):
206
+ if model_name == "ExampleCustomSTT":
207
+ api_key = kwargs.get("api_key", "")
208
+ region = kwargs.get("region", "us-east-1")
209
+
210
+ if not api_key:
211
+ return "❌ API key required for ExampleCustomSTT"
212
+
213
+ ExampleCustomSTT.load_model(api_key=api_key, region=region)
214
+ # ... rest of loading logic
215
+
216
+
217
+ # Real-world integration examples:
218
+
219
+ class AzureSTT(BaseSTT):
220
+ """Example Azure Speech Service integration."""
221
+
222
+ model_name = "AzureSTT"
223
+ model = None
224
+ is_loaded = False
225
+
226
+ @classmethod
227
+ def load_model(cls, subscription_key: str, region: str, **kwargs):
228
+ """Initialize Azure Speech SDK."""
229
+ try:
230
+ import azure.cognitiveservices.speech as speechsdk
231
+
232
+ speech_config = speechsdk.SpeechConfig(
233
+ subscription=subscription_key,
234
+ region=region
235
+ )
236
+ cls.model = speech_config
237
+ cls.is_loaded = True
238
+ except ImportError:
239
+ raise ImportError("Install Azure Speech SDK: pip install azure-cognitiveservices-speech")
240
+
241
+ @classmethod
242
+ def transcribe_audio(cls, audio_data, sample_rate=None):
243
+ """Transcribe using Azure Speech Service."""
244
+ # Implement Azure-specific transcription logic
245
+ pass
246
+
247
+
248
+ class GoogleSTT(BaseSTT):
249
+ """Example Google Cloud Speech-to-Text integration."""
250
+
251
+ model_name = "GoogleSTT"
252
+ model = None
253
+ is_loaded = False
254
+
255
+ @classmethod
256
+ def load_model(cls, credentials_path: str, **kwargs):
257
+ """Initialize Google Cloud Speech client."""
258
+ try:
259
+ from google.cloud import speech
260
+ import os
261
+
262
+ os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
263
+ cls.model = speech.SpeechClient()
264
+ cls.is_loaded = True
265
+ except ImportError:
266
+ raise ImportError("Install Google Cloud Speech: pip install google-cloud-speech")
267
+
268
+ @classmethod
269
+ def transcribe_audio(cls, audio_data, sample_rate=None):
270
+ """Transcribe using Google Cloud Speech."""
271
+ # Implement Google-specific transcription logic
272
+ pass
273
+
274
+
275
+ if __name__ == "__main__":
276
+ # Test the example custom STT
277
+ print("Testing ExampleCustomSTT...")
278
+
279
+ # Load model
280
+ ExampleCustomSTT.load_model(api_key="test-api-key", region="us-east-1")
281
+
282
+ # Test transcription
283
+ dummy_audio = np.random.randn(16000).astype(np.float32) # 1 second
284
+ result = ExampleCustomSTT.transcribe_audio(dummy_audio, 16000)
285
+
286
+ print(f"Result: {result}")
287
+ print(f"Metadata: {result.metadata}")
288
+ print("βœ… Custom STT integration test completed!")
stt/hubert_arabic_stt.py ADDED
@@ -0,0 +1,568 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ HuBERT Arabic Egyptian STT Implementation
4
+
5
+ Hugging Face HuBERT speech-to-text implementation for Arabic Egyptian dialect
6
+ using the omarxadel/hubert-large-arabic-egyptian model.
7
+
8
+ Usage:
9
+ from stt.hubert_arabic_stt import HuBERTArabicSTT
10
+
11
+ # Load model
12
+ HuBERTArabicSTT.load_model()
13
+
14
+ # Transcribe audio
15
+ result = HuBERTArabicSTT.transcribe_audio(audio_array, 16000)
16
+ print(result.text)
17
+ """
18
+
19
+ from typing import Union, Optional, Dict, Any
20
+ import numpy as np
21
+ from pathlib import Path
22
+ import time
23
+ import logging
24
+ import warnings
25
+
26
+ # Suppress warnings for cleaner output
27
+ warnings.filterwarnings("ignore")
28
+
29
+ try:
30
+ import torch
31
+ import torchaudio
32
+ from transformers import (
33
+ HubertForCTC,
34
+ Wav2Vec2Processor,
35
+ Wav2Vec2Tokenizer,
36
+ AutoProcessor,
37
+ AutoModelForCTC
38
+ )
39
+ TRANSFORMERS_AVAILABLE = True
40
+ except ImportError:
41
+ TRANSFORMERS_AVAILABLE = False
42
+
43
+ try:
44
+ import librosa
45
+ LIBROSA_AVAILABLE = True
46
+ except ImportError:
47
+ LIBROSA_AVAILABLE = False
48
+
49
+ from .stt_base import BaseSTT, STTResult
50
+
51
+ logger = logging.getLogger(__name__)
52
+
53
+
54
+ class HuBERTArabicSTT(BaseSTT):
55
+ class Chirp3STT(BaseSTT):
56
+ """
57
+ Chirp3 Speech-to-Text implementation (stub).
58
+ Replace this stub with actual Chirp3 model integration as needed.
59
+ """
60
+ model_name = "Chirp3STT"
61
+ model = None
62
+ processor = None
63
+ is_loaded = False
64
+ config = {
65
+ "model_id": "chirp3/ar-egyptian", # Example placeholder
66
+ "device": "auto",
67
+ "sample_rate": 16000,
68
+ "language": "ar-EG",
69
+ }
70
+
71
+ @classmethod
72
+ def load_model(cls, model_id: str = None, device: str = "auto", **kwargs) -> None:
73
+ """
74
+ Load the Chirp3 model (stub).
75
+ """
76
+ # TODO: Implement actual Chirp3 model loading
77
+ cls.is_loaded = True
78
+ cls.config["model_id"] = model_id or cls.config["model_id"]
79
+ cls.config["device"] = device
80
+
81
+ @classmethod
82
+ def transcribe_audio(cls, audio_data, sample_rate: int = None):
83
+ """
84
+ Transcribe audio using Chirp3 model (stub).
85
+ """
86
+ if not cls.is_loaded:
87
+ raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
88
+ # TODO: Implement actual transcription logic
89
+ from .stt_base import STTResult
90
+ return STTResult(
91
+ text="[Chirp3STT stub: no transcription]",
92
+ confidence=0.0,
93
+ processing_time=0.0,
94
+ metadata={"note": "Chirp3STT is a stub."}
95
+ )
96
+ """
97
+ HuBERT Arabic Egyptian STT implementation using Hugging Face transformers.
98
+
99
+ Supports:
100
+ - Arabic Egyptian dialect transcription
101
+ - Local model execution (no API required)
102
+ - Automatic audio preprocessing
103
+ - Confidence estimation
104
+ - Chunked processing for long audio
105
+ """
106
+
107
+ model_name = "HuBERTArabicSTT"
108
+ model = None
109
+ processor = None
110
+ tokenizer = None
111
+ is_loaded = False
112
+ config = {
113
+ "model_id": "omarxadel/hubert-large-arabic-egyptian",
114
+ "fallback_models": [
115
+ "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
116
+ "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
117
+ "facebook/wav2vec2-large-xlsr-53",
118
+ ],
119
+ "device": "auto", # auto, cpu, cuda
120
+ "chunk_length": 15, # seconds, for long audio processing
121
+ "sample_rate": 16000,
122
+ "return_confidence": True,
123
+ "language": "ar-EG", # Arabic Egyptian
124
+ "hf_token": None, # Hugging Face token for private models
125
+ "use_auth_token": True # Try to use cached token
126
+ }
127
+
128
+ @classmethod
129
+ def load_model(cls,
130
+ model_id: str = None,
131
+ device: str = "auto",
132
+ hf_token: str = None,
133
+ **kwargs) -> None:
134
+ """
135
+ Load the HuBERT Arabic model.
136
+
137
+ Args:
138
+ model_id: Hugging Face model ID (default: omarxadel/hubert-large-arabic-egyptian)
139
+ device: Device to use (auto, cpu, cuda)
140
+ hf_token: Hugging Face token for private models (optional)
141
+ **kwargs: Additional configuration parameters
142
+ """
143
+ if not TRANSFORMERS_AVAILABLE:
144
+ raise ImportError(
145
+ "Transformers library required. Install with: "
146
+ "pip install transformers torch torchaudio"
147
+ )
148
+
149
+ # Update configuration
150
+ cls.config.update({
151
+ "model_id": model_id or cls.config["model_id"],
152
+ "device": device,
153
+ "hf_token": hf_token,
154
+ **kwargs
155
+ })
156
+
157
+ # Determine device
158
+ if device == "auto":
159
+ if torch.cuda.is_available():
160
+ device = "cuda"
161
+ elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
162
+ device = "mps" # Apple Silicon
163
+ else:
164
+ device = "cpu"
165
+
166
+ cls.config["device"] = device
167
+
168
+ # Try to load the model, with fallbacks
169
+ models_to_try = [cls.config["model_id"]] + cls.config["fallback_models"]
170
+
171
+ for model_id_to_try in models_to_try:
172
+ logger.info(f"Attempting to load HuBERT model: {model_id_to_try}")
173
+
174
+ try:
175
+ success = cls._load_model_with_id(model_id_to_try, device, hf_token)
176
+ if success:
177
+ cls.config["model_id"] = model_id_to_try # Update to successful model
178
+ return
179
+ except Exception as e:
180
+ logger.warning(f"Failed to load {model_id_to_try}: {e}")
181
+ continue
182
+
183
+ # If all models failed
184
+ raise RuntimeError(f"Failed to load any HuBERT model. Tried: {models_to_try}")
185
+
186
+ @classmethod
187
+ def _load_model_with_id(cls, model_id: str, device: str, hf_token: str = None) -> bool:
188
+ """
189
+ Load a specific model ID with authentication handling.
190
+
191
+ Returns:
192
+ bool: True if successful, False otherwise
193
+ """
194
+ logger.info(f"Loading HuBERT model: {model_id}")
195
+ logger.info(f"Using device: {device}")
196
+
197
+ start_time = time.time()
198
+
199
+ # Prepare authentication
200
+ auth_kwargs = {}
201
+ if hf_token:
202
+ auth_kwargs["token"] = hf_token
203
+ elif cls.config.get("use_auth_token", True):
204
+ auth_kwargs["use_auth_token"] = True
205
+
206
+ try:
207
+ # Try to load as HuBERT model first
208
+ if "hubert" in model_id.lower():
209
+ logger.info("Loading as HuBERT model...")
210
+ cls.processor = AutoProcessor.from_pretrained(model_id, **auth_kwargs)
211
+ cls.model = AutoModelForCTC.from_pretrained(model_id, **auth_kwargs)
212
+ else:
213
+ # Fallback to Wav2Vec2 for other models
214
+ logger.info("Loading as Wav2Vec2 model...")
215
+ cls.processor = Wav2Vec2Processor.from_pretrained(model_id, **auth_kwargs)
216
+ cls.model = AutoModelForCTC.from_pretrained(model_id, **auth_kwargs)
217
+
218
+ # Move model to device
219
+ cls.model = cls.model.to(device)
220
+ cls.model.eval() # Set to evaluation mode
221
+
222
+ # Load tokenizer for confidence calculation
223
+ try:
224
+ cls.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_id, **auth_kwargs)
225
+ except Exception as e:
226
+ logger.warning(f"Could not load tokenizer: {e}")
227
+ cls.tokenizer = None
228
+
229
+ cls.is_loaded = True
230
+ load_time = time.time() - start_time
231
+
232
+ logger.info(f"βœ… HuBERT model loaded successfully in {load_time:.2f}s")
233
+ logger.info(f"Model vocab size: {cls.model.config.vocab_size}")
234
+
235
+ return True
236
+
237
+ except Exception as e:
238
+ logger.error(f"Failed to load model {model_id}: {e}")
239
+ return False
240
+
241
+ @classmethod
242
+ def transcribe_audio(cls,
243
+ audio_data: Union[np.ndarray, str, Path],
244
+ sample_rate: Optional[int] = None) -> STTResult:
245
+ """
246
+ Transcribe audio using HuBERT Arabic model.
247
+
248
+ Args:
249
+ audio_data: Audio input (numpy array or file path)
250
+ sample_rate: Sample rate for numpy arrays
251
+
252
+ Returns:
253
+ STTResult: Transcription with confidence and metadata
254
+ """
255
+ if not cls.is_loaded:
256
+ raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
257
+
258
+ start_time = time.time()
259
+
260
+ try:
261
+ # Process input audio
262
+ processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
263
+
264
+ # Check audio length
265
+ duration = len(processed_audio) / actual_sr
266
+ if duration < 0.1:
267
+ return STTResult(
268
+ text="",
269
+ confidence=0.0,
270
+ processing_time=time.time() - start_time,
271
+ metadata={"error": "Audio too short", "duration": duration}
272
+ )
273
+
274
+ # Process with model
275
+ if duration > cls.config.get("chunk_length", 15):
276
+ # Handle long audio by chunking
277
+ text, confidence = cls._transcribe_long_audio(processed_audio, actual_sr)
278
+ else:
279
+ # Process short audio directly
280
+ text, confidence = cls._transcribe_chunk(processed_audio, actual_sr)
281
+
282
+ processing_time = time.time() - start_time
283
+
284
+ # Prepare metadata
285
+ metadata = {
286
+ "model": cls.config["model_id"],
287
+ "model_type": "HuBERT" if "hubert" in cls.config["model_id"].lower() else "Wav2Vec2",
288
+ "device": cls.config["device"],
289
+ "language": "ar-EG",
290
+ "duration": duration,
291
+ "sample_rate": actual_sr,
292
+ "chunks_processed": 1 if duration <= cls.config.get("chunk_length", 15) else int(duration / cls.config["chunk_length"]) + 1
293
+ }
294
+
295
+ return STTResult(
296
+ text=text.strip(),
297
+ confidence=confidence,
298
+ processing_time=processing_time,
299
+ metadata=metadata
300
+ )
301
+
302
+ except Exception as e:
303
+ error_msg = f"Transcription failed: {str(e)}"
304
+ logger.error(error_msg)
305
+ return STTResult(
306
+ text="",
307
+ confidence=0.0,
308
+ processing_time=time.time() - start_time,
309
+ metadata={"error": error_msg}
310
+ )
311
+
312
+ @classmethod
313
+ def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
314
+ """Process and validate audio input."""
315
+ if isinstance(audio_data, (str, Path)):
316
+ # Load audio file
317
+ audio_path = Path(audio_data)
318
+ if not audio_path.exists():
319
+ raise FileNotFoundError(f"Audio file not found: {audio_path}")
320
+
321
+ if LIBROSA_AVAILABLE:
322
+ audio_array, sr = librosa.load(str(audio_path), sr=cls.config["sample_rate"])
323
+ else:
324
+ # Fallback to torchaudio
325
+ audio_tensor, sr = torchaudio.load(str(audio_path))
326
+ audio_array = audio_tensor.numpy().flatten()
327
+
328
+ # Resample if needed
329
+ if sr != cls.config["sample_rate"]:
330
+ resampler = torchaudio.transforms.Resample(sr, cls.config["sample_rate"])
331
+ audio_tensor = resampler(audio_tensor)
332
+ audio_array = audio_tensor.numpy().flatten()
333
+ sr = cls.config["sample_rate"]
334
+
335
+ else:
336
+ # Handle numpy array
337
+ audio_array = audio_data.astype(np.float32)
338
+ sr = sample_rate or cls.config["sample_rate"]
339
+
340
+ # Resample if needed
341
+ if sr != cls.config["sample_rate"]:
342
+ if LIBROSA_AVAILABLE:
343
+ audio_array = librosa.resample(
344
+ audio_array,
345
+ orig_sr=sr,
346
+ target_sr=cls.config["sample_rate"]
347
+ )
348
+ else:
349
+ # Simple resampling fallback
350
+ if sr > cls.config["sample_rate"]:
351
+ step = sr // cls.config["sample_rate"]
352
+ audio_array = audio_array[::step]
353
+ else:
354
+ repeat = cls.config["sample_rate"] // sr
355
+ audio_array = np.repeat(audio_array, repeat)
356
+
357
+ sr = cls.config["sample_rate"]
358
+
359
+ # Normalize audio
360
+ if len(audio_array) > 0:
361
+ # Convert to mono if stereo
362
+ if audio_array.ndim > 1:
363
+ audio_array = np.mean(audio_array, axis=0)
364
+
365
+ # Normalize to [-1, 1]
366
+ max_val = np.max(np.abs(audio_array))
367
+ if max_val > 0:
368
+ audio_array = audio_array / max_val
369
+
370
+ return audio_array, sr
371
+
372
+ @classmethod
373
+ def _transcribe_chunk(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
374
+ """Transcribe a single audio chunk."""
375
+ # Preprocess audio
376
+ input_values = cls.processor(
377
+ audio_array,
378
+ sampling_rate=sample_rate,
379
+ return_tensors="pt",
380
+ padding=True
381
+ )
382
+
383
+ # Move to device
384
+ input_values = {k: v.to(cls.config["device"]) for k, v in input_values.items()}
385
+
386
+ # Inference
387
+ with torch.no_grad():
388
+ logits = cls.model(**input_values).logits
389
+
390
+ # Get predicted tokens
391
+ predicted_ids = torch.argmax(logits, dim=-1)
392
+
393
+ # Decode transcription
394
+ transcription = cls.processor.batch_decode(predicted_ids)[0]
395
+
396
+ # Calculate confidence (average of max probabilities)
397
+ confidence = cls._calculate_confidence(logits)
398
+
399
+ return transcription, confidence
400
+
401
+ @classmethod
402
+ def _transcribe_long_audio(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
403
+ """Transcribe long audio by chunking."""
404
+ chunk_length = cls.config.get("chunk_length", 15)
405
+ chunk_samples = int(chunk_length * sample_rate)
406
+ overlap_samples = int(1.0 * sample_rate) # 1 second overlap
407
+
408
+ transcriptions = []
409
+ confidences = []
410
+
411
+ for start in range(0, len(audio_array), chunk_samples - overlap_samples):
412
+ end = min(start + chunk_samples, len(audio_array))
413
+ chunk = audio_array[start:end]
414
+
415
+ if len(chunk) < 0.5 * sample_rate: # Skip very short chunks
416
+ continue
417
+
418
+ try:
419
+ chunk_text, chunk_confidence = cls._transcribe_chunk(chunk, sample_rate)
420
+ if chunk_text.strip():
421
+ transcriptions.append(chunk_text.strip())
422
+ confidences.append(chunk_confidence)
423
+ except Exception as e:
424
+ logger.warning(f"Failed to transcribe chunk: {e}")
425
+ continue
426
+
427
+ # Combine results
428
+ full_text = " ".join(transcriptions)
429
+ avg_confidence = np.mean(confidences) if confidences else 0.0
430
+
431
+ return full_text, avg_confidence
432
+
433
+ @classmethod
434
+ def _calculate_confidence(cls, logits: torch.Tensor) -> float:
435
+ """Calculate confidence score from model logits."""
436
+ try:
437
+ # Apply softmax to get probabilities
438
+ probabilities = torch.softmax(logits, dim=-1)
439
+
440
+ # Get maximum probability for each time step
441
+ max_probs = torch.max(probabilities, dim=-1)[0]
442
+
443
+ # Average over time steps (excluding padding if any)
444
+ confidence = torch.mean(max_probs).item()
445
+
446
+ return confidence
447
+
448
+ except Exception as e:
449
+ logger.warning(f"Could not calculate confidence: {e}")
450
+ return 0.5 # Default confidence
451
+
452
+ @classmethod
453
+ def get_available_models(cls) -> Dict[str, Any]:
454
+ """Get information about available HuBERT models."""
455
+ models_info = {
456
+ "transformers_available": TRANSFORMERS_AVAILABLE,
457
+ "librosa_available": LIBROSA_AVAILABLE,
458
+ "torch_available": True if TRANSFORMERS_AVAILABLE else False,
459
+ }
460
+
461
+ if TRANSFORMERS_AVAILABLE:
462
+ models_info.update({
463
+ "cuda_available": torch.cuda.is_available(),
464
+ "mps_available": hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(),
465
+ "hubert_models": [
466
+ {
467
+ "id": "omarxadel/hubert-large-arabic-egyptian",
468
+ "name": "HuBERT Arabic Egyptian (Large)",
469
+ "language": "Arabic Egyptian Dialect",
470
+ "size": "1.3GB",
471
+ "type": "HuBERT"
472
+ }
473
+ ],
474
+ "fallback_models": [
475
+ {
476
+ "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
477
+ "name": "Wav2Vec2 Arabic Egyptian",
478
+ "language": "Arabic Egyptian",
479
+ "size": "1.2GB",
480
+ "type": "Wav2Vec2"
481
+ },
482
+ {
483
+ "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
484
+ "name": "Wav2Vec2 Arabic Standard",
485
+ "language": "Arabic Standard",
486
+ "size": "1.2GB",
487
+ "type": "Wav2Vec2"
488
+ },
489
+ {
490
+ "id": "facebook/wav2vec2-large-xlsr-53",
491
+ "name": "Wav2Vec2 Multilingual",
492
+ "language": "Multilingual",
493
+ "size": "1.2GB",
494
+ "type": "Wav2Vec2"
495
+ }
496
+ ]
497
+ })
498
+
499
+ return models_info
500
+
501
+ @classmethod
502
+ def set_language(cls, language: Optional[str]) -> None:
503
+ """Set language (for compatibility - this model is Arabic-specific)."""
504
+ if language and not language.startswith("ar"):
505
+ logger.warning(f"This model is optimized for Arabic. Language '{language}' may not work well.")
506
+
507
+ cls.config["language"] = language or "ar-EG"
508
+ logger.info(f"Language set to: {cls.config['language']}")
509
+
510
+ @classmethod
511
+ def set_device(cls, device: str) -> None:
512
+ """Change device for model inference."""
513
+ if cls.model is not None:
514
+ cls.model = cls.model.to(device)
515
+ cls.config["device"] = device
516
+ logger.info(f"Model moved to device: {device}")
517
+
518
+ @classmethod
519
+ def get_model_info(cls) -> Dict[str, Any]:
520
+ """Get detailed model information."""
521
+ base_info = super().get_model_info()
522
+
523
+ if cls.is_loaded:
524
+ base_info.update({
525
+ "model_id": cls.config["model_id"],
526
+ "model_type": "HuBERT" if "hubert" in cls.config["model_id"].lower() else "Wav2Vec2",
527
+ "device": cls.config["device"],
528
+ "language": cls.config["language"],
529
+ "sample_rate": cls.config["sample_rate"],
530
+ "vocab_size": cls.model.config.vocab_size if cls.model else None,
531
+ "chunk_length": cls.config["chunk_length"],
532
+ })
533
+
534
+ return base_info
535
+
536
+
537
+ # Example usage and testing
538
+ if __name__ == "__main__":
539
+ print("Testing HuBERT Arabic STT implementation...")
540
+
541
+ # Check availability
542
+ models_info = HuBERTArabicSTT.get_available_models()
543
+ print(f"Available models info: {models_info}")
544
+
545
+ if models_info["transformers_available"]:
546
+ try:
547
+ print("Loading HuBERT Arabic model...")
548
+ HuBERTArabicSTT.load_model(device="cpu") # Use CPU for testing
549
+
550
+ print("Creating test audio...")
551
+ # Generate test audio (2 seconds of random noise)
552
+ test_audio = np.random.randn(32000).astype(np.float32) * 0.1
553
+
554
+ print("Testing transcription...")
555
+ result = HuBERTArabicSTT.transcribe_audio(test_audio, 16000)
556
+ print(f"Result: {result}")
557
+ print(f"Metadata: {result.metadata}")
558
+
559
+ except Exception as e:
560
+ print(f"Error: {e}")
561
+ print("Note: This is expected with random audio - the model expects Arabic speech")
562
+
563
+ else:
564
+ print("Transformers not installed - install with:")
565
+ print("pip install transformers torch torchaudio")
566
+ print("Optional: pip install librosa (for better audio processing)")
567
+
568
+ print("\nHuBERT Arabic STT implementation ready!")
stt/stt_base.py ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Base Speech-to-Text (STT) Static Class
4
+
5
+ This module provides an abstract base class for implementing different STT models using static methods.
6
+ All STT implementations should inherit from this class and implement the required static methods.
7
+
8
+ Usage:
9
+ from stt_base import BaseSTT
10
+
11
+ class MySTTModel(BaseSTT):
12
+ model = None # Class variable to hold the model
13
+
14
+ @classmethod
15
+ def load_model(cls):
16
+ # Load your specific model
17
+ cls.model = your_model_loader()
18
+
19
+ @classmethod
20
+ def transcribe_audio(cls, audio_data, sample_rate):
21
+ # Implement transcription logic
22
+ return STTResult("transcribed text")
23
+ """
24
+
25
+ from abc import ABC, abstractmethod
26
+ from typing import Union, Optional, Dict, Any, ClassVar
27
+ import numpy as np
28
+ from pathlib import Path
29
+ import time
30
+ import logging
31
+
32
+ # Set up logging
33
+ logging.basicConfig(level=logging.INFO)
34
+ logger = logging.getLogger(__name__)
35
+
36
+
37
+ class STTResult:
38
+ """Container for STT transcription results with metadata."""
39
+
40
+ def __init__(self,
41
+ text: str,
42
+ confidence: Optional[float] = None,
43
+ processing_time: Optional[float] = None,
44
+ metadata: Optional[Dict[str, Any]] = None):
45
+ self.text = text
46
+ self.confidence = confidence
47
+ self.processing_time = processing_time
48
+ self.metadata = metadata or {}
49
+
50
+ def __str__(self) -> str:
51
+ return self.text
52
+
53
+ def __repr__(self) -> str:
54
+ return f"STTResult(text='{self.text}', confidence={self.confidence}, time={self.processing_time}s)"
55
+
56
+
57
+ class BaseSTT(ABC):
58
+ """
59
+ Abstract base class for Speech-to-Text models using static methods.
60
+
61
+ All STT implementations must inherit from this class and implement:
62
+ - load_model(): Load and initialize the STT model (classmethod)
63
+ - transcribe_audio(): Convert audio to text (classmethod)
64
+ """
65
+
66
+ # Class variables that subclasses should define
67
+ model_name: ClassVar[str] = "BaseSTT"
68
+ model: ClassVar[Any] = None
69
+ is_loaded: ClassVar[bool] = False
70
+ config: ClassVar[Dict[str, Any]] = {}
71
+
72
+ @classmethod
73
+ @abstractmethod
74
+ def load_model(cls) -> None:
75
+ """
76
+ Load and initialize the STT model.
77
+
78
+ This method must be implemented by subclasses to load their specific model.
79
+ After successful loading, set cls.is_loaded = True
80
+ """
81
+ pass
82
+
83
+ @classmethod
84
+ @abstractmethod
85
+ def transcribe_audio(cls,
86
+ audio_data: Union[np.ndarray, str, Path],
87
+ sample_rate: Optional[int] = None) -> STTResult:
88
+ """
89
+ Transcribe audio data to text.
90
+
91
+ Args:
92
+ audio_data: Audio input - can be:
93
+ - numpy array of audio samples
94
+ - path to audio file (str or Path)
95
+ sample_rate: Sample rate of audio data (required for numpy arrays)
96
+
97
+ Returns:
98
+ STTResult: Object containing transcribed text and metadata
99
+
100
+ This method must be implemented by subclasses.
101
+ """
102
+ pass
103
+
104
+ @classmethod
105
+ def transcribe_file(cls, file_path: Union[str, Path]) -> STTResult:
106
+ """
107
+ Transcribe an audio file to text.
108
+
109
+ Args:
110
+ file_path: Path to the audio file
111
+
112
+ Returns:
113
+ STTResult: Transcription result
114
+ """
115
+ if not cls.is_loaded:
116
+ raise RuntimeError(f"{cls.model_name} model not loaded. Call load_model() first.")
117
+
118
+ file_path = Path(file_path)
119
+ if not file_path.exists():
120
+ raise FileNotFoundError(f"Audio file not found: {file_path}")
121
+
122
+ logger.info(f"Transcribing file: {file_path}")
123
+ start_time = time.time()
124
+
125
+ result = cls.transcribe_audio(file_path)
126
+
127
+ if result.processing_time is None:
128
+ result.processing_time = time.time() - start_time
129
+
130
+ logger.info(f"Transcription completed in {result.processing_time:.2f}s")
131
+ return result
132
+
133
+ @classmethod
134
+ def transcribe_numpy(cls,
135
+ audio_array: np.ndarray,
136
+ sample_rate: int) -> STTResult:
137
+ """
138
+ Transcribe a numpy array of audio samples to text.
139
+
140
+ Args:
141
+ audio_array: Audio samples as numpy array
142
+ sample_rate: Sample rate of the audio
143
+
144
+ Returns:
145
+ STTResult: Transcription result
146
+ """
147
+ if not cls.is_loaded:
148
+ raise RuntimeError(f"{cls.model_name} model not loaded. Call load_model() first.")
149
+
150
+ if not isinstance(audio_array, np.ndarray):
151
+ raise TypeError("audio_array must be a numpy array")
152
+
153
+ logger.info(f"Transcribing numpy array: shape={audio_array.shape}, sr={sample_rate}")
154
+ start_time = time.time()
155
+
156
+ result = cls.transcribe_audio(audio_array, sample_rate)
157
+
158
+ if result.processing_time is None:
159
+ result.processing_time = time.time() - start_time
160
+
161
+ logger.info(f"Transcription completed in {result.processing_time:.2f}s")
162
+ return result
163
+
164
+ @classmethod
165
+ def get_model_info(cls) -> Dict[str, Any]:
166
+ """
167
+ Get information about the loaded model.
168
+
169
+ Returns:
170
+ Dict containing model information
171
+ """
172
+ return {
173
+ "model_name": cls.model_name,
174
+ "is_loaded": cls.is_loaded,
175
+ "config": cls.config
176
+ }
177
+
178
+ @classmethod
179
+ def ensure_loaded(cls) -> None:
180
+ """Ensure the model is loaded, load it if not."""
181
+ if not cls.is_loaded:
182
+ cls.load_model()
183
+
184
+ @classmethod
185
+ def get_status(cls) -> str:
186
+ """Get a string representation of the model status."""
187
+ status = "loaded" if cls.is_loaded else "not loaded"
188
+ return f"{cls.model_name} STT Model ({status})"
189
+
190
+
191
+ class DummySTT(BaseSTT):
192
+ """
193
+ Dummy STT implementation for testing the static class interface.
194
+ Returns placeholder text instead of actual transcription.
195
+ """
196
+
197
+ model_name = "DummySTT"
198
+ model = None
199
+ is_loaded = False
200
+ config = {}
201
+
202
+ @classmethod
203
+ def load_model(cls) -> None:
204
+ """Load the dummy model (just a placeholder)."""
205
+ logger.info("Loading dummy STT model...")
206
+ time.sleep(0.5) # Simulate loading time
207
+ cls.model = "dummy_model_loaded"
208
+ cls.is_loaded = True
209
+ logger.info("Dummy STT model loaded successfully")
210
+
211
+ @classmethod
212
+ def transcribe_audio(cls,
213
+ audio_data: Union[np.ndarray, str, Path],
214
+ sample_rate: Optional[int] = None) -> STTResult:
215
+ """
216
+ Dummy transcription - returns placeholder text.
217
+ """
218
+ if isinstance(audio_data, np.ndarray):
219
+ duration = len(audio_data) / (sample_rate or 16000)
220
+ text = f"[Dummy transcription of {duration:.1f}s audio]"
221
+ else:
222
+ text = f"[Dummy transcription of file: {Path(audio_data).name}]"
223
+
224
+ # Simulate processing time
225
+ processing_time = 0.1 + np.random.random() * 0.2
226
+ time.sleep(processing_time)
227
+
228
+ return STTResult(
229
+ text=text,
230
+ confidence=0.95,
231
+ processing_time=processing_time,
232
+ metadata={"model": "dummy", "simulated": True}
233
+ )
234
+
235
+
236
+ # Example usage and testing
237
+ if __name__ == "__main__":
238
+ # Test the dummy implementation
239
+ print("Testing BaseSTT with static DummySTT implementation...")
240
+
241
+ # Load the model
242
+ DummySTT.load_model()
243
+
244
+ # Test with dummy numpy array
245
+ dummy_audio = np.random.randn(16000) # 1 second at 16kHz
246
+ result = DummySTT.transcribe_numpy(dummy_audio, 16000)
247
+ print(f"Numpy result: {result}")
248
+ print(f"Model info: {DummySTT.get_model_info()}")
249
+ print(f"Status: {DummySTT.get_status()}")
250
+
251
+ print("\\nStatic BaseSTT interface ready for real STT implementations!")
stt/tawasul_stt.py ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Tawasul STT V0 Implementation
4
+
5
+ This module provides Arabic speech-to-text transcription using the Tawasul STT V0 model,
6
+ which is specifically designed for Arabic language recognition.
7
+
8
+ Tawasul STT V0 is built on Wav2Vec2 architecture and fine-tuned for Arabic speech.
9
+ """
10
+
11
+ import os
12
+ import logging
13
+ import warnings
14
+ from pathlib import Path
15
+ from typing import Optional, Dict, Any, Tuple, List, Union
16
+ import time
17
+
18
+ # Suppress warnings for cleaner output
19
+ warnings.filterwarnings("ignore", category=UserWarning)
20
+ warnings.filterwarnings("ignore", category=FutureWarning)
21
+
22
+ # Try to import torch for type hints
23
+ try:
24
+ import torch
25
+ TORCH_AVAILABLE = True
26
+ except ImportError:
27
+ TORCH_AVAILABLE = False
28
+ # Create a dummy torch class for type hints when torch is not available
29
+ class torch:
30
+ class Tensor:
31
+ pass
32
+
33
+ # Configure logging
34
+ logging.basicConfig(level=logging.INFO)
35
+ logger = logging.getLogger(__name__)
36
+
37
+ class TawasulSTT:
38
+ """
39
+ Tawasul STT V0 static implementation for Arabic speech recognition.
40
+
41
+ This class provides Arabic speech-to-text transcription using the Tawasul STT V0 model,
42
+ which is specifically optimized for Arabic language variants.
43
+ All methods are static for direct class-level access.
44
+ """
45
+
46
+ # Class variables for model state
47
+ model = None
48
+ processor = None
49
+ tokenizer = None
50
+ device = "cpu"
51
+ model_id = "Kareem35/Tawasul-STT-V0"
52
+ is_loaded = False
53
+ hf_token = None
54
+ chunk_length = 20 # seconds
55
+ max_audio_length = 300 # 5 minutes max
56
+
57
+ # Model fallback chain for better reliability
58
+ fallback_models = [
59
+ "Kareem35/Tawasul-STT-V0",
60
+ "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
61
+ "facebook/wav2vec2-large-xlsr-53",
62
+ "facebook/wav2vec2-base-960h"
63
+ ]
64
+
65
+ @staticmethod
66
+ def is_available() -> bool:
67
+ """Check if Tawasul STT dependencies are available."""
68
+ if not TORCH_AVAILABLE:
69
+ logger.warning("Tawasul STT dependencies not available: torch not installed")
70
+ return False
71
+
72
+ try:
73
+ import transformers
74
+ import torchaudio
75
+ import librosa
76
+ import soundfile
77
+ return True
78
+ except ImportError as e:
79
+ logger.warning(f"Tawasul STT dependencies not available: {e}")
80
+ return False
81
+
82
+ @staticmethod
83
+ def load_model(
84
+ model_id: Optional[str] = None,
85
+ device: str = "auto",
86
+ chunk_length: int = 20,
87
+ hf_token: Optional[str] = None,
88
+ max_audio_length: int = 300,
89
+ **kwargs
90
+ ) -> None:
91
+ """
92
+ Load the Tawasul STT V0 model.
93
+
94
+ Args:
95
+ model_id: Model identifier (defaults to Tawasul STT V0)
96
+ device: Device to use ('auto', 'cpu', 'cuda', 'mps')
97
+ chunk_length: Audio chunk length in seconds for processing
98
+ hf_token: Hugging Face authentication token
99
+ max_audio_length: Maximum audio length in seconds
100
+ **kwargs: Additional model parameters
101
+ """
102
+ try:
103
+ import torch
104
+ import transformers
105
+ from transformers import (
106
+ Wav2Vec2ForCTC,
107
+ Wav2Vec2Processor,
108
+ Wav2Vec2Tokenizer
109
+ )
110
+ import torchaudio
111
+ import librosa
112
+
113
+ # Set authentication token
114
+ if hf_token:
115
+ TawasulSTT.hf_token = hf_token
116
+ # Set token for transformers
117
+ try:
118
+ from huggingface_hub import login
119
+ login(token=hf_token, add_to_git_credential=True)
120
+ logger.info("βœ… Authenticated with Hugging Face")
121
+ except Exception as e:
122
+ logger.warning(f"HF authentication warning: {e}")
123
+
124
+ # Determine device
125
+ if device == "auto":
126
+ if torch.cuda.is_available():
127
+ TawasulSTT.device = "cuda"
128
+ elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
129
+ TawasulSTT.device = "mps"
130
+ else:
131
+ TawasulSTT.device = "cpu"
132
+ else:
133
+ TawasulSTT.device = device
134
+
135
+ # Set model parameters
136
+ TawasulSTT.model_id = model_id or "Kareem35/Tawasul-STT-V0"
137
+ TawasulSTT.chunk_length = chunk_length
138
+ TawasulSTT.max_audio_length = max_audio_length
139
+
140
+ # Try loading the model with fallback chain
141
+ model_loaded = False
142
+ last_error = None
143
+
144
+ models_to_try = [TawasulSTT.model_id] + [m for m in TawasulSTT.fallback_models if m != TawasulSTT.model_id]
145
+
146
+ for model_name in models_to_try:
147
+ try:
148
+ logger.info(f"πŸ”„ Loading Tawasul STT model: {model_name}")
149
+
150
+ # Load model components
151
+ TawasulSTT.processor = Wav2Vec2Processor.from_pretrained(
152
+ model_name,
153
+ token=TawasulSTT.hf_token,
154
+ trust_remote_code=True
155
+ )
156
+
157
+ TawasulSTT.model = Wav2Vec2ForCTC.from_pretrained(
158
+ model_name,
159
+ token=TawasulSTT.hf_token,
160
+ trust_remote_code=True
161
+ )
162
+
163
+ # Try to load tokenizer if available
164
+ try:
165
+ TawasulSTT.tokenizer = Wav2Vec2Tokenizer.from_pretrained(
166
+ model_name,
167
+ token=TawasulSTT.hf_token
168
+ )
169
+ except Exception:
170
+ logger.info("Using processor instead of separate tokenizer")
171
+ TawasulSTT.tokenizer = TawasulSTT.processor.tokenizer
172
+
173
+ # Move model to device
174
+ TawasulSTT.model = TawasulSTT.model.to(TawasulSTT.device)
175
+ TawasulSTT.model.eval()
176
+
177
+ # Test model with dummy input
178
+ test_input = torch.randn(1, 16000).to(TawasulSTT.device)
179
+ with torch.no_grad():
180
+ _ = TawasulSTT.model(test_input)
181
+
182
+ TawasulSTT.model_id = model_name # Update to actually loaded model
183
+ model_loaded = True
184
+ logger.info(f"βœ… Successfully loaded Tawasul STT model: {model_name} on {TawasulSTT.device}")
185
+ break
186
+
187
+ except Exception as e:
188
+ last_error = e
189
+ logger.warning(f"Failed to load {model_name}: {str(e)}")
190
+ continue
191
+
192
+ if not model_loaded:
193
+ raise RuntimeError(f"Failed to load any Tawasul STT model. Last error: {last_error}")
194
+
195
+ TawasulSTT.is_loaded = True
196
+
197
+ # Log model info
198
+ total_params = sum(p.numel() for p in TawasulSTT.model.parameters())
199
+ logger.info(f"πŸ“Š Model loaded: {total_params:,} parameters on {TawasulSTT.device}")
200
+
201
+ except Exception as e:
202
+ error_msg = f"Failed to load Tawasul STT model: {str(e)}"
203
+ logger.error(error_msg)
204
+ raise RuntimeError(error_msg)
205
+
206
+ @staticmethod
207
+ def _preprocess_audio(audio_path: str) -> Tuple[torch.Tensor, int]:
208
+ """
209
+ Preprocess audio file for Tawasul STT model.
210
+
211
+ Args:
212
+ audio_path: Path to audio file
213
+
214
+ Returns:
215
+ Tuple of (audio_tensor, sample_rate)
216
+ """
217
+ try:
218
+ import librosa
219
+ import torch
220
+ import numpy as np
221
+
222
+ # Load audio file with proper error handling
223
+ try:
224
+ # Load audio at 16kHz as required by Tawasul STT
225
+ audio, sample_rate = librosa.load(audio_path, sr=16000, mono=True)
226
+ except Exception as load_error:
227
+ raise RuntimeError(f"Failed to load audio file {audio_path}: {load_error}")
228
+
229
+ # Validate audio data
230
+ if len(audio) == 0:
231
+ raise RuntimeError("Audio file is empty or corrupted")
232
+
233
+ # Convert to float32 for processing
234
+ audio = audio.astype(np.float32)
235
+
236
+ # Remove DC offset (center around zero)
237
+ audio = audio - np.mean(audio)
238
+
239
+ # Normalize audio with proper scaling
240
+ max_val = np.max(np.abs(audio))
241
+ if max_val > 0:
242
+ # Normalize to [-0.95, 0.95] to prevent clipping
243
+ audio = audio / max_val * 0.95
244
+ else:
245
+ logger.warning("Audio appears to be silent")
246
+
247
+ # Apply simple noise gate to reduce background noise
248
+ noise_threshold = np.max(np.abs(audio)) * 0.01 # 1% of max amplitude
249
+ audio = np.where(np.abs(audio) < noise_threshold, 0, audio)
250
+
251
+ # Check and limit audio duration
252
+ audio_duration = len(audio) / sample_rate
253
+ if audio_duration > TawasulSTT.max_audio_length:
254
+ logger.warning(f"Audio duration ({audio_duration:.1f}s) exceeds maximum ({TawasulSTT.max_audio_length}s)")
255
+ # Truncate to maximum length
256
+ max_samples = int(TawasulSTT.max_audio_length * sample_rate)
257
+ audio = audio[:max_samples]
258
+ logger.info(f"Audio truncated to {TawasulSTT.max_audio_length}s")
259
+
260
+ # Validate minimum duration
261
+ min_duration = 0.1 # 100ms minimum
262
+ if audio_duration < min_duration:
263
+ logger.warning(f"Audio duration ({audio_duration:.3f}s) is very short")
264
+
265
+ # Convert to PyTorch tensor
266
+ audio_tensor = torch.FloatTensor(audio)
267
+
268
+ # Log preprocessing info
269
+ final_duration = len(audio_tensor) / sample_rate
270
+ logger.debug(f"Audio preprocessed: {final_duration:.2f}s, max_amp: {torch.max(torch.abs(audio_tensor)):.3f}")
271
+
272
+ return audio_tensor, sample_rate
273
+
274
+ except Exception as e:
275
+ error_msg = f"Audio preprocessing failed for {audio_path}: {str(e)}"
276
+ logger.error(error_msg)
277
+ raise RuntimeError(error_msg)
278
+
279
+ @staticmethod
280
+ def _chunk_audio(audio_tensor: torch.Tensor, sample_rate: int) -> List[torch.Tensor]:
281
+ """
282
+ Split audio into chunks for processing.
283
+
284
+ Args:
285
+ audio_tensor: Audio tensor
286
+ sample_rate: Sample rate
287
+
288
+ Returns:
289
+ List of audio chunks
290
+ """
291
+ chunk_samples = int(TawasulSTT.chunk_length * sample_rate)
292
+ chunks = []
293
+
294
+ for i in range(0, len(audio_tensor), chunk_samples):
295
+ chunk = audio_tensor[i:i + chunk_samples]
296
+ if len(chunk) > sample_rate * 0.5: # Only process chunks > 0.5 seconds
297
+ chunks.append(chunk)
298
+
299
+ return chunks
300
+
301
+ @staticmethod
302
+ def _transcribe_chunk(audio_chunk: torch.Tensor) -> Tuple[str, float]:
303
+ """
304
+ Transcribe a single audio chunk.
305
+
306
+ Args:
307
+ audio_chunk: Audio chunk tensor
308
+
309
+ Returns:
310
+ Tuple of (transcription, confidence_score)
311
+ """
312
+ try:
313
+ import torch
314
+
315
+ # Prepare input
316
+ input_values = TawasulSTT.processor(
317
+ audio_chunk,
318
+ sampling_rate=16000,
319
+ return_tensors="pt"
320
+ ).input_values
321
+
322
+ input_values = input_values.to(TawasulSTT.device)
323
+
324
+ # Get model predictions
325
+ with torch.no_grad():
326
+ logits = TawasulSTT.model(input_values).logits
327
+
328
+ # Get predicted tokens
329
+ predicted_ids = torch.argmax(logits, dim=-1)
330
+
331
+ # Decode transcription
332
+ transcription = TawasulSTT.processor.decode(predicted_ids[0])
333
+
334
+ # Calculate confidence (approximation)
335
+ probs = torch.nn.functional.softmax(logits, dim=-1)
336
+ max_probs = torch.max(probs, dim=-1)[0]
337
+ confidence = torch.mean(max_probs).item()
338
+
339
+ return transcription.strip(), confidence
340
+
341
+ except Exception as e:
342
+ logger.error(f"Chunk transcription error: {str(e)}")
343
+ return "", 0.0
344
+
345
+ @staticmethod
346
+ def transcribe(audio_path: str, **kwargs) -> Tuple[str, str, str]:
347
+ """
348
+ Transcribe audio file using Tawasul STT V0.
349
+
350
+ Args:
351
+ audio_path: Path to audio file
352
+ **kwargs: Additional transcription parameters
353
+
354
+ Returns:
355
+ Tuple of (transcription, confidence_info, processing_info)
356
+ """
357
+ if not TawasulSTT.is_loaded:
358
+ return "❌ Model not loaded. Please load the model first.", "", ""
359
+
360
+ try:
361
+ start_time = time.time()
362
+
363
+ # Validate file
364
+ if not os.path.exists(audio_path):
365
+ return f"❌ Audio file not found: {audio_path}", "", ""
366
+
367
+ logger.info(f"🎡 Transcribing audio with Tawasul STT: {audio_path}")
368
+
369
+ # Preprocess audio
370
+ audio_tensor, sample_rate = TawasulSTT._preprocess_audio(audio_path)
371
+ audio_duration = len(audio_tensor) / sample_rate
372
+
373
+ # Process audio in chunks
374
+ chunks = TawasulSTT._chunk_audio(audio_tensor, sample_rate)
375
+
376
+ if not chunks:
377
+ return "❌ No valid audio chunks found", "", ""
378
+
379
+ # Transcribe each chunk
380
+ transcriptions = []
381
+ confidences = []
382
+
383
+ for i, chunk in enumerate(chunks):
384
+ logger.info(f"Processing chunk {i+1}/{len(chunks)}")
385
+ transcription, confidence = TawasulSTT._transcribe_chunk(chunk)
386
+
387
+ if transcription: # Only add non-empty transcriptions
388
+ transcriptions.append(transcription)
389
+ confidences.append(confidence)
390
+
391
+ # Combine results
392
+ if not transcriptions:
393
+ return "❌ No transcription generated", "", ""
394
+
395
+ final_transcription = " ".join(transcriptions).strip()
396
+ avg_confidence = sum(confidences) / len(confidences) if confidences else 0.0
397
+
398
+ # Calculate processing time
399
+ processing_time = time.time() - start_time
400
+
401
+ # Create info strings
402
+ confidence_info = f"Confidence: {avg_confidence:.2f}"
403
+ processing_info = (
404
+ f"Duration: {audio_duration:.1f}s | "
405
+ f"Chunks: {len(chunks)} | "
406
+ f"Time: {processing_time:.1f}s | "
407
+ f"Model: {TawasulSTT.model_id.split('/')[-1]}"
408
+ )
409
+
410
+ logger.info(f"βœ… Transcription completed in {processing_time:.1f}s")
411
+
412
+ return final_transcription, confidence_info, processing_info
413
+
414
+ except Exception as e:
415
+ error_msg = f"❌ Tawasul STT transcription failed: {str(e)}"
416
+ logger.error(error_msg)
417
+ return error_msg, "", ""
418
+
419
+ @staticmethod
420
+ def get_supported_languages() -> List[str]:
421
+ """Get list of supported languages."""
422
+ return [
423
+ "ar", # Arabic
424
+ "ar-SA", # Saudi Arabic
425
+ "ar-EG", # Egyptian Arabic
426
+ "ar-JO", # Jordanian Arabic
427
+ "ar-LB", # Lebanese Arabic
428
+ "ar-SY", # Syrian Arabic
429
+ "ar-IQ", # Iraqi Arabic
430
+ "ar-MA", # Moroccan Arabic
431
+ "ar-DZ", # Algerian Arabic
432
+ "ar-TN", # Tunisian Arabic
433
+ ]
434
+
435
+ @staticmethod
436
+ def get_model_info() -> Dict[str, Any]:
437
+ """Get model information."""
438
+ return {
439
+ "name": "Tawasul STT V0",
440
+ "model_id": TawasulSTT.model_id,
441
+ "device": TawasulSTT.device,
442
+ "is_loaded": TawasulSTT.is_loaded,
443
+ "supported_languages": TawasulSTT.get_supported_languages(),
444
+ "chunk_length": TawasulSTT.chunk_length,
445
+ "max_audio_length": TawasulSTT.max_audio_length,
446
+ "architecture": "Wav2Vec2",
447
+ "specialization": "Arabic Speech Recognition"
448
+ }
stt/vosk_stt.py ADDED
@@ -0,0 +1,561 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Vosk STT Implementation
4
+
5
+ Vosk speech-to-text implementation using the static BaseSTT interface.
6
+ Supports multiple languages with offline models and real-time recognition.
7
+
8
+ Usage:
9
+ from stt.vosk_stt import VoskSTT
10
+
11
+ # Load model
12
+ VoskSTT.load_model(model_name="vosk-model-en-us-0.22")
13
+
14
+ # Transcribe audio
15
+ result = VoskSTT.transcribe_audio(audio_array, 16000)
16
+ print(result.text)
17
+ """
18
+
19
+ from typing import Union, Optional, Dict, Any, List
20
+ import numpy as np
21
+ from pathlib import Path
22
+ import time
23
+ import json
24
+ import logging
25
+ import os
26
+ import urllib.request
27
+ import zipfile
28
+ import tempfile
29
+
30
+ try:
31
+ import vosk
32
+ VOSK_AVAILABLE = True
33
+ except ImportError:
34
+ VOSK_AVAILABLE = False
35
+
36
+ try:
37
+ import soundfile as sf
38
+ SOUNDFILE_AVAILABLE = True
39
+ except ImportError:
40
+ SOUNDFILE_AVAILABLE = False
41
+
42
+ from .stt_base import BaseSTT, STTResult
43
+
44
+ logger = logging.getLogger(__name__)
45
+
46
+
47
+ class VoskSTT(BaseSTT):
48
+ """
49
+ Vosk STT implementation supporting multiple languages and offline recognition.
50
+
51
+ Features:
52
+ - Multiple language support
53
+ - Offline processing (no internet required after model download)
54
+ - Real-time recognition capability
55
+ - Small to large model options
56
+ - Word-level timestamps and confidence scores
57
+ - Lightweight and fast
58
+ """
59
+
60
+ model_name = "VoskSTT"
61
+ model = None
62
+ recognizer = None
63
+ is_loaded = False
64
+ config = {
65
+ "model_name": "vosk-model-en-us-0.22", # Default English model
66
+ "model_path": None, # Auto-determined
67
+ "sample_rate": 16000,
68
+ "language": "en",
69
+ "download_url_base": "https://alphacephei.com/vosk/models/",
70
+ "models_dir": str(Path.home() / ".vosk" / "models"),
71
+ "return_confidence": True,
72
+ "return_words": True,
73
+ "chunk_size": 4096,
74
+ }
75
+
76
+ # Available Vosk models with their properties
77
+ AVAILABLE_MODELS = {
78
+ # English models
79
+ "vosk-model-en-us-0.22": {
80
+ "language": "en-US",
81
+ "size": "1.8GB",
82
+ "description": "English US Large",
83
+ "url": "vosk-model-en-us-0.22.zip"
84
+ },
85
+ "vosk-model-small-en-us-0.15": {
86
+ "language": "en-US",
87
+ "size": "40MB",
88
+ "description": "English US Small",
89
+ "url": "vosk-model-small-en-us-0.15.zip"
90
+ },
91
+
92
+ # Arabic models
93
+ "vosk-model-ar-mgb2-0.4": {
94
+ "language": "ar",
95
+ "size": "318MB",
96
+ "description": "Arabic",
97
+ "url": "vosk-model-ar-mgb2-0.4.zip"
98
+ },
99
+
100
+ # Multilingual and other languages
101
+ "vosk-model-small-cn-0.22": {
102
+ "language": "zh-CN",
103
+ "size": "42MB",
104
+ "description": "Chinese Small",
105
+ "url": "vosk-model-small-cn-0.22.zip"
106
+ },
107
+ "vosk-model-fr-0.22": {
108
+ "language": "fr-FR",
109
+ "size": "1.4GB",
110
+ "description": "French",
111
+ "url": "vosk-model-fr-0.22.zip"
112
+ },
113
+ "vosk-model-de-0.21": {
114
+ "language": "de-DE",
115
+ "size": "1.2GB",
116
+ "description": "German",
117
+ "url": "vosk-model-de-0.21.zip"
118
+ },
119
+ "vosk-model-es-0.42": {
120
+ "language": "es-ES",
121
+ "size": "1.4GB",
122
+ "description": "Spanish",
123
+ "url": "vosk-model-es-0.42.zip"
124
+ },
125
+ "vosk-model-ru-0.42": {
126
+ "language": "ru-RU",
127
+ "size": "1.5GB",
128
+ "description": "Russian",
129
+ "url": "vosk-model-ru-0.42.zip"
130
+ },
131
+ "vosk-model-small-ru-0.22": {
132
+ "language": "ru-RU",
133
+ "size": "45MB",
134
+ "description": "Russian Small",
135
+ "url": "vosk-model-small-ru-0.22.zip"
136
+ }
137
+ }
138
+
139
+ @classmethod
140
+ def load_model(cls,
141
+ model_name: str = None,
142
+ model_path: str = None,
143
+ auto_download: bool = True,
144
+ **kwargs) -> None:
145
+ """
146
+ Load the Vosk model.
147
+
148
+ Args:
149
+ model_name: Name of the Vosk model (e.g., "vosk-model-en-us-0.22")
150
+ model_path: Direct path to model directory (overrides model_name)
151
+ auto_download: Automatically download model if not found
152
+ **kwargs: Additional configuration parameters
153
+ """
154
+ if not VOSK_AVAILABLE:
155
+ raise ImportError(
156
+ "Vosk library required. Install with: pip install vosk"
157
+ )
158
+
159
+ # Update configuration
160
+ cls.config.update({
161
+ "model_name": model_name or cls.config["model_name"],
162
+ "model_path": model_path,
163
+ "auto_download": auto_download,
164
+ **kwargs
165
+ })
166
+
167
+ # Determine model path
168
+ if model_path:
169
+ final_model_path = Path(model_path)
170
+ else:
171
+ final_model_path = cls._get_model_path(cls.config["model_name"])
172
+
173
+ # Check if model exists, download if needed
174
+ if not final_model_path.exists():
175
+ if auto_download:
176
+ logger.info(f"Model not found at {final_model_path}")
177
+ cls._download_model(cls.config["model_name"])
178
+ else:
179
+ raise FileNotFoundError(f"Model not found: {final_model_path}")
180
+
181
+ logger.info(f"Loading Vosk model from: {final_model_path}")
182
+ start_time = time.time()
183
+
184
+ try:
185
+ # Load the Vosk model
186
+ cls.model = vosk.Model(str(final_model_path))
187
+
188
+ # Create recognizer
189
+ cls.recognizer = vosk.KaldiRecognizer(cls.model, cls.config["sample_rate"])
190
+
191
+ # Configure recognizer options (with compatibility checks)
192
+ try:
193
+ if hasattr(cls.recognizer, 'SetMaxAlternatives'):
194
+ cls.recognizer.SetMaxAlternatives(cls.config.get("max_alternatives", 3))
195
+ logger.info("βœ… Max alternatives enabled")
196
+ except (AttributeError, Exception) as e:
197
+ logger.warning(f"⚠️ Max alternatives not supported: {e}")
198
+
199
+ try:
200
+ if hasattr(cls.recognizer, 'SetReturnWordTimes'):
201
+ cls.recognizer.SetReturnWordTimes(cls.config.get("return_words", True))
202
+ logger.info("βœ… Word timing enabled")
203
+ else:
204
+ logger.info("ℹ️ Word timing not available in this Vosk version")
205
+ except (AttributeError, Exception) as e:
206
+ logger.warning(f"⚠️ Word timing not supported: {e}")
207
+
208
+ try:
209
+ if hasattr(cls.recognizer, 'SetWords'):
210
+ cls.recognizer.SetWords(cls.config.get("return_words", True))
211
+ logger.info("βœ… Word-level output enabled")
212
+ except (AttributeError, Exception) as e:
213
+ logger.info(f"ℹ️ Word-level output using basic mode: {e}")
214
+
215
+ # Test recognizer with a small sample
216
+ test_result = cls.recognizer.AcceptWaveform(b'\x00' * 1600) # 0.1s of silence
217
+ logger.info("βœ… Recognizer test successful")
218
+
219
+ cls.is_loaded = True
220
+ load_time = time.time() - start_time
221
+
222
+ model_info = cls.AVAILABLE_MODELS.get(cls.config["model_name"], {})
223
+ language = model_info.get("language", "unknown")
224
+
225
+ logger.info(f"βœ… Vosk model loaded successfully in {load_time:.2f}s")
226
+ logger.info(f"Model: {cls.config['model_name']}")
227
+ logger.info(f"Language: {language}")
228
+ logger.info(f"Sample rate: {cls.config['sample_rate']}Hz")
229
+
230
+ except Exception as e:
231
+ cls.is_loaded = False
232
+ error_msg = f"Failed to load Vosk model: {str(e)}"
233
+ logger.error(error_msg)
234
+ raise RuntimeError(error_msg)
235
+
236
+ @classmethod
237
+ def _get_model_path(cls, model_name: str) -> Path:
238
+ """Get the local path where a model should be stored."""
239
+ models_dir = Path(cls.config["models_dir"])
240
+ models_dir.mkdir(parents=True, exist_ok=True)
241
+ return models_dir / model_name
242
+
243
+ @classmethod
244
+ def _download_model(cls, model_name: str) -> None:
245
+ """Download a Vosk model if it's not already available."""
246
+ if model_name not in cls.AVAILABLE_MODELS:
247
+ raise ValueError(f"Unknown model: {model_name}. Available: {list(cls.AVAILABLE_MODELS.keys())}")
248
+
249
+ model_info = cls.AVAILABLE_MODELS[model_name]
250
+ download_url = cls.config["download_url_base"] + model_info["url"]
251
+ model_path = cls._get_model_path(model_name)
252
+
253
+ if model_path.exists():
254
+ logger.info(f"Model already exists: {model_path}")
255
+ return
256
+
257
+ logger.info(f"Downloading Vosk model: {model_name}")
258
+ logger.info(f"Size: {model_info['size']} - This may take a while...")
259
+ logger.info(f"URL: {download_url}")
260
+
261
+ try:
262
+ # Create temporary file for download
263
+ with tempfile.NamedTemporaryFile(suffix='.zip', delete=False) as tmp_file:
264
+ tmp_path = tmp_file.name
265
+
266
+ # Download with progress
267
+ def show_progress(block_num, block_size, total_size):
268
+ if total_size > 0:
269
+ percent = min(100, (block_num * block_size * 100) // total_size)
270
+ if block_num % 100 == 0: # Show progress every 100 blocks
271
+ print(f"\rDownloading... {percent}%", end="", flush=True)
272
+
273
+ urllib.request.urlretrieve(download_url, tmp_path, show_progress)
274
+ print() # New line after progress
275
+
276
+ logger.info(f"Download complete. Extracting to: {model_path}")
277
+
278
+ # Extract the zip file
279
+ with zipfile.ZipFile(tmp_path, 'r') as zip_ref:
280
+ # Extract to temporary directory first
281
+ extract_dir = model_path.parent / f"{model_name}_temp"
282
+ extract_dir.mkdir(exist_ok=True)
283
+ zip_ref.extractall(extract_dir)
284
+
285
+ # Find the actual model directory (should contain conf/ and graph/ subdirs)
286
+ extracted_items = list(extract_dir.iterdir())
287
+ if len(extracted_items) == 1 and extracted_items[0].is_dir():
288
+ # Move the inner directory to the final location
289
+ extracted_items[0].rename(model_path)
290
+ extract_dir.rmdir()
291
+ else:
292
+ # Multiple items or files - rename the temp directory
293
+ extract_dir.rename(model_path)
294
+
295
+ # Cleanup
296
+ os.unlink(tmp_path)
297
+
298
+ logger.info(f"βœ… Model downloaded and extracted successfully: {model_path}")
299
+
300
+ except Exception as e:
301
+ # Cleanup on failure
302
+ if os.path.exists(tmp_path):
303
+ os.unlink(tmp_path)
304
+ if model_path.exists():
305
+ import shutil
306
+ shutil.rmtree(model_path, ignore_errors=True)
307
+
308
+ error_msg = f"Failed to download model {model_name}: {str(e)}"
309
+ logger.error(error_msg)
310
+ raise RuntimeError(error_msg)
311
+
312
+ @classmethod
313
+ def transcribe_audio(cls,
314
+ audio_data: Union[np.ndarray, str, Path],
315
+ sample_rate: Optional[int] = None) -> STTResult:
316
+ """
317
+ Transcribe audio using Vosk.
318
+
319
+ Args:
320
+ audio_data: Audio input (numpy array or file path)
321
+ sample_rate: Sample rate for numpy arrays
322
+
323
+ Returns:
324
+ STTResult: Transcription with confidence and metadata
325
+ """
326
+ if not cls.is_loaded:
327
+ raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
328
+
329
+ start_time = time.time()
330
+
331
+ try:
332
+ # Process input audio
333
+ processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
334
+
335
+ # Check audio length
336
+ duration = len(processed_audio) / actual_sr
337
+ if duration < 0.1:
338
+ return STTResult(
339
+ text="",
340
+ confidence=0.0,
341
+ processing_time=time.time() - start_time,
342
+ metadata={"error": "Audio too short", "duration": duration}
343
+ )
344
+
345
+ # Transcribe using Vosk
346
+ result_text, confidence, words = cls._transcribe_with_vosk(processed_audio)
347
+
348
+ processing_time = time.time() - start_time
349
+
350
+ # Prepare metadata
351
+ metadata = {
352
+ "model": cls.config["model_name"],
353
+ "language": cls.AVAILABLE_MODELS.get(cls.config["model_name"], {}).get("language", "unknown"),
354
+ "duration": duration,
355
+ "sample_rate": actual_sr,
356
+ "words": words if cls.config.get("return_words", True) else None,
357
+ "vosk_version": vosk.__version__ if hasattr(vosk, '__version__') else "unknown"
358
+ }
359
+
360
+ return STTResult(
361
+ text=result_text.strip(),
362
+ confidence=confidence,
363
+ processing_time=processing_time,
364
+ metadata=metadata
365
+ )
366
+
367
+ except Exception as e:
368
+ error_msg = f"Transcription failed: {str(e)}"
369
+ logger.error(error_msg)
370
+ return STTResult(
371
+ text="",
372
+ confidence=0.0,
373
+ processing_time=time.time() - start_time,
374
+ metadata={"error": error_msg}
375
+ )
376
+
377
+ @classmethod
378
+ def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
379
+ """Process and validate audio input."""
380
+ if isinstance(audio_data, (str, Path)):
381
+ # Load audio file
382
+ audio_path = Path(audio_data)
383
+ if not audio_path.exists():
384
+ raise FileNotFoundError(f"Audio file not found: {audio_path}")
385
+
386
+ if SOUNDFILE_AVAILABLE:
387
+ audio_array, sr = sf.read(str(audio_path))
388
+ if audio_array.ndim > 1:
389
+ audio_array = np.mean(audio_array, axis=1) # Convert to mono
390
+ else:
391
+ raise ImportError("soundfile required for file input. Install with: pip install soundfile")
392
+ else:
393
+ # Handle numpy array
394
+ audio_array = audio_data.astype(np.float32)
395
+ sr = sample_rate or cls.config["sample_rate"]
396
+
397
+ # Convert to mono if stereo
398
+ if audio_array.ndim > 1:
399
+ audio_array = np.mean(audio_array, axis=1)
400
+
401
+ # Resample to target sample rate if needed
402
+ target_sr = cls.config["sample_rate"]
403
+ if sr != target_sr:
404
+ # Simple resampling
405
+ if sr > target_sr:
406
+ step = sr // target_sr
407
+ audio_array = audio_array[::step]
408
+ else:
409
+ repeat = target_sr // sr
410
+ audio_array = np.repeat(audio_array, repeat)
411
+ sr = target_sr
412
+
413
+ # Normalize and convert to 16-bit PCM format expected by Vosk
414
+ audio_array = np.clip(audio_array, -1.0, 1.0)
415
+ audio_int16 = (audio_array * 32767).astype(np.int16)
416
+
417
+ return audio_int16, sr
418
+
419
+ @classmethod
420
+ def _transcribe_with_vosk(cls, audio_int16: np.ndarray) -> tuple:
421
+ """Transcribe audio using Vosk recognizer."""
422
+ # Convert to bytes
423
+ audio_bytes = audio_int16.tobytes()
424
+
425
+ # Reset recognizer for new transcription
426
+ cls.recognizer = vosk.KaldiRecognizer(cls.model, cls.config["sample_rate"])
427
+
428
+ # Configure recognizer with compatibility checks
429
+ try:
430
+ if hasattr(cls.recognizer, 'SetReturnWordTimes'):
431
+ cls.recognizer.SetReturnWordTimes(cls.config.get("return_words", True))
432
+ except (AttributeError, Exception):
433
+ pass # Use basic recognition without word timing
434
+
435
+ # Process audio in chunks
436
+ chunk_size = cls.config.get("chunk_size", 4096)
437
+ partial_results = []
438
+
439
+ for i in range(0, len(audio_bytes), chunk_size):
440
+ chunk = audio_bytes[i:i + chunk_size]
441
+ if cls.recognizer.AcceptWaveform(chunk):
442
+ result = json.loads(cls.recognizer.Result())
443
+ if result.get("text"):
444
+ partial_results.append(result)
445
+
446
+ # Get final result
447
+ final_result = json.loads(cls.recognizer.FinalResult())
448
+ if final_result.get("text"):
449
+ partial_results.append(final_result)
450
+
451
+ # Combine all results
452
+ if not partial_results:
453
+ return "", 0.0, []
454
+
455
+ # Extract text and confidence
456
+ full_text = " ".join([r.get("text", "") for r in partial_results]).strip()
457
+
458
+ # Calculate average confidence from words
459
+ all_words = []
460
+ total_confidence = 0.0
461
+ word_count = 0
462
+
463
+ for result in partial_results:
464
+ if "result" in result:
465
+ words = result["result"]
466
+ all_words.extend(words)
467
+ for word in words:
468
+ if "conf" in word:
469
+ total_confidence += word["conf"]
470
+ word_count += 1
471
+
472
+ average_confidence = total_confidence / word_count if word_count > 0 else 0.0
473
+
474
+ return full_text, average_confidence, all_words
475
+
476
+ @classmethod
477
+ def get_available_models(cls) -> Dict[str, Any]:
478
+ """Get information about available Vosk models."""
479
+ return {
480
+ "vosk_available": VOSK_AVAILABLE,
481
+ "soundfile_available": SOUNDFILE_AVAILABLE,
482
+ "models": cls.AVAILABLE_MODELS,
483
+ "models_dir": cls.config["models_dir"],
484
+ "downloaded_models": cls._get_downloaded_models()
485
+ }
486
+
487
+ @classmethod
488
+ def _get_downloaded_models(cls) -> List[str]:
489
+ """Get list of already downloaded models."""
490
+ models_dir = Path(cls.config["models_dir"])
491
+ if not models_dir.exists():
492
+ return []
493
+
494
+ downloaded = []
495
+ for model_dir in models_dir.iterdir():
496
+ if model_dir.is_dir() and model_dir.name in cls.AVAILABLE_MODELS:
497
+ # Check if it looks like a valid Vosk model
498
+ if (model_dir / "conf").exists() or (model_dir / "graph").exists():
499
+ downloaded.append(model_dir.name)
500
+
501
+ return downloaded
502
+
503
+ @classmethod
504
+ def set_language(cls, language: Optional[str]) -> None:
505
+ """Set language preference (informational - model determines actual language)."""
506
+ cls.config["language"] = language or "auto"
507
+ logger.info(f"Language preference set to: {cls.config['language']}")
508
+ logger.info("Note: Vosk model determines actual recognition language")
509
+
510
+ @classmethod
511
+ def list_models(cls) -> None:
512
+ """Print available models in a formatted way."""
513
+ print("\n🎀 Available Vosk Models:")
514
+ print("=" * 60)
515
+
516
+ downloaded = cls._get_downloaded_models()
517
+
518
+ for model_name, info in cls.AVAILABLE_MODELS.items():
519
+ status = "βœ… Downloaded" if model_name in downloaded else "πŸ“₯ Available"
520
+ print(f"{status} {model_name}")
521
+ print(f" Language: {info['language']}")
522
+ print(f" Size: {info['size']}")
523
+ print(f" Description: {info['description']}")
524
+ print()
525
+
526
+
527
+ # Example usage and testing
528
+ if __name__ == "__main__":
529
+ print("Testing Vosk STT implementation...")
530
+
531
+ # Check availability
532
+ models_info = VoskSTT.get_available_models()
533
+ print(f"Vosk available: {models_info['vosk_available']}")
534
+ print(f"Downloaded models: {models_info['downloaded_models']}")
535
+
536
+ if models_info["vosk_available"]:
537
+ try:
538
+ # List available models
539
+ VoskSTT.list_models()
540
+
541
+ # Try to load a small English model for testing
542
+ print("\\nTesting with small English model...")
543
+ VoskSTT.load_model(model_name="vosk-model-small-en-us-0.15")
544
+
545
+ # Test with dummy audio
546
+ print("Testing transcription...")
547
+ test_audio = np.random.randn(16000).astype(np.float32) * 0.1
548
+
549
+ result = VoskSTT.transcribe_audio(test_audio, 16000)
550
+ print(f"Result: {result}")
551
+ print(f"Metadata: {result.metadata}")
552
+
553
+ except Exception as e:
554
+ print(f"Error: {e}")
555
+ print("Note: This is expected with random audio")
556
+
557
+ else:
558
+ print("Vosk not installed - install with: pip install vosk")
559
+ print("Also recommended: pip install soundfile")
560
+
561
+ print("\\nVosk STT implementation ready!")
stt/wav2vec2_arabic_stt.py ADDED
@@ -0,0 +1,509 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Wav2Vec2 Arabic Egyptian STT Implementation
4
+
5
+ Hugging Face Wav2Vec2 speech-to-text implementation for Arabic Egyptian dialect
6
+ using the wav2vec2-large-xlsr-53-arabic-egyptian model.
7
+
8
+ Usage:
9
+ from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
10
+
11
+ # Load model
12
+ Wav2Vec2ArabicSTT.load_model()
13
+
14
+ # Transcribe audio
15
+ result = Wav2Vec2ArabicSTT.transcribe_audio(audio_array, 16000)
16
+ print(result.text)
17
+ """
18
+
19
+ from typing import Union, Optional, Dict, Any
20
+ import numpy as np
21
+ from pathlib import Path
22
+ import time
23
+ import logging
24
+ import warnings
25
+
26
+ # Suppress warnings for cleaner output
27
+ warnings.filterwarnings("ignore")
28
+
29
+ try:
30
+ import torch
31
+ import torchaudio
32
+ from transformers import (
33
+ Wav2Vec2ForCTC,
34
+ Wav2Vec2Processor,
35
+ Wav2Vec2Tokenizer
36
+ )
37
+ TRANSFORMERS_AVAILABLE = True
38
+ except ImportError:
39
+ TRANSFORMERS_AVAILABLE = False
40
+
41
+ try:
42
+ import librosa
43
+ LIBROSA_AVAILABLE = True
44
+ except ImportError:
45
+ LIBROSA_AVAILABLE = False
46
+
47
+ from .stt_base import BaseSTT, STTResult
48
+
49
+ logger = logging.getLogger(__name__)
50
+
51
+
52
+ class Wav2Vec2ArabicSTT(BaseSTT):
53
+ """
54
+ Wav2Vec2 Arabic Egyptian STT implementation using Hugging Face transformers.
55
+
56
+ Supports:
57
+ - Arabic Egyptian dialect transcription
58
+ - Local model execution (no API required)
59
+ - Automatic audio preprocessing
60
+ - Confidence estimation
61
+ """
62
+
63
+ model_name = "Wav2Vec2ArabicSTT"
64
+ model = None
65
+ processor = None
66
+ tokenizer = None
67
+ is_loaded = False
68
+ config = {
69
+ "model_id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
70
+ "fallback_models": [
71
+ "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
72
+ "facebook/wav2vec2-large-xlsr-53",
73
+ "facebook/wav2vec2-base-960h" # English fallback
74
+ ],
75
+ "device": "auto", # auto, cpu, cuda
76
+ "chunk_length": 20, # seconds, for long audio processing
77
+ "sample_rate": 16000,
78
+ "return_confidence": True,
79
+ "language": "ar-EG", # Arabic Egyptian
80
+ "hf_token": None, # Hugging Face token for private models
81
+ "use_auth_token": True # Try to use cached token
82
+ }
83
+
84
+ @classmethod
85
+ def load_model(cls,
86
+ model_id: str = None,
87
+ device: str = "auto",
88
+ hf_token: str = None,
89
+ **kwargs) -> None:
90
+ """
91
+ Load the Wav2Vec2 Arabic model.
92
+
93
+ Args:
94
+ model_id: Hugging Face model ID (default: wav2vec2-large-xlsr-53-arabic-egyptian)
95
+ device: Device to use (auto, cpu, cuda)
96
+ hf_token: Hugging Face token for private models (optional)
97
+ **kwargs: Additional configuration parameters
98
+ """
99
+ if not TRANSFORMERS_AVAILABLE:
100
+ raise ImportError(
101
+ "Transformers library required. Install with: "
102
+ "pip install transformers torch torchaudio"
103
+ )
104
+
105
+ # Update configuration
106
+ cls.config.update({
107
+ "model_id": model_id or cls.config["model_id"],
108
+ "device": device,
109
+ "hf_token": hf_token,
110
+ **kwargs
111
+ })
112
+
113
+ # Determine device
114
+ if device == "auto":
115
+ if torch.cuda.is_available():
116
+ device = "cuda"
117
+ elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
118
+ device = "mps" # Apple Silicon
119
+ else:
120
+ device = "cpu"
121
+
122
+ cls.config["device"] = device
123
+
124
+ # Try to load the model, with fallbacks
125
+ models_to_try = [cls.config["model_id"]] + cls.config["fallback_models"]
126
+
127
+ for model_id_to_try in models_to_try:
128
+ logger.info(f"Attempting to load model: {model_id_to_try}")
129
+
130
+ try:
131
+ success = cls._load_model_with_id(model_id_to_try, device, hf_token)
132
+ if success:
133
+ cls.config["model_id"] = model_id_to_try # Update to successful model
134
+ return
135
+ except Exception as e:
136
+ logger.warning(f"Failed to load {model_id_to_try}: {e}")
137
+ continue
138
+
139
+ # If all models failed
140
+ raise RuntimeError(f"Failed to load any Wav2Vec2 model. Tried: {models_to_try}")
141
+
142
+ @classmethod
143
+ def _load_model_with_id(cls, model_id: str, device: str, hf_token: str = None) -> bool:
144
+ """
145
+ Load a specific model ID with authentication handling.
146
+
147
+ Returns:
148
+ bool: True if successful, False otherwise
149
+ """
150
+ logger.info(f"Loading Wav2Vec2 model: {model_id}")
151
+ logger.info(f"Using device: {device}")
152
+
153
+ start_time = time.time()
154
+
155
+ # Prepare authentication
156
+ auth_kwargs = {}
157
+ if hf_token:
158
+ auth_kwargs["token"] = hf_token
159
+ elif cls.config.get("use_auth_token", True):
160
+ auth_kwargs["use_auth_token"] = True
161
+
162
+ # Load processor and tokenizer
163
+ logger.info("Loading processor...")
164
+ cls.processor = Wav2Vec2Processor.from_pretrained(model_id, **auth_kwargs)
165
+
166
+ logger.info("Loading model...")
167
+ cls.model = Wav2Vec2ForCTC.from_pretrained(model_id, **auth_kwargs)
168
+
169
+ # Move model to device
170
+ cls.model = cls.model.to(device)
171
+ cls.model.eval() # Set to evaluation mode
172
+
173
+ # Load tokenizer for confidence calculation
174
+ try:
175
+ cls.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_id, **auth_kwargs)
176
+ except Exception as e:
177
+ logger.warning(f"Could not load tokenizer: {e}")
178
+ cls.tokenizer = None
179
+
180
+ cls.is_loaded = True
181
+ load_time = time.time() - start_time
182
+
183
+ logger.info(f"βœ… Wav2Vec2 model loaded successfully in {load_time:.2f}s")
184
+ logger.info(f"Model vocab size: {cls.model.config.vocab_size}")
185
+
186
+ return True
187
+
188
+ @classmethod
189
+ def transcribe_audio(cls,
190
+ audio_data: Union[np.ndarray, str, Path],
191
+ sample_rate: Optional[int] = None) -> STTResult:
192
+ """
193
+ Transcribe audio using Wav2Vec2 Arabic model.
194
+
195
+ Args:
196
+ audio_data: Audio input (numpy array or file path)
197
+ sample_rate: Sample rate for numpy arrays
198
+
199
+ Returns:
200
+ STTResult: Transcription with confidence and metadata
201
+ """
202
+ if not cls.is_loaded:
203
+ raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
204
+
205
+ start_time = time.time()
206
+
207
+ try:
208
+ # Process input audio
209
+ processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
210
+
211
+ # Check audio length
212
+ duration = len(processed_audio) / actual_sr
213
+ if duration < 0.1:
214
+ return STTResult(
215
+ text="",
216
+ confidence=0.0,
217
+ processing_time=time.time() - start_time,
218
+ metadata={"error": "Audio too short", "duration": duration}
219
+ )
220
+
221
+ # Process with model
222
+ if duration > cls.config.get("chunk_length", 20):
223
+ # Handle long audio by chunking
224
+ text, confidence = cls._transcribe_long_audio(processed_audio, actual_sr)
225
+ else:
226
+ # Process short audio directly
227
+ text, confidence = cls._transcribe_chunk(processed_audio, actual_sr)
228
+
229
+ processing_time = time.time() - start_time
230
+
231
+ # Prepare metadata
232
+ metadata = {
233
+ "model": cls.config["model_id"],
234
+ "device": cls.config["device"],
235
+ "language": "ar-EG",
236
+ "duration": duration,
237
+ "sample_rate": actual_sr,
238
+ "chunks_processed": 1 if duration <= cls.config.get("chunk_length", 20) else int(duration / cls.config["chunk_length"]) + 1
239
+ }
240
+
241
+ return STTResult(
242
+ text=text.strip(),
243
+ confidence=confidence,
244
+ processing_time=processing_time,
245
+ metadata=metadata
246
+ )
247
+
248
+ except Exception as e:
249
+ error_msg = f"Transcription failed: {str(e)}"
250
+ logger.error(error_msg)
251
+ return STTResult(
252
+ text="",
253
+ confidence=0.0,
254
+ processing_time=time.time() - start_time,
255
+ metadata={"error": error_msg}
256
+ )
257
+
258
+ @classmethod
259
+ def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
260
+ """Process and validate audio input."""
261
+ if isinstance(audio_data, (str, Path)):
262
+ # Load audio file
263
+ audio_path = Path(audio_data)
264
+ if not audio_path.exists():
265
+ raise FileNotFoundError(f"Audio file not found: {audio_path}")
266
+
267
+ if LIBROSA_AVAILABLE:
268
+ audio_array, sr = librosa.load(str(audio_path), sr=cls.config["sample_rate"])
269
+ else:
270
+ # Fallback to torchaudio
271
+ audio_tensor, sr = torchaudio.load(str(audio_path))
272
+ audio_array = audio_tensor.numpy().flatten()
273
+
274
+ # Resample if needed
275
+ if sr != cls.config["sample_rate"]:
276
+ resampler = torchaudio.transforms.Resample(sr, cls.config["sample_rate"])
277
+ audio_tensor = resampler(audio_tensor)
278
+ audio_array = audio_tensor.numpy().flatten()
279
+ sr = cls.config["sample_rate"]
280
+
281
+ else:
282
+ # Handle numpy array
283
+ audio_array = audio_data.astype(np.float32)
284
+ sr = sample_rate or cls.config["sample_rate"]
285
+
286
+ # Resample if needed
287
+ if sr != cls.config["sample_rate"]:
288
+ if LIBROSA_AVAILABLE:
289
+ audio_array = librosa.resample(
290
+ audio_array,
291
+ orig_sr=sr,
292
+ target_sr=cls.config["sample_rate"]
293
+ )
294
+ else:
295
+ # Simple resampling fallback
296
+ if sr > cls.config["sample_rate"]:
297
+ step = sr // cls.config["sample_rate"]
298
+ audio_array = audio_array[::step]
299
+ else:
300
+ repeat = cls.config["sample_rate"] // sr
301
+ audio_array = np.repeat(audio_array, repeat)
302
+
303
+ sr = cls.config["sample_rate"]
304
+
305
+ # Normalize audio
306
+ if len(audio_array) > 0:
307
+ # Convert to mono if stereo
308
+ if audio_array.ndim > 1:
309
+ audio_array = np.mean(audio_array, axis=0)
310
+
311
+ # Normalize to [-1, 1]
312
+ max_val = np.max(np.abs(audio_array))
313
+ if max_val > 0:
314
+ audio_array = audio_array / max_val
315
+
316
+ return audio_array, sr
317
+
318
+ @classmethod
319
+ def _transcribe_chunk(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
320
+ """Transcribe a single audio chunk."""
321
+ # Preprocess audio
322
+ input_values = cls.processor(
323
+ audio_array,
324
+ sampling_rate=sample_rate,
325
+ return_tensors="pt",
326
+ padding=True
327
+ )
328
+
329
+ # Move to device
330
+ input_values = {k: v.to(cls.config["device"]) for k, v in input_values.items()}
331
+
332
+ # Inference
333
+ with torch.no_grad():
334
+ logits = cls.model(**input_values).logits
335
+
336
+ # Get predicted tokens
337
+ predicted_ids = torch.argmax(logits, dim=-1)
338
+
339
+ # Decode transcription
340
+ transcription = cls.processor.batch_decode(predicted_ids)[0]
341
+
342
+ # Calculate confidence (average of max probabilities)
343
+ confidence = cls._calculate_confidence(logits)
344
+
345
+ return transcription, confidence
346
+
347
+ @classmethod
348
+ def _transcribe_long_audio(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
349
+ """Transcribe long audio by chunking."""
350
+ chunk_length = cls.config.get("chunk_length", 20)
351
+ chunk_samples = int(chunk_length * sample_rate)
352
+ overlap_samples = int(1.0 * sample_rate) # 1 second overlap
353
+
354
+ transcriptions = []
355
+ confidences = []
356
+
357
+ for start in range(0, len(audio_array), chunk_samples - overlap_samples):
358
+ end = min(start + chunk_samples, len(audio_array))
359
+ chunk = audio_array[start:end]
360
+
361
+ if len(chunk) < 0.5 * sample_rate: # Skip very short chunks
362
+ continue
363
+
364
+ try:
365
+ chunk_text, chunk_confidence = cls._transcribe_chunk(chunk, sample_rate)
366
+ if chunk_text.strip():
367
+ transcriptions.append(chunk_text.strip())
368
+ confidences.append(chunk_confidence)
369
+ except Exception as e:
370
+ logger.warning(f"Failed to transcribe chunk: {e}")
371
+ continue
372
+
373
+ # Combine results
374
+ full_text = " ".join(transcriptions)
375
+ avg_confidence = np.mean(confidences) if confidences else 0.0
376
+
377
+ return full_text, avg_confidence
378
+
379
+ @classmethod
380
+ def _calculate_confidence(cls, logits: torch.Tensor) -> float:
381
+ """Calculate confidence score from model logits."""
382
+ try:
383
+ # Apply softmax to get probabilities
384
+ probabilities = torch.softmax(logits, dim=-1)
385
+
386
+ # Get maximum probability for each time step
387
+ max_probs = torch.max(probabilities, dim=-1)[0]
388
+
389
+ # Average over time steps (excluding padding if any)
390
+ confidence = torch.mean(max_probs).item()
391
+
392
+ return confidence
393
+
394
+ except Exception as e:
395
+ logger.warning(f"Could not calculate confidence: {e}")
396
+ return 0.5 # Default confidence
397
+
398
+ @classmethod
399
+ def get_available_models(cls) -> Dict[str, Any]:
400
+ """Get information about available Wav2Vec2 models."""
401
+ models_info = {
402
+ "transformers_available": TRANSFORMERS_AVAILABLE,
403
+ "librosa_available": LIBROSA_AVAILABLE,
404
+ "torch_available": True if TRANSFORMERS_AVAILABLE else False,
405
+ }
406
+
407
+ if TRANSFORMERS_AVAILABLE:
408
+ models_info.update({
409
+ "cuda_available": torch.cuda.is_available(),
410
+ "mps_available": hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(),
411
+ "public_models": [
412
+ {
413
+ "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
414
+ "name": "Wav2Vec2 Arabic (Large)",
415
+ "language": "Arabic",
416
+ "size": "1.2GB"
417
+ },
418
+ {
419
+ "id": "facebook/wav2vec2-large-xlsr-53",
420
+ "name": "Wav2Vec2 Multilingual (Large)",
421
+ "language": "Multilingual (including Arabic)",
422
+ "size": "1.2GB"
423
+ },
424
+ {
425
+ "id": "facebook/wav2vec2-base-960h",
426
+ "name": "Wav2Vec2 English Base",
427
+ "language": "English",
428
+ "size": "360MB"
429
+ }
430
+ ],
431
+ "experimental_models": [
432
+ {
433
+ "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
434
+ "name": "Wav2Vec2 Arabic Egyptian (Large)",
435
+ "language": "Arabic Egyptian Dialect",
436
+ "size": "1.2GB",
437
+ "note": "May require HuggingFace authentication"
438
+ }
439
+ ]
440
+ })
441
+
442
+ return models_info
443
+
444
+ @classmethod
445
+ def set_language(cls, language: Optional[str]) -> None:
446
+ """Set language (for compatibility - this model is Arabic-specific)."""
447
+ if language and not language.startswith("ar"):
448
+ logger.warning(f"This model is optimized for Arabic. Language '{language}' may not work well.")
449
+
450
+ cls.config["language"] = language or "ar-EG"
451
+ logger.info(f"Language set to: {cls.config['language']}")
452
+
453
+ @classmethod
454
+ def set_device(cls, device: str) -> None:
455
+ """Change device for model inference."""
456
+ if cls.model is not None:
457
+ cls.model = cls.model.to(device)
458
+ cls.config["device"] = device
459
+ logger.info(f"Model moved to device: {device}")
460
+
461
+ @classmethod
462
+ def get_model_info(cls) -> Dict[str, Any]:
463
+ """Get detailed model information."""
464
+ base_info = super().get_model_info()
465
+
466
+ if cls.is_loaded:
467
+ base_info.update({
468
+ "model_id": cls.config["model_id"],
469
+ "device": cls.config["device"],
470
+ "language": cls.config["language"],
471
+ "sample_rate": cls.config["sample_rate"],
472
+ "vocab_size": cls.model.config.vocab_size if cls.model else None,
473
+ })
474
+
475
+ return base_info
476
+
477
+
478
+ # Example usage and testing
479
+ if __name__ == "__main__":
480
+ print("Testing Wav2Vec2 Arabic STT implementation...")
481
+
482
+ # Check availability
483
+ models_info = Wav2Vec2ArabicSTT.get_available_models()
484
+ print(f"Available models info: {models_info}")
485
+
486
+ if models_info["transformers_available"]:
487
+ try:
488
+ print("Loading Wav2Vec2 Arabic model...")
489
+ Wav2Vec2ArabicSTT.load_model()
490
+
491
+ print("Creating test audio...")
492
+ # Generate test audio (1 second of random noise)
493
+ test_audio = np.random.randn(16000).astype(np.float32) * 0.1
494
+
495
+ print("Testing transcription...")
496
+ result = Wav2Vec2ArabicSTT.transcribe_audio(test_audio, 16000)
497
+ print(f"Result: {result}")
498
+ print(f"Metadata: {result.metadata}")
499
+
500
+ except Exception as e:
501
+ print(f"Error: {e}")
502
+ print("Note: This is expected with random audio - the model expects Arabic speech")
503
+
504
+ else:
505
+ print("Transformers not installed - install with:")
506
+ print("pip install transformers torch torchaudio")
507
+ print("Optional: pip install librosa (for better audio processing)")
508
+
509
+ print("\nWav2Vec2 Arabic STT implementation ready!")
stt/whisper_stt.py ADDED
@@ -0,0 +1,377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Whisper STT Implementation
4
+
5
+ OpenAI Whisper speech-to-text implementation using the static BaseSTT interface.
6
+ Supports both local Whisper models and OpenAI API calls.
7
+
8
+ Usage:
9
+ from stt.whisper_stt import WhisperSTT
10
+
11
+ # Load model (local)
12
+ WhisperSTT.load_model()
13
+
14
+ # Transcribe audio
15
+ result = WhisperSTT.transcribe_file("audio.wav")
16
+ print(result.text)
17
+
18
+ # Or use OpenAI API
19
+ WhisperSTT.load_model(use_api=True, api_key="your-key")
20
+ result = WhisperSTT.transcribe_file("audio.wav")
21
+ """
22
+
23
+ from typing import Union, Optional, Dict, Any
24
+ import numpy as np
25
+ from pathlib import Path
26
+ import time
27
+ import logging
28
+ import tempfile
29
+ import os
30
+
31
+ try:
32
+ import whisper
33
+ WHISPER_AVAILABLE = True
34
+ except ImportError:
35
+ WHISPER_AVAILABLE = False
36
+
37
+ try:
38
+ import openai
39
+ OPENAI_AVAILABLE = True
40
+ except ImportError:
41
+ OPENAI_AVAILABLE = False
42
+
43
+ try:
44
+ import soundfile as sf
45
+ SOUNDFILE_AVAILABLE = True
46
+ except ImportError:
47
+ SOUNDFILE_AVAILABLE = False
48
+
49
+ from .stt_base import BaseSTT, STTResult
50
+
51
+ logger = logging.getLogger(__name__)
52
+
53
+
54
+ class WhisperSTT(BaseSTT):
55
+ """
56
+ OpenAI Whisper STT implementation with support for both local models and API.
57
+
58
+ Supports:
59
+ - Local Whisper models (tiny, base, small, medium, large)
60
+ - OpenAI Whisper API calls
61
+ - Multiple audio formats via soundfile
62
+ - Confidence scoring and metadata
63
+ """
64
+
65
+ model_name = "WhisperSTT"
66
+ model = None
67
+ is_loaded = False
68
+ config = {
69
+ "model_size": "base",
70
+ "use_api": False,
71
+ "api_key": None,
72
+ "language": None, # Auto-detect if None
73
+ "task": "transcribe", # "transcribe" or "translate"
74
+ "temperature": 0.0,
75
+ "best_of": 5,
76
+ "beam_size": 5,
77
+ "patience": 1.0,
78
+ "length_penalty": 1.0,
79
+ "suppress_tokens": "-1",
80
+ "initial_prompt": None,
81
+ "condition_on_previous_text": True,
82
+ "fp16": True,
83
+ "compression_ratio_threshold": 2.4,
84
+ "logprob_threshold": -1.0,
85
+ "no_speech_threshold": 0.6
86
+ }
87
+
88
+ @classmethod
89
+ def load_model(cls,
90
+ model_size: str = "base",
91
+ use_api: bool = False,
92
+ api_key: Optional[str] = None,
93
+ **kwargs) -> None:
94
+ """
95
+ Load the Whisper model (local or API setup).
96
+
97
+ Args:
98
+ model_size: Size of local model ("tiny", "base", "small", "medium", "large")
99
+ use_api: Use OpenAI API instead of local model
100
+ api_key: OpenAI API key (required if use_api=True)
101
+ **kwargs: Additional Whisper parameters
102
+ """
103
+ cls.config.update({
104
+ "model_size": model_size,
105
+ "use_api": use_api,
106
+ "api_key": api_key,
107
+ **kwargs
108
+ })
109
+
110
+ if use_api:
111
+ cls._load_api_model(api_key)
112
+ else:
113
+ cls._load_local_model(model_size)
114
+
115
+ @classmethod
116
+ def _load_local_model(cls, model_size: str) -> None:
117
+ """Load local Whisper model."""
118
+ if not WHISPER_AVAILABLE:
119
+ raise ImportError(
120
+ "OpenAI Whisper not installed. Install with: pip install openai-whisper"
121
+ )
122
+
123
+ logger.info(f"Loading Whisper local model: {model_size}")
124
+ start_time = time.time()
125
+
126
+ try:
127
+ cls.model = whisper.load_model(model_size)
128
+ cls.is_loaded = True
129
+ load_time = time.time() - start_time
130
+ logger.info(f"Whisper model '{model_size}' loaded successfully in {load_time:.2f}s")
131
+
132
+ except Exception as e:
133
+ cls.is_loaded = False
134
+ raise RuntimeError(f"Failed to load Whisper model '{model_size}': {e}")
135
+
136
+ @classmethod
137
+ def _load_api_model(cls, api_key: Optional[str]) -> None:
138
+ """Setup OpenAI API client."""
139
+ if not OPENAI_AVAILABLE:
140
+ raise ImportError(
141
+ "OpenAI Python client not installed. Install with: pip install openai"
142
+ )
143
+
144
+ if not api_key:
145
+ # Try to get from environment
146
+ api_key = os.getenv("OPENAI_API_KEY")
147
+ if not api_key:
148
+ raise ValueError(
149
+ "OpenAI API key required. Set OPENAI_API_KEY environment variable or pass api_key parameter."
150
+ )
151
+
152
+ logger.info("Setting up OpenAI Whisper API client")
153
+
154
+ try:
155
+ openai.api_key = api_key
156
+ cls.model = "whisper-1" # API model identifier
157
+ cls.is_loaded = True
158
+ logger.info("OpenAI Whisper API client configured successfully")
159
+
160
+ except Exception as e:
161
+ cls.is_loaded = False
162
+ raise RuntimeError(f"Failed to setup OpenAI API: {e}")
163
+
164
+ @classmethod
165
+ def transcribe_audio(cls,
166
+ audio_data: Union[np.ndarray, str, Path],
167
+ sample_rate: Optional[int] = None) -> STTResult:
168
+ """
169
+ Transcribe audio using Whisper (local or API).
170
+
171
+ Args:
172
+ audio_data: Audio input (numpy array, file path, or audio file)
173
+ sample_rate: Sample rate for numpy arrays
174
+
175
+ Returns:
176
+ STTResult: Transcription with confidence and metadata
177
+ """
178
+ if not cls.is_loaded:
179
+ raise RuntimeError("Whisper model not loaded. Call load_model() first.")
180
+
181
+ start_time = time.time()
182
+
183
+ try:
184
+ if cls.config["use_api"]:
185
+ result = cls._transcribe_api(audio_data, sample_rate)
186
+ else:
187
+ result = cls._transcribe_local(audio_data, sample_rate)
188
+
189
+ processing_time = time.time() - start_time
190
+ result.processing_time = processing_time
191
+
192
+ logger.info(f"Transcription completed in {processing_time:.2f}s")
193
+ return result
194
+
195
+ except Exception as e:
196
+ logger.error(f"Transcription failed: {e}")
197
+ raise RuntimeError(f"Whisper transcription failed: {e}")
198
+
199
+ @classmethod
200
+ def _transcribe_local(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> STTResult:
201
+ """Transcribe using local Whisper model."""
202
+
203
+ # Prepare transcription options
204
+ transcribe_options = {
205
+ "language": cls.config.get("language"),
206
+ "task": cls.config.get("task", "transcribe"),
207
+ "temperature": cls.config.get("temperature", 0.0),
208
+ "best_of": cls.config.get("best_of", 5),
209
+ "beam_size": cls.config.get("beam_size", 5),
210
+ "patience": cls.config.get("patience", 1.0),
211
+ "length_penalty": cls.config.get("length_penalty", 1.0),
212
+ "suppress_tokens": cls.config.get("suppress_tokens", "-1"),
213
+ "initial_prompt": cls.config.get("initial_prompt"),
214
+ "condition_on_previous_text": cls.config.get("condition_on_previous_text", True),
215
+ "fp16": cls.config.get("fp16", True),
216
+ "compression_ratio_threshold": cls.config.get("compression_ratio_threshold", 2.4),
217
+ "logprob_threshold": cls.config.get("logprob_threshold", -1.0),
218
+ "no_speech_threshold": cls.config.get("no_speech_threshold", 0.6)
219
+ }
220
+
221
+ # Remove None values
222
+ transcribe_options = {k: v for k, v in transcribe_options.items() if v is not None}
223
+
224
+ # Handle numpy arrays
225
+ if isinstance(audio_data, np.ndarray):
226
+ audio_input = audio_data.astype(np.float32)
227
+ # Whisper expects mono audio
228
+ if audio_input.ndim > 1:
229
+ audio_input = np.mean(audio_input, axis=1)
230
+ else:
231
+ # File path
232
+ audio_input = str(audio_data)
233
+
234
+ # Transcribe
235
+ result = cls.model.transcribe(audio_input, **transcribe_options)
236
+
237
+ # Calculate confidence (average of segment confidences if available)
238
+ confidence = None
239
+ if "segments" in result and result["segments"]:
240
+ segment_confidences = []
241
+ for segment in result["segments"]:
242
+ if "avg_logprob" in segment:
243
+ # Convert log prob to confidence estimate
244
+ conf = min(1.0, max(0.0, np.exp(segment["avg_logprob"])))
245
+ segment_confidences.append(conf)
246
+
247
+ if segment_confidences:
248
+ confidence = np.mean(segment_confidences)
249
+
250
+ # Prepare metadata
251
+ metadata = {
252
+ "model": cls.config["model_size"],
253
+ "language": result.get("language"),
254
+ "task": cls.config["task"],
255
+ "segments": len(result.get("segments", [])),
256
+ "api_used": False
257
+ }
258
+
259
+ return STTResult(
260
+ text=result["text"].strip(),
261
+ confidence=confidence,
262
+ metadata=metadata
263
+ )
264
+
265
+ @classmethod
266
+ def _transcribe_api(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> STTResult:
267
+ """Transcribe using OpenAI API."""
268
+
269
+ # Handle numpy arrays - save to temp file for API
270
+ if isinstance(audio_data, np.ndarray):
271
+ if not SOUNDFILE_AVAILABLE:
272
+ raise ImportError("soundfile required for numpy array support. Install with: pip install soundfile")
273
+
274
+ # Create temporary WAV file
275
+ with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
276
+ temp_path = temp_file.name
277
+
278
+ try:
279
+ sf.write(temp_path, audio_data, sample_rate or 16000)
280
+ audio_file_path = temp_path
281
+ cleanup_temp = True
282
+ except Exception as e:
283
+ if os.path.exists(temp_path):
284
+ os.unlink(temp_path)
285
+ raise RuntimeError(f"Failed to save temporary audio file: {e}")
286
+ else:
287
+ audio_file_path = str(audio_data)
288
+ cleanup_temp = False
289
+
290
+ try:
291
+ # Make API call
292
+ with open(audio_file_path, "rb") as audio_file:
293
+ transcript = openai.Audio.transcribe(
294
+ model="whisper-1",
295
+ file=audio_file,
296
+ language=cls.config.get("language"),
297
+ prompt=cls.config.get("initial_prompt"),
298
+ temperature=cls.config.get("temperature", 0.0)
299
+ )
300
+
301
+ # API doesn't provide confidence scores
302
+ metadata = {
303
+ "model": "whisper-1",
304
+ "language": cls.config.get("language", "auto"),
305
+ "task": "transcribe",
306
+ "api_used": True
307
+ }
308
+
309
+ return STTResult(
310
+ text=transcript["text"].strip(),
311
+ confidence=None, # API doesn't provide confidence
312
+ metadata=metadata
313
+ )
314
+
315
+ finally:
316
+ # Clean up temporary file if created
317
+ if cleanup_temp and os.path.exists(audio_file_path):
318
+ try:
319
+ os.unlink(audio_file_path)
320
+ except Exception as e:
321
+ logger.warning(f"Failed to cleanup temp file {audio_file_path}: {e}")
322
+
323
+ @classmethod
324
+ def get_available_models(cls) -> Dict[str, Any]:
325
+ """Get information about available Whisper models."""
326
+ local_models = ["tiny", "base", "small", "medium", "large"] if WHISPER_AVAILABLE else []
327
+ api_available = OPENAI_AVAILABLE
328
+
329
+ return {
330
+ "local_models": local_models,
331
+ "api_available": api_available,
332
+ "whisper_installed": WHISPER_AVAILABLE,
333
+ "openai_installed": OPENAI_AVAILABLE,
334
+ "soundfile_installed": SOUNDFILE_AVAILABLE
335
+ }
336
+
337
+ @classmethod
338
+ def set_language(cls, language: Optional[str]) -> None:
339
+ """Set the transcription language."""
340
+ cls.config["language"] = language
341
+ logger.info(f"Language set to: {language or 'auto-detect'}")
342
+
343
+ @classmethod
344
+ def set_task(cls, task: str) -> None:
345
+ """Set the task (transcribe or translate)."""
346
+ if task not in ["transcribe", "translate"]:
347
+ raise ValueError("Task must be 'transcribe' or 'translate'")
348
+ cls.config["task"] = task
349
+ logger.info(f"Task set to: {task}")
350
+
351
+
352
+ # Example usage and testing
353
+ if __name__ == "__main__":
354
+ print("Testing WhisperSTT implementation...")
355
+
356
+ # Check availability
357
+ models_info = WhisperSTT.get_available_models()
358
+ print(f"Available models: {models_info}")
359
+
360
+ if models_info["whisper_installed"]:
361
+ try:
362
+ # Test with local model
363
+ print("\\nTesting local Whisper model...")
364
+ WhisperSTT.load_model("tiny") # Use tiny model for faster testing
365
+
366
+ # Test with dummy numpy audio
367
+ dummy_audio = np.random.randn(16000).astype(np.float32) # 1 second
368
+ result = WhisperSTT.transcribe_numpy(dummy_audio, 16000)
369
+ print(f"Dummy audio result: {result}")
370
+ print(f"Model info: {WhisperSTT.get_model_info()}")
371
+
372
+ except Exception as e:
373
+ print(f"Local model test failed: {e}")
374
+ else:
375
+ print("Whisper not installed - install with: pip install openai-whisper")
376
+
377
+ print("\\nWhisperSTT implementation ready!")
test_coqui.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Coqui STT
4
+
5
+ This script tests the Coqui STT implementation with a sample audio file.
6
+ Coqui STT provides open-source speech recognition with multiple language support.
7
+
8
+ Usage:
9
+ python test_coqui.py [audio_file]
10
+
11
+ If no audio file is provided, it will use the default recording if available.
12
+ """
13
+
14
+ import sys
15
+ import logging
16
+ from pathlib import Path
17
+
18
+ # Add the project root to the path
19
+ sys.path.append(str(Path(__file__).parent))
20
+
21
+ from stt.coqui_stt import CoquiSTT, COQUI_STT_AVAILABLE
22
+
23
+ def test_coqui_stt(audio_file: str = None):
24
+ """Test Coqui STT functionality."""
25
+ print("πŸš€ Testing Coqui STT")
26
+ print("=" * 50)
27
+
28
+ # Check if Coqui STT is available
29
+ if not COQUI_STT_AVAILABLE:
30
+ print("❌ Coqui STT not available. Install with:")
31
+ print("pip install coqui-stt soundfile librosa")
32
+ return False
33
+
34
+ # Create CoquiSTT instance
35
+ coqui = CoquiSTT()
36
+
37
+ # Check dependencies
38
+ deps_ok, deps_msg = coqui.check_dependencies()
39
+ print(f"Dependencies: {deps_msg}")
40
+ if not deps_ok:
41
+ return False
42
+
43
+ # Get available models
44
+ print("\nπŸ“¦ Available Models:")
45
+ available_models = coqui.get_available_models()
46
+ for model in available_models:
47
+ status = "βœ… Downloaded" if model["downloaded"] else "⬇️ Available for download"
48
+ scorer_status = " (with scorer)" if model["has_scorer"] else " (no scorer)"
49
+ print(f" - {model['name']}: {model['description']} ({model['size']}) {status}{scorer_status}")
50
+
51
+ # Test model loading
52
+ print("\nπŸ”„ Loading English Large model...")
53
+ model_name = "english-large"
54
+ success = coqui.load_model(
55
+ model_name=model_name,
56
+ auto_download=True,
57
+ beam_width=512
58
+ )
59
+
60
+ if not success:
61
+ print("❌ Failed to load model")
62
+ return False
63
+
64
+ print("βœ… Model loaded successfully")
65
+
66
+ # Get model info
67
+ model_info = coqui.get_model_info()
68
+ print(f"\nπŸ“‹ Model Info:")
69
+ for key, value in model_info.items():
70
+ print(f" - {key}: {value}")
71
+
72
+ # Test transcription
73
+ if audio_file and Path(audio_file).exists():
74
+ print(f"\n🎀 Transcribing: {audio_file}")
75
+ else:
76
+ # Look for default recording
77
+ default_files = [
78
+ "recordings/recorded_audio.wav",
79
+ "recorded_audio.wav",
80
+ "test_audio.wav"
81
+ ]
82
+
83
+ audio_file = None
84
+ for file_path in default_files:
85
+ if Path(file_path).exists():
86
+ audio_file = file_path
87
+ break
88
+
89
+ if not audio_file:
90
+ print("❌ No audio file found for testing")
91
+ print("Record audio using the Gradio interface first, or provide a file path")
92
+ return False
93
+
94
+ print(f"\n🎀 Using default recording: {audio_file}")
95
+
96
+ # Perform transcription
97
+ try:
98
+ print("Transcribing...")
99
+ result = coqui.transcribe_audio(
100
+ audio_file_path=audio_file,
101
+ return_confidence=True,
102
+ return_timestamps=False
103
+ )
104
+
105
+ if "error" in result:
106
+ print(f"❌ Transcription error: {result['error']}")
107
+ return False
108
+
109
+ print("\nπŸ“ Transcription Results:")
110
+ print(f" Text: {result['text']}")
111
+ print(f" Confidence: {result.get('confidence', 'N/A')}")
112
+ print(f" Language: {result.get('language', 'Unknown')}")
113
+
114
+ # Test with timestamps if successful
115
+ print("\nπŸ• Testing with timestamps...")
116
+ result_with_timestamps = coqui.transcribe_audio(
117
+ audio_file_path=audio_file,
118
+ return_confidence=True,
119
+ return_timestamps=True
120
+ )
121
+
122
+ if "words" in result_with_timestamps:
123
+ print(f" Word count: {len(result_with_timestamps['words'])}")
124
+ if result_with_timestamps['words']:
125
+ print(" First few words with timestamps:")
126
+ for word in result_with_timestamps['words'][:3]:
127
+ print(f" - '{word['word']}' at {word['start_time']:.2f}s (confidence: {word.get('confidence', 'N/A')})")
128
+
129
+ except Exception as e:
130
+ print(f"❌ Transcription failed: {e}")
131
+ return False
132
+
133
+ # Cleanup
134
+ coqui.cleanup()
135
+ print("\nβœ… Test completed successfully!")
136
+ return True
137
+
138
+ def main():
139
+ """Main function."""
140
+ # Setup logging
141
+ logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
142
+
143
+ # Get audio file from command line if provided
144
+ audio_file = sys.argv[1] if len(sys.argv) > 1 else None
145
+
146
+ # Run test
147
+ success = test_coqui_stt(audio_file)
148
+
149
+ if success:
150
+ print("\nπŸŽ‰ Coqui STT is working correctly!")
151
+ print("\nπŸ’‘ Next steps:")
152
+ print(" 1. Run the main transcriber: python gradio_voice_transcriber_clean.py")
153
+ print(" 2. Select 'CoquiSTT' as your model")
154
+ print(" 3. Choose your preferred language model")
155
+ print(" 4. Start transcribing!")
156
+ else:
157
+ print("\n❌ Coqui STT test failed")
158
+ return 1
159
+
160
+ return 0
161
+
162
+ if __name__ == "__main__":
163
+ sys.exit(main())
test_gradio_voice_transcriber.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import types
2
+ import numpy as np
3
+ import builtins
4
+ import time
5
+ import importlib
6
+ import pytest
7
+
8
+ # Import the module under test
9
+ import gradio_voice_transcriber as gvt
10
+
11
+
12
+ class DummySTT:
13
+ is_loaded = True
14
+
15
+ def __init__(self):
16
+ self._language = None
17
+
18
+ def load_model(self, **kwargs):
19
+ self.is_loaded = True
20
+
21
+ def set_language(self, lang):
22
+ self._language = lang
23
+
24
+ def transcribe_audio(self, audio, sample_rate):
25
+ # Return an object mimicking STTResult
26
+ class R:
27
+ def __init__(self):
28
+ self.text = "hello world"
29
+ self.confidence = 0.75
30
+ self.processing_time = 0.05
31
+ return R()
32
+
33
+ @staticmethod
34
+ def get_model_info():
35
+ return {"is_loaded": True, "model_name": "DummySTT"}
36
+
37
+
38
+ class DummyTawasul:
39
+ # Static style class (no instantiation) used by code path
40
+ is_loaded = True
41
+
42
+ @staticmethod
43
+ def load_model(**kwargs):
44
+ DummyTawasul.is_loaded = True
45
+
46
+ @staticmethod
47
+ def get_model_info():
48
+ return {"is_loaded": True, "model_name": "TawasulSTT"}
49
+
50
+ @staticmethod
51
+ def transcribe(path):
52
+ # Return tuple like (text, confidence_info, processing_info)
53
+ return ("transcribed from file", "Confidence: 0.42", "ok")
54
+
55
+
56
+ @pytest.fixture(autouse=True)
57
+ def reset_globals(monkeypatch):
58
+ # Ensure clean state between tests
59
+ gvt.current_stt_model = None
60
+ gvt.current_model_config = {}
61
+ yield
62
+ gvt.current_stt_model = None
63
+ gvt.current_model_config = {}
64
+
65
+
66
+ def test_audio_processor_preprocess_basic():
67
+ sr = 8000
68
+ t = np.linspace(0, 1, sr, endpoint=False)
69
+ audio = (0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.float32)
70
+ out = gvt.AudioProcessor.preprocess(audio, sr, target_sr=16000)
71
+ # Should be float32, mono, and clipped range within [-1,1]
72
+ assert out.dtype == np.float32
73
+ assert out.ndim == 1
74
+ assert np.max(np.abs(out)) <= 1.0
75
+
76
+
77
+ def test_model_manager_load_whisper_missing_api_key_returns_error(monkeypatch):
78
+ # Register DummySTT under WhisperSTT name to avoid heavy import
79
+ monkeypatch.setitem(gvt.STT_MODELS, "WhisperSTT", DummySTT)
80
+ # Request API mode without key
81
+ msg = gvt.ModelManager.load_model("WhisperSTT", model_size="base", use_api=True, api_key="")
82
+ assert "API key required" in msg
83
+
84
+
85
+ def test_model_manager_load_generic_success(monkeypatch):
86
+ # Register a generic model name and load
87
+ monkeypatch.setitem(gvt.STT_MODELS, "DummySTT", DummySTT)
88
+ msg = gvt.ModelManager.load_model("DummySTT")
89
+ assert msg.startswith("βœ…")
90
+ assert gvt.current_stt_model is not None
91
+
92
+
93
+ def test_transcription_engine_no_audio():
94
+ text, conf, proc = gvt.TranscriptionEngine.transcribe(None, language="en")
95
+ assert text.startswith("❌ No audio provided")
96
+
97
+
98
+ def test_transcription_engine_requires_loaded_model():
99
+ # Provide dummy audio but no model
100
+ sr = 16000
101
+ audio = np.zeros(sr, dtype=np.float32)
102
+ text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
103
+ assert "No STT model loaded" in text
104
+
105
+
106
+ def test_transcription_engine_happy_path(monkeypatch):
107
+ # Use DummySTT and set as the loaded model
108
+ gvt.current_stt_model = DummySTT()
109
+ gvt.current_model_config = {"model_name": "DummySTT"}
110
+ # Provide a 1 second tone with enough amplitude
111
+ sr = 16000
112
+ t = np.linspace(0, 1.0, sr, endpoint=False)
113
+ audio = (0.3 * np.sin(2 * np.pi * 220 * t)).astype(np.float32)
114
+ text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
115
+ assert text == "hello world"
116
+ assert conf.startswith("Confidence: ")
117
+ assert "Processing:" in proc
118
+
119
+
120
+ def test_transcription_engine_filters_false_positives(monkeypatch):
121
+ class LowTextDummy(DummySTT):
122
+ def transcribe_audio(self, audio, sample_rate):
123
+ class R:
124
+ def __init__(self):
125
+ self.text = "you" # a known false positive which should be filtered
126
+ self.confidence = None
127
+ self.processing_time = 0.01
128
+ return R()
129
+
130
+ gvt.current_stt_model = LowTextDummy()
131
+ gvt.current_model_config = {"model_name": "LowTextDummy"}
132
+ sr = 16000
133
+ audio = np.ones(sr, dtype=np.float32) * 0.2
134
+ text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
135
+ assert text == "πŸ”‡ No clear speech detected"
136
+
137
+
138
+ def test_transcription_engine_tawasul_static_path_flow(monkeypatch, tmp_path):
139
+ # Force the Tawasul path by setting current_model_config model_name
140
+ gvt.current_stt_model = DummyTawasul
141
+ gvt.current_model_config = {"model_name": "TawasulSTT"}
142
+
143
+ # Create a simple audio array meeting quality gates
144
+ sr = 16000
145
+ t = np.linspace(0, 1.0, sr, endpoint=False)
146
+ audio = (0.3 * np.sin(2 * np.pi * 220 * t)).astype(np.float32)
147
+
148
+ # Monkeypatch soundfile.write to write to the provided path without needing soundfile dependency
149
+ written = {}
150
+
151
+ def fake_write(path, data, samplerate):
152
+ written["path"] = path
153
+ written["samplerate"] = samplerate
154
+ written["len"] = len(data)
155
+
156
+ monkeypatch.setitem(builtins.__dict__, "__SOUNDFILE_WRITE__", fake_write)
157
+
158
+ # Patch import inside function to use our fake write via simple shim
159
+ import types as _types
160
+
161
+ class SFShim:
162
+ @staticmethod
163
+ def write(path, data, samplerate):
164
+ fake_write(path, data, samplerate)
165
+
166
+ monkeypatch.setitem(importlib.import_module("soundfile").__dict__ if False else globals(), "sf", SFShim)
167
+
168
+ # Run transcription
169
+ text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="ar")
170
+ assert text == "transcribed from file"
171
+ assert conf.startswith("Confidence: ")
172
+ assert "Model: TawasulSTT" in proc
173
+
174
+
175
+ def test_get_quality_recommendations_messages():
176
+ q = {
177
+ "duration": 0.5,
178
+ "max_amplitude": 0.95,
179
+ "clipping_ratio": 0.02,
180
+ "silence_ratio": 0.6,
181
+ }
182
+ msg = gvt._get_quality_recommendations(q)
183
+ # Expect multiple recommendations due to thresholds
184
+ assert "recording for longer" in msg
185
+ assert "clipping" in msg
186
+ assert "Too much silence" in msg
test_hubert_arabic.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for HuBERT Arabic STT model
4
+
5
+ This script tests the HuBERT Arabic Egyptian STT implementation
6
+ including authentication, model loading, and transcription.
7
+ """
8
+
9
+ import sys
10
+ import os
11
+ from pathlib import Path
12
+ import soundfile as sf
13
+ import numpy as np
14
+
15
+ # Add the project root to the path
16
+ project_root = Path(__file__).parent
17
+ sys.path.insert(0, str(project_root))
18
+
19
+ def test_hubert_arabic_stt():
20
+ """Test the HuBERT Arabic STT implementation."""
21
+ print("πŸš€ Testing HuBERT Arabic STT")
22
+ print("=" * 50)
23
+
24
+ try:
25
+ from stt.hubert_arabic_stt import HuBERTArabicSTT
26
+ print("βœ… HuBERTArabicSTT imported successfully")
27
+ except ImportError as e:
28
+ print(f"❌ Failed to import HuBERTArabicSTT: {e}")
29
+ print("\nπŸ’‘ To install HuBERT dependencies:")
30
+ print(" pip install -r requirements_hubert.txt")
31
+ return False
32
+
33
+ # Test model loading
34
+ print("\nπŸ“¦ Testing model loading...")
35
+ try:
36
+ stt = HuBERTArabicSTT()
37
+
38
+ # Try to load the primary model
39
+ print("πŸ”§ Loading HuBERT Arabic Egyptian model...")
40
+ result = stt.load_model(
41
+ model_id="omarxadel/hubert-large-arabic-egyptian",
42
+ device="auto"
43
+ )
44
+ print(f"Model load result: {result}")
45
+
46
+ except Exception as e:
47
+ print(f"❌ Model loading failed: {e}")
48
+ print("\nπŸ’‘ This might be due to:")
49
+ print(" - Missing HuggingFace authentication token")
50
+ print(" - Network connectivity issues")
51
+ print(" - Private model access restrictions")
52
+ print("\nπŸ”§ Try setting up authentication:")
53
+ print(" python setup_hf_auth.py")
54
+ return False
55
+
56
+ # Test with sample audio (if available)
57
+ print("\n🎡 Testing audio transcription...")
58
+
59
+ # Create a test audio file (silence)
60
+ sample_rate = 16000
61
+ duration = 2.0 # seconds
62
+ test_audio = np.zeros(int(sample_rate * duration), dtype=np.float32)
63
+
64
+ test_audio_path = "test_audio_hubert.wav"
65
+ sf.write(test_audio_path, test_audio, sample_rate)
66
+
67
+ try:
68
+ transcription, confidence, processing_info = stt.transcribe(test_audio_path)
69
+ print(f"βœ… Transcription completed")
70
+ print(f" Text: '{transcription}'")
71
+ print(f" Confidence: {confidence}")
72
+ print(f" Processing: {processing_info}")
73
+
74
+ except Exception as e:
75
+ print(f"❌ Transcription failed: {e}")
76
+ return False
77
+ finally:
78
+ # Clean up test file
79
+ if os.path.exists(test_audio_path):
80
+ os.remove(test_audio_path)
81
+
82
+ print("\nβœ… All HuBERT Arabic STT tests passed!")
83
+ return True
84
+
85
+ def test_with_real_audio():
86
+ """Test with real audio if available."""
87
+ recordings_dir = Path("recordings")
88
+
89
+ if not recordings_dir.exists():
90
+ print(f"\nπŸ’‘ No recordings directory found at {recordings_dir}")
91
+ print(" Create the directory and add .wav files to test with real audio")
92
+ return
93
+
94
+ audio_files = list(recordings_dir.glob("*.wav"))
95
+ if not audio_files:
96
+ print(f"\nπŸ’‘ No .wav files found in {recordings_dir}")
97
+ return
98
+
99
+ print(f"\n🎡 Testing with real audio files from {recordings_dir}...")
100
+
101
+ try:
102
+ from stt.hubert_arabic_stt import HuBERTArabicSTT
103
+ stt = HuBERTArabicSTT()
104
+ stt.load_model()
105
+
106
+ for audio_file in audio_files[:2]: # Test first 2 files
107
+ print(f"\nπŸ”Š Processing: {audio_file.name}")
108
+ try:
109
+ transcription, confidence, processing_info = stt.transcribe(str(audio_file))
110
+ print(f" Text: '{transcription}'")
111
+ print(f" Confidence: {confidence}")
112
+ except Exception as e:
113
+ print(f" ❌ Error: {e}")
114
+
115
+ except Exception as e:
116
+ print(f"❌ Real audio test failed: {e}")
117
+
118
+ def main():
119
+ """Main test function."""
120
+ print("HuBERT Arabic STT Test Suite")
121
+ print("=" * 60)
122
+
123
+ # Basic functionality test
124
+ success = test_hubert_arabic_stt()
125
+
126
+ if success:
127
+ print("\n🎯 Running additional tests...")
128
+ test_with_real_audio()
129
+
130
+ print("\n" + "=" * 60)
131
+ if success:
132
+ print("πŸŽ‰ HuBERT Arabic STT is working correctly!")
133
+ print("\nπŸ’‘ Next steps:")
134
+ print(" 1. Test with the Gradio interface:")
135
+ print(" python gradio_voice_transcriber_clean.py")
136
+ print(" 2. Select 'HuBERTArabicSTT' as the STT model")
137
+ print(" 3. Upload Arabic Egyptian audio for transcription")
138
+ else:
139
+ print("❌ HuBERT Arabic STT tests failed")
140
+ print("\nπŸ”§ Troubleshooting:")
141
+ print(" 1. Install dependencies: pip install -r requirements_hubert.txt")
142
+ print(" 2. Set up HF authentication: python setup_hf_auth.py")
143
+ print(" 3. Check network connectivity")
144
+
145
+ if __name__ == "__main__":
146
+ main()
test_tawasul.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Tawasul STT V0 model
4
+
5
+ This script tests the Tawasul STT V0 Arabic speech recognition model
6
+ with sample audio files.
7
+
8
+ Usage:
9
+ python test_tawasul.py [audio_file]
10
+
11
+ If no audio file is provided, it will test with any files in the recordings/ directory.
12
+ """
13
+
14
+ import sys
15
+ import os
16
+ from pathlib import Path
17
+ import time
18
+ import logging
19
+
20
+ # Add the project root to the path
21
+ sys.path.insert(0, str(Path(__file__).parent))
22
+
23
+ from stt.tawasul_stt import TawasulSTT
24
+
25
+ # Configure logging
26
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
27
+ logger = logging.getLogger(__name__)
28
+
29
+ def test_tawasul_stt(audio_file: str = None):
30
+ """Test Tawasul STT with an audio file."""
31
+
32
+ print("πŸ§ͺ Tawasul STT V0 Test")
33
+ print("=" * 50)
34
+
35
+ # Check if Tawasul STT is available
36
+ if not TawasulSTT.is_available():
37
+ print("❌ Tawasul STT dependencies not available!")
38
+ print("Install with: pip install -r requirements_tawasul.txt")
39
+ return False
40
+
41
+ # Find audio file if not provided
42
+ if not audio_file:
43
+ recordings_dir = Path("recordings")
44
+ if recordings_dir.exists():
45
+ audio_files = list(recordings_dir.glob("*.wav")) + list(recordings_dir.glob("*.mp3"))
46
+ if audio_files:
47
+ audio_file = str(audio_files[0])
48
+ print(f"🎡 Using sample audio: {audio_file}")
49
+ else:
50
+ print("❌ No audio files found in recordings/ directory")
51
+ print("Please provide an audio file: python test_tawasul.py your_audio.wav")
52
+ return False
53
+ else:
54
+ print("❌ No audio file provided and no recordings/ directory found")
55
+ print("Usage: python test_tawasul.py your_audio.wav")
56
+ return False
57
+
58
+ if not os.path.exists(audio_file):
59
+ print(f"❌ Audio file not found: {audio_file}")
60
+ return False
61
+
62
+ try:
63
+ # Load the model (static method)
64
+ print("πŸ“₯ Loading Tawasul STT V0 model...")
65
+ start_time = time.time()
66
+
67
+ TawasulSTT.load_model(
68
+ device="auto", # Automatically choose best device
69
+ chunk_length=20, # 20-second chunks
70
+ max_audio_length=300 # 5 minutes max
71
+ )
72
+
73
+ load_time = time.time() - start_time
74
+ print(f"βœ… Model loaded in {load_time:.1f} seconds")
75
+
76
+ # Get model info (static method)
77
+ model_info = TawasulSTT.get_model_info()
78
+ print(f"\nπŸ“Š Model Information:")
79
+ print(f" Name: {model_info['name']}")
80
+ print(f" Model ID: {model_info['model_id']}")
81
+ print(f" Device: {model_info['device']}")
82
+ print(f" Architecture: {model_info['architecture']}")
83
+ print(f" Specialization: {model_info['specialization']}")
84
+ print(f" Supported Languages: {', '.join(model_info['supported_languages'][:5])}...")
85
+
86
+ # Transcribe audio (static method)
87
+ print(f"\nπŸŽ™οΈ Transcribing audio: {audio_file}")
88
+ transcription, confidence_info, processing_info = TawasulSTT.transcribe(audio_file)
89
+
90
+ # Display results
91
+ print("\n" + "=" * 50)
92
+ print("πŸ“ TRANSCRIPTION RESULTS")
93
+ print("=" * 50)
94
+ print(f"Text: {transcription}")
95
+ print(f"Confidence: {confidence_info}")
96
+ print(f"Processing: {processing_info}")
97
+ print("=" * 50)
98
+
99
+ if transcription and not transcription.startswith("❌"):
100
+ print("βœ… Transcription successful!")
101
+ return True
102
+ else:
103
+ print("❌ Transcription failed!")
104
+ return False
105
+
106
+ except Exception as e:
107
+ print(f"❌ Test failed: {str(e)}")
108
+ return False
109
+
110
+ def main():
111
+ """Main function."""
112
+ audio_file = sys.argv[1] if len(sys.argv) > 1 else None
113
+
114
+ # Test with different configurations
115
+ success = test_tawasul_stt(audio_file)
116
+
117
+ if success:
118
+ print("\nπŸŽ‰ Tawasul STT test completed successfully!")
119
+ print("\nπŸ’‘ Next steps:")
120
+ print(" 1. Try the main transcriber: python gradio_voice_transcriber_clean.py")
121
+ print(" 2. Test with different Arabic audio files")
122
+ print(" 3. Experiment with different model variants")
123
+ else:
124
+ print("\n❌ Tawasul STT test failed!")
125
+ print("\nπŸ”§ Troubleshooting:")
126
+ print(" 1. Install dependencies: pip install -r requirements_tawasul.txt")
127
+ print(" 2. Check audio file format (WAV/MP3)")
128
+ print(" 3. Ensure stable internet for model download")
129
+ print(" 4. Try with a different audio file")
130
+
131
+ if __name__ == "__main__":
132
+ main()
test_vosk.py ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Vosk STT implementation
4
+
5
+ This script tests the Vosk STT implementation to ensure compatibility
6
+ and proper error handling.
7
+ """
8
+
9
+ import sys
10
+ import numpy as np
11
+ from pathlib import Path
12
+
13
+ # Add the project directory to Python path
14
+ project_dir = Path(__file__).parent
15
+ sys.path.insert(0, str(project_dir))
16
+
17
+ def test_vosk_basic():
18
+ """Test basic Vosk functionality."""
19
+ print("πŸ” Testing Vosk STT...")
20
+
21
+ try:
22
+ from stt.vosk_stt import VoskSTT
23
+ print("βœ… Successfully imported VoskSTT")
24
+ except ImportError as e:
25
+ print(f"❌ Failed to import VoskSTT: {e}")
26
+ print("\nπŸ“¦ Required dependencies:")
27
+ print("pip install vosk")
28
+ return False
29
+
30
+ # Check Vosk availability
31
+ print("\nπŸ“Š Checking Vosk availability...")
32
+ models_info = VoskSTT.get_available_models()
33
+
34
+ for key, value in models_info.items():
35
+ status = "βœ…" if value else "❌"
36
+ print(f"{status} {key}: {value}")
37
+
38
+ if not models_info.get("vosk_available", False):
39
+ print("\n❌ Vosk not available. Cannot proceed with test.")
40
+ return False
41
+
42
+ # Test model loading with a small model
43
+ print(f"\nπŸ”„ Testing model loading...")
44
+ print("⚠️ Note: This will download a small model (~40MB) if not cached")
45
+
46
+ try:
47
+ # Try to load a small English model first
48
+ VoskSTT.load_model(model_name="vosk-model-small-en-us-0.15")
49
+ print("βœ… Model loaded successfully!")
50
+
51
+ # Get model info
52
+ model_info = VoskSTT.get_model_info()
53
+ print(f"πŸ“‹ Model info:")
54
+ for key, value in model_info.items():
55
+ print(f" {key}: {value}")
56
+
57
+ except Exception as e:
58
+ print(f"❌ Failed to load model: {e}")
59
+ print("\nπŸ’‘ This might be due to:")
60
+ print(" - Network issues (model download)")
61
+ print(" - Vosk version compatibility")
62
+ print(" - Model availability")
63
+ return False
64
+
65
+ # Test with dummy audio
66
+ print(f"\n🎀 Testing transcription with dummy audio...")
67
+ print("⚠️ Note: Random audio won't produce meaningful text")
68
+
69
+ try:
70
+ # Create 2 seconds of random audio
71
+ dummy_audio = np.random.randn(32000).astype(np.float32) * 0.1
72
+
73
+ result = VoskSTT.transcribe_audio(dummy_audio, 16000)
74
+
75
+ print(f"πŸ“ Transcription result:")
76
+ print(f" Text: '{result.text}'")
77
+ print(f" Confidence: {result.confidence:.2%}" if result.confidence else " Confidence: N/A")
78
+ print(f" Processing time: {result.processing_time:.2f}s")
79
+ print(f" Metadata: {result.metadata}")
80
+
81
+ print("βœ… Transcription test completed!")
82
+ return True
83
+
84
+ except Exception as e:
85
+ print(f"❌ Transcription failed: {e}")
86
+ return False
87
+
88
+ def test_vosk_models():
89
+ """Test different Vosk models."""
90
+ print(f"\nπŸ“‹ Testing different Vosk models...")
91
+
92
+ try:
93
+ from stt.vosk_stt import VoskSTT
94
+
95
+ # Get available models
96
+ available = VoskSTT.AVAILABLE_MODELS
97
+ print(f"πŸ“Š Available models: {len(available)}")
98
+
99
+ # Show a few interesting models
100
+ interesting_models = [
101
+ "vosk-model-small-en-us-0.15",
102
+ "vosk-model-small-ru-0.22",
103
+ "vosk-model-small-fr-0.22",
104
+ "vosk-model-small-de-0.15"
105
+ ]
106
+
107
+ print("\n🌍 Some available models:")
108
+ for model_name in interesting_models:
109
+ if model_name in available:
110
+ model_info = available[model_name]
111
+ print(f" {model_name}:")
112
+ print(f" Language: {model_info['language']}")
113
+ print(f" Size: {model_info['size']}")
114
+ print(f" Description: {model_info['description']}")
115
+
116
+ return True
117
+
118
+ except Exception as e:
119
+ print(f"❌ Model listing failed: {e}")
120
+ return False
121
+
122
+ def test_integration():
123
+ """Test integration with the modular transcriber."""
124
+ print(f"\nπŸ”— Testing integration with modular transcriber...")
125
+
126
+ try:
127
+ from gradio_voice_transcriber_clean import ModelManager
128
+
129
+ available_models = ModelManager.get_available_models()
130
+ print(f"πŸ“‹ Available models: {available_models}")
131
+
132
+ if "VoskSTT" in available_models:
133
+ print("βœ… VoskSTT is registered in the modular transcriber")
134
+
135
+ # Test model options
136
+ options = ModelManager.get_model_options("VoskSTT")
137
+ print(f"πŸ“Š Model options: {options}")
138
+
139
+ return True
140
+ else:
141
+ print("❌ VoskSTT not found in available models")
142
+ return False
143
+
144
+ except ImportError as e:
145
+ print(f"❌ Failed to import modular transcriber components: {e}")
146
+ return False
147
+
148
+ def main():
149
+ """Main test function."""
150
+ print("πŸ§ͺ Vosk STT Test Suite")
151
+ print("=" * 50)
152
+
153
+ # Test basic functionality
154
+ basic_test = test_vosk_basic()
155
+
156
+ # Test model listing
157
+ models_test = test_vosk_models()
158
+
159
+ # Test integration
160
+ integration_test = test_integration()
161
+
162
+ print("\n" + "=" * 50)
163
+ print("πŸ“Š Test Results Summary:")
164
+ print(f" Basic Functionality: {'βœ… PASS' if basic_test else '❌ FAIL'}")
165
+ print(f" Model Listing: {'βœ… PASS' if models_test else '❌ FAIL'}")
166
+ print(f" Integration: {'βœ… PASS' if integration_test else '❌ FAIL'}")
167
+
168
+ if basic_test and models_test and integration_test:
169
+ print("\nπŸŽ‰ All tests passed! Vosk STT is ready to use.")
170
+ print("\nπŸ’‘ Next steps:")
171
+ print(" 1. Run: python gradio_voice_transcriber_clean.py")
172
+ print(" 2. Select 'VoskSTT' from the dropdown")
173
+ print(" 3. Choose your model (small models are faster)")
174
+ print(" 4. Load the model and test with audio!")
175
+ print("\n🌍 Vosk supports many languages offline!")
176
+ else:
177
+ print("\n❌ Some tests failed. Please check the errors above.")
178
+
179
+ if not basic_test:
180
+ print("\nπŸ“¦ To fix Vosk issues:")
181
+ print(" pip install vosk")
182
+ print(" Check internet connection for model download")
183
+
184
+ if __name__ == "__main__":
185
+ main()
test_wav2vec2_arabic.py ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for Wav2Vec2 Arabic STT
4
+
5
+ This script tests the Wav2Vec2 Arabic STT implementation without requiring
6
+ the full Gradio interface.
7
+ """
8
+
9
+ import sys
10
+ import numpy as np
11
+ from pathlib import Path
12
+
13
+ # Add the project directory to Python path
14
+ project_dir = Path(__file__).parent
15
+ sys.path.insert(0, str(project_dir))
16
+
17
+ def test_wav2vec2_arabic():
18
+ """Test the Wav2Vec2 Arabic STT implementation."""
19
+ print("πŸ” Testing Wav2Vec2 Arabic STT...")
20
+
21
+ try:
22
+ from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
23
+ print("βœ… Successfully imported Wav2Vec2ArabicSTT")
24
+ except ImportError as e:
25
+ print(f"❌ Failed to import Wav2Vec2ArabicSTT: {e}")
26
+ print("\nπŸ“¦ Required dependencies:")
27
+ print("pip install transformers torch torchaudio")
28
+ print("Optional: pip install librosa")
29
+ return False
30
+
31
+ # Check model availability
32
+ print("\nπŸ“Š Checking model availability...")
33
+ models_info = Wav2Vec2ArabicSTT.get_available_models()
34
+
35
+ for key, value in models_info.items():
36
+ status = "βœ…" if value else "❌"
37
+ print(f"{status} {key}: {value}")
38
+
39
+ if not models_info.get("transformers_available", False):
40
+ print("\n❌ Transformers not available. Cannot proceed with test.")
41
+ return False
42
+
43
+ # Test model loading (this will download the model if not cached)
44
+ print(f"\nπŸ”„ Loading model...")
45
+ print("⚠️ Note: First run will download ~1.2GB model from Hugging Face")
46
+
47
+ try:
48
+ Wav2Vec2ArabicSTT.load_model(device="cpu") # Use CPU for testing
49
+ print("βœ… Model loaded successfully!")
50
+
51
+ # Get model info
52
+ model_info = Wav2Vec2ArabicSTT.get_model_info()
53
+ print(f"πŸ“‹ Model info:")
54
+ for key, value in model_info.items():
55
+ print(f" {key}: {value}")
56
+
57
+ except Exception as e:
58
+ print(f"❌ Failed to load model: {e}")
59
+ return False
60
+
61
+ # Test with dummy audio (this won't produce meaningful Arabic text)
62
+ print(f"\n🎀 Testing transcription with dummy audio...")
63
+ print("⚠️ Note: Random audio won't produce meaningful Arabic text")
64
+
65
+ try:
66
+ # Create 2 seconds of random audio
67
+ dummy_audio = np.random.randn(32000).astype(np.float32) * 0.1
68
+
69
+ result = Wav2Vec2ArabicSTT.transcribe_audio(dummy_audio, 16000)
70
+
71
+ print(f"πŸ“ Transcription result:")
72
+ print(f" Text: '{result.text}'")
73
+ print(f" Confidence: {result.confidence:.2%}" if result.confidence else " Confidence: N/A")
74
+ print(f" Processing time: {result.processing_time:.2f}s")
75
+ print(f" Metadata: {result.metadata}")
76
+
77
+ print("βœ… Transcription test completed!")
78
+ return True
79
+
80
+ except Exception as e:
81
+ print(f"❌ Transcription failed: {e}")
82
+ return False
83
+
84
+ def test_integration():
85
+ """Test integration with the modular transcriber."""
86
+ print(f"\nπŸ”— Testing integration with modular transcriber...")
87
+
88
+ try:
89
+ from gradio_voice_transcriber_clean import ModelManager, STT_MODELS
90
+
91
+ available_models = ModelManager.get_available_models()
92
+ print(f"πŸ“‹ Available models: {available_models}")
93
+
94
+ if "Wav2Vec2ArabicSTT" in available_models:
95
+ print("βœ… Wav2Vec2ArabicSTT is registered in the modular transcriber")
96
+
97
+ # Test model options
98
+ options = ModelManager.get_model_options("Wav2Vec2ArabicSTT")
99
+ print(f"πŸ“Š Model options: {options}")
100
+
101
+ return True
102
+ else:
103
+ print("❌ Wav2Vec2ArabicSTT not found in available models")
104
+ return False
105
+
106
+ except ImportError as e:
107
+ print(f"❌ Failed to import modular transcriber components: {e}")
108
+ return False
109
+
110
+ def main():
111
+ """Main test function."""
112
+ print("πŸ§ͺ Wav2Vec2 Arabic STT Test Suite")
113
+ print("=" * 50)
114
+
115
+ # Test individual STT implementation
116
+ stt_test = test_wav2vec2_arabic()
117
+
118
+ # Test integration
119
+ integration_test = test_integration()
120
+
121
+ print("\n" + "=" * 50)
122
+ print("πŸ“Š Test Results Summary:")
123
+ print(f" STT Implementation: {'βœ… PASS' if stt_test else '❌ FAIL'}")
124
+ print(f" Integration: {'βœ… PASS' if integration_test else '❌ FAIL'}")
125
+
126
+ if stt_test and integration_test:
127
+ print("\nπŸŽ‰ All tests passed! The Wav2Vec2 Arabic STT is ready to use.")
128
+ print("\nπŸ’‘ Next steps:")
129
+ print(" 1. Run: python gradio_voice_transcriber_clean.py")
130
+ print(" 2. Select 'Wav2Vec2ArabicSTT' from the dropdown")
131
+ print(" 3. Choose your device (CPU/CUDA)")
132
+ print(" 4. Load the model and test with Arabic audio!")
133
+ else:
134
+ print("\n❌ Some tests failed. Please check the errors above.")
135
+
136
+ if not stt_test:
137
+ print("\nπŸ“¦ To fix STT implementation issues:")
138
+ print(" pip install transformers torch torchaudio")
139
+ print(" pip install librosa # optional, for better audio processing")
140
+
141
+ if __name__ == "__main__":
142
+ main()
test_whisper_local.py ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple Whisper Test
4
+
5
+ Load Whisper model and test transcription.
6
+ """
7
+
8
+ import numpy as np
9
+ from stt import WhisperSTT
10
+
11
+ def main():
12
+ print("Loading Whisper model...")
13
+ WhisperSTT.load_model("tiny") # Load tiny model (fastest)
14
+
15
+ if not WhisperSTT.is_loaded:
16
+ print("Failed to load model")
17
+ return
18
+
19
+ print("Model loaded successfully!")
20
+
21
+ # Create some test audio (1 second of random noise)
22
+ test_audio = np.random.randn(16000).astype(np.float32)
23
+
24
+ print("Transcribing audio...")
25
+ result = WhisperSTT.transcribe_numpy(test_audio, 16000)
26
+
27
+ print(f"Result: {result.text}")
28
+ print(f"Confidence: {result.confidence}")
29
+ print(f"Time: {result.processing_time:.2f}s")
30
+
31
+ if __name__ == "__main__":
32
+ main()
uv.lock ADDED
The diff for this file is too large to render. See raw diff