Spaces:

daliaMarzouk
/

stt-trails

Build error

App Files Files Community

GitHub Actions commited on Dec 27, 2025

Commit

57b8470

1 Parent(s): 921859d

Deploy from bot_text branch - Sat Dec 27 17:53:04 UTC 2025

Browse files

Files changed (34) hide show

1.4.0 +0 -0
INSTALL.md +152 -0
README.md +214 -10
STT_INTEGRATION_GUIDE.md +270 -0
app.py +1263 -1105
hf-space +1 -0
pyproject.toml +139 -0
requirements.txt +42 -42
requirements_coqui.txt +6 -0
requirements_hubert.txt +7 -0
requirements_tawasul.txt +13 -0
requirements_vosk.txt +3 -0
requirements_wav2vec2.txt +24 -0
requirements_whisper.txt +27 -0
setup.py +212 -0
setup_hf_auth.py +156 -0
stt/__init__.py +19 -0
stt/chirp3_stt.py +136 -0
stt/coqui_stt.py +390 -0
stt/example_custom_stt.py +288 -0
stt/hubert_arabic_stt.py +568 -0
stt/stt_base.py +251 -0
stt/tawasul_stt.py +448 -0
stt/vosk_stt.py +561 -0
stt/wav2vec2_arabic_stt.py +509 -0
stt/whisper_stt.py +377 -0
test_coqui.py +163 -0
test_gradio_voice_transcriber.py +186 -0
test_hubert_arabic.py +146 -0
test_tawasul.py +132 -0
test_vosk.py +185 -0
test_wav2vec2_arabic.py +142 -0
test_whisper_local.py +32 -0
uv.lock +0 -0

1.4.0 ADDED Viewed

File without changes

INSTALL.md ADDED Viewed

	@@ -0,0 +1,152 @@

+# Installation Guide
+This guide explains how to install and set up the Modular Voice Transcriber with different STT models.
+## 🚀 Quick Start
+### Option 1: Automated Setup (Recommended)
+```bash
+# Essential models (Whisper + Wav2Vec2)
+python setup.py --profile essential --test
+# Or for specific models only
+python setup.py --profile whisper-only --test
+python setup.py --profile wav2vec2-only --test
+```
+### Option 2: Manual Installation
+#### Base Installation
+```bash
+# Core requirements
+pip install gradio>=4.0.0 numpy>=1.21.0 soundfile>=0.12.1
+```
+#### Choose Your STT Models
+**OpenAI Whisper (Local + API)**
+```bash
+pip install -r requirements_whisper.txt
+# Or: pip install -e .[whisper,whisper-api]
+```
+**Wav2Vec2 Arabic**
+```bash
+pip install -r requirements_wav2vec2.txt
+# Or: pip install -e .[wav2vec2]
+```
+**All Models**
+```bash
+pip install -r requirements.txt
+# Or: pip install -e .[all-stt]
+```
+## 📦 Installation Profiles
+| Profile | Models Included | Use Case |
+|---------|----------------|----------|
+| `minimal` | None | Interface only (for development) |
+| `essential` | Whisper + Wav2Vec2 | Best balance of features |
+| `whisper-only` | OpenAI Whisper | English + Multilingual |
+| `wav2vec2-only` | Wav2Vec2 Arabic | Arabic Egyptian dialect |
+| `all` | All supported models | Complete functionality |
+## 🔧 System Requirements
+### Minimum Requirements
+- Python 3.8+
+- 4GB RAM
+- 2GB free disk space
+### Recommended Requirements
+- Python 3.9+
+- 8GB RAM
+- 5GB free disk space
+- GPU with CUDA support (for faster transcription)
+## 📋 Model Download Sizes
+| Model | First Download | Disk Space |
+|-------|---------------|------------|
+| Whisper Tiny | 39MB | 39MB |
+| Whisper Base | 142MB | 142MB |
+| Whisper Medium | 1.5GB | 1.5GB |
+| Wav2Vec2 Arabic | 1.2GB | 1.2GB |
+## 🧪 Testing Your Installation
+### Test Individual Models
+```bash
+# Test Wav2Vec2 Arabic
+python test_wav2vec2_arabic.py
+# Test Whisper (coming soon)
+python test_whisper_local.py
+```
+### Test Full Interface
+```bash
+python gradio_voice_transcriber_clean.py
+```
+## 🔍 Troubleshooting
+### Common Issues
+**Import Error: transformers**
+```bash
+pip install transformers torch torchaudio
+```
+**Import Error: whisper**
+```bash
+pip install openai-whisper
+```
+**CUDA Issues**
+- Install PyTorch with CUDA support from [pytorch.org](https://pytorch.org)
+- Or use CPU-only: `pip install torch torchaudio --index-url https://download.pytorch.org/whl/cpu`
+**Model Download Issues**
+- Check internet connection
+- Hugging Face models download automatically on first use
+- Downloads go to `~/.cache/huggingface/` and `~/.cache/whisper/`
+### Performance Tips
+**For Better Speed:**
+- Use GPU if available
+- Choose smaller models for real-time use
+- Use larger models for better accuracy
+**For Better Quality:**
+- Record in quiet environment
+- Use good microphone
+- Speak clearly and at normal pace
+- Choose appropriate language/dialect model
+## 🔄 Updating
+```bash
+# Update to latest versions
+pip install --upgrade -r requirements.txt
+# Update specific models
+pip install --upgrade transformers openai-whisper
+```
+## 🎯 Next Steps
+1. **Run the interface:** `python gradio_voice_transcriber_clean.py`
+2. **Choose your model** from the dropdown
+3. **Load the model** (first time will download)
+4. **Test with audio** recording or upload
+5. **Check quality analysis** for audio tips
+## 📚 Additional Resources
+- [Gradio Documentation](https://gradio.app/docs/)
+- [Whisper by OpenAI](https://openai.com/research/whisper)
+- [Wav2Vec2 Models](https://huggingface.co/models?search=wav2vec2)
+- [Transformers Library](https://huggingface.co/docs/transformers/)

README.md CHANGED Viewed

@@ -1,12 +1,216 @@
----
-title: Stt Trails
-emoji: 🌖
-colorFrom: pink
-colorTo: pink
-sdk: gradio
-sdk_version: 5.49.1
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Modular Voice Transcriber
+A comprehensive, modular Gradio-based web interface for speech-to-text transcription supporting multiple STT engines including OpenAI Whisper, Wav2Vec2, HuBERT Arabic, Tawasul, Vosk, and Coqui STT models.
+## 🌟 Features
+- **Comprehensive STT Support**: 7+ different speech-to-text engines
+- **Multiple Models**: OpenAI Whisper, Wav2Vec2 Arabic, HuBERT, Tawasul, Vosk, Coqui STT
+- **Arabic Language Focus**: Specialized models for Arabic dialect recognition
+- **Web Interface**: User-friendly Gradio interface with image gallery
+- **Real-time Processing**: Live audio recording and transcription
+- **Quality Analysis**: Audio quality feedback and recommendations
+- **Device Support**: CPU/GPU automatic detection and selection
+- **Authentication**: Support for private HuggingFace models
+- **Static Class Support**: Optimized memory usage for certain models
+- **Visual Interface**: Interactive image gallery with thumbnail navigation
+## 🚀 Quick Start
+### Option 1: Automated Setup (Recommended)
+```bash
+# Essential models (Whisper + Wav2Vec2)
+python setup.py --profile essential --test
+# Or specific models
+python setup.py --profile whisper-only --test
+python setup.py --profile wav2vec2-only --test
+```
+### Option 2: Manual Installation
+```bash
+# Base installation (Whisper + Wav2Vec2)
+pip install -r requirements.txt
+# Specific model installations
+pip install -r requirements_whisper.txt     # OpenAI Whisper only
+pip install -r requirements_wav2vec2.txt    # Wav2Vec2 only
+pip install -r requirements_hubert.txt      # HuBERT Arabic only
+pip install -r requirements_tawasul.txt     # Tawasul Arabic only
+pip install -r requirements_vosk.txt        # Vosk offline only
+pip install -r requirements_coqui.txt       # Coqui STT only
+# Or install with specific extras
+pip install -e .[essential]    # Whisper + Wav2Vec2
+pip install -e .[all-stt]     # All models
+```
+## 🎯 Supported STT Models
+| Model | Language | Size | Type | Quality | Features |
+|-------|----------|------|------|---------|----------|
+| **Whisper Tiny** | Multilingual | 39MB | Local/API | Fast | General purpose |
+| **Whisper Base** | Multilingual | 142MB | Local/API | Good | General purpose |
+| **Whisper Medium** | Multilingual | 1.5GB | Local/API | Better | General purpose |
+| **Whisper Large** | Multilingual | 2.9GB | Local/API | Best | General purpose |
+| **Wav2Vec2 Arabic** | Arabic | 1.2GB | Local | Excellent | Arabic dialects |
+| **HuBERT Arabic** | Arabic Egyptian | 1.2GB | Local | Excellent | Egyptian dialect |
+| **Tawasul V0** | Arabic | 800MB | Local | Very Good | Arabic speech, Static class |
+| **Vosk** | Multilingual | 50MB-1.8GB | Local/Offline | Good | Offline capable |
+| **Coqui STT** | Multilingual | 180MB-2GB | Local | Good | Open source |
+## 🔧 Usage
+### Start the Interface
+```bash
+python gradio_voice_transcriber_clean.py
+```
+### Using Different Models
+1. **Select Model**: Choose from dropdown (WhisperSTT, Wav2Vec2ArabicSTT, HubertArabicSTT, TawasulSTT, VoskSTT, CoquiSTT)
+2. **Configure**: Set model size, device, language, and authentication if needed
+3. **Load**: Click "Load Model" (first time downloads model automatically)
+4. **Transcribe**: Record audio or upload audio files
+5. **Gallery**: Browse sample images using the interactive thumbnail gallery
+### Authentication for Private Models
+Some experimental models require HuggingFace authentication:
+```bash
+# Option 1: Use helper script
+python setup_hf_auth.py
+# Option 2: Manual login
+pip install huggingface-hub
+huggingface-cli login
+# Option 3: Use token in interface
+# Get token from: https://huggingface.co/settings/tokens
+# Enter in "HuggingFace Token" field
+```
+## 🏗️ Adding New STT Models
+The system is designed to be easily extensible:
+1. **Create STT Class**: Inherit from `BaseSTT` in `stt/your_model.py`
+2. **Register Model**: Add to `STT_MODELS` in `gradio_voice_transcriber_clean.py`
+3. **Configure Options**: Update `get_model_options()` method
+4. **Test**: Run your model through the interface
+See `STT_INTEGRATION_GUIDE.md` for detailed instructions.
+## 🎨 Interface Features
+### Image Gallery
+The enhanced interface (`gradio_voice_transcript_temp.py`) includes:
+- **Interactive Gallery**: Browse sample images with thumbnail navigation
+- **Horizontal Scrolling**: Smooth image browsing experience
+- **Thumbnail Selection**: Click thumbnails to view full images
+- **Gallery Controls**: Navigation and zoom functionality
+### Static Class Models
+Some models (like TawasulSTT) use static class implementation for:
+- **Memory Efficiency**: Reduced memory footprint
+- **Faster Loading**: Optimized model initialization
+- **Shared Resources**: Better resource management
+## 📁 Model Storage Locations
+Different STT models store their files in various locations:
+| Model Type | Storage Location | Description |
+|------------|------------------|-------------|
+| **Hugging Face Models** | `~/.cache/huggingface/` | Wav2Vec2, HuBERT, Tawasul models |
+| **Whisper Models** | `~/.cache/whisper/` | OpenAI Whisper model files |
+| **Vosk Models** | `~/.vosk/models/` | Offline Vosk language models |
+| **Coqui Models** | Managed by model manager | Coqui STT model files |
+## 📁 Project Structure
+```
+STT-trails/
+├── gradio_voice_transcriber.py          # Main comprehensive interface
+├── gradio_voice_transcript_temp.py      # Enhanced interface with image gallery
+├── stt/                                  # STT implementations
+│   ├── stt_base.py                      # Base class for all STT models
+│   ├── whisper_stt.py                   # OpenAI Whisper implementation
+│   ├── wav2vec2_arabic_stt.py           # Wav2Vec2 Arabic model
+│   ├── hubert_arabic_stt.py             # HuBERT Arabic dialect model
+│   ├── tawasul_stt.py                   # Tawasul Arabic model (static class)
+│   ├── vosk_stt.py                      # Vosk offline STT
+│   ├── coqui_stt.py                     # Coqui open-source STT
+│   └── example_custom_stt.py            # Template for new models
+├── setup.py                             # Installation helper
+├── setup_hf_auth.py                     # HuggingFace authentication helper
+├── test_*.py                            # Model testing scripts
+├── requirements*.txt                     # Dependencies for each model
+├── recordings/                          # Audio recordings directory
+├── INSTALL.md                           # Detailed installation guide
+├── STT_INTEGRATION_GUIDE.md             # Developer integration guide
+└── pyproject.toml                       # Project configuration
+```
+## 🧪 Testing
+```bash
+# Test specific models
+python test_whisper_local.py         # Test Whisper models
+python test_wav2vec2_arabic.py       # Test Wav2Vec2 Arabic
+python test_hubert_arabic.py         # Test HuBERT Arabic
+python test_tawasul.py               # Test Tawasul Arabic
+python test_vosk.py                  # Test Vosk offline STT
+python test_coqui.py                 # Test Coqui STT
+# Test installation
+python setup.py --profile essential --test
+# Run the interface
+python gradio_voice_transcriber.py           # Main interface
+python gradio_voice_transcript_temp.py       # Interface with image gallery
+```
+## 🔍 Troubleshooting
+### Model Loading Issues
+- **HuggingFace Authentication**: Use `setup_hf_auth.py` or manual token
+- **Memory Issues**: Use smaller models or CPU-only mode
+- **Internet Required**: First model download needs internet connection
+### Audio Issues
+- **No Audio Detected**: Check microphone permissions and volume
+- **Poor Quality**: Use audio quality analysis feature
+- **Wrong Language**: Select appropriate model for your language
+### Performance Tips
+- **Use GPU**: Automatic if CUDA PyTorch is installed
+- **Chunk Long Audio**: Handled automatically for 20+ second clips
+- **Choose Right Model**: Balance size vs. accuracy for your use case
+## 📄 License
+This project is open source. See individual model licenses:
+- OpenAI Whisper: MIT License
+- Wav2Vec2: MIT License
+- HuggingFace Transformers: Apache 2.0
+## 🤝 Contributing
+1. Fork the repository
+2. Create your feature branch
+3. Add your STT model following the integration guide
+4. Submit a pull request
+## 📚 Resources
+- [OpenAI Whisper](https://openai.com/research/whisper)
+- [Wav2Vec2 Paper](https://arxiv.org/abs/2006.11477)
+- [HuggingFace Models](https://huggingface.co/models)
+- [Gradio Documentation](https://gradio.app/docs/)
 ---
+**Made with ❤️ for the speech recognition community**

STT_INTEGRATION_GUIDE.md ADDED Viewed

	@@ -0,0 +1,270 @@

+# Modular STT Integration Guide
+This guide explains how to integrate new Speech-to-Text models into the modular Gradio voice transcriber.
+## 🏗️ Architecture Overview
+The system is built with a modular architecture that makes it easy to add new STT engines:
+```
+gradio_voice_transcriber_clean.py
+├── ModelManager        # Handles model registration and loading
+├── AudioProcessor      # Preprocesses audio for better quality
+├── TranscriptionEngine # Manages transcription workflow
+└── GradioInterface     # Creates the web UI
+```
+## 🔧 Adding a New STT Model
+### Step 1: Create Your STT Class
+Create a new file in the `stt/` directory (e.g., `your_stt.py`) that inherits from `BaseSTT`:
+```python
+from stt.stt_base import BaseSTT, STTResult
+import numpy as np
+class YourSTT(BaseSTT):
+    model_name = "YourSTT"
+    model = None
+    is_loaded = False
+    config = {}
+    @classmethod
+    def load_model(cls, **kwargs):
+        # Initialize your STT service
+        cls.model = your_stt_client()
+        cls.is_loaded = True
+    @classmethod
+    def transcribe_audio(cls, audio_data, sample_rate=None):
+        # Implement transcription logic
+        result = cls.model.transcribe(audio_data)
+        return STTResult(text=result.text, confidence=result.confidence)
+```
+### Step 2: Register Your Model
+Add your model to the registry in `gradio_voice_transcriber_clean.py`:
+```python
+# Import your model
+from stt.your_stt import YourSTT
+# Add to registry
+STT_MODELS = {
+    "WhisperSTT": WhisperSTT,
+    "YourSTT": YourSTT,  # Add this line
+}
+```
+### Step 3: Configure Model Options
+Update the `ModelManager.get_model_options()` method:
+```python
+@staticmethod
+def get_model_options(model_name: str) -> Dict[str, Any]:
+    if model_name == "YourSTT":
+        return {
+            "model_sizes": ["small", "large"],
+            "supports_api": True,
+            "languages": [("English", "en"), ("Spanish", "es")],
+            "default_params": {"temperature": 0.0}
+        }
+    # ... existing code
+```
+### Step 4: Handle Model Loading
+Update the loading logic in `ModelManager.load_model()`:
+```python
+if model_name == "YourSTT":
+    api_key = kwargs.get("api_key", "")
+    model_size = kwargs.get("model_size", "small")
+    YourSTT.load_model(api_key=api_key, model_size=model_size)
+    status = f"✅ {model_name} loaded successfully"
+```
+## 📝 Real Examples
+### Azure Speech Service
+```python
+import azure.cognitiveservices.speech as speechsdk
+class AzureSTT(BaseSTT):
+    model_name = "AzureSTT"
+    @classmethod
+    def load_model(cls, subscription_key, region):
+        speech_config = speechsdk.SpeechConfig(
+            subscription=subscription_key,
+            region=region
+        )
+        cls.model = speech_config
+        cls.is_loaded = True
+    @classmethod
+    def transcribe_audio(cls, audio_data, sample_rate=None):
+        # Convert audio and send to Azure
+        # Return STTResult with transcription
+        pass
+```
+### Google Cloud Speech
+```python
+from google.cloud import speech
+class GoogleSTT(BaseSTT):
+    model_name = "GoogleSTT"
+    @classmethod
+    def load_model(cls, credentials_path):
+        cls.model = speech.SpeechClient()
+        cls.is_loaded = True
+    @classmethod
+    def transcribe_audio(cls, audio_data, sample_rate=None):
+        # Process with Google Cloud Speech
+        pass
+```
+### AssemblyAI
+```python
+import assemblyai as aai
+class AssemblyAISTT(BaseSTT):
+    model_name = "AssemblyAISTT"
+    @classmethod
+    def load_model(cls, api_key):
+        aai.settings.api_key = api_key
+        cls.model = aai.Transcriber()
+        cls.is_loaded = True
+    @classmethod
+    def transcribe_audio(cls, audio_data, sample_rate=None):
+        # Save audio temporarily and transcribe
+        pass
+```
+## 🎯 Best Practices
+### 1. Error Handling
+```python
+@classmethod
+def load_model(cls, **kwargs):
+    try:
+        # Model loading logic
+        cls.is_loaded = True
+    except Exception as e:
+        cls.is_loaded = False
+        raise RuntimeError(f"Failed to load {cls.model_name}: {e}")
+```
+### 2. Configuration Management
+```python
+class YourSTT(BaseSTT):
+    config = {
+        "default_language": "en",
+        "timeout": 30,
+        "retry_count": 3
+    }
+    @classmethod
+    def set_language(cls, language):
+        cls.config["default_language"] = language
+```
+### 3. Audio Format Handling
+```python
+@classmethod
+def transcribe_audio(cls, audio_data, sample_rate=None):
+    # Handle numpy arrays
+    if isinstance(audio_data, np.ndarray):
+        # Convert to required format
+        audio_bytes = audio_to_bytes(audio_data, sample_rate)
+    else:
+        # Handle file paths
+        with open(audio_data, 'rb') as f:
+            audio_bytes = f.read()
+    # Transcribe and return result
+```
+### 4. Metadata and Confidence
+```python
+return STTResult(
+    text=transcription,
+    confidence=confidence_score,
+    processing_time=processing_time,
+    metadata={
+        "model": cls.model_name,
+        "language_detected": detected_language,
+        "audio_duration": duration,
+        "service_info": additional_info
+    }
+)
+```
+## 🚀 Testing Your Integration
+1. **Unit Test Your STT Class**:
+```python
+def test_your_stt():
+    YourSTT.load_model(api_key="test")
+    dummy_audio = np.random.randn(16000).astype(np.float32)
+    result = YourSTT.transcribe_audio(dummy_audio, 16000)
+    assert result.text is not None
+```
+2. **Test in Gradio Interface**:
+   - Run `python gradio_voice_transcriber_clean.py`
+   - Select your model from the dropdown
+   - Load it and test with audio
+## 🛠️ Advanced Features
+### Custom UI Components
+You can add model-specific UI components by extending the interface:
+```python
+# Add custom fields for your model
+if model_name == "YourSTT":
+    custom_setting = gr.Slider(
+        minimum=0, maximum=1, value=0.5,
+        label="Custom Setting"
+    )
+```
+### Background Processing
+For long-running transcriptions:
+```python
+@classmethod
+def transcribe_audio_async(cls, audio_data, callback):
+    # Start background transcription
+    # Call callback when done
+    pass
+```
+## 📊 Current Available Models
+- **WhisperSTT**: OpenAI Whisper (local + API)
+- **ExampleCustomSTT**: Template for new integrations
+## 🎯 Next Steps
+1. Choose your STT service
+2. Follow the integration pattern
+3. Test thoroughly
+4. Contribute back to the project!
+The modular design makes it easy to support any STT service while maintaining a consistent user experience.

app.py CHANGED Viewed

@@ -1,1106 +1,1264 @@
-#!/usr/bin/env python3
-"""
-Modular Gradio Voice Transcriber
-A flexible web interface for voice transcription supporting multiple STT models.
-Easily extensible to support any STT implementation that follows the BaseSTT interface.
-Usage:
-    python gradio_voice_transcriber_clean.py
-"""
-import gradio as gr
-import numpy as np
-import logging
-import time
-from typing import Tuple, Optional, Dict, Any, Type, List, Union
-from pathlib import Path
-# Import base STT class and available implementations
-from stt.stt_base import BaseSTT, STTResult
-from stt.whisper_stt import WhisperSTT
-# Try to import Wav2Vec2 Arabic STT (optional)
-try:
-    from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
-    WAV2VEC2_AVAILABLE = True
-except ImportError:
-    WAV2VEC2_AVAILABLE = False
-# Try to import HuBERT Arabic STT (optional)
-try:
-    from stt.hubert_arabic_stt import HuBERTArabicSTT
-    HUBERT_AVAILABLE = True
-except ImportError:
-    HUBERT_AVAILABLE = False
-# Try to import Vosk STT (optional)
-try:
-    from stt.vosk_stt import VoskSTT
-    VOSK_AVAILABLE = True
-except ImportError:
-    VOSK_AVAILABLE = False
-# Try to import Coqui STT (optional)
-try:
-    from stt.coqui_stt import CoquiSTT
-    COQUI_AVAILABLE = True
-except ImportError:
-    COQUI_AVAILABLE = False
-# Try to import Tawasul STT (optional)
-try:
-    from stt.tawasul_stt import TawasulSTT
-    TAWASUL_AVAILABLE = True
-except ImportError:
-    TAWASUL_AVAILABLE = False
-# Setup logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-# STT Model Registry - Add new models here
-STT_MODELS: Dict[str, Type[BaseSTT]] = {
-    "WhisperSTT": WhisperSTT,
-}
-# Add Wav2Vec2 Arabic if available
-if WAV2VEC2_AVAILABLE:
-    STT_MODELS["Wav2Vec2ArabicSTT"] = Wav2Vec2ArabicSTT
-# Add HuBERT Arabic if available
-if HUBERT_AVAILABLE:
-    STT_MODELS["HuBERTArabicSTT"] = HuBERTArabicSTT
-# Add Vosk if available
-if VOSK_AVAILABLE:
-    STT_MODELS["VoskSTT"] = VoskSTT
-# Add Coqui STT if available
-if COQUI_AVAILABLE:
-    STT_MODELS["CoquiSTT"] = CoquiSTT
-# Add Tawasul STT if available
-if TAWASUL_AVAILABLE:
-    STT_MODELS["TawasulSTT"] = TawasulSTT
-# Global state
-current_stt_model: Optional[Type[BaseSTT]] = None
-current_model_config: Dict[str, Any] = {}
-class AudioProcessor:
-    """Handle audio preprocessing for better transcription quality."""
-    @staticmethod
-    def preprocess(audio_data: np.ndarray, sample_rate: int, target_sr: int = 16000) -> np.ndarray:
-        """
-        Preprocess audio for better transcription quality.
-        Args:
-            audio_data: Raw audio data
-            sample_rate: Original sample rate
-            target_sr: Target sample rate (default: 16000 for Whisper)
-        Returns:
-            Preprocessed audio data
-        """
-        # Convert to mono if stereo
-        if audio_data.ndim > 1:
-            audio_data = np.mean(audio_data, axis=1)
-        # Normalize to float32 [-1, 1]
-        if audio_data.dtype == np.int16:
-            audio_data = audio_data.astype(np.float32) / 32768.0
-        elif audio_data.dtype == np.int32:
-            audio_data = audio_data.astype(np.float32) / 2147483648.0
-        else:
-            audio_data = audio_data.astype(np.float32)
-        # Clip to prevent overflow
-        audio_data = np.clip(audio_data, -1.0, 1.0)
-        # Remove DC offset
-        audio_data = audio_data - np.mean(audio_data)
-        # Simple noise gate (remove very quiet sections)
-        if len(audio_data) > 0:
-            threshold = np.max(np.abs(audio_data)) * 0.01
-            audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
-        # Resample if needed
-        if sample_rate != target_sr:
-            audio_data = AudioProcessor._resample(audio_data, sample_rate, target_sr)
-        return audio_data
-    @staticmethod
-    def _resample(audio_data: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
-        """Simple resampling (prefer librosa if available)."""
-        try:
-            import librosa
-            return librosa.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)
-        except ImportError:
-            # Simple resampling fallback
-            if orig_sr > target_sr:
-                step = orig_sr // target_sr
-                return audio_data[::step]
-            else:
-                repeat_factor = target_sr // orig_sr
-                return np.repeat(audio_data, repeat_factor)
-    @staticmethod
-    def _preprocess_audio(audio_path: str) -> Tuple[np.ndarray, int]:
-        """
-        Preprocess audio file for STT models that need torch.Tensor input.
-        Args:
-            audio_path: Path to audio file
-        Returns:
-            Tuple of (audio_tensor_as_numpy, sample_rate) that can be converted to torch.Tensor
-        """
-        try:
-            import librosa
-            import soundfile as sf
-            # Try to load with librosa first (more robust)
-            try:
-                audio_data, sample_rate = librosa.load(audio_path, sr=16000)
-            except Exception:
-                # Fallback to soundfile
-                audio_data, sample_rate = sf.read(audio_path)
-                if sample_rate != 16000:
-                    audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
-                    sample_rate = 16000
-            # Convert to mono if needed
-            if audio_data.ndim > 1:
-                audio_data = np.mean(audio_data, axis=1)
-            # Normalize audio to [-1, 1]
-            if audio_data.max() > 1.0:
-                audio_data = audio_data / audio_data.max()
-            # Remove DC offset
-            audio_data = audio_data - np.mean(audio_data)
-            # Apply noise gate for very quiet audio
-            threshold = np.max(np.abs(audio_data)) * 0.01
-            audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
-            # Convert to float32 for compatibility
-            audio_data = audio_data.astype(np.float32)
-            return audio_data, sample_rate
-        except Exception as e:
-            raise RuntimeError(f"Audio preprocessing failed: {str(e)}")
-    @staticmethod
-    def _preprocess_audio_torch(audio_path: str):
-        """
-        Preprocess audio file and return torch.Tensor for PyTorch-based STT models.
-        Args:
-            audio_path: Path to audio file
-        Returns:
-            Tuple of (audio_tensor, sample_rate) where audio_tensor is torch.Tensor
-        """
-        try:
-            import torch
-            # Get numpy array first
-            audio_data, sample_rate = AudioProcessor._preprocess_audio(audio_path)
-            # Convert to torch tensor
-            audio_tensor = torch.FloatTensor(audio_data)
-            return audio_tensor, sample_rate
-        except ImportError:
-            raise RuntimeError("PyTorch not available. Install with: pip install torch")
-        except Exception as e:
-            raise RuntimeError(f"Torch audio preprocessing failed: {str(e)}")
-    @staticmethod
-    def analyze_quality(audio_data: np.ndarray, sample_rate: int) -> Dict[str, Any]:
-        """Analyze audio quality and provide feedback."""
-        if audio_data.ndim > 1:
-            audio_data = np.mean(audio_data, axis=1)
-        duration = len(audio_data) / sample_rate
-        max_amp = np.max(np.abs(audio_data))
-        mean_amp = np.mean(np.abs(audio_data))
-        # Check for clipping and silence
-        clipping_ratio = np.sum(np.abs(audio_data) > 0.95) / len(audio_data)
-        silence_threshold = max_amp * 0.01
-        silence_ratio = np.sum(np.abs(audio_data) < silence_threshold) / len(audio_data)
-        return {
-            "duration": duration,
-            "max_amplitude": max_amp,
-            "mean_amplitude": mean_amp,
-            "clipping_ratio": clipping_ratio,
-            "silence_ratio": silence_ratio,
-            "sample_rate": sample_rate,
-            "is_good_quality": (
-                duration > 1.0 and
-                0.1 < max_amp < 0.9 and
-                clipping_ratio < 0.01 and
-                silence_ratio < 0.5
-            )
-        }
-class ModelManager:
-    """Handle STT model registration and loading."""
-    @staticmethod
-    def get_available_models() -> List[str]:
-        """Get list of available STT model names."""
-        return list(STT_MODELS.keys())
-    @staticmethod
-    def get_model_options(model_name: str) -> Dict[str, Any]:
-        """Get model-specific configuration options."""
-        if model_name == "WhisperSTT":
-            return {
-                "model_sizes": ["tiny", "base", "small", "medium", "large"],
-                "supports_api": True,
-                "languages": [
-                    ("Auto-detect", "auto"),
-                    ("English", "en"),
-                    ("Spanish", "es"),
-                    ("French", "fr"),
-                    ("German", "de"),
-                    ("Italian", "it"),
-                    ("Portuguese", "pt"),
-                    ("Russian", "ru"),
-                    ("Japanese", "ja"),
-                    ("Korean", "ko"),
-                    ("Chinese", "zh"),
-                    ("Dutch", "nl"),
-                    ("Arabic", "ar"),
-                    ("Hindi", "hi")
-                ],
-                "default_params": {
-                    "temperature": 0.0,
-                    "beam_size": 5,
-                    "best_of": 5,
-                    "patience": 2.0,
-                    "condition_on_previous_text": True,
-                }
-            }
-        elif model_name == "Wav2Vec2ArabicSTT":
-            return {
-                "model_sizes": [
-                    ("Arabic Standard", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
-                    ("Multilingual", "facebook/wav2vec2-large-xlsr-53"),
-                    ("English Fallback", "facebook/wav2vec2-base-960h"),
-                    ("Arabic Egyptian (Experimental)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian")
-                ],
-                "supports_api": False,
-                "supports_hf_token": True,
-                "languages": [
-                    ("Arabic Egyptian", "ar-EG"),
-                    ("Arabic Standard", "ar"),
-                    ("Auto-detect", "auto"),
-                ],
-                "device_options": ["auto", "cpu", "cuda"],
-                "default_params": {
-                    "device": "auto",
-                    "chunk_length": 20,
-                    "return_confidence": True,
-                }
-            }
-        elif model_name == "VoskSTT":
-            return {
-                "model_sizes": [
-                    ("English US Small (40MB)", "vosk-model-small-en-us-0.15"),
-                    ("English US Large (1.8GB)", "vosk-model-en-us-0.22"),
-                    ("Arabic (318MB)", "vosk-model-ar-mgb2-0.4"),
-                    ("French (1.4GB)", "vosk-model-fr-0.22"),
-                    ("German (1.2GB)", "vosk-model-de-0.21"),
-                    ("Spanish (1.4GB)", "vosk-model-es-0.42"),
-                    ("Russian Large (1.5GB)", "vosk-model-ru-0.42"),
-                    ("Russian Small (45MB)", "vosk-model-small-ru-0.22"),
-                    ("Chinese Small (42MB)", "vosk-model-small-cn-0.22"),
-                ],
-                "supports_api": False,
-                "supports_auto_download": True,
-                "languages": [
-                    ("Auto (based on model)", "auto"),
-                    ("English", "en"),
-                    ("Arabic", "ar"),
-                    ("French", "fr"),
-                    ("German", "de"),
-                    ("Spanish", "es"),
-                    ("Russian", "ru"),
-                    ("Chinese", "zh"),
-                ],
-                "default_params": {
-                    "auto_download": True,
-                    "return_confidence": True,
-                    "return_words": True,
-                }
-            }
-        elif model_name == "HuBERTArabicSTT":
-            return {
-                "model_sizes": [
-                    ("Arabic Egyptian (HuBERT)", "omarxadel/hubert-large-arabic-egyptian"),
-                    ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
-                    ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
-                    ("Arabic MSA", "facebook/wav2vec2-large-xlsr-53")
-                ],
-                "supports_api": False,
-                "supports_hf_token": True,
-                "languages": [
-                    ("Arabic Egyptian", "ar-EG"),
-                    ("Arabic Standard", "ar"),
-                    ("Auto-detect", "auto"),
-                ],
-                "device_options": ["auto", "cpu", "cuda"],
-                "default_params": {
-                    "device": "auto",
-                    "chunk_length": 20,
-                    "return_confidence": True,
-                    "max_audio_length": 120
-                }
-            }
-        elif model_name == "CoquiSTT":
-            return {
-                "model_sizes": [
-                    ("English Large Vocab", "english-large"),
-                    ("English Huge Vocab", "english-huge"),
-                    ("German", "german"),
-                    ("French", "french"),
-                    ("Spanish", "spanish")
-                ],
-                "supports_api": False,
-                "supports_auto_download": True,
-                "languages": [
-                    ("English", "en"),
-                    ("German", "de"),
-                    ("French", "fr"),
-                    ("Spanish", "es"),
-                    ("Auto (based on model)", "auto"),
-                ],
-                "default_params": {
-                    "auto_download": True,
-                    "beam_width": 512,
-                    "lm_alpha": 0.931289039105002,
-                    "lm_beta": 1.1834137581510284,
-                    "return_confidence": True,
-                    "return_timestamps": False,
-                }
-            }
-        elif model_name == "TawasulSTT":
-            return {
-                "model_sizes": [
-                    ("Tawasul STT V0 (Arabic)", "Kareem35/Tawasul-STT-V0"),
-                    ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
-                    ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
-                    ("Multilingual Fallback", "facebook/wav2vec2-large-xlsr-53")
-                ],
-                "supports_api": False,
-                "supports_hf_token": True,
-                "languages": [
-                    ("Arabic Standard", "ar"),
-                    ("Arabic Egyptian", "ar-EG"),
-                    ("Arabic Saudi", "ar-SA"),
-                    ("Arabic Jordanian", "ar-JO"),
-                    ("Arabic Lebanese", "ar-LB"),
-                    ("Arabic Syrian", "ar-SY"),
-                    ("Arabic Iraqi", "ar-IQ"),
-                    ("Auto-detect", "auto"),
-                ],
-                "device_options": ["auto", "cpu", "cuda"],
-                "default_params": {
-                    "device": "auto",
-                    "chunk_length": 20,
-                    "return_confidence": True,
-                    "max_audio_length": 300
-                }
-            }
-        # Default options for other models
-        return {
-            "model_sizes": ["default"],
-            "supports_api": False,
-            "languages": [("Auto-detect", "auto")],
-            "default_params": {}
-        }
-    @staticmethod
-    def load_model(model_name: str, **kwargs) -> str:
-        """Load specified STT model with configuration."""
-        global current_stt_model, current_model_config
-        if model_name not in STT_MODELS:
-            return f"❌ Unknown model: {model_name}. Available: {list(STT_MODELS.keys())}"
-        try:
-            model_class = STT_MODELS[model_name]
-            # Handle TawasulSTT as static class (don't instantiate)
-            if model_name == "TawasulSTT":
-                model_instance = model_class  # Use class directly for static methods
-            else:
-                # Instantiate the model for instance-based classes
-                model_instance = model_class()
-            if model_name == "WhisperSTT":
-                # Handle WhisperSTT specific loading
-                model_size = kwargs.get("model_size", "base")
-                use_api = kwargs.get("use_api", False)
-                api_key = kwargs.get("api_key", "")
-                if use_api and not api_key.strip():
-                    return "❌ Error: API key required for API mode"
-                # Load with optimized parameters
-                load_params = {
-                    "model_size": model_size,
-                    "use_api": use_api,
-                }
-                if api_key:
-                    load_params["api_key"] = api_key.strip()
-                # Add quality optimization parameters for local models
-                if not use_api:
-                    load_params.update({
-                        "temperature": 0.0,
-                        "beam_size": 5,
-                        "best_of": 5,
-                        "patience": 2.0,
-                        "condition_on_previous_text": True,
-                    })
-                model_instance.load_model(**load_params)
-                current_model_config = {
-                    "model_name": model_name,
-                    "model_size": model_size,
-                    "use_api": use_api
-                }
-                status = f"✅ {model_name} ({'API' if use_api else model_size}) loaded successfully"
-            elif model_name == "Wav2Vec2ArabicSTT":
-                # Handle Wav2Vec2 Arabic specific loading
-                device = kwargs.get("device", "auto")
-                chunk_length = kwargs.get("chunk_length", 20)
-                hf_token = kwargs.get("hf_token", "")
-                model_id = kwargs.get("model_size", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
-                load_params = {
-                    "device": device,
-                    "chunk_length": chunk_length,
-                    "model_id": model_id,
-                }
-                if hf_token:
-                    load_params["hf_token"] = hf_token.strip()
-                model_instance.load_model(**load_params)
-                current_model_config = {
-                    "model_name": model_name,
-                    "model_id": model_id,
-                    "device": device,
-                    "chunk_length": chunk_length
-                }
-                # Extract model name for display
-                model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
-                status = f"✅ {model_name} ({model_display_name}) loaded on {device}"
-            elif model_name == "VoskSTT":
-                # Handle VoskSTT specific loading
-                model_name_param = kwargs.get("model_size", "vosk-model-small-en-us-0.15")
-                auto_download = kwargs.get("auto_download", True)
-                load_params = {
-                    "model_name": model_name_param,
-                    "auto_download": auto_download,
-                }
-                model_instance.load_model(**load_params)
-                current_model_config = {
-                    "model_name": model_name,
-                    "model_name_param": model_name_param,
-                    "auto_download": auto_download
-                }
-                status = f"✅ {model_name} ({model_name_param}) loaded successfully"
-            elif model_name == "HuBERTArabicSTT":
-                # Handle HuBERT Arabic specific loading
-                device = kwargs.get("device", "auto")
-                chunk_length = kwargs.get("chunk_length", 20)
-                hf_token = kwargs.get("hf_token", "")
-                model_id = kwargs.get("model_size", "omarxadel/hubert-large-arabic-egyptian")
-                max_audio_length = kwargs.get("max_audio_length", 120)
-                load_params = {
-                    "device": device,
-                    "chunk_length": chunk_length,
-                    "model_id": model_id,
-                    "max_audio_length": max_audio_length,
-                }
-                if hf_token:
-                    load_params["hf_token"] = hf_token.strip()
-                model_instance.load_model(**load_params)
-                current_model_config = {
-                    "model_name": model_name,
-                    "model_id": model_id,
-                    "device": device,
-                    "chunk_length": chunk_length,
-                    "max_audio_length": max_audio_length
-                }
-                # Extract model name for display
-                model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
-                status = f"✅ {model_name} ({model_display_name}) loaded on {device}"
-            elif model_name == "CoquiSTT":
-                # Handle Coqui STT specific loading
-                model_name_param = kwargs.get("model_size", "english-large")
-                auto_download = kwargs.get("auto_download", True)
-                beam_width = kwargs.get("beam_width", 512)
-                lm_alpha = kwargs.get("lm_alpha", 0.931289039105002)
-                lm_beta = kwargs.get("lm_beta", 1.1834137581510284)
-                load_params = {
-                    "model_name": model_name_param,
-                    "auto_download": auto_download,
-                    "beam_width": beam_width,
-                    "lm_alpha": lm_alpha,
-                    "lm_beta": lm_beta,
-                }
-                model_instance.load_model(**load_params)
-                current_model_config = {
-                    "model_name": model_name,
-                    "model_name_param": model_name_param,
-                    "auto_download": auto_download,
-                    "beam_width": beam_width,
-                    "lm_alpha": lm_alpha,
-                    "lm_beta": lm_beta
-                }
-                status = f"✅ {model_name} ({model_name_param}) loaded successfully"
-            elif model_name == "TawasulSTT":
-                # Handle Tawasul STT specific loading (static class)
-                device = kwargs.get("device", "auto")
-                chunk_length = kwargs.get("chunk_length", 20)
-                hf_token = kwargs.get("hf_token", "")
-                model_id = kwargs.get("model_size", "Kareem35/Tawasul-STT-V0")
-                max_audio_length = kwargs.get("max_audio_length", 300)
-                load_params = {
-                    "device": device,
-                    "chunk_length": chunk_length,
-                    "model_id": model_id,
-                    "max_audio_length": max_audio_length,
-                }
-                if hf_token:
-                    load_params["hf_token"] = hf_token.strip()
-                # Call static method directly
-                model_class.load_model(**load_params)
-                current_model_config = {
-                    "model_name": model_name,
-                    "model_id": model_id,
-                    "device": device,
-                    "chunk_length": chunk_length,
-                    "max_audio_length": max_audio_length
-                }
-                # Extract model name for display
-                model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
-                status = f"✅ {model_name} ({model_display_name}) loaded on {device}"
-            else:
-                # Generic model loading for future STT models
-                model_instance.load_model(**kwargs)
-                current_model_config = {"model_name": model_name, **kwargs}
-                status = f"✅ {model_name} loaded successfully"
-            current_stt_model = model_instance
-            logger.info(status)
-            return status
-        except Exception as e:
-            error_msg = f"❌ Error loading {model_name}: {str(e)}"
-            logger.error(error_msg)
-            return error_msg
-    @staticmethod
-    def get_model_info() -> str:
-        """Get information about available and loaded models."""
-        info = f"**Available Models:** {', '.join(STT_MODELS.keys())}\n\n"
-        if current_stt_model:
-            model_info = current_stt_model.get_model_info()
-            # Handle different key names for model name
-            model_name = model_info.get('model_name') or model_info.get('name', 'Unknown')
-            info += f"**Currently Loaded:** {model_name}\n"
-            info += f"**Status:** {'✅ Ready' if model_info['is_loaded'] else '❌ Not loaded'}\n"
-            info += f"**Config:** {current_model_config}"
-        else:
-            info += "**Currently Loaded:** None"
-        return info
-class TranscriptionEngine:
-    """Handle audio transcription using the loaded STT model."""
-    @staticmethod
-    def transcribe(audio_input: Tuple[int, np.ndarray],
-                  language: Optional[str] = None) -> Tuple[str, str, str]:
-        """
-        Transcribe audio input using the currently loaded STT model.
-        Args:
-            audio_input: Tuple of (sample_rate, audio_data) from Gradio
-            language: Language code for transcription
-        Returns:
-            Tuple of (transcription, confidence_info, processing_info)
-        """
-        if audio_input is None:
-            return "❌ No audio provided", "", ""
-        if not current_stt_model or not current_stt_model.is_loaded:
-            return "❌ No STT model loaded. Please load a model first.", "", ""
-        try:
-            sample_rate, audio_data = audio_input
-            # Preprocess audio
-            processed_audio = AudioProcessor.preprocess(audio_data, sample_rate)
-            # Quality checks
-            quality = AudioProcessor.analyze_quality(processed_audio, 16000)
-            if quality["duration"] < 0.5:
-                return "❌ Audio too short (minimum 0.5 seconds)", "", ""
-            if quality["max_amplitude"] < 0.001:
-                return "❌ Audio too quiet or silent", "", f"Max amplitude: {quality['max_amplitude']:.6f}"
-            # Set language for models that support it
-            if hasattr(current_stt_model, 'set_language') and language and language != "auto":
-                current_stt_model.set_language(language)
-            # Transcribe using different approaches for different models
-            start_time = time.time()
-            # Check if this is TawasulSTT (static class) which needs file path
-            if current_model_config.get('model_name') == 'TawasulSTT':
-                # TawasulSTT needs a file path, so save audio to temporary file
-                import tempfile
-                import soundfile as sf
-                with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
-                    temp_path = temp_file.name
-                    sf.write(temp_path, processed_audio, 16000)
-                try:
-                    # Call TawasulSTT.transcribe() with file path
-                    transcription, confidence_info_raw, processing_info_raw = current_stt_model.transcribe(temp_path)
-                    # Create a result-like object for consistency
-                    class TempResult:
-                        def __init__(self, text, confidence=None, processing_time=None):
-                            self.text = text
-                            self.confidence = confidence
-                            self.processing_time = processing_time
-                    # Extract confidence from confidence_info_raw if available
-                    confidence_value = None
-                    if confidence_info_raw and "Confidence:" in confidence_info_raw:
-                        try:
-                            conf_str = confidence_info_raw.split("Confidence:")[1].strip()
-                            confidence_value = float(conf_str)
-                        except:
-                            confidence_value = None
-                    processing_time = time.time() - start_time
-                    result = TempResult(transcription, confidence_value, processing_time)
-                finally:
-                    # Clean up temporary file
-                    import os
-                    try:
-                        os.unlink(temp_path)
-                    except:
-                        pass
-            else:
-                # For other STT models that use transcribe_audio
-                result = current_stt_model.transcribe_audio(processed_audio, 16000)
-            # Prepare output
-            transcription = result.text.strip() if result.text else "No speech detected"
-            # Filter out common false positives
-            if transcription.lower() in ["you", "thank you.", "thanks for watching!", ""]:
-                transcription = "🔇 No clear speech detected"
-            # Confidence info
-            confidence_info = ""
-            if result.confidence is not None:
-                confidence_info = f"Confidence: {result.confidence:.2%}"
-                if result.confidence < 0.3:
-                    confidence_info += " (Low - consider re-recording)"
-            else:
-                confidence_info = "Confidence: N/A"
-            # Processing info
-            processing_info = f"Processing: {result.processing_time or 0:.2f}s\n"
-            processing_info += f"Model: {current_model_config.get('model_name', 'Unknown')}\n"
-            processing_info += f"Audio: {quality['duration']:.2f}s, {quality['max_amplitude']:.3f} amplitude\n"
-            processing_info += f"Quality: {'✅ Good' if quality['is_good_quality'] else '⚠️ Poor'}"
-            return transcription, confidence_info, processing_info
-        except Exception as e:
-            error_msg = f"❌ Transcription error: {str(e)}"
-            logger.error(error_msg)
-            return error_msg, "", ""
-class GradioInterface:
-    """Create and manage the Gradio web interface."""
-    @staticmethod
-    def create_interface():
-        """Create the main Gradio interface."""
-        with gr.Blocks(
-            title="🎙️ Modular Voice Transcriber",
-            theme=gr.themes.Soft()
-        ) as demo:
-            gr.Markdown(
-                """
-                # 🎙️ Modular Voice Transcriber
-                A flexible interface supporting multiple STT models.
-                Easily extensible for new transcription engines.
-                """
-            )
-            with gr.Row():
-                # Model Configuration Panel
-                with gr.Column(scale=1):
-                    gr.Markdown("### 🔧 Model Configuration")
-                    # Model selection
-                    model_selector = gr.Dropdown(
-                        choices=ModelManager.get_available_models(),
-                        value="WhisperSTT",
-                        label="STT Model",
-                        info="Choose your speech-to-text engine"
-                    )
-                    # Dynamic model options (will update based on selected model)
-                    model_size = gr.Dropdown(
-                        choices=["tiny", "base", "small", "medium", "large"],
-                        value="base",
-                        label="Model Size",
-                        visible=True
-                    )
-                    use_api = gr.Checkbox(
-                        label="Use API",
-                        info="Use cloud API instead of local model",
-                        visible=True
-                    )
-                    api_key = gr.Textbox(
-                        label="API Key",
-                        type="password",
-                        placeholder="Enter API key...",
-                        visible=False
-                    )
-                    # Device selection for models that support it
-                    device_selector = gr.Dropdown(
-                        choices=["auto", "cpu", "cuda"],
-                        value="auto",
-                        label="Device",
-                        info="Processing device (auto recommended)",
-                        visible=False
-                    )
-                    # HuggingFace token for private models
-                    hf_token = gr.Textbox(
-                        label="HuggingFace Token",
-                        type="password",
-                        placeholder="hf_...",
-                        info="Optional: For private or experimental models",
-                        visible=False
-                    )
-                    # Load button and status
-                    load_btn = gr.Button("🔄 Load Model", variant="primary")
-                    load_status = gr.Textbox(
-                        label="Status",
-                        value="No model loaded",
-                        interactive=False
-                    )
-                    # Model info
-                    model_info = gr.Markdown(ModelManager.get_model_info())
-                # Transcription Panel
-                with gr.Column(scale=2):
-                    gr.Markdown("### 🎤 Voice Transcription")
-                    # Language selection
-                    language = gr.Dropdown(
-                        choices=[("Auto-detect", "auto"), ("English", "en")],
-                        value="auto",
-                        label="Language"
-                    )
-                    # Audio input
-                    audio_input = gr.Audio(
-                        label="Record or Upload Audio",
-                        type="numpy",
-                        format="wav"
-                    )
-                    # Action buttons
-                    with gr.Row():
-                        transcribe_btn = gr.Button("🎯 Transcribe", variant="primary")
-                        quality_btn = gr.Button("📊 Check Quality")
-                        clear_btn = gr.Button("🗑️ Clear")
-                    # Outputs
-                    transcription_output = gr.Textbox(
-                        label="📝 Transcription",
-                        lines=4,
-                        placeholder="Transcribed text will appear here..."
-                    )
-                    with gr.Row():
-                        confidence_output = gr.Textbox(
-                            label="🎯 Confidence",
-                            interactive=False
-                        )
-                        processing_output = gr.Textbox(
-                            label="⏱️ Processing Info",
-                            interactive=False
-                        )
-                    quality_output = gr.Markdown(
-                        value="",
-                        visible=False,
-                        label="📊 Audio Quality Analysis"
-                    )
-            # Usage tips
-            gr.Markdown(
-                """
-                ### 💡 Tips for Best Results
-                - **Record clearly** in a quiet environment
-                - **Speak at normal pace** - not too fast or slow
-                - **Use good audio quality** - avoid background noise
-                - **Try different models** - larger models are more accurate but slower
-                - **Check quality analysis** to identify audio issues
-                """
-            )
-            # Event handlers
-            def update_model_options(model_name: str):
-                """Update interface based on selected model."""
-                options = ModelManager.get_model_options(model_name)
-                # Determine visibility of components
-                show_model_size = len(options["model_sizes"]) > 1
-                show_api = options["supports_api"]
-                show_device = "device_options" in options
-                show_hf_token = options.get("supports_hf_token", False)
-                # Extract model size options (handle both simple lists and tuples)
-                if show_model_size and isinstance(options["model_sizes"][0], tuple):
-                    # Model sizes are tuples of (display_name, value)
-                    size_choices = options["model_sizes"]
-                    size_value = size_choices[0][1]  # Use the value from first tuple
-                else:
-                    # Model sizes are simple strings
-                    size_choices = options["model_sizes"]
-                    size_value = size_choices[0]
-                return (
-                    gr.update(choices=size_choices, value=size_value, visible=show_model_size),
-                    gr.update(visible=show_api),
-                    gr.update(visible=False),  # Hide API key initially
-                    gr.update(choices=options["languages"], value="auto"),
-                    gr.update(
-                        choices=options.get("device_options", ["auto"]),
-                        value="auto",
-                        visible=show_device
-                    ),
-                    gr.update(visible=show_hf_token)
-                )
-            def toggle_api_key(use_api: bool):
-                """Show/hide API key field."""
-                return gr.update(visible=use_api)
-            def load_selected_model(model_name: str, model_size: str, use_api: bool, api_key: str, device: str, hf_token: str):
-                """Load the selected model with configuration."""
-                kwargs = {"model_size": model_size, "use_api": use_api}
-                if api_key:
-                    kwargs["api_key"] = api_key
-                if device and device != "auto":
-                    kwargs["device"] = device
-                if hf_token:
-                    kwargs["hf_token"] = hf_token
-                return ModelManager.load_model(model_name, **kwargs)
-            def analyze_audio_quality(audio_input):
-                """Analyze and display audio quality."""
-                if audio_input is None:
-                    return "", gr.update(visible=False)
-                sample_rate, audio_data = audio_input
-                quality = AudioProcessor.analyze_quality(audio_data, sample_rate)
-                report = f"""
-                **📊 Audio Quality Analysis:**
-                - Duration: {quality['duration']:.2f}s
-                - Max amplitude: {quality['max_amplitude']:.3f}
-                - Clipping: {quality['clipping_ratio']:.2%}
-                - Silence ratio: {quality['silence_ratio']:.2%}
-                - Overall quality: {'✅ Good' if quality['is_good_quality'] else '⚠️ Needs improvement'}
-                **🔧 Recommendations:**
-                {_get_quality_recommendations(quality)}
-                """
-                return report, gr.update(visible=True)
-            # Connect events
-            model_selector.change(
-                fn=update_model_options,
-                inputs=model_selector,
-                outputs=[model_size, use_api, api_key, language, device_selector, hf_token]
-            )
-            use_api.change(
-                fn=toggle_api_key,
-                inputs=use_api,
-                outputs=api_key
-            )
-            load_btn.click(
-                fn=load_selected_model,
-                inputs=[model_selector, model_size, use_api, api_key, device_selector, hf_token],
-                outputs=load_status
-            ).then(
-                fn=lambda: ModelManager.get_model_info(),
-                outputs=model_info
-            )
-            transcribe_btn.click(
-                fn=TranscriptionEngine.transcribe,
-                inputs=[audio_input, language],
-                outputs=[transcription_output, confidence_output, processing_output]
-            )
-            quality_btn.click(
-                fn=analyze_audio_quality,
-                inputs=audio_input,
-                outputs=[quality_output, quality_output]
-            )
-            clear_btn.click(
-                fn=lambda: ("", "", "", "", gr.update(visible=False)),
-                outputs=[transcription_output, confidence_output, processing_output, quality_output, quality_output]
-            )
-            # Auto-transcribe on audio change (optional)
-            audio_input.change(
-                fn=TranscriptionEngine.transcribe,
-                inputs=[audio_input, language],
-                outputs=[transcription_output, confidence_output, processing_output]
-            )
-        return demo
-def _get_quality_recommendations(quality: Dict[str, Any]) -> str:
-    """Generate quality recommendations based on analysis."""
-    recommendations = []
-    if quality["duration"] < 1.0:
-        recommendations.append("• Try recording for longer (1+ seconds)")
-    if quality["max_amplitude"] < 0.1:
-        recommendations.append("• Increase volume or move closer to microphone")
-    elif quality["max_amplitude"] > 0.9:
-        recommendations.append("• Reduce volume to avoid clipping")
-    if quality["clipping_ratio"] > 0.01:
-        recommendations.append("• Audio is clipping - reduce input gain")
-    if quality["silence_ratio"] > 0.5:
-        recommendations.append("• Too much silence - record in quieter environment")
-    if not recommendations:
-        recommendations.append("• Audio quality looks good!")
-    return "\n".join(recommendations)
-def main():
-    """Main application entry point."""
-    # Check dependencies
-    print("🔍 Checking dependencies...")
-    try:
-        import gradio
-        print("✅ Gradio available")
-    except ImportError:
-        print("❌ Gradio not installed. Run: pip install gradio")
-        return
-    # Check available STT models
-    print(f"🤖 Available STT models: {ModelManager.get_available_models()}")
-    # Create and launch interface
-    print("🚀 Launching Gradio interface...")
-    demo = GradioInterface.create_interface()
-    demo.launch(
-        share=True,  # Set to True for public sharing
-        server_name="127.0.0.1",
-        server_port=7860,
-        show_error=True
-    )
-if __name__ == "__main__":
     main()

+#!/usr/bin/env python3
+"""
+Modular Gradio Voice Transcriber
+A flexible web interface for voice transcription supporting multiple STT models.
+Easily extensible to support any STT implementation that follows the BaseSTT interface.
+Usage:
+    python gradio_voice_transcriber_clean.py
+"""
+import gradio as gr
+import numpy as np
+import logging
+import time
+from typing import Tuple, Optional, Dict, Any, Type, List, Union
+from pathlib import Path
+# Import base STT class and available implementations
+from stt.stt_base import BaseSTT, STTResult
+from stt.whisper_stt import WhisperSTT
+# Try to import Wav2Vec2 Arabic STT (optional)
+try:
+    from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
+    WAV2VEC2_AVAILABLE = True
+except ImportError:
+    WAV2VEC2_AVAILABLE = False
+# Try to import Chirp3 STT (optional)
+try:
+    from stt.chirp3_stt import Chirp3STT
+    CHIRP3_AVAILABLE = True
+except ImportError:
+    Chirp3STT = None
+    CHIRP3_AVAILABLE = False
+# Try to import HuBERT Arabic STT (optional)
+try:
+    from stt.hubert_arabic_stt import HuBERTArabicSTT
+    HUBERT_AVAILABLE = True
+except ImportError:
+    HUBERT_AVAILABLE = False
+# Try to import Vosk STT (optional)
+try:
+    from stt.vosk_stt import VoskSTT
+    VOSK_AVAILABLE = True
+except ImportError:
+    VOSK_AVAILABLE = False
+# Try to import Coqui STT (optional)
+try:
+    from stt.coqui_stt import CoquiSTT
+    COQUI_AVAILABLE = True
+except ImportError:
+    COQUI_AVAILABLE = False
+# Try to import Tawasul STT (optional)
+try:
+    from stt.tawasul_stt import TawasulSTT
+    TAWASUL_AVAILABLE = True
+except ImportError:
+    TAWASUL_AVAILABLE = False
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# STT Model Registry - Add new models here
+STT_MODELS: Dict[str, Type[BaseSTT]] = {
+    "WhisperSTT": WhisperSTT,
+}
+# Add Wav2Vec2 Arabic if available
+if WAV2VEC2_AVAILABLE:
+    STT_MODELS["Wav2Vec2ArabicSTT"] = Wav2Vec2ArabicSTT
+# Add HuBERT Arabic if available
+if HUBERT_AVAILABLE:
+    STT_MODELS["HuBERTArabicSTT"] = HuBERTArabicSTT
+# Add Vosk if available
+if VOSK_AVAILABLE:
+    STT_MODELS["VoskSTT"] = VoskSTT
+# Add Coqui STT if available
+if COQUI_AVAILABLE:
+    STT_MODELS["CoquiSTT"] = CoquiSTT
+# Add Tawasul STT if available
+if TAWASUL_AVAILABLE:
+    STT_MODELS["TawasulSTT"] = TawasulSTT
+# Global state
+current_stt_model: Optional[Type[BaseSTT]] = None
+current_model_config: Dict[str, Any] = {}
+class AudioProcessor:
+    """Handle audio preprocessing for better transcription quality."""
+    @staticmethod
+    def preprocess(audio_data: np.ndarray, sample_rate: int, target_sr: int = 16000) -> np.ndarray:
+        """
+        Preprocess audio for better transcription quality.
+        Args:
+            audio_data: Raw audio data
+            sample_rate: Original sample rate
+            target_sr: Target sample rate (default: 16000 for Whisper)
+        Returns:
+            Preprocessed audio data
+        """
+        # Convert to mono if stereo
+        if audio_data.ndim > 1:
+            audio_data = np.mean(audio_data, axis=1)
+        # Normalize to float32 [-1, 1]
+        if audio_data.dtype == np.int16:
+            audio_data = audio_data.astype(np.float32) / 32768.0
+        elif audio_data.dtype == np.int32:
+            audio_data = audio_data.astype(np.float32) / 2147483648.0
+        else:
+            audio_data = audio_data.astype(np.float32)
+        # Clip to prevent overflow
+        audio_data = np.clip(audio_data, -1.0, 1.0)
+        # Remove DC offset
+        audio_data = audio_data - np.mean(audio_data)
+        # Simple noise gate (remove very quiet sections)
+        if len(audio_data) > 0:
+            threshold = np.max(np.abs(audio_data)) * 0.01
+            audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
+        # Resample if needed
+        if sample_rate != target_sr:
+            audio_data = AudioProcessor._resample(audio_data, sample_rate, target_sr)
+        return audio_data
+    @staticmethod
+    def _resample(audio_data: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
+        """Simple resampling (prefer librosa if available)."""
+        try:
+            import librosa
+            return librosa.resample(audio_data, orig_sr=orig_sr, target_sr=target_sr)
+        except ImportError:
+            # Simple resampling fallback
+            if orig_sr > target_sr:
+                step = orig_sr // target_sr
+                return audio_data[::step]
+            else:
+                repeat_factor = target_sr // orig_sr
+                return np.repeat(audio_data, repeat_factor)
+    @staticmethod
+    def _preprocess_audio(audio_path: str) -> Tuple[np.ndarray, int]:
+        """
+        Preprocess audio file for STT models that need torch.Tensor input.
+        Args:
+            audio_path: Path to audio file
+        Returns:
+            Tuple of (audio_tensor_as_numpy, sample_rate) that can be converted to torch.Tensor
+        """
+        try:
+            import librosa
+            import soundfile as sf
+            # Try to load with librosa first (more robust)
+            try:
+                audio_data, sample_rate = librosa.load(audio_path, sr=16000)
+            except Exception:
+                # Fallback to soundfile
+                audio_data, sample_rate = sf.read(audio_path)
+                if sample_rate != 16000:
+                    audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=16000)
+                    sample_rate = 16000
+            # Convert to mono if needed
+            if audio_data.ndim > 1:
+                audio_data = np.mean(audio_data, axis=1)
+            # Normalize audio to [-1, 1]
+            if audio_data.max() > 1.0:
+                audio_data = audio_data / audio_data.max()
+            # Remove DC offset
+            audio_data = audio_data - np.mean(audio_data)
+            # Apply noise gate for very quiet audio
+            threshold = np.max(np.abs(audio_data)) * 0.01
+            audio_data = np.where(np.abs(audio_data) < threshold, 0, audio_data)
+            # Convert to float32 for compatibility
+            audio_data = audio_data.astype(np.float32)
+            return audio_data, sample_rate
+        except Exception as e:
+            raise RuntimeError(f"Audio preprocessing failed: {str(e)}")
+    @staticmethod
+    def _preprocess_audio_torch(audio_path: str):
+        """
+        Preprocess audio file and return torch.Tensor for PyTorch-based STT models.
+        Args:
+            audio_path: Path to audio file
+        Returns:
+            Tuple of (audio_tensor, sample_rate) where audio_tensor is torch.Tensor
+        """
+        try:
+            import torch
+            # Get numpy array first
+            audio_data, sample_rate = AudioProcessor._preprocess_audio(audio_path)
+            # Convert to torch tensor
+            audio_tensor = torch.FloatTensor(audio_data)
+            return audio_tensor, sample_rate
+        except ImportError:
+            raise RuntimeError("PyTorch not available. Install with: pip install torch")
+        except Exception as e:
+            raise RuntimeError(f"Torch audio preprocessing failed: {str(e)}")
+    @staticmethod
+    def analyze_quality(audio_data: np.ndarray, sample_rate: int) -> Dict[str, Any]:
+        """Analyze audio quality and provide feedback."""
+        if audio_data.ndim > 1:
+            audio_data = np.mean(audio_data, axis=1)
+        duration = len(audio_data) / sample_rate
+        max_amp = np.max(np.abs(audio_data))
+        mean_amp = np.mean(np.abs(audio_data))
+        # Check for clipping and silence
+        clipping_ratio = np.sum(np.abs(audio_data) > 0.95) / len(audio_data)
+        silence_threshold = max_amp * 0.01
+        silence_ratio = np.sum(np.abs(audio_data) < silence_threshold) / len(audio_data)
+        return {
+            "duration": duration,
+            "max_amplitude": max_amp,
+            "mean_amplitude": mean_amp,
+            "clipping_ratio": clipping_ratio,
+            "silence_ratio": silence_ratio,
+            "sample_rate": sample_rate,
+            "is_good_quality": (
+                duration > 1.0 and
+                0.1 < max_amp < 0.9 and
+                clipping_ratio < 0.01 and
+                silence_ratio < 0.5
+            )
+        }
+class ModelManager:
+    """Handle STT model registration and loading."""
+    @staticmethod
+    def get_available_models() -> List[str]:
+        """Get list of available STT model names."""
+        return list(STT_MODELS.keys())
+    @staticmethod
+    def get_model_options(model_name: str) -> Dict[str, Any]:
+        """Get model-specific configuration options."""
+        if model_name == "WhisperSTT":
+            return {
+                "model_sizes": ["tiny", "base", "small", "medium", "large"],
+                "supports_api": True,
+                "languages": [
+                    ("Auto-detect", "auto"),
+                    ("English", "en"),
+                    ("Spanish", "es"),
+                    ("French", "fr"),
+                    ("German", "de"),
+                    ("Italian", "it"),
+                    ("Portuguese", "pt"),
+                    ("Russian", "ru"),
+                    ("Japanese", "ja"),
+                    ("Korean", "ko"),
+                    ("Chinese", "zh"),
+                    ("Dutch", "nl"),
+                    ("Arabic", "ar"),
+                    ("Hindi", "hi")
+                ],
+                "default_params": {
+                    "temperature": 0.0,
+                    "beam_size": 5,
+                    "best_of": 5,
+                    "patience": 2.0,
+                    "condition_on_previous_text": True,
+                }
+            }
+        elif model_name == "Wav2Vec2ArabicSTT":
+            return {
+                "model_sizes": [
+                    ("Arabic Standard", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
+                    ("Multilingual", "facebook/wav2vec2-large-xlsr-53"),
+                    ("English Fallback", "facebook/wav2vec2-base-960h"),
+                    ("Arabic Egyptian (Experimental)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian")
+                ],
+                "supports_api": False,
+                "supports_hf_token": True,
+                "languages": [
+                    ("Arabic Egyptian", "ar-EG"),
+                    ("Arabic Standard", "ar"),
+                    ("Auto-detect", "auto"),
+                ],
+                "device_options": ["auto", "cpu", "cuda"],
+                "default_params": {
+                    "device": "auto",
+                    "chunk_length": 20,
+                    "return_confidence": True,
+                }
+            }
+        elif model_name == "VoskSTT":
+            return {
+                "model_sizes": [
+                    ("English US Small (40MB)", "vosk-model-small-en-us-0.15"),
+                    ("English US Large (1.8GB)", "vosk-model-en-us-0.22"),
+                    ("Arabic (318MB)", "vosk-model-ar-mgb2-0.4"),
+                    ("French (1.4GB)", "vosk-model-fr-0.22"),
+                    ("German (1.2GB)", "vosk-model-de-0.21"),
+                    ("Spanish (1.4GB)", "vosk-model-es-0.42"),
+                    ("Russian Large (1.5GB)", "vosk-model-ru-0.42"),
+                    ("Russian Small (45MB)", "vosk-model-small-ru-0.22"),
+                    ("Chinese Small (42MB)", "vosk-model-small-cn-0.22"),
+                ],
+                "supports_api": False,
+                "supports_auto_download": True,
+                "languages": [
+                    ("Auto (based on model)", "auto"),
+                    ("English", "en"),
+                    ("Arabic", "ar"),
+                    ("French", "fr"),
+                    ("German", "de"),
+                    ("Spanish", "es"),
+                    ("Russian", "ru"),
+                    ("Chinese", "zh"),
+                ],
+                "default_params": {
+                    "auto_download": True,
+                    "return_confidence": True,
+                    "return_words": True,
+                }
+            }
+        elif model_name == "HuBERTArabicSTT":
+            return {
+                "model_sizes": [
+                    ("Arabic Egyptian (HuBERT)", "omarxadel/hubert-large-arabic-egyptian"),
+                    ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
+                    ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
+                    ("Arabic MSA", "facebook/wav2vec2-large-xlsr-53")
+                ],
+                "supports_api": False,
+                "supports_hf_token": True,
+                "languages": [
+                    ("Arabic Egyptian", "ar-EG"),
+                    ("Arabic Standard", "ar"),
+                    ("Auto-detect", "auto"),
+                ],
+                "device_options": ["auto", "cpu", "cuda"],
+                "default_params": {
+                    "device": "auto",
+                    "chunk_length": 20,
+                    "return_confidence": True,
+                    "max_audio_length": 120
+                }
+            }
+        elif model_name == "CoquiSTT":
+            return {
+                "model_sizes": [
+                    ("English Large Vocab", "english-large"),
+                    ("English Huge Vocab", "english-huge"),
+                    ("German", "german"),
+                    ("French", "french"),
+                    ("Spanish", "spanish")
+                ],
+                "supports_api": False,
+                "supports_auto_download": True,
+                "languages": [
+                    ("English", "en"),
+                    ("German", "de"),
+                    ("French", "fr"),
+                    ("Spanish", "es"),
+                    ("Auto (based on model)", "auto"),
+                ],
+                "default_params": {
+                    "auto_download": True,
+                    "beam_width": 512,
+                    "lm_alpha": 0.931289039105002,
+                    "lm_beta": 1.1834137581510284,
+                    "return_confidence": True,
+                    "return_timestamps": False,
+                }
+            }
+        elif model_name == "TawasulSTT":
+            return {
+                "model_sizes": [
+                    ("Tawasul STT V0 (Arabic)", "Kareem35/Tawasul-STT-V0"),
+                    ("Arabic Standard (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"),
+                    ("Arabic Egyptian (Wav2Vec2)", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian"),
+                    ("Multilingual Fallback", "facebook/wav2vec2-large-xlsr-53")
+                ],
+                "supports_api": False,
+                "supports_hf_token": True,
+                "languages": [
+                    ("Arabic Standard", "ar"),
+                    ("Arabic Egyptian", "ar-EG"),
+                    ("Arabic Saudi", "ar-SA"),
+                    ("Arabic Jordanian", "ar-JO"),
+                    ("Arabic Lebanese", "ar-LB"),
+                    ("Arabic Syrian", "ar-SY"),
+                    ("Arabic Iraqi", "ar-IQ"),
+                    ("Auto-detect", "auto"),
+                ],
+                "device_options": ["auto", "cpu", "cuda"],
+                "default_params": {
+                    "device": "auto",
+                    "chunk_length": 20,
+                    "return_confidence": True,
+                    "max_audio_length": 300
+                }
+            }
+        # Default options for other models
+        return {
+            "model_sizes": ["default"],
+            "supports_api": False,
+            "languages": [("Auto-detect", "auto")],
+            "default_params": {}
+        }
+    @staticmethod
+    def load_model(model_name: str, **kwargs) -> str:
+        """Load specified STT model with configuration."""
+        global current_stt_model, current_model_config
+        if model_name not in STT_MODELS:
+            return f"❌ Unknown model: {model_name}. Available: {list(STT_MODELS.keys())}"
+        try:
+            model_class = STT_MODELS[model_name]
+            # Handle TawasulSTT as static class (don't instantiate)
+            if model_name == "TawasulSTT":
+                model_instance = model_class  # Use class directly for static methods
+            else:
+                # Instantiate the model for instance-based classes
+                model_instance = model_class()
+            if model_name == "WhisperSTT":
+                # Handle WhisperSTT specific loading
+                model_size = kwargs.get("model_size", "base")
+                use_api = kwargs.get("use_api", False)
+                api_key = kwargs.get("api_key", "")
+                if use_api and not api_key.strip():
+                    return "❌ Error: API key required for API mode"
+                # Load with optimized parameters
+                load_params = {
+                    "model_size": model_size,
+                    "use_api": use_api,
+                }
+                if api_key:
+                    load_params["api_key"] = api_key.strip()
+                # Add quality optimization parameters for local models
+                if not use_api:
+                    load_params.update({
+                        "temperature": 0.0,
+                        "beam_size": 5,
+                        "best_of": 5,
+                        "patience": 2.0,
+                        "condition_on_previous_text": True,
+                    })
+                model_instance.load_model(**load_params)
+                current_model_config = {
+                    "model_name": model_name,
+                    "model_size": model_size,
+                    "use_api": use_api
+                }
+                status = f"✅ {model_name} ({'API' if use_api else model_size}) loaded successfully"
+            elif model_name == "Wav2Vec2ArabicSTT":
+                # Handle Wav2Vec2 Arabic specific loading
+                device = kwargs.get("device", "auto")
+                chunk_length = kwargs.get("chunk_length", 20)
+                hf_token = kwargs.get("hf_token", "")
+                model_id = kwargs.get("model_size", "jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
+                load_params = {
+                    "device": device,
+                    "chunk_length": chunk_length,
+                    "model_id": model_id,
+                }
+                if hf_token:
+                    load_params["hf_token"] = hf_token.strip()
+                model_instance.load_model(**load_params)
+                current_model_config = {
+                    "model_name": model_name,
+                    "model_id": model_id,
+                    "device": device,
+                    "chunk_length": chunk_length
+                }
+                # Extract model name for display
+                model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
+                status = f"✅ {model_name} ({model_display_name}) loaded on {device}"
+            elif model_name == "VoskSTT":
+                # Handle VoskSTT specific loading
+                model_name_param = kwargs.get("model_size", "vosk-model-small-en-us-0.15")
+                auto_download = kwargs.get("auto_download", True)
+                load_params = {
+                    "model_name": model_name_param,
+                    "auto_download": auto_download,
+                }
+                model_instance.load_model(**load_params)
+                current_model_config = {
+                    "model_name": model_name,
+                    "model_name_param": model_name_param,
+                    "auto_download": auto_download
+                }
+                status = f"✅ {model_name} ({model_name_param}) loaded successfully"
+            elif model_name == "HuBERTArabicSTT":
+                # Handle HuBERT Arabic specific loading
+                device = kwargs.get("device", "auto")
+                chunk_length = kwargs.get("chunk_length", 20)
+                hf_token = kwargs.get("hf_token", "")
+                model_id = kwargs.get("model_size", "omarxadel/hubert-large-arabic-egyptian")
+                max_audio_length = kwargs.get("max_audio_length", 120)
+                load_params = {
+                    "device": device,
+                    "chunk_length": chunk_length,
+                    "model_id": model_id,
+                    "max_audio_length": max_audio_length,
+                }
+                if hf_token:
+                    load_params["hf_token"] = hf_token.strip()
+                model_instance.load_model(**load_params)
+                current_model_config = {
+                    "model_name": model_name,
+                    "model_id": model_id,
+                    "device": device,
+                    "chunk_length": chunk_length,
+                    "max_audio_length": max_audio_length
+                }
+                # Extract model name for display
+                model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
+                status = f"✅ {model_name} ({model_display_name}) loaded on {device}"
+            elif model_name == "CoquiSTT":
+                # Handle Coqui STT specific loading
+                model_name_param = kwargs.get("model_size", "english-large")
+                auto_download = kwargs.get("auto_download", True)
+                beam_width = kwargs.get("beam_width", 512)
+                lm_alpha = kwargs.get("lm_alpha", 0.931289039105002)
+                lm_beta = kwargs.get("lm_beta", 1.1834137581510284)
+                load_params = {
+                    "model_name": model_name_param,
+                    "auto_download": auto_download,
+                    "beam_width": beam_width,
+                    "lm_alpha": lm_alpha,
+                    "lm_beta": lm_beta,
+                }
+                model_instance.load_model(**load_params)
+                current_model_config = {
+                    "model_name": model_name,
+                    "model_name_param": model_name_param,
+                    "auto_download": auto_download,
+                    "beam_width": beam_width,
+                    "lm_alpha": lm_alpha,
+                    "lm_beta": lm_beta
+                }
+                status = f"✅ {model_name} ({model_name_param}) loaded successfully"
+            elif model_name == "TawasulSTT":
+                # Handle Tawasul STT specific loading (static class)
+                device = kwargs.get("device", "auto")
+                chunk_length = kwargs.get("chunk_length", 20)
+                hf_token = kwargs.get("hf_token", "")
+                model_id = kwargs.get("model_size", "Kareem35/Tawasul-STT-V0")
+                max_audio_length = kwargs.get("max_audio_length", 300)
+                load_params = {
+                    "device": device,
+                    "chunk_length": chunk_length,
+                    "model_id": model_id,
+                    "max_audio_length": max_audio_length,
+                }
+                if hf_token:
+                    load_params["hf_token"] = hf_token.strip()
+                # Call static method directly
+                model_class.load_model(**load_params)
+                current_model_config = {
+                    "model_name": model_name,
+                    "model_id": model_id,
+                    "device": device,
+                    "chunk_length": chunk_length,
+                    "max_audio_length": max_audio_length
+                }
+                # Extract model name for display
+                model_display_name = model_id.split('/')[-1] if '/' in model_id else model_id
+                status = f"✅ {model_name} ({model_display_name}) loaded on {device}"
+            else:
+                # Generic model loading for future STT models
+                model_instance.load_model(**kwargs)
+                current_model_config = {"model_name": model_name, **kwargs}
+                status = f"✅ {model_name} loaded successfully"
+            current_stt_model = model_instance
+            logger.info(status)
+            return status
+        except Exception as e:
+            error_msg = f"❌ Error loading {model_name}: {str(e)}"
+            logger.error(error_msg)
+            return error_msg
+    @staticmethod
+    def get_model_info() -> str:
+        """Get information about available and loaded models."""
+        info = f"**Available Models:** {', '.join(STT_MODELS.keys())}\n\n"
+        if current_stt_model:
+            model_info = current_stt_model.get_model_info()
+            # Handle different key names for model name
+            model_name = model_info.get('model_name') or model_info.get('name', 'Unknown')
+            info += f"**Currently Loaded:** {model_name}\n"
+            info += f"**Status:** {'✅ Ready' if model_info['is_loaded'] else '❌ Not loaded'}\n"
+            info += f"**Config:** {current_model_config}"
+        else:
+            info += "**Currently Loaded:** None"
+        return info
+class ImageGallery:
+    """Handle static image gallery with slider navigation."""
+    def __init__(self):
+        """Initialize image gallery with predefined images."""
+        # Define your static images here - you can add more images to this list
+        self.images = [
+            "https://picsum.photos/400/300?random=1",  # Random image 1
+            "https://picsum.photos/400/300?random=2",  # Random image 2
+            "https://picsum.photos/400/300?random=3",  # Random image 3
+            "https://picsum.photos/400/300?random=4",  # Random image 4
+            "https://picsum.photos/400/300?random=5",  # Random image 5
+        ]
+        # Alternative: Use local images (uncomment and modify paths as needed)
+        # self.images = [
+        #     "path/to/image1.jpg",
+        #     "path/to/image2.png",
+        #     "path/to/image3.jpg",
+        #     "path/to/image4.png",
+        #     "path/to/image5.jpg",
+        # ]
+        self.current_index = 0
+    def get_image_by_index(self, index: int) -> str:
+        """Get image by index with bounds checking."""
+        if 0 <= index < len(self.images):
+            self.current_index = index
+            return self.images[index]
+        return self.images[0]  # Return first image as fallback
+    def get_image_info(self, index: int) -> str:
+        """Get information about current image."""
+        return f"Image {index + 1} of {len(self.images)}"
+    def get_total_images(self) -> int:
+        """Get total number of images."""
+        return len(self.images)
+class TranscriptionEngine:
+    """Handle audio transcription using the loaded STT model."""
+    @staticmethod
+    def transcribe(audio_input: Tuple[int, np.ndarray],
+                  language: Optional[str] = None) -> Tuple[str, str, str]:
+        """
+        Transcribe audio input using the currently loaded STT model.
+        Args:
+            audio_input: Tuple of (sample_rate, audio_data) from Gradio
+            language: Language code for transcription
+        Returns:
+            Tuple of (transcription, confidence_info, processing_info)
+        """
+        if audio_input is None:
+            return "❌ No audio provided", "", ""
+        if not current_stt_model or not current_stt_model.is_loaded:
+            return "❌ No STT model loaded. Please load a model first.", "", ""
+        try:
+            sample_rate, audio_data = audio_input
+            # Preprocess audio
+            processed_audio = AudioProcessor.preprocess(audio_data, sample_rate)
+            # Quality checks
+            quality = AudioProcessor.analyze_quality(processed_audio, 16000)
+            if quality["duration"] < 0.5:
+                return "❌ Audio too short (minimum 0.5 seconds)", "", ""
+            if quality["max_amplitude"] < 0.001:
+                return "❌ Audio too quiet or silent", "", f"Max amplitude: {quality['max_amplitude']:.6f}"
+            # Set language for models that support it
+            if hasattr(current_stt_model, 'set_language') and language and language != "auto":
+                current_stt_model.set_language(language)
+            # Transcribe using different approaches for different models
+            start_time = time.time()
+            # Check if this is TawasulSTT (static class) which needs file path
+            if current_model_config.get('model_name') == 'TawasulSTT':
+                # TawasulSTT needs a file path, so save audio to temporary file
+                import tempfile
+                import soundfile as sf
+                with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as temp_file:
+                    temp_path = temp_file.name
+                    sf.write(temp_path, processed_audio, 16000)
+                try:
+                    # Call TawasulSTT.transcribe() with file path
+                    transcription, confidence_info_raw, processing_info_raw = current_stt_model.transcribe(temp_path)
+                    # Create a result-like object for consistency
+                    class TempResult:
+                        def __init__(self, text, confidence=None, processing_time=None):
+                            self.text = text
+                            self.confidence = confidence
+                            self.processing_time = processing_time
+                    # Extract confidence from confidence_info_raw if available
+                    confidence_value = None
+                    if confidence_info_raw and "Confidence:" in confidence_info_raw:
+                        try:
+                            conf_str = confidence_info_raw.split("Confidence:")[1].strip()
+                            confidence_value = float(conf_str)
+                        except:
+                            confidence_value = None
+                    processing_time = time.time() - start_time
+                    result = TempResult(transcription, confidence_value, processing_time)
+                finally:
+                    # Clean up temporary file
+                    import os
+                    try:
+                        os.unlink(temp_path)
+                    except:
+                        pass
+            else:
+                # For other STT models that use transcribe_audio
+                result = current_stt_model.transcribe_audio(processed_audio, 16000)
+            # Prepare output
+            transcription = result.text.strip() if result.text else "No speech detected"
+            # Filter out common false positives
+            if transcription.lower() in ["you", "thank you.", "thanks for watching!", ""]:
+                transcription = "🔇 No clear speech detected"
+            # Confidence info
+            confidence_info = ""
+            if result.confidence is not None:
+                confidence_info = f"Confidence: {result.confidence:.2%}"
+                if result.confidence < 0.3:
+                    confidence_info += " (Low - consider re-recording)"
+            else:
+                confidence_info = "Confidence: N/A"
+            # Processing info
+            processing_info = f"Processing: {result.processing_time or 0:.2f}s\n"
+            processing_info += f"Model: {current_model_config.get('model_name', 'Unknown')}\n"
+            processing_info += f"Audio: {quality['duration']:.2f}s, {quality['max_amplitude']:.3f} amplitude\n"
+            processing_info += f"Quality: {'✅ Good' if quality['is_good_quality'] else '⚠️ Poor'}"
+            return transcription, confidence_info, processing_info
+        except Exception as e:
+            error_msg = f"❌ Transcription error: {str(e)}"
+            logger.error(error_msg)
+            return error_msg, "", ""
+class GradioInterface:
+    """Create and manage the Gradio web interface."""
+    @staticmethod
+    def create_interface():
+        """Create the main Gradio interface."""
+        # Initialize image gallery
+        gallery = ImageGallery()
+        with gr.Blocks(
+            title="🎙️ Modular Voice Transcriber with Image Gallery",
+            theme=gr.themes.Soft()
+        ) as demo:
+            gr.Markdown(
+                """
+                # 🎙️ Modular Voice Transcriber with Image Gallery
+                A flexible interface supporting multiple STT models with an integrated image viewer.
+                Easily extensible for new transcription engines and image collections.
+                """
+            )
+            # Image Gallery Section (at the top)
+            with gr.Row():
+                with gr.Column():
+                    gr.Markdown("### 🖼️ Image Gallery")
+                    # Main image display
+                    image_display = gr.Image(
+                        value=gallery.get_image_by_index(0),
+                        label="Selected Image",
+                        height=400,
+                        width=500,
+                        interactive=False
+                    )
+                    # Image info
+                    image_info = gr.Textbox(
+                        value=gallery.get_image_info(0),
+                        label="Image Info",
+                        interactive=False
+                    )
+                    # Horizontal thumbnail gallery with actual image previews
+                    gr.Markdown("**Click on a thumbnail to view:**")
+                    with gr.Row():
+                        # Create thumbnail gallery using Gradio's Gallery component
+                        thumbnail_gallery = gr.Gallery(
+                            value=gallery.images,  # All images as thumbnails
+                            label="Image Gallery",
+                            show_label=False,
+                            elem_id="thumbnail_gallery",
+                            columns=len(gallery.images),  # Horizontal layout
+                            rows=1,
+                            height=120,  # Small thumbnail height
+                            allow_preview=False,  # Don't show preview popup
+                            interactive=True
+                        )
+                    # Navigation buttons (kept for convenience)
+                    gr.Markdown("**Or use navigation:**")
+                    with gr.Row():
+                        prev_btn = gr.Button("◀️ Previous", size="sm")
+                        next_btn = gr.Button("Next ▶️", size="sm")
+                        random_btn = gr.Button("🎲 Random", size="sm")
+            gr.Markdown("---")  # Separator line
+            with gr.Row():
+                # Model Configuration Panel
+                with gr.Column(scale=1):
+                    gr.Markdown("### 🔧 Model Configuration")
+                    # Model selection
+                    model_selector = gr.Dropdown(
+                        choices=ModelManager.get_available_models(),
+                        value="WhisperSTT",
+                        label="STT Model",
+                        info="Choose your speech-to-text engine"
+                    )
+                    # Dynamic model options (will update based on selected model)
+                    model_size = gr.Dropdown(
+                        choices=["tiny", "base", "small", "medium", "large"],
+                        value="base",
+                        label="Model Size",
+                        visible=True
+                    )
+                    use_api = gr.Checkbox(
+                        label="Use API",
+                        info="Use cloud API instead of local model",
+                        visible=True
+                    )
+                    api_key = gr.Textbox(
+                        label="API Key",
+                        type="password",
+                        placeholder="Enter API key...",
+                        visible=False
+                    )
+                    # Device selection for models that support it
+                    device_selector = gr.Dropdown(
+                        choices=["auto", "cpu", "cuda"],
+                        value="auto",
+                        label="Device",
+                        info="Processing device (auto recommended)",
+                        visible=False
+                    )
+                    # HuggingFace token for private models
+                    hf_token = gr.Textbox(
+                        label="HuggingFace Token",
+                        type="password",
+                        placeholder="hf_...",
+                        info="Optional: For private or experimental models",
+                        visible=False
+                    )
+                    # Load button and status
+                    load_btn = gr.Button("🔄 Load Model", variant="primary")
+                    load_status = gr.Textbox(
+                        label="Status",
+                        value="No model loaded",
+                        interactive=False
+                    )
+                    # Model info
+                    model_info = gr.Markdown(ModelManager.get_model_info())
+                # Transcription Panel
+                with gr.Column(scale=2):
+                    gr.Markdown("### 🎤 Voice Transcription")
+                    # Language selection
+                    language = gr.Dropdown(
+                        choices=[("Auto-detect", "auto"), ("English", "en")],
+                        value="auto",
+                        label="Language"
+                    )
+                    # Audio input
+                    audio_input = gr.Audio(
+                        label="Record or Upload Audio",
+                        type="numpy",
+                        format="wav"
+                    )
+                    # Action buttons
+                    with gr.Row():
+                        transcribe_btn = gr.Button("🎯 Transcribe", variant="primary")
+                        quality_btn = gr.Button("📊 Check Quality")
+                        clear_btn = gr.Button("🗑️ Clear")
+                    # Outputs
+                    transcription_output = gr.Textbox(
+                        label="📝 Transcription",
+                        lines=4,
+                        placeholder="Transcribed text will appear here..."
+                    )
+                    with gr.Row():
+                        confidence_output = gr.Textbox(
+                            label="🎯 Confidence",
+                            interactive=False
+                        )
+                        processing_output = gr.Textbox(
+                            label="⏱️ Processing Info",
+                            interactive=False
+                        )
+                    quality_output = gr.Markdown(
+                        value="",
+                        visible=False,
+                        label="📊 Audio Quality Analysis"
+                    )
+            # Usage tips
+            gr.Markdown(
+                """
+                ### 💡 Tips for Best Results
+                - **Record clearly** in a quiet environment
+                - **Speak at normal pace** - not too fast or slow
+                - **Use good audio quality** - avoid background noise
+                - **Try different models** - larger models are more accurate but slower
+                - **Check quality analysis** to identify audio issues
+                - **Browse images** using the slider or navigation buttons
+                """
+            )
+            # Event handlers
+            def update_model_options(model_name: str):
+                """Update interface based on selected model."""
+                options = ModelManager.get_model_options(model_name)
+                # Determine visibility of components
+                show_model_size = len(options["model_sizes"]) > 1
+                show_api = options["supports_api"]
+                show_device = "device_options" in options
+                show_hf_token = options.get("supports_hf_token", False)
+                # Extract model size options (handle both simple lists and tuples)
+                if show_model_size and isinstance(options["model_sizes"][0], tuple):
+                    # Model sizes are tuples of (display_name, value)
+                    size_choices = options["model_sizes"]
+                    size_value = size_choices[0][1]  # Use the value from first tuple
+                else:
+                    # Model sizes are simple strings
+                    size_choices = options["model_sizes"]
+                    size_value = size_choices[0]
+                return (
+                    gr.update(choices=size_choices, value=size_value, visible=show_model_size),
+                    gr.update(visible=show_api),
+                    gr.update(visible=False),  # Hide API key initially
+                    gr.update(choices=options["languages"], value="auto"),
+                    gr.update(
+                        choices=options.get("device_options", ["auto"]),
+                        value="auto",
+                        visible=show_device
+                    ),
+                    gr.update(visible=show_hf_token)
+                )
+            def toggle_api_key(use_api: bool):
+                """Show/hide API key field."""
+                return gr.update(visible=use_api)
+            def load_selected_model(model_name: str, model_size: str, use_api: bool, api_key: str, device: str, hf_token: str):
+                """Load the selected model with configuration."""
+                kwargs = {"model_size": model_size, "use_api": use_api}
+                if api_key:
+                    kwargs["api_key"] = api_key
+                if device and device != "auto":
+                    kwargs["device"] = device
+                if hf_token:
+                    kwargs["hf_token"] = hf_token
+                return ModelManager.load_model(model_name, **kwargs)
+            def analyze_audio_quality(audio_input):
+                """Analyze and display audio quality."""
+                if audio_input is None:
+                    return "", gr.update(visible=False)
+                sample_rate, audio_data = audio_input
+                quality = AudioProcessor.analyze_quality(audio_data, sample_rate)
+                report = f"""
+                **📊 Audio Quality Analysis:**
+                - Duration: {quality['duration']:.2f}s
+                - Max amplitude: {quality['max_amplitude']:.3f}
+                - Clipping: {quality['clipping_ratio']:.2%}
+                - Silence ratio: {quality['silence_ratio']:.2%}
+                - Overall quality: {'✅ Good' if quality['is_good_quality'] else '⚠️ Needs improvement'}
+                **🔧 Recommendations:**
+                {_get_quality_recommendations(quality)}
+                """
+                return report, gr.update(visible=True)
+            # Image Gallery Event Handlers
+            current_image_index = [0]  # Use list to make it mutable in nested functions
+            def select_image_from_gallery(evt: gr.SelectData):
+                """Handle image selection from gallery thumbnail."""
+                index = evt.index
+                current_image_index[0] = index
+                image_path = gallery.get_image_by_index(index)
+                image_info_text = gallery.get_image_info(index)
+                return image_path, image_info_text
+            def go_to_previous_image():
+                """Go to previous image."""
+                current_image_index[0] = max(0, current_image_index[0] - 1)
+                image_path = gallery.get_image_by_index(current_image_index[0])
+                image_info_text = gallery.get_image_info(current_image_index[0])
+                return image_path, image_info_text
+            def go_to_next_image():
+                """Go to next image."""
+                current_image_index[0] = min(gallery.get_total_images() - 1, current_image_index[0] + 1)
+                image_path = gallery.get_image_by_index(current_image_index[0])
+                image_info_text = gallery.get_image_info(current_image_index[0])
+                return image_path, image_info_text
+            def go_to_random_image():
+                """Go to random image."""
+                import random
+                current_image_index[0] = random.randint(0, gallery.get_total_images() - 1)
+                image_path = gallery.get_image_by_index(current_image_index[0])
+                image_info_text = gallery.get_image_info(current_image_index[0])
+                return image_path, image_info_text
+                return image_path, image_info_text, new_index
+            # Connect events
+            model_selector.change(
+                fn=update_model_options,
+                inputs=model_selector,
+                outputs=[model_size, use_api, api_key, language, device_selector, hf_token]
+            )
+            use_api.change(
+                fn=toggle_api_key,
+                inputs=use_api,
+                outputs=api_key
+            )
+            load_btn.click(
+                fn=load_selected_model,
+                inputs=[model_selector, model_size, use_api, api_key, device_selector, hf_token],
+                outputs=load_status
+            ).then(
+                fn=lambda: ModelManager.get_model_info(),
+                outputs=model_info
+            )
+            transcribe_btn.click(
+                fn=TranscriptionEngine.transcribe,
+                inputs=[audio_input, language],
+                outputs=[transcription_output, confidence_output, processing_output]
+            )
+            quality_btn.click(
+                fn=analyze_audio_quality,
+                inputs=audio_input,
+                outputs=[quality_output, quality_output]
+            )
+            clear_btn.click(
+                fn=lambda: ("", "", "", "", gr.update(visible=False)),
+                outputs=[transcription_output, confidence_output, processing_output, quality_output, quality_output]
+            )
+            # Auto-transcribe on audio change (optional)
+            audio_input.change(
+                fn=TranscriptionEngine.transcribe,
+                inputs=[audio_input, language],
+                outputs=[transcription_output, confidence_output, processing_output]
+            )
+            # Image Gallery Event Connections
+            # Connect thumbnail gallery selection
+            thumbnail_gallery.select(
+                fn=select_image_from_gallery,
+                outputs=[image_display, image_info]
+            )
+            # Connect navigation buttons
+            prev_btn.click(
+                fn=go_to_previous_image,
+                outputs=[image_display, image_info]
+            )
+            next_btn.click(
+                fn=go_to_next_image,
+                outputs=[image_display, image_info]
+            )
+            random_btn.click(
+                fn=go_to_random_image,
+                outputs=[image_display, image_info]
+            )
+        return demo
+def _get_quality_recommendations(quality: Dict[str, Any]) -> str:
+    """Generate quality recommendations based on analysis."""
+    recommendations = []
+    if quality["duration"] < 1.0:
+        recommendations.append("• Try recording for longer (1+ seconds)")
+    if quality["max_amplitude"] < 0.1:
+        recommendations.append("• Increase volume or move closer to microphone")
+    elif quality["max_amplitude"] > 0.9:
+        recommendations.append("• Reduce volume to avoid clipping")
+    if quality["clipping_ratio"] > 0.01:
+        recommendations.append("• Audio is clipping - reduce input gain")
+    if quality["silence_ratio"] > 0.5:
+        recommendations.append("• Too much silence - record in quieter environment")
+    if not recommendations:
+        recommendations.append("• Audio quality looks good!")
+    return "\n".join(recommendations)
+def main():
+    """Main application entry point."""
+    # Check dependencies
+    print("🔍 Checking dependencies...")
+    try:
+        import gradio
+        print("✅ Gradio available")
+    except ImportError:
+        print("❌ Gradio not installed. Run: pip install gradio")
+        return
+    # Check available STT models
+    print(f"🤖 Available STT models: {ModelManager.get_available_models()}")
+    # Create and launch interface
+    print("🚀 Launching Gradio interface...")
+    demo = GradioInterface.create_interface()
+    demo.launch(
+        share=False,  # Set to True for public sharing
+        server_name="127.0.0.1",
+        server_port=7861,
+        show_error=True
+    )
+if __name__ == "__main__":
     main()

hf-space ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit 921859d10816a9cb386449308c5f66037f50deb2

pyproject.toml ADDED Viewed

	@@ -0,0 +1,139 @@

+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[project]
+name = "modular-voice-transcriber"
+version = "0.2.0"
+description = "A modular Gradio-based web interface for speech-to-text transcription supporting multiple STT engines"
+readme = "README.md"
+requires-python = ">=3.8"
+dependencies = [
+    "gradio>=4.0.0",
+    "soundfile>=0.12.1",
+    "numpy>=1.21.0",
+    "pathlib2>=2.3.0; python_version < '3.4'",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=7.0.0",
+    "black>=22.0.0",
+    "flake8>=4.0.0",
+    "mypy>=1.0.0",
+]
+# OpenAI Whisper STT (local models)
+whisper = [
+    "openai-whisper>=20230314",
+    "torch>=1.10.0",
+    "torchaudio>=0.10.0",
+]
+# OpenAI Whisper API
+whisper-api = [
+    "openai>=1.0.0",
+]
+# Wav2Vec2 models (Hugging Face)
+wav2vec2 = [
+    "transformers>=4.20.0",
+    "torch>=1.12.0",
+    "torchaudio>=0.12.0",
+    "librosa>=0.9.0",  # Optional but recommended
+]
+# HuBERT models (Hugging Face, Arabic Egyptian)
+hubert = [
+    "transformers>=4.20.0",
+    "torch>=1.12.0",
+    "torchaudio>=0.12.0",
+    "librosa>=0.9.2",
+    "soundfile>=0.10.3",
+    "huggingface-hub>=0.14.0",
+]
+# Coqui STT (open-source multilingual)
+coqui = [
+    "coqui-stt>=1.4.0",
+    "soundfile>=0.10.3",
+    "librosa>=0.9.2",
+    "requests>=2.25.0",
+]
+# Vosk STT (offline recognition)
+vosk = [
+    "vosk>=0.3.42",
+    "soundfile>=0.12.1",
+]
+# Tawasul STT (Arabic speech recognition)
+tawasul = [
+    "transformers>=4.20.0",
+    "torch>=1.12.0",
+    "torchaudio>=0.12.0",
+    "librosa>=0.9.2",
+    "soundfile>=0.10.3",
+    "huggingface-hub>=0.14.0",
+]
+# Azure Speech Service
+azure-speech = [
+    "azure-cognitiveservices-speech>=1.25.0",
+]
+# Google Cloud Speech-to-Text
+google-speech = [
+    "google-cloud-speech>=2.15.0",
+]
+# AssemblyAI
+assemblyai = [
+    "assemblyai>=0.15.0",
+]
+# Amazon Transcribe
+aws-transcribe = [
+    "boto3>=1.26.0",
+    "botocore>=1.29.0",
+]
+# All STT engines (for full functionality)
+all-stt = [
+    "openai-whisper>=20230314",
+    "openai>=1.0.0",
+    "transformers>=4.20.0",
+    "torch>=1.12.0",
+    "torchaudio>=0.12.0",
+    "librosa>=0.9.0",
+    "vosk>=0.3.42",
+    "soundfile>=0.10.3",
+    "huggingface-hub>=0.14.0",
+    "coqui-stt>=1.4.0",
+    "requests>=2.25.0",
+    "azure-cognitiveservices-speech>=1.25.0",
+    "google-cloud-speech>=2.15.0",
+    "assemblyai>=0.15.0",
+    "boto3>=1.26.0",
+]
+# Essential models (Whisper + Wav2Vec2 + HuBERT + Vosk + Coqui)
+essential = [
+    "openai-whisper>=20230314",
+    "openai>=1.0.0",
+    "transformers>=4.20.0",
+    "torch>=1.12.0",
+    "torchaudio>=0.12.0",
+    "librosa>=0.9.0",
+    "soundfile>=0.10.3",
+    "huggingface-hub>=0.14.0",
+    "vosk>=0.3.42",
+    "coqui-stt>=1.4.0",
+    "requests>=2.25.0",
+]
+[project.urls]
+Homepage = "https://github.com/your-username/modular-voice-transcriber"
+Repository = "https://github.com/your-username/modular-voice-transcriber.git"
+Issues = "https://github.com/your-username/modular-voice-transcriber/issues"
+[project.scripts]
+voice-transcriber = "gradio_voice_transcriber_clean:main"
+[tool.black]
+line-length = 88
+target-version = ['py38']
+[tool.uv]
+dev-dependencies = [
+    "pytest>=7.0.0",
+    "black>=22.0.0",
+    "flake8>=4.0.0",
+]

requirements.txt CHANGED Viewed

@@ -1,43 +1,43 @@
-# Modular Voice Transcriber - Core Dependencies
-# Base requirements for the Gradio interface
-gradio>=4.0.0
-soundfile>=0.12.1
-numpy>=1.21.0
-# Essential STT Models (Whisper + Wav2Vec2)
-# OpenAI Whisper (local and API)
-openai-whisper>=20231117
-openai>=1.0.0
-# Wav2Vec2 (Hugging Face Transformers)
-transformers>=4.20.0
-torch>=1.12.0
-torchaudio>=0.12.0
-# Audio processing (recommended)
-librosa>=0.9.0
-# Optional: Vosk STT (for offline recognition)
-# vosk>=0.3.42
-# Optional: HuBERT Arabic STT (for Arabic Egyptian dialect)
-# Use requirements_hubert.txt for full setup
-# Optional: Coqui STT (open-source multilingual)
-# Use requirements_coqui.txt for full setup
-# Optional: Tawasul STT (for Arabic speech recognition)
-# Use requirements_tawasul.txt for full setup
-# Installation options:
-# pip install -r requirements.txt                    # Core + Essential STT models
-# pip install -r requirements_whisper.txt            # Whisper-only setup
-# pip install -r requirements_wav2vec2.txt           # Wav2Vec2-only setup
-# pip install -r requirements_vosk.txt               # Vosk-only setup
-# pip install -r requirements_hubert.txt             # HuBERT Arabic-only setup
-# pip install -r requirements_coqui.txt              # Coqui STT-only setup
-# pip install -r requirements_tawasul.txt            # Tawasul STT-only setup
-# pip install -e .[essential]                        # Same as core
-# pip install -e .[all-stt]                         # All supported STT engines
-# pip install -e .[whisper,wav2vec2,vosk,hubert,coqui,tawasul] # Specific models only
 # pip install -e .[dev]                             # Development dependencies

+# Modular Voice Transcriber - Core Dependencies
+# Base requirements for the Gradio interface
+gradio>=4.0.0
+soundfile>=0.12.1
+numpy>=1.21.0
+# Essential STT Models (Whisper + Wav2Vec2)
+# OpenAI Whisper (local and API)
+openai-whisper>=20231117
+openai>=1.0.0
+# Wav2Vec2 (Hugging Face Transformers)
+transformers>=4.20.0
+torch>=1.12.0
+torchaudio>=0.12.0
+# Audio processing (recommended)
+librosa>=0.9.0
+# Optional: Vosk STT (for offline recognition)
+# vosk>=0.3.42
+# Optional: HuBERT Arabic STT (for Arabic Egyptian dialect)
+# Use requirements_hubert.txt for full setup
+# Optional: Coqui STT (open-source multilingual)
+# Use requirements_coqui.txt for full setup
+# Optional: Tawasul STT (for Arabic speech recognition)
+# Use requirements_tawasul.txt for full setup
+# Installation options:
+# pip install -r requirements.txt                    # Core + Essential STT models
+# pip install -r requirements_whisper.txt            # Whisper-only setup
+# pip install -r requirements_wav2vec2.txt           # Wav2Vec2-only setup
+# pip install -r requirements_vosk.txt               # Vosk-only setup
+# pip install -r requirements_hubert.txt             # HuBERT Arabic-only setup
+# pip install -r requirements_coqui.txt              # Coqui STT-only setup
+# pip install -r requirements_tawasul.txt            # Tawasul STT-only setup
+# pip install -e .[essential]                        # Same as core
+# pip install -e .[all-stt]                         # All supported STT engines
+# pip install -e .[whisper,wav2vec2,vosk,hubert,coqui,tawasul] # Specific models only
 # pip install -e .[dev]                             # Development dependencies

requirements_coqui.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+# Coqui STT Requirements
+coqui-stt-model-manager
+soundfile>=0.10.3
+librosa>=0.9.2
+numpy>=1.21.0
+requests>=2.25.0

requirements_hubert.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+# HuBERT Arabic STT Requirements
+torch>=1.12.0
+transformers>=4.20.0
+torchaudio>=0.12.0
+librosa>=0.9.2
+soundfile>=0.10.3
+huggingface-hub>=0.14.0

requirements_tawasul.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+# Tawasul STT Requirements
+# Arabic Speech Recognition using Tawasul STT V0 model
+torch>=1.12.0
+transformers>=4.20.0
+torchaudio>=0.12.0
+librosa>=0.9.2
+soundfile>=0.10.3
+huggingface-hub>=0.14.0
+numpy>=1.21.0
+# Optional: For better performance
+# accelerate>=0.20.0
+# optimum>=1.8.0

requirements_vosk.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+# Vosk STT requirements
+vosk>=0.3.42
+soundfile>=0.12.1

requirements_wav2vec2.txt ADDED Viewed

	@@ -0,0 +1,24 @@

+# Wav2Vec2 Arabic STT Requirements
+# Minimal requirements for using the Wav2Vec2 Arabic Egyptian model
+# Base requirements
+gradio>=4.0.0
+numpy>=1.21.0
+soundfile>=0.12.1
+# Wav2Vec2 specific requirements
+transformers>=4.20.0
+torch>=1.12.0
+torchaudio>=0.12.0
+# Optional but highly recommended for better audio processing
+librosa>=0.9.0
+# Installation:
+# pip install -r requirements_wav2vec2.txt
+# Notes:
+# - First model load will download ~1.2GB from Hugging Face Hub
+# - GPU support is automatic if PyTorch with CUDA is installed
+# - Model runs on CPU but GPU is significantly faster for longer audio
+# - Optimized for Arabic Egyptian dialect but works with Standard Arabic

requirements_whisper.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+# OpenAI Whisper STT Requirements
+# Requirements for using OpenAI Whisper (local and API)
+# Base requirements
+gradio>=4.0.0
+numpy>=1.21.0
+soundfile>=0.12.1
+# Whisper local model requirements
+openai-whisper>=20231117
+torch>=1.10.0
+torchaudio>=0.10.0
+# Whisper API requirements
+openai>=1.0.0
+# Optional for better audio processing
+librosa>=0.9.0
+# Installation:
+# pip install -r requirements_whisper.txt
+# Notes:
+# - Local models download automatically on first use
+# - API requires OpenAI API key
+# - Model sizes: tiny(39MB) < base(142MB) < small(461MB) < medium(1.5GB) < large(2.9GB)
+# - GPU support automatic if PyTorch with CUDA is installed

setup.py ADDED Viewed

	@@ -0,0 +1,212 @@

+#!/usr/bin/env python3
+"""
+Setup script for Modular Voice Transcriber
+This script helps set up the environment and install dependencies
+based on which STT models you want to use.
+"""
+import subprocess
+import sys
+import argparse
+from pathlib import Path
+def run_command(command, description=""):
+    """Run a command and handle errors."""
+    if description:
+        print(f"📦 {description}...")
+    try:
+        result = subprocess.run(command, shell=True, check=True, capture_output=True, text=True)
+        print(f"✅ {description or 'Command'} completed successfully")
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ {description or 'Command'} failed: {e}")
+        if e.stdout:
+            print(f"Output: {e.stdout}")
+        if e.stderr:
+            print(f"Error: {e.stderr}")
+        return False
+def install_requirements(requirements_file):
+    """Install requirements from a specific file."""
+    if not Path(requirements_file).exists():
+        print(f"❌ Requirements file not found: {requirements_file}")
+        return False
+    return run_command(
+        f"pip install -r {requirements_file}",
+        f"Installing requirements from {requirements_file}"
+    )
+def install_optional_dependencies(groups):
+    """Install optional dependencies using pip install -e ."""
+    group_str = ",".join(groups)
+    return run_command(
+        f"pip install -e .[{group_str}]",
+        f"Installing optional dependencies: {group_str}"
+    )
+def test_imports(modules):
+    """Test if modules can be imported."""
+    print("\n🔍 Testing module imports...")
+    all_good = True
+    for module in modules:
+        try:
+            __import__(module)
+            print(f"✅ {module}")
+        except ImportError as e:
+            print(f"❌ {module}: {e}")
+            all_good = False
+    return all_good
+def main():
+    parser = argparse.ArgumentParser(description="Setup Modular Voice Transcriber")
+    parser.add_argument(
+        "--profile",
+        choices=["minimal", "essential", "whisper-only", "wav2vec2-only", "vosk-only", "hubert-only", "coqui-only", "tawasul-only", "all"],
+        default="essential",
+        help="Installation profile (default: essential)"
+    )
+    parser.add_argument(
+        "--test",
+        action="store_true",
+        help="Test the installation after setup"
+    )
+    args = parser.parse_args()
+    print("🚀 Modular Voice Transcriber Setup")
+    print("=" * 50)
+    print(f"Profile: {args.profile}")
+    print()
+    # Install base requirements first
+    print("📦 Installing base requirements...")
+    base_success = run_command(
+        "pip install gradio>=4.0.0 numpy>=1.21.0 soundfile>=0.12.1",
+        "Installing base dependencies"
+    )
+    if not base_success:
+        print("❌ Failed to install base requirements. Exiting.")
+        return 1
+    # Install profile-specific requirements
+    success = True
+    if args.profile == "minimal":
+        print("\n📦 Minimal installation - Gradio interface only")
+        # Base requirements already installed
+    elif args.profile == "essential":
+        print("\n📦 Essential installation - Whisper + Wav2Vec2")
+        success = install_optional_dependencies(["essential"])
+    elif args.profile == "whisper-only":
+        print("\n📦 Whisper-only installation")
+        success = install_requirements("requirements_whisper.txt")
+    elif args.profile == "wav2vec2-only":
+        print("\n📦 Wav2Vec2-only installation")
+        success = install_requirements("requirements_wav2vec2.txt")
+    elif args.profile == "vosk-only":
+        print("\n📦 Vosk-only installation")
+        success = install_requirements("requirements_vosk.txt")
+    elif args.profile == "hubert-only":
+        print("\n📦 HuBERT Arabic-only installation")
+        success = install_requirements("requirements_hubert.txt")
+    elif args.profile == "coqui-only":
+        print("\n📦 Coqui STT-only installation")
+        success = install_requirements("requirements_coqui.txt")
+    elif args.profile == "tawasul-only":
+        print("\n📦 Tawasul STT-only installation")
+        success = install_requirements("requirements_tawasul.txt")
+    elif args.profile == "all":
+        print("\n📦 Full installation - All STT models")
+        success = install_optional_dependencies(["all-stt"])
+    if not success:
+        print(f"❌ Failed to install {args.profile} profile requirements.")
+        return 1
+    # Test installation if requested
+    if args.test:
+        print("\n🧪 Testing installation...")
+        # Basic imports
+        basic_modules = ["gradio", "numpy", "soundfile"]
+        test_imports(basic_modules)
+        # Profile-specific tests
+        if args.profile in ["essential", "whisper-only", "all"]:
+            whisper_modules = ["whisper", "openai"]
+            test_imports(whisper_modules)
+        if args.profile in ["essential", "wav2vec2-only", "hubert-only", "tawasul-only", "all"]:
+            wav2vec2_modules = ["transformers", "torch", "torchaudio"]
+            test_imports(wav2vec2_modules)
+        # Test our modules
+        try:
+            from stt.stt_base import BaseSTT
+            from stt.whisper_stt import WhisperSTT
+            print("✅ STT base classes")
+        except ImportError as e:
+            print(f"❌ STT base classes: {e}")
+        if args.profile in ["essential", "wav2vec2-only", "all"]:
+            try:
+                from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
+                print("✅ Wav2Vec2 Arabic STT")
+            except ImportError as e:
+                print(f"❌ Wav2Vec2 Arabic STT: {e}")
+        if args.profile in ["hubert-only", "all"]:
+            try:
+                from stt.hubert_arabic_stt import HuBERTArabicSTT
+                print("✅ HuBERT Arabic STT")
+            except ImportError as e:
+                print(f"❌ HuBERT Arabic STT: {e}")
+        if args.profile in ["coqui-only", "all"]:
+            try:
+                from stt.coqui_stt import CoquiSTT
+                print("✅ Coqui STT")
+            except ImportError as e:
+                print(f"❌ Coqui STT: {e}")
+        if args.profile in ["tawasul-only", "all"]:
+            try:
+                from stt.tawasul_stt import TawasulSTT
+                print("✅ Tawasul STT")
+            except ImportError as e:
+                print(f"❌ Tawasul STT: {e}")
+        if args.profile in ["vosk-only", "all"]:
+            try:
+                from stt.vosk_stt import VoskSTT
+                print("✅ Vosk STT")
+            except ImportError as e:
+                print(f"❌ Vosk STT: {e}")
+    print("\n" + "=" * 50)
+    print("🎉 Setup completed!")
+    print("\n💡 Next steps:")
+    print("   1. Run the transcriber:")
+    print("      python gradio_voice_transcriber_clean.py")
+    print("\n   2. Or test specific models:")
+    print("      python test_wav2vec2_arabic.py")
+    print("\n   3. Check available models in the web interface")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

setup_hf_auth.py ADDED Viewed

	@@ -0,0 +1,156 @@

+#!/usr/bin/env python3
+"""
+HuggingFace Authentication Helper
+This script helps set up HuggingFace authentication for accessing private models.
+"""
+import os
+import subprocess
+import sys
+from pathlib import Path
+def check_hf_cli():
+    """Check if huggingface-hub CLI is available."""
+    try:
+        result = subprocess.run(["huggingface-cli", "--version"],
+                              capture_output=True, text=True, check=True)
+        print(f"✅ HuggingFace CLI available: {result.stdout.strip()}")
+        return True
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        print("❌ HuggingFace CLI not found")
+        return False
+def install_hf_hub():
+    """Install huggingface-hub package."""
+    print("📦 Installing huggingface-hub...")
+    try:
+        subprocess.run([sys.executable, "-m", "pip", "install", "huggingface-hub"],
+                      check=True)
+        print("✅ huggingface-hub installed successfully")
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Failed to install huggingface-hub: {e}")
+        return False
+def login_to_hf():
+    """Login to HuggingFace using CLI."""
+    print("\n🔐 Logging in to HuggingFace...")
+    print("This will open a browser to get your token.")
+    print("If you don't have a token, create one at: https://huggingface.co/settings/tokens")
+    try:
+        subprocess.run(["huggingface-cli", "login"], check=True)
+        print("✅ Successfully logged in to HuggingFace")
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"❌ Failed to login: {e}")
+        return False
+def check_auth_status():
+    """Check current authentication status."""
+    try:
+        result = subprocess.run(["huggingface-cli", "whoami"],
+                              capture_output=True, text=True, check=True)
+        username = result.stdout.strip()
+        print(f"✅ Logged in as: {username}")
+        return True, username
+    except subprocess.CalledProcessError:
+        print("❌ Not logged in to HuggingFace")
+        return False, None
+def test_model_access():
+    """Test access to the Arabic Egyptian model."""
+    print("\n🧪 Testing model access...")
+    try:
+        from transformers import AutoTokenizer
+        # Test the main model
+        models_to_test = [
+            "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
+            "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
+            "facebook/wav2vec2-large-xlsr-53"
+        ]
+        for model_id in models_to_test:
+            try:
+                print(f"Testing: {model_id}")
+                tokenizer = AutoTokenizer.from_pretrained(model_id)
+                print(f"✅ {model_id} - Accessible")
+                return True
+            except Exception as e:
+                print(f"❌ {model_id} - {str(e)}")
+                continue
+        print("❌ None of the models are accessible")
+        return False
+    except ImportError:
+        print("❌ Transformers library not installed")
+        return False
+def manual_token_setup():
+    """Guide user through manual token setup."""
+    print("\n📝 Manual Token Setup")
+    print("=" * 40)
+    print("1. Go to: https://huggingface.co/settings/tokens")
+    print("2. Create a new token with 'Read' permissions")
+    print("3. Copy the token (starts with 'hf_')")
+    print("4. Use it in the Gradio interface:")
+    print("   - Select 'Wav2Vec2ArabicSTT'")
+    print("   - Choose 'Arabic Egyptian (Experimental)' model")
+    print("   - Enter your token in 'HuggingFace Token' field")
+    print("   - Click 'Load Model'")
+    print("\n💡 Alternatively, set environment variable:")
+    print("   export HF_TOKEN=your_token_here")
+def main():
+    """Main authentication helper."""
+    print("🤗 HuggingFace Authentication Helper")
+    print("=" * 50)
+    # Check if already logged in
+    is_logged_in, username = check_auth_status()
+    if is_logged_in:
+        print(f"\n✅ Already authenticated as: {username}")
+        # Test model access
+        if test_model_access():
+            print("\n🎉 Authentication is working! You can use the experimental models.")
+        else:
+            print("\n⚠️  Authentication works but model access failed.")
+            print("The experimental model might not be available.")
+            print("Try using the standard Arabic model instead.")
+        return 0
+    # Not logged in, try to set up
+    print("\n❌ Not authenticated with HuggingFace")
+    # Check if CLI is available
+    if not check_hf_cli():
+        print("\n📦 Installing HuggingFace CLI...")
+        if not install_hf_hub():
+            print("\n❌ Failed to install HuggingFace Hub")
+            manual_token_setup()
+            return 1
+    # Try to login
+    print("\n🔐 Setting up authentication...")
+    if login_to_hf():
+        # Test access after login
+        if test_model_access():
+            print("\n🎉 Setup complete! You can now use all models.")
+        else:
+            print("\n⚠️  Login successful but some models may not be accessible.")
+    else:
+        print("\n❌ Automatic login failed")
+        manual_token_setup()
+        return 1
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

stt/__init__.py ADDED Viewed

	@@ -0,0 +1,19 @@

+"""
+STT (Speech-to-Text) Package
+This package contains various STT model implementations that inherit from BaseSTT.
+Available STT Models:
+- DummySTT: Test implementation for interface validation
+- WhisperSTT: OpenAI Whisper implementation (local + API)
+"""
+from .stt_base import BaseSTT, STTResult, DummySTT
+# Import WhisperSTT with error handling
+try:
+    from .whisper_stt import WhisperSTT
+    __all__ = ['BaseSTT', 'STTResult', 'DummySTT', 'WhisperSTT']
+except ImportError as e:
+    # WhisperSTT dependencies not available
+    __all__ = ['BaseSTT', 'STTResult', 'DummySTT']

stt/chirp3_stt.py ADDED Viewed

	@@ -0,0 +1,136 @@

+#!/usr/bin/env python3
+"""
+Chirp3 Speech-to-Text (STT) Implementation (Stub)
+This is a stub for integrating the Chirp3 model with the BaseSTT interface.
+Replace the stub methods with actual model loading and transcription logic as needed.
+"""
+from .stt_base import BaseSTT, STTResult
+import time
+import numpy as np
+from typing import Union
+import io
+import wave
+try:
+    from google.cloud import speech
+except ImportError:
+    speech = None
+from .stt_base import BaseSTT, STTResult
+class Chirp3STT(BaseSTT):
+    """
+    Chirp3STT implementation using Google Cloud Speech-to-Text API.
+    Accepts file path or numpy array as input.
+    """
+    model_name = "Chirp3STT"
+    client = None
+    is_loaded = False
+    config = {
+        "language": "ar-EG",
+        "sample_rate": 16000,
+        "encoding": "LINEAR16",
+        "enable_automatic_punctuation": True,
+    }
+    @classmethod
+    def load_model(cls, **kwargs) -> None:
+        """
+        Initialize Google Cloud Speech client.
+        """
+        cls.client = speech.SpeechClient()
+        cls.is_loaded = True
+    @classmethod
+    def transcribe_audio(cls, audio_data: Union[str, np.ndarray], sample_rate: int = None):
+        """
+        Transcribe audio using Google Cloud Speech-to-Text API.
+        Args:
+            audio_data: Path to WAV file or numpy array (float32, mono)
+            sample_rate: Sample rate if numpy array is provided
+        Returns:
+            STTResult
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
+        start_time = time.time()
+        # Check google-cloud-speech import
+        if speech is None:
+            return STTResult(
+                text="",
+                confidence=0.0,
+                processing_time=0.0,
+                metadata={"error": "google-cloud-speech not installed"}
+            )
+        # Prepare audio for Google API
+        audio_content = None
+        actual_sample_rate = sample_rate or cls.config["sample_rate"]
+        if isinstance(audio_data, str):
+            # File path
+            try:
+                with open(audio_data, "rb") as f:
+                    audio_content = f.read()
+            except Exception as e:
+                return STTResult(
+                    text="",
+                    confidence=0.0,
+                    processing_time=0.0,
+                    metadata={"error": f"Failed to read file: {e}"}
+                )
+        elif isinstance(audio_data, np.ndarray):
+            # Numpy array (float32 or int16)
+            arr = audio_data
+            if arr.dtype != np.int16:
+                arr = (arr * 32767).astype(np.int16)
+            buf = io.BytesIO()
+            with wave.open(buf, 'wb') as wf:
+                wf.setnchannels(1)
+                wf.setsampwidth(2)
+                wf.setframerate(actual_sample_rate)
+                wf.writeframes(arr.tobytes())
+            audio_content = buf.getvalue()
+        else:
+            return STTResult(
+                text="",
+                confidence=0.0,
+                processing_time=0.0,
+                metadata={"error": "Unsupported audio input type"}
+            )
+        # Prepare Google API request
+        audio = speech.RecognitionAudio(content=audio_content)
+        config = speech.RecognitionConfig(
+            encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
+            sample_rate_hertz=actual_sample_rate,
+            language_code=cls.config["language"],
+            enable_automatic_punctuation=cls.config["enable_automatic_punctuation"],
+        )
+        try:
+            response = cls.client.recognize(config=config, audio=audio)
+            if response.results:
+                transcript = response.results[0].alternatives[0].transcript
+                confidence = response.results[0].alternatives[0].confidence if response.results[0].alternatives else 0.0
+            else:
+                transcript = ""
+                confidence = 0.0
+            processing_time = time.time() - start_time
+            return STTResult(
+                text=transcript,
+                confidence=confidence,
+                processing_time=processing_time,
+                metadata={"api": "google-cloud-speech"}
+            )
+        except Exception as e:
+            return STTResult(
+                text="",
+                confidence=0.0,
+                processing_time=time.time() - start_time,
+                metadata={"error": str(e)}
+            )

stt/coqui_stt.py ADDED Viewed

	@@ -0,0 +1,390 @@

+#!/usr/bin/env python3
+"""
+Coqui STT Implementation with Model Manager
+This module provides speech-to-text functionality using coqui-stt-model-manager.
+The model manager provides a simplified interface for downloading and using
+Coqui STT models with automatic model management.
+Features:
+- Automatic model downloading and management
+- Multiple pre-trained models available
+- Language-specific models
+- Offline processing
+- Simplified API interface
+- GPU acceleration support
+Dependencies:
+- coqui-stt-model-manager
+- numpy
+- soundfile
+- librosa (for audio preprocessing)
+Model Management:
+Models are automatically managed by the coqui-stt-model-manager.
+Popular models include English, German, French, Spanish, and more.
+"""
+import os
+import logging
+import tempfile
+from pathlib import Path
+from typing import Optional, Dict, Any, List, Tuple
+import numpy as np
+try:
+    from coqui_stt_model_manager import CoquiSTTModelManager
+    COQUI_STT_AVAILABLE = True
+except ImportError:
+    COQUI_STT_AVAILABLE = False
+    CoquiSTTModelManager = None
+try:
+    import soundfile as sf
+    SOUNDFILE_AVAILABLE = True
+except ImportError:
+    SOUNDFILE_AVAILABLE = False
+    sf = None
+try:
+    import librosa
+    LIBROSA_AVAILABLE = True
+except ImportError:
+    LIBROSA_AVAILABLE = False
+    librosa = None
+from .stt_base import BaseSTT
+logger = logging.getLogger(__name__)
+class CoquiSTT(BaseSTT):
+    """
+    Coqui STT implementation using coqui-stt-model-manager.
+    Coqui STT provides high-quality open-source speech recognition
+    with simplified model management through the model manager.
+    """
+    def __init__(self):
+        """Initialize Coqui STT with model manager."""
+        super().__init__()
+        self.model_manager = None
+        self.current_model = None
+        self.model_info = {}
+        # Available models through the model manager
+        self.available_models = {
+            "english-huge": {
+                "language": "en",
+                "description": "English model with huge vocabulary",
+                "model_id": "english-huge-vocab"
+            },
+            "english-large": {
+                "language": "en",
+                "description": "English model with large vocabulary",
+                "model_id": "english-large-vocab"
+            },
+            "german": {
+                "language": "de",
+                "description": "German language model",
+                "model_id": "german"
+            },
+            "french": {
+                "language": "fr",
+                "description": "French language model",
+                "model_id": "french"
+            },
+            "spanish": {
+                "language": "es",
+                "description": "Spanish language model",
+                "model_id": "spanish"
+            }
+        }
+    @classmethod
+    def is_available(cls) -> bool:
+        """Check if Coqui STT Model Manager is available."""
+        try:
+            from coqui_stt_model_manager import CoquiSTTModelManager
+            import soundfile
+            return True
+        except ImportError as e:
+            logger.warning(f"Coqui STT Model Manager dependencies not available: {e}")
+            return False
+    def check_dependencies(self) -> Tuple[bool, str]:
+        """Check if required dependencies are available."""
+        missing_deps = []
+        if not COQUI_STT_AVAILABLE:
+            missing_deps.append("coqui-stt-model-manager")
+        if not SOUNDFILE_AVAILABLE:
+            missing_deps.append("soundfile")
+        if not LIBROSA_AVAILABLE:
+            missing_deps.append("librosa (recommended for audio preprocessing)")
+        if missing_deps:
+            return False, f"Missing dependencies: {', '.join(missing_deps)}"
+        return True, "All dependencies available"
+    def load_model(
+        self,
+        model_name: str = "english-large",
+        auto_download: bool = True,
+        beam_width: int = 512,
+        lm_alpha: float = 0.931289039105002,
+        lm_beta: float = 1.1834137581510284,
+        **kwargs
+    ) -> None:
+        """
+        Load a Coqui STT model using the model manager.
+        Args:
+            model_name: Name of the model to load
+            auto_download: Whether to automatically download the model if not found
+            beam_width: Beam width for CTC beam search decoder
+            lm_alpha: Language model alpha parameter
+            lm_beta: Language model beta parameter
+            **kwargs: Additional model parameters
+        Raises:
+            RuntimeError: If model loading fails
+        """
+        deps_ok, deps_msg = self.check_dependencies()
+        if not deps_ok:
+            raise RuntimeError(f"Dependency check failed: {deps_msg}")
+        try:
+            # Initialize model manager
+            logger.info("Initializing Coqui STT Model Manager...")
+            self.model_manager = CoquiSTTModelManager()
+            # Get model identifier
+            if model_name in self.available_models:
+                model_id = self.available_models[model_name]["model_id"]
+            else:
+                model_id = model_name  # Use as custom model ID
+            # Load the model through model manager
+            logger.info(f"Loading Coqui STT model: {model_id}")
+            if auto_download:
+                # Download and load model
+                self.current_model = self.model_manager.download_and_load_model(
+                    model_id=model_id,
+                    beam_width=beam_width,
+                    lm_alpha=lm_alpha,
+                    lm_beta=lm_beta
+                )
+            else:
+                # Try to load existing model
+                self.current_model = self.model_manager.load_model(
+                    model_id=model_id,
+                    beam_width=beam_width,
+                    lm_alpha=lm_alpha,
+                    lm_beta=lm_beta
+                )
+            # Store model info
+            self.model_info = {
+                "model_name": model_name,
+                "model_id": model_id,
+                "beam_width": beam_width,
+                "lm_alpha": lm_alpha,
+                "lm_beta": lm_beta,
+            }
+            if model_name in self.available_models:
+                self.model_info.update(self.available_models[model_name])
+            logger.info(f"Coqui STT model loaded successfully: {model_name}")
+        except Exception as e:
+            error_msg = f"Error loading Coqui STT model: {e}"
+            logger.error(error_msg)
+            raise RuntimeError(error_msg)
+    def preprocess_audio(self, audio_data: np.ndarray, sample_rate: int) -> np.ndarray:
+        """
+        Preprocess audio for Coqui STT.
+        Coqui STT requires 16kHz mono audio.
+        Args:
+            audio_data: Audio data as numpy array
+            sample_rate: Original sample rate
+        Returns:
+            Preprocessed audio data
+        """
+        try:
+            # Convert to mono if needed
+            if len(audio_data.shape) > 1:
+                audio_data = np.mean(audio_data, axis=1)
+            # Resample to 16kHz if needed
+            target_sr = 16000
+            if sample_rate != target_sr and LIBROSA_AVAILABLE:
+                audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=target_sr)
+                sample_rate = target_sr
+            elif sample_rate != target_sr:
+                logger.warning(f"Audio is {sample_rate}Hz but Coqui STT requires 16kHz. Install librosa for automatic resampling.")
+            # Normalize audio
+            audio_data = audio_data.astype(np.float32)
+            if np.max(np.abs(audio_data)) > 0:
+                audio_data = audio_data / np.max(np.abs(audio_data))
+            # Convert to int16 as required by Coqui STT
+            audio_data = (audio_data * 32767).astype(np.int16)
+            return audio_data
+        except Exception as e:
+            logger.error(f"Error preprocessing audio: {e}")
+            return audio_data
+    def transcribe(self, audio_path: str, **kwargs) -> Tuple[str, str, str]:
+        """
+        Transcribe audio using Coqui STT Model Manager.
+        Args:
+            audio_path: Path to audio file
+            **kwargs: Additional transcription parameters
+        Returns:
+            Tuple of (transcription, confidence_info, processing_info)
+        """
+        if self.current_model is None:
+            return "❌ Model not loaded. Please load the model first.", "", ""
+        try:
+            import time
+            start_time = time.time()
+            # Validate file
+            if not os.path.exists(audio_path):
+                return f"❌ Audio file not found: {audio_path}", "", ""
+            logger.info(f"🎵 Transcribing audio with Coqui STT: {audio_path}")
+            # Load audio file
+            audio_data, sample_rate = sf.read(audio_path)
+            # Preprocess audio
+            processed_audio = self.preprocess_audio(audio_data, sample_rate)
+            # Get transcription parameters
+            return_confidence = kwargs.get("return_confidence", True)
+            return_timestamps = kwargs.get("return_timestamps", False)
+            # Perform transcription using model manager
+            if return_timestamps:
+                # Use metadata for word timestamps
+                result = self.model_manager.transcribe_with_metadata(
+                    audio_data=processed_audio,
+                    model=self.current_model
+                )
+                # Extract text and calculate confidence
+                transcription = ""
+                total_confidence = 0.0
+                word_count = 0
+                if hasattr(result, 'transcripts') and result.transcripts:
+                    for token in result.transcripts[0].tokens:
+                        transcription += token.text
+                        if hasattr(token, 'confidence'):
+                            total_confidence += token.confidence
+                            word_count += 1
+                avg_confidence = total_confidence / word_count if word_count > 0 else 0.0
+            else:
+                # Simple transcription
+                transcription = self.model_manager.transcribe(
+                    audio_data=processed_audio,
+                    model=self.current_model
+                )
+                avg_confidence = 0.8  # Estimated confidence
+            # Calculate processing time
+            processing_time = time.time() - start_time
+            audio_duration = len(audio_data) / sample_rate
+            # Create info strings
+            confidence_info = f"Confidence: {avg_confidence:.2f}" if return_confidence else ""
+            processing_info = (
+                f"Duration: {audio_duration:.1f}s | "
+                f"Time: {processing_time:.1f}s | "
+                f"Model: {self.model_info.get('model_name', 'unknown')}"
+            )
+            logger.info(f"✅ Transcription completed in {processing_time:.1f}s")
+            return transcription.strip(), confidence_info, processing_info
+        except Exception as e:
+            error_msg = f"❌ Coqui STT transcription failed: {str(e)}"
+            logger.error(error_msg)
+            return error_msg, "", ""
+    def get_supported_languages(self) -> List[str]:
+        """Get list of supported languages."""
+        return [
+            "en",      # English
+            "de",      # German
+            "fr",      # French
+            "es",      # Spanish
+        ]
+    def get_model_info(self) -> Dict[str, Any]:
+        """Get information about the currently loaded model."""
+        if self.current_model is None:
+            return {"error": "No model loaded"}
+        info = self.model_info.copy()
+        info.update({
+            "name": "Coqui STT with Model Manager",
+            "is_loaded": self.current_model is not None,
+            "supported_languages": self.get_supported_languages(),
+            "architecture": "DeepSpeech-based CTC",
+            "provider": "Coqui AI"
+        })
+        return info
+    def get_available_models(self) -> List[Dict[str, Any]]:
+        """Get list of available models."""
+        models = []
+        for name, info in self.available_models.items():
+            model_info = {
+                "name": name,
+                "language": info["language"],
+                "description": info["description"],
+                "model_id": info["model_id"]
+            }
+            models.append(model_info)
+        return models
+    def cleanup(self):
+        """Clean up resources."""
+        if self.current_model is not None:
+            # Model manager handles cleanup automatically
+            self.current_model = None
+        if self.model_manager is not None:
+            self.model_manager = None
+        self.model_info = {}
+        logger.info("Coqui STT cleanup completed")
+# Export the class
+__all__ = ["CoquiSTT", "COQUI_STT_AVAILABLE"]

stt/example_custom_stt.py ADDED Viewed

	@@ -0,0 +1,288 @@

+#!/usr/bin/env python3
+"""
+Example: Adding a Custom STT Model
+This file demonstrates how to add a new STT model to the modular voice transcriber.
+Follow this pattern to integrate any speech-to-text service.
+Usage:
+    1. Create your STT class following the BaseSTT interface
+    2. Add it to the STT_MODELS registry in gradio_voice_transcriber_clean.py
+    3. Update ModelManager.get_model_options() if needed
+"""
+from typing import Union, Optional
+import numpy as np
+from pathlib import Path
+import time
+import random
+from stt.stt_base import BaseSTT, STTResult
+class ExampleCustomSTT(BaseSTT):
+    """
+    Example custom STT implementation.
+    This shows how to create a new STT model following the BaseSTT interface.
+    Replace this with actual integration to your preferred STT service:
+    - Azure Speech Service
+    - Google Cloud Speech-to-Text
+    - Amazon Transcribe
+    - IBM Watson Speech to Text
+    - AssemblyAI
+    - Rev.ai
+    - Or any other service
+    """
+    model_name = "ExampleCustomSTT"
+    model = None
+    is_loaded = False
+    config = {
+        "api_key": None,
+        "region": "us-east-1",
+        "language": "en-US",
+        "sample_rate": 16000
+    }
+    @classmethod
+    def load_model(cls, api_key: str = "", region: str = "us-east-1", **kwargs) -> None:
+        """
+        Load/initialize the custom STT service.
+        Args:
+            api_key: API key for the service
+            region: Service region
+            **kwargs: Additional configuration parameters
+        """
+        if not api_key:
+            raise ValueError("API key required for ExampleCustomSTT")
+        # Update configuration
+        cls.config.update({
+            "api_key": api_key,
+            "region": region,
+            **kwargs
+        })
+        # Initialize your STT service here
+        # Example:
+        # cls.model = YourSTTClient(
+        #     api_key=api_key,
+        #     region=region
+        # )
+        # For demonstration, just simulate initialization
+        print(f"Initializing ExampleCustomSTT with region {region}")
+        time.sleep(1)  # Simulate initialization time
+        cls.model = f"custom_stt_client_{region}"
+        cls.is_loaded = True
+        print(f"✅ {cls.model_name} loaded successfully")
+    @classmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Transcribe audio using the custom STT service.
+        Args:
+            audio_data: Audio input (numpy array or file path)
+            sample_rate: Sample rate for numpy arrays
+        Returns:
+            STTResult: Transcription result with metadata
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
+        start_time = time.time()
+        # Handle different input types
+        if isinstance(audio_data, np.ndarray):
+            # For numpy arrays, you might need to:
+            # 1. Save to temporary file
+            # 2. Upload to service
+            # 3. Get transcription result
+            duration = len(audio_data) / (sample_rate or 16000)
+            print(f"Transcribing numpy array: {duration:.2f}s")
+            # Simulate API call
+            time.sleep(0.5 + duration * 0.1)  # Simulate processing time
+            # Example transcription (replace with actual API call)
+            transcription = f"[Custom STT transcription of {duration:.1f}s audio]"
+            confidence = random.uniform(0.85, 0.98)  # Simulate confidence
+        else:
+            # Handle file path
+            file_path = Path(audio_data)
+            print(f"Transcribing file: {file_path.name}")
+            # Simulate file upload and transcription
+            time.sleep(1.0)
+            transcription = f"[Custom STT transcription of {file_path.name}]"
+            confidence = random.uniform(0.80, 0.95)
+        processing_time = time.time() - start_time
+        # Prepare metadata
+        metadata = {
+            "model": cls.model_name,
+            "region": cls.config["region"],
+            "language": cls.config.get("language", "en-US"),
+            "api_used": True,
+            "service": "example-custom-service"
+        }
+        return STTResult(
+            text=transcription,
+            confidence=confidence,
+            processing_time=processing_time,
+            metadata=metadata
+        )
+    @classmethod
+    def set_language(cls, language: Optional[str]) -> None:
+        """Set the transcription language."""
+        if language:
+            cls.config["language"] = language
+            print(f"Language set to: {language}")
+    @classmethod
+    def get_supported_languages(cls) -> list:
+        """Get list of supported languages."""
+        return [
+            "en-US", "en-GB", "es-ES", "fr-FR", "de-DE",
+            "it-IT", "pt-BR", "ja-JP", "ko-KR", "zh-CN"
+        ]
+# Example of how to integrate into the main application:
+def integrate_custom_stt():
+    """
+    This function shows how to add the custom STT to the main application.
+    Add this to gradio_voice_transcriber_clean.py:
+    """
+    # 1. Import your custom STT class
+    from stt.example_custom_stt import ExampleCustomSTT
+    # 2. Add to STT_MODELS registry
+    STT_MODELS = {
+        "WhisperSTT": WhisperSTT,
+        "ExampleCustomSTT": ExampleCustomSTT,  # Add this line
+    }
+    # 3. Update ModelManager.get_model_options() to include custom options
+    def get_model_options(model_name: str):
+        if model_name == "ExampleCustomSTT":
+            return {
+                "model_sizes": ["default"],  # No size options for this service
+                "supports_api": True,
+                "languages": [
+                    ("Auto-detect", "auto"),
+                    ("English (US)", "en-US"),
+                    ("English (UK)", "en-GB"),
+                    ("Spanish", "es-ES"),
+                    ("French", "fr-FR"),
+                    ("German", "de-DE"),
+                ],
+                "custom_fields": [
+                    {"name": "api_key", "type": "password", "label": "API Key", "required": True},
+                    {"name": "region", "type": "dropdown", "label": "Region",
+                     "choices": ["us-east-1", "us-west-2", "eu-west-1"], "default": "us-east-1"}
+                ]
+            }
+        # ... existing code for other models
+    # 4. Update the load_model function to handle custom parameters
+    def load_model(model_name: str, **kwargs):
+        if model_name == "ExampleCustomSTT":
+            api_key = kwargs.get("api_key", "")
+            region = kwargs.get("region", "us-east-1")
+            if not api_key:
+                return "❌ API key required for ExampleCustomSTT"
+            ExampleCustomSTT.load_model(api_key=api_key, region=region)
+            # ... rest of loading logic
+# Real-world integration examples:
+class AzureSTT(BaseSTT):
+    """Example Azure Speech Service integration."""
+    model_name = "AzureSTT"
+    model = None
+    is_loaded = False
+    @classmethod
+    def load_model(cls, subscription_key: str, region: str, **kwargs):
+        """Initialize Azure Speech SDK."""
+        try:
+            import azure.cognitiveservices.speech as speechsdk
+            speech_config = speechsdk.SpeechConfig(
+                subscription=subscription_key,
+                region=region
+            )
+            cls.model = speech_config
+            cls.is_loaded = True
+        except ImportError:
+            raise ImportError("Install Azure Speech SDK: pip install azure-cognitiveservices-speech")
+    @classmethod
+    def transcribe_audio(cls, audio_data, sample_rate=None):
+        """Transcribe using Azure Speech Service."""
+        # Implement Azure-specific transcription logic
+        pass
+class GoogleSTT(BaseSTT):
+    """Example Google Cloud Speech-to-Text integration."""
+    model_name = "GoogleSTT"
+    model = None
+    is_loaded = False
+    @classmethod
+    def load_model(cls, credentials_path: str, **kwargs):
+        """Initialize Google Cloud Speech client."""
+        try:
+            from google.cloud import speech
+            import os
+            os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = credentials_path
+            cls.model = speech.SpeechClient()
+            cls.is_loaded = True
+        except ImportError:
+            raise ImportError("Install Google Cloud Speech: pip install google-cloud-speech")
+    @classmethod
+    def transcribe_audio(cls, audio_data, sample_rate=None):
+        """Transcribe using Google Cloud Speech."""
+        # Implement Google-specific transcription logic
+        pass
+if __name__ == "__main__":
+    # Test the example custom STT
+    print("Testing ExampleCustomSTT...")
+    # Load model
+    ExampleCustomSTT.load_model(api_key="test-api-key", region="us-east-1")
+    # Test transcription
+    dummy_audio = np.random.randn(16000).astype(np.float32)  # 1 second
+    result = ExampleCustomSTT.transcribe_audio(dummy_audio, 16000)
+    print(f"Result: {result}")
+    print(f"Metadata: {result.metadata}")
+    print("✅ Custom STT integration test completed!")

stt/hubert_arabic_stt.py ADDED Viewed

	@@ -0,0 +1,568 @@

+#!/usr/bin/env python3
+"""
+HuBERT Arabic Egyptian STT Implementation
+Hugging Face HuBERT speech-to-text implementation for Arabic Egyptian dialect
+using the omarxadel/hubert-large-arabic-egyptian model.
+Usage:
+    from stt.hubert_arabic_stt import HuBERTArabicSTT
+    # Load model
+    HuBERTArabicSTT.load_model()
+    # Transcribe audio
+    result = HuBERTArabicSTT.transcribe_audio(audio_array, 16000)
+    print(result.text)
+"""
+from typing import Union, Optional, Dict, Any
+import numpy as np
+from pathlib import Path
+import time
+import logging
+import warnings
+# Suppress warnings for cleaner output
+warnings.filterwarnings("ignore")
+try:
+    import torch
+    import torchaudio
+    from transformers import (
+        HubertForCTC,
+        Wav2Vec2Processor,
+        Wav2Vec2Tokenizer,
+        AutoProcessor,
+        AutoModelForCTC
+    )
+    TRANSFORMERS_AVAILABLE = True
+except ImportError:
+    TRANSFORMERS_AVAILABLE = False
+try:
+    import librosa
+    LIBROSA_AVAILABLE = True
+except ImportError:
+    LIBROSA_AVAILABLE = False
+from .stt_base import BaseSTT, STTResult
+logger = logging.getLogger(__name__)
+class HuBERTArabicSTT(BaseSTT):
+    class Chirp3STT(BaseSTT):
+        """
+        Chirp3 Speech-to-Text implementation (stub).
+        Replace this stub with actual Chirp3 model integration as needed.
+        """
+        model_name = "Chirp3STT"
+        model = None
+        processor = None
+        is_loaded = False
+        config = {
+            "model_id": "chirp3/ar-egyptian",  # Example placeholder
+            "device": "auto",
+            "sample_rate": 16000,
+            "language": "ar-EG",
+        }
+        @classmethod
+        def load_model(cls, model_id: str = None, device: str = "auto", **kwargs) -> None:
+            """
+            Load the Chirp3 model (stub).
+            """
+            # TODO: Implement actual Chirp3 model loading
+            cls.is_loaded = True
+            cls.config["model_id"] = model_id or cls.config["model_id"]
+            cls.config["device"] = device
+        @classmethod
+        def transcribe_audio(cls, audio_data, sample_rate: int = None):
+            """
+            Transcribe audio using Chirp3 model (stub).
+            """
+            if not cls.is_loaded:
+                raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
+            # TODO: Implement actual transcription logic
+            from .stt_base import STTResult
+            return STTResult(
+                text="[Chirp3STT stub: no transcription]",
+                confidence=0.0,
+                processing_time=0.0,
+                metadata={"note": "Chirp3STT is a stub."}
+            )
+    """
+    HuBERT Arabic Egyptian STT implementation using Hugging Face transformers.
+    Supports:
+    - Arabic Egyptian dialect transcription
+    - Local model execution (no API required)
+    - Automatic audio preprocessing
+    - Confidence estimation
+    - Chunked processing for long audio
+    """
+    model_name = "HuBERTArabicSTT"
+    model = None
+    processor = None
+    tokenizer = None
+    is_loaded = False
+    config = {
+        "model_id": "omarxadel/hubert-large-arabic-egyptian",
+        "fallback_models": [
+            "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
+            "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
+            "facebook/wav2vec2-large-xlsr-53",
+        ],
+        "device": "auto",  # auto, cpu, cuda
+        "chunk_length": 15,  # seconds, for long audio processing
+        "sample_rate": 16000,
+        "return_confidence": True,
+        "language": "ar-EG",  # Arabic Egyptian
+        "hf_token": None,  # Hugging Face token for private models
+        "use_auth_token": True  # Try to use cached token
+    }
+    @classmethod
+    def load_model(cls,
+                   model_id: str = None,
+                   device: str = "auto",
+                   hf_token: str = None,
+                   **kwargs) -> None:
+        """
+        Load the HuBERT Arabic model.
+        Args:
+            model_id: Hugging Face model ID (default: omarxadel/hubert-large-arabic-egyptian)
+            device: Device to use (auto, cpu, cuda)
+            hf_token: Hugging Face token for private models (optional)
+            **kwargs: Additional configuration parameters
+        """
+        if not TRANSFORMERS_AVAILABLE:
+            raise ImportError(
+                "Transformers library required. Install with: "
+                "pip install transformers torch torchaudio"
+            )
+        # Update configuration
+        cls.config.update({
+            "model_id": model_id or cls.config["model_id"],
+            "device": device,
+            "hf_token": hf_token,
+            **kwargs
+        })
+        # Determine device
+        if device == "auto":
+            if torch.cuda.is_available():
+                device = "cuda"
+            elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+                device = "mps"  # Apple Silicon
+            else:
+                device = "cpu"
+        cls.config["device"] = device
+        # Try to load the model, with fallbacks
+        models_to_try = [cls.config["model_id"]] + cls.config["fallback_models"]
+        for model_id_to_try in models_to_try:
+            logger.info(f"Attempting to load HuBERT model: {model_id_to_try}")
+            try:
+                success = cls._load_model_with_id(model_id_to_try, device, hf_token)
+                if success:
+                    cls.config["model_id"] = model_id_to_try  # Update to successful model
+                    return
+            except Exception as e:
+                logger.warning(f"Failed to load {model_id_to_try}: {e}")
+                continue
+        # If all models failed
+        raise RuntimeError(f"Failed to load any HuBERT model. Tried: {models_to_try}")
+    @classmethod
+    def _load_model_with_id(cls, model_id: str, device: str, hf_token: str = None) -> bool:
+        """
+        Load a specific model ID with authentication handling.
+        Returns:
+            bool: True if successful, False otherwise
+        """
+        logger.info(f"Loading HuBERT model: {model_id}")
+        logger.info(f"Using device: {device}")
+        start_time = time.time()
+        # Prepare authentication
+        auth_kwargs = {}
+        if hf_token:
+            auth_kwargs["token"] = hf_token
+        elif cls.config.get("use_auth_token", True):
+            auth_kwargs["use_auth_token"] = True
+        try:
+            # Try to load as HuBERT model first
+            if "hubert" in model_id.lower():
+                logger.info("Loading as HuBERT model...")
+                cls.processor = AutoProcessor.from_pretrained(model_id, **auth_kwargs)
+                cls.model = AutoModelForCTC.from_pretrained(model_id, **auth_kwargs)
+            else:
+                # Fallback to Wav2Vec2 for other models
+                logger.info("Loading as Wav2Vec2 model...")
+                cls.processor = Wav2Vec2Processor.from_pretrained(model_id, **auth_kwargs)
+                cls.model = AutoModelForCTC.from_pretrained(model_id, **auth_kwargs)
+            # Move model to device
+            cls.model = cls.model.to(device)
+            cls.model.eval()  # Set to evaluation mode
+            # Load tokenizer for confidence calculation
+            try:
+                cls.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_id, **auth_kwargs)
+            except Exception as e:
+                logger.warning(f"Could not load tokenizer: {e}")
+                cls.tokenizer = None
+            cls.is_loaded = True
+            load_time = time.time() - start_time
+            logger.info(f"✅ HuBERT model loaded successfully in {load_time:.2f}s")
+            logger.info(f"Model vocab size: {cls.model.config.vocab_size}")
+            return True
+        except Exception as e:
+            logger.error(f"Failed to load model {model_id}: {e}")
+            return False
+    @classmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Transcribe audio using HuBERT Arabic model.
+        Args:
+            audio_data: Audio input (numpy array or file path)
+            sample_rate: Sample rate for numpy arrays
+        Returns:
+            STTResult: Transcription with confidence and metadata
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
+        start_time = time.time()
+        try:
+            # Process input audio
+            processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
+            # Check audio length
+            duration = len(processed_audio) / actual_sr
+            if duration < 0.1:
+                return STTResult(
+                    text="",
+                    confidence=0.0,
+                    processing_time=time.time() - start_time,
+                    metadata={"error": "Audio too short", "duration": duration}
+                )
+            # Process with model
+            if duration > cls.config.get("chunk_length", 15):
+                # Handle long audio by chunking
+                text, confidence = cls._transcribe_long_audio(processed_audio, actual_sr)
+            else:
+                # Process short audio directly
+                text, confidence = cls._transcribe_chunk(processed_audio, actual_sr)
+            processing_time = time.time() - start_time
+            # Prepare metadata
+            metadata = {
+                "model": cls.config["model_id"],
+                "model_type": "HuBERT" if "hubert" in cls.config["model_id"].lower() else "Wav2Vec2",
+                "device": cls.config["device"],
+                "language": "ar-EG",
+                "duration": duration,
+                "sample_rate": actual_sr,
+                "chunks_processed": 1 if duration <= cls.config.get("chunk_length", 15) else int(duration / cls.config["chunk_length"]) + 1
+            }
+            return STTResult(
+                text=text.strip(),
+                confidence=confidence,
+                processing_time=processing_time,
+                metadata=metadata
+            )
+        except Exception as e:
+            error_msg = f"Transcription failed: {str(e)}"
+            logger.error(error_msg)
+            return STTResult(
+                text="",
+                confidence=0.0,
+                processing_time=time.time() - start_time,
+                metadata={"error": error_msg}
+            )
+    @classmethod
+    def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
+        """Process and validate audio input."""
+        if isinstance(audio_data, (str, Path)):
+            # Load audio file
+            audio_path = Path(audio_data)
+            if not audio_path.exists():
+                raise FileNotFoundError(f"Audio file not found: {audio_path}")
+            if LIBROSA_AVAILABLE:
+                audio_array, sr = librosa.load(str(audio_path), sr=cls.config["sample_rate"])
+            else:
+                # Fallback to torchaudio
+                audio_tensor, sr = torchaudio.load(str(audio_path))
+                audio_array = audio_tensor.numpy().flatten()
+                # Resample if needed
+                if sr != cls.config["sample_rate"]:
+                    resampler = torchaudio.transforms.Resample(sr, cls.config["sample_rate"])
+                    audio_tensor = resampler(audio_tensor)
+                    audio_array = audio_tensor.numpy().flatten()
+                    sr = cls.config["sample_rate"]
+        else:
+            # Handle numpy array
+            audio_array = audio_data.astype(np.float32)
+            sr = sample_rate or cls.config["sample_rate"]
+            # Resample if needed
+            if sr != cls.config["sample_rate"]:
+                if LIBROSA_AVAILABLE:
+                    audio_array = librosa.resample(
+                        audio_array,
+                        orig_sr=sr,
+                        target_sr=cls.config["sample_rate"]
+                    )
+                else:
+                    # Simple resampling fallback
+                    if sr > cls.config["sample_rate"]:
+                        step = sr // cls.config["sample_rate"]
+                        audio_array = audio_array[::step]
+                    else:
+                        repeat = cls.config["sample_rate"] // sr
+                        audio_array = np.repeat(audio_array, repeat)
+                sr = cls.config["sample_rate"]
+        # Normalize audio
+        if len(audio_array) > 0:
+            # Convert to mono if stereo
+            if audio_array.ndim > 1:
+                audio_array = np.mean(audio_array, axis=0)
+            # Normalize to [-1, 1]
+            max_val = np.max(np.abs(audio_array))
+            if max_val > 0:
+                audio_array = audio_array / max_val
+        return audio_array, sr
+    @classmethod
+    def _transcribe_chunk(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
+        """Transcribe a single audio chunk."""
+        # Preprocess audio
+        input_values = cls.processor(
+            audio_array,
+            sampling_rate=sample_rate,
+            return_tensors="pt",
+            padding=True
+        )
+        # Move to device
+        input_values = {k: v.to(cls.config["device"]) for k, v in input_values.items()}
+        # Inference
+        with torch.no_grad():
+            logits = cls.model(**input_values).logits
+        # Get predicted tokens
+        predicted_ids = torch.argmax(logits, dim=-1)
+        # Decode transcription
+        transcription = cls.processor.batch_decode(predicted_ids)[0]
+        # Calculate confidence (average of max probabilities)
+        confidence = cls._calculate_confidence(logits)
+        return transcription, confidence
+    @classmethod
+    def _transcribe_long_audio(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
+        """Transcribe long audio by chunking."""
+        chunk_length = cls.config.get("chunk_length", 15)
+        chunk_samples = int(chunk_length * sample_rate)
+        overlap_samples = int(1.0 * sample_rate)  # 1 second overlap
+        transcriptions = []
+        confidences = []
+        for start in range(0, len(audio_array), chunk_samples - overlap_samples):
+            end = min(start + chunk_samples, len(audio_array))
+            chunk = audio_array[start:end]
+            if len(chunk) < 0.5 * sample_rate:  # Skip very short chunks
+                continue
+            try:
+                chunk_text, chunk_confidence = cls._transcribe_chunk(chunk, sample_rate)
+                if chunk_text.strip():
+                    transcriptions.append(chunk_text.strip())
+                    confidences.append(chunk_confidence)
+            except Exception as e:
+                logger.warning(f"Failed to transcribe chunk: {e}")
+                continue
+        # Combine results
+        full_text = " ".join(transcriptions)
+        avg_confidence = np.mean(confidences) if confidences else 0.0
+        return full_text, avg_confidence
+    @classmethod
+    def _calculate_confidence(cls, logits: torch.Tensor) -> float:
+        """Calculate confidence score from model logits."""
+        try:
+            # Apply softmax to get probabilities
+            probabilities = torch.softmax(logits, dim=-1)
+            # Get maximum probability for each time step
+            max_probs = torch.max(probabilities, dim=-1)[0]
+            # Average over time steps (excluding padding if any)
+            confidence = torch.mean(max_probs).item()
+            return confidence
+        except Exception as e:
+            logger.warning(f"Could not calculate confidence: {e}")
+            return 0.5  # Default confidence
+    @classmethod
+    def get_available_models(cls) -> Dict[str, Any]:
+        """Get information about available HuBERT models."""
+        models_info = {
+            "transformers_available": TRANSFORMERS_AVAILABLE,
+            "librosa_available": LIBROSA_AVAILABLE,
+            "torch_available": True if TRANSFORMERS_AVAILABLE else False,
+        }
+        if TRANSFORMERS_AVAILABLE:
+            models_info.update({
+                "cuda_available": torch.cuda.is_available(),
+                "mps_available": hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(),
+                "hubert_models": [
+                    {
+                        "id": "omarxadel/hubert-large-arabic-egyptian",
+                        "name": "HuBERT Arabic Egyptian (Large)",
+                        "language": "Arabic Egyptian Dialect",
+                        "size": "1.3GB",
+                        "type": "HuBERT"
+                    }
+                ],
+                "fallback_models": [
+                    {
+                        "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
+                        "name": "Wav2Vec2 Arabic Egyptian",
+                        "language": "Arabic Egyptian",
+                        "size": "1.2GB",
+                        "type": "Wav2Vec2"
+                    },
+                    {
+                        "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
+                        "name": "Wav2Vec2 Arabic Standard",
+                        "language": "Arabic Standard",
+                        "size": "1.2GB",
+                        "type": "Wav2Vec2"
+                    },
+                    {
+                        "id": "facebook/wav2vec2-large-xlsr-53",
+                        "name": "Wav2Vec2 Multilingual",
+                        "language": "Multilingual",
+                        "size": "1.2GB",
+                        "type": "Wav2Vec2"
+                    }
+                ]
+            })
+        return models_info
+    @classmethod
+    def set_language(cls, language: Optional[str]) -> None:
+        """Set language (for compatibility - this model is Arabic-specific)."""
+        if language and not language.startswith("ar"):
+            logger.warning(f"This model is optimized for Arabic. Language '{language}' may not work well.")
+        cls.config["language"] = language or "ar-EG"
+        logger.info(f"Language set to: {cls.config['language']}")
+    @classmethod
+    def set_device(cls, device: str) -> None:
+        """Change device for model inference."""
+        if cls.model is not None:
+            cls.model = cls.model.to(device)
+            cls.config["device"] = device
+            logger.info(f"Model moved to device: {device}")
+    @classmethod
+    def get_model_info(cls) -> Dict[str, Any]:
+        """Get detailed model information."""
+        base_info = super().get_model_info()
+        if cls.is_loaded:
+            base_info.update({
+                "model_id": cls.config["model_id"],
+                "model_type": "HuBERT" if "hubert" in cls.config["model_id"].lower() else "Wav2Vec2",
+                "device": cls.config["device"],
+                "language": cls.config["language"],
+                "sample_rate": cls.config["sample_rate"],
+                "vocab_size": cls.model.config.vocab_size if cls.model else None,
+                "chunk_length": cls.config["chunk_length"],
+            })
+        return base_info
+# Example usage and testing
+if __name__ == "__main__":
+    print("Testing HuBERT Arabic STT implementation...")
+    # Check availability
+    models_info = HuBERTArabicSTT.get_available_models()
+    print(f"Available models info: {models_info}")
+    if models_info["transformers_available"]:
+        try:
+            print("Loading HuBERT Arabic model...")
+            HuBERTArabicSTT.load_model(device="cpu")  # Use CPU for testing
+            print("Creating test audio...")
+            # Generate test audio (2 seconds of random noise)
+            test_audio = np.random.randn(32000).astype(np.float32) * 0.1
+            print("Testing transcription...")
+            result = HuBERTArabicSTT.transcribe_audio(test_audio, 16000)
+            print(f"Result: {result}")
+            print(f"Metadata: {result.metadata}")
+        except Exception as e:
+            print(f"Error: {e}")
+            print("Note: This is expected with random audio - the model expects Arabic speech")
+    else:
+        print("Transformers not installed - install with:")
+        print("pip install transformers torch torchaudio")
+        print("Optional: pip install librosa (for better audio processing)")
+    print("\nHuBERT Arabic STT implementation ready!")

stt/stt_base.py ADDED Viewed

	@@ -0,0 +1,251 @@

+#!/usr/bin/env python3
+"""
+Base Speech-to-Text (STT) Static Class
+This module provides an abstract base class for implementing different STT models using static methods.
+All STT implementations should inherit from this class and implement the required static methods.
+Usage:
+    from stt_base import BaseSTT
+    class MySTTModel(BaseSTT):
+        model = None  # Class variable to hold the model
+        @classmethod
+        def load_model(cls):
+            # Load your specific model
+            cls.model = your_model_loader()
+        @classmethod
+        def transcribe_audio(cls, audio_data, sample_rate):
+            # Implement transcription logic
+            return STTResult("transcribed text")
+"""
+from abc import ABC, abstractmethod
+from typing import Union, Optional, Dict, Any, ClassVar
+import numpy as np
+from pathlib import Path
+import time
+import logging
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class STTResult:
+    """Container for STT transcription results with metadata."""
+    def __init__(self,
+                 text: str,
+                 confidence: Optional[float] = None,
+                 processing_time: Optional[float] = None,
+                 metadata: Optional[Dict[str, Any]] = None):
+        self.text = text
+        self.confidence = confidence
+        self.processing_time = processing_time
+        self.metadata = metadata or {}
+    def __str__(self) -> str:
+        return self.text
+    def __repr__(self) -> str:
+        return f"STTResult(text='{self.text}', confidence={self.confidence}, time={self.processing_time}s)"
+class BaseSTT(ABC):
+    """
+    Abstract base class for Speech-to-Text models using static methods.
+    All STT implementations must inherit from this class and implement:
+    - load_model(): Load and initialize the STT model (classmethod)
+    - transcribe_audio(): Convert audio to text (classmethod)
+    """
+    # Class variables that subclasses should define
+    model_name: ClassVar[str] = "BaseSTT"
+    model: ClassVar[Any] = None
+    is_loaded: ClassVar[bool] = False
+    config: ClassVar[Dict[str, Any]] = {}
+    @classmethod
+    @abstractmethod
+    def load_model(cls) -> None:
+        """
+        Load and initialize the STT model.
+        This method must be implemented by subclasses to load their specific model.
+        After successful loading, set cls.is_loaded = True
+        """
+        pass
+    @classmethod
+    @abstractmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Transcribe audio data to text.
+        Args:
+            audio_data: Audio input - can be:
+                       - numpy array of audio samples
+                       - path to audio file (str or Path)
+            sample_rate: Sample rate of audio data (required for numpy arrays)
+        Returns:
+            STTResult: Object containing transcribed text and metadata
+        This method must be implemented by subclasses.
+        """
+        pass
+    @classmethod
+    def transcribe_file(cls, file_path: Union[str, Path]) -> STTResult:
+        """
+        Transcribe an audio file to text.
+        Args:
+            file_path: Path to the audio file
+        Returns:
+            STTResult: Transcription result
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} model not loaded. Call load_model() first.")
+        file_path = Path(file_path)
+        if not file_path.exists():
+            raise FileNotFoundError(f"Audio file not found: {file_path}")
+        logger.info(f"Transcribing file: {file_path}")
+        start_time = time.time()
+        result = cls.transcribe_audio(file_path)
+        if result.processing_time is None:
+            result.processing_time = time.time() - start_time
+        logger.info(f"Transcription completed in {result.processing_time:.2f}s")
+        return result
+    @classmethod
+    def transcribe_numpy(cls,
+                        audio_array: np.ndarray,
+                        sample_rate: int) -> STTResult:
+        """
+        Transcribe a numpy array of audio samples to text.
+        Args:
+            audio_array: Audio samples as numpy array
+            sample_rate: Sample rate of the audio
+        Returns:
+            STTResult: Transcription result
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} model not loaded. Call load_model() first.")
+        if not isinstance(audio_array, np.ndarray):
+            raise TypeError("audio_array must be a numpy array")
+        logger.info(f"Transcribing numpy array: shape={audio_array.shape}, sr={sample_rate}")
+        start_time = time.time()
+        result = cls.transcribe_audio(audio_array, sample_rate)
+        if result.processing_time is None:
+            result.processing_time = time.time() - start_time
+        logger.info(f"Transcription completed in {result.processing_time:.2f}s")
+        return result
+    @classmethod
+    def get_model_info(cls) -> Dict[str, Any]:
+        """
+        Get information about the loaded model.
+        Returns:
+            Dict containing model information
+        """
+        return {
+            "model_name": cls.model_name,
+            "is_loaded": cls.is_loaded,
+            "config": cls.config
+        }
+    @classmethod
+    def ensure_loaded(cls) -> None:
+        """Ensure the model is loaded, load it if not."""
+        if not cls.is_loaded:
+            cls.load_model()
+    @classmethod
+    def get_status(cls) -> str:
+        """Get a string representation of the model status."""
+        status = "loaded" if cls.is_loaded else "not loaded"
+        return f"{cls.model_name} STT Model ({status})"
+class DummySTT(BaseSTT):
+    """
+    Dummy STT implementation for testing the static class interface.
+    Returns placeholder text instead of actual transcription.
+    """
+    model_name = "DummySTT"
+    model = None
+    is_loaded = False
+    config = {}
+    @classmethod
+    def load_model(cls) -> None:
+        """Load the dummy model (just a placeholder)."""
+        logger.info("Loading dummy STT model...")
+        time.sleep(0.5)  # Simulate loading time
+        cls.model = "dummy_model_loaded"
+        cls.is_loaded = True
+        logger.info("Dummy STT model loaded successfully")
+    @classmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Dummy transcription - returns placeholder text.
+        """
+        if isinstance(audio_data, np.ndarray):
+            duration = len(audio_data) / (sample_rate or 16000)
+            text = f"[Dummy transcription of {duration:.1f}s audio]"
+        else:
+            text = f"[Dummy transcription of file: {Path(audio_data).name}]"
+        # Simulate processing time
+        processing_time = 0.1 + np.random.random() * 0.2
+        time.sleep(processing_time)
+        return STTResult(
+            text=text,
+            confidence=0.95,
+            processing_time=processing_time,
+            metadata={"model": "dummy", "simulated": True}
+        )
+# Example usage and testing
+if __name__ == "__main__":
+    # Test the dummy implementation
+    print("Testing BaseSTT with static DummySTT implementation...")
+    # Load the model
+    DummySTT.load_model()
+    # Test with dummy numpy array
+    dummy_audio = np.random.randn(16000)  # 1 second at 16kHz
+    result = DummySTT.transcribe_numpy(dummy_audio, 16000)
+    print(f"Numpy result: {result}")
+    print(f"Model info: {DummySTT.get_model_info()}")
+    print(f"Status: {DummySTT.get_status()}")
+    print("\\nStatic BaseSTT interface ready for real STT implementations!")

stt/tawasul_stt.py ADDED Viewed

	@@ -0,0 +1,448 @@

+#!/usr/bin/env python3
+"""
+Tawasul STT V0 Implementation
+This module provides Arabic speech-to-text transcription using the Tawasul STT V0 model,
+which is specifically designed for Arabic language recognition.
+Tawasul STT V0 is built on Wav2Vec2 architecture and fine-tuned for Arabic speech.
+"""
+import os
+import logging
+import warnings
+from pathlib import Path
+from typing import Optional, Dict, Any, Tuple, List, Union
+import time
+# Suppress warnings for cleaner output
+warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
+# Try to import torch for type hints
+try:
+    import torch
+    TORCH_AVAILABLE = True
+except ImportError:
+    TORCH_AVAILABLE = False
+    # Create a dummy torch class for type hints when torch is not available
+    class torch:
+        class Tensor:
+            pass
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class TawasulSTT:
+    """
+    Tawasul STT V0 static implementation for Arabic speech recognition.
+    This class provides Arabic speech-to-text transcription using the Tawasul STT V0 model,
+    which is specifically optimized for Arabic language variants.
+    All methods are static for direct class-level access.
+    """
+    # Class variables for model state
+    model = None
+    processor = None
+    tokenizer = None
+    device = "cpu"
+    model_id = "Kareem35/Tawasul-STT-V0"
+    is_loaded = False
+    hf_token = None
+    chunk_length = 20  # seconds
+    max_audio_length = 300  # 5 minutes max
+    # Model fallback chain for better reliability
+    fallback_models = [
+        "Kareem35/Tawasul-STT-V0",
+        "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
+        "facebook/wav2vec2-large-xlsr-53",
+        "facebook/wav2vec2-base-960h"
+    ]
+    @staticmethod
+    def is_available() -> bool:
+        """Check if Tawasul STT dependencies are available."""
+        if not TORCH_AVAILABLE:
+            logger.warning("Tawasul STT dependencies not available: torch not installed")
+            return False
+        try:
+            import transformers
+            import torchaudio
+            import librosa
+            import soundfile
+            return True
+        except ImportError as e:
+            logger.warning(f"Tawasul STT dependencies not available: {e}")
+            return False
+    @staticmethod
+    def load_model(
+        model_id: Optional[str] = None,
+        device: str = "auto",
+        chunk_length: int = 20,
+        hf_token: Optional[str] = None,
+        max_audio_length: int = 300,
+        **kwargs
+    ) -> None:
+        """
+        Load the Tawasul STT V0 model.
+        Args:
+            model_id: Model identifier (defaults to Tawasul STT V0)
+            device: Device to use ('auto', 'cpu', 'cuda', 'mps')
+            chunk_length: Audio chunk length in seconds for processing
+            hf_token: Hugging Face authentication token
+            max_audio_length: Maximum audio length in seconds
+            **kwargs: Additional model parameters
+        """
+        try:
+            import torch
+            import transformers
+            from transformers import (
+                Wav2Vec2ForCTC,
+                Wav2Vec2Processor,
+                Wav2Vec2Tokenizer
+            )
+            import torchaudio
+            import librosa
+            # Set authentication token
+            if hf_token:
+                TawasulSTT.hf_token = hf_token
+                # Set token for transformers
+                try:
+                    from huggingface_hub import login
+                    login(token=hf_token, add_to_git_credential=True)
+                    logger.info("✅ Authenticated with Hugging Face")
+                except Exception as e:
+                    logger.warning(f"HF authentication warning: {e}")
+            # Determine device
+            if device == "auto":
+                if torch.cuda.is_available():
+                    TawasulSTT.device = "cuda"
+                elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+                    TawasulSTT.device = "mps"
+                else:
+                    TawasulSTT.device = "cpu"
+            else:
+                TawasulSTT.device = device
+            # Set model parameters
+            TawasulSTT.model_id = model_id or "Kareem35/Tawasul-STT-V0"
+            TawasulSTT.chunk_length = chunk_length
+            TawasulSTT.max_audio_length = max_audio_length
+            # Try loading the model with fallback chain
+            model_loaded = False
+            last_error = None
+            models_to_try = [TawasulSTT.model_id] + [m for m in TawasulSTT.fallback_models if m != TawasulSTT.model_id]
+            for model_name in models_to_try:
+                try:
+                    logger.info(f"🔄 Loading Tawasul STT model: {model_name}")
+                    # Load model components
+                    TawasulSTT.processor = Wav2Vec2Processor.from_pretrained(
+                        model_name,
+                        token=TawasulSTT.hf_token,
+                        trust_remote_code=True
+                    )
+                    TawasulSTT.model = Wav2Vec2ForCTC.from_pretrained(
+                        model_name,
+                        token=TawasulSTT.hf_token,
+                        trust_remote_code=True
+                    )
+                    # Try to load tokenizer if available
+                    try:
+                        TawasulSTT.tokenizer = Wav2Vec2Tokenizer.from_pretrained(
+                            model_name,
+                            token=TawasulSTT.hf_token
+                        )
+                    except Exception:
+                        logger.info("Using processor instead of separate tokenizer")
+                        TawasulSTT.tokenizer = TawasulSTT.processor.tokenizer
+                    # Move model to device
+                    TawasulSTT.model = TawasulSTT.model.to(TawasulSTT.device)
+                    TawasulSTT.model.eval()
+                    # Test model with dummy input
+                    test_input = torch.randn(1, 16000).to(TawasulSTT.device)
+                    with torch.no_grad():
+                        _ = TawasulSTT.model(test_input)
+                    TawasulSTT.model_id = model_name  # Update to actually loaded model
+                    model_loaded = True
+                    logger.info(f"✅ Successfully loaded Tawasul STT model: {model_name} on {TawasulSTT.device}")
+                    break
+                except Exception as e:
+                    last_error = e
+                    logger.warning(f"Failed to load {model_name}: {str(e)}")
+                    continue
+            if not model_loaded:
+                raise RuntimeError(f"Failed to load any Tawasul STT model. Last error: {last_error}")
+            TawasulSTT.is_loaded = True
+            # Log model info
+            total_params = sum(p.numel() for p in TawasulSTT.model.parameters())
+            logger.info(f"📊 Model loaded: {total_params:,} parameters on {TawasulSTT.device}")
+        except Exception as e:
+            error_msg = f"Failed to load Tawasul STT model: {str(e)}"
+            logger.error(error_msg)
+            raise RuntimeError(error_msg)
+    @staticmethod
+    def _preprocess_audio(audio_path: str) -> Tuple[torch.Tensor, int]:
+        """
+        Preprocess audio file for Tawasul STT model.
+        Args:
+            audio_path: Path to audio file
+        Returns:
+            Tuple of (audio_tensor, sample_rate)
+        """
+        try:
+            import librosa
+            import torch
+            import numpy as np
+            # Load audio file with proper error handling
+            try:
+                # Load audio at 16kHz as required by Tawasul STT
+                audio, sample_rate = librosa.load(audio_path, sr=16000, mono=True)
+            except Exception as load_error:
+                raise RuntimeError(f"Failed to load audio file {audio_path}: {load_error}")
+            # Validate audio data
+            if len(audio) == 0:
+                raise RuntimeError("Audio file is empty or corrupted")
+            # Convert to float32 for processing
+            audio = audio.astype(np.float32)
+            # Remove DC offset (center around zero)
+            audio = audio - np.mean(audio)
+            # Normalize audio with proper scaling
+            max_val = np.max(np.abs(audio))
+            if max_val > 0:
+                # Normalize to [-0.95, 0.95] to prevent clipping
+                audio = audio / max_val * 0.95
+            else:
+                logger.warning("Audio appears to be silent")
+            # Apply simple noise gate to reduce background noise
+            noise_threshold = np.max(np.abs(audio)) * 0.01  # 1% of max amplitude
+            audio = np.where(np.abs(audio) < noise_threshold, 0, audio)
+            # Check and limit audio duration
+            audio_duration = len(audio) / sample_rate
+            if audio_duration > TawasulSTT.max_audio_length:
+                logger.warning(f"Audio duration ({audio_duration:.1f}s) exceeds maximum ({TawasulSTT.max_audio_length}s)")
+                # Truncate to maximum length
+                max_samples = int(TawasulSTT.max_audio_length * sample_rate)
+                audio = audio[:max_samples]
+                logger.info(f"Audio truncated to {TawasulSTT.max_audio_length}s")
+            # Validate minimum duration
+            min_duration = 0.1  # 100ms minimum
+            if audio_duration < min_duration:
+                logger.warning(f"Audio duration ({audio_duration:.3f}s) is very short")
+            # Convert to PyTorch tensor
+            audio_tensor = torch.FloatTensor(audio)
+            # Log preprocessing info
+            final_duration = len(audio_tensor) / sample_rate
+            logger.debug(f"Audio preprocessed: {final_duration:.2f}s, max_amp: {torch.max(torch.abs(audio_tensor)):.3f}")
+            return audio_tensor, sample_rate
+        except Exception as e:
+            error_msg = f"Audio preprocessing failed for {audio_path}: {str(e)}"
+            logger.error(error_msg)
+            raise RuntimeError(error_msg)
+    @staticmethod
+    def _chunk_audio(audio_tensor: torch.Tensor, sample_rate: int) -> List[torch.Tensor]:
+        """
+        Split audio into chunks for processing.
+        Args:
+            audio_tensor: Audio tensor
+            sample_rate: Sample rate
+        Returns:
+            List of audio chunks
+        """
+        chunk_samples = int(TawasulSTT.chunk_length * sample_rate)
+        chunks = []
+        for i in range(0, len(audio_tensor), chunk_samples):
+            chunk = audio_tensor[i:i + chunk_samples]
+            if len(chunk) > sample_rate * 0.5:  # Only process chunks > 0.5 seconds
+                chunks.append(chunk)
+        return chunks
+    @staticmethod
+    def _transcribe_chunk(audio_chunk: torch.Tensor) -> Tuple[str, float]:
+        """
+        Transcribe a single audio chunk.
+        Args:
+            audio_chunk: Audio chunk tensor
+        Returns:
+            Tuple of (transcription, confidence_score)
+        """
+        try:
+            import torch
+            # Prepare input
+            input_values = TawasulSTT.processor(
+                audio_chunk,
+                sampling_rate=16000,
+                return_tensors="pt"
+            ).input_values
+            input_values = input_values.to(TawasulSTT.device)
+            # Get model predictions
+            with torch.no_grad():
+                logits = TawasulSTT.model(input_values).logits
+            # Get predicted tokens
+            predicted_ids = torch.argmax(logits, dim=-1)
+            # Decode transcription
+            transcription = TawasulSTT.processor.decode(predicted_ids[0])
+            # Calculate confidence (approximation)
+            probs = torch.nn.functional.softmax(logits, dim=-1)
+            max_probs = torch.max(probs, dim=-1)[0]
+            confidence = torch.mean(max_probs).item()
+            return transcription.strip(), confidence
+        except Exception as e:
+            logger.error(f"Chunk transcription error: {str(e)}")
+            return "", 0.0
+    @staticmethod
+    def transcribe(audio_path: str, **kwargs) -> Tuple[str, str, str]:
+        """
+        Transcribe audio file using Tawasul STT V0.
+        Args:
+            audio_path: Path to audio file
+            **kwargs: Additional transcription parameters
+        Returns:
+            Tuple of (transcription, confidence_info, processing_info)
+        """
+        if not TawasulSTT.is_loaded:
+            return "❌ Model not loaded. Please load the model first.", "", ""
+        try:
+            start_time = time.time()
+            # Validate file
+            if not os.path.exists(audio_path):
+                return f"❌ Audio file not found: {audio_path}", "", ""
+            logger.info(f"🎵 Transcribing audio with Tawasul STT: {audio_path}")
+            # Preprocess audio
+            audio_tensor, sample_rate = TawasulSTT._preprocess_audio(audio_path)
+            audio_duration = len(audio_tensor) / sample_rate
+            # Process audio in chunks
+            chunks = TawasulSTT._chunk_audio(audio_tensor, sample_rate)
+            if not chunks:
+                return "❌ No valid audio chunks found", "", ""
+            # Transcribe each chunk
+            transcriptions = []
+            confidences = []
+            for i, chunk in enumerate(chunks):
+                logger.info(f"Processing chunk {i+1}/{len(chunks)}")
+                transcription, confidence = TawasulSTT._transcribe_chunk(chunk)
+                if transcription:  # Only add non-empty transcriptions
+                    transcriptions.append(transcription)
+                    confidences.append(confidence)
+            # Combine results
+            if not transcriptions:
+                return "❌ No transcription generated", "", ""
+            final_transcription = " ".join(transcriptions).strip()
+            avg_confidence = sum(confidences) / len(confidences) if confidences else 0.0
+            # Calculate processing time
+            processing_time = time.time() - start_time
+            # Create info strings
+            confidence_info = f"Confidence: {avg_confidence:.2f}"
+            processing_info = (
+                f"Duration: {audio_duration:.1f}s | "
+                f"Chunks: {len(chunks)} | "
+                f"Time: {processing_time:.1f}s | "
+                f"Model: {TawasulSTT.model_id.split('/')[-1]}"
+            )
+            logger.info(f"✅ Transcription completed in {processing_time:.1f}s")
+            return final_transcription, confidence_info, processing_info
+        except Exception as e:
+            error_msg = f"❌ Tawasul STT transcription failed: {str(e)}"
+            logger.error(error_msg)
+            return error_msg, "", ""
+    @staticmethod
+    def get_supported_languages() -> List[str]:
+        """Get list of supported languages."""
+        return [
+            "ar",      # Arabic
+            "ar-SA",   # Saudi Arabic
+            "ar-EG",   # Egyptian Arabic
+            "ar-JO",   # Jordanian Arabic
+            "ar-LB",   # Lebanese Arabic
+            "ar-SY",   # Syrian Arabic
+            "ar-IQ",   # Iraqi Arabic
+            "ar-MA",   # Moroccan Arabic
+            "ar-DZ",   # Algerian Arabic
+            "ar-TN",   # Tunisian Arabic
+        ]
+    @staticmethod
+    def get_model_info() -> Dict[str, Any]:
+        """Get model information."""
+        return {
+            "name": "Tawasul STT V0",
+            "model_id": TawasulSTT.model_id,
+            "device": TawasulSTT.device,
+            "is_loaded": TawasulSTT.is_loaded,
+            "supported_languages": TawasulSTT.get_supported_languages(),
+            "chunk_length": TawasulSTT.chunk_length,
+            "max_audio_length": TawasulSTT.max_audio_length,
+            "architecture": "Wav2Vec2",
+            "specialization": "Arabic Speech Recognition"
+        }

stt/vosk_stt.py ADDED Viewed

	@@ -0,0 +1,561 @@

+#!/usr/bin/env python3
+"""
+Vosk STT Implementation
+Vosk speech-to-text implementation using the static BaseSTT interface.
+Supports multiple languages with offline models and real-time recognition.
+Usage:
+    from stt.vosk_stt import VoskSTT
+    # Load model
+    VoskSTT.load_model(model_name="vosk-model-en-us-0.22")
+    # Transcribe audio
+    result = VoskSTT.transcribe_audio(audio_array, 16000)
+    print(result.text)
+"""
+from typing import Union, Optional, Dict, Any, List
+import numpy as np
+from pathlib import Path
+import time
+import json
+import logging
+import os
+import urllib.request
+import zipfile
+import tempfile
+try:
+    import vosk
+    VOSK_AVAILABLE = True
+except ImportError:
+    VOSK_AVAILABLE = False
+try:
+    import soundfile as sf
+    SOUNDFILE_AVAILABLE = True
+except ImportError:
+    SOUNDFILE_AVAILABLE = False
+from .stt_base import BaseSTT, STTResult
+logger = logging.getLogger(__name__)
+class VoskSTT(BaseSTT):
+    """
+    Vosk STT implementation supporting multiple languages and offline recognition.
+    Features:
+    - Multiple language support
+    - Offline processing (no internet required after model download)
+    - Real-time recognition capability
+    - Small to large model options
+    - Word-level timestamps and confidence scores
+    - Lightweight and fast
+    """
+    model_name = "VoskSTT"
+    model = None
+    recognizer = None
+    is_loaded = False
+    config = {
+        "model_name": "vosk-model-en-us-0.22",  # Default English model
+        "model_path": None,  # Auto-determined
+        "sample_rate": 16000,
+        "language": "en",
+        "download_url_base": "https://alphacephei.com/vosk/models/",
+        "models_dir": str(Path.home() / ".vosk" / "models"),
+        "return_confidence": True,
+        "return_words": True,
+        "chunk_size": 4096,
+    }
+    # Available Vosk models with their properties
+    AVAILABLE_MODELS = {
+        # English models
+        "vosk-model-en-us-0.22": {
+            "language": "en-US",
+            "size": "1.8GB",
+            "description": "English US Large",
+            "url": "vosk-model-en-us-0.22.zip"
+        },
+        "vosk-model-small-en-us-0.15": {
+            "language": "en-US",
+            "size": "40MB",
+            "description": "English US Small",
+            "url": "vosk-model-small-en-us-0.15.zip"
+        },
+        # Arabic models
+        "vosk-model-ar-mgb2-0.4": {
+            "language": "ar",
+            "size": "318MB",
+            "description": "Arabic",
+            "url": "vosk-model-ar-mgb2-0.4.zip"
+        },
+        # Multilingual and other languages
+        "vosk-model-small-cn-0.22": {
+            "language": "zh-CN",
+            "size": "42MB",
+            "description": "Chinese Small",
+            "url": "vosk-model-small-cn-0.22.zip"
+        },
+        "vosk-model-fr-0.22": {
+            "language": "fr-FR",
+            "size": "1.4GB",
+            "description": "French",
+            "url": "vosk-model-fr-0.22.zip"
+        },
+        "vosk-model-de-0.21": {
+            "language": "de-DE",
+            "size": "1.2GB",
+            "description": "German",
+            "url": "vosk-model-de-0.21.zip"
+        },
+        "vosk-model-es-0.42": {
+            "language": "es-ES",
+            "size": "1.4GB",
+            "description": "Spanish",
+            "url": "vosk-model-es-0.42.zip"
+        },
+        "vosk-model-ru-0.42": {
+            "language": "ru-RU",
+            "size": "1.5GB",
+            "description": "Russian",
+            "url": "vosk-model-ru-0.42.zip"
+        },
+        "vosk-model-small-ru-0.22": {
+            "language": "ru-RU",
+            "size": "45MB",
+            "description": "Russian Small",
+            "url": "vosk-model-small-ru-0.22.zip"
+        }
+    }
+    @classmethod
+    def load_model(cls,
+                   model_name: str = None,
+                   model_path: str = None,
+                   auto_download: bool = True,
+                   **kwargs) -> None:
+        """
+        Load the Vosk model.
+        Args:
+            model_name: Name of the Vosk model (e.g., "vosk-model-en-us-0.22")
+            model_path: Direct path to model directory (overrides model_name)
+            auto_download: Automatically download model if not found
+            **kwargs: Additional configuration parameters
+        """
+        if not VOSK_AVAILABLE:
+            raise ImportError(
+                "Vosk library required. Install with: pip install vosk"
+            )
+        # Update configuration
+        cls.config.update({
+            "model_name": model_name or cls.config["model_name"],
+            "model_path": model_path,
+            "auto_download": auto_download,
+            **kwargs
+        })
+        # Determine model path
+        if model_path:
+            final_model_path = Path(model_path)
+        else:
+            final_model_path = cls._get_model_path(cls.config["model_name"])
+        # Check if model exists, download if needed
+        if not final_model_path.exists():
+            if auto_download:
+                logger.info(f"Model not found at {final_model_path}")
+                cls._download_model(cls.config["model_name"])
+            else:
+                raise FileNotFoundError(f"Model not found: {final_model_path}")
+        logger.info(f"Loading Vosk model from: {final_model_path}")
+        start_time = time.time()
+        try:
+            # Load the Vosk model
+            cls.model = vosk.Model(str(final_model_path))
+            # Create recognizer
+            cls.recognizer = vosk.KaldiRecognizer(cls.model, cls.config["sample_rate"])
+            # Configure recognizer options (with compatibility checks)
+            try:
+                if hasattr(cls.recognizer, 'SetMaxAlternatives'):
+                    cls.recognizer.SetMaxAlternatives(cls.config.get("max_alternatives", 3))
+                    logger.info("✅ Max alternatives enabled")
+            except (AttributeError, Exception) as e:
+                logger.warning(f"⚠️  Max alternatives not supported: {e}")
+            try:
+                if hasattr(cls.recognizer, 'SetReturnWordTimes'):
+                    cls.recognizer.SetReturnWordTimes(cls.config.get("return_words", True))
+                    logger.info("✅ Word timing enabled")
+                else:
+                    logger.info("ℹ️  Word timing not available in this Vosk version")
+            except (AttributeError, Exception) as e:
+                logger.warning(f"⚠️  Word timing not supported: {e}")
+            try:
+                if hasattr(cls.recognizer, 'SetWords'):
+                    cls.recognizer.SetWords(cls.config.get("return_words", True))
+                    logger.info("✅ Word-level output enabled")
+            except (AttributeError, Exception) as e:
+                logger.info(f"ℹ️  Word-level output using basic mode: {e}")
+            # Test recognizer with a small sample
+            test_result = cls.recognizer.AcceptWaveform(b'\x00' * 1600)  # 0.1s of silence
+            logger.info("✅ Recognizer test successful")
+            cls.is_loaded = True
+            load_time = time.time() - start_time
+            model_info = cls.AVAILABLE_MODELS.get(cls.config["model_name"], {})
+            language = model_info.get("language", "unknown")
+            logger.info(f"✅ Vosk model loaded successfully in {load_time:.2f}s")
+            logger.info(f"Model: {cls.config['model_name']}")
+            logger.info(f"Language: {language}")
+            logger.info(f"Sample rate: {cls.config['sample_rate']}Hz")
+        except Exception as e:
+            cls.is_loaded = False
+            error_msg = f"Failed to load Vosk model: {str(e)}"
+            logger.error(error_msg)
+            raise RuntimeError(error_msg)
+    @classmethod
+    def _get_model_path(cls, model_name: str) -> Path:
+        """Get the local path where a model should be stored."""
+        models_dir = Path(cls.config["models_dir"])
+        models_dir.mkdir(parents=True, exist_ok=True)
+        return models_dir / model_name
+    @classmethod
+    def _download_model(cls, model_name: str) -> None:
+        """Download a Vosk model if it's not already available."""
+        if model_name not in cls.AVAILABLE_MODELS:
+            raise ValueError(f"Unknown model: {model_name}. Available: {list(cls.AVAILABLE_MODELS.keys())}")
+        model_info = cls.AVAILABLE_MODELS[model_name]
+        download_url = cls.config["download_url_base"] + model_info["url"]
+        model_path = cls._get_model_path(model_name)
+        if model_path.exists():
+            logger.info(f"Model already exists: {model_path}")
+            return
+        logger.info(f"Downloading Vosk model: {model_name}")
+        logger.info(f"Size: {model_info['size']} - This may take a while...")
+        logger.info(f"URL: {download_url}")
+        try:
+            # Create temporary file for download
+            with tempfile.NamedTemporaryFile(suffix='.zip', delete=False) as tmp_file:
+                tmp_path = tmp_file.name
+            # Download with progress
+            def show_progress(block_num, block_size, total_size):
+                if total_size > 0:
+                    percent = min(100, (block_num * block_size * 100) // total_size)
+                    if block_num % 100 == 0:  # Show progress every 100 blocks
+                        print(f"\rDownloading... {percent}%", end="", flush=True)
+            urllib.request.urlretrieve(download_url, tmp_path, show_progress)
+            print()  # New line after progress
+            logger.info(f"Download complete. Extracting to: {model_path}")
+            # Extract the zip file
+            with zipfile.ZipFile(tmp_path, 'r') as zip_ref:
+                # Extract to temporary directory first
+                extract_dir = model_path.parent / f"{model_name}_temp"
+                extract_dir.mkdir(exist_ok=True)
+                zip_ref.extractall(extract_dir)
+                # Find the actual model directory (should contain conf/ and graph/ subdirs)
+                extracted_items = list(extract_dir.iterdir())
+                if len(extracted_items) == 1 and extracted_items[0].is_dir():
+                    # Move the inner directory to the final location
+                    extracted_items[0].rename(model_path)
+                    extract_dir.rmdir()
+                else:
+                    # Multiple items or files - rename the temp directory
+                    extract_dir.rename(model_path)
+            # Cleanup
+            os.unlink(tmp_path)
+            logger.info(f"✅ Model downloaded and extracted successfully: {model_path}")
+        except Exception as e:
+            # Cleanup on failure
+            if os.path.exists(tmp_path):
+                os.unlink(tmp_path)
+            if model_path.exists():
+                import shutil
+                shutil.rmtree(model_path, ignore_errors=True)
+            error_msg = f"Failed to download model {model_name}: {str(e)}"
+            logger.error(error_msg)
+            raise RuntimeError(error_msg)
+    @classmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Transcribe audio using Vosk.
+        Args:
+            audio_data: Audio input (numpy array or file path)
+            sample_rate: Sample rate for numpy arrays
+        Returns:
+            STTResult: Transcription with confidence and metadata
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
+        start_time = time.time()
+        try:
+            # Process input audio
+            processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
+            # Check audio length
+            duration = len(processed_audio) / actual_sr
+            if duration < 0.1:
+                return STTResult(
+                    text="",
+                    confidence=0.0,
+                    processing_time=time.time() - start_time,
+                    metadata={"error": "Audio too short", "duration": duration}
+                )
+            # Transcribe using Vosk
+            result_text, confidence, words = cls._transcribe_with_vosk(processed_audio)
+            processing_time = time.time() - start_time
+            # Prepare metadata
+            metadata = {
+                "model": cls.config["model_name"],
+                "language": cls.AVAILABLE_MODELS.get(cls.config["model_name"], {}).get("language", "unknown"),
+                "duration": duration,
+                "sample_rate": actual_sr,
+                "words": words if cls.config.get("return_words", True) else None,
+                "vosk_version": vosk.__version__ if hasattr(vosk, '__version__') else "unknown"
+            }
+            return STTResult(
+                text=result_text.strip(),
+                confidence=confidence,
+                processing_time=processing_time,
+                metadata=metadata
+            )
+        except Exception as e:
+            error_msg = f"Transcription failed: {str(e)}"
+            logger.error(error_msg)
+            return STTResult(
+                text="",
+                confidence=0.0,
+                processing_time=time.time() - start_time,
+                metadata={"error": error_msg}
+            )
+    @classmethod
+    def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
+        """Process and validate audio input."""
+        if isinstance(audio_data, (str, Path)):
+            # Load audio file
+            audio_path = Path(audio_data)
+            if not audio_path.exists():
+                raise FileNotFoundError(f"Audio file not found: {audio_path}")
+            if SOUNDFILE_AVAILABLE:
+                audio_array, sr = sf.read(str(audio_path))
+                if audio_array.ndim > 1:
+                    audio_array = np.mean(audio_array, axis=1)  # Convert to mono
+            else:
+                raise ImportError("soundfile required for file input. Install with: pip install soundfile")
+        else:
+            # Handle numpy array
+            audio_array = audio_data.astype(np.float32)
+            sr = sample_rate or cls.config["sample_rate"]
+            # Convert to mono if stereo
+            if audio_array.ndim > 1:
+                audio_array = np.mean(audio_array, axis=1)
+        # Resample to target sample rate if needed
+        target_sr = cls.config["sample_rate"]
+        if sr != target_sr:
+            # Simple resampling
+            if sr > target_sr:
+                step = sr // target_sr
+                audio_array = audio_array[::step]
+            else:
+                repeat = target_sr // sr
+                audio_array = np.repeat(audio_array, repeat)
+            sr = target_sr
+        # Normalize and convert to 16-bit PCM format expected by Vosk
+        audio_array = np.clip(audio_array, -1.0, 1.0)
+        audio_int16 = (audio_array * 32767).astype(np.int16)
+        return audio_int16, sr
+    @classmethod
+    def _transcribe_with_vosk(cls, audio_int16: np.ndarray) -> tuple:
+        """Transcribe audio using Vosk recognizer."""
+        # Convert to bytes
+        audio_bytes = audio_int16.tobytes()
+        # Reset recognizer for new transcription
+        cls.recognizer = vosk.KaldiRecognizer(cls.model, cls.config["sample_rate"])
+        # Configure recognizer with compatibility checks
+        try:
+            if hasattr(cls.recognizer, 'SetReturnWordTimes'):
+                cls.recognizer.SetReturnWordTimes(cls.config.get("return_words", True))
+        except (AttributeError, Exception):
+            pass  # Use basic recognition without word timing
+        # Process audio in chunks
+        chunk_size = cls.config.get("chunk_size", 4096)
+        partial_results = []
+        for i in range(0, len(audio_bytes), chunk_size):
+            chunk = audio_bytes[i:i + chunk_size]
+            if cls.recognizer.AcceptWaveform(chunk):
+                result = json.loads(cls.recognizer.Result())
+                if result.get("text"):
+                    partial_results.append(result)
+        # Get final result
+        final_result = json.loads(cls.recognizer.FinalResult())
+        if final_result.get("text"):
+            partial_results.append(final_result)
+        # Combine all results
+        if not partial_results:
+            return "", 0.0, []
+        # Extract text and confidence
+        full_text = " ".join([r.get("text", "") for r in partial_results]).strip()
+        # Calculate average confidence from words
+        all_words = []
+        total_confidence = 0.0
+        word_count = 0
+        for result in partial_results:
+            if "result" in result:
+                words = result["result"]
+                all_words.extend(words)
+                for word in words:
+                    if "conf" in word:
+                        total_confidence += word["conf"]
+                        word_count += 1
+        average_confidence = total_confidence / word_count if word_count > 0 else 0.0
+        return full_text, average_confidence, all_words
+    @classmethod
+    def get_available_models(cls) -> Dict[str, Any]:
+        """Get information about available Vosk models."""
+        return {
+            "vosk_available": VOSK_AVAILABLE,
+            "soundfile_available": SOUNDFILE_AVAILABLE,
+            "models": cls.AVAILABLE_MODELS,
+            "models_dir": cls.config["models_dir"],
+            "downloaded_models": cls._get_downloaded_models()
+        }
+    @classmethod
+    def _get_downloaded_models(cls) -> List[str]:
+        """Get list of already downloaded models."""
+        models_dir = Path(cls.config["models_dir"])
+        if not models_dir.exists():
+            return []
+        downloaded = []
+        for model_dir in models_dir.iterdir():
+            if model_dir.is_dir() and model_dir.name in cls.AVAILABLE_MODELS:
+                # Check if it looks like a valid Vosk model
+                if (model_dir / "conf").exists() or (model_dir / "graph").exists():
+                    downloaded.append(model_dir.name)
+        return downloaded
+    @classmethod
+    def set_language(cls, language: Optional[str]) -> None:
+        """Set language preference (informational - model determines actual language)."""
+        cls.config["language"] = language or "auto"
+        logger.info(f"Language preference set to: {cls.config['language']}")
+        logger.info("Note: Vosk model determines actual recognition language")
+    @classmethod
+    def list_models(cls) -> None:
+        """Print available models in a formatted way."""
+        print("\n🎤 Available Vosk Models:")
+        print("=" * 60)
+        downloaded = cls._get_downloaded_models()
+        for model_name, info in cls.AVAILABLE_MODELS.items():
+            status = "✅ Downloaded" if model_name in downloaded else "📥 Available"
+            print(f"{status} {model_name}")
+            print(f"   Language: {info['language']}")
+            print(f"   Size: {info['size']}")
+            print(f"   Description: {info['description']}")
+            print()
+# Example usage and testing
+if __name__ == "__main__":
+    print("Testing Vosk STT implementation...")
+    # Check availability
+    models_info = VoskSTT.get_available_models()
+    print(f"Vosk available: {models_info['vosk_available']}")
+    print(f"Downloaded models: {models_info['downloaded_models']}")
+    if models_info["vosk_available"]:
+        try:
+            # List available models
+            VoskSTT.list_models()
+            # Try to load a small English model for testing
+            print("\\nTesting with small English model...")
+            VoskSTT.load_model(model_name="vosk-model-small-en-us-0.15")
+            # Test with dummy audio
+            print("Testing transcription...")
+            test_audio = np.random.randn(16000).astype(np.float32) * 0.1
+            result = VoskSTT.transcribe_audio(test_audio, 16000)
+            print(f"Result: {result}")
+            print(f"Metadata: {result.metadata}")
+        except Exception as e:
+            print(f"Error: {e}")
+            print("Note: This is expected with random audio")
+    else:
+        print("Vosk not installed - install with: pip install vosk")
+        print("Also recommended: pip install soundfile")
+    print("\\nVosk STT implementation ready!")

stt/wav2vec2_arabic_stt.py ADDED Viewed

	@@ -0,0 +1,509 @@

+#!/usr/bin/env python3
+"""
+Wav2Vec2 Arabic Egyptian STT Implementation
+Hugging Face Wav2Vec2 speech-to-text implementation for Arabic Egyptian dialect
+using the wav2vec2-large-xlsr-53-arabic-egyptian model.
+Usage:
+    from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
+    # Load model
+    Wav2Vec2ArabicSTT.load_model()
+    # Transcribe audio
+    result = Wav2Vec2ArabicSTT.transcribe_audio(audio_array, 16000)
+    print(result.text)
+"""
+from typing import Union, Optional, Dict, Any
+import numpy as np
+from pathlib import Path
+import time
+import logging
+import warnings
+# Suppress warnings for cleaner output
+warnings.filterwarnings("ignore")
+try:
+    import torch
+    import torchaudio
+    from transformers import (
+        Wav2Vec2ForCTC,
+        Wav2Vec2Processor,
+        Wav2Vec2Tokenizer
+    )
+    TRANSFORMERS_AVAILABLE = True
+except ImportError:
+    TRANSFORMERS_AVAILABLE = False
+try:
+    import librosa
+    LIBROSA_AVAILABLE = True
+except ImportError:
+    LIBROSA_AVAILABLE = False
+from .stt_base import BaseSTT, STTResult
+logger = logging.getLogger(__name__)
+class Wav2Vec2ArabicSTT(BaseSTT):
+    """
+    Wav2Vec2 Arabic Egyptian STT implementation using Hugging Face transformers.
+    Supports:
+    - Arabic Egyptian dialect transcription
+    - Local model execution (no API required)
+    - Automatic audio preprocessing
+    - Confidence estimation
+    """
+    model_name = "Wav2Vec2ArabicSTT"
+    model = None
+    processor = None
+    tokenizer = None
+    is_loaded = False
+    config = {
+        "model_id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
+        "fallback_models": [
+            "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
+            "facebook/wav2vec2-large-xlsr-53",
+            "facebook/wav2vec2-base-960h"  # English fallback
+        ],
+        "device": "auto",  # auto, cpu, cuda
+        "chunk_length": 20,  # seconds, for long audio processing
+        "sample_rate": 16000,
+        "return_confidence": True,
+        "language": "ar-EG",  # Arabic Egyptian
+        "hf_token": None,  # Hugging Face token for private models
+        "use_auth_token": True  # Try to use cached token
+    }
+    @classmethod
+    def load_model(cls,
+                   model_id: str = None,
+                   device: str = "auto",
+                   hf_token: str = None,
+                   **kwargs) -> None:
+        """
+        Load the Wav2Vec2 Arabic model.
+        Args:
+            model_id: Hugging Face model ID (default: wav2vec2-large-xlsr-53-arabic-egyptian)
+            device: Device to use (auto, cpu, cuda)
+            hf_token: Hugging Face token for private models (optional)
+            **kwargs: Additional configuration parameters
+        """
+        if not TRANSFORMERS_AVAILABLE:
+            raise ImportError(
+                "Transformers library required. Install with: "
+                "pip install transformers torch torchaudio"
+            )
+        # Update configuration
+        cls.config.update({
+            "model_id": model_id or cls.config["model_id"],
+            "device": device,
+            "hf_token": hf_token,
+            **kwargs
+        })
+        # Determine device
+        if device == "auto":
+            if torch.cuda.is_available():
+                device = "cuda"
+            elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+                device = "mps"  # Apple Silicon
+            else:
+                device = "cpu"
+        cls.config["device"] = device
+        # Try to load the model, with fallbacks
+        models_to_try = [cls.config["model_id"]] + cls.config["fallback_models"]
+        for model_id_to_try in models_to_try:
+            logger.info(f"Attempting to load model: {model_id_to_try}")
+            try:
+                success = cls._load_model_with_id(model_id_to_try, device, hf_token)
+                if success:
+                    cls.config["model_id"] = model_id_to_try  # Update to successful model
+                    return
+            except Exception as e:
+                logger.warning(f"Failed to load {model_id_to_try}: {e}")
+                continue
+        # If all models failed
+        raise RuntimeError(f"Failed to load any Wav2Vec2 model. Tried: {models_to_try}")
+    @classmethod
+    def _load_model_with_id(cls, model_id: str, device: str, hf_token: str = None) -> bool:
+        """
+        Load a specific model ID with authentication handling.
+        Returns:
+            bool: True if successful, False otherwise
+        """
+        logger.info(f"Loading Wav2Vec2 model: {model_id}")
+        logger.info(f"Using device: {device}")
+        start_time = time.time()
+        # Prepare authentication
+        auth_kwargs = {}
+        if hf_token:
+            auth_kwargs["token"] = hf_token
+        elif cls.config.get("use_auth_token", True):
+            auth_kwargs["use_auth_token"] = True
+        # Load processor and tokenizer
+        logger.info("Loading processor...")
+        cls.processor = Wav2Vec2Processor.from_pretrained(model_id, **auth_kwargs)
+        logger.info("Loading model...")
+        cls.model = Wav2Vec2ForCTC.from_pretrained(model_id, **auth_kwargs)
+        # Move model to device
+        cls.model = cls.model.to(device)
+        cls.model.eval()  # Set to evaluation mode
+        # Load tokenizer for confidence calculation
+        try:
+            cls.tokenizer = Wav2Vec2Tokenizer.from_pretrained(model_id, **auth_kwargs)
+        except Exception as e:
+            logger.warning(f"Could not load tokenizer: {e}")
+            cls.tokenizer = None
+        cls.is_loaded = True
+        load_time = time.time() - start_time
+        logger.info(f"✅ Wav2Vec2 model loaded successfully in {load_time:.2f}s")
+        logger.info(f"Model vocab size: {cls.model.config.vocab_size}")
+        return True
+    @classmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Transcribe audio using Wav2Vec2 Arabic model.
+        Args:
+            audio_data: Audio input (numpy array or file path)
+            sample_rate: Sample rate for numpy arrays
+        Returns:
+            STTResult: Transcription with confidence and metadata
+        """
+        if not cls.is_loaded:
+            raise RuntimeError(f"{cls.model_name} not loaded. Call load_model() first.")
+        start_time = time.time()
+        try:
+            # Process input audio
+            processed_audio, actual_sr = cls._process_audio_input(audio_data, sample_rate)
+            # Check audio length
+            duration = len(processed_audio) / actual_sr
+            if duration < 0.1:
+                return STTResult(
+                    text="",
+                    confidence=0.0,
+                    processing_time=time.time() - start_time,
+                    metadata={"error": "Audio too short", "duration": duration}
+                )
+            # Process with model
+            if duration > cls.config.get("chunk_length", 20):
+                # Handle long audio by chunking
+                text, confidence = cls._transcribe_long_audio(processed_audio, actual_sr)
+            else:
+                # Process short audio directly
+                text, confidence = cls._transcribe_chunk(processed_audio, actual_sr)
+            processing_time = time.time() - start_time
+            # Prepare metadata
+            metadata = {
+                "model": cls.config["model_id"],
+                "device": cls.config["device"],
+                "language": "ar-EG",
+                "duration": duration,
+                "sample_rate": actual_sr,
+                "chunks_processed": 1 if duration <= cls.config.get("chunk_length", 20) else int(duration / cls.config["chunk_length"]) + 1
+            }
+            return STTResult(
+                text=text.strip(),
+                confidence=confidence,
+                processing_time=processing_time,
+                metadata=metadata
+            )
+        except Exception as e:
+            error_msg = f"Transcription failed: {str(e)}"
+            logger.error(error_msg)
+            return STTResult(
+                text="",
+                confidence=0.0,
+                processing_time=time.time() - start_time,
+                metadata={"error": error_msg}
+            )
+    @classmethod
+    def _process_audio_input(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> tuple:
+        """Process and validate audio input."""
+        if isinstance(audio_data, (str, Path)):
+            # Load audio file
+            audio_path = Path(audio_data)
+            if not audio_path.exists():
+                raise FileNotFoundError(f"Audio file not found: {audio_path}")
+            if LIBROSA_AVAILABLE:
+                audio_array, sr = librosa.load(str(audio_path), sr=cls.config["sample_rate"])
+            else:
+                # Fallback to torchaudio
+                audio_tensor, sr = torchaudio.load(str(audio_path))
+                audio_array = audio_tensor.numpy().flatten()
+                # Resample if needed
+                if sr != cls.config["sample_rate"]:
+                    resampler = torchaudio.transforms.Resample(sr, cls.config["sample_rate"])
+                    audio_tensor = resampler(audio_tensor)
+                    audio_array = audio_tensor.numpy().flatten()
+                    sr = cls.config["sample_rate"]
+        else:
+            # Handle numpy array
+            audio_array = audio_data.astype(np.float32)
+            sr = sample_rate or cls.config["sample_rate"]
+            # Resample if needed
+            if sr != cls.config["sample_rate"]:
+                if LIBROSA_AVAILABLE:
+                    audio_array = librosa.resample(
+                        audio_array,
+                        orig_sr=sr,
+                        target_sr=cls.config["sample_rate"]
+                    )
+                else:
+                    # Simple resampling fallback
+                    if sr > cls.config["sample_rate"]:
+                        step = sr // cls.config["sample_rate"]
+                        audio_array = audio_array[::step]
+                    else:
+                        repeat = cls.config["sample_rate"] // sr
+                        audio_array = np.repeat(audio_array, repeat)
+                sr = cls.config["sample_rate"]
+        # Normalize audio
+        if len(audio_array) > 0:
+            # Convert to mono if stereo
+            if audio_array.ndim > 1:
+                audio_array = np.mean(audio_array, axis=0)
+            # Normalize to [-1, 1]
+            max_val = np.max(np.abs(audio_array))
+            if max_val > 0:
+                audio_array = audio_array / max_val
+        return audio_array, sr
+    @classmethod
+    def _transcribe_chunk(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
+        """Transcribe a single audio chunk."""
+        # Preprocess audio
+        input_values = cls.processor(
+            audio_array,
+            sampling_rate=sample_rate,
+            return_tensors="pt",
+            padding=True
+        )
+        # Move to device
+        input_values = {k: v.to(cls.config["device"]) for k, v in input_values.items()}
+        # Inference
+        with torch.no_grad():
+            logits = cls.model(**input_values).logits
+        # Get predicted tokens
+        predicted_ids = torch.argmax(logits, dim=-1)
+        # Decode transcription
+        transcription = cls.processor.batch_decode(predicted_ids)[0]
+        # Calculate confidence (average of max probabilities)
+        confidence = cls._calculate_confidence(logits)
+        return transcription, confidence
+    @classmethod
+    def _transcribe_long_audio(cls, audio_array: np.ndarray, sample_rate: int) -> tuple:
+        """Transcribe long audio by chunking."""
+        chunk_length = cls.config.get("chunk_length", 20)
+        chunk_samples = int(chunk_length * sample_rate)
+        overlap_samples = int(1.0 * sample_rate)  # 1 second overlap
+        transcriptions = []
+        confidences = []
+        for start in range(0, len(audio_array), chunk_samples - overlap_samples):
+            end = min(start + chunk_samples, len(audio_array))
+            chunk = audio_array[start:end]
+            if len(chunk) < 0.5 * sample_rate:  # Skip very short chunks
+                continue
+            try:
+                chunk_text, chunk_confidence = cls._transcribe_chunk(chunk, sample_rate)
+                if chunk_text.strip():
+                    transcriptions.append(chunk_text.strip())
+                    confidences.append(chunk_confidence)
+            except Exception as e:
+                logger.warning(f"Failed to transcribe chunk: {e}")
+                continue
+        # Combine results
+        full_text = " ".join(transcriptions)
+        avg_confidence = np.mean(confidences) if confidences else 0.0
+        return full_text, avg_confidence
+    @classmethod
+    def _calculate_confidence(cls, logits: torch.Tensor) -> float:
+        """Calculate confidence score from model logits."""
+        try:
+            # Apply softmax to get probabilities
+            probabilities = torch.softmax(logits, dim=-1)
+            # Get maximum probability for each time step
+            max_probs = torch.max(probabilities, dim=-1)[0]
+            # Average over time steps (excluding padding if any)
+            confidence = torch.mean(max_probs).item()
+            return confidence
+        except Exception as e:
+            logger.warning(f"Could not calculate confidence: {e}")
+            return 0.5  # Default confidence
+    @classmethod
+    def get_available_models(cls) -> Dict[str, Any]:
+        """Get information about available Wav2Vec2 models."""
+        models_info = {
+            "transformers_available": TRANSFORMERS_AVAILABLE,
+            "librosa_available": LIBROSA_AVAILABLE,
+            "torch_available": True if TRANSFORMERS_AVAILABLE else False,
+        }
+        if TRANSFORMERS_AVAILABLE:
+            models_info.update({
+                "cuda_available": torch.cuda.is_available(),
+                "mps_available": hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(),
+                "public_models": [
+                    {
+                        "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic",
+                        "name": "Wav2Vec2 Arabic (Large)",
+                        "language": "Arabic",
+                        "size": "1.2GB"
+                    },
+                    {
+                        "id": "facebook/wav2vec2-large-xlsr-53",
+                        "name": "Wav2Vec2 Multilingual (Large)",
+                        "language": "Multilingual (including Arabic)",
+                        "size": "1.2GB"
+                    },
+                    {
+                        "id": "facebook/wav2vec2-base-960h",
+                        "name": "Wav2Vec2 English Base",
+                        "language": "English",
+                        "size": "360MB"
+                    }
+                ],
+                "experimental_models": [
+                    {
+                        "id": "jonatasgrosman/wav2vec2-large-xlsr-53-arabic-egyptian",
+                        "name": "Wav2Vec2 Arabic Egyptian (Large)",
+                        "language": "Arabic Egyptian Dialect",
+                        "size": "1.2GB",
+                        "note": "May require HuggingFace authentication"
+                    }
+                ]
+            })
+        return models_info
+    @classmethod
+    def set_language(cls, language: Optional[str]) -> None:
+        """Set language (for compatibility - this model is Arabic-specific)."""
+        if language and not language.startswith("ar"):
+            logger.warning(f"This model is optimized for Arabic. Language '{language}' may not work well.")
+        cls.config["language"] = language or "ar-EG"
+        logger.info(f"Language set to: {cls.config['language']}")
+    @classmethod
+    def set_device(cls, device: str) -> None:
+        """Change device for model inference."""
+        if cls.model is not None:
+            cls.model = cls.model.to(device)
+            cls.config["device"] = device
+            logger.info(f"Model moved to device: {device}")
+    @classmethod
+    def get_model_info(cls) -> Dict[str, Any]:
+        """Get detailed model information."""
+        base_info = super().get_model_info()
+        if cls.is_loaded:
+            base_info.update({
+                "model_id": cls.config["model_id"],
+                "device": cls.config["device"],
+                "language": cls.config["language"],
+                "sample_rate": cls.config["sample_rate"],
+                "vocab_size": cls.model.config.vocab_size if cls.model else None,
+            })
+        return base_info
+# Example usage and testing
+if __name__ == "__main__":
+    print("Testing Wav2Vec2 Arabic STT implementation...")
+    # Check availability
+    models_info = Wav2Vec2ArabicSTT.get_available_models()
+    print(f"Available models info: {models_info}")
+    if models_info["transformers_available"]:
+        try:
+            print("Loading Wav2Vec2 Arabic model...")
+            Wav2Vec2ArabicSTT.load_model()
+            print("Creating test audio...")
+            # Generate test audio (1 second of random noise)
+            test_audio = np.random.randn(16000).astype(np.float32) * 0.1
+            print("Testing transcription...")
+            result = Wav2Vec2ArabicSTT.transcribe_audio(test_audio, 16000)
+            print(f"Result: {result}")
+            print(f"Metadata: {result.metadata}")
+        except Exception as e:
+            print(f"Error: {e}")
+            print("Note: This is expected with random audio - the model expects Arabic speech")
+    else:
+        print("Transformers not installed - install with:")
+        print("pip install transformers torch torchaudio")
+        print("Optional: pip install librosa (for better audio processing)")
+    print("\nWav2Vec2 Arabic STT implementation ready!")

stt/whisper_stt.py ADDED Viewed

	@@ -0,0 +1,377 @@

+#!/usr/bin/env python3
+"""
+Whisper STT Implementation
+OpenAI Whisper speech-to-text implementation using the static BaseSTT interface.
+Supports both local Whisper models and OpenAI API calls.
+Usage:
+    from stt.whisper_stt import WhisperSTT
+    # Load model (local)
+    WhisperSTT.load_model()
+    # Transcribe audio
+    result = WhisperSTT.transcribe_file("audio.wav")
+    print(result.text)
+    # Or use OpenAI API
+    WhisperSTT.load_model(use_api=True, api_key="your-key")
+    result = WhisperSTT.transcribe_file("audio.wav")
+"""
+from typing import Union, Optional, Dict, Any
+import numpy as np
+from pathlib import Path
+import time
+import logging
+import tempfile
+import os
+try:
+    import whisper
+    WHISPER_AVAILABLE = True
+except ImportError:
+    WHISPER_AVAILABLE = False
+try:
+    import openai
+    OPENAI_AVAILABLE = True
+except ImportError:
+    OPENAI_AVAILABLE = False
+try:
+    import soundfile as sf
+    SOUNDFILE_AVAILABLE = True
+except ImportError:
+    SOUNDFILE_AVAILABLE = False
+from .stt_base import BaseSTT, STTResult
+logger = logging.getLogger(__name__)
+class WhisperSTT(BaseSTT):
+    """
+    OpenAI Whisper STT implementation with support for both local models and API.
+    Supports:
+    - Local Whisper models (tiny, base, small, medium, large)
+    - OpenAI Whisper API calls
+    - Multiple audio formats via soundfile
+    - Confidence scoring and metadata
+    """
+    model_name = "WhisperSTT"
+    model = None
+    is_loaded = False
+    config = {
+        "model_size": "base",
+        "use_api": False,
+        "api_key": None,
+        "language": None,  # Auto-detect if None
+        "task": "transcribe",  # "transcribe" or "translate"
+        "temperature": 0.0,
+        "best_of": 5,
+        "beam_size": 5,
+        "patience": 1.0,
+        "length_penalty": 1.0,
+        "suppress_tokens": "-1",
+        "initial_prompt": None,
+        "condition_on_previous_text": True,
+        "fp16": True,
+        "compression_ratio_threshold": 2.4,
+        "logprob_threshold": -1.0,
+        "no_speech_threshold": 0.6
+    }
+    @classmethod
+    def load_model(cls,
+                   model_size: str = "base",
+                   use_api: bool = False,
+                   api_key: Optional[str] = None,
+                   **kwargs) -> None:
+        """
+        Load the Whisper model (local or API setup).
+        Args:
+            model_size: Size of local model ("tiny", "base", "small", "medium", "large")
+            use_api: Use OpenAI API instead of local model
+            api_key: OpenAI API key (required if use_api=True)
+            **kwargs: Additional Whisper parameters
+        """
+        cls.config.update({
+            "model_size": model_size,
+            "use_api": use_api,
+            "api_key": api_key,
+            **kwargs
+        })
+        if use_api:
+            cls._load_api_model(api_key)
+        else:
+            cls._load_local_model(model_size)
+    @classmethod
+    def _load_local_model(cls, model_size: str) -> None:
+        """Load local Whisper model."""
+        if not WHISPER_AVAILABLE:
+            raise ImportError(
+                "OpenAI Whisper not installed. Install with: pip install openai-whisper"
+            )
+        logger.info(f"Loading Whisper local model: {model_size}")
+        start_time = time.time()
+        try:
+            cls.model = whisper.load_model(model_size)
+            cls.is_loaded = True
+            load_time = time.time() - start_time
+            logger.info(f"Whisper model '{model_size}' loaded successfully in {load_time:.2f}s")
+        except Exception as e:
+            cls.is_loaded = False
+            raise RuntimeError(f"Failed to load Whisper model '{model_size}': {e}")
+    @classmethod
+    def _load_api_model(cls, api_key: Optional[str]) -> None:
+        """Setup OpenAI API client."""
+        if not OPENAI_AVAILABLE:
+            raise ImportError(
+                "OpenAI Python client not installed. Install with: pip install openai"
+            )
+        if not api_key:
+            # Try to get from environment
+            api_key = os.getenv("OPENAI_API_KEY")
+            if not api_key:
+                raise ValueError(
+                    "OpenAI API key required. Set OPENAI_API_KEY environment variable or pass api_key parameter."
+                )
+        logger.info("Setting up OpenAI Whisper API client")
+        try:
+            openai.api_key = api_key
+            cls.model = "whisper-1"  # API model identifier
+            cls.is_loaded = True
+            logger.info("OpenAI Whisper API client configured successfully")
+        except Exception as e:
+            cls.is_loaded = False
+            raise RuntimeError(f"Failed to setup OpenAI API: {e}")
+    @classmethod
+    def transcribe_audio(cls,
+                        audio_data: Union[np.ndarray, str, Path],
+                        sample_rate: Optional[int] = None) -> STTResult:
+        """
+        Transcribe audio using Whisper (local or API).
+        Args:
+            audio_data: Audio input (numpy array, file path, or audio file)
+            sample_rate: Sample rate for numpy arrays
+        Returns:
+            STTResult: Transcription with confidence and metadata
+        """
+        if not cls.is_loaded:
+            raise RuntimeError("Whisper model not loaded. Call load_model() first.")
+        start_time = time.time()
+        try:
+            if cls.config["use_api"]:
+                result = cls._transcribe_api(audio_data, sample_rate)
+            else:
+                result = cls._transcribe_local(audio_data, sample_rate)
+            processing_time = time.time() - start_time
+            result.processing_time = processing_time
+            logger.info(f"Transcription completed in {processing_time:.2f}s")
+            return result
+        except Exception as e:
+            logger.error(f"Transcription failed: {e}")
+            raise RuntimeError(f"Whisper transcription failed: {e}")
+    @classmethod
+    def _transcribe_local(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> STTResult:
+        """Transcribe using local Whisper model."""
+        # Prepare transcription options
+        transcribe_options = {
+            "language": cls.config.get("language"),
+            "task": cls.config.get("task", "transcribe"),
+            "temperature": cls.config.get("temperature", 0.0),
+            "best_of": cls.config.get("best_of", 5),
+            "beam_size": cls.config.get("beam_size", 5),
+            "patience": cls.config.get("patience", 1.0),
+            "length_penalty": cls.config.get("length_penalty", 1.0),
+            "suppress_tokens": cls.config.get("suppress_tokens", "-1"),
+            "initial_prompt": cls.config.get("initial_prompt"),
+            "condition_on_previous_text": cls.config.get("condition_on_previous_text", True),
+            "fp16": cls.config.get("fp16", True),
+            "compression_ratio_threshold": cls.config.get("compression_ratio_threshold", 2.4),
+            "logprob_threshold": cls.config.get("logprob_threshold", -1.0),
+            "no_speech_threshold": cls.config.get("no_speech_threshold", 0.6)
+        }
+        # Remove None values
+        transcribe_options = {k: v for k, v in transcribe_options.items() if v is not None}
+        # Handle numpy arrays
+        if isinstance(audio_data, np.ndarray):
+            audio_input = audio_data.astype(np.float32)
+            # Whisper expects mono audio
+            if audio_input.ndim > 1:
+                audio_input = np.mean(audio_input, axis=1)
+        else:
+            # File path
+            audio_input = str(audio_data)
+        # Transcribe
+        result = cls.model.transcribe(audio_input, **transcribe_options)
+        # Calculate confidence (average of segment confidences if available)
+        confidence = None
+        if "segments" in result and result["segments"]:
+            segment_confidences = []
+            for segment in result["segments"]:
+                if "avg_logprob" in segment:
+                    # Convert log prob to confidence estimate
+                    conf = min(1.0, max(0.0, np.exp(segment["avg_logprob"])))
+                    segment_confidences.append(conf)
+            if segment_confidences:
+                confidence = np.mean(segment_confidences)
+        # Prepare metadata
+        metadata = {
+            "model": cls.config["model_size"],
+            "language": result.get("language"),
+            "task": cls.config["task"],
+            "segments": len(result.get("segments", [])),
+            "api_used": False
+        }
+        return STTResult(
+            text=result["text"].strip(),
+            confidence=confidence,
+            metadata=metadata
+        )
+    @classmethod
+    def _transcribe_api(cls, audio_data: Union[np.ndarray, str, Path], sample_rate: Optional[int]) -> STTResult:
+        """Transcribe using OpenAI API."""
+        # Handle numpy arrays - save to temp file for API
+        if isinstance(audio_data, np.ndarray):
+            if not SOUNDFILE_AVAILABLE:
+                raise ImportError("soundfile required for numpy array support. Install with: pip install soundfile")
+            # Create temporary WAV file
+            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
+                temp_path = temp_file.name
+            try:
+                sf.write(temp_path, audio_data, sample_rate or 16000)
+                audio_file_path = temp_path
+                cleanup_temp = True
+            except Exception as e:
+                if os.path.exists(temp_path):
+                    os.unlink(temp_path)
+                raise RuntimeError(f"Failed to save temporary audio file: {e}")
+        else:
+            audio_file_path = str(audio_data)
+            cleanup_temp = False
+        try:
+            # Make API call
+            with open(audio_file_path, "rb") as audio_file:
+                transcript = openai.Audio.transcribe(
+                    model="whisper-1",
+                    file=audio_file,
+                    language=cls.config.get("language"),
+                    prompt=cls.config.get("initial_prompt"),
+                    temperature=cls.config.get("temperature", 0.0)
+                )
+            # API doesn't provide confidence scores
+            metadata = {
+                "model": "whisper-1",
+                "language": cls.config.get("language", "auto"),
+                "task": "transcribe",
+                "api_used": True
+            }
+            return STTResult(
+                text=transcript["text"].strip(),
+                confidence=None,  # API doesn't provide confidence
+                metadata=metadata
+            )
+        finally:
+            # Clean up temporary file if created
+            if cleanup_temp and os.path.exists(audio_file_path):
+                try:
+                    os.unlink(audio_file_path)
+                except Exception as e:
+                    logger.warning(f"Failed to cleanup temp file {audio_file_path}: {e}")
+    @classmethod
+    def get_available_models(cls) -> Dict[str, Any]:
+        """Get information about available Whisper models."""
+        local_models = ["tiny", "base", "small", "medium", "large"] if WHISPER_AVAILABLE else []
+        api_available = OPENAI_AVAILABLE
+        return {
+            "local_models": local_models,
+            "api_available": api_available,
+            "whisper_installed": WHISPER_AVAILABLE,
+            "openai_installed": OPENAI_AVAILABLE,
+            "soundfile_installed": SOUNDFILE_AVAILABLE
+        }
+    @classmethod
+    def set_language(cls, language: Optional[str]) -> None:
+        """Set the transcription language."""
+        cls.config["language"] = language
+        logger.info(f"Language set to: {language or 'auto-detect'}")
+    @classmethod
+    def set_task(cls, task: str) -> None:
+        """Set the task (transcribe or translate)."""
+        if task not in ["transcribe", "translate"]:
+            raise ValueError("Task must be 'transcribe' or 'translate'")
+        cls.config["task"] = task
+        logger.info(f"Task set to: {task}")
+# Example usage and testing
+if __name__ == "__main__":
+    print("Testing WhisperSTT implementation...")
+    # Check availability
+    models_info = WhisperSTT.get_available_models()
+    print(f"Available models: {models_info}")
+    if models_info["whisper_installed"]:
+        try:
+            # Test with local model
+            print("\\nTesting local Whisper model...")
+            WhisperSTT.load_model("tiny")  # Use tiny model for faster testing
+            # Test with dummy numpy audio
+            dummy_audio = np.random.randn(16000).astype(np.float32)  # 1 second
+            result = WhisperSTT.transcribe_numpy(dummy_audio, 16000)
+            print(f"Dummy audio result: {result}")
+            print(f"Model info: {WhisperSTT.get_model_info()}")
+        except Exception as e:
+            print(f"Local model test failed: {e}")
+    else:
+        print("Whisper not installed - install with: pip install openai-whisper")
+    print("\\nWhisperSTT implementation ready!")

test_coqui.py ADDED Viewed

	@@ -0,0 +1,163 @@

+#!/usr/bin/env python3
+"""
+Test script for Coqui STT
+This script tests the Coqui STT implementation with a sample audio file.
+Coqui STT provides open-source speech recognition with multiple language support.
+Usage:
+    python test_coqui.py [audio_file]
+If no audio file is provided, it will use the default recording if available.
+"""
+import sys
+import logging
+from pathlib import Path
+# Add the project root to the path
+sys.path.append(str(Path(__file__).parent))
+from stt.coqui_stt import CoquiSTT, COQUI_STT_AVAILABLE
+def test_coqui_stt(audio_file: str = None):
+    """Test Coqui STT functionality."""
+    print("🚀 Testing Coqui STT")
+    print("=" * 50)
+    # Check if Coqui STT is available
+    if not COQUI_STT_AVAILABLE:
+        print("❌ Coqui STT not available. Install with:")
+        print("pip install coqui-stt soundfile librosa")
+        return False
+    # Create CoquiSTT instance
+    coqui = CoquiSTT()
+    # Check dependencies
+    deps_ok, deps_msg = coqui.check_dependencies()
+    print(f"Dependencies: {deps_msg}")
+    if not deps_ok:
+        return False
+    # Get available models
+    print("\n📦 Available Models:")
+    available_models = coqui.get_available_models()
+    for model in available_models:
+        status = "✅ Downloaded" if model["downloaded"] else "⬇️ Available for download"
+        scorer_status = " (with scorer)" if model["has_scorer"] else " (no scorer)"
+        print(f"  - {model['name']}: {model['description']} ({model['size']}) {status}{scorer_status}")
+    # Test model loading
+    print("\n🔄 Loading English Large model...")
+    model_name = "english-large"
+    success = coqui.load_model(
+        model_name=model_name,
+        auto_download=True,
+        beam_width=512
+    )
+    if not success:
+        print("❌ Failed to load model")
+        return False
+    print("✅ Model loaded successfully")
+    # Get model info
+    model_info = coqui.get_model_info()
+    print(f"\n📋 Model Info:")
+    for key, value in model_info.items():
+        print(f"  - {key}: {value}")
+    # Test transcription
+    if audio_file and Path(audio_file).exists():
+        print(f"\n🎤 Transcribing: {audio_file}")
+    else:
+        # Look for default recording
+        default_files = [
+            "recordings/recorded_audio.wav",
+            "recorded_audio.wav",
+            "test_audio.wav"
+        ]
+        audio_file = None
+        for file_path in default_files:
+            if Path(file_path).exists():
+                audio_file = file_path
+                break
+        if not audio_file:
+            print("❌ No audio file found for testing")
+            print("Record audio using the Gradio interface first, or provide a file path")
+            return False
+        print(f"\n🎤 Using default recording: {audio_file}")
+    # Perform transcription
+    try:
+        print("Transcribing...")
+        result = coqui.transcribe_audio(
+            audio_file_path=audio_file,
+            return_confidence=True,
+            return_timestamps=False
+        )
+        if "error" in result:
+            print(f"❌ Transcription error: {result['error']}")
+            return False
+        print("\n📝 Transcription Results:")
+        print(f"  Text: {result['text']}")
+        print(f"  Confidence: {result.get('confidence', 'N/A')}")
+        print(f"  Language: {result.get('language', 'Unknown')}")
+        # Test with timestamps if successful
+        print("\n🕐 Testing with timestamps...")
+        result_with_timestamps = coqui.transcribe_audio(
+            audio_file_path=audio_file,
+            return_confidence=True,
+            return_timestamps=True
+        )
+        if "words" in result_with_timestamps:
+            print(f"  Word count: {len(result_with_timestamps['words'])}")
+            if result_with_timestamps['words']:
+                print("  First few words with timestamps:")
+                for word in result_with_timestamps['words'][:3]:
+                    print(f"    - '{word['word']}' at {word['start_time']:.2f}s (confidence: {word.get('confidence', 'N/A')})")
+    except Exception as e:
+        print(f"❌ Transcription failed: {e}")
+        return False
+    # Cleanup
+    coqui.cleanup()
+    print("\n✅ Test completed successfully!")
+    return True
+def main():
+    """Main function."""
+    # Setup logging
+    logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
+    # Get audio file from command line if provided
+    audio_file = sys.argv[1] if len(sys.argv) > 1 else None
+    # Run test
+    success = test_coqui_stt(audio_file)
+    if success:
+        print("\n🎉 Coqui STT is working correctly!")
+        print("\n💡 Next steps:")
+        print("  1. Run the main transcriber: python gradio_voice_transcriber_clean.py")
+        print("  2. Select 'CoquiSTT' as your model")
+        print("  3. Choose your preferred language model")
+        print("  4. Start transcribing!")
+    else:
+        print("\n❌ Coqui STT test failed")
+        return 1
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

test_gradio_voice_transcriber.py ADDED Viewed

	@@ -0,0 +1,186 @@

+import types
+import numpy as np
+import builtins
+import time
+import importlib
+import pytest
+# Import the module under test
+import gradio_voice_transcriber as gvt
+class DummySTT:
+    is_loaded = True
+    def __init__(self):
+        self._language = None
+    def load_model(self, **kwargs):
+        self.is_loaded = True
+    def set_language(self, lang):
+        self._language = lang
+    def transcribe_audio(self, audio, sample_rate):
+        # Return an object mimicking STTResult
+        class R:
+            def __init__(self):
+                self.text = "hello world"
+                self.confidence = 0.75
+                self.processing_time = 0.05
+        return R()
+    @staticmethod
+    def get_model_info():
+        return {"is_loaded": True, "model_name": "DummySTT"}
+class DummyTawasul:
+    # Static style class (no instantiation) used by code path
+    is_loaded = True
+    @staticmethod
+    def load_model(**kwargs):
+        DummyTawasul.is_loaded = True
+    @staticmethod
+    def get_model_info():
+        return {"is_loaded": True, "model_name": "TawasulSTT"}
+    @staticmethod
+    def transcribe(path):
+        # Return tuple like (text, confidence_info, processing_info)
+        return ("transcribed from file", "Confidence: 0.42", "ok")
+@pytest.fixture(autouse=True)
+def reset_globals(monkeypatch):
+    # Ensure clean state between tests
+    gvt.current_stt_model = None
+    gvt.current_model_config = {}
+    yield
+    gvt.current_stt_model = None
+    gvt.current_model_config = {}
+def test_audio_processor_preprocess_basic():
+    sr = 8000
+    t = np.linspace(0, 1, sr, endpoint=False)
+    audio = (0.5 * np.sin(2 * np.pi * 440 * t)).astype(np.float32)
+    out = gvt.AudioProcessor.preprocess(audio, sr, target_sr=16000)
+    # Should be float32, mono, and clipped range within [-1,1]
+    assert out.dtype == np.float32
+    assert out.ndim == 1
+    assert np.max(np.abs(out)) <= 1.0
+def test_model_manager_load_whisper_missing_api_key_returns_error(monkeypatch):
+    # Register DummySTT under WhisperSTT name to avoid heavy import
+    monkeypatch.setitem(gvt.STT_MODELS, "WhisperSTT", DummySTT)
+    # Request API mode without key
+    msg = gvt.ModelManager.load_model("WhisperSTT", model_size="base", use_api=True, api_key="")
+    assert "API key required" in msg
+def test_model_manager_load_generic_success(monkeypatch):
+    # Register a generic model name and load
+    monkeypatch.setitem(gvt.STT_MODELS, "DummySTT", DummySTT)
+    msg = gvt.ModelManager.load_model("DummySTT")
+    assert msg.startswith("✅")
+    assert gvt.current_stt_model is not None
+def test_transcription_engine_no_audio():
+    text, conf, proc = gvt.TranscriptionEngine.transcribe(None, language="en")
+    assert text.startswith("❌ No audio provided")
+def test_transcription_engine_requires_loaded_model():
+    # Provide dummy audio but no model
+    sr = 16000
+    audio = np.zeros(sr, dtype=np.float32)
+    text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
+    assert "No STT model loaded" in text
+def test_transcription_engine_happy_path(monkeypatch):
+    # Use DummySTT and set as the loaded model
+    gvt.current_stt_model = DummySTT()
+    gvt.current_model_config = {"model_name": "DummySTT"}
+    # Provide a 1 second tone with enough amplitude
+    sr = 16000
+    t = np.linspace(0, 1.0, sr, endpoint=False)
+    audio = (0.3 * np.sin(2 * np.pi * 220 * t)).astype(np.float32)
+    text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
+    assert text == "hello world"
+    assert conf.startswith("Confidence: ")
+    assert "Processing:" in proc
+def test_transcription_engine_filters_false_positives(monkeypatch):
+    class LowTextDummy(DummySTT):
+        def transcribe_audio(self, audio, sample_rate):
+            class R:
+                def __init__(self):
+                    self.text = "you"  # a known false positive which should be filtered
+                    self.confidence = None
+                    self.processing_time = 0.01
+            return R()
+    gvt.current_stt_model = LowTextDummy()
+    gvt.current_model_config = {"model_name": "LowTextDummy"}
+    sr = 16000
+    audio = np.ones(sr, dtype=np.float32) * 0.2
+    text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="en")
+    assert text == "🔇 No clear speech detected"
+def test_transcription_engine_tawasul_static_path_flow(monkeypatch, tmp_path):
+    # Force the Tawasul path by setting current_model_config model_name
+    gvt.current_stt_model = DummyTawasul
+    gvt.current_model_config = {"model_name": "TawasulSTT"}
+    # Create a simple audio array meeting quality gates
+    sr = 16000
+    t = np.linspace(0, 1.0, sr, endpoint=False)
+    audio = (0.3 * np.sin(2 * np.pi * 220 * t)).astype(np.float32)
+    # Monkeypatch soundfile.write to write to the provided path without needing soundfile dependency
+    written = {}
+    def fake_write(path, data, samplerate):
+        written["path"] = path
+        written["samplerate"] = samplerate
+        written["len"] = len(data)
+    monkeypatch.setitem(builtins.__dict__, "__SOUNDFILE_WRITE__", fake_write)
+    # Patch import inside function to use our fake write via simple shim
+    import types as _types
+    class SFShim:
+        @staticmethod
+        def write(path, data, samplerate):
+            fake_write(path, data, samplerate)
+    monkeypatch.setitem(importlib.import_module("soundfile").__dict__ if False else globals(), "sf", SFShim)
+    # Run transcription
+    text, conf, proc = gvt.TranscriptionEngine.transcribe((sr, audio), language="ar")
+    assert text == "transcribed from file"
+    assert conf.startswith("Confidence: ")
+    assert "Model: TawasulSTT" in proc
+def test_get_quality_recommendations_messages():
+    q = {
+        "duration": 0.5,
+        "max_amplitude": 0.95,
+        "clipping_ratio": 0.02,
+        "silence_ratio": 0.6,
+    }
+    msg = gvt._get_quality_recommendations(q)
+    # Expect multiple recommendations due to thresholds
+    assert "recording for longer" in msg
+    assert "clipping" in msg
+    assert "Too much silence" in msg

test_hubert_arabic.py ADDED Viewed

	@@ -0,0 +1,146 @@

+#!/usr/bin/env python3
+"""
+Test script for HuBERT Arabic STT model
+This script tests the HuBERT Arabic Egyptian STT implementation
+including authentication, model loading, and transcription.
+"""
+import sys
+import os
+from pathlib import Path
+import soundfile as sf
+import numpy as np
+# Add the project root to the path
+project_root = Path(__file__).parent
+sys.path.insert(0, str(project_root))
+def test_hubert_arabic_stt():
+    """Test the HuBERT Arabic STT implementation."""
+    print("🚀 Testing HuBERT Arabic STT")
+    print("=" * 50)
+    try:
+        from stt.hubert_arabic_stt import HuBERTArabicSTT
+        print("✅ HuBERTArabicSTT imported successfully")
+    except ImportError as e:
+        print(f"❌ Failed to import HuBERTArabicSTT: {e}")
+        print("\n💡 To install HuBERT dependencies:")
+        print("   pip install -r requirements_hubert.txt")
+        return False
+    # Test model loading
+    print("\n📦 Testing model loading...")
+    try:
+        stt = HuBERTArabicSTT()
+        # Try to load the primary model
+        print("🔧 Loading HuBERT Arabic Egyptian model...")
+        result = stt.load_model(
+            model_id="omarxadel/hubert-large-arabic-egyptian",
+            device="auto"
+        )
+        print(f"Model load result: {result}")
+    except Exception as e:
+        print(f"❌ Model loading failed: {e}")
+        print("\n💡 This might be due to:")
+        print("   - Missing HuggingFace authentication token")
+        print("   - Network connectivity issues")
+        print("   - Private model access restrictions")
+        print("\n🔧 Try setting up authentication:")
+        print("   python setup_hf_auth.py")
+        return False
+    # Test with sample audio (if available)
+    print("\n🎵 Testing audio transcription...")
+    # Create a test audio file (silence)
+    sample_rate = 16000
+    duration = 2.0  # seconds
+    test_audio = np.zeros(int(sample_rate * duration), dtype=np.float32)
+    test_audio_path = "test_audio_hubert.wav"
+    sf.write(test_audio_path, test_audio, sample_rate)
+    try:
+        transcription, confidence, processing_info = stt.transcribe(test_audio_path)
+        print(f"✅ Transcription completed")
+        print(f"   Text: '{transcription}'")
+        print(f"   Confidence: {confidence}")
+        print(f"   Processing: {processing_info}")
+    except Exception as e:
+        print(f"❌ Transcription failed: {e}")
+        return False
+    finally:
+        # Clean up test file
+        if os.path.exists(test_audio_path):
+            os.remove(test_audio_path)
+    print("\n✅ All HuBERT Arabic STT tests passed!")
+    return True
+def test_with_real_audio():
+    """Test with real audio if available."""
+    recordings_dir = Path("recordings")
+    if not recordings_dir.exists():
+        print(f"\n💡 No recordings directory found at {recordings_dir}")
+        print("   Create the directory and add .wav files to test with real audio")
+        return
+    audio_files = list(recordings_dir.glob("*.wav"))
+    if not audio_files:
+        print(f"\n💡 No .wav files found in {recordings_dir}")
+        return
+    print(f"\n🎵 Testing with real audio files from {recordings_dir}...")
+    try:
+        from stt.hubert_arabic_stt import HuBERTArabicSTT
+        stt = HuBERTArabicSTT()
+        stt.load_model()
+        for audio_file in audio_files[:2]:  # Test first 2 files
+            print(f"\n🔊 Processing: {audio_file.name}")
+            try:
+                transcription, confidence, processing_info = stt.transcribe(str(audio_file))
+                print(f"   Text: '{transcription}'")
+                print(f"   Confidence: {confidence}")
+            except Exception as e:
+                print(f"   ❌ Error: {e}")
+    except Exception as e:
+        print(f"❌ Real audio test failed: {e}")
+def main():
+    """Main test function."""
+    print("HuBERT Arabic STT Test Suite")
+    print("=" * 60)
+    # Basic functionality test
+    success = test_hubert_arabic_stt()
+    if success:
+        print("\n🎯 Running additional tests...")
+        test_with_real_audio()
+    print("\n" + "=" * 60)
+    if success:
+        print("🎉 HuBERT Arabic STT is working correctly!")
+        print("\n💡 Next steps:")
+        print("   1. Test with the Gradio interface:")
+        print("      python gradio_voice_transcriber_clean.py")
+        print("   2. Select 'HuBERTArabicSTT' as the STT model")
+        print("   3. Upload Arabic Egyptian audio for transcription")
+    else:
+        print("❌ HuBERT Arabic STT tests failed")
+        print("\n🔧 Troubleshooting:")
+        print("   1. Install dependencies: pip install -r requirements_hubert.txt")
+        print("   2. Set up HF authentication: python setup_hf_auth.py")
+        print("   3. Check network connectivity")
+if __name__ == "__main__":
+    main()

test_tawasul.py ADDED Viewed

	@@ -0,0 +1,132 @@

+#!/usr/bin/env python3
+"""
+Test script for Tawasul STT V0 model
+This script tests the Tawasul STT V0 Arabic speech recognition model
+with sample audio files.
+Usage:
+    python test_tawasul.py [audio_file]
+If no audio file is provided, it will test with any files in the recordings/ directory.
+"""
+import sys
+import os
+from pathlib import Path
+import time
+import logging
+# Add the project root to the path
+sys.path.insert(0, str(Path(__file__).parent))
+from stt.tawasul_stt import TawasulSTT
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def test_tawasul_stt(audio_file: str = None):
+    """Test Tawasul STT with an audio file."""
+    print("🧪 Tawasul STT V0 Test")
+    print("=" * 50)
+    # Check if Tawasul STT is available
+    if not TawasulSTT.is_available():
+        print("❌ Tawasul STT dependencies not available!")
+        print("Install with: pip install -r requirements_tawasul.txt")
+        return False
+    # Find audio file if not provided
+    if not audio_file:
+        recordings_dir = Path("recordings")
+        if recordings_dir.exists():
+            audio_files = list(recordings_dir.glob("*.wav")) + list(recordings_dir.glob("*.mp3"))
+            if audio_files:
+                audio_file = str(audio_files[0])
+                print(f"🎵 Using sample audio: {audio_file}")
+            else:
+                print("❌ No audio files found in recordings/ directory")
+                print("Please provide an audio file: python test_tawasul.py your_audio.wav")
+                return False
+        else:
+            print("❌ No audio file provided and no recordings/ directory found")
+            print("Usage: python test_tawasul.py your_audio.wav")
+            return False
+    if not os.path.exists(audio_file):
+        print(f"❌ Audio file not found: {audio_file}")
+        return False
+    try:
+        # Load the model (static method)
+        print("📥 Loading Tawasul STT V0 model...")
+        start_time = time.time()
+        TawasulSTT.load_model(
+            device="auto",  # Automatically choose best device
+            chunk_length=20,  # 20-second chunks
+            max_audio_length=300  # 5 minutes max
+        )
+        load_time = time.time() - start_time
+        print(f"✅ Model loaded in {load_time:.1f} seconds")
+        # Get model info (static method)
+        model_info = TawasulSTT.get_model_info()
+        print(f"\n📊 Model Information:")
+        print(f"   Name: {model_info['name']}")
+        print(f"   Model ID: {model_info['model_id']}")
+        print(f"   Device: {model_info['device']}")
+        print(f"   Architecture: {model_info['architecture']}")
+        print(f"   Specialization: {model_info['specialization']}")
+        print(f"   Supported Languages: {', '.join(model_info['supported_languages'][:5])}...")
+        # Transcribe audio (static method)
+        print(f"\n🎙️ Transcribing audio: {audio_file}")
+        transcription, confidence_info, processing_info = TawasulSTT.transcribe(audio_file)
+        # Display results
+        print("\n" + "=" * 50)
+        print("📝 TRANSCRIPTION RESULTS")
+        print("=" * 50)
+        print(f"Text: {transcription}")
+        print(f"Confidence: {confidence_info}")
+        print(f"Processing: {processing_info}")
+        print("=" * 50)
+        if transcription and not transcription.startswith("❌"):
+            print("✅ Transcription successful!")
+            return True
+        else:
+            print("❌ Transcription failed!")
+            return False
+    except Exception as e:
+        print(f"❌ Test failed: {str(e)}")
+        return False
+def main():
+    """Main function."""
+    audio_file = sys.argv[1] if len(sys.argv) > 1 else None
+    # Test with different configurations
+    success = test_tawasul_stt(audio_file)
+    if success:
+        print("\n🎉 Tawasul STT test completed successfully!")
+        print("\n💡 Next steps:")
+        print("   1. Try the main transcriber: python gradio_voice_transcriber_clean.py")
+        print("   2. Test with different Arabic audio files")
+        print("   3. Experiment with different model variants")
+    else:
+        print("\n❌ Tawasul STT test failed!")
+        print("\n🔧 Troubleshooting:")
+        print("   1. Install dependencies: pip install -r requirements_tawasul.txt")
+        print("   2. Check audio file format (WAV/MP3)")
+        print("   3. Ensure stable internet for model download")
+        print("   4. Try with a different audio file")
+if __name__ == "__main__":
+    main()

test_vosk.py ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env python3
+"""
+Test script for Vosk STT implementation
+This script tests the Vosk STT implementation to ensure compatibility
+and proper error handling.
+"""
+import sys
+import numpy as np
+from pathlib import Path
+# Add the project directory to Python path
+project_dir = Path(__file__).parent
+sys.path.insert(0, str(project_dir))
+def test_vosk_basic():
+    """Test basic Vosk functionality."""
+    print("🔍 Testing Vosk STT...")
+    try:
+        from stt.vosk_stt import VoskSTT
+        print("✅ Successfully imported VoskSTT")
+    except ImportError as e:
+        print(f"❌ Failed to import VoskSTT: {e}")
+        print("\n📦 Required dependencies:")
+        print("pip install vosk")
+        return False
+    # Check Vosk availability
+    print("\n📊 Checking Vosk availability...")
+    models_info = VoskSTT.get_available_models()
+    for key, value in models_info.items():
+        status = "✅" if value else "❌"
+        print(f"{status} {key}: {value}")
+    if not models_info.get("vosk_available", False):
+        print("\n❌ Vosk not available. Cannot proceed with test.")
+        return False
+    # Test model loading with a small model
+    print(f"\n🔄 Testing model loading...")
+    print("⚠️  Note: This will download a small model (~40MB) if not cached")
+    try:
+        # Try to load a small English model first
+        VoskSTT.load_model(model_name="vosk-model-small-en-us-0.15")
+        print("✅ Model loaded successfully!")
+        # Get model info
+        model_info = VoskSTT.get_model_info()
+        print(f"📋 Model info:")
+        for key, value in model_info.items():
+            print(f"   {key}: {value}")
+    except Exception as e:
+        print(f"❌ Failed to load model: {e}")
+        print("\n💡 This might be due to:")
+        print("   - Network issues (model download)")
+        print("   - Vosk version compatibility")
+        print("   - Model availability")
+        return False
+    # Test with dummy audio
+    print(f"\n🎤 Testing transcription with dummy audio...")
+    print("⚠️  Note: Random audio won't produce meaningful text")
+    try:
+        # Create 2 seconds of random audio
+        dummy_audio = np.random.randn(32000).astype(np.float32) * 0.1
+        result = VoskSTT.transcribe_audio(dummy_audio, 16000)
+        print(f"📝 Transcription result:")
+        print(f"   Text: '{result.text}'")
+        print(f"   Confidence: {result.confidence:.2%}" if result.confidence else "   Confidence: N/A")
+        print(f"   Processing time: {result.processing_time:.2f}s")
+        print(f"   Metadata: {result.metadata}")
+        print("✅ Transcription test completed!")
+        return True
+    except Exception as e:
+        print(f"❌ Transcription failed: {e}")
+        return False
+def test_vosk_models():
+    """Test different Vosk models."""
+    print(f"\n📋 Testing different Vosk models...")
+    try:
+        from stt.vosk_stt import VoskSTT
+        # Get available models
+        available = VoskSTT.AVAILABLE_MODELS
+        print(f"📊 Available models: {len(available)}")
+        # Show a few interesting models
+        interesting_models = [
+            "vosk-model-small-en-us-0.15",
+            "vosk-model-small-ru-0.22",
+            "vosk-model-small-fr-0.22",
+            "vosk-model-small-de-0.15"
+        ]
+        print("\n🌍 Some available models:")
+        for model_name in interesting_models:
+            if model_name in available:
+                model_info = available[model_name]
+                print(f"   {model_name}:")
+                print(f"     Language: {model_info['language']}")
+                print(f"     Size: {model_info['size']}")
+                print(f"     Description: {model_info['description']}")
+        return True
+    except Exception as e:
+        print(f"❌ Model listing failed: {e}")
+        return False
+def test_integration():
+    """Test integration with the modular transcriber."""
+    print(f"\n🔗 Testing integration with modular transcriber...")
+    try:
+        from gradio_voice_transcriber_clean import ModelManager
+        available_models = ModelManager.get_available_models()
+        print(f"📋 Available models: {available_models}")
+        if "VoskSTT" in available_models:
+            print("✅ VoskSTT is registered in the modular transcriber")
+            # Test model options
+            options = ModelManager.get_model_options("VoskSTT")
+            print(f"📊 Model options: {options}")
+            return True
+        else:
+            print("❌ VoskSTT not found in available models")
+            return False
+    except ImportError as e:
+        print(f"❌ Failed to import modular transcriber components: {e}")
+        return False
+def main():
+    """Main test function."""
+    print("🧪 Vosk STT Test Suite")
+    print("=" * 50)
+    # Test basic functionality
+    basic_test = test_vosk_basic()
+    # Test model listing
+    models_test = test_vosk_models()
+    # Test integration
+    integration_test = test_integration()
+    print("\n" + "=" * 50)
+    print("📊 Test Results Summary:")
+    print(f"   Basic Functionality: {'✅ PASS' if basic_test else '❌ FAIL'}")
+    print(f"   Model Listing: {'✅ PASS' if models_test else '❌ FAIL'}")
+    print(f"   Integration: {'✅ PASS' if integration_test else '❌ FAIL'}")
+    if basic_test and models_test and integration_test:
+        print("\n🎉 All tests passed! Vosk STT is ready to use.")
+        print("\n💡 Next steps:")
+        print("   1. Run: python gradio_voice_transcriber_clean.py")
+        print("   2. Select 'VoskSTT' from the dropdown")
+        print("   3. Choose your model (small models are faster)")
+        print("   4. Load the model and test with audio!")
+        print("\n🌍 Vosk supports many languages offline!")
+    else:
+        print("\n❌ Some tests failed. Please check the errors above.")
+        if not basic_test:
+            print("\n📦 To fix Vosk issues:")
+            print("   pip install vosk")
+            print("   Check internet connection for model download")
+if __name__ == "__main__":
+    main()

test_wav2vec2_arabic.py ADDED Viewed

	@@ -0,0 +1,142 @@

+#!/usr/bin/env python3
+"""
+Test script for Wav2Vec2 Arabic STT
+This script tests the Wav2Vec2 Arabic STT implementation without requiring
+the full Gradio interface.
+"""
+import sys
+import numpy as np
+from pathlib import Path
+# Add the project directory to Python path
+project_dir = Path(__file__).parent
+sys.path.insert(0, str(project_dir))
+def test_wav2vec2_arabic():
+    """Test the Wav2Vec2 Arabic STT implementation."""
+    print("🔍 Testing Wav2Vec2 Arabic STT...")
+    try:
+        from stt.wav2vec2_arabic_stt import Wav2Vec2ArabicSTT
+        print("✅ Successfully imported Wav2Vec2ArabicSTT")
+    except ImportError as e:
+        print(f"❌ Failed to import Wav2Vec2ArabicSTT: {e}")
+        print("\n📦 Required dependencies:")
+        print("pip install transformers torch torchaudio")
+        print("Optional: pip install librosa")
+        return False
+    # Check model availability
+    print("\n📊 Checking model availability...")
+    models_info = Wav2Vec2ArabicSTT.get_available_models()
+    for key, value in models_info.items():
+        status = "✅" if value else "❌"
+        print(f"{status} {key}: {value}")
+    if not models_info.get("transformers_available", False):
+        print("\n❌ Transformers not available. Cannot proceed with test.")
+        return False
+    # Test model loading (this will download the model if not cached)
+    print(f"\n🔄 Loading model...")
+    print("⚠️  Note: First run will download ~1.2GB model from Hugging Face")
+    try:
+        Wav2Vec2ArabicSTT.load_model(device="cpu")  # Use CPU for testing
+        print("✅ Model loaded successfully!")
+        # Get model info
+        model_info = Wav2Vec2ArabicSTT.get_model_info()
+        print(f"📋 Model info:")
+        for key, value in model_info.items():
+            print(f"   {key}: {value}")
+    except Exception as e:
+        print(f"❌ Failed to load model: {e}")
+        return False
+    # Test with dummy audio (this won't produce meaningful Arabic text)
+    print(f"\n🎤 Testing transcription with dummy audio...")
+    print("⚠️  Note: Random audio won't produce meaningful Arabic text")
+    try:
+        # Create 2 seconds of random audio
+        dummy_audio = np.random.randn(32000).astype(np.float32) * 0.1
+        result = Wav2Vec2ArabicSTT.transcribe_audio(dummy_audio, 16000)
+        print(f"📝 Transcription result:")
+        print(f"   Text: '{result.text}'")
+        print(f"   Confidence: {result.confidence:.2%}" if result.confidence else "   Confidence: N/A")
+        print(f"   Processing time: {result.processing_time:.2f}s")
+        print(f"   Metadata: {result.metadata}")
+        print("✅ Transcription test completed!")
+        return True
+    except Exception as e:
+        print(f"❌ Transcription failed: {e}")
+        return False
+def test_integration():
+    """Test integration with the modular transcriber."""
+    print(f"\n🔗 Testing integration with modular transcriber...")
+    try:
+        from gradio_voice_transcriber_clean import ModelManager, STT_MODELS
+        available_models = ModelManager.get_available_models()
+        print(f"📋 Available models: {available_models}")
+        if "Wav2Vec2ArabicSTT" in available_models:
+            print("✅ Wav2Vec2ArabicSTT is registered in the modular transcriber")
+            # Test model options
+            options = ModelManager.get_model_options("Wav2Vec2ArabicSTT")
+            print(f"📊 Model options: {options}")
+            return True
+        else:
+            print("❌ Wav2Vec2ArabicSTT not found in available models")
+            return False
+    except ImportError as e:
+        print(f"❌ Failed to import modular transcriber components: {e}")
+        return False
+def main():
+    """Main test function."""
+    print("🧪 Wav2Vec2 Arabic STT Test Suite")
+    print("=" * 50)
+    # Test individual STT implementation
+    stt_test = test_wav2vec2_arabic()
+    # Test integration
+    integration_test = test_integration()
+    print("\n" + "=" * 50)
+    print("📊 Test Results Summary:")
+    print(f"   STT Implementation: {'✅ PASS' if stt_test else '❌ FAIL'}")
+    print(f"   Integration: {'✅ PASS' if integration_test else '❌ FAIL'}")
+    if stt_test and integration_test:
+        print("\n🎉 All tests passed! The Wav2Vec2 Arabic STT is ready to use.")
+        print("\n💡 Next steps:")
+        print("   1. Run: python gradio_voice_transcriber_clean.py")
+        print("   2. Select 'Wav2Vec2ArabicSTT' from the dropdown")
+        print("   3. Choose your device (CPU/CUDA)")
+        print("   4. Load the model and test with Arabic audio!")
+    else:
+        print("\n❌ Some tests failed. Please check the errors above.")
+        if not stt_test:
+            print("\n📦 To fix STT implementation issues:")
+            print("   pip install transformers torch torchaudio")
+            print("   pip install librosa  # optional, for better audio processing")
+if __name__ == "__main__":
+    main()

test_whisper_local.py ADDED Viewed

	@@ -0,0 +1,32 @@

+#!/usr/bin/env python3
+"""
+Simple Whisper Test
+Load Whisper model and test transcription.
+"""
+import numpy as np
+from stt import WhisperSTT
+def main():
+    print("Loading Whisper model...")
+    WhisperSTT.load_model("tiny")  # Load tiny model (fastest)
+    if not WhisperSTT.is_loaded:
+        print("Failed to load model")
+        return
+    print("Model loaded successfully!")
+    # Create some test audio (1 second of random noise)
+    test_audio = np.random.randn(16000).astype(np.float32)
+    print("Transcribing audio...")
+    result = WhisperSTT.transcribe_numpy(test_audio, 16000)
+    print(f"Result: {result.text}")
+    print(f"Confidence: {result.confidence}")
+    print(f"Time: {result.processing_time:.2f}s")
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff