Spaces:

anfastech
/

zlaqa-version-b-ai-enginee

Sleeping

App Files Files Community

anfastech commited on Dec 7, 2025

Commit

74a089b

1 Parent(s): e7e9fa8

Updation: ML/AI logic is now in the AI engine service

Browse files

Files changed (5) hide show

ARCHITECTURE.md +185 -0
app.py +4 -4
diagnosis/ai_engine/detect_stuttering.py +25 -83
diagnosis/ai_engine/features.py +206 -0
diagnosis/ai_engine/model_loader.py +51 -0

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,185 @@

+# AI Engine Architecture
+## Clean Architecture Implementation
+This AI engine follows clean architecture principles with proper separation of concerns.
+---
+## Module Structure
+```
+diagnosis/ai_engine/
+├── detect_stuttering.py    # Main detector class (business logic)
+├── model_loader.py         # Singleton pattern for model loading
+└── features.py             # Feature extraction (ASR features)
+```
+---
+## Architecture Pattern
+### 1. Model Loader (`model_loader.py`)
+**Responsibility**: Singleton pattern for model instance management
+- Ensures models are loaded only once
+- Provides clean interface: `get_stutter_detector()`
+- Handles initialization and error handling
+- Used by API layer (`app.py`)
+**Usage:**
+```python
+from diagnosis.ai_engine.model_loader import get_stutter_detector
+detector = get_stutter_detector()  # Singleton instance
+```
+---
+### 2. Feature Extractor (`features.py`)
+**Responsibility**: Feature extraction from audio using IndicWav2Vec Hindi
+**Class**: `ASRFeatureExtractor`
+**Methods:**
+- `extract_audio_features()` - Raw audio feature extraction
+- `get_transcription_features()` - Transcription with confidence scores
+- `get_word_level_features()` - Word-level timestamps and confidence
+**Design Pattern**:
+- Takes pre-loaded model and processor as dependencies
+- Single responsibility: feature extraction only
+- Reusable across different use cases
+**Usage:**
+```python
+from .features import ASRFeatureExtractor
+extractor = ASRFeatureExtractor(model, processor, device)
+features = extractor.get_transcription_features(audio)
+```
+---
+### 3. Detector (`detect_stuttering.py`)
+**Responsibility**: High-level stutter detection orchestration
+**Class**: `AdvancedStutterDetector`
+**Design:**
+- Uses feature extractor for transcription (composition)
+- Orchestrates the analysis pipeline
+- Returns structured results
+**Flow:**
+```
+Audio Input
+    ↓
+Feature Extractor (ASR)
+    ↓
+Text Analysis
+    ↓
+Results
+```
+---
+## Benefits of This Architecture
+### ✅ Separation of Concerns
+- **Model Loading**: Isolated in `model_loader.py`
+- **Feature Extraction**: Isolated in `features.py`
+- **Business Logic**: In `detect_stuttering.py`
+### ✅ Single Responsibility Principle
+- Each module has one clear purpose
+- Easy to test and maintain
+- Easy to extend or replace components
+### ✅ Dependency Injection
+- Feature extractor receives model/processor as dependencies
+- No tight coupling
+- Easy to mock for testing
+### ✅ Reusability
+- Feature extractor can be used independently
+- Model loader can be used by other modules
+- Clean interfaces between layers
+---
+## Data Flow
+```
+API Request (app.py)
+    ↓
+get_stutter_detector() [model_loader.py]
+    ↓
+AdvancedStutterDetector [detect_stuttering.py]
+    ↓
+ASRFeatureExtractor [features.py]
+    ↓
+IndicWav2Vec Hindi Model
+    ↓
+Results back through layers
+```
+---
+## Comparison with Django App
+**Before (Django App):**
+- Model loading logic in Django app
+- Feature extraction in Django app
+- Tight coupling between web app and ML logic
+**After (AI Engine Service):**
+- ✅ Model loading in AI engine service
+- ✅ Feature extraction in AI engine service
+- ✅ Django app only calls API (loose coupling)
+- ✅ ML logic isolated in dedicated service
+---
+## Extension Points
+### Adding New Features
+1. Add method to `ASRFeatureExtractor` in `features.py`
+2. Use in `AdvancedStutterDetector` via composition
+3. No changes needed to model loader
+### Adding New Models
+1. Update `detect_stuttering.py` to load new model
+2. Create new feature extractor if needed
+3. Model loader remains unchanged
+### Testing
+- Mock `ASRFeatureExtractor` in tests
+- Mock model loader for integration tests
+- Each component can be tested independently
+---
+## Key Principles Applied
+1. **Dependency Inversion**: High-level modules don't depend on low-level modules
+2. **Open/Closed**: Open for extension, closed for modification
+3. **Interface Segregation**: Clean, focused interfaces
+4. **Don't Repeat Yourself (DRY)**: Feature extraction logic centralized
+5. **Single Source of Truth**: Model instance managed by singleton
+---
+## File Responsibilities
+| File | Responsibility | Depends On |
+|------|---------------|------------|
+| `model_loader.py` | Singleton model management | `detect_stuttering.py` |
+| `features.py` | Feature extraction | `transformers`, `torch` |
+| `detect_stuttering.py` | Business logic orchestration | `features.py`, `model_loader.py` |
+| `app.py` | API layer | `model_loader.py` |
+---
+This architecture ensures the ML/AI logic stays in the AI engine service, not in the Django web application, following microservices best practices.

app.py CHANGED Viewed

@@ -18,12 +18,12 @@ logger = logging.getLogger(__name__)
 # Add project root to path
 sys.path.insert(0, str(Path(__file__).parent))
-# Import detector
 try:
-    from diagnosis.ai_engine.detect_stuttering import get_stutter_detector
-    logger.info("✅ Successfully imported StutterDetector")
 except ImportError as e:
-    logger.error(f"❌ Failed to import StutterDetector: {e}")
     raise
 # Initialize FastAPI

 # Add project root to path
 sys.path.insert(0, str(Path(__file__).parent))
+# Import detector using model loader (clean architecture)
 try:
+    from diagnosis.ai_engine.model_loader import get_stutter_detector
+    logger.info("✅ Successfully imported model loader")
 except ImportError as e:
+    logger.error(f"❌ Failed to import model loader: {e}")
     raise
 # Initialize FastAPI

diagnosis/ai_engine/detect_stuttering.py CHANGED Viewed

@@ -107,6 +107,14 @@ class AdvancedStutterDetector:
             ).to(DEVICE)
             self.model.eval()
             # Debug: Log processor structure
             logger.info(f"📋 Processor type: {type(self.processor)}")
             if hasattr(self.processor, 'tokenizer'):
@@ -114,7 +122,7 @@ class AdvancedStutterDetector:
             if hasattr(self.processor, 'feature_extractor'):
                 logger.info(f"📋 Feature extractor type: {type(self.processor.feature_extractor)}")
-            logger.info("✅ IndicWav2Vec Hindi ASR Engine Loaded")
         except Exception as e:
             logger.error(f"🔥 Engine Failure: {e}")
             raise
@@ -236,71 +244,22 @@ class AdvancedStutterDetector:
         return features
     def _transcribe_with_timestamps(self, audio: np.ndarray) -> Tuple[str, List[Dict], torch.Tensor]:
-        """Transcribe audio and return word timestamps and logits"""
         try:
-            inputs = self.processor(audio, sampling_rate=16000, return_tensors="pt").to(DEVICE)
-            with torch.no_grad():
-                outputs = self.model(**inputs)
-                logits = outputs.logits
-                predicted_ids = torch.argmax(logits, dim=-1)
-            # Decode transcript - IndicWav2Vec uses tokenizer for decoding
-            transcript = ""
-            try:
-                # Method 1: Try using processor's tokenizer directly
-                if hasattr(self.processor, 'tokenizer'):
-                    transcript = self.processor.tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
-                    logger.info(f"📝 Decoded via tokenizer: '{transcript}' (length: {len(transcript)})")
-                # Method 2: Try batch_decode if tokenizer not available
-                elif hasattr(self.processor, 'batch_decode'):
-                    transcript = self.processor.batch_decode(predicted_ids)[0]
-                    logger.info(f"📝 Decoded via batch_decode: '{transcript}' (length: {len(transcript)})")
-                # Method 3: Try accessing tokenizer through processor.feature_extractor or processor attributes
-                else:
-                    # Check if processor wraps a tokenizer
-                    for attr in ['tokenizer', '_tokenizer', 'decoder']:
-                        if hasattr(self.processor, attr):
-                            tokenizer = getattr(self.processor, attr)
-                            if hasattr(tokenizer, 'decode'):
-                                transcript = tokenizer.decode(predicted_ids[0], skip_special_tokens=True)
-                                logger.info(f"📝 Decoded via {attr}: '{transcript}' (length: {len(transcript)})")
-                                break
-                # Clean up transcript - remove special tokens and normalize
-                if transcript:
-                    transcript = transcript.strip()
-                    # Remove common special tokens if present
-                    transcript = transcript.replace('<pad>', '').replace('<s>', '').replace('</s>', '').replace('|', ' ').strip()
-                    # Normalize whitespace
-                    transcript = ' '.join(transcript.split())
-            except Exception as decode_error:
-                logger.error(f"⚠️ Decode error: {decode_error}", exc_info=True)
-                transcript = ""
-            # Ensure transcript is not None
-            if not transcript:
-                transcript = ""
-                logger.warning("⚠️ Empty transcript generated - model may not have produced valid output")
-                logger.warning(f"⚠️ Predicted IDs shape: {predicted_ids.shape}, sample values: {predicted_ids[0][:10].tolist() if predicted_ids.numel() > 0 else 'empty'}")
-            # Estimate word timestamps (simplified - frame-level alignment)
-            frame_duration = 0.02  # 20ms per frame
-            num_frames = logits.shape[1]
-            audio_duration = len(audio) / 16000
-            # Simple word-level timestamps (would need proper alignment for production)
-            words = transcript.split() if transcript else []
-            word_timestamps = []
-            time_per_word = audio_duration / max(len(words), 1) if words else 0
-            for i, word in enumerate(words):
-                word_timestamps.append({
-                    'word': word,
-                    'start': i * time_per_word,
-                    'end': (i + 1) * time_per_word
-                })
             return transcript, word_timestamps, logits
         except Exception as e:
@@ -860,23 +819,6 @@ class AdvancedStutterDetector:
         return round(min(max(confidence, 0.0), 1.0), 2)
-# diagnosis/ai_engine/model_loader.py
-"""Singleton pattern for model loading"""
-_detector_instance = None
-def get_stutter_detector():
-    """Get or create singleton AdvancedStutterDetector instance"""
-    global _detector_instance
-    if _detector_instance is None:
-        _detector_instance = AdvancedStutterDetector()
-    return _detector_instance
-# Singleton pattern for model loading
-_detector_instance = None
-def get_stutter_detector():
-    """Get or create singleton AdvancedStutterDetector instance"""
-    global _detector_instance
-    if _detector_instance is None:
-        _detector_instance = AdvancedStutterDetector()
-    return _detector_instance

             ).to(DEVICE)
             self.model.eval()
+            # Initialize feature extractor (clean architecture pattern)
+            from .features import ASRFeatureExtractor
+            self.feature_extractor = ASRFeatureExtractor(
+                model=self.model,
+                processor=self.processor,
+                device=DEVICE
+            )
             # Debug: Log processor structure
             logger.info(f"📋 Processor type: {type(self.processor)}")
             if hasattr(self.processor, 'tokenizer'):
             if hasattr(self.processor, 'feature_extractor'):
                 logger.info(f"📋 Feature extractor type: {type(self.processor.feature_extractor)}")
+            logger.info("✅ IndicWav2Vec Hindi ASR Engine Loaded with Feature Extractor")
         except Exception as e:
             logger.error(f"🔥 Engine Failure: {e}")
             raise
         return features
     def _transcribe_with_timestamps(self, audio: np.ndarray) -> Tuple[str, List[Dict], torch.Tensor]:
+        """
+        Transcribe audio and return word timestamps and logits.
+        Uses the feature extractor for clean separation of concerns.
+        """
         try:
+            # Use feature extractor for transcription (clean architecture)
+            features = self.feature_extractor.get_transcription_features(audio, sample_rate=16000)
+            transcript = features['transcript']
+            logits = torch.from_numpy(features['logits'])
+            # Get word-level features for timestamps
+            word_features = self.feature_extractor.get_word_level_features(audio, sample_rate=16000)
+            word_timestamps = word_features['word_timestamps']
+            logger.info(f"📝 Transcription via feature extractor: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
             return transcript, word_timestamps, logits
         except Exception as e:
         return round(min(max(confidence, 0.0), 1.0), 2)
+# Model loader is now in a separate module: model_loader.py
+# This follows clean architecture principles - separation of concerns
+# Import using: from diagnosis.ai_engine.model_loader import get_stutter_detector

diagnosis/ai_engine/features.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# diagnosis/ai_engine/features.py
+"""
+Feature extraction for IndicWav2Vec Hindi ASR
+This module provides feature extraction capabilities using the IndicWav2Vec Hindi model.
+Focused on ASR transcription features rather than hybrid acoustic+linguistic features.
+"""
+import torch
+import numpy as np
+import logging
+from typing import Dict, Any, Tuple, Optional
+from transformers import Wav2Vec2ForCTC, AutoProcessor
+logger = logging.getLogger(__name__)
+class ASRFeatureExtractor:
+    """
+    Feature extractor using IndicWav2Vec Hindi for Automatic Speech Recognition.
+    This extractor focuses on:
+    - Audio feature extraction via IndicWav2Vec
+    - Transcription confidence scores
+    - Frame-level predictions and logits
+    - Word-level alignments (estimated)
+    Model: ai4bharat/indicwav2vec-hindi
+    """
+    def __init__(self, model: Wav2Vec2ForCTC, processor: AutoProcessor, device: str = "cpu"):
+        """
+        Initialize the ASR feature extractor.
+        Args:
+            model: Pre-loaded IndicWav2Vec Hindi model
+            processor: Pre-loaded processor for the model
+            device: Device to run inference on ('cpu' or 'cuda')
+        """
+        self.model = model
+        self.processor = processor
+        self.device = device
+        self.model.eval()
+        logger.info(f"✅ ASRFeatureExtractor initialized on {device}")
+    def extract_audio_features(self, audio: np.ndarray, sample_rate: int = 16000) -> Dict[str, Any]:
+        """
+        Extract features from audio using IndicWav2Vec Hindi.
+        Args:
+            audio: Audio waveform as numpy array
+            sample_rate: Sample rate of the audio (default: 16000)
+        Returns:
+            Dictionary containing:
+            - input_values: Processed audio features
+            - attention_mask: Attention mask (if available)
+        """
+        try:
+            # Process audio through the processor
+            inputs = self.processor(
+                audio,
+                sampling_rate=sample_rate,
+                return_tensors="pt"
+            ).to(self.device)
+            return {
+                'input_values': inputs.input_values,
+                'attention_mask': inputs.get('attention_mask', None)
+            }
+        except Exception as e:
+            logger.error(f"❌ Error extracting audio features: {e}")
+            raise
+    def get_transcription_features(
+        self,
+        audio: np.ndarray,
+        sample_rate: int = 16000
+    ) -> Dict[str, Any]:
+        """
+        Get transcription features including logits, predictions, and confidence.
+        Args:
+            audio: Audio waveform as numpy array
+            sample_rate: Sample rate of the audio (default: 16000)
+        Returns:
+            Dictionary containing:
+            - transcript: Transcribed text
+            - logits: Model logits (raw predictions)
+            - predicted_ids: Predicted token IDs
+            - probabilities: Softmax probabilities
+            - confidence: Average confidence score
+            - frame_confidence: Per-frame confidence scores
+        """
+        try:
+            # Process audio
+            inputs = self.processor(
+                audio,
+                sampling_rate=sample_rate,
+                return_tensors="pt"
+            ).to(self.device)
+            # Get model predictions
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                logits = outputs.logits
+                predicted_ids = torch.argmax(logits, dim=-1)
+            # Calculate probabilities and confidence
+            probs = torch.softmax(logits, dim=-1)
+            max_probs = torch.max(probs, dim=-1)[0]  # Get max probability per frame
+            frame_confidence = max_probs[0].cpu().numpy()
+            avg_confidence = float(torch.mean(max_probs).item())
+            # Decode transcript
+            transcript = ""
+            try:
+                if hasattr(self.processor, 'tokenizer'):
+                    transcript = self.processor.tokenizer.decode(
+                        predicted_ids[0],
+                        skip_special_tokens=True
+                    )
+                elif hasattr(self.processor, 'batch_decode'):
+                    transcript = self.processor.batch_decode(predicted_ids)[0]
+                # Clean up transcript
+                if transcript:
+                    transcript = transcript.strip()
+                    transcript = transcript.replace('<pad>', '').replace('<s>', '').replace('</s>', '').replace('|', ' ').strip()
+                    transcript = ' '.join(transcript.split())
+            except Exception as e:
+                logger.warning(f"⚠️ Decode error: {e}")
+                transcript = ""
+            return {
+                'transcript': transcript,
+                'logits': logits.cpu().numpy(),
+                'predicted_ids': predicted_ids.cpu().numpy(),
+                'probabilities': probs.cpu().numpy(),
+                'confidence': avg_confidence,
+                'frame_confidence': frame_confidence,
+                'num_frames': logits.shape[1]
+            }
+        except Exception as e:
+            logger.error(f"❌ Error getting transcription features: {e}")
+            raise
+    def get_word_level_features(
+        self,
+        audio: np.ndarray,
+        sample_rate: int = 16000
+    ) -> Dict[str, Any]:
+        """
+        Get word-level features including timestamps and confidence.
+        Args:
+            audio: Audio waveform as numpy array
+            sample_rate: Sample rate of the audio (default: 16000)
+        Returns:
+            Dictionary containing:
+            - words: List of words
+            - word_timestamps: List of (start, end) timestamps for each word
+            - word_confidence: Confidence score for each word
+        """
+        try:
+            # Get transcription features
+            features = self.get_transcription_features(audio, sample_rate)
+            transcript = features['transcript']
+            frame_confidence = features['frame_confidence']
+            num_frames = features['num_frames']
+            # Estimate word-level timestamps (simplified)
+            words = transcript.split() if transcript else []
+            audio_duration = len(audio) / sample_rate
+            time_per_word = audio_duration / max(len(words), 1) if words else 0
+            word_timestamps = []
+            word_confidence = []
+            for i, word in enumerate(words):
+                start_time = i * time_per_word
+                end_time = (i + 1) * time_per_word
+                # Estimate confidence for this word (average of corresponding frames)
+                start_frame = int((start_time / audio_duration) * num_frames)
+                end_frame = int((end_time / audio_duration) * num_frames)
+                word_conf = float(np.mean(frame_confidence[start_frame:end_frame])) if end_frame > start_frame else 0.5
+                word_timestamps.append({
+                    'word': word,
+                    'start': start_time,
+                    'end': end_time
+                })
+                word_confidence.append(word_conf)
+            return {
+                'words': words,
+                'word_timestamps': word_timestamps,
+                'word_confidence': word_confidence,
+                'transcript': transcript
+            }
+        except Exception as e:
+            logger.error(f"❌ Error getting word-level features: {e}")
+            raise

diagnosis/ai_engine/model_loader.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# diagnosis/ai_engine/model_loader.py
+"""Singleton pattern for model loading
+This loader provides a clean interface for getting the detector instance.
+Uses singleton pattern to ensure models are loaded only once.
+"""
+import logging
+logger = logging.getLogger(__name__)
+_detector_instance = None
+def get_stutter_detector():
+    """
+    Get or create singleton AdvancedStutterDetector instance.
+    This ensures models are loaded only once and reused across requests.
+    Returns:
+        AdvancedStutterDetector: The singleton detector instance
+    Raises:
+        ImportError: If the detector class cannot be imported
+    """
+    global _detector_instance
+    if _detector_instance is None:
+        try:
+            from .detect_stuttering import AdvancedStutterDetector
+            logger.info("🔄 Initializing detector instance (first call)...")
+            _detector_instance = AdvancedStutterDetector()
+            logger.info("✅ Detector instance created successfully")
+        except ImportError as e:
+            logger.error(f"❌ Failed to import AdvancedStutterDetector: {e}")
+            raise ImportError("No StutterDetector implementation available in detect_stuttering.py") from e
+        except Exception as e:
+            logger.error(f"❌ Failed to create detector instance: {e}")
+            raise
+    return _detector_instance
+def reset_detector():
+    """
+    Reset the singleton instance (useful for testing or reloading models).
+    Note: This will force reloading of models on next get_stutter_detector() call.
+    """
+    global _detector_instance
+    _detector_instance = None
+    logger.info("🔄 Detector instance reset")