zlaqa-version-b-ai-enginee

Runtime error

App Files Files Community

anfastech commited on Dec 18, 2025

Commit

a62077e

0 Parent(s):

fix: resolve torch security error by pinning torch 2.6.0 and updating requirements

Browse files

Files changed (12) hide show

.gitignore +4 -0
Dockerfile +30 -0
Docs/ARCHITECTURE.md +185 -0
Docs/MODEL_SUMMARY.md +130 -0
Docs/TRANSCRIPT_DEBUG.md +213 -0
README.md +10 -0
app.py +186 -0
diagnosis/ai_engine/detect_stuttering.py +824 -0
diagnosis/ai_engine/features.py +206 -0
diagnosis/ai_engine/model_loader.py +51 -0
gitattributes +35 -0
requirements.txt +27 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+.env
+hello.wav
+venv/
+__pycache__/

Dockerfile ADDED Viewed

	@@ -0,0 +1,30 @@

+FROM python:3.10
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    libffi-dev \
+    libsndfile1 \
+    libasound2 \
+    libxt6 \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first to leverage Docker cache
+COPY requirements.txt .
+# Install PyTorch CPU first (2.6.0 available for CPU)
+RUN pip install --no-cache-dir torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0
+# Install the rest of requirements
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application files
+COPY . .
+EXPOSE 7860
+ENV PYTHONUNBUFFERED=1
+# Run the application
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

Docs/ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,185 @@

+# AI Engine Architecture
+## Clean Architecture Implementation
+This AI engine follows clean architecture principles with proper separation of concerns.
+---
+## Module Structure
+```
+diagnosis/ai_engine/
+├── detect_stuttering.py    # Main detector class (business logic)
+├── model_loader.py         # Singleton pattern for model loading
+└── features.py             # Feature extraction (ASR features)
+```
+---
+## Architecture Pattern
+### 1. Model Loader (`model_loader.py`)
+**Responsibility**: Singleton pattern for model instance management
+- Ensures models are loaded only once
+- Provides clean interface: `get_stutter_detector()`
+- Handles initialization and error handling
+- Used by API layer (`app.py`)
+**Usage:**
+```python
+from diagnosis.ai_engine.model_loader import get_stutter_detector
+detector = get_stutter_detector()  # Singleton instance
+```
+---
+### 2. Feature Extractor (`features.py`)
+**Responsibility**: Feature extraction from audio using IndicWav2Vec Hindi
+**Class**: `ASRFeatureExtractor`
+**Methods:**
+- `extract_audio_features()` - Raw audio feature extraction
+- `get_transcription_features()` - Transcription with confidence scores
+- `get_word_level_features()` - Word-level timestamps and confidence
+**Design Pattern**:
+- Takes pre-loaded model and processor as dependencies
+- Single responsibility: feature extraction only
+- Reusable across different use cases
+**Usage:**
+```python
+from .features import ASRFeatureExtractor
+extractor = ASRFeatureExtractor(model, processor, device)
+features = extractor.get_transcription_features(audio)
+```
+---
+### 3. Detector (`detect_stuttering.py`)
+**Responsibility**: High-level stutter detection orchestration
+**Class**: `AdvancedStutterDetector`
+**Design:**
+- Uses feature extractor for transcription (composition)
+- Orchestrates the analysis pipeline
+- Returns structured results
+**Flow:**
+```
+Audio Input
+    ↓
+Feature Extractor (ASR)
+    ↓
+Text Analysis
+    ↓
+Results
+```
+---
+## Benefits of This Architecture
+### ✅ Separation of Concerns
+- **Model Loading**: Isolated in `model_loader.py`
+- **Feature Extraction**: Isolated in `features.py`
+- **Business Logic**: In `detect_stuttering.py`
+### ✅ Single Responsibility Principle
+- Each module has one clear purpose
+- Easy to test and maintain
+- Easy to extend or replace components
+### ✅ Dependency Injection
+- Feature extractor receives model/processor as dependencies
+- No tight coupling
+- Easy to mock for testing
+### ✅ Reusability
+- Feature extractor can be used independently
+- Model loader can be used by other modules
+- Clean interfaces between layers
+---
+## Data Flow
+```
+API Request (app.py)
+    ↓
+get_stutter_detector() [model_loader.py]
+    ↓
+AdvancedStutterDetector [detect_stuttering.py]
+    ↓
+ASRFeatureExtractor [features.py]
+    ↓
+IndicWav2Vec Hindi Model
+    ↓
+Results back through layers
+```
+---
+## Comparison with Django App
+**Before (Django App):**
+- Model loading logic in Django app
+- Feature extraction in Django app
+- Tight coupling between web app and ML logic
+**After (AI Engine Service):**
+- ✅ Model loading in AI engine service
+- ✅ Feature extraction in AI engine service
+- ✅ Django app only calls API (loose coupling)
+- ✅ ML logic isolated in dedicated service
+---
+## Extension Points
+### Adding New Features
+1. Add method to `ASRFeatureExtractor` in `features.py`
+2. Use in `AdvancedStutterDetector` via composition
+3. No changes needed to model loader
+### Adding New Models
+1. Update `detect_stuttering.py` to load new model
+2. Create new feature extractor if needed
+3. Model loader remains unchanged
+### Testing
+- Mock `ASRFeatureExtractor` in tests
+- Mock model loader for integration tests
+- Each component can be tested independently
+---
+## Key Principles Applied
+1. **Dependency Inversion**: High-level modules don't depend on low-level modules
+2. **Open/Closed**: Open for extension, closed for modification
+3. **Interface Segregation**: Clean, focused interfaces
+4. **Don't Repeat Yourself (DRY)**: Feature extraction logic centralized
+5. **Single Source of Truth**: Model instance managed by singleton
+---
+## File Responsibilities
+| File | Responsibility | Depends On |
+|------|---------------|------------|
+| `model_loader.py` | Singleton model management | `detect_stuttering.py` |
+| `features.py` | Feature extraction | `transformers`, `torch` |
+| `detect_stuttering.py` | Business logic orchestration | `features.py`, `model_loader.py` |
+| `app.py` | API layer | `model_loader.py` |
+---
+This architecture ensures the ML/AI logic stays in the AI engine service, not in the Django web application, following microservices best practices.

Docs/MODEL_SUMMARY.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# AI Engine Model Summary
+## Simplified ASR-Only Configuration
+This engine has been simplified to use **ONLY** the IndicWav2Vec Hindi model for Automatic Speech Recognition (ASR).
+---
+## Active Model
+### 1. IndicWav2Vec Hindi (Primary & Only Model)
+- **Model ID**: `ai4bharat/indicwav2vec-hindi`
+- **Type**: `Wav2Vec2ForCTC`
+- **Purpose**: Automatic Speech Recognition (ASR) for Hindi and Indian languages
+- **Status**: ✅ Active - Loaded at startup
+- **Location**: `detect_stuttering.py` lines 26, 148-156
+- **Authentication**: Requires `HF_TOKEN` environment variable
+**Features:**
+- Speech-to-text transcription
+- Confidence scoring from model predictions
+- Text-based stutter analysis (simple repetition detection)
+---
+## Removed Models
+The following models have been **removed** to simplify the engine:
+1. ❌ **MMS Language Identification (LID)** - `facebook/mms-lid-126`
+   - Previously used for language detection
+   - No longer needed - IndicWav2Vec handles Hindi natively
+2. ❌ **Isolation Forest** (sklearn)
+   - Previously used for anomaly detection
+   - Removed - using simple text-based analysis instead
+---
+## Removed Libraries
+The following signal processing libraries are no longer used:
+- ❌ `parselmouth` (Praat) - Voice quality analysis
+- ❌ `fastdtw` - Repetition detection via DTW
+- ❌ `sklearn` - Machine learning algorithms
+- ❌ Complex acoustic feature extraction (MFCC, formants, etc.)
+---
+## Current Pipeline
+```
+Audio Input
+    ↓
+IndicWav2Vec Hindi ASR
+    ↓
+Text Transcription
+    ↓
+Basic Text Analysis
+    ↓
+Results (transcript + simple stutter detection)
+```
+---
+## API Response Format
+The simplified engine returns:
+```json
+{
+  "actual_transcript": "transcribed text",
+  "target_transcript": "expected text (if provided)",
+  "mismatched_chars": ["timestamps of low confidence regions"],
+  "mismatch_percentage": 0.0,
+  "ctc_loss_score": 0.0,
+  "stutter_timestamps": [{"type": "repetition", "start": 0.0, "end": 0.5, ...}],
+  "total_stutter_duration": 0.0,
+  "stutter_frequency": 0.0,
+  "severity": "none|mild|moderate|severe",
+  "confidence_score": 0.8,
+  "speaking_rate_sps": 0.0,
+  "analysis_duration_seconds": 0.0,
+  "model_version": "indicwav2vec-hindi-asr-v1"
+}
+```
+---
+## Dependencies
+**Required:**
+- `transformers` 4.35.0 - For IndicWav2Vec model
+- `torch` 2.0.1 - PyTorch backend
+- `librosa` ≥0.10.0 - Audio loading (16kHz resampling)
+- `numpy` - Array operations
+**Optional (for legacy methods, not used in ASR mode):**
+- `parselmouth` - Voice quality (not used)
+- `fastdtw` - DTW algorithm (not used)
+- `sklearn` - ML algorithms (not used)
+---
+## Usage
+```python
+from diagnosis.ai_engine.detect_stuttering import get_stutter_detector
+detector = get_stutter_detector()
+result = detector.analyze_audio(
+    audio_path="path/to/audio.wav",
+    proper_transcript="expected text",  # optional
+    language="hindi"  # default: hindi
+)
+print(result['actual_transcript'])  # ASR transcription
+```
+---
+## Notes
+- The engine focuses **only** on ASR transcription
+- Stutter detection is simplified to text-based repetition analysis
+- No complex acoustic feature extraction
+- Faster and lighter than the previous multi-model approach
+- Optimized for Hindi but can handle other Indian languages

Docs/TRANSCRIPT_DEBUG.md ADDED Viewed

	@@ -0,0 +1,213 @@

+# Transcript Debugging Guide
+## Issue: Empty Transcripts ("No transcript available")
+## Complete Flow Analysis
+### 1. Django App → API Request (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 269-274
+```python
+response = requests.post(
+    self.api_url,
+    files=files,
+    data={
+        "transcript": proper_transcript if proper_transcript else "",
+        "language": lang_code,
+    },
+    timeout=self.api_timeout
+)
+```
+**Status:** ✅ Sending transcript parameter correctly
+---
+### 2. API Receives Request (`slaq-version-c-ai-enginee/app.py`)
+**Location:** Line 70-73
+```python
+@app.post("/analyze")
+async def analyze_audio(
+    audio: UploadFile = File(...),
+    transcript: str = Form("")  # ✅ Fixed: Now uses Form() for multipart
+):
+```
+**Status:** ✅ Fixed - Now correctly receives transcript via Form()
+---
+### 3. API Calls Model (`slaq-version-c-ai-enginee/app.py`)
+**Location:** Line 106
+```python
+result = detector.analyze_audio(temp_file, transcript)
+```
+**Status:** ✅ Passing transcript correctly
+---
+### 4. Model Transcribes Audio (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 313-369 (`_transcribe_with_timestamps`)
+**Potential Issues:**
+- ❓ IndicWav2Vec decoding might not work with `processor.batch_decode()`
+- ❓ Need to use tokenizer directly
+- ❓ Model might not be producing valid predictions
+**Status:** ⚠️ **LIKELY ISSUE HERE** - Decoding method may be incorrect
+---
+### 5. Model Returns Result (`slaq-version-c-ai-enginee/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 787-794
+```python
+actual_transcript = transcript if transcript else ""
+target_transcript = proper_transcript if proper_transcript else transcript if transcript else ""
+return {
+    'actual_transcript': actual_transcript,
+    'target_transcript': target_transcript,
+    ...
+}
+```
+**Status:** ✅ Returns transcripts correctly (if transcript is not empty)
+---
+### 6. API Returns Response (`slaq-version-c-ai-enginee/app.py`)
+**Location:** Line 109-113
+```python
+actual = result.get('actual_transcript', '')
+target = result.get('target_transcript', '')
+logger.info(f"📝 Result transcripts - Actual: '{actual[:100]}' (len: {len(actual)}), Target: '{target[:100]}' (len: {len(target)})")
+return result
+```
+**Status:** ✅ Returns JSON with transcripts
+---
+### 7. Django Receives Response (`slaq-version-c/diagnosis/ai_engine/detect_stuttering.py`)
+**Location:** Line 279-410
+```python
+result = response.json()
+# ... formatting ...
+actual_transcript = str(api_result.get('actual_transcript', '')).strip()
+target_transcript = str(api_result.get('target_transcript', '')).strip()
+```
+**Status:** ✅ Extracts transcripts correctly
+---
+### 8. Django Saves to Database (`slaq-version-c/diagnosis/tasks.py`)
+**Location:** Line 141-142
+```python
+actual_transcript=actual_transcript,
+target_transcript=target_transcript,
+```
+**Status:** ✅ Saves correctly
+---
+## Root Cause Analysis
+### Most Likely Issue: Transcription Decoding
+The IndicWav2Vec model (`ai4bharat/indicwav2vec-hindi`) may require:
+1. **Direct tokenizer access** instead of `processor.batch_decode()`
+2. **CTC decoding** with proper tokenizer
+3. **Special handling** for Indic scripts
+### Fix Applied
+Updated `_transcribe_with_timestamps()` to:
+1. Try multiple decoding methods
+2. Use tokenizer directly if available
+3. Add comprehensive error logging
+4. Log predicted IDs for debugging
+---
+## Debugging Steps
+### 1. Check API Logs
+When processing audio, look for:
+```
+📝 Transcribed text: '...' (length: X)
+📝 Final return - Actual: '...' (len: X), Target: '...' (len: Y)
+📝 Result transcripts - Actual: '...' (len: X), Target: '...' (len: Y)
+```
+### 2. Check Django Logs
+Look for:
+```
+📝 Final transcripts - Actual: X chars, Target: Y chars
+📝 Saving transcripts - Actual: X chars, Target: Y chars
+```
+### 3. Check Database
+Query the `AnalysisResult` table:
+```sql
+SELECT actual_transcript, target_transcript, LENGTH(actual_transcript) as actual_len, LENGTH(target_transcript) as target_len
+FROM diagnosis_analysisresult
+ORDER BY created_at DESC LIMIT 5;
+```
+### 4. Test API Directly
+```bash
+curl -X POST "http://localhost:7860/analyze" \
+  -F "audio=@test.wav" \
+  -F "transcript=test transcript" \
+  -F "language=hin"
+```
+Check the response JSON for `actual_transcript` and `target_transcript`.
+---
+## Next Steps
+1. **Rebuild Docker image** with latest changes
+2. **Check logs** during audio processing
+3. **Verify processor structure** - logs will show processor attributes
+4. **Test with Hindi audio** - model is optimized for Hindi
+5. **Check if model is loaded correctly** - verify HF_TOKEN is working
+---
+## Expected Log Output (Success)
+```
+🚀 Initializing Advanced AI Engine on cpu...
+✅ HF_TOKEN found - using authenticated model access
+📋 Processor type: <class 'transformers.models.wav2vec2.processing_wav2vec2.Wav2Vec2Processor'>
+📋 Processor attributes: ['batch_decode', 'decode', 'feature_extractor', 'tokenizer', ...]
+📋 Tokenizer type: <class 'transformers.models.wav2vec2.tokenization_wav2vec2.Wav2Vec2CTCTokenizer'>
+📝 Transcribed text: 'नमस्ते मैं हिंदी बोल रहा हूं' (length: 25)
+📝 Final return - Actual: 'नमस्ते मैं हिंदी बोल रहा हूं' (len: 25), Target: '...' (len: X)
+```
+---
+## If Still Empty
+1. **Model may not be loaded correctly** - check HF_TOKEN
+2. **Audio format issue** - ensure 16kHz mono WAV
+3. **Model not producing predictions** - check predicted_ids in logs
+4. **Tokenizer mismatch** - IndicWav2Vec may need special tokenizer initialization

README.md ADDED Viewed

	@@ -0,0 +1,10 @@

+---
+title: Zlaqa Version B Ai Enginee
+emoji: ⚡
+colorFrom: purple
+colorTo: red
+sdk: docker
+pinned: false
+---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

app.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# app.py
+import logging
+import os
+import sys
+from pathlib import Path
+from fastapi import FastAPI, UploadFile, File, Form, HTTPException
+from fastapi.responses import JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+import gradio as gr
+# Configure logging FIRST
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    stream=sys.stdout
+)
+logger = logging.getLogger(__name__)
+# Add project root to path
+sys.path.insert(0, str(Path(__file__).parent))
+# Import detector using model loader (clean architecture)
+try:
+    from diagnosis.ai_engine.model_loader import get_stutter_detector
+    logger.info("✅ Successfully imported model loader")
+except ImportError as e:
+    logger.error(f"❌ Failed to import model loader: {e}")
+    raise
+# Initialize FastAPI
+app = FastAPI(
+    title="Stutter Detector API",
+    description="Speech analysis using Wav2Vec2 models for stutter detection",
+    version="1.0.0"
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global detector instance
+detector = None
+@app.on_event("startup")
+async def startup_event():
+    """Load models on startup"""
+    global detector
+    try:
+        logger.info("🚀 Startup event: Loading AI models...")
+        detector = get_stutter_detector()
+        logger.info("✅ Models loaded successfully!")
+    except Exception as e:
+        logger.error(f"❌ Failed to load models: {e}", exc_info=True)
+        raise
+def gradio_analyze(audio_path, transcript=""):
+    """
+    Analyze audio for stuttering using Gradio interface
+    """
+    if not detector:
+        return {"error": "Models not loaded yet. Please try again later."}
+    try:
+        result = detector.analyze_audio(audio_path, transcript)
+        return result
+    except Exception as e:
+        return {"error": f"Analysis failed: {str(e)}"}
+# Create Gradio interface
+gradio_app = gr.Interface(
+    fn=gradio_analyze,
+    inputs=[
+        gr.Audio(type="filepath", label="Upload Audio File"),
+        gr.Textbox(label="Optional Transcript", placeholder="Enter expected transcript here...", lines=2)
+    ],
+    outputs=gr.JSON(label="Analysis Results"),
+    title="Stutter Detection",
+    description="Upload an audio file and optionally provide a transcript to analyze for stuttering."
+)
+# Mount Gradio app to FastAPI
+gr.mount_gradio_app(app, gradio_app, path="/gradio")
+@app.get("/health")
+async def health_check():
+    """Health check endpoint"""
+    from datetime import datetime
+    return {
+        "status": "healthy",
+        "models_loaded": detector is not None,
+        "timestamp": datetime.utcnow().isoformat() + "Z"
+    }
+@app.post("/analyze")
+async def analyze_audio(
+    audio: UploadFile = File(...),
+    transcript: str = Form("")
+):
+    """
+    Analyze audio file for stuttering
+    Parameters:
+    - audio: WAV or MP3 audio file
+    - transcript: Optional expected transcript
+    Returns: Complete stutter analysis results
+    """
+    temp_file = None
+    try:
+        if not detector:
+            raise HTTPException(status_code=503, detail="Models not loaded yet. Try again in a moment.")
+        logger.info(f"📥 Processing: {audio.filename}")
+        # Create temp directory if needed
+        temp_dir = "/tmp/stutter_analysis"
+        os.makedirs(temp_dir, exist_ok=True)
+        # Save uploaded file
+        temp_file = os.path.join(temp_dir, audio.filename)
+        content = await audio.read()
+        with open(temp_file, "wb") as f:
+            f.write(content)
+        logger.info(f"📂 Saved to: {temp_file} ({len(content) / 1024 / 1024:.2f} MB)")
+        # Analyze
+        logger.info(f"🔄 Analyzing audio with transcript: '{transcript[:50] if transcript else '(empty)'}...'")
+        result = detector.analyze_audio(temp_file, transcript)
+        # Log transcript values from result
+        actual = result.get('actual_transcript', '')
+        target = result.get('target_transcript', '')
+        logger.info(f"✅ Analysis complete: severity={result['severity']}, mismatch={result['mismatch_percentage']}%")
+        logger.info(f"📝 Result transcripts - Actual: '{actual[:100]}' (len: {len(actual)}), Target: '{target[:100]}' (len: {len(target)})")
+        return result
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"❌ Error during analysis: {str(e)}", exc_info=True)
+        raise HTTPException(status_code=500, detail=f"Analysis failed: {str(e)}")
+    finally:
+        # Cleanup
+        if temp_file and os.path.exists(temp_file):
+            try:
+                os.remove(temp_file)
+                logger.info(f"🧹 Cleaned up: {temp_file}")
+            except Exception as e:
+                logger.warning(f"Could not clean up {temp_file}: {e}")
+@app.get("/")
+async def root():
+    """API documentation"""
+    return {
+        "name": "SLAQ Stutter Detector API",
+        "version": "1.0.0",
+        "status": "running",
+        "endpoints": {
+            "health": "GET /health",
+            "analyze": "POST /analyze (multipart: audio file + optional transcript field)",
+            "docs": "GET /docs (interactive API docs)",
+            "gradio": "GET /gradio (web UI for stutter detection)"
+        },
+        "models": {
+            "base": "facebook/wav2vec2-base-960h",
+            "large": "facebook/wav2vec2-large-960h-lv60-self",
+            "xlsr": "jonatasgrosman/wav2vec2-large-xlsr-53-english"
+        }
+    }
+if __name__ == "__main__":
+    import uvicorn
+    logger.info("🚀 Starting SLAQ Stutter Detector API...")
+    uvicorn.run(
+        app,
+        host="0.0.0.0",
+        port=7860,
+        log_level="info"
+    )

diagnosis/ai_engine/detect_stuttering.py ADDED Viewed

	@@ -0,0 +1,824 @@

+# diagnosis/ai_engine/detect_stuttering.py
+import os
+import librosa
+import torch
+import logging
+import numpy as np
+from transformers import Wav2Vec2ForCTC, AutoProcessor
+import time
+from dataclasses import dataclass, field
+from typing import List, Dict, Any, Tuple
+# Simplified: Only using ASR transcription, removed complex signal processing libraries
+logger = logging.getLogger(__name__)
+# === CONFIGURATION ===
+MODEL_ID = "ai4bharat/indicwav2vec-hindi"  # Only model used - IndicWav2Vec Hindi for ASR
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+HF_TOKEN = os.getenv("HF_TOKEN")  # Hugging Face token for authenticated model access
+INDIAN_LANGUAGES = {
+    'hindi': 'hin', 'english': 'eng', 'tamil': 'tam', 'telugu': 'tel',
+    'bengali': 'ben', 'marathi': 'mar', 'gujarati': 'guj', 'kannada': 'kan',
+    'malayalam': 'mal', 'punjabi': 'pan', 'urdu': 'urd', 'assamese': 'asm',
+    'odia': 'ory', 'bhojpuri': 'bho', 'maithili': 'mai'
+}
+# === RESEARCH-BASED THRESHOLDS (2024-2025 Literature) ===
+# Prolongation Detection (Spectral Correlation + Duration)
+PROLONGATION_CORRELATION_THRESHOLD = 0.90  # >0.9 spectral similarity
+PROLONGATION_MIN_DURATION = 0.25  # >250ms (Revisiting Rule-Based, 2025)
+# Block Detection (Silence Analysis)
+BLOCK_SILENCE_THRESHOLD = 0.35  # >350ms silence mid-utterance
+BLOCK_ENERGY_PERCENTILE = 10  # Bottom 10% energy = silence
+# Repetition Detection (DTW + Text Matching)
+REPETITION_DTW_THRESHOLD = 0.15  # Normalized DTW distance
+REPETITION_MIN_SIMILARITY = 0.85  # Text-based similarity
+# Speaking Rate Norms (syllables/second)
+SPEECH_RATE_MIN = 2.0
+SPEECH_RATE_MAX = 6.0
+SPEECH_RATE_TYPICAL = 4.0
+# Formant Analysis (Vowel Centralization - Research Finding)
+# People who stutter show reduced vowel space area
+VOWEL_SPACE_REDUCTION_THRESHOLD = 0.70  # 70% of typical area
+# Voice Quality (Jitter, Shimmer, HNR)
+JITTER_THRESHOLD = 0.01  # >1% jitter indicates instability
+SHIMMER_THRESHOLD = 0.03  # >3% shimmer
+HNR_THRESHOLD = 15.0  # <15 dB Harmonics-to-Noise Ratio
+# Zero-Crossing Rate (Voiced/Unvoiced Discrimination)
+ZCR_VOICED_THRESHOLD = 0.1  # Low ZCR = voiced
+ZCR_UNVOICED_THRESHOLD = 0.3  # High ZCR = unvoiced
+# Entropy-Based Uncertainty
+ENTROPY_HIGH_THRESHOLD = 3.5  # High confusion in model predictions
+CONFIDENCE_LOW_THRESHOLD = 0.40  # Low confidence frame threshold
+@dataclass
+class StutterEvent:
+    """Enhanced stutter event with multi-modal features"""
+    type: str  # 'repetition', 'prolongation', 'block', 'dysfluency'
+    start: float
+    end: float
+    text: str
+    confidence: float
+    acoustic_features: Dict[str, float] = field(default_factory=dict)
+    voice_quality: Dict[str, float] = field(default_factory=dict)
+    formant_data: Dict[str, Any] = field(default_factory=dict)
+class AdvancedStutterDetector:
+    """
+    🎤 IndicWav2Vec Hindi ASR Engine
+    Simplified engine using ONLY ai4bharat/indicwav2vec-hindi for Automatic Speech Recognition.
+    Features:
+    - Speech-to-text transcription using IndicWav2Vec Hindi model
+    - Text-based stutter analysis from transcription
+    - Confidence scoring from model predictions
+    - Basic dysfluency detection from transcript patterns
+    Model: ai4bharat/indicwav2vec-hindi (Wav2Vec2ForCTC)
+    Purpose: Automatic Speech Recognition (ASR) for Hindi and Indian languages
+    """
+    def __init__(self):
+        logger.info(f"🚀 Initializing Advanced AI Engine on {DEVICE}...")
+        if HF_TOKEN:
+            logger.info("✅ HF_TOKEN found - using authenticated model access")
+        else:
+            logger.warning("⚠️ HF_TOKEN not found - model access may fail if authentication is required")
+        try:
+            # Wav2Vec2 Model Loading - IndicWav2Vec Hindi Model
+            self.processor = AutoProcessor.from_pretrained(
+                MODEL_ID,
+                token=HF_TOKEN
+            )
+            self.model = Wav2Vec2ForCTC.from_pretrained(
+                MODEL_ID,
+                token=HF_TOKEN,
+                torch_dtype=torch.float16 if DEVICE == "cuda" else torch.float32
+            ).to(DEVICE)
+            self.model.eval()
+            # Initialize feature extractor (clean architecture pattern)
+            from .features import ASRFeatureExtractor
+            self.feature_extractor = ASRFeatureExtractor(
+                model=self.model,
+                processor=self.processor,
+                device=DEVICE
+            )
+            # Debug: Log processor structure
+            logger.info(f"📋 Processor type: {type(self.processor)}")
+            if hasattr(self.processor, 'tokenizer'):
+                logger.info(f"📋 Tokenizer type: {type(self.processor.tokenizer)}")
+            if hasattr(self.processor, 'feature_extractor'):
+                logger.info(f"📋 Feature extractor type: {type(self.processor.feature_extractor)}")
+            logger.info("✅ IndicWav2Vec Hindi ASR Engine Loaded with Feature Extractor")
+        except Exception as e:
+            logger.error(f"🔥 Engine Failure: {e}")
+            raise
+    def _init_common_adapters(self):
+        """Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
+        pass
+    def _activate_adapter(self, lang_code: str):
+        """Not applicable - IndicWav2Vec Hindi doesn't use adapters"""
+        logger.info(f"Using IndicWav2Vec Hindi model (optimized for Hindi)")
+        pass
+    # ===== LEGACY METHODS (NOT USED IN ASR-ONLY MODE) =====
+    # These methods are kept for reference but not called in the simplified ASR pipeline
+    # They require additional libraries (parselmouth, fastdtw, sklearn) that are not needed for ASR-only mode
+    def _extract_comprehensive_features(self, audio: np.ndarray, sr: int, audio_path: str) -> Dict[str, Any]:
+        """Extract multi-modal acoustic features"""
+        features = {}
+        # MFCC (20 coefficients)
+        mfcc = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=20, hop_length=512)
+        features['mfcc'] = mfcc.T  # Transpose for time x features
+        # Zero-Crossing Rate
+        zcr = librosa.feature.zero_crossing_rate(audio, hop_length=512)[0]
+        features['zcr'] = zcr
+        # RMS Energy
+        rms_energy = librosa.feature.rms(y=audio, hop_length=512)[0]
+        features['rms_energy'] = rms_energy
+        # Spectral Flux
+        stft = librosa.stft(audio, hop_length=512)
+        magnitude = np.abs(stft)
+        spectral_flux = np.sum(np.diff(magnitude, axis=1) * (np.diff(magnitude, axis=1) > 0), axis=0)
+        features['spectral_flux'] = spectral_flux
+        # Energy Entropy
+        frame_energy = np.sum(magnitude ** 2, axis=0)
+        frame_energy = frame_energy + 1e-10  # Avoid log(0)
+        energy_entropy = -np.sum((magnitude ** 2 / frame_energy) * np.log(magnitude ** 2 / frame_energy + 1e-10), axis=0)
+        features['energy_entropy'] = energy_entropy
+        # Formant Analysis using Parselmouth
+        try:
+            sound = parselmouth.Sound(audio_path)
+            formant = sound.to_formant_burg(time_step=0.01)
+            times = np.arange(0, sound.duration, 0.01)
+            f1, f2, f3, f4 = [], [], [], []
+            for t in times:
+                try:
+                    f1.append(formant.get_value_at_time(1, t) if formant.get_value_at_time(1, t) > 0 else np.nan)
+                    f2.append(formant.get_value_at_time(2, t) if formant.get_value_at_time(2, t) > 0 else np.nan)
+                    f3.append(formant.get_value_at_time(3, t) if formant.get_value_at_time(3, t) > 0 else np.nan)
+                    f4.append(formant.get_value_at_time(4, t) if formant.get_value_at_time(4, t) > 0 else np.nan)
+                except:
+                    f1.append(np.nan)
+                    f2.append(np.nan)
+                    f3.append(np.nan)
+                    f4.append(np.nan)
+            formants = np.array([f1, f2, f3, f4]).T
+            features['formants'] = formants
+            # Calculate vowel space area (F1-F2 plane)
+            valid_f1f2 = formants[~np.isnan(formants[:, 0]) & ~np.isnan(formants[:, 1]), :2]
+            if len(valid_f1f2) > 0:
+                # Convex hull area approximation
+                try:
+                    hull = ConvexHull(valid_f1f2)
+                    vowel_space_area = hull.volume
+                except:
+                    vowel_space_area = np.nan
+            else:
+                vowel_space_area = np.nan
+            features['formant_summary'] = {
+                'vowel_space_area': float(vowel_space_area) if not np.isnan(vowel_space_area) else 0.0,
+                'f1_mean': float(np.nanmean(f1)) if len(f1) > 0 else 0.0,
+                'f2_mean': float(np.nanmean(f2)) if len(f2) > 0 else 0.0,
+                'f1_std': float(np.nanstd(f1)) if len(f1) > 0 else 0.0,
+                'f2_std': float(np.nanstd(f2)) if len(f2) > 0 else 0.0
+            }
+        except Exception as e:
+            logger.warning(f"Formant analysis failed: {e}")
+            features['formants'] = np.zeros((len(audio) // 100, 4))
+            features['formant_summary'] = {
+                'vowel_space_area': 0.0,
+                'f1_mean': 0.0, 'f2_mean': 0.0,
+                'f1_std': 0.0, 'f2_std': 0.0
+            }
+        # Voice Quality Metrics (Jitter, Shimmer, HNR)
+        try:
+            sound = parselmouth.Sound(audio_path)
+            pitch = sound.to_pitch()
+            point_process = parselmouth.praat.call([sound, pitch], "To PointProcess")
+            jitter = parselmouth.praat.call(point_process, "Get jitter (local)", 0.0, 0.0, 1.1, 1.6, 1.3, 1.6)
+            shimmer = parselmouth.praat.call([sound, point_process], "Get shimmer (local)", 0.0, 0.0, 0.0001, 0.02, 1.3, 1.6)
+            hnr = parselmouth.praat.call(sound, "Get harmonicity (cc)", 0.0, 0.0, 0.01, 1.5, 1.0, 0.1, 1.0)
+            features['voice_quality'] = {
+                'jitter': float(jitter) if jitter is not None else 0.0,
+                'shimmer': float(shimmer) if shimmer is not None else 0.0,
+                'hnr_db': float(hnr) if hnr is not None else 20.0
+            }
+        except Exception as e:
+            logger.warning(f"Voice quality analysis failed: {e}")
+            features['voice_quality'] = {
+                'jitter': 0.0,
+                'shimmer': 0.0,
+                'hnr_db': 20.0
+            }
+        return features
+    def _transcribe_with_timestamps(self, audio: np.ndarray) -> Tuple[str, List[Dict], torch.Tensor]:
+        """
+        Transcribe audio and return word timestamps and logits.
+        Uses the feature extractor for clean separation of concerns.
+        """
+        try:
+            # Use feature extractor for transcription (clean architecture)
+            features = self.feature_extractor.get_transcription_features(audio, sample_rate=16000)
+            transcript = features['transcript']
+            logits = torch.from_numpy(features['logits'])
+            # Get word-level features for timestamps
+            word_features = self.feature_extractor.get_word_level_features(audio, sample_rate=16000)
+            word_timestamps = word_features['word_timestamps']
+            logger.info(f"📝 Transcription via feature extractor: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
+            return transcript, word_timestamps, logits
+        except Exception as e:
+            logger.error(f"❌ Transcription failed: {e}", exc_info=True)
+            return "", [], torch.zeros((1, 100, 32))  # Dummy return
+    def _calculate_uncertainty(self, logits: torch.Tensor) -> Tuple[float, List[Dict]]:
+        """Calculate entropy-based uncertainty and low-confidence regions"""
+        try:
+            probs = torch.softmax(logits, dim=-1)
+            entropy = -torch.sum(probs * torch.log(probs + 1e-10), dim=-1)
+            entropy_mean = float(torch.mean(entropy).item())
+            # Find low-confidence regions
+            frame_duration = 0.02
+            low_conf_regions = []
+            confidence = torch.max(probs, dim=-1)[0]
+            for i in range(confidence.shape[1]):
+                conf = float(confidence[0, i].item())
+                if conf < CONFIDENCE_LOW_THRESHOLD:
+                    low_conf_regions.append({
+                        'time': i * frame_duration,
+                        'confidence': conf
+                    })
+            return entropy_mean, low_conf_regions
+        except Exception as e:
+            logger.warning(f"Uncertainty calculation failed: {e}")
+            return 0.0, []
+    def _estimate_speaking_rate(self, audio: np.ndarray, sr: int) -> float:
+        """Estimate speaking rate in syllables per second"""
+        try:
+            # Simple syllable estimation using energy peaks
+            rms = librosa.feature.rms(y=audio, hop_length=512)[0]
+            peaks, _ = librosa.util.peak_pick(rms, pre_max=3, post_max=3, pre_avg=3, post_avg=5, delta=0.1, wait=10)
+            duration = len(audio) / sr
+            num_syllables = len(peaks)
+            speaking_rate = num_syllables / duration if duration > 0 else SPEECH_RATE_TYPICAL
+            return max(SPEECH_RATE_MIN, min(SPEECH_RATE_MAX, speaking_rate))
+        except Exception as e:
+            logger.warning(f"Speaking rate estimation failed: {e}")
+            return SPEECH_RATE_TYPICAL
+    def _detect_prolongations_advanced(self, mfcc: np.ndarray, spectral_flux: np.ndarray,
+                                      speaking_rate: float, word_timestamps: List[Dict]) -> List[StutterEvent]:
+        """Detect prolongations using spectral correlation"""
+        events = []
+        frame_duration = 0.02
+        # Adaptive threshold based on speaking rate
+        min_duration = PROLONGATION_MIN_DURATION * (SPEECH_RATE_TYPICAL / max(speaking_rate, 0.1))
+        window_size = int(min_duration / frame_duration)
+        if window_size < 2:
+            return events
+        for i in range(len(mfcc) - window_size):
+            window = mfcc[i:i+window_size]
+            # Calculate spectral correlation
+            if len(window) > 1:
+                corr_matrix = np.corrcoef(window.T)
+                avg_correlation = np.mean(corr_matrix[np.triu_indices_from(corr_matrix, k=1)])
+                if avg_correlation > PROLONGATION_CORRELATION_THRESHOLD:
+                    start_time = i * frame_duration
+                    end_time = (i + window_size) * frame_duration
+                    # Check if within a word boundary
+                    for word_ts in word_timestamps:
+                        if word_ts['start'] <= start_time <= word_ts['end']:
+                            events.append(StutterEvent(
+                                type='prolongation',
+                                start=start_time,
+                                end=end_time,
+                                text=word_ts.get('word', ''),
+                                confidence=float(avg_correlation),
+                                acoustic_features={
+                                    'spectral_correlation': float(avg_correlation),
+                                    'duration': end_time - start_time
+                                }
+                            ))
+                            break
+        return events
+    def _detect_blocks_enhanced(self, audio: np.ndarray, sr: int, rms_energy: np.ndarray,
+                               zcr: np.ndarray, word_timestamps: List[Dict],
+                               speaking_rate: float) -> List[StutterEvent]:
+        """Detect blocks using silence analysis"""
+        events = []
+        frame_duration = 0.02
+        # Adaptive threshold
+        silence_threshold = BLOCK_SILENCE_THRESHOLD * (SPEECH_RATE_TYPICAL / max(speaking_rate, 0.1))
+        energy_threshold = np.percentile(rms_energy, BLOCK_ENERGY_PERCENTILE)
+        in_silence = False
+        silence_start = 0
+        for i, energy in enumerate(rms_energy):
+            is_silent = energy < energy_threshold and zcr[i] < ZCR_VOICED_THRESHOLD
+            if is_silent and not in_silence:
+                silence_start = i * frame_duration
+                in_silence = True
+            elif not is_silent and in_silence:
+                silence_duration = (i * frame_duration) - silence_start
+                if silence_duration > silence_threshold:
+                    # Check if mid-utterance (not at start/end)
+                    audio_duration = len(audio) / sr
+                    if silence_start > 0.1 and silence_start < audio_duration - 0.1:
+                        events.append(StutterEvent(
+                            type='block',
+                            start=silence_start,
+                            end=i * frame_duration,
+                            text="<silence>",
+                            confidence=0.8,
+                            acoustic_features={
+                                'silence_duration': silence_duration,
+                                'energy_level': float(energy)
+                            }
+                        ))
+                in_silence = False
+        return events
+    def _detect_repetitions_advanced(self, mfcc: np.ndarray, formants: np.ndarray,
+                                    word_timestamps: List[Dict], transcript: str,
+                                    speaking_rate: float) -> List[StutterEvent]:
+        """Detect repetitions using DTW and text matching"""
+        events = []
+        if len(word_timestamps) < 2:
+            return events
+        # Text-based repetition detection
+        words = transcript.lower().split()
+        for i in range(len(words) - 1):
+            if words[i] == words[i+1]:
+                # Find corresponding timestamps
+                if i < len(word_timestamps) and i+1 < len(word_timestamps):
+                    start = word_timestamps[i]['start']
+                    end = word_timestamps[i+1]['end']
+                    # DTW verification on MFCC
+                    start_frame = int(start / 0.02)
+                    mid_frame = int((start + end) / 2 / 0.02)
+                    end_frame = int(end / 0.02)
+                    if start_frame < len(mfcc) and end_frame < len(mfcc):
+                        segment1 = mfcc[start_frame:mid_frame]
+                        segment2 = mfcc[mid_frame:end_frame]
+                        if len(segment1) > 0 and len(segment2) > 0:
+                            try:
+                                distance, _ = fastdtw(segment1, segment2)
+                                normalized_distance = distance / max(len(segment1), len(segment2))
+                                if normalized_distance < REPETITION_DTW_THRESHOLD:
+                                    events.append(StutterEvent(
+                                        type='repetition',
+                                        start=start,
+                                        end=end,
+                                        text=words[i],
+                                        confidence=1.0 - normalized_distance,
+                                        acoustic_features={
+                                            'dtw_distance': float(normalized_distance),
+                                            'repetition_count': 2
+                                        }
+                                    ))
+                            except:
+                                pass
+        return events
+    def _detect_voice_quality_issues(self, audio_path: str, word_timestamps: List[Dict],
+                                    voice_quality: Dict[str, float]) -> List[StutterEvent]:
+        """Detect dysfluencies based on voice quality metrics"""
+        events = []
+        # Global voice quality issues
+        if voice_quality.get('jitter', 0) > JITTER_THRESHOLD or \
+           voice_quality.get('shimmer', 0) > SHIMMER_THRESHOLD or \
+           voice_quality.get('hnr_db', 20) < HNR_THRESHOLD:
+            # Mark regions with poor voice quality
+            for word_ts in word_timestamps:
+                if word_ts.get('start', 0) > 0:  # Skip first word
+                    events.append(StutterEvent(
+                        type='dysfluency',
+                        start=word_ts['start'],
+                        end=word_ts['end'],
+                        text=word_ts.get('word', ''),
+                        confidence=0.6,
+                        voice_quality=voice_quality.copy()
+                    ))
+                    break  # Only mark first occurrence
+        return events
+    def _is_overlapping(self, time: float, events: List[StutterEvent], threshold: float = 0.1) -> bool:
+        """Check if time overlaps with existing events"""
+        for event in events:
+            if event.start - threshold <= time <= event.end + threshold:
+                return True
+        return False
+    def _detect_anomalies(self, events: List[StutterEvent], features: Dict[str, Any]) -> List[StutterEvent]:
+        """Use Isolation Forest to filter anomalous events"""
+        if len(events) == 0:
+            return events
+        try:
+            # Extract features for anomaly detection
+            X = []
+            for event in events:
+                feat_vec = [
+                    event.end - event.start,  # Duration
+                    event.confidence,
+                    features.get('voice_quality', {}).get('jitter', 0),
+                    features.get('voice_quality', {}).get('shimmer', 0)
+                ]
+                X.append(feat_vec)
+            X = np.array(X)
+            if len(X) > 1:
+                self.anomaly_detector.fit(X)
+                predictions = self.anomaly_detector.predict(X)
+                # Keep only non-anomalous events (predictions == 1)
+                filtered_events = [events[i] for i, pred in enumerate(predictions) if pred == 1]
+                return filtered_events
+        except Exception as e:
+            logger.warning(f"Anomaly detection failed: {e}")
+        return events
+    def _deduplicate_events_cascade(self, events: List[StutterEvent]) -> List[StutterEvent]:
+        """Remove overlapping events with priority: Block > Repetition > Prolongation > Dysfluency"""
+        if len(events) == 0:
+            return events
+        # Sort by priority and start time
+        priority = {'block': 4, 'repetition': 3, 'prolongation': 2, 'dysfluency': 1}
+        events.sort(key=lambda e: (priority.get(e.type, 0), e.start), reverse=True)
+        cleaned = []
+        for event in events:
+            overlap = False
+            for existing in cleaned:
+                # Check overlap
+                if not (event.end < existing.start or event.start > existing.end):
+                    overlap = True
+                    break
+            if not overlap:
+                cleaned.append(event)
+        # Sort by start time
+        cleaned.sort(key=lambda e: e.start)
+        return cleaned
+    def _calculate_clinical_metrics(self, events: List[StutterEvent], duration: float,
+                                    speaking_rate: float, features: Dict[str, Any]) -> Dict[str, Any]:
+        """Calculate comprehensive clinical metrics"""
+        total_duration = sum(e.end - e.start for e in events)
+        frequency = (len(events) / duration * 60) if duration > 0 else 0
+        # Calculate severity score (0-100)
+        stutter_percentage = (total_duration / duration * 100) if duration > 0 else 0
+        frequency_score = min(frequency / 10 * 100, 100)  # Normalize to 100
+        severity_score = (stutter_percentage * 0.6 + frequency_score * 0.4)
+        # Determine severity label
+        if severity_score < 10:
+            severity_label = 'none'
+        elif severity_score < 25:
+            severity_label = 'mild'
+        elif severity_score < 50:
+            severity_label = 'moderate'
+        else:
+            severity_label = 'severe'
+        # Calculate confidence based on multiple factors
+        voice_quality = features.get('voice_quality', {})
+        confidence = 0.8  # Base confidence
+        # Adjust based on voice quality metrics
+        if voice_quality.get('jitter', 0) > JITTER_THRESHOLD:
+            confidence -= 0.1
+        if voice_quality.get('shimmer', 0) > SHIMMER_THRESHOLD:
+            confidence -= 0.1
+        if voice_quality.get('hnr_db', 20) < HNR_THRESHOLD:
+            confidence -= 0.1
+        confidence = max(0.3, min(1.0, confidence))
+        return {
+            'total_duration': round(total_duration, 2),
+            'frequency': round(frequency, 2),
+            'severity_score': round(severity_score, 2),
+            'severity_label': severity_label,
+            'confidence': round(confidence, 2)
+        }
+    def _event_to_dict(self, event: StutterEvent) -> Dict[str, Any]:
+        """Convert StutterEvent to dictionary"""
+        return {
+            'type': event.type,
+            'start': round(event.start, 2),
+            'end': round(event.end, 2),
+            'text': event.text,
+            'confidence': round(event.confidence, 2),
+            'acoustic_features': event.acoustic_features,
+            'voice_quality': event.voice_quality,
+            'formant_data': event.formant_data
+        }
+    def analyze_audio(self, audio_path: str, proper_transcript: str = "", language: str = 'hindi') -> dict:
+        """
+        Main ASR analysis pipeline using IndicWav2Vec Hindi model
+        Focus: Automatic Speech Recognition (ASR) transcription only
+        """
+        start_time = time.time()
+        # === STEP 1: Audio Loading & Preprocessing ===
+        audio, sr = librosa.load(audio_path, sr=16000)
+        duration = librosa.get_duration(y=audio, sr=sr)
+        # === STEP 2: ASR Transcription using IndicWav2Vec Hindi ===
+        transcript, word_timestamps, logits = self._transcribe_with_timestamps(audio)
+        logger.info(f"📝 ASR Transcription: '{transcript}' (length: {len(transcript)}, words: {len(word_timestamps)})")
+        # === STEP 3: Calculate Confidence from Model Predictions ===
+        entropy_score, low_conf_regions = self._calculate_uncertainty(logits)
+        avg_confidence = 1.0 - (entropy_score / 10.0) if entropy_score > 0 else 0.8
+        avg_confidence = max(0.0, min(1.0, avg_confidence))
+        # === STEP 4: Basic Text-based Analysis ===
+        # Simple text-based stutter detection (repetitions, hesitations)
+        events = []
+        if transcript:
+            words = transcript.split()
+            # Detect word repetitions
+            for i in range(len(words) - 1):
+                if words[i] == words[i+1] and i < len(word_timestamps) - 1:
+                    events.append(StutterEvent(
+                        type='repetition',
+                        start=word_timestamps[i]['start'] if i < len(word_timestamps) else 0,
+                        end=word_timestamps[i+1]['end'] if i+1 < len(word_timestamps) else 0,
+                        text=words[i],
+                        confidence=0.7
+                    ))
+        # Add low confidence regions as potential dysfluencies
+        for region in low_conf_regions[:5]:  # Limit to first 5
+            events.append(StutterEvent(
+                type='dysfluency',
+                start=region['time'],
+                end=region['time'] + 0.3,
+                text="<uncertainty>",
+                confidence=0.4,
+                acoustic_features={'entropy': entropy_score}
+            ))
+        # === STEP 5: Calculate Basic Metrics ===
+        total_duration = sum(e.end - e.start for e in events)
+        frequency = (len(events) / duration * 60) if duration > 0 else 0
+        stutter_percentage = (total_duration / duration * 100) if duration > 0 else 0
+        # Simple severity assessment
+        if stutter_percentage < 5:
+            severity = 'none'
+        elif stutter_percentage < 15:
+            severity = 'mild'
+        elif stutter_percentage < 30:
+            severity = 'moderate'
+        else:
+            severity = 'severe'
+        # === STEP 6: Return ASR Results ===
+        actual_transcript = transcript if transcript else ""
+        target_transcript = proper_transcript if proper_transcript else ""
+        logger.info(f"📝 Final ASR result - Actual: '{actual_transcript}' (len: {len(actual_transcript)}), Target: '{target_transcript}' (len: {len(target_transcript)})")
+        return {
+            'actual_transcript': actual_transcript,
+            'target_transcript': target_transcript,
+            'mismatched_chars': [f"{r['time']:.2f}s" for r in low_conf_regions[:10]],
+            'mismatch_percentage': round(stutter_percentage, 2),
+            'ctc_loss_score': round(entropy_score, 4),
+            'stutter_timestamps': [self._event_to_dict(e) for e in events],
+            'total_stutter_duration': round(total_duration, 2),
+            'stutter_frequency': round(frequency, 2),
+            'severity': severity,
+            'confidence_score': round(avg_confidence, 2),
+            'speaking_rate_sps': round(len(word_timestamps) / duration if duration > 0 else 0, 2),
+            'analysis_duration_seconds': round(time.time() - start_time, 2),
+            'model_version': 'indicwav2vec-hindi-asr-v1'
+        }
+    # Legacy methods - kept for backward compatibility but may not work without additional model initialization
+    # These methods reference models (xlsr, base, large) that are not initialized in __init__
+    # The main analyze_audio() method uses the IndicWav2Vec Hindi model instead
+    def generate_target_transcript(self, audio_file: str) -> str:
+        """Generate expected transcript - Legacy method (uses IndicWav2Vec Hindi model)"""
+        try:
+            audio, sr = librosa.load(audio_file, sr=16000)
+            transcript, _, _ = self._transcribe_with_timestamps(audio)
+            return transcript
+        except Exception as e:
+            logger.error(f"Target transcript generation failed: {e}")
+            return ""
+    def transcribe_and_detect(self, audio_file: str, proper_transcript: str) -> Dict:
+        """Transcribe audio and detect stuttering patterns - Legacy method"""
+        try:
+            audio, _ = librosa.load(audio_file, sr=16000)
+            transcript, _, _ = self._transcribe_with_timestamps(audio)
+            # Find stuttered sequences
+            stuttered_chars = self.find_sequences_not_in_common(transcript, proper_transcript)
+            # Calculate mismatch percentage
+            total_mismatched = sum(len(segment) for segment in stuttered_chars)
+            mismatch_percentage = (total_mismatched / len(proper_transcript)) * 100 if len(proper_transcript) > 0 else 0
+            mismatch_percentage = min(round(mismatch_percentage), 100)
+            return {
+                'transcription': transcript,
+                'stuttered_chars': stuttered_chars,
+                'mismatch_percentage': mismatch_percentage
+            }
+        except Exception as e:
+            logger.error(f"Transcription failed: {e}")
+            return {
+                'transcription': '',
+                'stuttered_chars': [],
+                'mismatch_percentage': 0
+            }
+    def calculate_stutter_timestamps(self, audio_file: str, proper_transcript: str) -> Tuple[float, List[Tuple[float, float]]]:
+        """Calculate stutter timestamps - Legacy method (uses analyze_audio instead)"""
+        try:
+            # Use main analyze_audio method
+            result = self.analyze_audio(audio_file, proper_transcript)
+            # Extract timestamps from result
+            timestamps = []
+            for event in result.get('stutter_timestamps', []):
+                timestamps.append((event['start'], event['end']))
+            ctc_score = result.get('ctc_loss_score', 0.0)
+            return float(ctc_score), timestamps
+        except Exception as e:
+            logger.error(f"Timestamp calculation failed: {e}")
+            return 0.0, []
+    def find_max_common_characters(self, transcription1: str, transcript2: str) -> str:
+        """Longest Common Subsequence algorithm"""
+        m, n = len(transcription1), len(transcript2)
+        lcs_matrix = [[0] * (n + 1) for _ in range(m + 1)]
+        for i in range(1, m + 1):
+            for j in range(1, n + 1):
+                if transcription1[i - 1] == transcript2[j - 1]:
+                    lcs_matrix[i][j] = lcs_matrix[i - 1][j - 1] + 1
+                else:
+                    lcs_matrix[i][j] = max(lcs_matrix[i - 1][j], lcs_matrix[i][j - 1])
+        # Backtrack to find LCS
+        lcs_characters = []
+        i, j = m, n
+        while i > 0 and j > 0:
+            if transcription1[i - 1] == transcript2[j - 1]:
+                lcs_characters.append(transcription1[i - 1])
+                i -= 1
+                j -= 1
+            elif lcs_matrix[i - 1][j] > lcs_matrix[i][j - 1]:
+                i -= 1
+            else:
+                j -= 1
+        lcs_characters.reverse()
+        return ''.join(lcs_characters)
+    def find_sequences_not_in_common(self, transcription1: str, proper_transcript: str) -> List[str]:
+        """Find stuttered character sequences"""
+        common_characters = self.find_max_common_characters(transcription1, proper_transcript)
+        sequences = []
+        sequence = ""
+        i, j = 0, 0
+        while i < len(transcription1) and j < len(common_characters):
+            if transcription1[i] == common_characters[j]:
+                if sequence:
+                    sequences.append(sequence)
+                    sequence = ""
+                i += 1
+                j += 1
+            else:
+                sequence += transcription1[i]
+                i += 1
+        if sequence:
+            sequences.append(sequence)
+        return sequences
+    def _calculate_total_duration(self, timestamps: List[Tuple[float, float]]) -> float:
+        """Calculate total stuttering duration"""
+        return sum(end - start for start, end in timestamps)
+    def _calculate_frequency(self, timestamps: List[Tuple[float, float]], audio_file: str) -> float:
+        """Calculate stutters per minute"""
+        try:
+            audio_duration = librosa.get_duration(path=audio_file)
+            if audio_duration > 0:
+                return (len(timestamps) / audio_duration) * 60
+            return 0.0
+        except:
+            return 0.0
+    def _determine_severity(self, mismatch_percentage: float) -> str:
+        """Determine severity level"""
+        if mismatch_percentage < 10:
+            return 'none'
+        elif mismatch_percentage < 25:
+            return 'mild'
+        elif mismatch_percentage < 50:
+            return 'moderate'
+        else:
+            return 'severe'
+    def _calculate_confidence(self, transcription_result: Dict, ctc_loss: float) -> float:
+        """Calculate confidence score for the analysis"""
+        # Lower mismatch and lower CTC loss = higher confidence
+        mismatch_factor = 1 - (transcription_result['mismatch_percentage'] / 100)
+        loss_factor = max(0, 1 - (ctc_loss / 10))  # Normalize loss
+        confidence = (mismatch_factor + loss_factor) / 2
+        return round(min(max(confidence, 0.0), 1.0), 2)
+# Model loader is now in a separate module: model_loader.py
+# This follows clean architecture principles - separation of concerns
+# Import using: from diagnosis.ai_engine.model_loader import get_stutter_detector

diagnosis/ai_engine/features.py ADDED Viewed

	@@ -0,0 +1,206 @@

+# diagnosis/ai_engine/features.py
+"""
+Feature extraction for IndicWav2Vec Hindi ASR
+This module provides feature extraction capabilities using the IndicWav2Vec Hindi model.
+Focused on ASR transcription features rather than hybrid acoustic+linguistic features.
+"""
+import torch
+import numpy as np
+import logging
+from typing import Dict, Any, Tuple, Optional
+from transformers import Wav2Vec2ForCTC, AutoProcessor
+logger = logging.getLogger(__name__)
+class ASRFeatureExtractor:
+    """
+    Feature extractor using IndicWav2Vec Hindi for Automatic Speech Recognition.
+    This extractor focuses on:
+    - Audio feature extraction via IndicWav2Vec
+    - Transcription confidence scores
+    - Frame-level predictions and logits
+    - Word-level alignments (estimated)
+    Model: ai4bharat/indicwav2vec-hindi
+    """
+    def __init__(self, model: Wav2Vec2ForCTC, processor: AutoProcessor, device: str = "cpu"):
+        """
+        Initialize the ASR feature extractor.
+        Args:
+            model: Pre-loaded IndicWav2Vec Hindi model
+            processor: Pre-loaded processor for the model
+            device: Device to run inference on ('cpu' or 'cuda')
+        """
+        self.model = model
+        self.processor = processor
+        self.device = device
+        self.model.eval()
+        logger.info(f"✅ ASRFeatureExtractor initialized on {device}")
+    def extract_audio_features(self, audio: np.ndarray, sample_rate: int = 16000) -> Dict[str, Any]:
+        """
+        Extract features from audio using IndicWav2Vec Hindi.
+        Args:
+            audio: Audio waveform as numpy array
+            sample_rate: Sample rate of the audio (default: 16000)
+        Returns:
+            Dictionary containing:
+            - input_values: Processed audio features
+            - attention_mask: Attention mask (if available)
+        """
+        try:
+            # Process audio through the processor
+            inputs = self.processor(
+                audio,
+                sampling_rate=sample_rate,
+                return_tensors="pt"
+            ).to(self.device)
+            return {
+                'input_values': inputs.input_values,
+                'attention_mask': inputs.get('attention_mask', None)
+            }
+        except Exception as e:
+            logger.error(f"❌ Error extracting audio features: {e}")
+            raise
+    def get_transcription_features(
+        self,
+        audio: np.ndarray,
+        sample_rate: int = 16000
+    ) -> Dict[str, Any]:
+        """
+        Get transcription features including logits, predictions, and confidence.
+        Args:
+            audio: Audio waveform as numpy array
+            sample_rate: Sample rate of the audio (default: 16000)
+        Returns:
+            Dictionary containing:
+            - transcript: Transcribed text
+            - logits: Model logits (raw predictions)
+            - predicted_ids: Predicted token IDs
+            - probabilities: Softmax probabilities
+            - confidence: Average confidence score
+            - frame_confidence: Per-frame confidence scores
+        """
+        try:
+            # Process audio
+            inputs = self.processor(
+                audio,
+                sampling_rate=sample_rate,
+                return_tensors="pt"
+            ).to(self.device)
+            # Get model predictions
+            with torch.no_grad():
+                outputs = self.model(**inputs)
+                logits = outputs.logits
+                predicted_ids = torch.argmax(logits, dim=-1)
+            # Calculate probabilities and confidence
+            probs = torch.softmax(logits, dim=-1)
+            max_probs = torch.max(probs, dim=-1)[0]  # Get max probability per frame
+            frame_confidence = max_probs[0].cpu().numpy()
+            avg_confidence = float(torch.mean(max_probs).item())
+            # Decode transcript
+            transcript = ""
+            try:
+                if hasattr(self.processor, 'tokenizer'):
+                    transcript = self.processor.tokenizer.decode(
+                        predicted_ids[0],
+                        skip_special_tokens=True
+                    )
+                elif hasattr(self.processor, 'batch_decode'):
+                    transcript = self.processor.batch_decode(predicted_ids)[0]
+                # Clean up transcript
+                if transcript:
+                    transcript = transcript.strip()
+                    transcript = transcript.replace('<pad>', '').replace('<s>', '').replace('</s>', '').replace('|', ' ').strip()
+                    transcript = ' '.join(transcript.split())
+            except Exception as e:
+                logger.warning(f"⚠️ Decode error: {e}")
+                transcript = ""
+            return {
+                'transcript': transcript,
+                'logits': logits.cpu().numpy(),
+                'predicted_ids': predicted_ids.cpu().numpy(),
+                'probabilities': probs.cpu().numpy(),
+                'confidence': avg_confidence,
+                'frame_confidence': frame_confidence,
+                'num_frames': logits.shape[1]
+            }
+        except Exception as e:
+            logger.error(f"❌ Error getting transcription features: {e}")
+            raise
+    def get_word_level_features(
+        self,
+        audio: np.ndarray,
+        sample_rate: int = 16000
+    ) -> Dict[str, Any]:
+        """
+        Get word-level features including timestamps and confidence.
+        Args:
+            audio: Audio waveform as numpy array
+            sample_rate: Sample rate of the audio (default: 16000)
+        Returns:
+            Dictionary containing:
+            - words: List of words
+            - word_timestamps: List of (start, end) timestamps for each word
+            - word_confidence: Confidence score for each word
+        """
+        try:
+            # Get transcription features
+            features = self.get_transcription_features(audio, sample_rate)
+            transcript = features['transcript']
+            frame_confidence = features['frame_confidence']
+            num_frames = features['num_frames']
+            # Estimate word-level timestamps (simplified)
+            words = transcript.split() if transcript else []
+            audio_duration = len(audio) / sample_rate
+            time_per_word = audio_duration / max(len(words), 1) if words else 0
+            word_timestamps = []
+            word_confidence = []
+            for i, word in enumerate(words):
+                start_time = i * time_per_word
+                end_time = (i + 1) * time_per_word
+                # Estimate confidence for this word (average of corresponding frames)
+                start_frame = int((start_time / audio_duration) * num_frames)
+                end_frame = int((end_time / audio_duration) * num_frames)
+                word_conf = float(np.mean(frame_confidence[start_frame:end_frame])) if end_frame > start_frame else 0.5
+                word_timestamps.append({
+                    'word': word,
+                    'start': start_time,
+                    'end': end_time
+                })
+                word_confidence.append(word_conf)
+            return {
+                'words': words,
+                'word_timestamps': word_timestamps,
+                'word_confidence': word_confidence,
+                'transcript': transcript
+            }
+        except Exception as e:
+            logger.error(f"❌ Error getting word-level features: {e}")
+            raise

diagnosis/ai_engine/model_loader.py ADDED Viewed

	@@ -0,0 +1,51 @@

+# diagnosis/ai_engine/model_loader.py
+"""Singleton pattern for model loading
+This loader provides a clean interface for getting the detector instance.
+Uses singleton pattern to ensure models are loaded only once.
+"""
+import logging
+logger = logging.getLogger(__name__)
+_detector_instance = None
+def get_stutter_detector():
+    """
+    Get or create singleton AdvancedStutterDetector instance.
+    This ensures models are loaded only once and reused across requests.
+    Returns:
+        AdvancedStutterDetector: The singleton detector instance
+    Raises:
+        ImportError: If the detector class cannot be imported
+    """
+    global _detector_instance
+    if _detector_instance is None:
+        try:
+            from .detect_stuttering import AdvancedStutterDetector
+            logger.info("🔄 Initializing detector instance (first call)...")
+            _detector_instance = AdvancedStutterDetector()
+            logger.info("✅ Detector instance created successfully")
+        except ImportError as e:
+            logger.error(f"❌ Failed to import AdvancedStutterDetector: {e}")
+            raise ImportError("No StutterDetector implementation available in detect_stuttering.py") from e
+        except Exception as e:
+            logger.error(f"❌ Failed to create detector instance: {e}")
+            raise
+    return _detector_instance
+def reset_detector():
+    """
+    Reset the singleton instance (useful for testing or reloading models).
+    Note: This will force reloading of models on next get_stutter_detector() call.
+    """
+    global _detector_instance
+    _detector_instance = None
+    logger.info("🔄 Detector instance reset")

gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

requirements.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+# Core ML
+numpy>=1.24.0,<2.0.0
+librosa>=0.10.0
+transformers>=4.38.0,<5.0
+# Audio
+soundfile>=0.12.1
+scipy>=1.11.0
+praat-parselmouth>=0.4.3
+fastdtw>=0.3.4
+pyctcdecode==0.5.0
+# API
+fastapi>=0.115.2,<1.0
+uvicorn>=0.24.0
+python-multipart>=0.0.18
+# Logging
+python-json-logger>=2.0.0
+# Web UI
+gradio==6.1.0
+# Explicitly pin torch to 2.6+ for transformers compatibility
+torch>=2.6.0
+torchvision>=0.21.0
+torchaudio>=2.6.0