Spaces:

cracker0935
/

adtrack-v2

Sleeping

App Files Files Community

cracker0935 commited on Jan 18

Commit

d5f8ae0

1 Parent(s): 4b061f6

base modelv3

Browse files

Files changed (9) hide show

README.md +158 -30
main.py +42 -10
models/__pycache__/base.cpython-310.pyc +0 -0
models/base.py +2 -1
models/model_v1/__pycache__/wrapper.cpython-310.pyc +0 -0
models/model_v3/config.py +7 -0
models/model_v3/model.py +63 -0
models/model_v3/processor.py +172 -0
models/model_v3/wrapper.py +182 -0

README.md CHANGED Viewed

@@ -9,21 +9,24 @@ pinned: false
 # Alzheimer's Detection Backend API
-This repository contains a FastAPI-based backend for detecting Alzheimer's disease from linguistic data (transcribed speech files in `.cha` format). It supports multiple underlying machine learning models, each providing unique insights and output formats.
 ## 🚀 Features
 - **FastAPI Framework**: High-performance, easy-to-use API.
 - **Support for .cha Files**: Specialized parsing for CHAT format transcripts.
 - **Multiple AI Models**:
-  - `hybrid_deberta`: A DeBERTa-based hybrid model focusing on semantic understanding.
-  - `model_v2`: An explainable model incorporating linguistic features (TTR, fillers, pauses, etc.) along with deep learning.
-- **CORS Support**: Configured to allow requests from the frontend application (`https://adtrack.onrender.com`).
 ## 🛠️ Prerequisites
 - **Python 3.8+**
 - **pip** package manager
 ## 📥 Installation
@@ -61,6 +64,8 @@ uvicorn main:app --reload
 The server will start at `http://127.0.0.1:8000`.
 ## 📖 API Documentation
 ### 1. Health Check
@@ -72,7 +77,7 @@ Checks if the API is active and lists loaded models.
 ```json
 {
   "status": "active",
-  "loaded_models": ["hybrid_deberta", "model_v2"]
 }
 ```
@@ -86,7 +91,7 @@ Returns a list of all available model keys that can be used for prediction.
 **Response:**
 ```json
 {
-  "models": ["hybrid_deberta", "model_v2"]
 }
 ```
@@ -95,58 +100,174 @@ Returns a list of all available model keys that can be used for prediction.
 ### 3. Predict / Analyze
 **Endpoint:** `POST /predict`
-Uploads a `.cha` file and processes it using the specified model.
 **Parameters:**
-- `file`: The `.cha` transcript file (Form Data).
-- `model_name`: The key of the model to use (e.g., `hybrid_deberta` or `model_v2`) (Form Data).
-#### **Model Response Formats**
-Each model returns results in a simplified structure tailored to its architecture.
-#### **A. `hybrid_deberta` Output**
 Focuses on sequence classification and attention scores for sentences.
 ```json
 {
   "filename": "sample.cha",
-  "prediction": "DEMENTIA",        // "DEMENTIA" or "HEALTHY CONTROL"
-  "confidence": 0.85,              // Probability score (0.0 - 1.0)
-  "is_dementia": true,             // Boolean flag
-  "attention_map": [               // Token/Sentence level attention weights
     {
       "sentence": "I saw the cookie jar.",
       "attention_score": 0.92
-    },
-    ...
   ],
   "model_used": "hybrid_deberta"
 }
 ```
-#### **B. `model_v2` Output**
 Provides a rich set of metadata and linguistic features for explainability.
 ```json
 {
   "filename": "sample.cha",
-  "prediction": "Dementia",        // "Dementia" or "Control"
-  "probability_dementia": 0.78,    // Probability score (0.0 - 1.0)
   "metadata": {
     "age": 72,
     "gender": "Female",
     "sentence_count": 15
   },
-  "linguistic_features": {         // Extracted linguistic metrics
-    "TTR": 0.45,                   // Type-Token Ratio
     "fillers_ratio": 0.05,
     "repetitions_ratio": 0.02,
     "retracing_ratio": 0.01,
     "incomplete_ratio": 0.03,
     "pauses_ratio": 0.12
   },
-  "key_segments": [                // Top sentences contributing to the decision
     {
       "text": "Um... checking the... the overflowing water.",
       "importance": 0.88
@@ -156,14 +277,21 @@ Provides a rich set of metadata and linguistic features for explainability.
 }
 ```
 ## 📂 Project Structure
 ```
 .
-├── main.py             # Entry point, API routes, and CORS config
-├── models/             # Model definitions and wrappers
-│   ├── base.py         # Base class for model wrappers
-│   ├── model_v1/       # Logic for 'hybrid_deberta'
-│   └── model_v2/       # Logic for 'model_v2' (Explainable + Linguistic features)
-└── requirements.txt    # Project dependencies
 ```

 # Alzheimer's Detection Backend API
+This repository contains a FastAPI-based backend for detecting Alzheimer's disease from linguistic data. It supports multiple machine learning models, including text-only analysis from `.cha` transcripts and a **multimodal model (V3)** that can process both text and audio.
 ## 🚀 Features
 - **FastAPI Framework**: High-performance, easy-to-use API.
 - **Support for .cha Files**: Specialized parsing for CHAT format transcripts.
+- **Multimodal Audio Support (V3)**: Process raw audio files with Automatic Speech Recognition (ASR).
 - **Multiple AI Models**:
+  - `Model V1`: A DeBERTa-based hybrid model focusing on semantic understanding.
+  - `Model V2`: An explainable model with rich linguistic features (TTR, fillers, pauses, etc.).
+  - `Model V3 (Multimodal)`: A multimodal fusion model combining text, audio spectrograms, and linguistic features.
+- **CORS Support**: Configured to allow requests from frontend applications.
 ## 🛠️ Prerequisites
 - **Python 3.8+**
 - **pip** package manager
+- **FFmpeg**: Required for audio processing in Model V3.
 ## 📥 Installation
 The server will start at `http://127.0.0.1:8000`.
+---
 ## 📖 API Documentation
 ### 1. Health Check
 ```json
 {
   "status": "active",
+  "loaded_models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
 }
 ```
 **Response:**
 ```json
 {
+  "models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
 }
 ```
 ### 3. Predict / Analyze
 **Endpoint:** `POST /predict`
+Uploads files and processes them using the specified model.
+**Request Type:** `multipart/form-data`
 **Parameters:**
+| Parameter     | Type           | Required | Description                                                                 |
+|---------------|----------------|----------|-----------------------------------------------------------------------------|
+| `model_name`  | `string`       | **Yes**  | The key of the model (e.g., `Model V1`, `Model V2`, `Model V3 (Multimodal)`)|
+| `file`        | `file (.cha)`  | Depends  | The CHAT format transcript file.                                            |
+| `audio_file`  | `file (audio)` | No       | An audio file (e.g., `.wav`, `.mp3`). **Only for Model V3.**                |
+#### **Input Validation Rules**
+| Model                   | `file` (.cha) | `audio_file` | Notes                                                 |
+|-------------------------|---------------|--------------|-------------------------------------------------------|
+| `Model V1`              | **Required**  | Ignored      | Text-only model.                                       |
+| `Model V2`              | **Required**  | Ignored      | Text-only model.                                       |
+| `Model V3 (Multimodal)` | Optional      | Optional     | At least one file must be provided. Supports 3 modes. |
+---
+## 🧠 Model V3 (Multimodal) - Deep Dive
+Model V3 is a **multimodal fusion model** that combines three branches of information for its predictions:
+1.  **Text Branch**: Uses a DeBERTa transformer with an LSTM layer to encode textual semantics.
+2.  **Audio Branch**: Uses a Vision Transformer (ViT) trained on spectrograms derived from the audio.
+3.  **Linguistic Branch**: A simple feedforward network processing extracted linguistic features (TTR, filler ratio, pause ratio, etc.).
+### Processing Modes
+Model V3 intelligently handles three different input scenarios:
+#### **Mode 1: CHA File Only**
+- **Input:** A `.cha` transcript file.
+- **Process:**
+    1.  Parses `*PAR:` (participant) lines from the CHA file.
+    2.  Cleans the text for the DeBERTa model.
+    3.  Extracts a 6-dimensional linguistic feature vector (TTR, fillers, repetitions, retracing, errors, pauses).
+    4.  **Audio branch receives a zero-tensor** (no audio input).
+- **Use Case:** When you have a pre-existing transcript and no audio.
+#### **Mode 2: CHA File + Audio (Segmented)**
+- **Input:** A `.cha` transcript file AND an audio file.
+- **Process:**
+    1.  Parses the CHA file for text and linguistic features (same as Mode 1).
+    2.  Extracts timestamps (e.g., `15123_456`) from the CHA file.
+    3.  Uses these timestamps to **slice the corresponding audio segments** from the full audio file.
+    4.  Concatenates the slices and generates a spectrogram.
+    5.  Passes the spectrogram to the ViT-based audio branch.
+- **Use Case:** For maximum accuracy when you have a professionally transcribed CHA file that is time-aligned with its source audio.
+#### **Mode 3: Audio Only (ASR)**
+- **Input:** An audio file only (no `.cha` file).
+- **Process:**
+    1.  Uses OpenAI's **Whisper** model to transcribe the audio.
+    2.  Applies CHAT-like formatting rules to the transcript:
+        - Detects pauses and inserts `[PAUSE]` tokens.
+        - Detects word repetitions and inserts `[/]` markers.
+    3.  Extracts linguistic features from the generated transcript.
+    4.  Generates a spectrogram from the **full audio file** (up to 30s).
+- **Use Case:** For real-world inference when you only have raw audio (e.g., a voice recording).
+---
+### Model V3 Response Format
+```json
+{
+  "model_version": "v3_multimodal",
+  "filename": "sample.cha",
+  "predicted_label": "AD",              // "AD" (Alzheimer's Disease) or "Control"
+  "confidence": 0.8721,                 // Probability score (0.0 - 1.0)
+  "modalities_used": ["text", "linguistic", "audio"],
+  "generated_transcript": null          // Populated only in Audio-Only mode (Mode 3)
+}
+```
+**Response Fields:**
+| Field                  | Type            | Description                                                                                          |
+|------------------------|-----------------|------------------------------------------------------------------------------------------------------|
+| `model_version`        | `string`        | Always `"v3_multimodal"` for this model.                                                             |
+| `filename`             | `string`        | Name of the uploaded file, or `"audio_only_upload"` if no CHA file was provided.                     |
+| `predicted_label`      | `string`        | The classification result: `"AD"` or `"Control"`.                                                    |
+| `confidence`           | `float`         | The model's confidence score for the predicted label.                                                |
+| `modalities_used`      | `array[string]` | Lists the modalities used (`"text"`, `"linguistic"`, `"audio"`).                                     |
+| `generated_transcript` | `string \| null`| The transcript generated by Whisper. **Only populated in Audio-Only mode (Mode 3)**, otherwise `null`.|
+---
+## Example API Requests (cURL)
+### Model V1 / V2 (CHA File Only)
+```bash
+curl -X POST "http://127.0.0.1:8000/predict" \
+  -F "model_name=Model V1" \
+  -F "file=@/path/to/your/transcript.cha"
+```
+### Model V3: CHA Only
+```bash
+curl -X POST "http://127.0.0.1:8000/predict" \
+  -F "model_name=Model V3 (Multimodal)" \
+  -F "file=@/path/to/your/transcript.cha"
+```
+### Model V3: CHA + Audio
+```bash
+curl -X POST "http://127.0.0.1:8000/predict" \
+  -F "model_name=Model V3 (Multimodal)" \
+  -F "file=@/path/to/your/transcript.cha" \
+  -F "audio_file=@/path/to/your/audio.wav"
+```
+### Model V3: Audio Only (ASR)
+```bash
+curl -X POST "http://127.0.0.1:8000/predict" \
+  -F "model_name=Model V3 (Multimodal)" \
+  -F "audio_file=@/path/to/your/audio.wav"
+```
+---
+## Older Model Response Formats
+### **A. `Model V1` Output**
 Focuses on sequence classification and attention scores for sentences.
 ```json
 {
   "filename": "sample.cha",
+  "prediction": "DEMENTIA",
+  "confidence": 0.85,
+  "is_dementia": true,
+  "attention_map": [
     {
       "sentence": "I saw the cookie jar.",
       "attention_score": 0.92
+    }
   ],
   "model_used": "hybrid_deberta"
 }
 ```
+### **B. `Model V2` Output**
 Provides a rich set of metadata and linguistic features for explainability.
 ```json
 {
   "filename": "sample.cha",
+  "prediction": "Dementia",
+  "probability_dementia": 0.78,
   "metadata": {
     "age": 72,
     "gender": "Female",
     "sentence_count": 15
   },
+  "linguistic_features": {
+    "TTR": 0.45,
     "fillers_ratio": 0.05,
     "repetitions_ratio": 0.02,
     "retracing_ratio": 0.01,
     "incomplete_ratio": 0.03,
     "pauses_ratio": 0.12
   },
+  "key_segments": [
     {
       "text": "Um... checking the... the overflowing water.",
       "importance": 0.88
 }
 ```
+---
 ## 📂 Project Structure
 ```
 .
+├── main.py                 # Entry point, API routes, and CORS config
+├── models/                 # Model definitions and wrappers
+│   ├── base.py             # Base class for model wrappers
+│   ├── model_v1/           # Logic for 'Model V1' (DeBERTa Hybrid)
+│   ├── model_v2/           # Logic for 'Model V2' (Explainable + Linguistic)
+│   └── model_v3/           # Logic for 'Model V3 (Multimodal)'
+│       ├── config.py       # Model configuration (weights path, model names)
+│       ├── model.py        # Neural network architecture (TextBranch, AudioBranch, etc.)
+│       ├── processor.py    # Preprocessing (Linguistic features, Spectrograms, ASR)
+│       └── wrapper.py      # The main wrapper class integrating all components
+└── requirements.txt        # Project dependencies
 ```

main.py CHANGED Viewed

@@ -1,15 +1,17 @@
 from fastapi import FastAPI, UploadFile, File, Form, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
 from contextlib import asynccontextmanager
-from typing import Dict
 from models.base import BaseModelWrapper
 from models.model_v1.wrapper import HybridDebertaWrapper
 from models.model_v2.wrapper import ModelV2Wrapper
 AVAILABLE_MODELS: Dict[str, BaseModelWrapper] = {
     "Model V1": HybridDebertaWrapper(),
-    "Model V2":ModelV2Wrapper()
 }
 @asynccontextmanager
@@ -41,24 +43,54 @@ def list_models():
 @app.post("/predict")
 async def predict(
-    file: UploadFile = File(...),
-    model_name: str = Form(...)
 ):
     if model_name not in AVAILABLE_MODELS:
         raise HTTPException(status_code=404, detail=f"Model '{model_name}' not found")
-    if not file.filename.endswith('.cha'):
-        raise HTTPException(status_code=400, detail="Only .cha files are supported")
-    contents = await file.read()
     try:
-        result = AVAILABLE_MODELS[model_name].predict(contents, file.filename)
         return result
     except ValueError as e:
         raise HTTPException(status_code=400, detail=str(e))
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
 @app.get("/health")
 def health_check():

 from fastapi import FastAPI, UploadFile, File, Form, HTTPException
 from fastapi.middleware.cors import CORSMiddleware
 from contextlib import asynccontextmanager
+from typing import Dict, Optional
 from models.base import BaseModelWrapper
 from models.model_v1.wrapper import HybridDebertaWrapper
 from models.model_v2.wrapper import ModelV2Wrapper
+from models.model_v3.wrapper import MultimodalWrapper
 AVAILABLE_MODELS: Dict[str, BaseModelWrapper] = {
     "Model V1": HybridDebertaWrapper(),
+    "Model V2": ModelV2Wrapper(),
+    "Model V3 (Multimodal)": MultimodalWrapper()
 }
 @asynccontextmanager
 @app.post("/predict")
 async def predict(
+    model_name: str = Form(...),
+    file: Optional[UploadFile] = File(None),       # Changed to Optional
+    audio_file: Optional[UploadFile] = File(None)  # Added Audio Input
 ):
     if model_name not in AVAILABLE_MODELS:
         raise HTTPException(status_code=404, detail=f"Model '{model_name}' not found")
+    # --- Validation Logic ---
+    # Models V1 and V2 REQUIRE a .cha file
+    if model_name in ["Model V1", "Model V2"]:
+        if not file or not file.filename.endswith('.cha'):
+             raise HTTPException(status_code=400, detail=f"{model_name} requires a .cha file.")
+    # Model V3 requires AT LEAST one file
+    if not file and not audio_file:
+        raise HTTPException(status_code=400, detail="Please provide a .cha file, an audio file, or both.")
+    # --- Read Files ---
+    text_content = b""
+    filename = "audio_only_upload" # Default if no .cha
+    if file:
+        if not file.filename.endswith('.cha'):
+             raise HTTPException(status_code=400, detail="Text file must be a .cha file.")
+        text_content = await file.read()
+        filename = file.filename
+    audio_content = None
+    if audio_file:
+        audio_content = await audio_file.read()
+    # --- Prediction ---
     try:
+        # We pass audio_content to all models.
+        # Base.py now supports it, and V1/V2 wrappers (if not updated) might need a dummy arg
+        # or we rely on Python's flexible args if they inherit correctly.
+        # Ideally, update V1/V2 predict signatures to accept **kwargs or audio_content=None too.
+        result = AVAILABLE_MODELS[model_name].predict(
+            file_content=text_content,
+            filename=filename,
+            audio_content=audio_content
+        )
         return result
     except ValueError as e:
         raise HTTPException(status_code=400, detail=str(e))
     except Exception as e:
+        print(f"Prediction Error: {e}") # Log internal errors
+        raise HTTPException(status_code=500, detail="Internal Server Error")
 @app.get("/health")
 def health_check():

models/__pycache__/base.cpython-310.pyc CHANGED Viewed

Binary files a/models/__pycache__/base.cpython-310.pyc and b/models/__pycache__/base.cpython-310.pyc differ

models/base.py CHANGED Viewed

@@ -1,4 +1,5 @@
 from abc import ABC, abstractmethod
 class BaseModelWrapper(ABC):
     @abstractmethod
@@ -6,5 +7,5 @@ class BaseModelWrapper(ABC):
         pass
     @abstractmethod
-    def predict(self, file_content: bytes, filename: str) -> dict:
         pass

 from abc import ABC, abstractmethod
+from typing import Optional
 class BaseModelWrapper(ABC):
     @abstractmethod
         pass
     @abstractmethod
+    def predict(self, file_content: bytes, filename: str, audio_content: Optional[bytes] = None) -> dict:
         pass

models/model_v1/__pycache__/wrapper.cpython-310.pyc CHANGED Viewed

Binary files a/models/model_v1/__pycache__/wrapper.cpython-310.pyc and b/models/model_v1/__pycache__/wrapper.cpython-310.pyc differ

models/model_v3/config.py ADDED Viewed

	@@ -0,0 +1,7 @@

+import os
+BASE_DIR = os.path.dirname(os.path.abspath(__file__))
+WEIGHTS_PATH = os.path.join(BASE_DIR, "weights", "D:/Work/7th Sem/adtrack-v2/models/model_v3/multimodal_dementia_model.pth")
+TEXT_MODEL_NAME = "microsoft/deberta-base"
+MAX_LEN = 128
+WHISPER_MODEL_SIZE = "base"

models/model_v3/model.py ADDED Viewed

	@@ -0,0 +1,63 @@

+import torch
+import torch.nn as nn
+import timm
+from transformers import AutoModel
+class TextBranch(nn.Module):
+    def __init__(self, model_name):
+        super().__init__()
+        self.bert = AutoModel.from_pretrained(model_name)
+        self.lstm = nn.LSTM(768, 128, batch_first=True, bidirectional=True)
+        self.fc = nn.Linear(256, 64)
+    def forward(self, input_ids, attention_mask):
+        out = self.bert(input_ids=input_ids, attention_mask=attention_mask)
+        _, (h_n, _) = self.lstm(out.last_hidden_state)
+        context = torch.cat((h_n[-2], h_n[-1]), dim=1)
+        return self.fc(context)
+class AudioBranch(nn.Module):
+    def __init__(self):
+        super().__init__()
+        # Pretrained ViT
+        self.vit = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=0)
+        self.fc = nn.Linear(768, 64) # Project ViT dim to 64
+    def forward(self, pixel_values):
+        features = self.vit(pixel_values)
+        return self.fc(features)
+class LinguisticBranch(nn.Module):
+    def __init__(self, input_dim=6):
+        super().__init__()
+        self.net = nn.Sequential(
+            nn.Linear(input_dim, 32),
+            nn.ReLU(),
+            nn.Linear(32, 16)
+        )
+    def forward(self, x):
+        return self.net(x)
+class MultimodalFusion(nn.Module):
+    def __init__(self, text_model_name='microsoft/deberta-base'):
+        super().__init__()
+        self.text_branch = TextBranch(text_model_name)
+        self.audio_branch = AudioBranch()
+        self.ling_branch = LinguisticBranch(input_dim=6)
+        # Fusion: 64 (Text) + 64 (Audio) + 16 (Ling) = 144
+        self.classifier = nn.Sequential(
+            nn.Linear(64 + 64 + 16, 64),
+            nn.ReLU(),
+            nn.Dropout(0.5),
+            nn.Linear(64, 2)
+        )
+    def forward(self, input_ids, attention_mask, pixel_values, ling_features):
+        text_emb = self.text_branch(input_ids, attention_mask)
+        audio_emb = self.audio_branch(pixel_values)
+        ling_emb = self.ling_branch(ling_features)
+        # Concat
+        combined = torch.cat((text_emb, audio_emb, ling_emb), dim=1)
+        return self.classifier(combined)

models/model_v3/processor.py ADDED Viewed

	@@ -0,0 +1,172 @@

+import re
+import os
+import numpy as np
+import torch
+import librosa
+import librosa.display
+import matplotlib
+import matplotlib.pyplot as plt
+from PIL import Image
+from torchvision import transforms
+import whisper
+# Force non-interactive backend for server environments
+matplotlib.use('Agg')
+# ==========================================
+# 1. Linguistic Feature Extractor
+# ==========================================
+class LinguisticFeatureExtractor:
+    def __init__(self):
+        self.patterns = {
+            'fillers': re.compile(r'&-([a-z]+)', re.IGNORECASE),
+            'repetition': re.compile(r'\[/+\]'),
+            'retracing': re.compile(r'\[//\]'),
+            'incomplete': re.compile(r'\+[\./]+'),
+            'errors': re.compile(r'\[\*.*?\]'),
+            'pauses': re.compile(r'\(\.+\)')
+        }
+    def clean_for_bert(self, raw_text):
+        text = re.sub(r'^\*PAR:\s+', '', raw_text)
+        text = re.sub(r'\x15\d+_\d+\x15', '', text)
+        text = re.sub(r'<|>', '', text)
+        text = re.sub(r'\[.*?\]', '', text)
+        text = re.sub(r'\(\.+\)', '[PAUSE]', text)
+        text = text.replace('_', ' ')
+        text = re.sub(r'\s+', ' ', text).strip()
+        if text.endswith('[PAUSE]'):
+            text = text[:-7].strip()
+        return text
+    def get_features(self, raw_text):
+        stats = {
+            'filler_count': len(self.patterns['fillers'].findall(raw_text)),
+            'repetition_count': len(self.patterns['repetition'].findall(raw_text)),
+            'retracing_count': len(self.patterns['retracing'].findall(raw_text)),
+            'incomplete_count': len(self.patterns['incomplete'].findall(raw_text)),
+            'error_count': len(self.patterns['errors'].findall(raw_text)),
+            'pause_count': len(self.patterns['pauses'].findall(raw_text))
+        }
+        clean_for_stats = re.sub(r'\[.*?\]', '', raw_text)
+        clean_for_stats = re.sub(r'&-([a-z]+)', '', clean_for_stats)
+        clean_for_stats = re.sub(r'[^\w\s]', '', clean_for_stats)
+        words = clean_for_stats.lower().split()
+        stats['word_count'] = len(words)
+        return stats
+    def get_feature_vector(self, raw_text):
+        stats = self.get_features(raw_text)
+        n = stats['word_count'] if stats['word_count'] > 0 else 1
+        # Calculate TTR (Type-Token Ratio)
+        clean_for_stats = re.sub(r'\[.*?\]', '', raw_text)
+        clean_for_stats = re.sub(r'&-([a-z]+)', '', clean_for_stats)
+        clean_for_stats = re.sub(r'[^\w\s]', '', clean_for_stats)
+        words = clean_for_stats.lower().split()
+        ttr = (len(set(words)) / n) if n > 0 else 0.0
+        return np.array([
+            ttr,
+            stats['filler_count'] / n,
+            stats['repetition_count'] / n,
+            stats['retracing_count'] / n,
+            stats['error_count'] / n,
+            stats['pause_count'] / n
+        ], dtype=np.float32)
+# ==========================================
+# 2. Audio Processor
+# ==========================================
+class AudioProcessor:
+    def __init__(self):
+        self.vit_transforms = transforms.Compose([
+            transforms.Resize((224, 224)),
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+        ])
+    def create_spectrogram_tensor(self, audio_path, intervals=None):
+        """
+        Generates spectrogram image and transforms it to Tensor.
+        """
+        try:
+            fig = plt.figure(figsize=(2.24, 2.24), dpi=100)
+            ax = fig.add_subplot(1, 1, 1)
+            fig.subplots_adjust(left=0, right=1, bottom=0, top=1)
+            if intervals:
+                # Load full audio then slice based on timestamps
+                y, sr = librosa.load(audio_path, sr=None)
+                clips = []
+                for start_ms, end_ms in intervals:
+                    start_sample = int(start_ms * sr / 1000)
+                    end_sample = int(end_ms * sr / 1000)
+                    if end_sample > len(y): end_sample = len(y)
+                    if start_sample < len(y):
+                        clips.append(y[start_sample:end_sample])
+                if clips:
+                    y = np.concatenate(clips)
+                else:
+                    y = np.zeros(int(sr*30))
+                # Limit to 30s
+                if len(y) > 30 * sr:
+                    y = y[:30 * sr]
+            else:
+                y, sr = librosa.load(audio_path, duration=30)
+            ms = librosa.feature.melspectrogram(y=y, sr=sr)
+            log_ms = librosa.power_to_db(ms, ref=np.max)
+            librosa.display.specshow(log_ms, sr=sr, ax=ax)
+            # Save to buffer instead of file
+            from io import BytesIO
+            buf = BytesIO()
+            fig.savefig(buf, format='png')
+            plt.close(fig)
+            buf.seek(0)
+            image = Image.open(buf).convert('RGB')
+            return self.vit_transforms(image).unsqueeze(0)
+        except Exception as e:
+            print(f"Spectrogram creation failed: {e}")
+            return torch.zeros((1, 3, 224, 224))
+# ==========================================
+# 3. ASR Helper (Whisper + CHAT Rules)
+# ==========================================
+def apply_chat_rules(transcription_result):
+    """
+    Converts Whisper result into CHAT-like format AND inserts [PAUSE] tokens.
+    """
+    formatted_text = []
+    segments = transcription_result.get('segments', [])
+    last_end = 0
+    for seg in segments:
+        gap = seg['start'] - last_end
+        # Insert [PAUSE] token + CHAT marker
+        if gap > 0.8:
+            formatted_text.append("[PAUSE] (..)")
+        elif gap > 0.3:
+            formatted_text.append("[PAUSE] (.)")
+        text = seg['text'].strip()
+        # Repetitions (Basic Detection)
+        words = text.split()
+        processed_words = []
+        for i, w in enumerate(words):
+            clean_w = re.sub(r'[^a-zA-Z]', '', w.lower())
+            if i > 0:
+                prev_clean = re.sub(r'[^a-zA-Z]', '', words[i-1].lower())
+                if clean_w == prev_clean and clean_w:
+                    processed_words[-1] = f"{words[i-1]} [/]"
+            processed_words.append(w)
+        formatted_text.append(" ".join(processed_words))
+        last_end = seg['end']
+    return " ".join(formatted_text)

models/model_v3/wrapper.py ADDED Viewed

	@@ -0,0 +1,182 @@

+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer
+from typing import Optional
+import os
+import tempfile
+import whisper
+import re
+from models.base import BaseModelWrapper
+from .model import MultimodalFusion
+from .processor import LinguisticFeatureExtractor, AudioProcessor, apply_chat_rules
+from .config import WEIGHTS_PATH, TEXT_MODEL_NAME, MAX_LEN
+class MultimodalWrapper(BaseModelWrapper):
+    def __init__(self):
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        self.model = None
+        self.tokenizer = None
+        self.asr_model = None
+        self.ling_extractor = LinguisticFeatureExtractor()
+        self.audio_processor = AudioProcessor()
+    def load(self):
+        print("Loading Model V3 components...")
+        self.tokenizer = AutoTokenizer.from_pretrained(TEXT_MODEL_NAME)
+        self.model = MultimodalFusion(TEXT_MODEL_NAME)
+        # Load Weights
+        if torch.cuda.is_available():
+            state_dict = torch.load(WEIGHTS_PATH)
+        else:
+            state_dict = torch.load(WEIGHTS_PATH, map_location=torch.device('cpu'))
+        self.model.load_state_dict(state_dict)
+        self.model.to(self.device)
+        self.model.eval()
+        # Load Whisper (Base model as per notebook)
+        print("Loading Whisper for Audio-Only Inference...")
+        self.asr_model = whisper.load_model("base")
+    def predict(self, file_content: bytes, filename: str, audio_content: Optional[bytes] = None) -> dict:
+        """
+        Handles 3 scenarios:
+        1. CHA only: file_content is CHA.
+        2. CHA + Audio: file_content is CHA, audio_content is Audio.
+        3. Audio only: file_content is likely empty/dummy, audio_content is Audio.
+        """
+        # Determine Scenario
+        is_cha_provided = filename.endswith('.cha') and len(file_content) > 0
+        has_audio = audio_content is not None and len(audio_content) > 0
+        processed_text = ""
+        ling_features = None
+        audio_tensor = None
+        intervals = []
+        # --- SCENARIO 3: PURE AUDIO (New file, generate transcript) ---
+        if not is_cha_provided and has_audio:
+            print("Processing Mode: Audio Only (ASR)")
+            # Save audio to temp file for Whisper/Librosa
+            with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_audio:
+                tmp_audio.write(audio_content)
+                tmp_path = tmp_audio.name
+            try:
+                # 1. Transcribe
+                result = self.asr_model.transcribe(tmp_path, word_timestamps=False)
+                # 2. Apply Rules
+                chat_transcript = apply_chat_rules(result)
+                processed_text = chat_transcript # No BERT cleaning needed on Whisper output usually, or minimal
+                # 3. Extract Features from generated text
+                # We need to manually calculating stats like the ASR notebook section does
+                # because the ASR output doesn't have the exact same format as raw CHA
+                stats = self.ling_extractor.get_features(chat_transcript)
+                pause_count = chat_transcript.count("[PAUSE]")
+                repetition_count = chat_transcript.count("[/]")
+                # TTR Calc
+                clean_t = re.sub(r'\[.*?\]', '', chat_transcript)
+                clean_t = re.sub(r'[^\w\s]', '', clean_t)
+                words = clean_t.lower().split()
+                n = len(words) if len(words) > 0 else 1
+                ttr = len(set(words)) / n
+                ling_vec = [
+                    ttr,
+                    stats['filler_count'] / n,
+                    repetition_count / n,
+                    stats['retracing_count'] / n,
+                    stats['error_count'] / n,
+                    pause_count / n
+                ]
+                ling_features = torch.tensor(ling_vec, dtype=torch.float32).unsqueeze(0)
+                # 4. Generate Spectrogram (Whole file, no intervals)
+                audio_tensor = self.audio_processor.create_spectrogram_tensor(tmp_path, intervals=None)
+            finally:
+                os.remove(tmp_path)
+        # --- SCENARIO 1 & 2: CHA FILE PROVIDED ---
+        else:
+            # Parse Text from CHA
+            text_str = file_content.decode('utf-8', errors='replace')
+            par_lines = []
+            # Regex to find timestamps: 123_456
+            # Matches functionality in 'load_and_process_data' -> 'process_dir'
+            full_text_for_intervals = ""
+            for line in text_str.splitlines():
+                if line.startswith('*PAR:'):
+                    content = line[5:].strip()
+                    par_lines.append(content)
+                    full_text_for_intervals += content + " "
+            raw_text = " ".join(par_lines)
+            processed_text = self.ling_extractor.clean_for_bert(raw_text)
+            # Extract Features
+            feats = self.ling_extractor.get_feature_vector(raw_text)
+            ling_features = torch.tensor(feats, dtype=torch.float32).unsqueeze(0)
+            # --- SCENARIO 2: CHA + AUDIO (Segmentation) ---
+            if has_audio:
+                print("Processing Mode: CHA + Audio (Segmentation)")
+                # Extract intervals from the raw text (containing the bullets)
+                # Notebook regex: re.findall(r'\x15(\d+)_(\d+)\x15', text_content)
+                found_intervals = re.findall(r'\x15(\d+)_(\d+)\x15', full_text_for_intervals)
+                intervals = [(int(s), int(e)) for s, e in found_intervals]
+                with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as tmp_audio:
+                    tmp_audio.write(audio_content)
+                    tmp_path = tmp_audio.name
+                try:
+                    # Pass intervals to slice specific PAR audio
+                    audio_tensor = self.audio_processor.create_spectrogram_tensor(tmp_path, intervals=intervals)
+                finally:
+                    os.remove(tmp_path)
+            # --- SCENARIO 1: CHA ONLY ---
+            else:
+                print("Processing Mode: CHA Only")
+                audio_tensor = torch.zeros((1, 3, 224, 224))
+        # --- COMMON INFERENCE STEPS ---
+        encoding = self.tokenizer.encode_plus(
+            processed_text,
+            add_special_tokens=True,
+            max_length=MAX_LEN,
+            padding='max_length',
+            truncation=True,
+            return_attention_mask=True,
+            return_tensors='pt'
+        )
+        with torch.no_grad():
+            input_ids = encoding['input_ids'].to(self.device)
+            mask = encoding['attention_mask'].to(self.device)
+            pixel_values = audio_tensor.to(self.device)
+            ling_input = ling_features.to(self.device)
+            outputs = self.model(input_ids, mask, pixel_values, ling_input)
+            probs = F.softmax(outputs, dim=1)
+            pred_idx = torch.argmax(probs, dim=1).item()
+            confidence = probs[0][pred_idx].item()
+        label_map = {0: 'Control', 1: 'AD'}
+        return {
+            "model_version": "v3_multimodal",
+            "filename": filename if filename else "audio_upload",
+            "predicted_label": label_map[pred_idx],
+            "confidence": round(confidence, 4),
+            "modalities_used": ["text", "linguistic"] + (["audio"] if has_audio else []),
+            "generated_transcript": processed_text if not is_cha_provided else None
+        }