adtrack-v2 / README.md
cracker0935's picture
add mode to model 3
e824b96
metadata
title: Adtrack Backend
emoji: πŸ“š
colorFrom: red
colorTo: pink
sdk: docker
pinned: false

Alzheimer's Detection Backend API

This repository contains a FastAPI-based backend for detecting Alzheimer's disease from linguistic data. It supports multiple machine learning models, including text-only analysis from .cha transcripts and a multimodal model (V3) that can process both text and audio.

πŸš€ Features

  • FastAPI Framework: High-performance, easy-to-use API.
  • Support for .cha Files: Specialized parsing for CHAT format transcripts.
  • Multimodal Audio Support (V3): Process raw audio files with Automatic Speech Recognition (ASR).
  • Multiple AI Models:
    • Model V1: A DeBERTa-based hybrid model focusing on semantic understanding.
    • Model V2: An explainable model with rich linguistic features (TTR, fillers, pauses, etc.).
    • Model V3 (Multimodal): A multimodal fusion model combining text, audio spectrograms, and linguistic features.
  • CORS Support: Configured to allow requests from frontend applications.

πŸ› οΈ Prerequisites

  • Python 3.8+
  • pip package manager
  • FFmpeg: Required for audio processing in Model V3.

πŸ“₯ Installation

  1. Clone the repository:

    git clone <repository_url>
    cd <repository_name>
    
  2. Create and activate a virtual environment (recommended):

    # Windows
    python -m venv venv
    .\venv\Scripts\activate
    
    # macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
    
  3. Install dependencies:

    pip install -r requirements.txt
    

Note: Ensure you have torch installed with CUDA support if you intend to run on GPU.

πŸƒ Usage

Start the server using uvicorn:

uvicorn main:app --reload

The server will start at http://127.0.0.1:8000.

Configuration (Environment Variables)

Variable Description Default
SEGMENTATION_ROOT_PATH Root path for auto-discovering segmentation CSVs (structure: path/AD/*.csv and path/Control/*.csv) None (disabled)

Example (Windows PowerShell):

$env:SEGMENTATION_ROOT_PATH = "D:\dataset\segmentation"
uvicorn main:app --reload

πŸ“– API Documentation

1. Health Check

Endpoint: GET /health

Checks if the API is active and lists loaded models.

Response:

{
  "status": "active",
  "loaded_models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
}

2. List Available Models

Endpoint: GET /models

Returns a list of all available model keys that can be used for prediction.

Response:

{
  "models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
}

3. Predict / Analyze

Endpoint: POST /predict

Uploads files and processes them using the specified model.

Request Type: multipart/form-data

Parameters:

Parameter Type Required Description
model_name string Yes The key of the model (e.g., Model V1, Model V2, Model V3 (Multimodal))
file file (.cha) Depends The CHAT format transcript file.
audio_file file (audio) No An audio file (e.g., .wav, .mp3). Only for Model V3.
segmentation_file file (.csv) No Segmentation CSV with speaker intervals. Only for Model V3.

Input Validation Rules

Model file (.cha) audio_file segmentation_file Notes
Model V1 Required Ignored Ignored Text-only model.
Model V2 Required Ignored Ignored Text-only model.
Model V3 (Multimodal) Optional Optional Optional At least one file required. Supports 4 modes.

🧠 Model V3 (Multimodal) - Deep Dive

Model V3 is a multimodal fusion model that combines three branches of information for its predictions:

  1. Text Branch: Uses a DeBERTa transformer with an LSTM layer to encode textual semantics.
  2. Audio Branch: Uses a Vision Transformer (ViT) trained on spectrograms derived from the audio.
  3. Linguistic Branch: A simple feedforward network processing extracted linguistic features (TTR, filler ratio, pause ratio, etc.).

Processing Modes

Model V3 intelligently handles four different input scenarios:

Mode 1: CHA File Only

  • Input: A .cha transcript file.
  • Process:
    1. Parses *PAR: (participant) lines from the CHA file.
    2. Cleans the text for the DeBERTa model.
    3. Extracts a 6-dimensional linguistic feature vector (TTR, fillers, repetitions, retracing, errors, pauses).
    4. Audio branch receives a zero-tensor (no audio input).
  • Use Case: When you have a pre-existing transcript and no audio.

Mode 2: CHA File + Audio

  • Input: A .cha transcript file AND an audio file.
  • Process:
    1. Parses the CHA file for text and linguistic features (same as Mode 1).
    2. Extracts timestamps (e.g., 15123_456) from the CHA file.
    3. Uses these timestamps to slice the corresponding audio segments from the full audio file.
    4. Concatenates the slices and generates a spectrogram.
    5. Passes the spectrogram to the ViT-based audio branch.
  • Use Case: For maximum accuracy when you have a professionally transcribed CHA file that is time-aligned with its source audio.

Mode 3: Audio + Segmentation (Auto-Discovery)

  • Input: An audio file (e.g., F03.mp3).
  • Process:
    1. Auto-discovers segmentation CSV from configured SEGMENTATION_ROOT_PATH:
      • Searches SEGMENTATION_ROOT_PATH/AD/F03.csv
      • Searches SEGMENTATION_ROOT_PATH/Control/F03.csv
    2. Parses the CSV to extract PAR (participant) speech intervals.
    3. Slices the audio using these intervals to isolate participant speech.
    4. Generates a spectrogram from the sliced audio.
    5. Uses Whisper ASR to transcribe and extracts linguistic features.
  • CSV Format: speaker,start_ms,end_ms (e.g., PAR,1234,5678)
  • Use Case: When you have mixed audio (participant + interviewer) and segmentation files in a dataset folder.

Note: If SEGMENTATION_ROOT_PATH is not configured or no matching CSV is found, falls back to Mode 4.

Mode 4: Audio Only (Participant-Only Audio)

  • Input: An audio file only (no .cha or segmentation file found).
  • Process:
    1. Assumes the entire audio contains only participant speech.
    2. Uses Whisper ASR to transcribe the full audio.
    3. Applies CHAT-like formatting rules (pause detection, repetition markers).
    4. Extracts linguistic features from the generated transcript.
    5. Generates a spectrogram from the full audio file.
  • Use Case: Pre-segmented recordings or participant-only audio files.

Model V3 Response Format

{
  "model_version": "v3_multimodal",
  "filename": "sample.cha",
  "predicted_label": "AD",
  "confidence": 0.8721,
  "modalities_used": ["text", "linguistic", "audio"],
  "visualizations": {
    "probabilities": {
      "AD": 0.8721,
      "Control": 0.1279
    },
    "key_contribution_segments": {
      "note": "Denser color indicates higher contribution to prediction",
      "segments": [
        {"text": "uh the water is overflowing", "contribution_score": 1.0},
        {"text": "and the the mother", "contribution_score": 0.5}
      ]
    }
  }
}

Response Fields:

Field Type Description
model_version string Always "v3_multimodal" for this model.
filename string Name of the uploaded file, or "audio_upload" if no CHA file was provided.
predicted_label string The classification result: "AD" or "Control".
confidence float The model's confidence score for the predicted label.
modalities_used array[string] Lists the modalities used ("text", "linguistic", "audio").
visualizations object Contains visualization data for frontend rendering.

Visualizations:

Visualization Description Availability
probabilities AD vs Control probability distribution. All modes
key_contribution_segments Text segments with highest contribution to prediction. Denser color = more impact. All modes

Note: key_contribution_segments includes a note field explaining the color scheme and a segments array where each segment has a text and contribution_score (0.0-1.0).


Example API Requests (cURL)

Model V1 / V2 (CHA File Only)

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V1" \
  -F "file=@/path/to/your/transcript.cha"

Model V3: CHA Only

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "file=@/path/to/your/transcript.cha"

Model V3: CHA + Audio

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "file=@/path/to/transcript.cha" \
  -F "audio_file=@/path/to/audio.wav"

Model V3: Audio Only (with auto-discovery or participant-only)

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "audio_file=@/path/to/F03.mp3"

If SEGMENTATION_ROOT_PATH is set and F03.csv exists, Mode 3 is used. Otherwise, Mode 4.

Model V3: Audio + Explicit Segmentation CSV

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "audio_file=@/path/to/audio.wav" \
  -F "segmentation_file=@/path/to/segments.csv"

Older Model Response Formats

A. Model V1 Output

Focuses on sequence classification and attention scores for sentences.

{
  "filename": "sample.cha",
  "prediction": "DEMENTIA",
  "confidence": 0.85,
  "is_dementia": true,
  "attention_map": [
    {
      "sentence": "I saw the cookie jar.",
      "attention_score": 0.92
    }
  ],
  "model_used": "hybrid_deberta"
}

B. Model V2 Output

Provides a rich set of metadata and linguistic features for explainability.

{
  "filename": "sample.cha",
  "prediction": "Dementia",
  "probability_dementia": 0.78,
  "metadata": {
    "age": 72,
    "gender": "Female",
    "sentence_count": 15
  },
  "linguistic_features": {
    "TTR": 0.45,
    "fillers_ratio": 0.05,
    "repetitions_ratio": 0.02,
    "retracing_ratio": 0.01,
    "incomplete_ratio": 0.03,
    "pauses_ratio": 0.12
  },
  "key_segments": [
    {
      "text": "Um... checking the... the overflowing water.",
      "importance": 0.88
    }
  ],
  "model_used": "model_v2"
}

πŸ“‚ Project Structure

.
β”œβ”€β”€ main.py                 # Entry point, API routes, and CORS config
β”œβ”€β”€ models/                 # Model definitions and wrappers
β”‚   β”œβ”€β”€ base.py             # Base class for model wrappers
β”‚   β”œβ”€β”€ model_v1/           # Logic for 'Model V1' (DeBERTa Hybrid)
β”‚   β”œβ”€β”€ model_v2/           # Logic for 'Model V2' (Explainable + Linguistic)
β”‚   └── model_v3/           # Logic for 'Model V3 (Multimodal)'
β”‚       β”œβ”€β”€ config.py       # Model configuration (weights path, model names)
β”‚       β”œβ”€β”€ model.py        # Neural network architecture (TextBranch, AudioBranch, etc.)
β”‚       β”œβ”€β”€ processor.py    # Preprocessing (Linguistic features, Spectrograms, ASR)
β”‚       └── wrapper.py      # The main wrapper class integrating all components
└── requirements.txt        # Project dependencies