Spaces:

cracker0935
/

adtrack-v2

Sleeping

App Files Files Community

adtrack-v2 / README.md

cracker0935

add mode to model 3

e824b96 3 months ago

preview code

raw

history blame contribute delete

12.8 kB

metadata

title: Adtrack Backend
emoji: 📚
colorFrom: red
colorTo: pink
sdk: docker
pinned: false

Alzheimer's Detection Backend API

This repository contains a FastAPI-based backend for detecting Alzheimer's disease from linguistic data. It supports multiple machine learning models, including text-only analysis from .cha transcripts and a multimodal model (V3) that can process both text and audio.

🚀 Features

FastAPI Framework: High-performance, easy-to-use API.
Support for .cha Files: Specialized parsing for CHAT format transcripts.
Multimodal Audio Support (V3): Process raw audio files with Automatic Speech Recognition (ASR).
Multiple AI Models:
- Model V1: A DeBERTa-based hybrid model focusing on semantic understanding.
- Model V2: An explainable model with rich linguistic features (TTR, fillers, pauses, etc.).
- Model V3 (Multimodal): A multimodal fusion model combining text, audio spectrograms, and linguistic features.
CORS Support: Configured to allow requests from frontend applications.

🛠️ Prerequisites

Python 3.8+
pip package manager
FFmpeg: Required for audio processing in Model V3.

📥 Installation

Clone the repository:

git clone <repository_url>
cd <repository_name>

Create and activate a virtual environment (recommended):

# Windows
python -m venv venv
.\venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

Install dependencies:
```
pip install -r requirements.txt
```

Note: Ensure you have torch installed with CUDA support if you intend to run on GPU.

🏃 Usage

Start the server using uvicorn:

uvicorn main:app --reload

The server will start at http://127.0.0.1:8000.

Configuration (Environment Variables)

Variable	Description	Default
`SEGMENTATION_ROOT_PATH`	Root path for auto-discovering segmentation CSVs (structure: `path/AD/.csv` and `path/Control/.csv`)	`None` (disabled)

Example (Windows PowerShell):

$env:SEGMENTATION_ROOT_PATH = "D:\dataset\segmentation"
uvicorn main:app --reload

📖 API Documentation

1. Health Check

Endpoint: GET /health

Checks if the API is active and lists loaded models.

Response:

{
  "status": "active",
  "loaded_models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
}

2. List Available Models

Endpoint: GET /models

Returns a list of all available model keys that can be used for prediction.

Response:

{
  "models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
}

3. Predict / Analyze

Endpoint: POST /predict

Uploads files and processes them using the specified model.

Request Type: multipart/form-data

Parameters:

Parameter	Type	Required	Description
`model_name`	`string`	Yes	The key of the model (e.g., `Model V1`, `Model V2`, `Model V3 (Multimodal)`)
`file`	`file (.cha)`	Depends	The CHAT format transcript file.
`audio_file`	`file (audio)`	No	An audio file (e.g., `.wav`, `.mp3`). Only for Model V3.
`segmentation_file`	`file (.csv)`	No	Segmentation CSV with speaker intervals. Only for Model V3.

Input Validation Rules

Model	`file` (.cha)	`audio_file`	`segmentation_file`	Notes
`Model V1`	Required	Ignored	Ignored	Text-only model.
`Model V2`	Required	Ignored	Ignored	Text-only model.
`Model V3 (Multimodal)`	Optional	Optional	Optional	At least one file required. Supports 4 modes.

🧠 Model V3 (Multimodal) - Deep Dive

Model V3 is a multimodal fusion model that combines three branches of information for its predictions:

Text Branch: Uses a DeBERTa transformer with an LSTM layer to encode textual semantics.
Audio Branch: Uses a Vision Transformer (ViT) trained on spectrograms derived from the audio.
Linguistic Branch: A simple feedforward network processing extracted linguistic features (TTR, filler ratio, pause ratio, etc.).

Processing Modes

Model V3 intelligently handles four different input scenarios:

Mode 1: CHA File Only

Input: A .cha transcript file.
Process:
1. Parses *PAR: (participant) lines from the CHA file.
2. Cleans the text for the DeBERTa model.
3. Extracts a 6-dimensional linguistic feature vector (TTR, fillers, repetitions, retracing, errors, pauses).
4. Audio branch receives a zero-tensor (no audio input).
Use Case: When you have a pre-existing transcript and no audio.

Mode 2: CHA File + Audio

Input: A .cha transcript file AND an audio file.
Process:
1. Parses the CHA file for text and linguistic features (same as Mode 1).
2. Extracts timestamps (e.g., 15123_456) from the CHA file.
3. Uses these timestamps to slice the corresponding audio segments from the full audio file.
4. Concatenates the slices and generates a spectrogram.
5. Passes the spectrogram to the ViT-based audio branch.
Use Case: For maximum accuracy when you have a professionally transcribed CHA file that is time-aligned with its source audio.

Mode 3: Audio + Segmentation (Auto-Discovery)

Input: An audio file (e.g., F03.mp3).
Process:
1. Auto-discovers segmentation CSV from configured SEGMENTATION_ROOT_PATH:
  - Searches SEGMENTATION_ROOT_PATH/AD/F03.csv
  - Searches SEGMENTATION_ROOT_PATH/Control/F03.csv
2. Parses the CSV to extract PAR (participant) speech intervals.
3. Slices the audio using these intervals to isolate participant speech.
4. Generates a spectrogram from the sliced audio.
5. Uses Whisper ASR to transcribe and extracts linguistic features.
CSV Format: speaker,start_ms,end_ms (e.g., PAR,1234,5678)
Use Case: When you have mixed audio (participant + interviewer) and segmentation files in a dataset folder.

Note: If SEGMENTATION_ROOT_PATH is not configured or no matching CSV is found, falls back to Mode 4.

Mode 4: Audio Only (Participant-Only Audio)

Input: An audio file only (no .cha or segmentation file found).
Process:
1. Assumes the entire audio contains only participant speech.
2. Uses Whisper ASR to transcribe the full audio.
3. Applies CHAT-like formatting rules (pause detection, repetition markers).
4. Extracts linguistic features from the generated transcript.
5. Generates a spectrogram from the full audio file.
Use Case: Pre-segmented recordings or participant-only audio files.

Model V3 Response Format

{
  "model_version": "v3_multimodal",
  "filename": "sample.cha",
  "predicted_label": "AD",
  "confidence": 0.8721,
  "modalities_used": ["text", "linguistic", "audio"],
  "visualizations": {
    "probabilities": {
      "AD": 0.8721,
      "Control": 0.1279
    },
    "key_contribution_segments": {
      "note": "Denser color indicates higher contribution to prediction",
      "segments": [
        {"text": "uh the water is overflowing", "contribution_score": 1.0},
        {"text": "and the the mother", "contribution_score": 0.5}
      ]
    }
  }
}

Response Fields:

Field	Type	Description
`model_version`	`string`	Always `"v3_multimodal"` for this model.
`filename`	`string`	Name of the uploaded file, or `"audio_upload"` if no CHA file was provided.
`predicted_label`	`string`	The classification result: `"AD"` or `"Control"`.
`confidence`	`float`	The model's confidence score for the predicted label.
`modalities_used`	`array[string]`	Lists the modalities used (`"text"`, `"linguistic"`, `"audio"`).
`visualizations`	`object`	Contains visualization data for frontend rendering.

Visualizations:

Visualization	Description	Availability
`probabilities`	AD vs Control probability distribution.	All modes
`key_contribution_segments`	Text segments with highest contribution to prediction. Denser color = more impact.	All modes

Note: key_contribution_segments includes a note field explaining the color scheme and a segments array where each segment has a text and contribution_score (0.0-1.0).

Example API Requests (cURL)

Model V1 / V2 (CHA File Only)

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V1" \
  -F "file=@/path/to/your/transcript.cha"

Model V3: CHA Only

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "file=@/path/to/your/transcript.cha"

Model V3: CHA + Audio

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "file=@/path/to/transcript.cha" \
  -F "audio_file=@/path/to/audio.wav"

Model V3: Audio Only (with auto-discovery or participant-only)

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "audio_file=@/path/to/F03.mp3"

If SEGMENTATION_ROOT_PATH is set and F03.csv exists, Mode 3 is used. Otherwise, Mode 4.

Model V3: Audio + Explicit Segmentation CSV

curl -X POST "http://127.0.0.1:8000/predict" \
  -F "model_name=Model V3 (Multimodal)" \
  -F "audio_file=@/path/to/audio.wav" \
  -F "segmentation_file=@/path/to/segments.csv"

Older Model Response Formats

A. `Model V1` Output

Focuses on sequence classification and attention scores for sentences.

{
  "filename": "sample.cha",
  "prediction": "DEMENTIA",
  "confidence": 0.85,
  "is_dementia": true,
  "attention_map": [
    {
      "sentence": "I saw the cookie jar.",
      "attention_score": 0.92
    }
  ],
  "model_used": "hybrid_deberta"
}

B. `Model V2` Output

Provides a rich set of metadata and linguistic features for explainability.

{
  "filename": "sample.cha",
  "prediction": "Dementia",
  "probability_dementia": 0.78,
  "metadata": {
    "age": 72,
    "gender": "Female",
    "sentence_count": 15
  },
  "linguistic_features": {
    "TTR": 0.45,
    "fillers_ratio": 0.05,
    "repetitions_ratio": 0.02,
    "retracing_ratio": 0.01,
    "incomplete_ratio": 0.03,
    "pauses_ratio": 0.12
  },
  "key_segments": [
    {
      "text": "Um... checking the... the overflowing water.",
      "importance": 0.88
    }
  ],
  "model_used": "model_v2"
}

📂 Project Structure

.
├── main.py                 # Entry point, API routes, and CORS config
├── models/                 # Model definitions and wrappers
│   ├── base.py             # Base class for model wrappers
│   ├── model_v1/           # Logic for 'Model V1' (DeBERTa Hybrid)
│   ├── model_v2/           # Logic for 'Model V2' (Explainable + Linguistic)
│   └── model_v3/           # Logic for 'Model V3 (Multimodal)'
│       ├── config.py       # Model configuration (weights path, model names)
│       ├── model.py        # Neural network architecture (TextBranch, AudioBranch, etc.)
│       ├── processor.py    # Preprocessing (Linguistic features, Spectrograms, ASR)
│       └── wrapper.py      # The main wrapper class integrating all components
└── requirements.txt        # Project dependencies