Spaces:
Sleeping
title: Adtrack Backend
emoji: π
colorFrom: red
colorTo: pink
sdk: docker
pinned: false
Alzheimer's Detection Backend API
This repository contains a FastAPI-based backend for detecting Alzheimer's disease from linguistic data. It supports multiple machine learning models, including text-only analysis from .cha transcripts and a multimodal model (V3) that can process both text and audio.
π Features
- FastAPI Framework: High-performance, easy-to-use API.
- Support for .cha Files: Specialized parsing for CHAT format transcripts.
- Multimodal Audio Support (V3): Process raw audio files with Automatic Speech Recognition (ASR).
- Multiple AI Models:
Model V1: A DeBERTa-based hybrid model focusing on semantic understanding.Model V2: An explainable model with rich linguistic features (TTR, fillers, pauses, etc.).Model V3 (Multimodal): A multimodal fusion model combining text, audio spectrograms, and linguistic features.
- CORS Support: Configured to allow requests from frontend applications.
π οΈ Prerequisites
- Python 3.8+
- pip package manager
- FFmpeg: Required for audio processing in Model V3.
π₯ Installation
Clone the repository:
git clone <repository_url> cd <repository_name>Create and activate a virtual environment (recommended):
# Windows python -m venv venv .\venv\Scripts\activate # macOS/Linux python3 -m venv venv source venv/bin/activateInstall dependencies:
pip install -r requirements.txt
Note: Ensure you have torch installed with CUDA support if you intend to run on GPU.
π Usage
Start the server using uvicorn:
uvicorn main:app --reload
The server will start at http://127.0.0.1:8000.
Configuration (Environment Variables)
| Variable | Description | Default |
|---|---|---|
SEGMENTATION_ROOT_PATH |
Root path for auto-discovering segmentation CSVs (structure: path/AD/*.csv and path/Control/*.csv) |
None (disabled) |
Example (Windows PowerShell):
$env:SEGMENTATION_ROOT_PATH = "D:\dataset\segmentation"
uvicorn main:app --reload
π API Documentation
1. Health Check
Endpoint: GET /health
Checks if the API is active and lists loaded models.
Response:
{
"status": "active",
"loaded_models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
}
2. List Available Models
Endpoint: GET /models
Returns a list of all available model keys that can be used for prediction.
Response:
{
"models": ["Model V1", "Model V2", "Model V3 (Multimodal)"]
}
3. Predict / Analyze
Endpoint: POST /predict
Uploads files and processes them using the specified model.
Request Type: multipart/form-data
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model_name |
string |
Yes | The key of the model (e.g., Model V1, Model V2, Model V3 (Multimodal)) |
file |
file (.cha) |
Depends | The CHAT format transcript file. |
audio_file |
file (audio) |
No | An audio file (e.g., .wav, .mp3). Only for Model V3. |
segmentation_file |
file (.csv) |
No | Segmentation CSV with speaker intervals. Only for Model V3. |
Input Validation Rules
| Model | file (.cha) |
audio_file |
segmentation_file |
Notes |
|---|---|---|---|---|
Model V1 |
Required | Ignored | Ignored | Text-only model. |
Model V2 |
Required | Ignored | Ignored | Text-only model. |
Model V3 (Multimodal) |
Optional | Optional | Optional | At least one file required. Supports 4 modes. |
π§ Model V3 (Multimodal) - Deep Dive
Model V3 is a multimodal fusion model that combines three branches of information for its predictions:
- Text Branch: Uses a DeBERTa transformer with an LSTM layer to encode textual semantics.
- Audio Branch: Uses a Vision Transformer (ViT) trained on spectrograms derived from the audio.
- Linguistic Branch: A simple feedforward network processing extracted linguistic features (TTR, filler ratio, pause ratio, etc.).
Processing Modes
Model V3 intelligently handles four different input scenarios:
Mode 1: CHA File Only
- Input: A
.chatranscript file. - Process:
- Parses
*PAR:(participant) lines from the CHA file. - Cleans the text for the DeBERTa model.
- Extracts a 6-dimensional linguistic feature vector (TTR, fillers, repetitions, retracing, errors, pauses).
- Audio branch receives a zero-tensor (no audio input).
- Parses
- Use Case: When you have a pre-existing transcript and no audio.
Mode 2: CHA File + Audio
- Input: A
.chatranscript file AND an audio file. - Process:
- Parses the CHA file for text and linguistic features (same as Mode 1).
- Extracts timestamps (e.g.,
15123_456) from the CHA file. - Uses these timestamps to slice the corresponding audio segments from the full audio file.
- Concatenates the slices and generates a spectrogram.
- Passes the spectrogram to the ViT-based audio branch.
- Use Case: For maximum accuracy when you have a professionally transcribed CHA file that is time-aligned with its source audio.
Mode 3: Audio + Segmentation (Auto-Discovery)
- Input: An audio file (e.g.,
F03.mp3). - Process:
- Auto-discovers segmentation CSV from configured
SEGMENTATION_ROOT_PATH:- Searches
SEGMENTATION_ROOT_PATH/AD/F03.csv - Searches
SEGMENTATION_ROOT_PATH/Control/F03.csv
- Searches
- Parses the CSV to extract
PAR(participant) speech intervals. - Slices the audio using these intervals to isolate participant speech.
- Generates a spectrogram from the sliced audio.
- Uses Whisper ASR to transcribe and extracts linguistic features.
- Auto-discovers segmentation CSV from configured
- CSV Format:
speaker,start_ms,end_ms(e.g.,PAR,1234,5678) - Use Case: When you have mixed audio (participant + interviewer) and segmentation files in a dataset folder.
Note: If
SEGMENTATION_ROOT_PATHis not configured or no matching CSV is found, falls back to Mode 4.
Mode 4: Audio Only (Participant-Only Audio)
- Input: An audio file only (no
.chaor segmentation file found). - Process:
- Assumes the entire audio contains only participant speech.
- Uses Whisper ASR to transcribe the full audio.
- Applies CHAT-like formatting rules (pause detection, repetition markers).
- Extracts linguistic features from the generated transcript.
- Generates a spectrogram from the full audio file.
- Use Case: Pre-segmented recordings or participant-only audio files.
Model V3 Response Format
{
"model_version": "v3_multimodal",
"filename": "sample.cha",
"predicted_label": "AD",
"confidence": 0.8721,
"modalities_used": ["text", "linguistic", "audio"],
"visualizations": {
"probabilities": {
"AD": 0.8721,
"Control": 0.1279
},
"key_contribution_segments": {
"note": "Denser color indicates higher contribution to prediction",
"segments": [
{"text": "uh the water is overflowing", "contribution_score": 1.0},
{"text": "and the the mother", "contribution_score": 0.5}
]
}
}
}
Response Fields:
| Field | Type | Description |
|---|---|---|
model_version |
string |
Always "v3_multimodal" for this model. |
filename |
string |
Name of the uploaded file, or "audio_upload" if no CHA file was provided. |
predicted_label |
string |
The classification result: "AD" or "Control". |
confidence |
float |
The model's confidence score for the predicted label. |
modalities_used |
array[string] |
Lists the modalities used ("text", "linguistic", "audio"). |
visualizations |
object |
Contains visualization data for frontend rendering. |
Visualizations:
| Visualization | Description | Availability |
|---|---|---|
probabilities |
AD vs Control probability distribution. | All modes |
key_contribution_segments |
Text segments with highest contribution to prediction. Denser color = more impact. | All modes |
Note:
key_contribution_segmentsincludes anotefield explaining the color scheme and asegmentsarray where each segment has atextandcontribution_score(0.0-1.0).
Example API Requests (cURL)
Model V1 / V2 (CHA File Only)
curl -X POST "http://127.0.0.1:8000/predict" \
-F "model_name=Model V1" \
-F "file=@/path/to/your/transcript.cha"
Model V3: CHA Only
curl -X POST "http://127.0.0.1:8000/predict" \
-F "model_name=Model V3 (Multimodal)" \
-F "file=@/path/to/your/transcript.cha"
Model V3: CHA + Audio
curl -X POST "http://127.0.0.1:8000/predict" \
-F "model_name=Model V3 (Multimodal)" \
-F "file=@/path/to/transcript.cha" \
-F "audio_file=@/path/to/audio.wav"
Model V3: Audio Only (with auto-discovery or participant-only)
curl -X POST "http://127.0.0.1:8000/predict" \
-F "model_name=Model V3 (Multimodal)" \
-F "audio_file=@/path/to/F03.mp3"
If SEGMENTATION_ROOT_PATH is set and F03.csv exists, Mode 3 is used. Otherwise, Mode 4.
Model V3: Audio + Explicit Segmentation CSV
curl -X POST "http://127.0.0.1:8000/predict" \
-F "model_name=Model V3 (Multimodal)" \
-F "audio_file=@/path/to/audio.wav" \
-F "segmentation_file=@/path/to/segments.csv"
Older Model Response Formats
A. Model V1 Output
Focuses on sequence classification and attention scores for sentences.
{
"filename": "sample.cha",
"prediction": "DEMENTIA",
"confidence": 0.85,
"is_dementia": true,
"attention_map": [
{
"sentence": "I saw the cookie jar.",
"attention_score": 0.92
}
],
"model_used": "hybrid_deberta"
}
B. Model V2 Output
Provides a rich set of metadata and linguistic features for explainability.
{
"filename": "sample.cha",
"prediction": "Dementia",
"probability_dementia": 0.78,
"metadata": {
"age": 72,
"gender": "Female",
"sentence_count": 15
},
"linguistic_features": {
"TTR": 0.45,
"fillers_ratio": 0.05,
"repetitions_ratio": 0.02,
"retracing_ratio": 0.01,
"incomplete_ratio": 0.03,
"pauses_ratio": 0.12
},
"key_segments": [
{
"text": "Um... checking the... the overflowing water.",
"importance": 0.88
}
],
"model_used": "model_v2"
}
π Project Structure
.
βββ main.py # Entry point, API routes, and CORS config
βββ models/ # Model definitions and wrappers
β βββ base.py # Base class for model wrappers
β βββ model_v1/ # Logic for 'Model V1' (DeBERTa Hybrid)
β βββ model_v2/ # Logic for 'Model V2' (Explainable + Linguistic)
β βββ model_v3/ # Logic for 'Model V3 (Multimodal)'
β βββ config.py # Model configuration (weights path, model names)
β βββ model.py # Neural network architecture (TextBranch, AudioBranch, etc.)
β βββ processor.py # Preprocessing (Linguistic features, Spectrograms, ASR)
β βββ wrapper.py # The main wrapper class integrating all components
βββ requirements.txt # Project dependencies