Spaces:

akisg
/

care-notes

Sleeping

App Files Files Community

Akis Giannoukos commited on Oct 22, 2025

Commit

2e9e60e

1 Parent(s): 21cf285

Updated README.md

Browse files

Files changed (1) hide show

README.md +75 -267

README.md CHANGED Viewed

@@ -10,270 +10,78 @@ pinned: false
 short_description: MedGemma clinician chatbot demo (research prototype)
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-Technical Design Document: MedGemma-Based PHQ-9 Conversational Assessment Agent
-1. Overview
-1.1 Project Goal
-The goal of this project is to develop an AI-driven clinician simulation agent that conducts natural conversations with patients to assess depression severity based on the PHQ-9 (Patient Health Questionnaire-9) scale. Unlike simple questionnaire bots, this system aims to infer a patient’s score implicitly through conversation and speech cues, mirroring a clinician’s behavior in real-world interviews.
-1.2 Core Concept
-The system will:
-Engage the user in a realistic, adaptive dialogue (clinician-style questioning).
-Continuously analyze textual and vocal features to estimate PHQ-9 category scores.
-Stop automatically when confidence in all PHQ-9 items is sufficiently high.
-Produce a final PHQ-9 severity report.
-The system will use a configurable LLM (e.g., Gemma-2-2B-IT or MedGemma-4B-IT) as the base model for both:
--A Recording Agent (conversational component)
--A Scoring Agent (PHQ-9 inference component)
-2. System Architecture
-2.1 High-Level Components
-Component	Description
--Frontend Client:	Handles user interaction, voice input/output, and UI display.
--Speech I/O Module:	Converts speech to text (ASR) and text to speech (TTS).
--Feature Extraction Module:	Extracts acoustic and prosodic features via librosa (lightweight prosody proxies) for emotional/speech analysis.
--Recording Agent (Chatbot):	Conducts clinician-like conversation with adaptive questioning.
--Scoring Agent:	Evaluates PHQ-9 symptom probabilities after each exchange and determines confidence in final diagnosis.
-Controller / Orchestrator:	Manages communication between agents and triggers scoring cycles.
-Model Backend:	Hosts a configurable LLM (e.g., Gemma-2-2B-IT, MedGemma-4B-IT), prompted for clinician reasoning.
-2.2 Architecture Diagram (Text Description)
- ┌───────────────────────┐
- │     Frontend Client   │
- │  (Web / Desktop App)  │
- │  - Voice Input/Output │
- │  - Text Display       │
- └─────────┬─────────────┘
-           │
-     (Audio stream)
-           │
- ┌───────────────────────┐
- │  Speech I/O Module    │
- │  - ASR (Whisper)      │
- │  - TTS (e.g., Coqui)  │
- └─────────┬─────────────┘
-           │
-           ▼
- ┌────────────────────────────┐
-│ Feature Extraction Module  │
-│ - librosa (prosody pitch, energy/loudness, timing/phonation)│
- └─────────┬──────────────────┘
-           │
-           ▼
- ┌───────────────────────────────┐
- │ Recording Agent (MedGemma)    │
- │ - Generates next question     │
- │ - Conversational context      │
- └─────────┬─────────────────────┘
-           │
-           ▼
- ┌───────────────────────────────┐
- │ Scoring Agent (MedGemma)      │
- │ - Maps text+voice features →  │
- │   PHQ-9 dimension confidences │
- │ - Determines if assessment done│
- └─────────┬─────────────────────┘
-           │
-           ▼
- ┌───────────────────────────────┐
- │ Controller / Orchestrator     │
- │ - Loop until confidence ≥ τ   │
- │ - Output PHQ-9 report         │
- └───────────────────────────────┘
-3. Agent Design
-3.1 Recording Agent
-Role: Simulates a clinician conducting an empathetic, open-ended dialogue to elicit responses relevant to the PHQ-9 categories (mood, sleep, appetite, concentration, energy, self-worth, psychomotor changes, suicidal ideation).
-Key Responsibilities:
-Maintain conversational context.
-Adapt follow-up questions based on inferred patient state.
-Produce text responses using a configurable LLM (e.g. Gemma-2-2B-IT, MedGemma-4B-IT) with a clinician-style prompt template.
-After each user response, trigger the Scoring Agent to reassess.
-Prompt Skeleton Example:
-System: You are a clinician conducting a conversational assessment to infer PHQ-9 symptoms without listing questions.
-Keep tone empathetic, natural, and human.
-User: [transcribed patient input]
-Assistant: [clinician-style response / next question]
-3.2 Scoring Agent
-Role: Evaluates the ongoing conversation to infer a PHQ-9 score distribution and confidence values for each symptom.
-Input:
-Conversation transcript (all turns)
-OpenSmile features (prosody, energy, speech rate)
-Optional: timestamped emotional embeddings (via pretrained affect model)
-Output:
-Vector of 9 PHQ-9 scores (0–3)
-Confidence scores per question
-Overall depression severity classification (Minimal, Mild, Moderate, Moderately Severe, Severe)
-Operation Flow:
-Parse the full transcript and extract statements relevant to each PHQ-9 item.
-Combine textual cues + acoustic cues.
-Fusion mechanism: Acoustic features are summarized into a compact JSON and included in the scoring prompt alongside the transcript (early, prompt-level fusion).
-Use the LLM’s reasoning chain to map features to PHQ-9 scores.
-When confidence for all ≥ threshold τ (e.g., 0.8), finalize results and signal termination.
-4. Data Flow
-User speaks → Audio captured.
-ASR transcribes text.
-librosa/OpenSmile extracts voice features (prosody proxies).
-Recording Agent uses transcript (and optionally summarized features) → next conversational message.
-Scoring Agent evaluates cumulative context → PHQ-9 score vector + confidence.
-If confidence < τ → continue conversation; else → output final diagnosis.
-TTS module vocalizes clinician output.
-5. Implementation Details
-5.1 Models and Libraries
-Function	Tool / Library
-Base LLM	Configurable (e.g. Gemma-2-2B-IT, MedGemma-4B-IT)
-Whisper
-gTTS (preferrably), TTS	Coqui TTS, gTTS, or Bark
-Audio Features	librosa (RMS, ZCR, spectral centroid, f0, energy, duration)
-Backend	Python / Gradio (Spaces)
-Frontend	Gradio
-Communication	Gradio UI
-5.2 Confidence Computation
-Each PHQ-9 item i has a confidence score ci ∈ [0,1].
-ci estimated via secondary LLM reasoning (e.g., “How confident are you about this inference?”).
-Global confidence C=minici.
-Stop condition: C≥τ, e.g., 0.8.
-5.3 Example API Workflow
-POST /api/message
-{
-  "audio": <base64 encoded>,
-  "transcript": "...",
-  "features": {...}
-}
-→
-{
-  "agent_response": "...",
-  "phq9_scores": [1, 0, 2, ...],
-  "confidences": [0.9, 0.85, ...],
-  "finished": false
-}
-6. Training and Fine-Tuning (Future work, will not be implemented now as we do not have the data at the moment.)
-Supervised Fine-Tuning (SFT) using synthetic dialogues labeled with PHQ-9 scores.
-Speech-text alignment: fuse OpenSmile embeddings with conversation text embeddings before feeding to scoring prompts.
-Possible multi-modal fusion via:
-Feature concatenation → token embedding
-or cross-attention adapter (if fine-tuning allowed).
-7. Output Specification
-Final Output:
-{
-  "PHQ9_Scores": {
-    "interest": 2,
-    "mood": 3,
-    "sleep": 2,
-    "energy": 2,
-    "appetite": 1,
-    "self_worth": 2,
-    "concentration": 1,
-    "motor": 1,
-    "suicidal_thoughts": 0
-  },
-  "Total_Score": 14,
-  "Severity": "Moderate Depression",
-  "Confidence": 0.86
-}
-Displayed alongside a clinician-style summary:
-“Based on our discussion, your responses suggest moderate depressive symptoms, with difficulties in mood and sleep being most prominent.”
-8. Termination and Safety
-The system will not offer therapy advice or emergency counseling.
-If the patient mentions suicidal thoughts (item 9), the system:
-Flags high risk,
-Terminates the chat, and
-Displays emergency contact information (e.g., “If you are in danger or need immediate help, call 988 in the U.S.”).
-9. Future Extensions (Not implemented now)
-Fine-tuned model jointly trained on PHQ-9 labeled conversations.
-Multilingual support (via Whisper multilingual and TTS).
-Confidence calibration using Bayesian reasoning or uncertainty quantification.
-Integration with EHR systems for clinician verification.
-10. Summary
-This project creates an intelligent, conversational PHQ-9 assessment agent that blends:
-The MedGemma-4B-IT medical LLM,
-Audio emotion analysis with OpenSmile,
-A dual-agent architecture for conversation and scoring,
-and multimodal reasoning to deliver clinician-like mental health assessments.
-The modular design enables local deployment on GPU servers, privacy-preserving operation, and future research extensions into multimodal diagnostic reasoning.

 short_description: MedGemma clinician chatbot demo (research prototype)
 ---
+# PHQ-9 Clinician Agent (Voice-first)
+A lightweight research demo that simulates a clinician conducting a brief conversational PHQ-9 screening. The app is voice-first: you tap a circular mic bubble to talk; the model replies and can speak back via TTS. A separate Advanced tab exposes scoring and configuration.
+## What it does
+- Conversational assessment to infer PHQ‑9 items from natural dialogue (no explicit questionnaire).
+- Live inference of PHQ‑9 item scores, confidences, total score, and severity.
+- Automatic stop when minimum confidence across items reaches a threshold or risk is detected.
+- Optional TTS playback for clinician responses.
+## UI overview
+- Main tab: Large circular mic “Record” bubble
+  - Tap to start, tap again to stop (processing runs on stop)
+  - While speaking back (TTS), the bubble shows a speaking state
+- Chat tab: Plain chat transcript (for reviewing turns)
+- Advanced tab:
+  - PHQ‑9 Assessment JSON (live)
+  - Severity label
+  - Confidence threshold slider (τ)
+  - Toggle: Speak clinician responses (TTS)
+  - Model ID textbox and “Apply model” button
+## Quick start (local)
+1. Python 3.10+ recommended.
+2. Install deps:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. Run the app:
+   ```bash
+   python app.py
+   ```
+4. Open the URL shown in the console (defaults to `http://0.0.0.0:7860`). Allow microphone access in your browser.
+## Configuration
+Environment variables (all optional):
+- `LLM_MODEL_ID` (default `google/gemma-2-2b-it`): chat model id
+- `ASR_MODEL_ID` (default `openai/whisper-tiny.en`): speech-to-text model id
+- `CONFIDENCE_THRESHOLD` (default `0.8`): stop when min item confidence ≥ τ
+- `MAX_TURNS` (default `12`): hard stop cap
+- `USE_TTS` (default `true`): enable TTS playback
+- `MODEL_CONFIG_PATH` (default `model_config.json`): persisted model id
+- `PORT` (default `7860`): server port
+Notes:
+- If a GPU is available, the app will use it automatically for Transformers pipelines.
+- Changing the model in Advanced will reload the text-generation pipeline on the next turn.
+## How to use
+1. Go to Main and tap the mic bubble. Speak naturally.
+2. Tap again to finish your turn. The model replies; if TTS is enabled, you’ll hear it.
+3. The Advanced tab updates live with PHQ‑9 scores and severity. Adjust the confidence threshold if you want the assessment to stop earlier/later.
+## Troubleshooting
+- No mic input detected:
+  - Ensure the site has microphone permission in your browser settings.
+  - Try refreshing the page after granting permission.
+- Can’t hear TTS:
+  - Enable the “Speak clinician responses (TTS)” toggle in Advanced.
+  - Ensure your system audio output is correct. Some browsers block auto‑play without interaction—use the mic once, then it should work.
+- Model download slow or fails:
+  - Check internet connectivity and try again. Some models are large.
+- Assessment doesn’t stop:
+  - Increase the confidence threshold slider (τ) in Advanced, or wait until the cap (`MAX_TURNS`).
+## Safety
+This demo does not provide therapy or emergency counseling. If a user expresses suicidal intent or risk is inferred, the app ends the conversation and advises contacting emergency services (e.g., 988 in the U.S.).
+## Development notes
+- Framework: Gradio Blocks
+- ASR: Transformers pipeline (Whisper)
+- TTS: gTTS
+- Prosody features: librosa (lightweight proxies) for the scoring prompt
+PRs and experiments are welcome. This is a research prototype and not a clinical tool.