Spaces:
Sleeping
Sleeping
Akis Giannoukos commited on
Commit Β·
2e9e60e
1
Parent(s): 21cf285
Updated README.md
Browse files
README.md
CHANGED
|
@@ -10,270 +10,78 @@ pinned: false
|
|
| 10 |
short_description: MedGemma clinician chatbot demo (research prototype)
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
βββββββββββ¬ββββββββββββββββββββββ
|
| 89 |
-
β
|
| 90 |
-
βΌ
|
| 91 |
-
βββββββββββββββββββββββββββββββββ
|
| 92 |
-
β Controller / Orchestrator β
|
| 93 |
-
β - Loop until confidence β₯ Ο β
|
| 94 |
-
β - Output PHQ-9 report β
|
| 95 |
-
βββββββββββββββββββββββββββββββββ
|
| 96 |
-
|
| 97 |
-
3. Agent Design
|
| 98 |
-
|
| 99 |
-
3.1 Recording Agent
|
| 100 |
-
|
| 101 |
-
Role: Simulates a clinician conducting an empathetic, open-ended dialogue to elicit responses relevant to the PHQ-9 categories (mood, sleep, appetite, concentration, energy, self-worth, psychomotor changes, suicidal ideation).
|
| 102 |
-
|
| 103 |
-
Key Responsibilities:
|
| 104 |
-
|
| 105 |
-
Maintain conversational context.
|
| 106 |
-
|
| 107 |
-
Adapt follow-up questions based on inferred patient state.
|
| 108 |
-
|
| 109 |
-
Produce text responses using a configurable LLM (e.g. Gemma-2-2B-IT, MedGemma-4B-IT) with a clinician-style prompt template.
|
| 110 |
-
|
| 111 |
-
After each user response, trigger the Scoring Agent to reassess.
|
| 112 |
-
|
| 113 |
-
Prompt Skeleton Example:
|
| 114 |
-
|
| 115 |
-
System: You are a clinician conducting a conversational assessment to infer PHQ-9 symptoms without listing questions.
|
| 116 |
-
Keep tone empathetic, natural, and human.
|
| 117 |
-
User: [transcribed patient input]
|
| 118 |
-
Assistant: [clinician-style response / next question]
|
| 119 |
-
|
| 120 |
-
3.2 Scoring Agent
|
| 121 |
-
|
| 122 |
-
Role: Evaluates the ongoing conversation to infer a PHQ-9 score distribution and confidence values for each symptom.
|
| 123 |
-
|
| 124 |
-
Input:
|
| 125 |
-
|
| 126 |
-
Conversation transcript (all turns)
|
| 127 |
-
|
| 128 |
-
OpenSmile features (prosody, energy, speech rate)
|
| 129 |
-
|
| 130 |
-
Optional: timestamped emotional embeddings (via pretrained affect model)
|
| 131 |
-
|
| 132 |
-
Output:
|
| 133 |
-
|
| 134 |
-
Vector of 9 PHQ-9 scores (0β3)
|
| 135 |
-
|
| 136 |
-
Confidence scores per question
|
| 137 |
-
|
| 138 |
-
Overall depression severity classification (Minimal, Mild, Moderate, Moderately Severe, Severe)
|
| 139 |
-
|
| 140 |
-
Operation Flow:
|
| 141 |
-
|
| 142 |
-
Parse the full transcript and extract statements relevant to each PHQ-9 item.
|
| 143 |
-
|
| 144 |
-
Combine textual cues + acoustic cues.
|
| 145 |
-
|
| 146 |
-
Fusion mechanism: Acoustic features are summarized into a compact JSON and included in the scoring prompt alongside the transcript (early, prompt-level fusion).
|
| 147 |
-
|
| 148 |
-
Use the LLMβs reasoning chain to map features to PHQ-9 scores.
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
When confidence for all β₯ threshold Ο (e.g., 0.8), finalize results and signal termination.
|
| 152 |
-
|
| 153 |
-
4. Data Flow
|
| 154 |
-
|
| 155 |
-
User speaks β Audio captured.
|
| 156 |
-
|
| 157 |
-
ASR transcribes text.
|
| 158 |
-
|
| 159 |
-
librosa/OpenSmile extracts voice features (prosody proxies).
|
| 160 |
-
|
| 161 |
-
Recording Agent uses transcript (and optionally summarized features) β next conversational message.
|
| 162 |
-
|
| 163 |
-
Scoring Agent evaluates cumulative context β PHQ-9 score vector + confidence.
|
| 164 |
-
|
| 165 |
-
If confidence < Ο β continue conversation; else β output final diagnosis.
|
| 166 |
-
|
| 167 |
-
TTS module vocalizes clinician output.
|
| 168 |
-
|
| 169 |
-
5. Implementation Details
|
| 170 |
-
|
| 171 |
-
5.1 Models and Libraries
|
| 172 |
-
Function Tool / Library
|
| 173 |
-
Base LLM Configurable (e.g. Gemma-2-2B-IT, MedGemma-4B-IT)
|
| 174 |
-
Whisper
|
| 175 |
-
gTTS (preferrably), TTS Coqui TTS, gTTS, or Bark
|
| 176 |
-
Audio Features librosa (RMS, ZCR, spectral centroid, f0, energy, duration)
|
| 177 |
-
Backend Python / Gradio (Spaces)
|
| 178 |
-
Frontend Gradio
|
| 179 |
-
Communication Gradio UI
|
| 180 |
-
|
| 181 |
-
5.2 Confidence Computation
|
| 182 |
-
|
| 183 |
-
Each PHQ-9 item i has a confidence score ci β [0,1].
|
| 184 |
-
|
| 185 |
-
ci estimated via secondary LLM reasoning (e.g., βHow confident are you about this inference?β).
|
| 186 |
-
|
| 187 |
-
Global confidence C=minici.
|
| 188 |
-
β
|
| 189 |
-
Stop condition: Cβ₯Ο, e.g., 0.8.
|
| 190 |
-
|
| 191 |
-
5.3 Example API Workflow
|
| 192 |
-
|
| 193 |
-
POST /api/message
|
| 194 |
-
{
|
| 195 |
-
"audio": <base64 encoded>,
|
| 196 |
-
"transcript": "...",
|
| 197 |
-
"features": {...}
|
| 198 |
-
}
|
| 199 |
-
β
|
| 200 |
-
{
|
| 201 |
-
"agent_response": "...",
|
| 202 |
-
"phq9_scores": [1, 0, 2, ...],
|
| 203 |
-
"confidences": [0.9, 0.85, ...],
|
| 204 |
-
"finished": false
|
| 205 |
-
}
|
| 206 |
-
|
| 207 |
-
6. Training and Fine-Tuning (Future work, will not be implemented now as we do not have the data at the moment.)
|
| 208 |
-
|
| 209 |
-
Supervised Fine-Tuning (SFT) using synthetic dialogues labeled with PHQ-9 scores.
|
| 210 |
-
|
| 211 |
-
Speech-text alignment: fuse OpenSmile embeddings with conversation text embeddings before feeding to scoring prompts.
|
| 212 |
-
|
| 213 |
-
Possible multi-modal fusion via:
|
| 214 |
-
|
| 215 |
-
Feature concatenation β token embedding
|
| 216 |
-
|
| 217 |
-
or cross-attention adapter (if fine-tuning allowed).
|
| 218 |
-
|
| 219 |
-
7. Output Specification
|
| 220 |
-
|
| 221 |
-
Final Output:
|
| 222 |
-
|
| 223 |
-
{
|
| 224 |
-
"PHQ9_Scores": {
|
| 225 |
-
"interest": 2,
|
| 226 |
-
"mood": 3,
|
| 227 |
-
"sleep": 2,
|
| 228 |
-
"energy": 2,
|
| 229 |
-
"appetite": 1,
|
| 230 |
-
"self_worth": 2,
|
| 231 |
-
"concentration": 1,
|
| 232 |
-
"motor": 1,
|
| 233 |
-
"suicidal_thoughts": 0
|
| 234 |
-
},
|
| 235 |
-
"Total_Score": 14,
|
| 236 |
-
"Severity": "Moderate Depression",
|
| 237 |
-
"Confidence": 0.86
|
| 238 |
-
}
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
Displayed alongside a clinician-style summary:
|
| 242 |
-
|
| 243 |
-
βBased on our discussion, your responses suggest moderate depressive symptoms, with difficulties in mood and sleep being most prominent.β
|
| 244 |
-
|
| 245 |
-
8. Termination and Safety
|
| 246 |
-
|
| 247 |
-
The system will not offer therapy advice or emergency counseling.
|
| 248 |
-
|
| 249 |
-
If the patient mentions suicidal thoughts (item 9), the system:
|
| 250 |
-
|
| 251 |
-
Flags high risk,
|
| 252 |
-
|
| 253 |
-
Terminates the chat, and
|
| 254 |
-
|
| 255 |
-
Displays emergency contact information (e.g., βIf you are in danger or need immediate help, call 988 in the U.S.β).
|
| 256 |
-
|
| 257 |
-
9. Future Extensions (Not implemented now)
|
| 258 |
-
|
| 259 |
-
Fine-tuned model jointly trained on PHQ-9 labeled conversations.
|
| 260 |
-
|
| 261 |
-
Multilingual support (via Whisper multilingual and TTS).
|
| 262 |
-
|
| 263 |
-
Confidence calibration using Bayesian reasoning or uncertainty quantification.
|
| 264 |
-
|
| 265 |
-
Integration with EHR systems for clinician verification.
|
| 266 |
-
|
| 267 |
-
10. Summary
|
| 268 |
-
|
| 269 |
-
This project creates an intelligent, conversational PHQ-9 assessment agent that blends:
|
| 270 |
-
|
| 271 |
-
The MedGemma-4B-IT medical LLM,
|
| 272 |
-
|
| 273 |
-
Audio emotion analysis with OpenSmile,
|
| 274 |
-
|
| 275 |
-
A dual-agent architecture for conversation and scoring,
|
| 276 |
-
|
| 277 |
-
and multimodal reasoning to deliver clinician-like mental health assessments.
|
| 278 |
-
|
| 279 |
-
The modular design enables local deployment on GPU servers, privacy-preserving operation, and future research extensions into multimodal diagnostic reasoning.
|
|
|
|
| 10 |
short_description: MedGemma clinician chatbot demo (research prototype)
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# PHQ-9 Clinician Agent (Voice-first)
|
| 14 |
+
|
| 15 |
+
A lightweight research demo that simulates a clinician conducting a brief conversational PHQ-9 screening. The app is voice-first: you tap a circular mic bubble to talk; the model replies and can speak back via TTS. A separate Advanced tab exposes scoring and configuration.
|
| 16 |
+
|
| 17 |
+
## What it does
|
| 18 |
+
- Conversational assessment to infer PHQβ9 items from natural dialogue (no explicit questionnaire).
|
| 19 |
+
- Live inference of PHQβ9 item scores, confidences, total score, and severity.
|
| 20 |
+
- Automatic stop when minimum confidence across items reaches a threshold or risk is detected.
|
| 21 |
+
- Optional TTS playback for clinician responses.
|
| 22 |
+
|
| 23 |
+
## UI overview
|
| 24 |
+
- Main tab: Large circular mic βRecordβ bubble
|
| 25 |
+
- Tap to start, tap again to stop (processing runs on stop)
|
| 26 |
+
- While speaking back (TTS), the bubble shows a speaking state
|
| 27 |
+
- Chat tab: Plain chat transcript (for reviewing turns)
|
| 28 |
+
- Advanced tab:
|
| 29 |
+
- PHQβ9 Assessment JSON (live)
|
| 30 |
+
- Severity label
|
| 31 |
+
- Confidence threshold slider (Ο)
|
| 32 |
+
- Toggle: Speak clinician responses (TTS)
|
| 33 |
+
- Model ID textbox and βApply modelβ button
|
| 34 |
+
|
| 35 |
+
## Quick start (local)
|
| 36 |
+
1. Python 3.10+ recommended.
|
| 37 |
+
2. Install deps:
|
| 38 |
+
```bash
|
| 39 |
+
pip install -r requirements.txt
|
| 40 |
+
```
|
| 41 |
+
3. Run the app:
|
| 42 |
+
```bash
|
| 43 |
+
python app.py
|
| 44 |
+
```
|
| 45 |
+
4. Open the URL shown in the console (defaults to `http://0.0.0.0:7860`). Allow microphone access in your browser.
|
| 46 |
+
|
| 47 |
+
## Configuration
|
| 48 |
+
Environment variables (all optional):
|
| 49 |
+
- `LLM_MODEL_ID` (default `google/gemma-2-2b-it`): chat model id
|
| 50 |
+
- `ASR_MODEL_ID` (default `openai/whisper-tiny.en`): speech-to-text model id
|
| 51 |
+
- `CONFIDENCE_THRESHOLD` (default `0.8`): stop when min item confidence β₯ Ο
|
| 52 |
+
- `MAX_TURNS` (default `12`): hard stop cap
|
| 53 |
+
- `USE_TTS` (default `true`): enable TTS playback
|
| 54 |
+
- `MODEL_CONFIG_PATH` (default `model_config.json`): persisted model id
|
| 55 |
+
- `PORT` (default `7860`): server port
|
| 56 |
+
|
| 57 |
+
Notes:
|
| 58 |
+
- If a GPU is available, the app will use it automatically for Transformers pipelines.
|
| 59 |
+
- Changing the model in Advanced will reload the text-generation pipeline on the next turn.
|
| 60 |
+
|
| 61 |
+
## How to use
|
| 62 |
+
1. Go to Main and tap the mic bubble. Speak naturally.
|
| 63 |
+
2. Tap again to finish your turn. The model replies; if TTS is enabled, youβll hear it.
|
| 64 |
+
3. The Advanced tab updates live with PHQβ9 scores and severity. Adjust the confidence threshold if you want the assessment to stop earlier/later.
|
| 65 |
+
|
| 66 |
+
## Troubleshooting
|
| 67 |
+
- No mic input detected:
|
| 68 |
+
- Ensure the site has microphone permission in your browser settings.
|
| 69 |
+
- Try refreshing the page after granting permission.
|
| 70 |
+
- Canβt hear TTS:
|
| 71 |
+
- Enable the βSpeak clinician responses (TTS)β toggle in Advanced.
|
| 72 |
+
- Ensure your system audio output is correct. Some browsers block autoβplay without interactionβuse the mic once, then it should work.
|
| 73 |
+
- Model download slow or fails:
|
| 74 |
+
- Check internet connectivity and try again. Some models are large.
|
| 75 |
+
- Assessment doesnβt stop:
|
| 76 |
+
- Increase the confidence threshold slider (Ο) in Advanced, or wait until the cap (`MAX_TURNS`).
|
| 77 |
+
|
| 78 |
+
## Safety
|
| 79 |
+
This demo does not provide therapy or emergency counseling. If a user expresses suicidal intent or risk is inferred, the app ends the conversation and advises contacting emergency services (e.g., 988 in the U.S.).
|
| 80 |
+
|
| 81 |
+
## Development notes
|
| 82 |
+
- Framework: Gradio Blocks
|
| 83 |
+
- ASR: Transformers pipeline (Whisper)
|
| 84 |
+
- TTS: gTTS
|
| 85 |
+
- Prosody features: librosa (lightweight proxies) for the scoring prompt
|
| 86 |
+
|
| 87 |
+
PRs and experiments are welcome. This is a research prototype and not a clinical tool.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|