Akis Giannoukos commited on
Commit
2e9e60e
Β·
1 Parent(s): 21cf285

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -267
README.md CHANGED
@@ -10,270 +10,78 @@ pinned: false
10
  short_description: MedGemma clinician chatbot demo (research prototype)
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
-
15
-
16
- Technical Design Document: MedGemma-Based PHQ-9 Conversational Assessment Agent
17
- 1. Overview
18
-
19
- 1.1 Project Goal
20
-
21
- The goal of this project is to develop an AI-driven clinician simulation agent that conducts natural conversations with patients to assess depression severity based on the PHQ-9 (Patient Health Questionnaire-9) scale. Unlike simple questionnaire bots, this system aims to infer a patient’s score implicitly through conversation and speech cues, mirroring a clinician’s behavior in real-world interviews.
22
-
23
- 1.2 Core Concept
24
-
25
- The system will:
26
-
27
- Engage the user in a realistic, adaptive dialogue (clinician-style questioning).
28
-
29
- Continuously analyze textual and vocal features to estimate PHQ-9 category scores.
30
-
31
- Stop automatically when confidence in all PHQ-9 items is sufficiently high.
32
-
33
- Produce a final PHQ-9 severity report.
34
-
35
- The system will use a configurable LLM (e.g., Gemma-2-2B-IT or MedGemma-4B-IT) as the base model for both:
36
-
37
- -A Recording Agent (conversational component)
38
-
39
- -A Scoring Agent (PHQ-9 inference component)
40
-
41
- 2. System Architecture
42
-
43
- 2.1 High-Level Components
44
- Component Description
45
- -Frontend Client: Handles user interaction, voice input/output, and UI display.
46
- -Speech I/O Module: Converts speech to text (ASR) and text to speech (TTS).
47
- -Feature Extraction Module: Extracts acoustic and prosodic features via librosa (lightweight prosody proxies) for emotional/speech analysis.
48
- -Recording Agent (Chatbot): Conducts clinician-like conversation with adaptive questioning.
49
- -Scoring Agent: Evaluates PHQ-9 symptom probabilities after each exchange and determines confidence in final diagnosis.
50
- Controller / Orchestrator: Manages communication between agents and triggers scoring cycles.
51
- Model Backend: Hosts a configurable LLM (e.g., Gemma-2-2B-IT, MedGemma-4B-IT), prompted for clinician reasoning.
52
-
53
- 2.2 Architecture Diagram (Text Description)
54
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
55
- β”‚ Frontend Client β”‚
56
- β”‚ (Web / Desktop App) β”‚
57
- β”‚ - Voice Input/Output β”‚
58
- β”‚ - Text Display β”‚
59
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
60
- β”‚
61
- (Audio stream)
62
- β”‚
63
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
64
- β”‚ Speech I/O Module β”‚
65
- β”‚ - ASR (Whisper) β”‚
66
- β”‚ - TTS (e.g., Coqui) β”‚
67
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
68
- β”‚
69
- β–Ό
70
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
71
- β”‚ Feature Extraction Module β”‚
72
- β”‚ - librosa (prosody pitch, energy/loudness, timing/phonation)β”‚
73
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
74
- β”‚
75
- β–Ό
76
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
77
- β”‚ Recording Agent (MedGemma) β”‚
78
- β”‚ - Generates next question β”‚
79
- β”‚ - Conversational context β”‚
80
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
81
- β”‚
82
- β–Ό
83
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
84
- β”‚ Scoring Agent (MedGemma) β”‚
85
- β”‚ - Maps text+voice features β†’ β”‚
86
- β”‚ PHQ-9 dimension confidences β”‚
87
- β”‚ - Determines if assessment doneβ”‚
88
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
89
- β”‚
90
- β–Ό
91
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
92
- β”‚ Controller / Orchestrator β”‚
93
- β”‚ - Loop until confidence β‰₯ Ο„ β”‚
94
- β”‚ - Output PHQ-9 report β”‚
95
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
96
-
97
- 3. Agent Design
98
-
99
- 3.1 Recording Agent
100
-
101
- Role: Simulates a clinician conducting an empathetic, open-ended dialogue to elicit responses relevant to the PHQ-9 categories (mood, sleep, appetite, concentration, energy, self-worth, psychomotor changes, suicidal ideation).
102
-
103
- Key Responsibilities:
104
-
105
- Maintain conversational context.
106
-
107
- Adapt follow-up questions based on inferred patient state.
108
-
109
- Produce text responses using a configurable LLM (e.g. Gemma-2-2B-IT, MedGemma-4B-IT) with a clinician-style prompt template.
110
-
111
- After each user response, trigger the Scoring Agent to reassess.
112
-
113
- Prompt Skeleton Example:
114
-
115
- System: You are a clinician conducting a conversational assessment to infer PHQ-9 symptoms without listing questions.
116
- Keep tone empathetic, natural, and human.
117
- User: [transcribed patient input]
118
- Assistant: [clinician-style response / next question]
119
-
120
- 3.2 Scoring Agent
121
-
122
- Role: Evaluates the ongoing conversation to infer a PHQ-9 score distribution and confidence values for each symptom.
123
-
124
- Input:
125
-
126
- Conversation transcript (all turns)
127
-
128
- OpenSmile features (prosody, energy, speech rate)
129
-
130
- Optional: timestamped emotional embeddings (via pretrained affect model)
131
-
132
- Output:
133
-
134
- Vector of 9 PHQ-9 scores (0–3)
135
-
136
- Confidence scores per question
137
-
138
- Overall depression severity classification (Minimal, Mild, Moderate, Moderately Severe, Severe)
139
-
140
- Operation Flow:
141
-
142
- Parse the full transcript and extract statements relevant to each PHQ-9 item.
143
-
144
- Combine textual cues + acoustic cues.
145
-
146
- Fusion mechanism: Acoustic features are summarized into a compact JSON and included in the scoring prompt alongside the transcript (early, prompt-level fusion).
147
-
148
- Use the LLM’s reasoning chain to map features to PHQ-9 scores.
149
-
150
-
151
- When confidence for all β‰₯ threshold Ο„ (e.g., 0.8), finalize results and signal termination.
152
-
153
- 4. Data Flow
154
-
155
- User speaks β†’ Audio captured.
156
-
157
- ASR transcribes text.
158
-
159
- librosa/OpenSmile extracts voice features (prosody proxies).
160
-
161
- Recording Agent uses transcript (and optionally summarized features) β†’ next conversational message.
162
-
163
- Scoring Agent evaluates cumulative context β†’ PHQ-9 score vector + confidence.
164
-
165
- If confidence < Ο„ β†’ continue conversation; else β†’ output final diagnosis.
166
-
167
- TTS module vocalizes clinician output.
168
-
169
- 5. Implementation Details
170
-
171
- 5.1 Models and Libraries
172
- Function Tool / Library
173
- Base LLM Configurable (e.g. Gemma-2-2B-IT, MedGemma-4B-IT)
174
- Whisper
175
- gTTS (preferrably), TTS Coqui TTS, gTTS, or Bark
176
- Audio Features librosa (RMS, ZCR, spectral centroid, f0, energy, duration)
177
- Backend Python / Gradio (Spaces)
178
- Frontend Gradio
179
- Communication Gradio UI
180
-
181
- 5.2 Confidence Computation
182
-
183
- Each PHQ-9 item i has a confidence score ci ∈ [0,1].
184
-
185
- ci estimated via secondary LLM reasoning (e.g., β€œHow confident are you about this inference?”).
186
-
187
- Global confidence C=minici.
188
- ​
189
- Stop condition: Cβ‰₯Ο„, e.g., 0.8.
190
-
191
- 5.3 Example API Workflow
192
-
193
- POST /api/message
194
- {
195
- "audio": <base64 encoded>,
196
- "transcript": "...",
197
- "features": {...}
198
- }
199
- β†’
200
- {
201
- "agent_response": "...",
202
- "phq9_scores": [1, 0, 2, ...],
203
- "confidences": [0.9, 0.85, ...],
204
- "finished": false
205
- }
206
-
207
- 6. Training and Fine-Tuning (Future work, will not be implemented now as we do not have the data at the moment.)
208
-
209
- Supervised Fine-Tuning (SFT) using synthetic dialogues labeled with PHQ-9 scores.
210
-
211
- Speech-text alignment: fuse OpenSmile embeddings with conversation text embeddings before feeding to scoring prompts.
212
-
213
- Possible multi-modal fusion via:
214
-
215
- Feature concatenation β†’ token embedding
216
-
217
- or cross-attention adapter (if fine-tuning allowed).
218
-
219
- 7. Output Specification
220
-
221
- Final Output:
222
-
223
- {
224
- "PHQ9_Scores": {
225
- "interest": 2,
226
- "mood": 3,
227
- "sleep": 2,
228
- "energy": 2,
229
- "appetite": 1,
230
- "self_worth": 2,
231
- "concentration": 1,
232
- "motor": 1,
233
- "suicidal_thoughts": 0
234
- },
235
- "Total_Score": 14,
236
- "Severity": "Moderate Depression",
237
- "Confidence": 0.86
238
- }
239
-
240
-
241
- Displayed alongside a clinician-style summary:
242
-
243
- β€œBased on our discussion, your responses suggest moderate depressive symptoms, with difficulties in mood and sleep being most prominent.”
244
-
245
- 8. Termination and Safety
246
-
247
- The system will not offer therapy advice or emergency counseling.
248
-
249
- If the patient mentions suicidal thoughts (item 9), the system:
250
-
251
- Flags high risk,
252
-
253
- Terminates the chat, and
254
-
255
- Displays emergency contact information (e.g., β€œIf you are in danger or need immediate help, call 988 in the U.S.”).
256
-
257
- 9. Future Extensions (Not implemented now)
258
-
259
- Fine-tuned model jointly trained on PHQ-9 labeled conversations.
260
-
261
- Multilingual support (via Whisper multilingual and TTS).
262
-
263
- Confidence calibration using Bayesian reasoning or uncertainty quantification.
264
-
265
- Integration with EHR systems for clinician verification.
266
-
267
- 10. Summary
268
-
269
- This project creates an intelligent, conversational PHQ-9 assessment agent that blends:
270
-
271
- The MedGemma-4B-IT medical LLM,
272
-
273
- Audio emotion analysis with OpenSmile,
274
-
275
- A dual-agent architecture for conversation and scoring,
276
-
277
- and multimodal reasoning to deliver clinician-like mental health assessments.
278
-
279
- The modular design enables local deployment on GPU servers, privacy-preserving operation, and future research extensions into multimodal diagnostic reasoning.
 
10
  short_description: MedGemma clinician chatbot demo (research prototype)
11
  ---
12
 
13
+ # PHQ-9 Clinician Agent (Voice-first)
14
+
15
+ A lightweight research demo that simulates a clinician conducting a brief conversational PHQ-9 screening. The app is voice-first: you tap a circular mic bubble to talk; the model replies and can speak back via TTS. A separate Advanced tab exposes scoring and configuration.
16
+
17
+ ## What it does
18
+ - Conversational assessment to infer PHQ‑9 items from natural dialogue (no explicit questionnaire).
19
+ - Live inference of PHQ‑9 item scores, confidences, total score, and severity.
20
+ - Automatic stop when minimum confidence across items reaches a threshold or risk is detected.
21
+ - Optional TTS playback for clinician responses.
22
+
23
+ ## UI overview
24
+ - Main tab: Large circular mic β€œRecord” bubble
25
+ - Tap to start, tap again to stop (processing runs on stop)
26
+ - While speaking back (TTS), the bubble shows a speaking state
27
+ - Chat tab: Plain chat transcript (for reviewing turns)
28
+ - Advanced tab:
29
+ - PHQ‑9 Assessment JSON (live)
30
+ - Severity label
31
+ - Confidence threshold slider (Ο„)
32
+ - Toggle: Speak clinician responses (TTS)
33
+ - Model ID textbox and β€œApply model” button
34
+
35
+ ## Quick start (local)
36
+ 1. Python 3.10+ recommended.
37
+ 2. Install deps:
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ ```
41
+ 3. Run the app:
42
+ ```bash
43
+ python app.py
44
+ ```
45
+ 4. Open the URL shown in the console (defaults to `http://0.0.0.0:7860`). Allow microphone access in your browser.
46
+
47
+ ## Configuration
48
+ Environment variables (all optional):
49
+ - `LLM_MODEL_ID` (default `google/gemma-2-2b-it`): chat model id
50
+ - `ASR_MODEL_ID` (default `openai/whisper-tiny.en`): speech-to-text model id
51
+ - `CONFIDENCE_THRESHOLD` (default `0.8`): stop when min item confidence β‰₯ Ο„
52
+ - `MAX_TURNS` (default `12`): hard stop cap
53
+ - `USE_TTS` (default `true`): enable TTS playback
54
+ - `MODEL_CONFIG_PATH` (default `model_config.json`): persisted model id
55
+ - `PORT` (default `7860`): server port
56
+
57
+ Notes:
58
+ - If a GPU is available, the app will use it automatically for Transformers pipelines.
59
+ - Changing the model in Advanced will reload the text-generation pipeline on the next turn.
60
+
61
+ ## How to use
62
+ 1. Go to Main and tap the mic bubble. Speak naturally.
63
+ 2. Tap again to finish your turn. The model replies; if TTS is enabled, you’ll hear it.
64
+ 3. The Advanced tab updates live with PHQ‑9 scores and severity. Adjust the confidence threshold if you want the assessment to stop earlier/later.
65
+
66
+ ## Troubleshooting
67
+ - No mic input detected:
68
+ - Ensure the site has microphone permission in your browser settings.
69
+ - Try refreshing the page after granting permission.
70
+ - Can’t hear TTS:
71
+ - Enable the β€œSpeak clinician responses (TTS)” toggle in Advanced.
72
+ - Ensure your system audio output is correct. Some browsers block auto‑play without interactionβ€”use the mic once, then it should work.
73
+ - Model download slow or fails:
74
+ - Check internet connectivity and try again. Some models are large.
75
+ - Assessment doesn’t stop:
76
+ - Increase the confidence threshold slider (Ο„) in Advanced, or wait until the cap (`MAX_TURNS`).
77
+
78
+ ## Safety
79
+ This demo does not provide therapy or emergency counseling. If a user expresses suicidal intent or risk is inferred, the app ends the conversation and advises contacting emergency services (e.g., 988 in the U.S.).
80
+
81
+ ## Development notes
82
+ - Framework: Gradio Blocks
83
+ - ASR: Transformers pipeline (Whisper)
84
+ - TTS: gTTS
85
+ - Prosody features: librosa (lightweight proxies) for the scoring prompt
86
+
87
+ PRs and experiments are welcome. This is a research prototype and not a clinical tool.