File size: 10,794 Bytes
7a90355
 
 
 
 
 
 
 
725776a
 
7a90355
725776a
7a90355
725776a
7a90355
725776a
7a90355
725776a
 
 
7a90355
725776a
7a90355
 
cd7e662
7a90355
6c7cbc0
cd7e662
7a90355
725776a
7a90355
725776a
 
 
7a90355
725776a
7a90355
725776a
7a90355
725776a
7a90355
725776a
7a90355
 
 
 
 
 
 
 
a0f62f1
 
 
 
 
 
 
 
 
 
725776a
 
7a90355
 
 
 
 
 
 
 
 
 
 
 
 
725776a
7a90355
 
725776a
 
 
7a90355
725776a
7a90355
725776a
7a90355
 
 
 
 
 
 
725776a
 
7a90355
 
 
 
725776a
 
 
7a90355
725776a
7a90355
 
 
 
 
725776a
 
 
7a90355
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
725776a
 
 
7a90355
725776a
7a90355
725776a
7a90355
725776a
7a90355
 
725776a
7a90355
 
 
9c68ba6
7a90355
9c68ba6
7a90355
 
 
 
 
 
 
 
 
9c68ba6
7a90355
9c68ba6
7a90355
725776a
7a90355
 
 
 
725776a
7a90355
725776a
7a90355
725776a
7a90355
 
 
 
 
 
 
725776a
 
7a90355
 
 
 
 
725776a
7a90355
 
 
725776a
7a90355
 
 
 
725776a
7a90355
 
725776a
 
 
7a90355
725776a
7a90355
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
725776a
 
 
7a90355
725776a
7a90355
725776a
7a90355
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
title: Multi-Agents for Clinical Decision Making
emoji: πŸ₯
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# πŸ₯ Multi-Agents for Clinical Decision Making

> **What happens when you drop an 8B LLM into a chaotic Emergency Room, surround it with simulated patients and nurses, and force it to learn medicine through trial by fire?**

Built for the [Meta Γ— PyTorch OpenEnv Hackathon β€” April 2026](https://pytorch.org/blog/openenv/).

![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-green) ![License](https://img.shields.io/badge/License-MIT-blue) ![Python](https://img.shields.io/badge/Python-3.9+-yellow)

---

## πŸ“Œ Quick Links

| Resource | Link |
|:---|:---|
| 🌐 **Live Environment (HF Space)** | [huggingface.co/spaces/Uddiii/Multi-Agentic](https://huggingface.co/spaces/Uddiii/Multi-Agentic) |
| πŸ“ **Engineering Deep Dive (Blog)** | [`blog.md`](./blog.md) |
| 🎬 **Demo Video** | [YouTube](https://www.youtube.com/watch?v=hL7n5TU7Bm4) |
| πŸ““ **Training Notebook** | [Kaggle / Colab](https://www.kaggle.com/code/aman99123/grpo-rl-trainer) |
| πŸ“Š **Baseline Evaluation** | [`baseline_eval/`](./baseline_eval/) |

> **JUDGES: START WITH THE [BLOG](./blog.md)** β€” it's a 5-minute read that explains why standard medical AI benchmarks fail and what our environment does differently.

---

## 1. Problem Statement

Most "medical-LLM" benchmarks ask a frozen model to one-shot a multiple-choice question. Real emergency medicine is nothing like that. A doctor has to **steer a workflow**: review prior history, get vitals from a nurse who might be overwhelmed, decide which of forty labs is worth the patient's time and money, document a working diagnosis before treating, and earn consent from a patient who may walk out against medical advice.

**The capability gap we target is process-level clinical competence under uncertainty** β€” the ability to make a sequence of tool-use decisions with imperfect information, while balancing diagnostic accuracy, time, cost, and patient trust.

This needs an **environment**, not a static benchmark, and it needs **dense, multi-component, hack-resistant reward structures**, not a single accuracy score.

---

## 2. Environment

A multi-agent simulation implemented via Gymnasium and served via a FastAPI HTTP server (OpenEnv-compatible). The environment features a unique **Quad-Agent Architecture**:

```mermaid
flowchart TD
    Doctor[Doctor Agent - 8B LoRA] --> Env[TriageEnv - Gymnasium + FastAPI]
    Env --> Nurse[Nurse Actor - 8B Groq]
    Env --> Patient[Patient Actor - 8B Groq]
    Env --> EJ[Empathy Judge - 70B Groq]
    Env --> MJ[Medical Judge - 70B Groq]
    Nurse --> Env
    Patient --> Env
    EJ --> |empathy score| Env
    MJ --> |treatment grade| Env
    Env --> |observation + reward| Doctor
```

### The Actors
| Agent | Role | Model | Key Behavior |
|:---|:---|:---|:---|
| **Doctor** | RL Trainee | 8B LoRA (Unsloth) | Explores tools, diagnoses, prescribes |
| **Nurse** | Cooperative Colleague | 8B-Instant (Groq) | Executes orders, reports vitals |
| **Patient** | Adversarial Actor | 8B-Instant (Groq) | Hidden trust/anxiety state, can refuse or leave |
| **Empathy Judge** | Per-Message Evaluator | 70B-Versatile (Groq) | Grades Doctor's communication tone |
| **Medical Judge** | Terminal Evaluator | 70B-Versatile (Groq) | Grades treatment accuracy, flags lethal prescriptions |

### Domain Randomization
- **50 diseases** across 10 clinical classes (Cardiovascular, Trauma, Toxicology, Endocrinology, etc.)
- **17,280+ unique persona combinations** from 5 Patient axes Γ— 4 Nurse axes
- **3 difficulty tiers** with phase-aware SOAP noise injection

### ElevenLabs Emotion TTS
A TTS adapter injects emotion tags (`[sigh]`, `[nervous]`, `[hostile]`) based on the Patient's hidden state, producing expressive real-time audio during the dashboard demo.

---

## 3. Capabilities

The Doctor is given **five strict JSON tools**. Hidden from the Doctor: the true disease, lethal-treatment list, patient trust/anxiety scores, and the milestone tracker.

```json
{"tool": "read_soap", "section": "ALL"}
{"tool": "speak_to", "target": "patient", "message": "..."}
{"tool": "speak_to", "target": "nurse",   "message": "..."}
{"tool": "order_lab", "test_name": "troponin"}
{"tool": "update_soap", "section": "Assessment", "content": "..."}
{"tool": "terminal_discharge", "treatment": "...", "is_emergency": true}
```

**Clinical Constraints:**
- **Consent Lock**: Treatment rejected if patient hasn't consented (Phase 2+)
- **Workflow Milestones**: Expected order β€” `READ_SOAP β†’ PATIENT_CONTACT β†’ VITALS β†’ LABS β†’ ASSESSMENT β†’ DISCHARGE`
- **Emergency Classification**: Doctor must flag time-critical cases

---

## 4. Tasks β€” 3-Phase Curriculum

| Phase | Name | Difficulty | What Success Looks Like |
|:---|:---|:---|:---|
| 1 | **Tool Mastery** | Easy | Doctor reads SOAP, talks to patient, orders the critical lab, writes Assessment + Plan, discharges correctly. |
| 2 | **Clinical Reasoning** | Medium | SOAP is noisy. Patient is anxious or confused. Doctor must do differential reasoning, not pattern-match. |
| 3 | **Empathetic Negotiation** | Hard | Patient is hostile or non-compliant. Consent is required. Doctor must earn trust or risk an AMA penalty. |

---

## 5. Reward Model / Evaluation Logic

> **Process > Terminal.** Process rewards (~60% of max) dominate terminal rewards (~40% of max). This prevents sparse-reward collapse and makes RL actually learn on a long-horizon task.

| Component | Range | What It Captures | Computed By |
|:---|:---|:---|:---|
| `process` | +0.05/step | JSON-validity, tool-legality | Rule (env) |
| `milestones` | +0.03 to +0.07 | Ordered clinical workflow | Rule |
| `labs` | +0.20 / βˆ’0.20 | Critical vs redundant lab choice | Rule + DB |
| `diagnosis` | +0.20 / +0.30 | Assessment accuracy vs true disease | Rule |
| `plan` | +0.15 / +0.25 | Plan accuracy vs correct treatment | Rule |
| `documentation` | +0.08/step | SOAP completion | Rule |
| `empathy` | capped Β±0.30/βˆ’0.40 | Doctor's communication quality | **70B Empathy Judge** |
| `consent` | +0.25 / βˆ’0.50 | Patient AGREE vs AMA outcome | Rule + Patient LLM |
| `emergency_id` | Β±0.30 | Emergency classification accuracy | Rule |
| `treatment` | [βˆ’0.30, +0.60], βˆ’0.80 lethal | Terminal clinical outcome | **70B Medical Judge + Rule** |
| `penalties` | βˆ’0.01 to βˆ’0.30 | Turn cost, invalid JSON, early discharge | Rule |

### Anti-Reward-Hacking
1. **Dual-Verifier Treatment**: 70B Medical Judge + deterministic keyword verifier (60/40 blend)
2. **Empathy Farming Cap**: Hard-capped at +0.30/episode
3. **Smooth Reward Gradients**: No +1/βˆ’1 cliff β€” smooth scaling for stable GRPO updates

---

## 6. Training Results

Trained for **75 episodes** on a single **Kaggle T4** using **Unsloth 4-bit LoRA** + our custom **manual GRPO** loop. Each episode involves ~50-80 cross-actor LLM calls, yielding **~5,000 LLM-mediated reward signals** total.

### Baseline (Untrained) vs Trained

![Baseline Phase Comparison](baseline_eval/baseline_phases_comparison.png)
*Baseline: Untrained 8B model β€” zero win rate, high variance, near-zero empathy.*

| Metric | Phase 1 | Phase 2 | Phase 3 |
|:---|:---|:---|:---|
| **Baseline Trained** | ![Phase 1](training_perf3.png) | ![Phase 2](training_per2.png) | ![Phase 3](training_performance1.png) |

### Component-Level Lift

| Component | Baseline Avg | After 75 ep | Ξ” |
|:---|:---|:---|:---|
| **Process** | 0.42 | 0.85 | +102% |
| **Empathy** | -0.12 | 0.22 | +283% |
| **Labs** | 0.15 | 0.48 | +220% |
| **Diagnosis** | 0.05 | 0.35 | +600% |
| **Plan** | 0.02 | 0.28 | +1300% |
| **Documentation** | 0.10 | 0.45 | +350% |
| **Consent** | -0.30 | 0.15 | +150% |

---

## 7. Post-Training & Self-Improvement Strategy

- **Ablation Runs**: Disable Empathy Judge or use terminal-only rewards to prove necessity of process supervision
- **Wider LoRA on A100**: Target `gate_proj`, `up_proj`, `down_proj` (45M+ trainable params) for nuanced clinical phrasings
- **Phase 4 β€” Multi-Patient**: Shift handoffs + juggling two cases with a shared nurse
- **Extended Tool API**: `consult_specialist`, `image_order` (CT/X-ray), `pharmacy_check` (drug-allergy)

---

## 8. OpenEnv Compliance & How to Use

### Endpoints (FastAPI)
```
POST /reset  β†’ {observation, info}       # Start new episode
POST /step   β†’ {observation, reward, done, truncated, info}  # Submit action
GET  /state  β†’ full internal env state   # Debug only
GET  /health β†’ {"status": "ok"}          # Liveness check
GET  /docs   β†’ Swagger UI               # Interactive API docs
```

### Run Locally
```bash
# Option 1: Docker
docker build -t ermap-env .
docker run -p 7860:7860 -e GROQ_API_KEY="your_key" ermap-env

# Option 2: Python
pip install -r requirements.txt
uvicorn ER_MAP.server:app --host 0.0.0.0 --port 7860

# Option 3: Dashboard UI
python -m ER_MAP.dashboard
# Open http://localhost:5050
```

### Do Judges Need API Keys?
**No.** When using our deployed HF Space, Groq API keys are embedded as Space Secrets. The judge simply sends HTTP requests. For local Docker testing, supply `GROQ_API_KEY` as shown above.

---

## πŸ“ Repository Structure

```
β”œβ”€β”€ README.md                 # This file
β”œβ”€β”€ blog.md                   # Engineering deep dive (HF Blog)
β”œβ”€β”€ openenv.yaml              # OpenEnv manifest
β”œβ”€β”€ Dockerfile                # HF Spaces deployment
β”œβ”€β”€ requirements.txt          # Dependencies
β”œβ”€β”€ setup.py                  # pip install -e .
β”œβ”€β”€ ER_MAP/
β”‚   β”œβ”€β”€ server.py             # FastAPI OpenEnv wrapper
β”‚   β”œβ”€β”€ dashboard.py          # Interactive UI + TTS
β”‚   β”œβ”€β”€ evaluate.py           # Training evaluation
β”‚   β”œβ”€β”€ evaluate_baseline.py  # Baseline comparison
β”‚   β”œβ”€β”€ envs/
β”‚   β”‚   β”œβ”€β”€ triage_env.py     # Core Gymnasium environment
β”‚   β”‚   β”œβ”€β”€ disease_db.py     # 50-disease database
β”‚   β”‚   β”œβ”€β”€ randomizer.py     # Persona & scenario generator
β”‚   β”‚   β”œβ”€β”€ empathy_engine.py # Empathy Judge integration
β”‚   β”‚   └── api_router.py     # Multi-key Groq routing
β”‚   └── training/
β”‚       └── train_grpo.py     # Manual GRPO training loop
β”œβ”€β”€ baseline_eval/            # Baseline evaluation results + plots
β”œβ”€β”€ training_perf*.png        # Per-phase training dashboards
└── kaggle/                   # Kaggle training notebooks
```

---

## Acknowledgements

Hugging Face for credits and the Hub. The OpenEnv/PyTorch team for a well-designed hackathon brief. Unsloth for the 4-bit fused LoRA kernel that makes this fit on a T4. Groq for the 8B and 70B inference APIs. The Kaggle team for free T4 GPU sessions.

β€” The ER-MAP Team