---
title: Multi-Agents for Clinical Decision Making
emoji: 🏥
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# 🏥 Multi-Agents for Clinical Decision Making

> **What happens when you drop an 8B LLM into a chaotic Emergency Room, surround it with simulated patients and nurses, and force it to learn medicine through trial by fire?**

Built for the [Meta × PyTorch OpenEnv Hackathon — April 2026](https://pytorch.org/blog/openenv/).

![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-green) ![License](https://img.shields.io/badge/License-MIT-blue) ![Python](https://img.shields.io/badge/Python-3.9+-yellow)

---

## 📌 Quick Links

| Resource | Link |
|:---|:---|
| 🌐 **Live Environment (HF Space)** | [huggingface.co/spaces/Uddiii/Multi-Agentic](https://huggingface.co/spaces/Uddiii/Multi-Agentic) |
| 📝 **Engineering Deep Dive (Blog)** | [`blog.md`](./blog.md) |
| 🎬 **Demo Video** | [YouTube](https://www.youtube.com/watch?v=hL7n5TU7Bm4) |
| 📓 **Training Notebook** | [Kaggle / Colab](https://www.kaggle.com/code/aman99123/grpo-rl-trainer) |
| 📊 **Baseline Evaluation** | [`baseline_eval/`](./baseline_eval/) |

> **JUDGES: START WITH THE [BLOG](./blog.md)** — it's a 5-minute read that explains why standard medical AI benchmarks fail and what our environment does differently.

---

## 1. Problem Statement

Most "medical-LLM" benchmarks ask a frozen model to one-shot a multiple-choice question. Real emergency medicine is nothing like that. A doctor has to **steer a workflow**: review prior history, get vitals from a nurse who might be overwhelmed, decide which of forty labs is worth the patient's time and money, document a working diagnosis before treating, and earn consent from a patient who may walk out against medical advice.

**The capability gap we target is process-level clinical competence under uncertainty** — the ability to make a sequence of tool-use decisions with imperfect information, while balancing diagnostic accuracy, time, cost, and patient trust.

This needs an **environment**, not a static benchmark, and it needs **dense, multi-component, hack-resistant reward structures**, not a single accuracy score.

---

## 2. Environment

A multi-agent simulation implemented via Gymnasium and served via a FastAPI HTTP server (OpenEnv-compatible). The environment features a unique **Quad-Agent Architecture**:

```mermaid
flowchart TD
    Doctor[Doctor Agent - 8B LoRA] --> Env[TriageEnv - Gymnasium + FastAPI]
    Env --> Nurse[Nurse Actor - 8B Groq]
    Env --> Patient[Patient Actor - 8B Groq]
    Env --> EJ[Empathy Judge - 70B Groq]
    Env --> MJ[Medical Judge - 70B Groq]
    Nurse --> Env
    Patient --> Env
    EJ --> |empathy score| Env
    MJ --> |treatment grade| Env
    Env --> |observation + reward| Doctor
```

### The Actors
| Agent | Role | Model | Key Behavior |
|:---|:---|:---|:---|
| **Doctor** | RL Trainee | 8B LoRA (Unsloth) | Explores tools, diagnoses, prescribes |
| **Nurse** | Cooperative Colleague | 8B-Instant (Groq) | Executes orders, reports vitals |
| **Patient** | Adversarial Actor | 8B-Instant (Groq) | Hidden trust/anxiety state, can refuse or leave |
| **Empathy Judge** | Per-Message Evaluator | 70B-Versatile (Groq) | Grades Doctor's communication tone |
| **Medical Judge** | Terminal Evaluator | 70B-Versatile (Groq) | Grades treatment accuracy, flags lethal prescriptions |

### Domain Randomization
- **50 diseases** across 10 clinical classes (Cardiovascular, Trauma, Toxicology, Endocrinology, etc.)
- **17,280+ unique persona combinations** from 5 Patient axes × 4 Nurse axes
- **3 difficulty tiers** with phase-aware SOAP noise injection

### ElevenLabs Emotion TTS
A TTS adapter injects emotion tags (`[sigh]`, `[nervous]`, `[hostile]`) based on the Patient's hidden state, producing expressive real-time audio during the dashboard demo.

---

## 3. Capabilities

The Doctor is given **five strict JSON tools**. Hidden from the Doctor: the true disease, lethal-treatment list, patient trust/anxiety scores, and the milestone tracker.

```json
{"tool": "read_soap", "section": "ALL"}
{"tool": "speak_to", "target": "patient", "message": "..."}
{"tool": "speak_to", "target": "nurse",   "message": "..."}
{"tool": "order_lab", "test_name": "troponin"}
{"tool": "update_soap", "section": "Assessment", "content": "..."}
{"tool": "terminal_discharge", "treatment": "...", "is_emergency": true}
```

**Clinical Constraints:**
- **Consent Lock**: Treatment rejected if patient hasn't consented (Phase 2+)
- **Workflow Milestones**: Expected order — `READ_SOAP → PATIENT_CONTACT → VITALS → LABS → ASSESSMENT → DISCHARGE`
- **Emergency Classification**: Doctor must flag time-critical cases

---

## 4. Tasks — 3-Phase Curriculum

| Phase | Name | Difficulty | What Success Looks Like |
|:---|:---|:---|:---|
| 1 | **Tool Mastery** | Easy | Doctor reads SOAP, talks to patient, orders the critical lab, writes Assessment + Plan, discharges correctly. |
| 2 | **Clinical Reasoning** | Medium | SOAP is noisy. Patient is anxious or confused. Doctor must do differential reasoning, not pattern-match. |
| 3 | **Empathetic Negotiation** | Hard | Patient is hostile or non-compliant. Consent is required. Doctor must earn trust or risk an AMA penalty. |

---

## 5. Reward Model / Evaluation Logic

> **Process > Terminal.** Process rewards (~60% of max) dominate terminal rewards (~40% of max). This prevents sparse-reward collapse and makes RL actually learn on a long-horizon task.

| Component | Range | What It Captures | Computed By |
|:---|:---|:---|:---|
| `process` | +0.05/step | JSON-validity, tool-legality | Rule (env) |
| `milestones` | +0.03 to +0.07 | Ordered clinical workflow | Rule |
| `labs` | +0.20 / −0.20 | Critical vs redundant lab choice | Rule + DB |
| `diagnosis` | +0.20 / +0.30 | Assessment accuracy vs true disease | Rule |
| `plan` | +0.15 / +0.25 | Plan accuracy vs correct treatment | Rule |
| `documentation` | +0.08/step | SOAP completion | Rule |
| `empathy` | capped ±0.30/−0.40 | Doctor's communication quality | **70B Empathy Judge** |
| `consent` | +0.25 / −0.50 | Patient AGREE vs AMA outcome | Rule + Patient LLM |
| `emergency_id` | ±0.30 | Emergency classification accuracy | Rule |
| `treatment` | [−0.30, +0.60], −0.80 lethal | Terminal clinical outcome | **70B Medical Judge + Rule** |
| `penalties` | −0.01 to −0.30 | Turn cost, invalid JSON, early discharge | Rule |

### Anti-Reward-Hacking
1. **Dual-Verifier Treatment**: 70B Medical Judge + deterministic keyword verifier (60/40 blend)
2. **Empathy Farming Cap**: Hard-capped at +0.30/episode
3. **Smooth Reward Gradients**: No +1/−1 cliff — smooth scaling for stable GRPO updates

---

## 6. Training Results

Trained for **75 episodes** on a single **Kaggle T4** using **Unsloth 4-bit LoRA** + our custom **manual GRPO** loop. Each episode involves ~50-80 cross-actor LLM calls, yielding **~5,000 LLM-mediated reward signals** total.

### Baseline (Untrained) vs Trained

![Baseline Phase Comparison](baseline_eval/baseline_phases_comparison.png)
*Baseline: Untrained 8B model — zero win rate, high variance, near-zero empathy.*

| Metric | Phase 1 | Phase 2 | Phase 3 |
|:---|:---|:---|:---|
| **Baseline Trained** | ![Phase 1](training_perf3.png) | ![Phase 2](training_per2.png) | ![Phase 3](training_performance1.png) |

### Component-Level Lift

| Component | Baseline Avg | After 75 ep | Δ |
|:---|:---|:---|:---|
| **Process** | 0.42 | 0.85 | +102% |
| **Empathy** | -0.12 | 0.22 | +283% |
| **Labs** | 0.15 | 0.48 | +220% |
| **Diagnosis** | 0.05 | 0.35 | +600% |
| **Plan** | 0.02 | 0.28 | +1300% |
| **Documentation** | 0.10 | 0.45 | +350% |
| **Consent** | -0.30 | 0.15 | +150% |

---

## 7. Post-Training & Self-Improvement Strategy

- **Ablation Runs**: Disable Empathy Judge or use terminal-only rewards to prove necessity of process supervision
- **Wider LoRA on A100**: Target `gate_proj`, `up_proj`, `down_proj` (45M+ trainable params) for nuanced clinical phrasings
- **Phase 4 — Multi-Patient**: Shift handoffs + juggling two cases with a shared nurse
- **Extended Tool API**: `consult_specialist`, `image_order` (CT/X-ray), `pharmacy_check` (drug-allergy)

---

## 8. OpenEnv Compliance & How to Use

### Endpoints (FastAPI)
```
POST /reset  → {observation, info}       # Start new episode
POST /step   → {observation, reward, done, truncated, info}  # Submit action
GET  /state  → full internal env state   # Debug only
GET  /health → {"status": "ok"}          # Liveness check
GET  /docs   → Swagger UI               # Interactive API docs
```

### Run Locally
```bash
# Option 1: Docker
docker build -t ermap-env .
docker run -p 7860:7860 -e GROQ_API_KEY="your_key" ermap-env

# Option 2: Python
pip install -r requirements.txt
uvicorn ER_MAP.server:app --host 0.0.0.0 --port 7860

# Option 3: Dashboard UI
python -m ER_MAP.dashboard
# Open http://localhost:5050
```

### Do Judges Need API Keys?
**No.** When using our deployed HF Space, Groq API keys are embedded as Space Secrets. The judge simply sends HTTP requests. For local Docker testing, supply `GROQ_API_KEY` as shown above.

---

## 📁 Repository Structure

```
├── README.md                 # This file
├── blog.md                   # Engineering deep dive (HF Blog)
├── openenv.yaml              # OpenEnv manifest
├── Dockerfile                # HF Spaces deployment
├── requirements.txt          # Dependencies
├── setup.py                  # pip install -e .
├── ER_MAP/
│   ├── server.py             # FastAPI OpenEnv wrapper
│   ├── dashboard.py          # Interactive UI + TTS
│   ├── evaluate.py           # Training evaluation
│   ├── evaluate_baseline.py  # Baseline comparison
│   ├── envs/
│   │   ├── triage_env.py     # Core Gymnasium environment
│   │   ├── disease_db.py     # 50-disease database
│   │   ├── randomizer.py     # Persona & scenario generator
│   │   ├── empathy_engine.py # Empathy Judge integration
│   │   └── api_router.py     # Multi-key Groq routing
│   └── training/
│       └── train_grpo.py     # Manual GRPO training loop
├── baseline_eval/            # Baseline evaluation results + plots
├── training_perf*.png        # Per-phase training dashboards
└── kaggle/                   # Kaggle training notebooks
```

---

## Acknowledgements

Hugging Face for credits and the Hub. The OpenEnv/PyTorch team for a well-designed hackathon brief. Unsloth for the 4-bit fused LoRA kernel that makes this fit on a T4. Groq for the 8B and 70B inference APIs. The Kaggle team for free T4 GPU sessions.

— The ER-MAP Team