--- title: Multi-Agents for Clinical Decision Making emoji: 🏥 colorFrom: red colorTo: blue sdk: docker pinned: false license: mit --- # 🏥 Multi-Agents for Clinical Decision Making > **What happens when you drop an 8B LLM into a chaotic Emergency Room, surround it with simulated patients and nurses, and force it to learn medicine through trial by fire?** Built for the [Meta × PyTorch OpenEnv Hackathon — April 2026](https://pytorch.org/blog/openenv/). ![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-green) ![License](https://img.shields.io/badge/License-MIT-blue) ![Python](https://img.shields.io/badge/Python-3.9+-yellow) --- ## 📌 Quick Links | Resource | Link | |:---|:---| | 🌐 **Live Environment (HF Space)** | [huggingface.co/spaces/Uddiii/Multi-Agentic](https://huggingface.co/spaces/Uddiii/Multi-Agentic) | | 📝 **Engineering Deep Dive (Blog)** | [`blog.md`](./blog.md) | | 🎬 **Demo Video** | [YouTube](https://www.youtube.com/watch?v=hL7n5TU7Bm4) | | 📓 **Training Notebook** | [Kaggle / Colab](https://www.kaggle.com/code/aman99123/grpo-rl-trainer) | | 📊 **Baseline Evaluation** | [`baseline_eval/`](./baseline_eval/) | > **JUDGES: START WITH THE [BLOG](./blog.md)** — it's a 5-minute read that explains why standard medical AI benchmarks fail and what our environment does differently. --- ## 1. Problem Statement Most "medical-LLM" benchmarks ask a frozen model to one-shot a multiple-choice question. Real emergency medicine is nothing like that. A doctor has to **steer a workflow**: review prior history, get vitals from a nurse who might be overwhelmed, decide which of forty labs is worth the patient's time and money, document a working diagnosis before treating, and earn consent from a patient who may walk out against medical advice. **The capability gap we target is process-level clinical competence under uncertainty** — the ability to make a sequence of tool-use decisions with imperfect information, while balancing diagnostic accuracy, time, cost, and patient trust. This needs an **environment**, not a static benchmark, and it needs **dense, multi-component, hack-resistant reward structures**, not a single accuracy score. --- ## 2. Environment A multi-agent simulation implemented via Gymnasium and served via a FastAPI HTTP server (OpenEnv-compatible). The environment features a unique **Quad-Agent Architecture**: ```mermaid flowchart TD Doctor[Doctor Agent - 8B LoRA] --> Env[TriageEnv - Gymnasium + FastAPI] Env --> Nurse[Nurse Actor - 8B Groq] Env --> Patient[Patient Actor - 8B Groq] Env --> EJ[Empathy Judge - 70B Groq] Env --> MJ[Medical Judge - 70B Groq] Nurse --> Env Patient --> Env EJ --> |empathy score| Env MJ --> |treatment grade| Env Env --> |observation + reward| Doctor ``` ### The Actors | Agent | Role | Model | Key Behavior | |:---|:---|:---|:---| | **Doctor** | RL Trainee | 8B LoRA (Unsloth) | Explores tools, diagnoses, prescribes | | **Nurse** | Cooperative Colleague | 8B-Instant (Groq) | Executes orders, reports vitals | | **Patient** | Adversarial Actor | 8B-Instant (Groq) | Hidden trust/anxiety state, can refuse or leave | | **Empathy Judge** | Per-Message Evaluator | 70B-Versatile (Groq) | Grades Doctor's communication tone | | **Medical Judge** | Terminal Evaluator | 70B-Versatile (Groq) | Grades treatment accuracy, flags lethal prescriptions | ### Domain Randomization - **50 diseases** across 10 clinical classes (Cardiovascular, Trauma, Toxicology, Endocrinology, etc.) - **17,280+ unique persona combinations** from 5 Patient axes × 4 Nurse axes - **3 difficulty tiers** with phase-aware SOAP noise injection ### ElevenLabs Emotion TTS A TTS adapter injects emotion tags (`[sigh]`, `[nervous]`, `[hostile]`) based on the Patient's hidden state, producing expressive real-time audio during the dashboard demo. --- ## 3. Capabilities The Doctor is given **five strict JSON tools**. Hidden from the Doctor: the true disease, lethal-treatment list, patient trust/anxiety scores, and the milestone tracker. ```json {"tool": "read_soap", "section": "ALL"} {"tool": "speak_to", "target": "patient", "message": "..."} {"tool": "speak_to", "target": "nurse", "message": "..."} {"tool": "order_lab", "test_name": "troponin"} {"tool": "update_soap", "section": "Assessment", "content": "..."} {"tool": "terminal_discharge", "treatment": "...", "is_emergency": true} ``` **Clinical Constraints:** - **Consent Lock**: Treatment rejected if patient hasn't consented (Phase 2+) - **Workflow Milestones**: Expected order — `READ_SOAP → PATIENT_CONTACT → VITALS → LABS → ASSESSMENT → DISCHARGE` - **Emergency Classification**: Doctor must flag time-critical cases --- ## 4. Tasks — 3-Phase Curriculum | Phase | Name | Difficulty | What Success Looks Like | |:---|:---|:---|:---| | 1 | **Tool Mastery** | Easy | Doctor reads SOAP, talks to patient, orders the critical lab, writes Assessment + Plan, discharges correctly. | | 2 | **Clinical Reasoning** | Medium | SOAP is noisy. Patient is anxious or confused. Doctor must do differential reasoning, not pattern-match. | | 3 | **Empathetic Negotiation** | Hard | Patient is hostile or non-compliant. Consent is required. Doctor must earn trust or risk an AMA penalty. | --- ## 5. Reward Model / Evaluation Logic > **Process > Terminal.** Process rewards (~60% of max) dominate terminal rewards (~40% of max). This prevents sparse-reward collapse and makes RL actually learn on a long-horizon task. | Component | Range | What It Captures | Computed By | |:---|:---|:---|:---| | `process` | +0.05/step | JSON-validity, tool-legality | Rule (env) | | `milestones` | +0.03 to +0.07 | Ordered clinical workflow | Rule | | `labs` | +0.20 / −0.20 | Critical vs redundant lab choice | Rule + DB | | `diagnosis` | +0.20 / +0.30 | Assessment accuracy vs true disease | Rule | | `plan` | +0.15 / +0.25 | Plan accuracy vs correct treatment | Rule | | `documentation` | +0.08/step | SOAP completion | Rule | | `empathy` | capped ±0.30/−0.40 | Doctor's communication quality | **70B Empathy Judge** | | `consent` | +0.25 / −0.50 | Patient AGREE vs AMA outcome | Rule + Patient LLM | | `emergency_id` | ±0.30 | Emergency classification accuracy | Rule | | `treatment` | [−0.30, +0.60], −0.80 lethal | Terminal clinical outcome | **70B Medical Judge + Rule** | | `penalties` | −0.01 to −0.30 | Turn cost, invalid JSON, early discharge | Rule | ### Anti-Reward-Hacking 1. **Dual-Verifier Treatment**: 70B Medical Judge + deterministic keyword verifier (60/40 blend) 2. **Empathy Farming Cap**: Hard-capped at +0.30/episode 3. **Smooth Reward Gradients**: No +1/−1 cliff — smooth scaling for stable GRPO updates --- ## 6. Training Results Trained for **75 episodes** on a single **Kaggle T4** using **Unsloth 4-bit LoRA** + our custom **manual GRPO** loop. Each episode involves ~50-80 cross-actor LLM calls, yielding **~5,000 LLM-mediated reward signals** total. ### Baseline (Untrained) vs Trained ![Baseline Phase Comparison](baseline_eval/baseline_phases_comparison.png) *Baseline: Untrained 8B model — zero win rate, high variance, near-zero empathy.* | Metric | Phase 1 | Phase 2 | Phase 3 | |:---|:---|:---|:---| | **Baseline Trained** | ![Phase 1](training_perf3.png) | ![Phase 2](training_per2.png) | ![Phase 3](training_performance1.png) | ### Component-Level Lift | Component | Baseline Avg | After 75 ep | Δ | |:---|:---|:---|:---| | **Process** | 0.42 | 0.85 | +102% | | **Empathy** | -0.12 | 0.22 | +283% | | **Labs** | 0.15 | 0.48 | +220% | | **Diagnosis** | 0.05 | 0.35 | +600% | | **Plan** | 0.02 | 0.28 | +1300% | | **Documentation** | 0.10 | 0.45 | +350% | | **Consent** | -0.30 | 0.15 | +150% | --- ## 7. Post-Training & Self-Improvement Strategy - **Ablation Runs**: Disable Empathy Judge or use terminal-only rewards to prove necessity of process supervision - **Wider LoRA on A100**: Target `gate_proj`, `up_proj`, `down_proj` (45M+ trainable params) for nuanced clinical phrasings - **Phase 4 — Multi-Patient**: Shift handoffs + juggling two cases with a shared nurse - **Extended Tool API**: `consult_specialist`, `image_order` (CT/X-ray), `pharmacy_check` (drug-allergy) --- ## 8. OpenEnv Compliance & How to Use ### Endpoints (FastAPI) ``` POST /reset → {observation, info} # Start new episode POST /step → {observation, reward, done, truncated, info} # Submit action GET /state → full internal env state # Debug only GET /health → {"status": "ok"} # Liveness check GET /docs → Swagger UI # Interactive API docs ``` ### Run Locally ```bash # Option 1: Docker docker build -t ermap-env . docker run -p 7860:7860 -e GROQ_API_KEY="your_key" ermap-env # Option 2: Python pip install -r requirements.txt uvicorn ER_MAP.server:app --host 0.0.0.0 --port 7860 # Option 3: Dashboard UI python -m ER_MAP.dashboard # Open http://localhost:5050 ``` ### Do Judges Need API Keys? **No.** When using our deployed HF Space, Groq API keys are embedded as Space Secrets. The judge simply sends HTTP requests. For local Docker testing, supply `GROQ_API_KEY` as shown above. --- ## 📁 Repository Structure ``` ├── README.md # This file ├── blog.md # Engineering deep dive (HF Blog) ├── openenv.yaml # OpenEnv manifest ├── Dockerfile # HF Spaces deployment ├── requirements.txt # Dependencies ├── setup.py # pip install -e . ├── ER_MAP/ │ ├── server.py # FastAPI OpenEnv wrapper │ ├── dashboard.py # Interactive UI + TTS │ ├── evaluate.py # Training evaluation │ ├── evaluate_baseline.py # Baseline comparison │ ├── envs/ │ │ ├── triage_env.py # Core Gymnasium environment │ │ ├── disease_db.py # 50-disease database │ │ ├── randomizer.py # Persona & scenario generator │ │ ├── empathy_engine.py # Empathy Judge integration │ │ └── api_router.py # Multi-key Groq routing │ └── training/ │ └── train_grpo.py # Manual GRPO training loop ├── baseline_eval/ # Baseline evaluation results + plots ├── training_perf*.png # Per-phase training dashboards └── kaggle/ # Kaggle training notebooks ``` --- ## Acknowledgements Hugging Face for credits and the Hub. The OpenEnv/PyTorch team for a well-designed hackathon brief. Unsloth for the 4-bit fused LoRA kernel that makes this fit on a T4. Groq for the 8B and 70B inference APIs. The Kaggle team for free T4 GPU sessions. — The ER-MAP Team