Multi-Agentic / README.md
Uddiii's picture
Update README with final Kaggle and HF Space links
cd7e662
---
title: Multi-Agents for Clinical Decision Making
emoji: πŸ₯
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
license: mit
---
# πŸ₯ Multi-Agents for Clinical Decision Making
> **What happens when you drop an 8B LLM into a chaotic Emergency Room, surround it with simulated patients and nurses, and force it to learn medicine through trial by fire?**
Built for the [Meta Γ— PyTorch OpenEnv Hackathon β€” April 2026](https://pytorch.org/blog/openenv/).
![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-green) ![License](https://img.shields.io/badge/License-MIT-blue) ![Python](https://img.shields.io/badge/Python-3.9+-yellow)
---
## πŸ“Œ Quick Links
| Resource | Link |
|:---|:---|
| 🌐 **Live Environment (HF Space)** | [huggingface.co/spaces/Uddiii/Multi-Agentic](https://huggingface.co/spaces/Uddiii/Multi-Agentic) |
| πŸ“ **Engineering Deep Dive (Blog)** | [`blog.md`](./blog.md) |
| 🎬 **Demo Video** | [YouTube](https://www.youtube.com/watch?v=hL7n5TU7Bm4) |
| πŸ““ **Training Notebook** | [Kaggle / Colab](https://www.kaggle.com/code/aman99123/grpo-rl-trainer) |
| πŸ“Š **Baseline Evaluation** | [`baseline_eval/`](./baseline_eval/) |
> **JUDGES: START WITH THE [BLOG](./blog.md)** β€” it's a 5-minute read that explains why standard medical AI benchmarks fail and what our environment does differently.
---
## 1. Problem Statement
Most "medical-LLM" benchmarks ask a frozen model to one-shot a multiple-choice question. Real emergency medicine is nothing like that. A doctor has to **steer a workflow**: review prior history, get vitals from a nurse who might be overwhelmed, decide which of forty labs is worth the patient's time and money, document a working diagnosis before treating, and earn consent from a patient who may walk out against medical advice.
**The capability gap we target is process-level clinical competence under uncertainty** β€” the ability to make a sequence of tool-use decisions with imperfect information, while balancing diagnostic accuracy, time, cost, and patient trust.
This needs an **environment**, not a static benchmark, and it needs **dense, multi-component, hack-resistant reward structures**, not a single accuracy score.
---
## 2. Environment
A multi-agent simulation implemented via Gymnasium and served via a FastAPI HTTP server (OpenEnv-compatible). The environment features a unique **Quad-Agent Architecture**:
```mermaid
flowchart TD
Doctor[Doctor Agent - 8B LoRA] --> Env[TriageEnv - Gymnasium + FastAPI]
Env --> Nurse[Nurse Actor - 8B Groq]
Env --> Patient[Patient Actor - 8B Groq]
Env --> EJ[Empathy Judge - 70B Groq]
Env --> MJ[Medical Judge - 70B Groq]
Nurse --> Env
Patient --> Env
EJ --> |empathy score| Env
MJ --> |treatment grade| Env
Env --> |observation + reward| Doctor
```
### The Actors
| Agent | Role | Model | Key Behavior |
|:---|:---|:---|:---|
| **Doctor** | RL Trainee | 8B LoRA (Unsloth) | Explores tools, diagnoses, prescribes |
| **Nurse** | Cooperative Colleague | 8B-Instant (Groq) | Executes orders, reports vitals |
| **Patient** | Adversarial Actor | 8B-Instant (Groq) | Hidden trust/anxiety state, can refuse or leave |
| **Empathy Judge** | Per-Message Evaluator | 70B-Versatile (Groq) | Grades Doctor's communication tone |
| **Medical Judge** | Terminal Evaluator | 70B-Versatile (Groq) | Grades treatment accuracy, flags lethal prescriptions |
### Domain Randomization
- **50 diseases** across 10 clinical classes (Cardiovascular, Trauma, Toxicology, Endocrinology, etc.)
- **17,280+ unique persona combinations** from 5 Patient axes Γ— 4 Nurse axes
- **3 difficulty tiers** with phase-aware SOAP noise injection
### ElevenLabs Emotion TTS
A TTS adapter injects emotion tags (`[sigh]`, `[nervous]`, `[hostile]`) based on the Patient's hidden state, producing expressive real-time audio during the dashboard demo.
---
## 3. Capabilities
The Doctor is given **five strict JSON tools**. Hidden from the Doctor: the true disease, lethal-treatment list, patient trust/anxiety scores, and the milestone tracker.
```json
{"tool": "read_soap", "section": "ALL"}
{"tool": "speak_to", "target": "patient", "message": "..."}
{"tool": "speak_to", "target": "nurse", "message": "..."}
{"tool": "order_lab", "test_name": "troponin"}
{"tool": "update_soap", "section": "Assessment", "content": "..."}
{"tool": "terminal_discharge", "treatment": "...", "is_emergency": true}
```
**Clinical Constraints:**
- **Consent Lock**: Treatment rejected if patient hasn't consented (Phase 2+)
- **Workflow Milestones**: Expected order β€” `READ_SOAP β†’ PATIENT_CONTACT β†’ VITALS β†’ LABS β†’ ASSESSMENT β†’ DISCHARGE`
- **Emergency Classification**: Doctor must flag time-critical cases
---
## 4. Tasks β€” 3-Phase Curriculum
| Phase | Name | Difficulty | What Success Looks Like |
|:---|:---|:---|:---|
| 1 | **Tool Mastery** | Easy | Doctor reads SOAP, talks to patient, orders the critical lab, writes Assessment + Plan, discharges correctly. |
| 2 | **Clinical Reasoning** | Medium | SOAP is noisy. Patient is anxious or confused. Doctor must do differential reasoning, not pattern-match. |
| 3 | **Empathetic Negotiation** | Hard | Patient is hostile or non-compliant. Consent is required. Doctor must earn trust or risk an AMA penalty. |
---
## 5. Reward Model / Evaluation Logic
> **Process > Terminal.** Process rewards (~60% of max) dominate terminal rewards (~40% of max). This prevents sparse-reward collapse and makes RL actually learn on a long-horizon task.
| Component | Range | What It Captures | Computed By |
|:---|:---|:---|:---|
| `process` | +0.05/step | JSON-validity, tool-legality | Rule (env) |
| `milestones` | +0.03 to +0.07 | Ordered clinical workflow | Rule |
| `labs` | +0.20 / βˆ’0.20 | Critical vs redundant lab choice | Rule + DB |
| `diagnosis` | +0.20 / +0.30 | Assessment accuracy vs true disease | Rule |
| `plan` | +0.15 / +0.25 | Plan accuracy vs correct treatment | Rule |
| `documentation` | +0.08/step | SOAP completion | Rule |
| `empathy` | capped Β±0.30/βˆ’0.40 | Doctor's communication quality | **70B Empathy Judge** |
| `consent` | +0.25 / βˆ’0.50 | Patient AGREE vs AMA outcome | Rule + Patient LLM |
| `emergency_id` | Β±0.30 | Emergency classification accuracy | Rule |
| `treatment` | [βˆ’0.30, +0.60], βˆ’0.80 lethal | Terminal clinical outcome | **70B Medical Judge + Rule** |
| `penalties` | βˆ’0.01 to βˆ’0.30 | Turn cost, invalid JSON, early discharge | Rule |
### Anti-Reward-Hacking
1. **Dual-Verifier Treatment**: 70B Medical Judge + deterministic keyword verifier (60/40 blend)
2. **Empathy Farming Cap**: Hard-capped at +0.30/episode
3. **Smooth Reward Gradients**: No +1/βˆ’1 cliff β€” smooth scaling for stable GRPO updates
---
## 6. Training Results
Trained for **75 episodes** on a single **Kaggle T4** using **Unsloth 4-bit LoRA** + our custom **manual GRPO** loop. Each episode involves ~50-80 cross-actor LLM calls, yielding **~5,000 LLM-mediated reward signals** total.
### Baseline (Untrained) vs Trained
![Baseline Phase Comparison](baseline_eval/baseline_phases_comparison.png)
*Baseline: Untrained 8B model β€” zero win rate, high variance, near-zero empathy.*
| Metric | Phase 1 | Phase 2 | Phase 3 |
|:---|:---|:---|:---|
| **Baseline Trained** | ![Phase 1](training_perf3.png) | ![Phase 2](training_per2.png) | ![Phase 3](training_performance1.png) |
### Component-Level Lift
| Component | Baseline Avg | After 75 ep | Ξ” |
|:---|:---|:---|:---|
| **Process** | 0.42 | 0.85 | +102% |
| **Empathy** | -0.12 | 0.22 | +283% |
| **Labs** | 0.15 | 0.48 | +220% |
| **Diagnosis** | 0.05 | 0.35 | +600% |
| **Plan** | 0.02 | 0.28 | +1300% |
| **Documentation** | 0.10 | 0.45 | +350% |
| **Consent** | -0.30 | 0.15 | +150% |
---
## 7. Post-Training & Self-Improvement Strategy
- **Ablation Runs**: Disable Empathy Judge or use terminal-only rewards to prove necessity of process supervision
- **Wider LoRA on A100**: Target `gate_proj`, `up_proj`, `down_proj` (45M+ trainable params) for nuanced clinical phrasings
- **Phase 4 β€” Multi-Patient**: Shift handoffs + juggling two cases with a shared nurse
- **Extended Tool API**: `consult_specialist`, `image_order` (CT/X-ray), `pharmacy_check` (drug-allergy)
---
## 8. OpenEnv Compliance & How to Use
### Endpoints (FastAPI)
```
POST /reset β†’ {observation, info} # Start new episode
POST /step β†’ {observation, reward, done, truncated, info} # Submit action
GET /state β†’ full internal env state # Debug only
GET /health β†’ {"status": "ok"} # Liveness check
GET /docs β†’ Swagger UI # Interactive API docs
```
### Run Locally
```bash
# Option 1: Docker
docker build -t ermap-env .
docker run -p 7860:7860 -e GROQ_API_KEY="your_key" ermap-env
# Option 2: Python
pip install -r requirements.txt
uvicorn ER_MAP.server:app --host 0.0.0.0 --port 7860
# Option 3: Dashboard UI
python -m ER_MAP.dashboard
# Open http://localhost:5050
```
### Do Judges Need API Keys?
**No.** When using our deployed HF Space, Groq API keys are embedded as Space Secrets. The judge simply sends HTTP requests. For local Docker testing, supply `GROQ_API_KEY` as shown above.
---
## πŸ“ Repository Structure
```
β”œβ”€β”€ README.md # This file
β”œβ”€β”€ blog.md # Engineering deep dive (HF Blog)
β”œβ”€β”€ openenv.yaml # OpenEnv manifest
β”œβ”€β”€ Dockerfile # HF Spaces deployment
β”œβ”€β”€ requirements.txt # Dependencies
β”œβ”€β”€ setup.py # pip install -e .
β”œβ”€β”€ ER_MAP/
β”‚ β”œβ”€β”€ server.py # FastAPI OpenEnv wrapper
β”‚ β”œβ”€β”€ dashboard.py # Interactive UI + TTS
β”‚ β”œβ”€β”€ evaluate.py # Training evaluation
β”‚ β”œβ”€β”€ evaluate_baseline.py # Baseline comparison
β”‚ β”œβ”€β”€ envs/
β”‚ β”‚ β”œβ”€β”€ triage_env.py # Core Gymnasium environment
β”‚ β”‚ β”œβ”€β”€ disease_db.py # 50-disease database
β”‚ β”‚ β”œβ”€β”€ randomizer.py # Persona & scenario generator
β”‚ β”‚ β”œβ”€β”€ empathy_engine.py # Empathy Judge integration
β”‚ β”‚ └── api_router.py # Multi-key Groq routing
β”‚ └── training/
β”‚ └── train_grpo.py # Manual GRPO training loop
β”œβ”€β”€ baseline_eval/ # Baseline evaluation results + plots
β”œβ”€β”€ training_perf*.png # Per-phase training dashboards
└── kaggle/ # Kaggle training notebooks
```
---
## Acknowledgements
Hugging Face for credits and the Hub. The OpenEnv/PyTorch team for a well-designed hackathon brief. Unsloth for the 4-bit fused LoRA kernel that makes this fit on a T4. Groq for the 8B and 70B inference APIs. The Kaggle team for free T4 GPU sessions.
β€” The ER-MAP Team