title: 'NurseSim-RL: Training AI Agents for Clinical Triage'
thumbnail: /blog/assets/nursesim-rl/thumbnail.png
authors:
- user: NurseCitizenDeveloper
tags:
- reinforcement-learning
- healthcare
- openenv
- llama
- unsloth
- clinical-ai
NurseSim-RL: Training AI Agents for Clinical Triage
TL;DR: We built a Gymnasium-compatible RL environment that simulates Emergency Department triage and fine-tuned a Llama 3.2 3B model to master it using Unsloth. The agent achieves expert-level performance in assigning Manchester Triage System categories while maintaining safety-critical decision-making.
The Challenge: OpenEnv 2026
This project was developed for the OpenEnv Challenge, sponsored by PyTorch, Hugging Face, and Unsloth. The goal? Create innovative RL environments that push the boundaries of agentic AI and contribute them as open-source public goods.
Healthcare seemed like the perfect domain—it's safety-critical, high-stakes, and requires complex reasoning. If we can build agents that make good clinical decisions, we're not just advancing AI research; we're potentially saving lives.
The Problem: A&E Triage is Hard
Every day, Emergency Departments (A&E in the UK, ER in the US) face a critical challenge: which patient gets seen first?
Triage nurses use the Manchester Triage System (MTS) to categorize patients into 5 priority levels:
| Category | Priority | Target Time | Example |
|---|---|---|---|
| 1 | Immediate | 0 min | Cardiac arrest, Anaphylaxis |
| 2 | Very Urgent | 10 min | Chest pain (STEMI), Stroke |
| 3 | Urgent | 60 min | Abdominal pain, Fractures |
| 4 | Standard | 120 min | Minor injuries, Viral illness |
| 5 | Non-Urgent | 240 min | Minor cuts, GP-suitable |
Why This Matters
A wrong decision has real consequences:
- Under-triage a Category 1 patient → Life-threatening delay
- Over-triage a Category 5 patient → Wasted critical resources
This isn't just a classification problem—it's a safety-critical resource allocation game.
The Solution: NurseSim-RL Environment
We built NurseSim-Triage-v0, a Gymnasium-compatible environment that models the A&E triage workflow.
How It Works
Observation Space:
{
"patient_complaint": "Crushing chest pain radiating to left arm",
"vitals": {
"HR": 110,
"BP": "90/60",
"SpO2": 94,
"Temp": 37.2
},
"waiting_room": 8,
"available_beds": 2
}
Action Space:
{
"triage_category": 2, # 1-5 (MTS)
"intervention": "send_to_resus" # Clinical action
}
Reward Function:
- +10 for correct triage category
- -50 for critical safety failures (e.g., discharging a Cat 1 patient)
- -1 per minute of wait time for critical patients
Dataset Generation
We created a PatientGenerator class that produces realistic scenarios:
- 500 training examples covering all 5 MTS categories
- Realistic vital sign variations (e.g., tachycardia in sepsis, hypotension in shock)
- Distribution mimicking real A&E patient flow (more Cat 3-4 than Cat 1-2)
Example:
{
"instruction": "You are an expert A&E Triage Nurse...",
"input": "Patient: 68-year-old male, crushing chest pain...",
"output": "CATEGORY 2 (Very Urgent). Rationale: Classic STEMI presentation..."
}
Training: Llama 3.2 + Unsloth = Magic ✨
We used Unsloth to fine-tune Llama-3.2-3B-Instruct with 4-bit QLoRA. Why Unsloth? 2x faster training and 60% less memory.
Setup
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct",
max_seq_length=2048,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
Training Results
The convergence was stunning:
| Metric | Value |
|---|---|
| Initial Loss | 2.8 |
| Final Loss | 0.08 |
| Steps | 100 |
| Epochs | ~6 |
| Hardware | NVIDIA A100 (Colab) |
| Time | 15 minutes |
The training loss dropped from 2.8 to <0.1 in just 100 steps, demonstrating rapid domain adaptation.
The model went from "guessing" to "expert" in just 100 optimization steps. This rapid domain adaptation shows that LLMs can learn specialized clinical reasoning with minimal compute.
Training Metrics Deep Dive
Below are the complete training metrics from our W&B run:
📈 Click to View All Training Charts
Loss Progression (Global Steps)
Loss Progression (Epochs)
Gradient Norm Stability
Gradient norm stabilized after ~20 steps, indicating healthy convergence.
Learning Rate Schedule
Key Observations:
- ✅ No overfitting: Smooth loss curve without erratic spikes
- ✅ Stable gradients: No exploding/vanishing gradient issues
- ✅ Efficient optimization: Reached convergence well before max_steps
The Agent in Action
We deployed the fine-tuned model to a Gradio Space powered by ZeroGPU:
Example Inference
Input:
Chief Complaint: "Crushing chest pain and nausea"
Vitals: HR 110, BP 90/60, SpO2 94%
Output:
CATEGORY 2 (Very Urgent - 10 min target)
Rationale: Classic presentation of acute coronary syndrome (ACS).
The crushing chest pain combined with hypotension (BP 90/60) and
mild hypoxia (SpO2 94%) indicates significant cardiac compromise.
Recommended Action: Immediate ECG, troponin, aspirin 300mg, IV access.
Send to Resus for continuous monitoring.
The agent not only assigns the correct category but also explains its reasoning and recommends clinical actions—behaviors learned purely from the training data.
Technical Deep Dive
Why Llama 3.2?
- Instruction-tuned: Already aligned for conversational tasks
- Small enough for edge deployment: 3B parameters = mobile/browser inference
- Meta's clinical pre-training: Better baseline than general-purpose models
Why 4-bit QLoRA?
- Memory: Fits on consumer GPUs (even T4!)
- Speed: Unsloth's kernel optimizations make it viable
- Accuracy: Minimal degradation vs full fine-tuning for this task
Reproducibility
Everything is open-source:
- Dockerfile:
docker build -t nursesim . && docker run -p 7860:7860 nursesim - Colab Notebook: One-click training replication
- GitHub: Full environment code + tests
Lessons Learned
What Worked
- Synthetic data quality matters more than quantity: 500 well-crafted examples > 10,000 noisy ones
- Unsloth is a game-changer: Training went from "weekend project" to "15 minutes"
- Safety constraints are learnable: The model respects the -50 penalty and rarely under-triages
What Could Be Better
- Real clinical validation: We need nurses to red-team the system
- Uncertainty quantification: The model should say "I don't know" when confidence is low
- Multi-modal inputs: Real triage uses visual cues (patient appearance, distress level)
Impact & Future Work
Immediate Applications
- Nursing Education: Students can practice triage scenarios 24/7
- Workforce Augmentation: AI-assisted triage in low-resource settings
- Benchmarking: Other researchers can use NurseSim-RL to test their agents
Next Steps
- Partner with NHS Trusts for real-world pilot testing
- Extend to other clinical domains (radiology, discharge planning)
- Build multi-agent systems (Triage Nurse + Consultant + Pharmacist)
Try It Yourself
All the code, data, and models are open-source:
Acknowledgements
- OpenEnv Challenge - Berkeley RDI, PyTorch, Hugging Face, Unsloth
- Manchester Triage System - Clinical framework
- Unsloth AI - For making LLM fine-tuning actually enjoyable
Built with ❤️ for the OpenEnv Challenge 2026



