Spaces:
Sleeping
Sleeping
Fix Mermaid diagrams for GitHub rendering - simplified all three diagrams, added agentic interaction flow
Browse files
README.md
CHANGED
|
@@ -48,16 +48,16 @@ A multi-agent simulation implemented via Gymnasium and served via a FastAPI HTTP
|
|
| 48 |
|
| 49 |
```mermaid
|
| 50 |
flowchart TD
|
| 51 |
-
Doctor[
|
| 52 |
-
Env -->
|
| 53 |
-
Env -->
|
| 54 |
-
Env -->
|
| 55 |
-
Env -->
|
| 56 |
-
Nurse -->
|
| 57 |
-
Patient -->
|
| 58 |
-
EJ -->|
|
| 59 |
-
MJ -->|
|
| 60 |
-
Env -->|
|
| 61 |
```
|
| 62 |
|
| 63 |
### The Actors
|
|
|
|
| 48 |
|
| 49 |
```mermaid
|
| 50 |
flowchart TD
|
| 51 |
+
Doctor[Doctor Agent - 8B LoRA] --> Env[TriageEnv - Gymnasium + FastAPI]
|
| 52 |
+
Env --> Nurse[Nurse Actor - 8B Groq]
|
| 53 |
+
Env --> Patient[Patient Actor - 8B Groq]
|
| 54 |
+
Env --> EJ[Empathy Judge - 70B Groq]
|
| 55 |
+
Env --> MJ[Medical Judge - 70B Groq]
|
| 56 |
+
Nurse --> Env
|
| 57 |
+
Patient --> Env
|
| 58 |
+
EJ --> |empathy score| Env
|
| 59 |
+
MJ --> |treatment grade| Env
|
| 60 |
+
Env --> |observation + reward| Doctor
|
| 61 |
```
|
| 62 |
|
| 63 |
### The Actors
|
blog.md
CHANGED
|
@@ -26,6 +26,39 @@ To make a diagnosis, a doctor doesn't just "guess" through symptoms—they follo
|
|
| 26 |
|
| 27 |
Now lets discuss the playground where our agents play:
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
### Dual-Judge Architecture: The "Anti-Sycophant" Protocol
|
| 30 |
**What stops the Doctor from auto-discharging a patient without a diagnosis?** In a standard RL environment, the model might "hack" the reward by being incredibly polite to farm empathy points while ignoring the medical crisis.
|
| 31 |
|
|
@@ -57,58 +90,43 @@ The environment's reward is a composite of **eleven named components**, ensuring
|
|
| 57 |
|
| 58 |
### Reward Decomposition Flow
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
```mermaid
|
| 63 |
flowchart TD
|
| 64 |
-
DOC[
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
AGG["Σ 11 components → R_trajectory"] --> CLIP["clip into [−5, +5]"]
|
| 99 |
-
CLIP --> OUT["Trajectory reward<br/>(feeds GRPO advantage)"]
|
| 100 |
-
|
| 101 |
-
classDef judge fill:#fde2e4,stroke:#c1121f,color:#000
|
| 102 |
-
classDef bonus fill:#d4edda,stroke:#155724,color:#000
|
| 103 |
-
classDef penalty fill:#f8d7da,stroke:#842029,color:#000
|
| 104 |
-
classDef gate fill:#fff3b0,stroke:#996600,color:#000
|
| 105 |
-
classDef agg fill:#ffe066,stroke:#664d00,color:#000
|
| 106 |
-
|
| 107 |
-
class EJ,MJ judge
|
| 108 |
-
class C_PROC,C_MILES,C_LABS,C_DOCU,C_DIAG,C_PLAN,C_EMP,C_TREAT,C_EMER,C_CONS bonus
|
| 109 |
-
class MALF,C_PEN penalty
|
| 110 |
-
class SCHEMA gate
|
| 111 |
-
class AGG,CLIP,OUT agg
|
| 112 |
```
|
| 113 |
|
| 114 |
### Anti-Reward-Hacking Measures (Penalize model for being oversmart)
|
|
@@ -161,57 +179,27 @@ The pipeline below shows every stage from "scheduler picks a phase" to "AdamW up
|
|
| 161 |
|
| 162 |
```mermaid
|
| 163 |
flowchart TD
|
| 164 |
-
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
RAGG --> ADV["Group-Relative Advantage<br/>Aᵢ = (Rᵢ − μ_R) / (σ_R + ε)<br/>no value model — group is the baseline"]
|
| 186 |
-
|
| 187 |
-
ADV --> TRN["FastLM.for_training(model)<br/>re-enable gradient checkpointing"]
|
| 188 |
-
|
| 189 |
-
TRN --> GRPO
|
| 190 |
-
subgraph GRPO["GRPO Update — per-step backward (T4-safe)"]
|
| 191 |
-
direction TB
|
| 192 |
-
STEP["For each (prompt, response) pair<br/>across both trajectories — ≈ 40 pairs"]
|
| 193 |
-
STEP --> LOSS["L_step = −Aᵢ · meanₜ log π_θ(aₜ|sₜ)<br/>KL term off (β = 0)"]
|
| 194 |
-
LOSS --> SCALE["scale by 1 / n_steps_total<br/>loss.backward() per step<br/>release graph, accumulate grad"]
|
| 195 |
-
SCALE --> MORE{more pairs?}
|
| 196 |
-
MORE -->|yes| STEP
|
| 197 |
-
MORE -->|no| OPT["clip_grad_norm = 1.0<br/>AdamW step (lr = 5e-6)<br/>updates LoRA only<br/>q/k/v/o_proj · ~17M params"]
|
| 198 |
-
end
|
| 199 |
-
|
| 200 |
-
GRPO --> CKPT["Periodic LoRA checkpoint<br/>every 10 episodes<br/>checkpoint_epN_phaseM/"]
|
| 201 |
-
GRPO --> METRICS["Append to training_metrics.json<br/>(per-episode log)"]
|
| 202 |
-
GRPO -->|loop until 75 episodes| SCHED
|
| 203 |
-
|
| 204 |
-
SCHED -.budget exhausted.-> FIN["Final save<br/>final_lora/ · final_merged_fp16/<br/>+ post-training inference smoke test"]
|
| 205 |
-
|
| 206 |
-
classDef phase fill:#cce5ff,stroke:#004085,color:#000
|
| 207 |
-
classDef mem fill:#fff3b0,stroke:#996600,color:#000
|
| 208 |
-
classDef rl fill:#d3f9d8,stroke:#2b8a3e,color:#000
|
| 209 |
-
classDef save fill:#f8d7da,stroke:#842029,color:#000
|
| 210 |
-
|
| 211 |
-
class P1,P2,P3 phase
|
| 212 |
-
class INF,TRN mem
|
| 213 |
-
class ADV,GRPO rl
|
| 214 |
-
class FIN,CKPT save
|
| 215 |
```
|
| 216 |
|
| 217 |
### Why Manual GRPO, Not TRL?
|
|
|
|
| 26 |
|
| 27 |
Now lets discuss the playground where our agents play:
|
| 28 |
|
| 29 |
+
### Agentic Interaction Flow
|
| 30 |
+
|
| 31 |
+
Here is the full tool-use flow of a single episode. The Doctor picks one JSON tool per turn, the environment routes it to the right actor, and the reward engine scores the result.
|
| 32 |
+
|
| 33 |
+
```mermaid
|
| 34 |
+
flowchart TD
|
| 35 |
+
Doctor[Doctor Agent - 8B LoRA] --> read_soap[read_soap - Read patient chart]
|
| 36 |
+
Doctor --> speak_patient[speak_to patient - Ask symptoms]
|
| 37 |
+
Doctor --> speak_nurse[speak_to nurse - Request vitals]
|
| 38 |
+
Doctor --> order_lab[order_lab - Order a test]
|
| 39 |
+
Doctor --> update_soap[update_soap - Write Assessment or Plan]
|
| 40 |
+
Doctor --> discharge[terminal_discharge - Final treatment]
|
| 41 |
+
|
| 42 |
+
read_soap --> Env[TriageEnv]
|
| 43 |
+
speak_patient --> Patient[Patient Actor - 8B Groq]
|
| 44 |
+
speak_nurse --> Nurse[Nurse Actor - 8B Groq]
|
| 45 |
+
order_lab --> Env
|
| 46 |
+
update_soap --> Env
|
| 47 |
+
discharge --> Env
|
| 48 |
+
|
| 49 |
+
Patient --> |trust or anxiety status| Env
|
| 50 |
+
Nurse --> |vitals and observations| Env
|
| 51 |
+
|
| 52 |
+
Env --> |per-message| EmpathyJudge[Empathy Judge - 70B]
|
| 53 |
+
Env --> |terminal only| MedicalJudge[Medical Judge - 70B]
|
| 54 |
+
|
| 55 |
+
EmpathyJudge --> |empathy score| Reward[11-Component Reward]
|
| 56 |
+
MedicalJudge --> |treatment grade| Reward
|
| 57 |
+
Env --> |process + milestones + labs| Reward
|
| 58 |
+
|
| 59 |
+
Reward --> |observation + reward| Doctor
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
### Dual-Judge Architecture: The "Anti-Sycophant" Protocol
|
| 63 |
**What stops the Doctor from auto-discharging a patient without a diagnosis?** In a standard RL environment, the model might "hack" the reward by being incredibly polite to farm empathy points while ignoring the medical crisis.
|
| 64 |
|
|
|
|
| 90 |
|
| 91 |
### Reward Decomposition Flow
|
| 92 |
|
| 93 |
+
The two LLM judges sit on **independent paths** — the Empathy Judge runs every time the Doctor speaks to the patient, the Medical Judge runs only at the terminal step. The seven rule-based components act as a deterministic floor.
|
| 94 |
|
| 95 |
```mermaid
|
| 96 |
flowchart TD
|
| 97 |
+
DOC[Doctor Action] --> VALID{Valid JSON?}
|
| 98 |
+
VALID -->|no| PEN[Penalty -0.10]
|
| 99 |
+
VALID -->|yes| ENV[TriageEnv processes action]
|
| 100 |
+
|
| 101 |
+
ENV --> RULES[Rule-Based Rewards]
|
| 102 |
+
ENV --> EJ[Empathy Judge 70B - per message]
|
| 103 |
+
ENV --> MJ[Medical Judge 70B - terminal only]
|
| 104 |
+
|
| 105 |
+
RULES --> R1[Process +0.05]
|
| 106 |
+
RULES --> R2[Milestones +0.10]
|
| 107 |
+
RULES --> R3[Labs +0.20 or -0.05]
|
| 108 |
+
RULES --> R4[Documentation +0.20]
|
| 109 |
+
RULES --> R5[Diagnosis +0.15]
|
| 110 |
+
RULES --> R6[Penalties -0.05 to -0.20]
|
| 111 |
+
|
| 112 |
+
EJ --> R7[Empathy +0.05 or -0.08]
|
| 113 |
+
MJ --> R8[Treatment +1.0 or -1.0]
|
| 114 |
+
MJ --> R9[Emergency ID +0.30 or -0.30]
|
| 115 |
+
MJ --> R10[Consent +0.15 or -0.40]
|
| 116 |
+
|
| 117 |
+
R1 --> SUM[Sum of 11 components = Trajectory Reward]
|
| 118 |
+
R2 --> SUM
|
| 119 |
+
R3 --> SUM
|
| 120 |
+
R4 --> SUM
|
| 121 |
+
R5 --> SUM
|
| 122 |
+
R6 --> SUM
|
| 123 |
+
R7 --> SUM
|
| 124 |
+
R8 --> SUM
|
| 125 |
+
R9 --> SUM
|
| 126 |
+
R10 --> SUM
|
| 127 |
+
PEN --> SUM
|
| 128 |
+
|
| 129 |
+
SUM --> GRPO[Feeds into GRPO advantage]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
```
|
| 131 |
|
| 132 |
### Anti-Reward-Hacking Measures (Penalize model for being oversmart)
|
|
|
|
| 179 |
|
| 180 |
```mermaid
|
| 181 |
flowchart TD
|
| 182 |
+
SCHED[Curriculum Scheduler - 75 episodes] --> P1[Phase 1 - Tool Mastery - 20 ep]
|
| 183 |
+
P1 --> P2[Phase 2 - Clinical Reasoning - 25 ep]
|
| 184 |
+
P2 --> P3[Phase 3 - Empathetic Negotiation - 30 ep]
|
| 185 |
+
|
| 186 |
+
P1 --> SEED[Sample shared seed + disease + persona]
|
| 187 |
+
P2 --> SEED
|
| 188 |
+
P3 --> SEED
|
| 189 |
+
|
| 190 |
+
SEED --> INF[Switch to inference mode - saves VRAM]
|
| 191 |
+
INF --> EPA[Episode A - up to 20 steps]
|
| 192 |
+
INF --> EPB[Episode B - same seed different actions]
|
| 193 |
+
|
| 194 |
+
EPA --> REWARDS[Compute trajectory rewards R_A and R_B]
|
| 195 |
+
EPB --> REWARDS
|
| 196 |
+
|
| 197 |
+
REWARDS --> ADV[Group-Relative Advantage - no critic needed]
|
| 198 |
+
ADV --> TRAIN[Switch to training mode]
|
| 199 |
+
TRAIN --> GRPO[GRPO Update - per-step backward]
|
| 200 |
+
GRPO --> LORA[AdamW updates LoRA weights - 17M params]
|
| 201 |
+
LORA --> SAVE[Checkpoint every 10 episodes]
|
| 202 |
+
LORA -->|next episode| SCHED
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
```
|
| 204 |
|
| 205 |
### Why Manual GRPO, Not TRL?
|