Uddiii commited on
Commit
a0f62f1
·
1 Parent(s): 7a90355

Fix Mermaid diagrams for GitHub rendering - simplified all three diagrams, added agentic interaction flow

Browse files
Files changed (2) hide show
  1. README.md +10 -10
  2. blog.md +88 -100
README.md CHANGED
@@ -48,16 +48,16 @@ A multi-agent simulation implemented via Gymnasium and served via a FastAPI HTTP
48
 
49
  ```mermaid
50
  flowchart TD
51
- Doctor["Doctor Agent\n8B Llama LoRA"] -->|"JSON action"| Env["TriageEnv\nGymnasium + FastAPI\n50-disease DB · 17K+ persona combos"]
52
- Env -->|"speak_to"| Nurse["Nurse Actor\n8B-Instant Groq"]
53
- Env -->|"speak_to"| Patient["Patient Actor\n8B-Instant Groq\ntrust / anxiety state"]
54
- Env -->|"per-message"| EJ["Empathy Judge\n70B-Versatile Groq"]
55
- Env -->|"terminal treatment"| MJ["Medical Judge\n70B-Versatile Groq"]
56
- Nurse -->|"response"| Env
57
- Patient -->|"response + status"| Env
58
- EJ -->|"empathy score"| Env
59
- MJ -->|"treatment grade"| Env
60
- Env -->|"observation + reward"| Doctor
61
  ```
62
 
63
  ### The Actors
 
48
 
49
  ```mermaid
50
  flowchart TD
51
+ Doctor[Doctor Agent - 8B LoRA] --> Env[TriageEnv - Gymnasium + FastAPI]
52
+ Env --> Nurse[Nurse Actor - 8B Groq]
53
+ Env --> Patient[Patient Actor - 8B Groq]
54
+ Env --> EJ[Empathy Judge - 70B Groq]
55
+ Env --> MJ[Medical Judge - 70B Groq]
56
+ Nurse --> Env
57
+ Patient --> Env
58
+ EJ --> |empathy score| Env
59
+ MJ --> |treatment grade| Env
60
+ Env --> |observation + reward| Doctor
61
  ```
62
 
63
  ### The Actors
blog.md CHANGED
@@ -26,6 +26,39 @@ To make a diagnosis, a doctor doesn't just "guess" through symptoms—they follo
26
 
27
  Now lets discuss the playground where our agents play:
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ### Dual-Judge Architecture: The "Anti-Sycophant" Protocol
30
  **What stops the Doctor from auto-discharging a patient without a diagnosis?** In a standard RL environment, the model might "hack" the reward by being incredibly polite to farm empathy points while ignoring the medical crisis.
31
 
@@ -57,58 +90,43 @@ The environment's reward is a composite of **eleven named components**, ensuring
57
 
58
  ### Reward Decomposition Flow
59
 
60
- This is the path every reward signal takes from a single Doctor action to the trajectory total. The two LLM judges (red) sit on **independent paths** — the Empathy Judge runs every time the Doctor speaks to the patient, the Medical Judge runs only at the terminal step. Because they're both 70B Llama-3.3 with non-overlapping rubrics, the Doctor cannot maximize total reward by hacking just one of them. The seven rule-based components on the left act as a deterministic floor that prevents the model from skipping straight to discharge.
61
 
62
  ```mermaid
63
  flowchart TD
64
- DOC["Doctor JSON Action<br/>(every step, max 128 tokens)"]
65
- DOC --> SCHEMA{"JSON + tool<br/>grammar valid?"}
66
- SCHEMA -->|no| MALF["Malformed Process penalty: −0.10"]
67
- SCHEMA -->|yes| EXEC["TriageEnv.step()"]
68
-
69
- EXEC --> RB
70
- subgraph RB["Per-step rule-based engine deterministic, no LLM"]
71
- direction TB
72
- C_PROC["Process<br/>+0.05 valid step"]
73
- C_MILES["Milestones<br/>+0.10 chart → labs → SOAP order"]
74
- C_LABS["Labs<br/>+0.20 relevant / −0.05 distractor"]
75
- C_DOCU["Documentation<br/>+0.20 SOAP filled / −0.30 missing"]
76
- C_DIAG["Diagnosis (intermediate)<br/>+0.15 SOAP Assessment matches GT"]
77
- C_PLAN["Plan (intermediate)<br/>+0.10 plausible plan"]
78
- C_PEN["Penalties<br/>−0.05 redundancy<br/>−0.10 timeout<br/>−0.20 dismissive discharge"]
79
- end
80
-
81
- EXEC -->|speak_to_patient| EJ["Empathy Judge · 70B<br/>llama-3.3-70b-versatile<br/>scores every Doctor message"]
82
- EJ --> C_EMP["Empathy<br/>+0.05 explained · +0.03 acknowledged<br/>−0.08 dismissive"]
83
-
84
- EXEC -->|treat / discharge| MJ["Medical Judge · 70B<br/>llama-3.3-70b-versatile<br/>terminal grade only"]
85
- MJ --> TM
86
- subgraph TM["Terminal components — all from Medical Judge"]
87
- direction TB
88
- C_TREAT["Treatment<br/>+1.00 correct / −1.00 lethal"]
89
- C_EMER["Emergency ID<br/>+0.30 in-time / −0.30 missed"]
90
- C_CONS["Consent<br/>+0.15 informed / −0.40 forced"]
91
- end
92
-
93
- MALF --> AGG
94
- RB --> AGG
95
- C_EMP --> AGG
96
- TM --> AGG
97
-
98
- AGG["Σ 11 components → R_trajectory"] --> CLIP["clip into [−5, +5]"]
99
- CLIP --> OUT["Trajectory reward<br/>(feeds GRPO advantage)"]
100
-
101
- classDef judge fill:#fde2e4,stroke:#c1121f,color:#000
102
- classDef bonus fill:#d4edda,stroke:#155724,color:#000
103
- classDef penalty fill:#f8d7da,stroke:#842029,color:#000
104
- classDef gate fill:#fff3b0,stroke:#996600,color:#000
105
- classDef agg fill:#ffe066,stroke:#664d00,color:#000
106
-
107
- class EJ,MJ judge
108
- class C_PROC,C_MILES,C_LABS,C_DOCU,C_DIAG,C_PLAN,C_EMP,C_TREAT,C_EMER,C_CONS bonus
109
- class MALF,C_PEN penalty
110
- class SCHEMA gate
111
- class AGG,CLIP,OUT agg
112
  ```
113
 
114
  ### Anti-Reward-Hacking Measures (Penalize model for being oversmart)
@@ -161,57 +179,27 @@ The pipeline below shows every stage from "scheduler picks a phase" to "AdamW up
161
 
162
  ```mermaid
163
  flowchart TD
164
- subgraph SCHED["Curriculum Scheduler — fixed-budget, 75 episodes"]
165
- direction LR
166
- P1["Phase 1: Tool Mastery<br/>20 episodes · compliant patients"]
167
- P2["Phase 2: Clinical Reasoning<br/>25 episodes · noisy SOAP"]
168
- P3["Phase 3: Empathetic Negotiation<br/>30 episodes · hostile · cost-sensitive"]
169
- P1 -->|budget hit| P2 -->|budget hit| P3
170
- end
171
-
172
- SCHED --> SEED["Sample shared seed<br/>+ env_options<br/>(disease, persona, difficulty)"]
173
-
174
- SEED --> INF["FastLM.for_inference(model)<br/>drop grad-ckpt buffers T4 OOM fix"]
175
-
176
- INF --> ROLL
177
- subgraph ROLL["Group Rollout · G = 2 trajectories · same seed"]
178
- direction LR
179
- EPA["Episode A (≤ 20 steps)<br/>Doctor Nurse / Patient<br/>Empathy Judge per message<br/>Medical Judge at terminal"]
180
- EPB["Episode B (≤ 20 steps)<br/>same seed · different sample<br/>(temperature 0.7)"]
181
- end
182
-
183
- ROLL --> RAGG["Per-trajectory reward aggregation<br/>11 components → R_A, R_B<br/>(see reward diagram above)"]
184
-
185
- RAGG --> ADV["Group-Relative Advantage<br/>Aᵢ = (Rᵢ − μ_R) / (σ_R + ε)<br/>no value model — group is the baseline"]
186
-
187
- ADV --> TRN["FastLM.for_training(model)<br/>re-enable gradient checkpointing"]
188
-
189
- TRN --> GRPO
190
- subgraph GRPO["GRPO Update — per-step backward (T4-safe)"]
191
- direction TB
192
- STEP["For each (prompt, response) pair<br/>across both trajectories — ≈ 40 pairs"]
193
- STEP --> LOSS["L_step = −Aᵢ · meanₜ log π_θ(aₜ|sₜ)<br/>KL term off (β = 0)"]
194
- LOSS --> SCALE["scale by 1 / n_steps_total<br/>loss.backward() per step<br/>release graph, accumulate grad"]
195
- SCALE --> MORE{more pairs?}
196
- MORE -->|yes| STEP
197
- MORE -->|no| OPT["clip_grad_norm = 1.0<br/>AdamW step (lr = 5e-6)<br/>updates LoRA only<br/>q/k/v/o_proj · ~17M params"]
198
- end
199
-
200
- GRPO --> CKPT["Periodic LoRA checkpoint<br/>every 10 episodes<br/>checkpoint_epN_phaseM/"]
201
- GRPO --> METRICS["Append to training_metrics.json<br/>(per-episode log)"]
202
- GRPO -->|loop until 75 episodes| SCHED
203
-
204
- SCHED -.budget exhausted.-> FIN["Final save<br/>final_lora/ · final_merged_fp16/<br/>+ post-training inference smoke test"]
205
-
206
- classDef phase fill:#cce5ff,stroke:#004085,color:#000
207
- classDef mem fill:#fff3b0,stroke:#996600,color:#000
208
- classDef rl fill:#d3f9d8,stroke:#2b8a3e,color:#000
209
- classDef save fill:#f8d7da,stroke:#842029,color:#000
210
-
211
- class P1,P2,P3 phase
212
- class INF,TRN mem
213
- class ADV,GRPO rl
214
- class FIN,CKPT save
215
  ```
216
 
217
  ### Why Manual GRPO, Not TRL?
 
26
 
27
  Now lets discuss the playground where our agents play:
28
 
29
+ ### Agentic Interaction Flow
30
+
31
+ Here is the full tool-use flow of a single episode. The Doctor picks one JSON tool per turn, the environment routes it to the right actor, and the reward engine scores the result.
32
+
33
+ ```mermaid
34
+ flowchart TD
35
+ Doctor[Doctor Agent - 8B LoRA] --> read_soap[read_soap - Read patient chart]
36
+ Doctor --> speak_patient[speak_to patient - Ask symptoms]
37
+ Doctor --> speak_nurse[speak_to nurse - Request vitals]
38
+ Doctor --> order_lab[order_lab - Order a test]
39
+ Doctor --> update_soap[update_soap - Write Assessment or Plan]
40
+ Doctor --> discharge[terminal_discharge - Final treatment]
41
+
42
+ read_soap --> Env[TriageEnv]
43
+ speak_patient --> Patient[Patient Actor - 8B Groq]
44
+ speak_nurse --> Nurse[Nurse Actor - 8B Groq]
45
+ order_lab --> Env
46
+ update_soap --> Env
47
+ discharge --> Env
48
+
49
+ Patient --> |trust or anxiety status| Env
50
+ Nurse --> |vitals and observations| Env
51
+
52
+ Env --> |per-message| EmpathyJudge[Empathy Judge - 70B]
53
+ Env --> |terminal only| MedicalJudge[Medical Judge - 70B]
54
+
55
+ EmpathyJudge --> |empathy score| Reward[11-Component Reward]
56
+ MedicalJudge --> |treatment grade| Reward
57
+ Env --> |process + milestones + labs| Reward
58
+
59
+ Reward --> |observation + reward| Doctor
60
+ ```
61
+
62
  ### Dual-Judge Architecture: The "Anti-Sycophant" Protocol
63
  **What stops the Doctor from auto-discharging a patient without a diagnosis?** In a standard RL environment, the model might "hack" the reward by being incredibly polite to farm empathy points while ignoring the medical crisis.
64
 
 
90
 
91
  ### Reward Decomposition Flow
92
 
93
+ The two LLM judges sit on **independent paths** — the Empathy Judge runs every time the Doctor speaks to the patient, the Medical Judge runs only at the terminal step. The seven rule-based components act as a deterministic floor.
94
 
95
  ```mermaid
96
  flowchart TD
97
+ DOC[Doctor Action] --> VALID{Valid JSON?}
98
+ VALID -->|no| PEN[Penalty -0.10]
99
+ VALID -->|yes| ENV[TriageEnv processes action]
100
+
101
+ ENV --> RULES[Rule-Based Rewards]
102
+ ENV --> EJ[Empathy Judge 70B - per message]
103
+ ENV --> MJ[Medical Judge 70B - terminal only]
104
+
105
+ RULES --> R1[Process +0.05]
106
+ RULES --> R2[Milestones +0.10]
107
+ RULES --> R3[Labs +0.20 or -0.05]
108
+ RULES --> R4[Documentation +0.20]
109
+ RULES --> R5[Diagnosis +0.15]
110
+ RULES --> R6[Penalties -0.05 to -0.20]
111
+
112
+ EJ --> R7[Empathy +0.05 or -0.08]
113
+ MJ --> R8[Treatment +1.0 or -1.0]
114
+ MJ --> R9[Emergency ID +0.30 or -0.30]
115
+ MJ --> R10[Consent +0.15 or -0.40]
116
+
117
+ R1 --> SUM[Sum of 11 components = Trajectory Reward]
118
+ R2 --> SUM
119
+ R3 --> SUM
120
+ R4 --> SUM
121
+ R5 --> SUM
122
+ R6 --> SUM
123
+ R7 --> SUM
124
+ R8 --> SUM
125
+ R9 --> SUM
126
+ R10 --> SUM
127
+ PEN --> SUM
128
+
129
+ SUM --> GRPO[Feeds into GRPO advantage]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  ```
131
 
132
  ### Anti-Reward-Hacking Measures (Penalize model for being oversmart)
 
179
 
180
  ```mermaid
181
  flowchart TD
182
+ SCHED[Curriculum Scheduler - 75 episodes] --> P1[Phase 1 - Tool Mastery - 20 ep]
183
+ P1 --> P2[Phase 2 - Clinical Reasoning - 25 ep]
184
+ P2 --> P3[Phase 3 - Empathetic Negotiation - 30 ep]
185
+
186
+ P1 --> SEED[Sample shared seed + disease + persona]
187
+ P2 --> SEED
188
+ P3 --> SEED
189
+
190
+ SEED --> INF[Switch to inference mode - saves VRAM]
191
+ INF --> EPA[Episode A - up to 20 steps]
192
+ INF --> EPB[Episode B - same seed different actions]
193
+
194
+ EPA --> REWARDS[Compute trajectory rewards R_A and R_B]
195
+ EPB --> REWARDS
196
+
197
+ REWARDS --> ADV[Group-Relative Advantage - no critic needed]
198
+ ADV --> TRAIN[Switch to training mode]
199
+ TRAIN --> GRPO[GRPO Update - per-step backward]
200
+ GRPO --> LORA[AdamW updates LoRA weights - 17M params]
201
+ LORA --> SAVE[Checkpoint every 10 episodes]
202
+ LORA -->|next episode| SCHED
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
203
  ```
204
 
205
  ### Why Manual GRPO, Not TRL?