Addyk24 commited on
Commit
db9cb7a
·
verified ·
1 Parent(s): ec9a31a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -29,7 +29,7 @@ short_description: Multi-Agent RL Environment for PRD Negotiation
29
 
30
  ---
31
 
32
- ## The Problem
33
 
34
  Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
35
 
@@ -42,7 +42,7 @@ This is a gap that matters. Every enterprise AI deployment involves multi-stakeh
42
 
43
  ---
44
 
45
- ## The Environment
46
 
47
  An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
48
 
@@ -78,7 +78,7 @@ Architectural Highlights:
78
  - Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
79
 
80
 
81
- ### Hidden Constraints (what the agent must discover)
82
 
83
  | Expert | Hidden Constraint | Hints at |
84
  |---|---|---|
@@ -90,7 +90,7 @@ The agent never sees these directly. It must ask the right questions, interpret
90
 
91
  ```
92
  ```
93
- ### Actions
94
 
95
  ```python
96
  # Discover constraints
@@ -106,7 +106,7 @@ WorkSpaceAction(action_type="submit_final", target=None,
106
  content="Final PRD with all three constraints addressed...")
107
  ```
108
 
109
- ### Observations
110
 
111
  ```python
112
  WorkspaceObservation(
@@ -126,7 +126,7 @@ WorkspaceObservation(
126
  | Broadcast-to-All rate | high | 0% |
127
  | Constraint discovery | ~50% | targeted |
128
 
129
- ## Reward Design
130
 
131
  This is the core innovation. The reward function has three layers that are hard to game independently.
132
 
@@ -177,7 +177,7 @@ If the agent asks the same expert 5+ times, that expert's frustration rises and
177
 
178
  ---
179
 
180
- ## Tasks
181
 
182
  | Task | Difficulty | Goal | Max Steps | Success Criterion |
183
  |---|---|---|---|---|
@@ -187,7 +187,7 @@ If the agent asks the same expert 5+ times, that expert's frustration rises and
187
 
188
  ---
189
 
190
- ## Results
191
 
192
  ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
193
 
 
29
 
30
  ---
31
 
32
+ ## 🧱 The Problem Statement
33
 
34
  Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
35
 
 
42
 
43
  ---
44
 
45
+ ## 🧠 The Environment
46
 
47
  An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
48
 
 
78
  - Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
79
 
80
 
81
+ ### 🏛️ Hidden Constraints (what the agent must discover)
82
 
83
  | Expert | Hidden Constraint | Hints at |
84
  |---|---|---|
 
90
 
91
  ```
92
  ```
93
+ ### Actions
94
 
95
  ```python
96
  # Discover constraints
 
106
  content="Final PRD with all three constraints addressed...")
107
  ```
108
 
109
+ ### 🧱 Observations
110
 
111
  ```python
112
  WorkspaceObservation(
 
126
  | Broadcast-to-All rate | high | 0% |
127
  | Constraint discovery | ~50% | targeted |
128
 
129
+ ## Reward Design
130
 
131
  This is the core innovation. The reward function has three layers that are hard to game independently.
132
 
 
177
 
178
  ---
179
 
180
+ ## 🧠 Tasks
181
 
182
  | Task | Difficulty | Goal | Max Steps | Success Criterion |
183
  |---|---|---|---|---|
 
187
 
188
  ---
189
 
190
+ ## 🏛️ Results
191
 
192
  ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
193