Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -29,7 +29,7 @@ short_description: Multi-Agent RL Environment for PRD Negotiation
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
## The Problem
|
| 33 |
|
| 34 |
Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
|
| 35 |
|
|
@@ -42,7 +42,7 @@ This is a gap that matters. Every enterprise AI deployment involves multi-stakeh
|
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
-
## The Environment
|
| 46 |
|
| 47 |
An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
|
| 48 |
|
|
@@ -78,7 +78,7 @@ Architectural Highlights:
|
|
| 78 |
- Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
|
| 79 |
|
| 80 |
|
| 81 |
-
### Hidden Constraints (what the agent must discover)
|
| 82 |
|
| 83 |
| Expert | Hidden Constraint | Hints at |
|
| 84 |
|---|---|---|
|
|
@@ -90,7 +90,7 @@ The agent never sees these directly. It must ask the right questions, interpret
|
|
| 90 |
|
| 91 |
```
|
| 92 |
```
|
| 93 |
-
### Actions
|
| 94 |
|
| 95 |
```python
|
| 96 |
# Discover constraints
|
|
@@ -106,7 +106,7 @@ WorkSpaceAction(action_type="submit_final", target=None,
|
|
| 106 |
content="Final PRD with all three constraints addressed...")
|
| 107 |
```
|
| 108 |
|
| 109 |
-
### Observations
|
| 110 |
|
| 111 |
```python
|
| 112 |
WorkspaceObservation(
|
|
@@ -126,7 +126,7 @@ WorkspaceObservation(
|
|
| 126 |
| Broadcast-to-All rate | high | 0% |
|
| 127 |
| Constraint discovery | ~50% | targeted |
|
| 128 |
|
| 129 |
-
## Reward Design
|
| 130 |
|
| 131 |
This is the core innovation. The reward function has three layers that are hard to game independently.
|
| 132 |
|
|
@@ -177,7 +177,7 @@ If the agent asks the same expert 5+ times, that expert's frustration rises and
|
|
| 177 |
|
| 178 |
---
|
| 179 |
|
| 180 |
-
## Tasks
|
| 181 |
|
| 182 |
| Task | Difficulty | Goal | Max Steps | Success Criterion |
|
| 183 |
|---|---|---|---|---|
|
|
@@ -187,7 +187,7 @@ If the agent asks the same expert 5+ times, that expert's frustration rises and
|
|
| 187 |
|
| 188 |
---
|
| 189 |
|
| 190 |
-
## Results
|
| 191 |
|
| 192 |
### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
|
| 193 |
|
|
|
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
+
## 🧱 The Problem Statement
|
| 33 |
|
| 34 |
Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
|
| 35 |
|
|
|
|
| 42 |
|
| 43 |
---
|
| 44 |
|
| 45 |
+
## 🧠 The Environment
|
| 46 |
|
| 47 |
An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
|
| 48 |
|
|
|
|
| 78 |
- Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
|
| 79 |
|
| 80 |
|
| 81 |
+
### 🏛️ Hidden Constraints (what the agent must discover)
|
| 82 |
|
| 83 |
| Expert | Hidden Constraint | Hints at |
|
| 84 |
|---|---|---|
|
|
|
|
| 90 |
|
| 91 |
```
|
| 92 |
```
|
| 93 |
+
### ✨ Actions
|
| 94 |
|
| 95 |
```python
|
| 96 |
# Discover constraints
|
|
|
|
| 106 |
content="Final PRD with all three constraints addressed...")
|
| 107 |
```
|
| 108 |
|
| 109 |
+
### 🧱 Observations
|
| 110 |
|
| 111 |
```python
|
| 112 |
WorkspaceObservation(
|
|
|
|
| 126 |
| Broadcast-to-All rate | high | 0% |
|
| 127 |
| Constraint discovery | ~50% | targeted |
|
| 128 |
|
| 129 |
+
## ✨ Reward Design
|
| 130 |
|
| 131 |
This is the core innovation. The reward function has three layers that are hard to game independently.
|
| 132 |
|
|
|
|
| 177 |
|
| 178 |
---
|
| 179 |
|
| 180 |
+
## 🧠 Tasks
|
| 181 |
|
| 182 |
| Task | Difficulty | Goal | Max Steps | Success Criterion |
|
| 183 |
|---|---|---|---|---|
|
|
|
|
| 187 |
|
| 188 |
---
|
| 189 |
|
| 190 |
+
## 🏛️ Results
|
| 191 |
|
| 192 |
### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)
|
| 193 |
|