Spaces:

Addyk24
/

Project-Polymath

Sleeping

App Files Files Community

Addyk24 commited on 20 days ago

Commit

db9cb7a

verified ·

1 Parent(s): ec9a31a

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -29,7 +29,7 @@ short_description: Multi-Agent RL Environment for PRD Negotiation
 ---
-## The Problem
 Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
@@ -42,7 +42,7 @@ This is a gap that matters. Every enterprise AI deployment involves multi-stakeh
 ---
-## The Environment
 An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
@@ -78,7 +78,7 @@ Architectural Highlights:
 - Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
-### Hidden Constraints (what the agent must discover)
 | Expert | Hidden Constraint | Hints at |
 |---|---|---|
@@ -90,7 +90,7 @@ The agent never sees these directly. It must ask the right questions, interpret
 ```
 ```
-### Actions
 ```python
 # Discover constraints
@@ -106,7 +106,7 @@ WorkSpaceAction(action_type="submit_final", target=None,
                 content="Final PRD with all three constraints addressed...")
 ```
-### Observations
 ```python
 WorkspaceObservation(
@@ -126,7 +126,7 @@ WorkspaceObservation(
 | Broadcast-to-All rate | high | 0% |
 | Constraint discovery | ~50% | targeted |
-## Reward Design
 This is the core innovation. The reward function has three layers that are hard to game independently.
@@ -177,7 +177,7 @@ If the agent asks the same expert 5+ times, that expert's frustration rises and
 ---
-## Tasks
 | Task | Difficulty | Goal | Max Steps | Success Criterion |
 |---|---|---|---|---|
@@ -187,7 +187,7 @@ If the agent asks the same expert 5+ times, that expert's frustration rises and
 ---
-## Results
 ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)

 ---
+## 🧱 The Problem Statement
 Current LLMs are sycophantic. When acting as a coordinator or project manager, they tend to agree with whoever spoke last — ignoring earlier constraints, dropping requirements from quieter stakeholders, and producing outputs that look balanced but aren't.
 ---
+## 🧠 The Environment
 An agent is placed in a simulated corporate workspace as a **Product Manager**. Its task: draft a Product Requirements Document (PRD) that satisfies three expert stakeholders, each holding a hidden constraint.
 - Closed-Loop Reward Engine: The "Judge" monitors the state changes in the environment and provides a real-time floating-point reward signal back to the GRPO trainer, iteratively sharpening the "Sniper" logic.
+### 🏛️ Hidden Constraints (what the agent must discover)
 | Expert | Hidden Constraint | Hints at |
 |---|---|---|
 ```
 ```
+### ✨ Actions
 ```python
 # Discover constraints
                 content="Final PRD with all three constraints addressed...")
 ```
+### 🧱 Observations
 ```python
 WorkspaceObservation(
 | Broadcast-to-All rate | high | 0% |
 | Constraint discovery | ~50% | targeted |
+## ✨ Reward Design
 This is the core innovation. The reward function has three layers that are hard to game independently.
 ---
+## 🧠 Tasks
 | Task | Difficulty | Goal | Max Steps | Success Criterion |
 |---|---|---|---|---|
 ---
+## 🏛️ Results
 ### Baseline (untrained Qwen/Qwen2.5-1.5B-Instruct- model not sure for before state)