Addyk24 commited on
Commit
0253268
·
unverified ·
1 Parent(s): 2ae82a9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md CHANGED
@@ -1,2 +1,46 @@
1
  # Project-Polymath
2
  Target Themes: Multi-Agent Interactions (Halluminate Bonus) & Simulated Experts-in-the-Loop (Snorkel AI Bonus).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Project-Polymath
2
  Target Themes: Multi-Agent Interactions (Halluminate Bonus) & Simulated Experts-in-the-Loop (Snorkel AI Bonus).
3
+
4
+
5
+ ## 1.💡 The Problem Statement (The 30% Storytelling Hook)
6
+
7
+ Current LLMs are sycophantic and struggle with multi-stakeholder alignment. When an AI agent acts as a project manager or coordinator, it often blindly agrees with the last piece of feedback it received. The Problem: There is no benchmark to train agents to negotiate, balance conflicting constraints, and synthesize a final product when dealing with multiple "experts" who have different, shifting agendas.
8
+
9
+ ## 2.⚙️ The Environment
10
+ A Simulated Corporate Workspace (built on OpenEnv) The agent is placed in a simulated multi-turn environment (like a Slack channel or email thread) where it must draft a "Product Requirements Document" (PRD) or "Corporate Policy." The environment contains 3 LLM-driven "Simulated Experts" (e.g., The Security Lead, The Finance Director, and The UX Designer).
11
+
12
+ - The Twist: Each expert has a hidden set of constraints (e.g., Finance has a strict $50k budget, Security requires 2FA, UX demands a 1-click checkout). The environment dynamically shifts these preferences slightly if the agent pushes back too hard.
13
+
14
+ ## 3.📊 Capabilities of the Agent
15
+ The agent must possess a persistent world model and theory-of-mind reasoning.
16
+
17
+ - Information Gathering: It can query specific experts (action: message_expert, target: Finance).
18
+ - State Tracking: It must maintain a persistent internal scratchpad of what each expert wants.
19
+ - Drafting: It can propose a draft (action: propose_draft) which triggers the environment to return feedback from all three experts.
20
+ - Persuasion/Negotiation: It must logically push back against experts if their constraints conflict with another expert.
21
+
22
+ ## 4.🧠 The Tasks (Escalating Difficulty)
23
+
24
+ - Task 1 (Easy): Information Retrieval. The agent must simply message all 3 experts, discover their hidden constraints, and output them correctly.
25
+ - Task 2 (Medium): The Compromise. Two experts have slightly conflicting constraints. The agent must propose a draft that satisfies both by finding a middle ground.
26
+ - Task 3 (Hard - The Long Horizon): The Shifting Goalpost. Mid-negotiation, the "CEO" (Environment event) changes the core objective. The agent must completely refactor the draft and re-align all 3 experts before the turn limit expires.
27
+
28
+ ## 5.🎯 The Reward Model / Evaluation Logic (The 20% Technical Score)
29
+
30
+ Judges want to see continuous math, not binary pass/fail grades.
31
+
32
+ - Dense Step Rewards: * +0.1 every time the agent discovers a previously unknown hidden constraint.
33
+ -0.5 Repetition penalty (asking an expert a question they already answered).
34
+ - Sparse Final Reward (The Harmonic Mean): When the agent submits the final draft, the environment uses a frozen LLM grader to score the draft from 0.0 to 1.0 against each expert's hidden constraints.
35
+ - Crucial Innovation: Do not average the scores. Calculate the Harmonic Mean of the three scores. The harmonic mean heavily punishes the agent if it completely ignores one expert to please the other two. (e.g., Scores of 1.0, 1.0, and 0.1 yield a terrible harmonic mean, forcing the agent to balance its attention).
36
+
37
+ ## 6.🛡️ Post-Training & Self-Improvement Strategy
38
+
39
+ GRPO (Group Relative Policy Optimization) via Unsloth/TRL. Instead of traditional PPO, you will use GRPO (which is what DeepSeek used and is currently the hottest trend in RL).
40
+ - Strategy: For a given negotiation scenario, the model generates 8 different conversation trajectories.
41
+ - The environment scores all 8 trajectories using the Harmonic Mean reward.
42
+ - The model self-improves by increasing the probability of the actions taken in the highest-scoring trajectory relative to the group average.
43
+
44
+
45
+ ## 👨‍💻 Author
46
+ Aditya Katkar