Spaces:
Sleeping
Sleeping
Delete EVAL_REPORT.md
Browse files- EVAL_REPORT.md +0 -73
EVAL_REPORT.md
DELETED
|
@@ -1,73 +0,0 @@
|
|
| 1 |
-
# Evaluation Report: Project Polymath
|
| 2 |
-
|
| 3 |
-
## 1. Executive Summary
|
| 4 |
-
Project Polymath addresses the "Sycophancy Gap" in current Large Language Models (LLMs)—the tendency to prioritize polite, generic prose over strict adherence to conflicting stakeholder constraints.
|
| 5 |
-
|
| 6 |
-
To bridge this gap, we utilize a **two-stage curriculum** on the **OpenEnv** framework:
|
| 7 |
-
1. **Stage 1 (Easy):** Efficiency in hidden constraint discovery (Research).
|
| 8 |
-
2. **Stage 2 (Medium):** Balanced synthesis using a **Harmonic Mean Reward** (Decision Making).
|
| 9 |
-
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
## 2. Curriculum Stage 1: Research Stability
|
| 13 |
-
We validated the environment using a **Scripted Oracle** to ensure the task is solvable and the rewards are deterministic. We then ran a baseline using the untrained LLM.
|
| 14 |
-
|
| 15 |
-
### Easy Mode: Performance Metrics
|
| 16 |
-
| Metric | Oracle Scripted | Base LLM (Pre-Training) | Delta |
|
| 17 |
-
| :--- | :---: | :---: | :---: |
|
| 18 |
-
| **Completion Rate** | 1.00 | 1.00 | -- |
|
| 19 |
-
| **Avg. Cumulative Reward** | 0.99 | 0.825 | -0.165 |
|
| 20 |
-
| **Avg. Final Step Reward** | 0.33 | 0.264 | -0.066 |
|
| 21 |
-
| **Avg. Turns Completed** | 3.0 | 3.2 | +0.2 turns |
|
| 22 |
-
| **Constraint Discovery Rate**| 100% | 100% | -- |
|
| 23 |
-
|
| 24 |
-
### The "Policy Efficiency" Gap
|
| 25 |
-
While the Base LLM successfully discovers the constraints, it demonstrates **sloppy policy logic**:
|
| 26 |
-
* **Redundancy:** Episodes 2 and 4 showed the agent asking the same expert identical questions multiple times.
|
| 27 |
-
* **Shortcut Misuse:** Episode 5 showed the agent using `target="All"` to broadcast, resulting in a **0.0 reward** due to environment-enforced privacy/discipline penalties.
|
| 28 |
-
|
| 29 |
-
---
|
| 30 |
-
|
| 31 |
-
## 3. Curriculum Stage 2: Synthesis & The "Final Boss"
|
| 32 |
-
In Medium Mode, we shift the reward signal from "Discovery" to "Synthesis." The model must now generate a PRD that satisfies all three constraints simultaneously.
|
| 33 |
-
|
| 34 |
-
### Medium Mode Baseline (The Problem)
|
| 35 |
-
| Metric | Base LLM (Pre-Training) | Target (Post-Training) |
|
| 36 |
-
| :--- | :--- | :--- |
|
| 37 |
-
| **Avg. Final Reward** | **0.00** | **> 0.90** |
|
| 38 |
-
| **Synthesis Accuracy** | **Low** | **High** |
|
| 39 |
-
|
| 40 |
-
**Observation:**
|
| 41 |
-
Even when the Base LLM knows the constraints (e.g., $50k budget), it often omits them in the final PRD in favor of professional-sounding "filler" text. Because we use a **Harmonic Mean Reward**, failing to satisfy even one stakeholder (Finance, Security, or UX) results in a total reward collapse to **0.0**.
|
| 42 |
-
|
| 43 |
-
---
|
| 44 |
-
|
| 45 |
-
## 4. Training Roadmap (Onsite 36-Hour Sprint)
|
| 46 |
-
Our objective is to close the gap between the **Base LLM** and the **Oracle**.
|
| 47 |
-
|
| 48 |
-
### Training Targets
|
| 49 |
-
| Focus Area | Metric | Baseline | Target |
|
| 50 |
-
| :--- | :--- | :---: | :---: |
|
| 51 |
-
| **Policy** | Broadcast (`target="All"`) Usage | Present | **0%** |
|
| 52 |
-
| **Policy** | Repeated Query Rate | Present | **< 5%** |
|
| 53 |
-
| **Logic** | Multi-Constraint Synthesis | 0% | **> 90%** |
|
| 54 |
-
|
| 55 |
-
---
|
| 56 |
-
|
| 57 |
-
## 5. Judging Narrative & Pitch
|
| 58 |
-
> "Our evaluation proves that LLMs don't fail because they are 'uninformed'; they fail because their default policy is inefficient and sycophantic. We built a stable, verifiable gym where the Oracle proves perfection is possible. Our GRPO training will move the agent from a **0.825** sloppy researcher to a **0.99** disciplined negotiator and bridge the **0.0 to 0.9** gap in multi-stakeholder synthesis."
|
| 59 |
-
|
| 60 |
-
---
|
| 61 |
-
|
| 62 |
-
## 6. Failure Breakdown (Pre-Training)
|
| 63 |
-
| Failure Type | Count | Interpretation |
|
| 64 |
-
| :--- | :---: | :--- |
|
| 65 |
-
| Policy Loops | 2/10 | Asked the same question 3 times in one episode. |
|
| 66 |
-
| Broadcast Penalty | 2/10 | Tried to message 'All' to skip individual negotiation. |
|
| 67 |
-
| Synthesis Failure | 10/10 | Failed to include all 3 exact patterns in the final PRD (Medium). |
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
The Story This Tells the Judges:
|
| 72 |
-
|
| 73 |
-
If you walk into the judging room with a model that already scores a $1.0$, the judges will say, "You didn't need RL for this. Just use a better prompt."By showing them this baseline, you are saying:"Prompt engineering isn't enough for complex, multi-agent negotiation. The base model gets distracted (Episode 5), hallucinates JSON (Episode 3), and fails to synthesize all constraints into the final document (Episode 2). It's a sycophant that tries to please the last person it talked to. We need Reinforcement Learning to fix this."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|