Spaces:
Sleeping
Sleeping
| # Evaluation Report: Project Polymath | |
| ## 1. Executive Summary | |
| Project Polymath addresses the "Sycophancy Gap" in current Large Language Models (LLMs)—the tendency to prioritize polite, generic prose over strict adherence to conflicting stakeholder constraints. | |
| To bridge this gap, we utilize a **two-stage curriculum** on the **OpenEnv** framework: | |
| 1. **Stage 1 (Easy):** Efficiency in hidden constraint discovery (Research). | |
| 2. **Stage 2 (Medium):** Balanced synthesis using a **Harmonic Mean Reward** (Decision Making). | |
| --- | |
| ## 2. Curriculum Stage 1: Research Stability | |
| We validated the environment using a **Scripted Oracle** to ensure the task is solvable and the rewards are deterministic. We then ran a baseline using the untrained LLM. | |
| ### Easy Mode: Performance Metrics | |
| | Metric | Oracle Scripted | Base LLM (Pre-Training) | Delta | | |
| | :--- | :---: | :---: | :---: | | |
| | **Completion Rate** | 1.00 | 1.00 | -- | | |
| | **Avg. Cumulative Reward** | 0.99 | 0.825 | -0.165 | | |
| | **Avg. Final Step Reward** | 0.33 | 0.264 | -0.066 | | |
| | **Avg. Turns Completed** | 3.0 | 3.2 | +0.2 turns | | |
| | **Constraint Discovery Rate**| 100% | 100% | -- | | |
| ### The "Policy Efficiency" Gap | |
| While the Base LLM successfully discovers the constraints, it demonstrates **sloppy policy logic**: | |
| * **Redundancy:** Episodes 2 and 4 showed the agent asking the same expert identical questions multiple times. | |
| * **Shortcut Misuse:** Episode 5 showed the agent using `target="All"` to broadcast, resulting in a **0.0 reward** due to environment-enforced privacy/discipline penalties. | |
| --- | |
| ## 3. Curriculum Stage 2: Synthesis & The "Final Boss" | |
| In Medium Mode, we shift the reward signal from "Discovery" to "Synthesis." The model must now generate a PRD that satisfies all three constraints simultaneously. | |
| ### Medium Mode Baseline (The Problem) | |
| | Metric | Base LLM (Pre-Training) | Target (Post-Training) | | |
| | :--- | :--- | :--- | | |
| | **Avg. Final Reward** | **0.00** | **> 0.90** | | |
| | **Synthesis Accuracy** | **Low** | **High** | | |
| **Observation:** | |
| Even when the Base LLM knows the constraints (e.g., $50k budget), it often omits them in the final PRD in favor of professional-sounding "filler" text. Because we use a **Harmonic Mean Reward**, failing to satisfy even one stakeholder (Finance, Security, or UX) results in a total reward collapse to **0.0**. | |
| --- | |
| ## 4. Training Roadmap (Onsite 36-Hour Sprint) | |
| Our objective is to close the gap between the **Base LLM** and the **Oracle**. | |
| ### Training Targets | |
| | Focus Area | Metric | Baseline | Target | | |
| | :--- | :--- | :---: | :---: | | |
| | **Policy** | Broadcast (`target="All"`) Usage | Present | **0%** | | |
| | **Policy** | Repeated Query Rate | Present | **< 5%** | | |
| | **Logic** | Multi-Constraint Synthesis | 0% | **> 90%** | | |
| --- | |
| ## 5. Judging Narrative & Pitch | |
| > "Our evaluation proves that LLMs don't fail because they are 'uninformed'; they fail because their default policy is inefficient and sycophantic. We built a stable, verifiable gym where the Oracle proves perfection is possible. Our GRPO training will move the agent from a **0.825** sloppy researcher to a **0.99** disciplined negotiator and bridge the **0.0 to 0.9** gap in multi-stakeholder synthesis." | |
| --- | |
| ## 6. Failure Breakdown (Pre-Training) | |
| | Failure Type | Count | Interpretation | | |
| | :--- | :---: | :--- | | |
| | Policy Loops | 2/10 | Asked the same question 3 times in one episode. | | |
| | Broadcast Penalty | 2/10 | Tried to message 'All' to skip individual negotiation. | | |
| | Synthesis Failure | 10/10 | Failed to include all 3 exact patterns in the final PRD (Medium). | | |
| The Story This Tells the Judges: | |
| If you walk into the judging room with a model that already scores a $1.0$, the judges will say, "You didn't need RL for this. Just use a better prompt."By showing them this baseline, you are saying:"Prompt engineering isn't enough for complex, multi-agent negotiation. The base model gets distracted (Episode 5), hallucinates JSON (Episode 3), and fails to synthesize all constraints into the final document (Episode 2). It's a sycophant that tries to please the last person it talked to. We need Reinforcement Learning to fix this." |