Spaces:
Sleeping
Evaluation Report: Project Polymath
1. Executive Summary
Project Polymath addresses the "Sycophancy Gap" in current Large Language Models (LLMs)—the tendency to prioritize polite, generic prose over strict adherence to conflicting stakeholder constraints.
To bridge this gap, we utilize a two-stage curriculum on the OpenEnv framework:
- Stage 1 (Easy): Efficiency in hidden constraint discovery (Research).
- Stage 2 (Medium): Balanced synthesis using a Harmonic Mean Reward (Decision Making).
2. Curriculum Stage 1: Research Stability
We validated the environment using a Scripted Oracle to ensure the task is solvable and the rewards are deterministic. We then ran a baseline using the untrained LLM.
Easy Mode: Performance Metrics
| Metric | Oracle Scripted | Base LLM (Pre-Training) | Delta |
|---|---|---|---|
| Completion Rate | 1.00 | 1.00 | -- |
| Avg. Cumulative Reward | 0.99 | 0.825 | -0.165 |
| Avg. Final Step Reward | 0.33 | 0.264 | -0.066 |
| Avg. Turns Completed | 3.0 | 3.2 | +0.2 turns |
| Constraint Discovery Rate | 100% | 100% | -- |
The "Policy Efficiency" Gap
While the Base LLM successfully discovers the constraints, it demonstrates sloppy policy logic:
- Redundancy: Episodes 2 and 4 showed the agent asking the same expert identical questions multiple times.
- Shortcut Misuse: Episode 5 showed the agent using
target="All"to broadcast, resulting in a 0.0 reward due to environment-enforced privacy/discipline penalties.
3. Curriculum Stage 2: Synthesis & The "Final Boss"
In Medium Mode, we shift the reward signal from "Discovery" to "Synthesis." The model must now generate a PRD that satisfies all three constraints simultaneously.
Medium Mode Baseline (The Problem)
| Metric | Base LLM (Pre-Training) | Target (Post-Training) |
|---|---|---|
| Avg. Final Reward | 0.00 | > 0.90 |
| Synthesis Accuracy | Low | High |
Observation: Even when the Base LLM knows the constraints (e.g., $50k budget), it often omits them in the final PRD in favor of professional-sounding "filler" text. Because we use a Harmonic Mean Reward, failing to satisfy even one stakeholder (Finance, Security, or UX) results in a total reward collapse to 0.0.
4. Training Roadmap (Onsite 36-Hour Sprint)
Our objective is to close the gap between the Base LLM and the Oracle.
Training Targets
| Focus Area | Metric | Baseline | Target |
|---|---|---|---|
| Policy | Broadcast (target="All") Usage |
Present | 0% |
| Policy | Repeated Query Rate | Present | < 5% |
| Logic | Multi-Constraint Synthesis | 0% | > 90% |
5. Judging Narrative & Pitch
"Our evaluation proves that LLMs don't fail because they are 'uninformed'; they fail because their default policy is inefficient and sycophantic. We built a stable, verifiable gym where the Oracle proves perfection is possible. Our GRPO training will move the agent from a 0.825 sloppy researcher to a 0.99 disciplined negotiator and bridge the 0.0 to 0.9 gap in multi-stakeholder synthesis."
6. Failure Breakdown (Pre-Training)
| Failure Type | Count | Interpretation |
|---|---|---|
| Policy Loops | 2/10 | Asked the same question 3 times in one episode. |
| Broadcast Penalty | 2/10 | Tried to message 'All' to skip individual negotiation. |
| Synthesis Failure | 10/10 | Failed to include all 3 exact patterns in the final PRD (Medium). |
The Story This Tells the Judges:
If you walk into the judging room with a model that already scores a $1.0$, the judges will say, "You didn't need RL for this. Just use a better prompt."By showing them this baseline, you are saying:"Prompt engineering isn't enough for complex, multi-agent negotiation. The base model gets distracted (Episode 5), hallucinates JSON (Episode 3), and fails to synthesize all constraints into the final document (Episode 2). It's a sycophant that tries to please the last person it talked to. We need Reinforcement Learning to fix this."