Spaces:

Addyk24
/

Project-Polymath

Sleeping

App Files Files Community

Project-Polymath / EVAL_REPORT.md

Addyk24

feat: add eval baseline script (inference testing for env), report, prompter, and env config

fdd6ae0 about 2 months ago

preview code

raw

history blame

4.14 kB

Evaluation Report: Project Polymath

1. Executive Summary

Project Polymath addresses the "Sycophancy Gap" in current Large Language Models (LLMs)—the tendency to prioritize polite, generic prose over strict adherence to conflicting stakeholder constraints.

To bridge this gap, we utilize a two-stage curriculum on the OpenEnv framework:

Stage 1 (Easy): Efficiency in hidden constraint discovery (Research).
Stage 2 (Medium): Balanced synthesis using a Harmonic Mean Reward (Decision Making).

2. Curriculum Stage 1: Research Stability

We validated the environment using a Scripted Oracle to ensure the task is solvable and the rewards are deterministic. We then ran a baseline using the untrained LLM.

Easy Mode: Performance Metrics

Metric	Oracle Scripted	Base LLM (Pre-Training)	Delta
Completion Rate	1.00	1.00	--
Avg. Cumulative Reward	0.99	0.825	-0.165
Avg. Final Step Reward	0.33	0.264	-0.066
Avg. Turns Completed	3.0	3.2	+0.2 turns
Constraint Discovery Rate	100%	100%	--

The "Policy Efficiency" Gap

While the Base LLM successfully discovers the constraints, it demonstrates sloppy policy logic:

Redundancy: Episodes 2 and 4 showed the agent asking the same expert identical questions multiple times.
Shortcut Misuse: Episode 5 showed the agent using target="All" to broadcast, resulting in a 0.0 reward due to environment-enforced privacy/discipline penalties.

3. Curriculum Stage 2: Synthesis & The "Final Boss"

In Medium Mode, we shift the reward signal from "Discovery" to "Synthesis." The model must now generate a PRD that satisfies all three constraints simultaneously.

Medium Mode Baseline (The Problem)

Metric	Base LLM (Pre-Training)	Target (Post-Training)
Avg. Final Reward	0.00	> 0.90
Synthesis Accuracy	Low	High

Observation: Even when the Base LLM knows the constraints (e.g., $50k budget), it often omits them in the final PRD in favor of professional-sounding "filler" text. Because we use a Harmonic Mean Reward, failing to satisfy even one stakeholder (Finance, Security, or UX) results in a total reward collapse to 0.0.

4. Training Roadmap (Onsite 36-Hour Sprint)

Our objective is to close the gap between the Base LLM and the Oracle.

Training Targets

Focus Area	Metric	Baseline	Target
Policy	Broadcast (`target="All"`) Usage	Present	0%
Policy	Repeated Query Rate	Present	< 5%
Logic	Multi-Constraint Synthesis	0%	> 90%

5. Judging Narrative & Pitch

"Our evaluation proves that LLMs don't fail because they are 'uninformed'; they fail because their default policy is inefficient and sycophantic. We built a stable, verifiable gym where the Oracle proves perfection is possible. Our GRPO training will move the agent from a 0.825 sloppy researcher to a 0.99 disciplined negotiator and bridge the 0.0 to 0.9 gap in multi-stakeholder synthesis."

6. Failure Breakdown (Pre-Training)

Failure Type	Count	Interpretation
Policy Loops	2/10	Asked the same question 3 times in one episode.
Broadcast Penalty	2/10	Tried to message 'All' to skip individual negotiation.
Synthesis Failure	10/10	Failed to include all 3 exact patterns in the final PRD (Medium).

The Story This Tells the Judges:

If you walk into the judging room with a model that already scores a $1.0$, the judges will say, "You didn't need RL for this. Just use a better prompt."By showing them this baseline, you are saying:"Prompt engineering isn't enough for complex, multi-agent negotiation. The base model gets distracted (Episode 5), hallucinates JSON (Episode 3), and fails to synthesize all constraints into the final document (Episode 2). It's a sycophant that tries to please the last person it talked to. We need Reinforcement Learning to fix this."