Spaces:

Addyk24
/

Project-Polymath

Sleeping

App Files Files Community

Addyk24 commited on 21 days ago

Commit

f714c26

verified ·

1 Parent(s): 73e82fa

Delete EVAL_REPORT.md

Browse files

Files changed (1) hide show

EVAL_REPORT.md +0 -73

EVAL_REPORT.md DELETED Viewed

@@ -1,73 +0,0 @@
-# Evaluation Report: Project Polymath
-## 1. Executive Summary
-Project Polymath addresses the "Sycophancy Gap" in current Large Language Models (LLMs)—the tendency to prioritize polite, generic prose over strict adherence to conflicting stakeholder constraints.
-To bridge this gap, we utilize a **two-stage curriculum** on the **OpenEnv** framework:
-1.  **Stage 1 (Easy):** Efficiency in hidden constraint discovery (Research).
-2.  **Stage 2 (Medium):** Balanced synthesis using a **Harmonic Mean Reward** (Decision Making).
----
-## 2. Curriculum Stage 1: Research Stability
-We validated the environment using a **Scripted Oracle** to ensure the task is solvable and the rewards are deterministic. We then ran a baseline using the untrained LLM.
-### Easy Mode: Performance Metrics
-| Metric | Oracle Scripted | Base LLM (Pre-Training) | Delta |
-| :--- | :---: | :---: | :---: |
-| **Completion Rate** | 1.00 | 1.00 | -- |
-| **Avg. Cumulative Reward** | 0.99 | 0.825 | -0.165 |
-| **Avg. Final Step Reward** | 0.33 | 0.264 | -0.066 |
-| **Avg. Turns Completed** | 3.0 | 3.2 | +0.2 turns |
-| **Constraint Discovery Rate**| 100% | 100% | -- |
-### The "Policy Efficiency" Gap
-While the Base LLM successfully discovers the constraints, it demonstrates **sloppy policy logic**:
-* **Redundancy:** Episodes 2 and 4 showed the agent asking the same expert identical questions multiple times.
-* **Shortcut Misuse:** Episode 5 showed the agent using `target="All"` to broadcast, resulting in a **0.0 reward** due to environment-enforced privacy/discipline penalties.
----
-## 3. Curriculum Stage 2: Synthesis & The "Final Boss"
-In Medium Mode, we shift the reward signal from "Discovery" to "Synthesis." The model must now generate a PRD that satisfies all three constraints simultaneously.
-### Medium Mode Baseline (The Problem)
-| Metric | Base LLM (Pre-Training) | Target (Post-Training) |
-| :--- | :--- | :--- |
-| **Avg. Final Reward** | **0.00** | **> 0.90** |
-| **Synthesis Accuracy** | **Low** | **High** |
-**Observation:**
-Even when the Base LLM knows the constraints (e.g., $50k budget), it often omits them in the final PRD in favor of professional-sounding "filler" text. Because we use a **Harmonic Mean Reward**, failing to satisfy even one stakeholder (Finance, Security, or UX) results in a total reward collapse to **0.0**.
----
-## 4. Training Roadmap (Onsite 36-Hour Sprint)
-Our objective is to close the gap between the **Base LLM** and the **Oracle**.
-### Training Targets
-| Focus Area | Metric | Baseline | Target |
-| :--- | :--- | :---: | :---: |
-| **Policy** | Broadcast (`target="All"`) Usage | Present | **0%** |
-| **Policy** | Repeated Query Rate | Present | **< 5%** |
-| **Logic** | Multi-Constraint Synthesis | 0% | **> 90%** |
----
-## 5. Judging Narrative & Pitch
-> "Our evaluation proves that LLMs don't fail because they are 'uninformed'; they fail because their default policy is inefficient and sycophantic. We built a stable, verifiable gym where the Oracle proves perfection is possible. Our GRPO training will move the agent from a **0.825** sloppy researcher to a **0.99** disciplined negotiator and bridge the **0.0 to 0.9** gap in multi-stakeholder synthesis."
----
-## 6. Failure Breakdown (Pre-Training)
-| Failure Type | Count | Interpretation |
-| :--- | :---: | :--- |
-| Policy Loops | 2/10 | Asked the same question 3 times in one episode. |
-| Broadcast Penalty | 2/10 | Tried to message 'All' to skip individual negotiation. |
-| Synthesis Failure | 10/10 | Failed to include all 3 exact patterns in the final PRD (Medium). |
-The Story This Tells the Judges:
-If you walk into the judging room with a model that already scores a $1.0$, the judges will say, "You didn't need RL for this. Just use a better prompt."By showing them this baseline, you are saying:"Prompt engineering isn't enough for complex, multi-agent negotiation. The base model gets distracted (Episode 5), hallucinates JSON (Episode 3), and fails to synthesize all constraints into the final document (Episode 2). It's a sycophant that tries to please the last person it talked to. We need Reinforcement Learning to fix this."