Addyk24 commited on
Commit
f714c26
·
verified ·
1 Parent(s): 73e82fa

Delete EVAL_REPORT.md

Browse files
Files changed (1) hide show
  1. EVAL_REPORT.md +0 -73
EVAL_REPORT.md DELETED
@@ -1,73 +0,0 @@
1
- # Evaluation Report: Project Polymath
2
-
3
- ## 1. Executive Summary
4
- Project Polymath addresses the "Sycophancy Gap" in current Large Language Models (LLMs)—the tendency to prioritize polite, generic prose over strict adherence to conflicting stakeholder constraints.
5
-
6
- To bridge this gap, we utilize a **two-stage curriculum** on the **OpenEnv** framework:
7
- 1. **Stage 1 (Easy):** Efficiency in hidden constraint discovery (Research).
8
- 2. **Stage 2 (Medium):** Balanced synthesis using a **Harmonic Mean Reward** (Decision Making).
9
-
10
- ---
11
-
12
- ## 2. Curriculum Stage 1: Research Stability
13
- We validated the environment using a **Scripted Oracle** to ensure the task is solvable and the rewards are deterministic. We then ran a baseline using the untrained LLM.
14
-
15
- ### Easy Mode: Performance Metrics
16
- | Metric | Oracle Scripted | Base LLM (Pre-Training) | Delta |
17
- | :--- | :---: | :---: | :---: |
18
- | **Completion Rate** | 1.00 | 1.00 | -- |
19
- | **Avg. Cumulative Reward** | 0.99 | 0.825 | -0.165 |
20
- | **Avg. Final Step Reward** | 0.33 | 0.264 | -0.066 |
21
- | **Avg. Turns Completed** | 3.0 | 3.2 | +0.2 turns |
22
- | **Constraint Discovery Rate**| 100% | 100% | -- |
23
-
24
- ### The "Policy Efficiency" Gap
25
- While the Base LLM successfully discovers the constraints, it demonstrates **sloppy policy logic**:
26
- * **Redundancy:** Episodes 2 and 4 showed the agent asking the same expert identical questions multiple times.
27
- * **Shortcut Misuse:** Episode 5 showed the agent using `target="All"` to broadcast, resulting in a **0.0 reward** due to environment-enforced privacy/discipline penalties.
28
-
29
- ---
30
-
31
- ## 3. Curriculum Stage 2: Synthesis & The "Final Boss"
32
- In Medium Mode, we shift the reward signal from "Discovery" to "Synthesis." The model must now generate a PRD that satisfies all three constraints simultaneously.
33
-
34
- ### Medium Mode Baseline (The Problem)
35
- | Metric | Base LLM (Pre-Training) | Target (Post-Training) |
36
- | :--- | :--- | :--- |
37
- | **Avg. Final Reward** | **0.00** | **> 0.90** |
38
- | **Synthesis Accuracy** | **Low** | **High** |
39
-
40
- **Observation:**
41
- Even when the Base LLM knows the constraints (e.g., $50k budget), it often omits them in the final PRD in favor of professional-sounding "filler" text. Because we use a **Harmonic Mean Reward**, failing to satisfy even one stakeholder (Finance, Security, or UX) results in a total reward collapse to **0.0**.
42
-
43
- ---
44
-
45
- ## 4. Training Roadmap (Onsite 36-Hour Sprint)
46
- Our objective is to close the gap between the **Base LLM** and the **Oracle**.
47
-
48
- ### Training Targets
49
- | Focus Area | Metric | Baseline | Target |
50
- | :--- | :--- | :---: | :---: |
51
- | **Policy** | Broadcast (`target="All"`) Usage | Present | **0%** |
52
- | **Policy** | Repeated Query Rate | Present | **< 5%** |
53
- | **Logic** | Multi-Constraint Synthesis | 0% | **> 90%** |
54
-
55
- ---
56
-
57
- ## 5. Judging Narrative & Pitch
58
- > "Our evaluation proves that LLMs don't fail because they are 'uninformed'; they fail because their default policy is inefficient and sycophantic. We built a stable, verifiable gym where the Oracle proves perfection is possible. Our GRPO training will move the agent from a **0.825** sloppy researcher to a **0.99** disciplined negotiator and bridge the **0.0 to 0.9** gap in multi-stakeholder synthesis."
59
-
60
- ---
61
-
62
- ## 6. Failure Breakdown (Pre-Training)
63
- | Failure Type | Count | Interpretation |
64
- | :--- | :---: | :--- |
65
- | Policy Loops | 2/10 | Asked the same question 3 times in one episode. |
66
- | Broadcast Penalty | 2/10 | Tried to message 'All' to skip individual negotiation. |
67
- | Synthesis Failure | 10/10 | Failed to include all 3 exact patterns in the final PRD (Medium). |
68
-
69
-
70
-
71
- The Story This Tells the Judges:
72
-
73
- If you walk into the judging room with a model that already scores a $1.0$, the judges will say, "You didn't need RL for this. Just use a better prompt."By showing them this baseline, you are saying:"Prompt engineering isn't enough for complex, multi-agent negotiation. The base model gets distracted (Episode 5), hallucinates JSON (Episode 3), and fails to synthesize all constraints into the final document (Episode 2). It's a sycophant that tries to please the last person it talked to. We need Reinforcement Learning to fix this."