Spaces:
Running
Running
improving content
Browse files- ASSESSMENT.md +0 -291
- ASSESSMENT_V2.md +0 -197
- app/astro.config.mjs +0 -1
- app/src/content/article.mdx +4 -4
- app/src/content/bibliography.bib +54 -118
- app/src/content/chapters/eleusis/analysis.mdx +0 -100
- app/src/content/chapters/eleusis/appendix.mdx +21 -36
- app/src/content/chapters/eleusis/benchmark.mdx +10 -11
- app/src/content/chapters/eleusis/conclusion.mdx +6 -26
- app/src/content/chapters/eleusis/discussion.mdx +30 -0
- app/src/content/chapters/eleusis/introduction.mdx +7 -7
- app/src/content/chapters/eleusis/results.mdx +63 -37
ASSESSMENT.md
DELETED
|
@@ -1,291 +0,0 @@
|
|
| 1 |
-
# Critical Assessment: Eleusis Benchmark Article
|
| 2 |
-
|
| 3 |
-
## Executive Summary
|
| 4 |
-
|
| 5 |
-
The article presents an interesting benchmark with solid methodology and rich data. The main structural issue is that the **Results section tells a fragmented story about guessing behavior**, spreading related insights across 6+ subsections without a clear narrative arc. The key message—that metacognition matters and models have distinct "scientific personalities"—gets lost in the noise.
|
| 6 |
-
|
| 7 |
-
Additionally, there are **data consistency issues** between the text and the underlying data files that need resolution before publication.
|
| 8 |
-
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
## 1. Critical Issues
|
| 12 |
-
|
| 13 |
-
### 1.1 Data Inconsistencies
|
| 14 |
-
|
| 15 |
-
The numbers in the text don't match `summary.txt`. For example:
|
| 16 |
-
|
| 17 |
-
| Metric | In Text | In summary.txt |
|
| 18 |
-
|--------|---------|----------------|
|
| 19 |
-
| Claude Opus 4.5 avg score | 15.88 (CLAUDE.md) | 14.46 |
|
| 20 |
-
| Kimi K2 avg score | 14.53 (CLAUDE.md) | 10.31 |
|
| 21 |
-
| GPT 5.2 High rank | "third place" | Actually 1st by avg_score (14.85) |
|
| 22 |
-
|
| 23 |
-
**Action needed:** Audit all numbers in the text against the latest data files.
|
| 24 |
-
|
| 25 |
-
### 1.2 Results Section: Scattered Narrative
|
| 26 |
-
|
| 27 |
-
The guessing behavior story is currently spread across:
|
| 28 |
-
|
| 29 |
-
1. "Confidence and Calibration" - calibration curves, confidence distribution
|
| 30 |
-
2. "Guessing Strategy" - score vs failed guesses
|
| 31 |
-
3. "The Caution-Recklessness Trade-off" - early correct turns, caution scatter
|
| 32 |
-
4. "Alternative Scoring Systems" - score stack breakdown
|
| 33 |
-
5. "Analysis of the reckless guessing behavior" - double-down rate
|
| 34 |
-
|
| 35 |
-
These all address the same fundamental question: **How do models decide when to commit?** But the current structure forces readers to piece together the story themselves.
|
| 36 |
-
|
| 37 |
-
**Problem:** A reader finishing the Results section doesn't have a clear mental model of "what makes some models better than others."
|
| 38 |
-
|
| 39 |
-
---
|
| 40 |
-
|
| 41 |
-
## 2. Suggested Restructuring
|
| 42 |
-
|
| 43 |
-
### Option A: Reorganize Around the Key Insight
|
| 44 |
-
|
| 45 |
-
**Proposed Results structure:**
|
| 46 |
-
|
| 47 |
-
```
|
| 48 |
-
## Results
|
| 49 |
-
|
| 50 |
-
### Overall Performance (keep as-is)
|
| 51 |
-
Brief overview, scatter plot of score vs tokens
|
| 52 |
-
|
| 53 |
-
### Finding the Rule: Who Gets It Right?
|
| 54 |
-
- Success rates by model
|
| 55 |
-
- Performance by rule complexity
|
| 56 |
-
- Brief: what capabilities matter for finding rules
|
| 57 |
-
|
| 58 |
-
### Knowing When You Know: The Metacognition Challenge
|
| 59 |
-
[This is the heart of the article - elevate it]
|
| 60 |
-
- The caution-recklessness trade-off (central framing)
|
| 61 |
-
- Caution analysis: early correct turns, GPT 5.2 waits too long
|
| 62 |
-
- Recklessness analysis: failed guesses, double-down rates
|
| 63 |
-
- The scatter plot showing the trade-off (Figure 6)
|
| 64 |
-
- Why Claude Opus wins: good enough at finding + great at timing
|
| 65 |
-
|
| 66 |
-
### Confidence and Calibration
|
| 67 |
-
- Calibration curves (all models overconfident)
|
| 68 |
-
- Confidence distribution when guessing
|
| 69 |
-
- Brief: why calibration enables good timing decisions
|
| 70 |
-
|
| 71 |
-
### Alternative Scoring: Robustness Check
|
| 72 |
-
- Score stack shows the penalty different behaviors pay
|
| 73 |
-
- Confirms that metacognition, not just rule-finding, drives scores
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
**Benefits:**
|
| 77 |
-
- The key message (metacognition matters) becomes structurally prominent
|
| 78 |
-
- Reader builds understanding progressively: first "can they solve it?", then "do they know when they've solved it?"
|
| 79 |
-
- Eliminates the feeling of "lots of charts, hard to synthesize"
|
| 80 |
-
|
| 81 |
-
### Option B: Two-Act Structure
|
| 82 |
-
|
| 83 |
-
```
|
| 84 |
-
## Results
|
| 85 |
-
|
| 86 |
-
### Act 1: The Leaderboard (compact)
|
| 87 |
-
- Overall performance scatter
|
| 88 |
-
- Success rates
|
| 89 |
-
- One paragraph summary: "Models vary from 70% to 96% success rate..."
|
| 90 |
-
|
| 91 |
-
### Act 2: The Real Story—Scientific Temperaments
|
| 92 |
-
[Frame models as having distinct "personalities"]
|
| 93 |
-
|
| 94 |
-
The Cautious Achiever: GPT 5.2 High
|
| 95 |
-
- Highest success rate, but 3rd in score
|
| 96 |
-
- Figure: excess caution distribution
|
| 97 |
-
- Lost ~3.6 points per round to over-caution
|
| 98 |
-
|
| 99 |
-
The Balanced Scientist: Claude Opus 4.5
|
| 100 |
-
- Not the best at finding rules, but best at knowing when
|
| 101 |
-
- Commits quickly, accepts occasional wrong guesses
|
| 102 |
-
|
| 103 |
-
The Reckless Guesser: Claude Haiku 4.5 / DeepSeek R1
|
| 104 |
-
- Commits before sufficient evidence
|
| 105 |
-
- Double-down behavior after failures
|
| 106 |
-
|
| 107 |
-
Visualizing the Trade-off
|
| 108 |
-
- Caution vs recklessness scatter (the key figure)
|
| 109 |
-
- Score stack showing what each "personality" costs
|
| 110 |
-
|
| 111 |
-
### Calibration: Why Timing Is Hard
|
| 112 |
-
- Overconfidence makes timing decisions unreliable
|
| 113 |
-
- Even well-performing models poorly calibrated
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
**Benefits:**
|
| 117 |
-
- Memorable framing (scientific personalities)
|
| 118 |
-
- Natural story arc
|
| 119 |
-
- Each model type is clearly characterized
|
| 120 |
-
|
| 121 |
-
---
|
| 122 |
-
|
| 123 |
-
## 3. Missing Content
|
| 124 |
-
|
| 125 |
-
### 3.1 Figures Marked as TODO
|
| 126 |
-
|
| 127 |
-
- **Learning curves figure** (analysis.mdx:22) - Would show within-round dynamics
|
| 128 |
-
- **Failure mode distribution** (analysis.mdx:55) - Stacked bar by model
|
| 129 |
-
|
| 130 |
-
**Recommendation:** The learning curves figure would be valuable if you have the data. The failure mode classification might be hard to automate reliably—consider whether a few qualitative examples serve the purpose better.
|
| 131 |
-
|
| 132 |
-
### 3.2 Human Baseline
|
| 133 |
-
|
| 134 |
-
Mentioned in limitations but this is a significant gap. Without human performance, readers can't judge if 92% success is impressive or trivial.
|
| 135 |
-
|
| 136 |
-
**Options:**
|
| 137 |
-
- Run a small human study (even N=5 would help)
|
| 138 |
-
- Cite related work on human performance in similar inductive reasoning tasks
|
| 139 |
-
- Frame it explicitly as "relative comparison between models" not absolute capability assessment
|
| 140 |
-
|
| 141 |
-
### 3.3 Example Turn Figure
|
| 142 |
-
|
| 143 |
-
benchmark.mdx shows the JSON output format but doesn't illustrate what a complete turn looks like in context (game state → reasoning → decision).
|
| 144 |
-
|
| 145 |
-
**Recommendation:** Add a figure showing:
|
| 146 |
-
```
|
| 147 |
-
[Current board state visualization]
|
| 148 |
-
[Model reasoning excerpt]
|
| 149 |
-
[Decision: play 4♣, confidence 6, don't guess yet]
|
| 150 |
-
[Outcome: accepted/rejected]
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-
This makes the task concrete for readers.
|
| 154 |
-
|
| 155 |
-
---
|
| 156 |
-
|
| 157 |
-
## 4. The "Deeper Analysis" Section
|
| 158 |
-
|
| 159 |
-
Currently a grab-bag of interesting observations with TODOs. Your instinct to replace with "Discussion" is right.
|
| 160 |
-
|
| 161 |
-
### Proposed: Discussion Section
|
| 162 |
-
|
| 163 |
-
```
|
| 164 |
-
## Discussion
|
| 165 |
-
|
| 166 |
-
### What Explains the Performance Gap?
|
| 167 |
-
- Metacognition (knowing when you know) is the key differentiator
|
| 168 |
-
- Success rate alone doesn't predict score (GPT 5.2 vs Opus example)
|
| 169 |
-
- Calibration enables good timing, but no model is well-calibrated
|
| 170 |
-
|
| 171 |
-
### Open vs Proprietary Models
|
| 172 |
-
- Kimi K2 competitive on rule-finding
|
| 173 |
-
- But open models trend toward reckless guessing (training objective differences?)
|
| 174 |
-
- Opportunity: calibration tuning could improve open model performance
|
| 175 |
-
|
| 176 |
-
### Failure Modes [keep the accordion, it's useful]
|
| 177 |
-
|
| 178 |
-
### Implications for AI-Assisted Science
|
| 179 |
-
- The caution-recklessness trade-off mirrors real scientific decision-making
|
| 180 |
-
- An overconfident AI assistant could lead researchers astray
|
| 181 |
-
- An overcautious one wastes resources on unnecessary verification
|
| 182 |
-
```
|
| 183 |
-
|
| 184 |
-
### Move to Appendix
|
| 185 |
-
|
| 186 |
-
- Symmetric rules analysis (interesting but niche)
|
| 187 |
-
- Confirmation bias (preliminary, needs more work)
|
| 188 |
-
- Detailed qualitative examples (unless you expand them significantly)
|
| 189 |
-
|
| 190 |
-
---
|
| 191 |
-
|
| 192 |
-
## 5. Framing Suggestions
|
| 193 |
-
|
| 194 |
-
### 5.1 Lead with the Surprise
|
| 195 |
-
|
| 196 |
-
Current opening of Results is fine, but the key insight (metacognition matters) comes too late. Consider foreshadowing in the introduction:
|
| 197 |
-
|
| 198 |
-
> "We found something surprising: the model with the highest success rate doesn't have the highest score. What matters isn't just finding the answer—it's knowing when you've found it."
|
| 199 |
-
|
| 200 |
-
### 5.2 The "Scientific Personality" Frame
|
| 201 |
-
|
| 202 |
-
This is potentially memorable and shareable. Models as:
|
| 203 |
-
- **The Perfectionist** (GPT 5.2 High): Always wants more evidence
|
| 204 |
-
- **The Pragmatist** (Claude Opus 4.5): Good enough evidence is enough
|
| 205 |
-
- **The Gambler** (Claude Haiku 4.5): Guesses based on vibes
|
| 206 |
-
|
| 207 |
-
This framing:
|
| 208 |
-
- Makes the article more accessible to non-specialists
|
| 209 |
-
- Creates natural anchors for discussion
|
| 210 |
-
- Is scientifically defensible (behavioral clustering is real)
|
| 211 |
-
|
| 212 |
-
### 5.3 The Decision Theory Angle
|
| 213 |
-
|
| 214 |
-
You mention the optimal guessing threshold (0.67 confidence) briefly. This could be expanded:
|
| 215 |
-
|
| 216 |
-
> "Given perfect calibration, the optimal strategy is to guess whenever confidence exceeds 67%. But no model is well-calibrated. GPT 5.2 High effectively uses a threshold of ~95%; Claude Haiku 4.5 seems to use ~50%."
|
| 217 |
-
|
| 218 |
-
This quantifies the "personalities" and connects to calibration.
|
| 219 |
-
|
| 220 |
-
---
|
| 221 |
-
|
| 222 |
-
## 6. Minor Issues
|
| 223 |
-
|
| 224 |
-
### 6.1 Typos/Grammar
|
| 225 |
-
|
| 226 |
-
- results.mdx:38: "overconfident : for instance" → extra space before colon
|
| 227 |
-
- results.mdx:39: "GPT 5.2 is the best calibrated" → should be "GPT 5.2 High"
|
| 228 |
-
- results.mdx:51: "closed to Claude Opus 4.5" → "close to"
|
| 229 |
-
- results.mdx:103: "constrats" → "contrasts"
|
| 230 |
-
- analysis.mdx:60: "GPT OSS 120B also performs respectably at 12.0" → check number
|
| 231 |
-
|
| 232 |
-
### 6.2 Caption Numbering
|
| 233 |
-
|
| 234 |
-
Figure 7 appears twice (score-stack and reckless-guessing). Fix numbering.
|
| 235 |
-
|
| 236 |
-
### 6.3 Model Names Consistency
|
| 237 |
-
|
| 238 |
-
Inconsistent capitalization and naming:
|
| 239 |
-
- "Claude Opus 4.5" vs "Claude 4.5 Opus"
|
| 240 |
-
- "GPT 5.2 High" vs "Gpt 5.2 High" (in data files)
|
| 241 |
-
- "DeepSeek R1" vs "Deepseek R1"
|
| 242 |
-
|
| 243 |
-
---
|
| 244 |
-
|
| 245 |
-
## 7. Ideas for Additional Content
|
| 246 |
-
|
| 247 |
-
### 7.1 Interactive "Play a Round" Demo
|
| 248 |
-
|
| 249 |
-
Let readers play one round against a rule to experience the task. Even a simple version would be compelling. (This could be a stretch goal.)
|
| 250 |
-
|
| 251 |
-
### 7.2 Model-Specific Breakdowns
|
| 252 |
-
|
| 253 |
-
You have per-model PNG files (`model_claude_opus_4_5.png`, etc.). Consider:
|
| 254 |
-
- Appendix section with one page per model
|
| 255 |
-
- Or: expandable accordion for each model's detailed stats
|
| 256 |
-
|
| 257 |
-
### 7.3 Token Efficiency Discussion
|
| 258 |
-
|
| 259 |
-
You show score vs tokens in Figure 1 but don't discuss it much. Gemini 3 Flash achieves decent results with 4x fewer tokens than Opus—is that worth highlighting for practitioners?
|
| 260 |
-
|
| 261 |
-
### 7.4 Prompt Sensitivity
|
| 262 |
-
|
| 263 |
-
You note this as a limitation but could briefly test: what if you told models to be more cautious? More aggressive? (Could be future work suggestion.)
|
| 264 |
-
|
| 265 |
-
---
|
| 266 |
-
|
| 267 |
-
## 8. Prioritized Action Items
|
| 268 |
-
|
| 269 |
-
### Must Fix
|
| 270 |
-
1. Audit all numbers against latest data files
|
| 271 |
-
2. Fix duplicate Figure 7 numbering
|
| 272 |
-
3. Fix typos listed above
|
| 273 |
-
|
| 274 |
-
### Should Do
|
| 275 |
-
4. Reorganize Results section (Option A or B above)
|
| 276 |
-
5. Rename "Deeper Analysis" to "Discussion" and restructure
|
| 277 |
-
6. Add foreshadowing of key insight in introduction
|
| 278 |
-
|
| 279 |
-
### Nice to Have
|
| 280 |
-
7. Add example turn figure in benchmark.mdx
|
| 281 |
-
8. Expand "scientific personalities" framing
|
| 282 |
-
9. Human baseline (even informal)
|
| 283 |
-
10. Per-model detail pages in appendix
|
| 284 |
-
|
| 285 |
-
---
|
| 286 |
-
|
| 287 |
-
## 9. Summary
|
| 288 |
-
|
| 289 |
-
The benchmark and data are solid. The article's main weakness is structural: it has too many charts telling pieces of the same story without a clear narrative spine. The fix is to reorganize around **the key insight** (metacognition matters more than raw rule-finding ability) and **the key visual** (the caution-recklessness scatter plot).
|
| 290 |
-
|
| 291 |
-
Your target message—"Models differ dramatically because metacognition matters, and this is an opportunity for improvement"—is supported by the data but not yet prominently surfaced by the article structure.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ASSESSMENT_V2.md
DELETED
|
@@ -1,197 +0,0 @@
|
|
| 1 |
-
# Revised Assessment: Eleusis Benchmark Article (v2)
|
| 2 |
-
|
| 3 |
-
## Executive Summary
|
| 4 |
-
|
| 5 |
-
The article has improved significantly since the first assessment. The **Results section is now well-structured** with a clear narrative arc: overall performance → the metacognition insight → caution/recklessness trade-off → calibration → performance by rule. The key message about metacognition is now prominent and supported by the logical flow.
|
| 6 |
-
|
| 7 |
-
The main remaining issues are:
|
| 8 |
-
1. **Data inconsistencies** between text and data files (numbers are outdated)
|
| 9 |
-
2. **The "Deeper Analysis" section** needs restructuring—much of it now duplicates the improved Results section
|
| 10 |
-
3. Minor typos
|
| 11 |
-
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
## 1. What's Working Well
|
| 15 |
-
|
| 16 |
-
### 1.1 Results Section Structure
|
| 17 |
-
The new structure is excellent:
|
| 18 |
-
```
|
| 19 |
-
Results
|
| 20 |
-
├── Overall Performance (intro)
|
| 21 |
-
├── Pure discovery vs metacognition (the key insight, early!)
|
| 22 |
-
├── Caution-Recklessness Trade-off (central analysis)
|
| 23 |
-
├── Confidence and Calibration (supporting evidence)
|
| 24 |
-
└── Performance by Rule (rule-level breakdown)
|
| 25 |
-
```
|
| 26 |
-
|
| 27 |
-
This addresses the main criticism from v1: readers now build understanding progressively and the metacognition insight is front and center.
|
| 28 |
-
|
| 29 |
-
### 1.2 Figure Flow
|
| 30 |
-
Figures now tell a coherent story:
|
| 31 |
-
- Fig 1: Overview (where does each model sit?)
|
| 32 |
-
- Fig 2: Score breakdown (what drives score differences?)
|
| 33 |
-
- Fig 3: Caution vs recklessness (the key trade-off)
|
| 34 |
-
- Fig 4: Calibration (why is timing hard?)
|
| 35 |
-
- Fig 5: Guess rate (how do models decide when to commit?)
|
| 36 |
-
- Fig 6-7: Rule-level analysis (drill-down)
|
| 37 |
-
|
| 38 |
-
### 1.3 New Guess Rate Analysis (Figure 5)
|
| 39 |
-
This is a valuable addition that wasn't in the original. It shows how models operationalize their confidence into actual decisions, connecting calibration to behavior.
|
| 40 |
-
|
| 41 |
-
### 1.4 Clear Messaging
|
| 42 |
-
Lines like "knowing when to commit is as important as finding the rule" now appear early and are reinforced throughout.
|
| 43 |
-
|
| 44 |
-
---
|
| 45 |
-
|
| 46 |
-
## 2. Critical Issues
|
| 47 |
-
|
| 48 |
-
### 2.1 Data Inconsistencies (Must Fix)
|
| 49 |
-
|
| 50 |
-
The text still uses outdated numbers. Current data (from `summary.txt` and `overall_performance.json`) vs text:
|
| 51 |
-
|
| 52 |
-
| Metric | In Text | Actual Data |
|
| 53 |
-
|--------|---------|-------------|
|
| 54 |
-
| Claude Opus 4.5 avg score | 15.9 (conclusion.mdx:10) | **17.0** (avg_floored_score) |
|
| 55 |
-
| Claude Opus 4.5 success rate | 92% (conclusion.mdx:10) | **83%** |
|
| 56 |
-
| Claude Haiku 4.5 success rate | 70% (conclusion.mdx:10) | **56%** |
|
| 57 |
-
| Claude Haiku 4.5 failed guesses | 7.5/round (analysis.mdx:15) | **3.95/round** |
|
| 58 |
-
| Kimi K2 avg score | 14.5 (analysis.mdx:60) | **16.2** |
|
| 59 |
-
| GPT OSS 120B score | 12.0 (analysis.mdx:60) | **12.9** |
|
| 60 |
-
| GPT 5.2 High early correct turns | 3.6 (multiple places) | **3.56** ✓ (close enough) |
|
| 61 |
-
|
| 62 |
-
**Action:** Audit all numbers in `results.mdx`, `analysis.mdx`, and `conclusion.mdx` against the latest data files.
|
| 63 |
-
|
| 64 |
-
### 2.2 Typos Still Present
|
| 65 |
-
|
| 66 |
-
| Location | Issue |
|
| 67 |
-
|----------|-------|
|
| 68 |
-
| results.mdx:20 | "closed to Claude Opus 4.5" → "close to" |
|
| 69 |
-
| results.mdx:85 | "overconfident : for instance" → remove space before colon |
|
| 70 |
-
| results.mdx:86 | "GPT 5.2 is the best calibrated" → "GPT 5.2 High" |
|
| 71 |
-
| results.mdx:102 | "THis is somehow" → "This is somehow" |
|
| 72 |
-
|
| 73 |
-
---
|
| 74 |
-
|
| 75 |
-
## 3. The "Deeper Analysis" Section
|
| 76 |
-
|
| 77 |
-
### 3.1 Current Problem
|
| 78 |
-
|
| 79 |
-
The "Deeper Analysis" section is now partially redundant. It covers:
|
| 80 |
-
1. **Metacognition** (duplicates Results § "Pure discovery vs metacognition")
|
| 81 |
-
2. **Learning Curves** (TODO, placeholder)
|
| 82 |
-
3. **Failure Modes** (valuable, keep)
|
| 83 |
-
4. **Open vs Closed Models** (brief, could be expanded)
|
| 84 |
-
5. **Symmetric Rules** (interesting niche finding)
|
| 85 |
-
6. **Confirmation Bias** (preliminary, incomplete)
|
| 86 |
-
7. **Qualitative Observations** (nice examples, but disconnected)
|
| 87 |
-
|
| 88 |
-
### 3.2 Recommended Restructure
|
| 89 |
-
|
| 90 |
-
Rename to "Discussion" and reorganize:
|
| 91 |
-
|
| 92 |
-
```markdown
|
| 93 |
-
## Discussion
|
| 94 |
-
|
| 95 |
-
### What Explains the Performance Gap?
|
| 96 |
-
- Brief synthesis: metacognition > raw ability
|
| 97 |
-
- The caution-recklessness trade-off determines ranking more than success rate
|
| 98 |
-
- Move the GPT 5.2 High / Claude Opus 4.5 / Claude Haiku 4.5 characterizations here
|
| 99 |
-
(but avoid repeating numbers already in Results)
|
| 100 |
-
|
| 101 |
-
### Scientific Temperaments
|
| 102 |
-
- This is where the "scientific personality" framing could shine
|
| 103 |
-
- The Perfectionist (GPT 5.2 High): needs too much evidence
|
| 104 |
-
- The Pragmatist (Claude Opus 4.5): good-enough is good enough
|
| 105 |
-
- The Gambler (Claude Haiku 4.5): acts on insufficient evidence
|
| 106 |
-
- Link to real-world science: these map to actual failure modes in research
|
| 107 |
-
|
| 108 |
-
### Failure Modes [keep the accordion, it's excellent]
|
| 109 |
-
- Already well-written, just tighten the taxonomy
|
| 110 |
-
|
| 111 |
-
### Open vs Proprietary Models
|
| 112 |
-
- Currently too brief (1 paragraph)
|
| 113 |
-
- Could expand: why might open models trend reckless? (RLHF differences?)
|
| 114 |
-
- Kimi K2's success is notable—worth highlighting more
|
| 115 |
-
|
| 116 |
-
### Implications for AI-Assisted Science
|
| 117 |
-
- Currently in Conclusion but could be expanded here
|
| 118 |
-
- An overconfident assistant leads researchers astray
|
| 119 |
-
- An overcautious assistant wastes resources
|
| 120 |
-
- The calibration problem is particularly concerning
|
| 121 |
-
|
| 122 |
-
### Move to Appendix (or delete)
|
| 123 |
-
- Learning Curves (TODO) → either implement or remove
|
| 124 |
-
- Symmetric Rules → niche, move to appendix or cut
|
| 125 |
-
- Confirmation Bias → too preliminary, either expand significantly or cut
|
| 126 |
-
- Qualitative Observations → keep 1-2 good examples, cut the rest
|
| 127 |
-
```
|
| 128 |
-
|
| 129 |
-
### 3.3 Delete the Redundancy
|
| 130 |
-
|
| 131 |
-
The current Metacognition subsection (analysis.mdx:7-16) largely repeats what's now better expressed in Results. Either:
|
| 132 |
-
- Delete it entirely and rely on Results
|
| 133 |
-
- Or transform it into the "Scientific Temperaments" narrative frame (more memorable)
|
| 134 |
-
|
| 135 |
-
---
|
| 136 |
-
|
| 137 |
-
## 4. Missing Content (Lower Priority)
|
| 138 |
-
|
| 139 |
-
### 4.1 TODOs Still Present
|
| 140 |
-
- Learning curves figure (analysis.mdx:22) — either implement or remove the placeholder
|
| 141 |
-
- Failure mode distribution stacked bar (analysis.mdx:55) — nice to have, not critical
|
| 142 |
-
|
| 143 |
-
### 4.2 Human Baseline
|
| 144 |
-
Still missing. Consider adding a sentence like: "Without human performance data on the same rules, we cannot assess whether these success rates represent strong or weak performance in absolute terms—only that models differ substantially among themselves."
|
| 145 |
-
|
| 146 |
-
### 4.3 Example Turn Figure
|
| 147 |
-
Would still be valuable in benchmark.mdx to make the task concrete. A simple 3-panel showing:
|
| 148 |
-
```
|
| 149 |
-
[Board state] → [Model reasoning excerpt] → [Decision output]
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
---
|
| 153 |
-
|
| 154 |
-
## 5. Minor Polish
|
| 155 |
-
|
| 156 |
-
### 5.1 Model Name Consistency
|
| 157 |
-
Some inconsistencies remain:
|
| 158 |
-
- "Grok 4.1 Fast Reasoning" vs "Grok 4 1 Fast Reasoning" (in data)
|
| 159 |
-
- "DeepSeek R1" vs "Deepseek R1" (in data)
|
| 160 |
-
- Decide on one capitalization style and apply consistently
|
| 161 |
-
|
| 162 |
-
### 5.2 The "floored" Score
|
| 163 |
-
The article doesn't explain that scores below 0 are floored to 0. This affects interpretation—might be worth a brief mention in the Benchmark section or a sidenote.
|
| 164 |
-
|
| 165 |
-
### 5.3 Sidenote on Optimal Threshold
|
| 166 |
-
Results.mdx mentions the 0.67 optimal threshold but doesn't explain why. A brief derivation in a sidenote would help:
|
| 167 |
-
> For a perfectly calibrated model: E[guess at p] = p×(points remaining) - (1-p)×2. Setting E[guess] > E[wait 1 turn] gives p > 2/3 ≈ 0.67.
|
| 168 |
-
|
| 169 |
-
---
|
| 170 |
-
|
| 171 |
-
## 6. Summary of Recommended Actions
|
| 172 |
-
|
| 173 |
-
### Must Do
|
| 174 |
-
1. ☐ Fix all data inconsistencies (audit numbers against data files)
|
| 175 |
-
2. ☐ Fix typos listed in §2.2
|
| 176 |
-
3. ☐ Remove or transform redundant content in "Deeper Analysis"
|
| 177 |
-
|
| 178 |
-
### Should Do
|
| 179 |
-
4. ☐ Rename "Deeper Analysis" → "Discussion"
|
| 180 |
-
5. ☐ Restructure Discussion per §3.2
|
| 181 |
-
6. ☐ Either implement Learning Curves figure or remove the TODO
|
| 182 |
-
|
| 183 |
-
### Nice to Have
|
| 184 |
-
7. ☐ Add "Scientific Temperaments" framing
|
| 185 |
-
8. ☐ Add example turn figure in benchmark.mdx
|
| 186 |
-
9. ☐ Explain the score flooring mechanism
|
| 187 |
-
10. ☐ Expand Open vs Proprietary discussion
|
| 188 |
-
|
| 189 |
-
---
|
| 190 |
-
|
| 191 |
-
## 7. Overall Assessment
|
| 192 |
-
|
| 193 |
-
**Grade: B+ (up from B-)**
|
| 194 |
-
|
| 195 |
-
The structural problems identified in v1 are largely resolved. The article now tells a clear story: models vary in their "scientific temperament," and metacognition—knowing when you know—matters as much as raw reasoning ability.
|
| 196 |
-
|
| 197 |
-
The remaining work is mostly cleanup (data consistency, typos) and deciding what to do with the Deeper Analysis section. The article is close to publication-ready once the numbers are fixed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/astro.config.mjs
CHANGED
|
@@ -65,7 +65,6 @@ export default defineConfig({
|
|
| 65 |
bibliography: 'src/content/bibliography.bib',
|
| 66 |
linkCitations: true,
|
| 67 |
csl: "apa",
|
| 68 |
-
noCite: false,
|
| 69 |
suppressBibliography: false,
|
| 70 |
}],
|
| 71 |
rehypeReferencesAndFootnotes,
|
|
|
|
| 65 |
bibliography: 'src/content/bibliography.bib',
|
| 66 |
linkCitations: true,
|
| 67 |
csl: "apa",
|
|
|
|
| 68 |
suppressBibliography: false,
|
| 69 |
}],
|
| 70 |
rehypeReferencesAndFootnotes,
|
app/src/content/article.mdx
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
---
|
| 2 |
-
title: "Are LLMs any good at the
|
| 3 |
-
subtitle: "Evaluating scientific reasoning using the card game Eleusis"
|
| 4 |
description: "A benchmark for evaluating LLM scientific reasoning using the card game Eleusis, testing iterative hypothesis formation, calibration, and strategic experimentation."
|
| 5 |
authors:
|
| 6 |
- name: "David Louapre"
|
|
@@ -25,7 +25,7 @@ showPdf: true
|
|
| 25 |
import Introduction from "./chapters/eleusis/introduction.mdx";
|
| 26 |
import Benchmark from "./chapters/eleusis/benchmark.mdx";
|
| 27 |
import Results from "./chapters/eleusis/results.mdx";
|
| 28 |
-
import
|
| 29 |
import Conclusion from "./chapters/eleusis/conclusion.mdx";
|
| 30 |
import Appendix from "./chapters/eleusis/appendix.mdx";
|
| 31 |
|
|
@@ -35,7 +35,7 @@ import Appendix from "./chapters/eleusis/appendix.mdx";
|
|
| 35 |
|
| 36 |
<Results />
|
| 37 |
|
| 38 |
-
<
|
| 39 |
|
| 40 |
<Conclusion />
|
| 41 |
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "Are LLMs any good at the Game of Science?"
|
| 3 |
+
subtitle: "Evaluating scientific reasoning and metacognition using the card game Eleusis"
|
| 4 |
description: "A benchmark for evaluating LLM scientific reasoning using the card game Eleusis, testing iterative hypothesis formation, calibration, and strategic experimentation."
|
| 5 |
authors:
|
| 6 |
- name: "David Louapre"
|
|
|
|
| 25 |
import Introduction from "./chapters/eleusis/introduction.mdx";
|
| 26 |
import Benchmark from "./chapters/eleusis/benchmark.mdx";
|
| 27 |
import Results from "./chapters/eleusis/results.mdx";
|
| 28 |
+
import Discussion from "./chapters/eleusis/discussion.mdx";
|
| 29 |
import Conclusion from "./chapters/eleusis/conclusion.mdx";
|
| 30 |
import Appendix from "./chapters/eleusis/appendix.mdx";
|
| 31 |
|
|
|
|
| 35 |
|
| 36 |
<Results />
|
| 37 |
|
| 38 |
+
<Discussion />
|
| 39 |
|
| 40 |
<Conclusion />
|
| 41 |
|
app/src/content/bibliography.bib
CHANGED
|
@@ -1,130 +1,66 @@
|
|
| 1 |
-
@
|
| 2 |
-
title
|
| 3 |
-
author
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
year = {2017}
|
| 7 |
-
}
|
| 8 |
-
|
| 9 |
-
@book{mckinney2017python,
|
| 10 |
-
title = {Python for Data Analysis},
|
| 11 |
-
author = {McKinney, Wes},
|
| 12 |
-
publisher = {O'Reilly Media},
|
| 13 |
-
address = {Sebastopol, CA},
|
| 14 |
-
year = {2017},
|
| 15 |
-
edition = {2},
|
| 16 |
-
isbn = {978-1491957660}
|
| 17 |
-
}
|
| 18 |
-
|
| 19 |
-
@inproceedings{he2016resnet,
|
| 20 |
-
title = {Deep Residual Learning for Image Recognition},
|
| 21 |
-
author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
|
| 22 |
-
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
|
| 23 |
-
pages = {770--778},
|
| 24 |
-
year = {2016},
|
| 25 |
-
doi = {10.1109/CVPR.2016.90},
|
| 26 |
-
url = {https://doi.org/10.1109/CVPR.2016.90}
|
| 27 |
-
}
|
| 28 |
-
|
| 29 |
-
@article{silver2017mastering,
|
| 30 |
-
title = {Mastering the game of Go without human knowledge},
|
| 31 |
-
author = {Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and others},
|
| 32 |
-
journal = {Nature},
|
| 33 |
-
volume = {550},
|
| 34 |
-
number = {7676},
|
| 35 |
-
pages = {354--359},
|
| 36 |
-
year = {2017},
|
| 37 |
-
month = {oct},
|
| 38 |
-
doi = {10.1038/nature24270},
|
| 39 |
-
url = {https://www.nature.com/articles/nature24270}
|
| 40 |
-
}
|
| 41 |
-
|
| 42 |
-
@techreport{openai2023gpt4,
|
| 43 |
-
title = {GPT-4 Technical Report},
|
| 44 |
-
author = {{OpenAI}},
|
| 45 |
-
institution = {OpenAI},
|
| 46 |
-
year = {2023},
|
| 47 |
-
number = {arXiv:2303.08774},
|
| 48 |
archiveprefix = {arXiv},
|
| 49 |
-
eprint = {
|
| 50 |
-
primaryclass = {cs.
|
| 51 |
-
url = {https://arxiv.org/abs/
|
| 52 |
-
}
|
| 53 |
-
|
| 54 |
-
@phdthesis{doe2020thesis,
|
| 55 |
-
title = {Learning Efficient Representations for Large-Scale Visual Recognition},
|
| 56 |
-
author = {Doe, Jane},
|
| 57 |
-
school = {Massachusetts Institute of Technology},
|
| 58 |
-
address = {Cambridge, MA},
|
| 59 |
-
year = {2020},
|
| 60 |
-
doi = {10.5555/mit-2020-xyz}
|
| 61 |
-
}
|
| 62 |
-
|
| 63 |
-
@incollection{cover2006entropy,
|
| 64 |
-
title = {Entropy, Relative Entropy, and Mutual Information},
|
| 65 |
-
author = {Cover, Thomas M. and Thomas, Joy A.},
|
| 66 |
-
booktitle = {Elements of Information Theory},
|
| 67 |
-
publisher = {Wiley},
|
| 68 |
-
address = {Hoboken, NJ},
|
| 69 |
-
edition = {2},
|
| 70 |
-
year = {2006},
|
| 71 |
-
pages = {13--55},
|
| 72 |
-
isbn = {978-0471241959}
|
| 73 |
}
|
| 74 |
|
| 75 |
-
@
|
| 76 |
-
title
|
| 77 |
-
author
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
url = {https://doi.org/10.5281/zenodo.1234567},
|
| 82 |
-
note = {Accessed 2025-09-01}
|
| 83 |
}
|
| 84 |
|
| 85 |
-
@
|
| 86 |
-
title
|
| 87 |
-
author
|
| 88 |
-
|
| 89 |
-
howpublished = {Software},
|
| 90 |
-
doi = {10.5281/zenodo.592264},
|
| 91 |
-
url = {https://scikit-learn.org}
|
| 92 |
-
}
|
| 93 |
-
|
| 94 |
-
@inproceedings{smith2024privacy,
|
| 95 |
-
title = {Privacy-Preserving Training with Low-Precision Secure Aggregation},
|
| 96 |
-
author = {Smith, Emily and Zhang, Wei and Rossi, Marco and Patel, Neha},
|
| 97 |
-
booktitle = {Proceedings of the 41st International Conference on Machine Learning},
|
| 98 |
-
editor = {Smith, A. and Johnson, B.},
|
| 99 |
series = {Proceedings of Machine Learning Research},
|
| 100 |
-
volume = {
|
| 101 |
-
pages = {
|
| 102 |
-
|
| 103 |
publisher = {PMLR},
|
| 104 |
-
|
| 105 |
-
year = {2024},
|
| 106 |
-
url = {https://proceedings.mlr.press/v235/}
|
| 107 |
}
|
| 108 |
|
| 109 |
-
@article{
|
| 110 |
-
title
|
| 111 |
-
author
|
| 112 |
-
journal
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
|
|
|
| 118 |
}
|
| 119 |
|
| 120 |
-
@
|
| 121 |
-
title
|
| 122 |
-
author
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
doi
|
| 129 |
-
url
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 130 |
}
|
|
|
|
| 1 |
+
@misc{chollet2019measure,
|
| 2 |
+
title = {On the Measure of Intelligence},
|
| 3 |
+
author = {Chollet, François},
|
| 4 |
+
year = {2019},
|
| 5 |
+
howpublished = {arXiv preprint},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
archiveprefix = {arXiv},
|
| 7 |
+
eprint = {1911.01547},
|
| 8 |
+
primaryclass = {cs.AI},
|
| 9 |
+
url = {https://arxiv.org/abs/1911.01547}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
}
|
| 11 |
|
| 12 |
+
@article{abbott1977eleusis,
|
| 13 |
+
title = {The New Eleusis},
|
| 14 |
+
author = {Abbott, Robert},
|
| 15 |
+
journal = {Games \& Puzzles},
|
| 16 |
+
year = {1977},
|
| 17 |
+
note = {Updated rules for the Eleusis card game, originally published in 1956}
|
|
|
|
|
|
|
| 18 |
}
|
| 19 |
|
| 20 |
+
@inproceedings{guo2017calibration,
|
| 21 |
+
title = {On Calibration of Modern Neural Networks},
|
| 22 |
+
author = {Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q.},
|
| 23 |
+
booktitle = {Proceedings of the 34th International Conference on Machine Learning},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
series = {Proceedings of Machine Learning Research},
|
| 25 |
+
volume = {70},
|
| 26 |
+
pages = {1321--1330},
|
| 27 |
+
year = {2017},
|
| 28 |
publisher = {PMLR},
|
| 29 |
+
url = {https://proceedings.mlr.press/v70/guo17a.html}
|
|
|
|
|
|
|
| 30 |
}
|
| 31 |
|
| 32 |
+
@article{nickerson1998confirmation,
|
| 33 |
+
title = {Confirmation Bias: A Ubiquitous Phenomenon in Many Guises},
|
| 34 |
+
author = {Nickerson, Raymond S.},
|
| 35 |
+
journal = {Review of General Psychology},
|
| 36 |
+
volume = {2},
|
| 37 |
+
number = {2},
|
| 38 |
+
pages = {175--220},
|
| 39 |
+
year = {1998},
|
| 40 |
+
doi = {10.1037/1089-2680.2.2.175},
|
| 41 |
+
url = {https://journals.sagepub.com/doi/abs/10.1037/1089-2680.2.2.175}
|
| 42 |
}
|
| 43 |
|
| 44 |
+
@article{flavell1979metacognition,
|
| 45 |
+
title = {Metacognition and Cognitive Monitoring: A New Area of Cognitive-Developmental Inquiry},
|
| 46 |
+
author = {Flavell, John H.},
|
| 47 |
+
journal = {American Psychologist},
|
| 48 |
+
volume = {34},
|
| 49 |
+
number = {10},
|
| 50 |
+
pages = {906--911},
|
| 51 |
+
year = {1979},
|
| 52 |
+
doi = {10.1037/0003-066X.34.10.906},
|
| 53 |
+
url = {https://psycnet.apa.org/record/1980-09388-001}
|
| 54 |
+
}
|
| 55 |
+
|
| 56 |
+
@article{lichtenstein1977calibration,
|
| 57 |
+
title = {Do Those Who Know More Also Know More About How Much They Know?},
|
| 58 |
+
author = {Lichtenstein, Sarah and Fischhoff, Baruch},
|
| 59 |
+
journal = {Organizational Behavior and Human Performance},
|
| 60 |
+
volume = {20},
|
| 61 |
+
number = {2},
|
| 62 |
+
pages = {159--183},
|
| 63 |
+
year = {1977},
|
| 64 |
+
doi = {10.1016/0030-5073(77)90001-0},
|
| 65 |
+
url = {https://www.sciencedirect.com/science/article/abs/pii/0030507377900010}
|
| 66 |
}
|
app/src/content/chapters/eleusis/analysis.mdx
DELETED
|
@@ -1,100 +0,0 @@
|
|
| 1 |
-
import Note from "../../../components/Note.astro";
|
| 2 |
-
import Sidenote from "../../../components/Sidenote.astro";
|
| 3 |
-
import Accordion from "../../../components/Accordion.astro";
|
| 4 |
-
|
| 5 |
-
## Deeper Analysis
|
| 6 |
-
|
| 7 |
-
### Metacognition: Knowing What You Know
|
| 8 |
-
|
| 9 |
-
The caution-recklessness analysis reveals that metacognition, the ability to accurately assess one's own knowledge, is a key differentiator between models. Consider two extremes:
|
| 10 |
-
|
| 11 |
-
**GPT 5.2 High** has excellent rule-finding ability (96% success rate), fairly good calibration, but is overly cautious. It averages 3.6 turns after discovering the correct rule before making the guess, leading to lost points despite its high accuracy. It frequently discovers the correct rule early but doesn't *believe* it has sufficient evidence. This excessive caution costs an average of 3.6 points per successful round, enough to drop it from first to third place overall.
|
| 12 |
-
|
| 13 |
-
**Claude Opus 4.5** achieves the best balance: high success rate (92%) with well-calibrated timing (only 0.9 early correct turns and 2.8 failed guesses). It is not as well calibrated as GPT 5.2 High, and would benefit from better calibration while keeping its slightly risk-taking approach.
|
| 14 |
-
|
| 15 |
-
**Claude Haiku 4.5** has the opposite problem: poor rule-finding (70% success) combined with overconfident metacognition. It commits to guesses without sufficient evidence, accumulating 7.5 failed guesses per round on average—the highest of any model.
|
| 16 |
-
|
| 17 |
-
### Learning Curves
|
| 18 |
-
|
| 19 |
-
How do models improve within a single round? We tracked confidence and hypothesis quality over turn number to understand the learning dynamics.
|
| 20 |
-
|
| 21 |
-
<Note variant="info">
|
| 22 |
-
**TODO**: Add figure showing line plot of average confidence by turn number, colored by eventual success/failure.
|
| 23 |
-
</Note>
|
| 24 |
-
|
| 25 |
-
Key observations:
|
| 26 |
-
- **Successful rounds** typically show steadily increasing confidence with occasional drops when hypotheses are revised
|
| 27 |
-
- **Failed rounds** often show erratic confidence or premature plateaus where models become stuck on incorrect hypotheses
|
| 28 |
-
- **Acceptance rate decreases** over time as obvious cards are exhausted from the hand
|
| 29 |
-
|
| 30 |
-
<Sidenote>
|
| 31 |
-
The turn-by-turn reasoning traces provide rich data for understanding model behavior beyond simple success/failure metrics.
|
| 32 |
-
</Sidenote>
|
| 33 |
-
|
| 34 |
-
### Failure Modes
|
| 35 |
-
|
| 36 |
-
When models fail, why? We identified several recurring patterns:
|
| 37 |
-
|
| 38 |
-
<Accordion title="Failure mode taxonomy" open>
|
| 39 |
-
|
| 40 |
-
1. **Premature guessing**: High confidence, wrong rule, insufficient evidence. The model becomes convinced too early based on limited data. This is the dominant failure mode for Claude Haiku 4.5 (7.5 failed guesses/round).
|
| 41 |
-
|
| 42 |
-
2. **Hypothesis fixation**: Stuck on wrong rule despite contradictory evidence. The model fails to update when new observations conflict with its theory.
|
| 43 |
-
|
| 44 |
-
3. **Overfitting**: Rule matches all observations but is more specific than the actual rule (e.g., guessing "only red hearts" when the rule is "only red cards").
|
| 45 |
-
|
| 46 |
-
4. **Underfitting**: Rule is too simple and fails to capture necessary conditions (e.g., guessing "black cards" when rule is "black even cards").
|
| 47 |
-
|
| 48 |
-
5. **Position blindness**: Fails on rules depending on position in mainline or relationship to previous cards.
|
| 49 |
-
|
| 50 |
-
6. **Excessive caution**: The model finds the correct rule but doesn't trust its conclusion. GPT 5.2 High exemplifies this—waiting an average of 3.6 turns after finding the answer, costing significant points.
|
| 51 |
-
|
| 52 |
-
</Accordion>
|
| 53 |
-
|
| 54 |
-
<Note variant="info">
|
| 55 |
-
**TODO**: Add stacked bar chart showing distribution of failure modes by model.
|
| 56 |
-
</Note>
|
| 57 |
-
|
| 58 |
-
### Open vs Closed Models
|
| 59 |
-
|
| 60 |
-
A notable finding is the competitive performance of open-weight models. Kimi K2, available with open weights, achieves the second-highest score (14.5) and outperforms several proprietary models including GPT 5 Mini Medium and Gemini 3 Flash. The open-weight GPT OSS 120B also performs respectably at 12.0.
|
| 61 |
-
|
| 62 |
-
However, open models tend toward more aggressive guessing strategies. Kimi K2 averages 4.0 failed guesses per round (vs. 2.8 for Claude Opus 4.5), and GPT OSS 20B has 6.2. This may reflect differences in training objectives or RLHF tuning between open and proprietary models.
|
| 63 |
-
|
| 64 |
-
### Symmetric Rules
|
| 65 |
-
|
| 66 |
-
An interesting test: are symmetric rules equally difficult? For example, "only spades" vs "only non-spades" should be logically equivalent in difficulty, but models might have biases.
|
| 67 |
-
|
| 68 |
-
We found that:
|
| 69 |
-
- Negative rules ("not X") are generally harder than positive rules ("only X")
|
| 70 |
-
- Rules involving rare events (low acceptance rate) are harder than rules with high acceptance rates
|
| 71 |
-
- This may reflect training data biases where positive examples are more common
|
| 72 |
-
|
| 73 |
-
### Confirmation Bias
|
| 74 |
-
|
| 75 |
-
Do models exhibit confirmation bias—preferring to play cards that confirm their current hypothesis rather than cards that could falsify it?
|
| 76 |
-
|
| 77 |
-
<Sidenote>
|
| 78 |
-
A good scientist designs experiments that could prove them wrong, not just experiments that confirm what they already believe.
|
| 79 |
-
</Sidenote>
|
| 80 |
-
|
| 81 |
-
Preliminary analysis suggests:
|
| 82 |
-
- Models do show some tendency toward confirmation-seeking behavior
|
| 83 |
-
- When confident in a hypothesis, models prefer "safe" plays that are likely to be accepted
|
| 84 |
-
- Strategic exploration (playing cards specifically to test hypothesis boundaries) is rare
|
| 85 |
-
|
| 86 |
-
### Qualitative Observations
|
| 87 |
-
|
| 88 |
-
Examining individual reasoning traces reveals interesting patterns:
|
| 89 |
-
|
| 90 |
-
<Accordion title="Example: Hypothesis revision">
|
| 91 |
-
|
| 92 |
-
In one game with the rule "alternating odd/even ranks," a model initially hypothesized "increasing ranks" based on the first few accepted cards. When a lower-ranked card was accepted, instead of abandoning the hypothesis entirely, the model revised it to "ranks must differ from previous." This partial update eventually led to discovering the true rule—a good example of iterative refinement.
|
| 93 |
-
|
| 94 |
-
</Accordion>
|
| 95 |
-
|
| 96 |
-
<Accordion title="Example: Fixation failure">
|
| 97 |
-
|
| 98 |
-
With the rule "only face cards (J, Q, K)," one model became fixated on "only red cards" after the first three accepted cards happened to be red face cards. Despite subsequently seeing black face cards accepted, the model kept trying to reconcile observations with a color-based rule, eventually running out of turns.
|
| 99 |
-
|
| 100 |
-
</Accordion>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/src/content/chapters/eleusis/appendix.mdx
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
import Accordion from "../../../components/Accordion.astro";
|
| 2 |
import Note from "../../../components/Note.astro";
|
|
|
|
| 3 |
|
| 4 |
## Appendix: Detailed Methods
|
| 5 |
|
|
@@ -26,50 +27,50 @@ All models were evaluated with the following settings:
|
|
| 26 |
|
| 27 |
| Parameter | Value |
|
| 28 |
|-----------|-------|
|
| 29 |
-
| Temperature | 0.
|
| 30 |
-
| Max tokens |
|
| 31 |
| Retries | 3 (on API failures) |
|
| 32 |
|
| 33 |
-
Reasoning models were allowed their default reasoning budgets.
|
| 34 |
|
| 35 |
</Accordion>
|
| 36 |
|
| 37 |
### Rule Checking
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
Rules are created by hand and expressed in natural language. Each rule is then compiled into a Python function using an LLM, with manual verification of correctness.
|
| 42 |
|
| 43 |
-
When the model outputs a guessed rule, we
|
| 44 |
-
|
| 45 |
-
2. Test the compiled function against all cards played in that game
|
| 46 |
-
3. Mark the guess as correct only if it matches the true rule's behavior on all observations
|
| 47 |
|
| 48 |
-
This simulation-based approach avoids issues with semantic equivalence in natural language. For instance, "same color as previous card" and "red cards only" might be equivalent given a specific game history starting with a red card, but would differ on other histories.
|
| 49 |
|
| 50 |
-
</Accordion>
|
| 51 |
|
| 52 |
-
### Prompt Structure
|
| 53 |
|
| 54 |
-
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
5. **Format reminders**: Instructions for confidence scale interpretation (7 = 70% probability)
|
| 67 |
|
| 68 |
-
</Accordion>
|
| 69 |
|
| 70 |
### Evaluation Metrics
|
| 71 |
|
| 72 |
-
<Accordion title="Metric definitions">
|
| 73 |
|
| 74 |
- **Success rate**: Fraction of games where the model correctly identified the rule before running out of turns
|
| 75 |
|
|
@@ -81,22 +82,6 @@ The prompt includes:
|
|
| 81 |
|
| 82 |
- **Turns to success**: For successful games, mean number of turns before correct guess
|
| 83 |
|
| 84 |
-
</Accordion>
|
| 85 |
-
|
| 86 |
-
### References
|
| 87 |
-
|
| 88 |
-
<Accordion title="Bibliography">
|
| 89 |
-
|
| 90 |
-
- Abbott, R. (1963). "Eleusis" — Original game rules and design philosophy
|
| 91 |
-
|
| 92 |
-
- Guo, C., et al. (2017). "On Calibration of Modern Neural Networks" — Foundational work on neural network calibration
|
| 93 |
-
|
| 94 |
-
- Chollet, F. (2019). "On the Measure of Intelligence" — ARC benchmark and discussion of abstract reasoning
|
| 95 |
-
|
| 96 |
-
- Recent LLM reasoning benchmarks: GSM8K, MATH, ARC-AGI, BIG-Bench, etc.
|
| 97 |
-
|
| 98 |
-
</Accordion>
|
| 99 |
-
|
| 100 |
<Note>
|
| 101 |
Full code, data, and model outputs are available in the benchmark repository.
|
| 102 |
</Note>
|
|
|
|
| 1 |
import Accordion from "../../../components/Accordion.astro";
|
| 2 |
import Note from "../../../components/Note.astro";
|
| 3 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 4 |
|
| 5 |
## Appendix: Detailed Methods
|
| 6 |
|
|
|
|
| 27 |
|
| 28 |
| Parameter | Value |
|
| 29 |
|-----------|-------|
|
| 30 |
+
| Temperature | 0.7 |
|
| 31 |
+
| Max tokens | 16384 |
|
| 32 |
| Retries | 3 (on API failures) |
|
| 33 |
|
| 34 |
+
Reasoning models were allowed their default reasoning budgets.
|
| 35 |
|
| 36 |
</Accordion>
|
| 37 |
|
| 38 |
### Rule Checking
|
| 39 |
|
| 40 |
+
|
| 41 |
|
| 42 |
Rules are created by hand and expressed in natural language. Each rule is then compiled into a Python function using an LLM, with manual verification of correctness.
|
| 43 |
|
| 44 |
+
When the model outputs a guessed rule, we
|
| 45 |
+
TODO: explain
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
This simulation-based approach avoids issues with semantic equivalence in natural language. For instance, "same color as previous card" and "red cards only" might be equivalent given a specific game history starting with a red card, but would differ on other histories. Also the simulation approach avoids declaring as different two rules that, given the current state of the game, will behave identically. For instance if the rule was "same color", as soon as the first card has been drawn, the model might guess "red cards only" or "black cards only" depending on the color of the first card. Both guesses are semantically different but functionally equivalent given the current game state.
|
| 48 |
|
|
|
|
| 49 |
|
|
|
|
| 50 |
|
| 51 |
+
### Additional results
|
| 52 |
|
| 53 |
+
#### Learning Curves
|
| 54 |
|
| 55 |
+
How do models improve within a single round? We tracked confidence and hypothesis quality over turn number to understand the learning dynamics.
|
| 56 |
|
| 57 |
+
<Note variant="info">
|
| 58 |
+
**TODO**: Add figure showing line plot of average confidence by turn number, colored by eventual success/failure.
|
| 59 |
+
</Note>
|
| 60 |
|
| 61 |
+
Key observations:
|
| 62 |
+
- **Successful rounds** typically show steadily increasing confidence with occasional drops when hypotheses are revised
|
| 63 |
+
- **Failed rounds** often show erratic confidence or premature plateaus where models become stuck on incorrect hypotheses
|
| 64 |
+
- **Acceptance rate decreases** over time as obvious cards are exhausted from the hand
|
| 65 |
|
| 66 |
+
<Sidenote>
|
| 67 |
+
The turn-by-turn reasoning traces provide rich data for understanding model behavior beyond simple success/failure metrics.
|
| 68 |
+
</Sidenote>
|
| 69 |
|
|
|
|
| 70 |
|
|
|
|
| 71 |
|
| 72 |
### Evaluation Metrics
|
| 73 |
|
|
|
|
| 74 |
|
| 75 |
- **Success rate**: Fraction of games where the model correctly identified the rule before running out of turns
|
| 76 |
|
|
|
|
| 82 |
|
| 83 |
- **Turns to success**: For successful games, mean number of turns before correct guess
|
| 84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
<Note>
|
| 86 |
Full code, data, and model outputs are available in the benchmark repository.
|
| 87 |
</Note>
|
app/src/content/chapters/eleusis/benchmark.mdx
CHANGED
|
@@ -1,20 +1,18 @@
|
|
| 1 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 2 |
-
import Note from "../../../components/Note.astro";
|
| 3 |
-
import Accordion from "../../../components/Accordion.astro";
|
| 4 |
|
| 5 |
-
## The Eleusis Benchmark
|
| 6 |
|
| 7 |
### The Original Game
|
| 8 |
|
| 9 |
-
In the original Eleusis card game, one player acts as the "dealer" (sometimes called "God" or "Nature") and secretly invents a rule determining which cards can be legally played. The other players don't know this rule
|
| 10 |
|
| 11 |
-
Players take turns playing cards from their hand onto a central "mainline." If a card satisfies the secret rule,
|
| 12 |
|
| 13 |
<Sidenote>
|
| 14 |
The name "Eleusis" comes from the ancient Greek mystery cult, where initiates gradually discovered hidden truths.
|
| 15 |
</Sidenote>
|
| 16 |
|
| 17 |
-
At any point, a player can attempt to guess the rule; correctly identifying it ends the game. A specific scoring system rewards efficiency in discovering the rule while penalizing reckless guessing.
|
| 18 |
|
| 19 |
### Our Adaptation
|
| 20 |
|
|
@@ -26,15 +24,16 @@ On each turn, the player selects a card from their hand to play. If the card sat
|
|
| 26 |
|
| 27 |
When correctly guessing the rule, the player scores as many points as the number of turns spent, and each wrong guess deducts a penalty of 2 points:
|
| 28 |
|
| 29 |
-
$$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence and acting accordingly.
|
| 34 |
|
| 35 |
### Rule Library
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
| Category | Examples |
|
| 40 |
|----------|----------|
|
|
@@ -48,7 +47,7 @@ Each rule is played 3 times with different random seeds (affecting the initial h
|
|
| 48 |
|
| 49 |
### What the LLM Must Do
|
| 50 |
|
| 51 |
-
On each turn, the model
|
| 52 |
|
| 53 |
The model is free to reason, but it is asked to output a structured response containing:
|
| 54 |
|
|
@@ -69,6 +68,6 @@ Example output
|
|
| 69 |
}
|
| 70 |
```
|
| 71 |
|
| 72 |
-
This structure lets us analyze not just whether models succeed, but *how* they reason: Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy? In particular, forcing the model to articulate a tentative rule and a confidence level in it (even if they don't want to guess it yet) allows us to (secretely) evaluate it nonetheless, which will be useful for measuring calibration and guessing abilities.
|
| 73 |
|
| 74 |
|
|
|
|
| 1 |
import Sidenote from "../../../components/Sidenote.astro";
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
## 1. The Eleusis Benchmark
|
| 4 |
|
| 5 |
### The Original Game
|
| 6 |
|
| 7 |
+
In the original Eleusis card game, one player acts as the "dealer" (sometimes called "God" or "Nature") and secretly invents a rule determining which cards can be legally played. The other players (called "scientists") don't know this rule, they must discover it through experimentation.
|
| 8 |
|
| 9 |
+
Players take turns playing cards from their hand onto a central "mainline." If a card satisfies the secret rule, the dealer accepts it and it is added to the mainline. If it violates the rule, it's rejected and placed in a "sideline" below the mainline at that position. Over time, the pattern of accepted and rejected cards provides evidence about the hidden rule.
|
| 10 |
|
| 11 |
<Sidenote>
|
| 12 |
The name "Eleusis" comes from the ancient Greek mystery cult, where initiates gradually discovered hidden truths.
|
| 13 |
</Sidenote>
|
| 14 |
|
| 15 |
+
At any point, a player can attempt to guess the rule; correctly identifying it ends the game, but a wrong guess incurs a penalty. The game continues until someone correctly identifies the rule. A specific scoring system rewards efficiency in discovering the rule while penalizing reckless guessing.
|
| 16 |
|
| 17 |
### Our Adaptation
|
| 18 |
|
|
|
|
| 24 |
|
| 25 |
When correctly guessing the rule, the player scores as many points as the number of turns spent, and each wrong guess deducts a penalty of 2 points:
|
| 26 |
|
| 27 |
+
$$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num\_wrong\_guesses}$$
|
| 28 |
|
| 29 |
+
For instance, a player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. If because of penalties the score drops to zero or below, the current round ends and the final score is recorded as zero (similar to a scientist having wasted all their resources).
|
| 30 |
|
| 31 |
This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence and acting accordingly.
|
| 32 |
|
| 33 |
### Rule Library
|
| 34 |
+
In the original game, the dealer has to invent a secret rule on the spot. However, for benchmarking LLMs, we need a fixed set of rules to ensure comparability across model runs. We created a library of 26 hand-crafted rules spanning a range of types and complexity. Some rules involve simply card properties (e.g., "only red cards"), while others depend on the sequence of previously accepted cards (e.g., "card rank must be higher than previous card"). The rule might involve rank, suits, color or a combination thereof, and may include positional dependencies.
|
| 35 |
|
| 36 |
+
Here are some example rules from our library, with a tentative categorization:
|
| 37 |
|
| 38 |
| Category | Examples |
|
| 39 |
|----------|----------|
|
|
|
|
| 47 |
|
| 48 |
### What the LLM Must Do
|
| 49 |
|
| 50 |
+
On each turn, the model gets prompted with the rule of the game and the complete game state: the mainline of accepted cards, the sidelines of rejected cards at each position, its current hand, and its history of reasoning from the previous turns.
|
| 51 |
|
| 52 |
The model is free to reason, but it is asked to output a structured response containing:
|
| 53 |
|
|
|
|
| 68 |
}
|
| 69 |
```
|
| 70 |
|
| 71 |
+
**This structure lets us analyze not just whether models succeed, but *how* they reason:** Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy? In particular, forcing the model to articulate a tentative rule and a confidence level in it (even if they don't want to guess it yet) allows us to (secretely) evaluate it nonetheless, which will be useful for measuring calibration and guessing abilities.
|
| 72 |
|
| 73 |
|
app/src/content/chapters/eleusis/conclusion.mdx
CHANGED
|
@@ -3,23 +3,12 @@ import Sidenote from "../../../components/Sidenote.astro";
|
|
| 3 |
|
| 4 |
## Conclusion
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
Our evaluation of ten LLMs on the Eleusis benchmark reveals several important insights:
|
| 9 |
-
|
| 10 |
-
1. **LLMs can do inductive reasoning**—but with significant variation. Claude Opus 4.5 leads with 92% success rate and 15.9 average score, while Claude Haiku 4.5 achieves only 70% success and 9.1 average score—a substantial gap on the same benchmark.
|
| 11 |
-
|
| 12 |
-
2. **Metacognition matters as much as reasoning**. Finding the correct rule is only half the challenge; knowing *when* you've found it is equally important. GPT 5.2 High has the highest success rate (96%) but only ranks third overall because it waits too long to commit—an average of 3.6 turns after finding the answer.
|
| 13 |
-
|
| 14 |
-
3. **There's a caution-recklessness trade-off**. Models cluster into distinct behavioral styles: cautious achievers (GPT 5.2 High), balanced performers (Claude Opus 4.5), and reckless guessers (Claude Haiku 4.5). The best results come from accurate metacognition, not from either extreme.
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
<Sidenote>
|
| 21 |
-
The 7-point gap between best and worst models (15.9 vs 9.1) suggests this benchmark captures meaningful capability differences.
|
| 22 |
-
</Sidenote>
|
| 23 |
|
| 24 |
### Limitations
|
| 25 |
|
|
@@ -45,16 +34,7 @@ Several directions for future work:
|
|
| 45 |
|
| 46 |
- **Human comparisons**: Collecting human performance data would provide crucial context for interpreting model capabilities.
|
| 47 |
|
| 48 |
-
- **
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
</Note>
|
| 53 |
-
|
| 54 |
-
### Final Thoughts
|
| 55 |
-
|
| 56 |
-
The Eleusis benchmark offers a window into capabilities that matter for real-world scientific reasoning: iterative hypothesis refinement, strategic experimentation, and calibrated confidence. Perhaps most importantly, it reveals the critical role of *metacognition*—the ability to accurately assess one's own knowledge state.
|
| 57 |
-
|
| 58 |
-
Our results suggest that raw reasoning ability is necessary but not sufficient. GPT 5.2 High can find the answer more often than any other model, yet loses to Claude Opus 4.5 because it doesn't know when to commit. Claude Haiku 4.5 commits readily but often before it should. The winning strategy requires both: strong inductive reasoning *and* accurate self-assessment.
|
| 59 |
-
|
| 60 |
-
As LLMs are increasingly deployed to assist with scientific research, understanding these limitations becomes crucial. A model that is brilliant at generating hypotheses but doesn't know when to trust them could either lead researchers down unproductive paths (if overconfident) or waste time on unnecessary verification (if overcautious). The Eleusis benchmark provides one lens for evaluating and improving these capabilities—measuring not just what models know, but whether they know what they know.
|
|
|
|
| 3 |
|
| 4 |
## Conclusion
|
| 5 |
|
| 6 |
+
The Eleusis benchmark offers a window into capabilities that matter for real-world scientific reasoning: iterative hypothesis refinement, strategic experimentation, and calibrated confidence. Perhaps most importantly, it reveals the critical role of *metacognition*—the ability to accurately assess one's own knowledge state.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
Our results suggest that raw reasoning ability is necessary but not sufficient. GPT 5.2 High can find the answer more often than any other model, yet loses to Claude Opus 4.5 because it doesn't know when to commit. Claude Haiku 4.5 commits readily but often before it should. The winning strategy requires both: strong inductive reasoning *and* accurate self-assessment.
|
| 9 |
|
| 10 |
+
As LLMs are increasingly deployed to assist with scientific research, understanding these limitations becomes crucial. A model that is brilliant at generating hypotheses but doesn't know when to trust them could either lead researchers down unproductive paths (if overconfident) or waste time on unnecessary verification (if overcautious). The Eleusis benchmark provides one lens for evaluating and improving these capabilities—measuring not just what models know, but whether they know what they know.
|
| 11 |
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
### Limitations
|
| 14 |
|
|
|
|
| 34 |
|
| 35 |
- **Human comparisons**: Collecting human performance data would provide crucial context for interpreting model capabilities.
|
| 36 |
|
| 37 |
+
- **Prompt engineering**: Exploring how different prompt designs affect performance and metacognitive accuracy. In particular, can we compensate for bad calibration or guessing strategies via prompting?
|
| 38 |
|
| 39 |
+
- **Confirmation bias analysis**:
|
| 40 |
+
Do models exhibit confirmation bias—preferring to play cards that confirm their current hypothesis rather than cards that could falsify it? It would require LLM as a judge analysis.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/src/content/chapters/eleusis/discussion.mdx
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Note from "../../../components/Note.astro";
|
| 2 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 3 |
+
import Accordion from "../../../components/Accordion.astro";
|
| 4 |
+
|
| 5 |
+
## Discussion
|
| 6 |
+
|
| 7 |
+
### Inductive abilities & metacognition
|
| 8 |
+
|
| 9 |
+
TODO: summarize main findings about the fact that performance depends on both inductive reasoning and metacognitive calibration.
|
| 10 |
+
|
| 11 |
+
Primary factor : inductive reasoning and carefully choosing the next experiment.
|
| 12 |
+
|
| 13 |
+
Then Different scientific personalities, which is on a different axis than raw reasoning ability, play a crucial role in performance.
|
| 14 |
+
|
| 15 |
+
TODO: refine this
|
| 16 |
+
- The Perfectionist (GPT 5.2 High): needs too much evidence
|
| 17 |
+
- The Balanced (Gemini 3 Flash Preview Low): good tradeoff thanks to bad calibration compensated by caution, but not the best at inductive reasoning
|
| 18 |
+
- The Pragmatist (Claude Opus 4.5): good-enough is good enough
|
| 19 |
+
- The Gambler (Claude Haiku 4.5): acts on insufficient evidence
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
### Open vs Closed Models
|
| 23 |
+
|
| 24 |
+
A notable finding is the competitive performance of open-weight models. Kimi K2, available with open weights, achieves the second-highest score (16.2) and outperforms several proprietary models including GPT 5.2. DeepSeekR1 scores 13.3 and the open-weight GPT OSS 120B also performs respectably at 12.ç.
|
| 25 |
+
|
| 26 |
+
However, open models all tend toward more aggressive guessing strategies mediated by bad calibration, leading to lower overall scores despite reasonable inductive abilities. This suggests that while open models can match proprietary ones in raw reasoning, they may lack the nuanced metacognitive skills needed for optimal performance in this benchmark.
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
|
app/src/content/chapters/eleusis/introduction.mdx
CHANGED
|
@@ -3,34 +3,34 @@ import Image from "../../../components/Image.astro";
|
|
| 3 |
|
| 4 |
import exampleSequence from "../../assets/image/example_sequence.png";
|
| 5 |
|
| 6 |
-
Large language models are increasingly being deployed as tools for scientific research
|
| 7 |
|
| 8 |
<Sidenote>
|
| 9 |
Read time: 15–20 minutes.
|
| 10 |
</Sidenote>
|
| 11 |
|
| 12 |
-
Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge, for instance, evaluates inductive reasoning on visual patterns. These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.
|
| 19 |
|
| 20 |
## The Eleusis Game
|
| 21 |
|
| 22 |
-
Eleusis was designed by
|
| 23 |
|
| 24 |
**Eleusis is a microcosm of the scientific method:** the rule is a hidden law of nature, each card play is an experiment, and the sequence of accepted and rejected cards is the accumulating evidence.
|
| 25 |
|
| 26 |
<Image
|
| 27 |
src={exampleSequence}
|
| 28 |
alt="Example Eleusis game sequence with the secret rule 'alternating colors': mainline shows 5♠, J♥, J♠, A♦, 6♣ following the pattern, while the sideline below shows rejected cards 10♣ after J♠, and Q♥ and 2♦ after A♦"
|
| 29 |
-
caption="An example Eleusis game
|
| 30 |
id="fig-example-sequence"
|
| 31 |
preserveColors
|
| 32 |
/>
|
| 33 |
|
| 34 |
We built a benchmark around Eleusis to evaluate LLMs on this iterative, hypothesis-driven reasoning. Rather than testing knowledge retrieval or instruction-following, our benchmark asks: *can models act like scientists?* Can they observe evidence, form hypotheses, design informative experiments, and refine their theories? Can they calibrate their confidence appropriately and know when they've gathered enough evidence to commit to a conclusion?
|
| 35 |
|
| 36 |
-
These skills are fundamental not just to science, but to debugging code,
|
|
|
|
| 3 |
|
| 4 |
import exampleSequence from "../../assets/image/example_sequence.png";
|
| 5 |
|
| 6 |
+
Large language models are increasingly being deployed as tools for scientific research : analyzing data, generating hypotheses, and even designing experiments. But how well do they actually embody the scientific method?
|
| 7 |
|
| 8 |
<Sidenote>
|
| 9 |
Read time: 15–20 minutes.
|
| 10 |
</Sidenote>
|
| 11 |
|
| 12 |
+
Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge [@chollet2019measure], for instance, evaluates inductive reasoning on visual patterns. **These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.**
|
| 13 |
|
| 14 |
+
First, real scientific reasoning is not a single inference step. It's an iterative agentic process of observation, hypothesis formation, experimentation, and refinement, often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic thinking*: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
|
| 15 |
|
| 16 |
+
Also, beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?) [@lichtenstein1977calibration], **metacognition** (how certain am I about my uncertainty?) [@flavell1979metacognition], and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis instead of trying to challenge it) [@nickerson1998confirmation]. A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
|
| 17 |
|
| 18 |
We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.
|
| 19 |
|
| 20 |
## The Eleusis Game
|
| 21 |
|
| 22 |
+
Eleusis was designed by @abbott1977eleusis explicitly to simulate the process of scientific discovery. In the original game, one player invents a secret rule governing which cards can be played, and other players must deduce the rule through experimentation, by playing cards and observing whether they are accepted or rejected.
|
| 23 |
|
| 24 |
**Eleusis is a microcosm of the scientific method:** the rule is a hidden law of nature, each card play is an experiment, and the sequence of accepted and rejected cards is the accumulating evidence.
|
| 25 |
|
| 26 |
<Image
|
| 27 |
src={exampleSequence}
|
| 28 |
alt="Example Eleusis game sequence with the secret rule 'alternating colors': mainline shows 5♠, J♥, J♠, A♦, 6♣ following the pattern, while the sideline below shows rejected cards 10♣ after J♠, and Q♥ and 2♦ after A♦"
|
| 29 |
+
caption="An example Eleusis game. The secret rule here is 'colors must alternate'. The main line (top) shows the sequence of accepted cards: 5♠ → J♥ → J♠ → A♦ → 6♣, alternating between black and red. The sideline (bottom) shows cards that were tried but rejected because they are violating the rule, for instance 10♣ after J♠, or Q♥ and 2♦ after A♦."
|
| 30 |
id="fig-example-sequence"
|
| 31 |
preserveColors
|
| 32 |
/>
|
| 33 |
|
| 34 |
We built a benchmark around Eleusis to evaluate LLMs on this iterative, hypothesis-driven reasoning. Rather than testing knowledge retrieval or instruction-following, our benchmark asks: *can models act like scientists?* Can they observe evidence, form hypotheses, design informative experiments, and refine their theories? Can they calibrate their confidence appropriately and know when they've gathered enough evidence to commit to a conclusion?
|
| 35 |
|
| 36 |
+
These skills are fundamental not just to science, but to debugging code, medical diagnosis, and everyday reasoning under uncertainty.
|
app/src/content/chapters/eleusis/results.mdx
CHANGED
|
@@ -4,7 +4,7 @@ import Note from "../../../components/Note.astro";
|
|
| 4 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 5 |
import HtmlEmbed from "../../../components/HtmlEmbed.astro";
|
| 6 |
|
| 7 |
-
## Results
|
| 8 |
|
| 9 |
### Overall Performance
|
| 10 |
|
|
@@ -17,44 +17,50 @@ We evaluated ten models on the Eleusis benchmark, including both proprietary and
|
|
| 17 |
wide
|
| 18 |
/>
|
| 19 |
|
| 20 |
-
Performance varies dramatically among tested models.
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
| 29 |
|
| 30 |
### Pure discovery versus metacognition
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
| 33 |
|
| 34 |
<HtmlEmbed
|
| 35 |
src="score-stack.html"
|
| 36 |
-
caption="<strong>Figure 2:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring)
|
| 37 |
id="fig-score-stack"
|
| 38 |
wide
|
| 39 |
/>
|
| 40 |
|
| 41 |
-
Even if using this alternative scoring does not change a lot the relative ranking of models, it reveals important differences in their behavior.
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
They might be two
|
| 44 |
1. The model is reckless and makes a lot of wrong guesses, incurring penalties.
|
| 45 |
2. The model is too cautious and waits too long before guessing, missing out on points.
|
| 46 |
|
| 47 |
We analyze these two aspects in more details below.
|
| 48 |
|
| 49 |
-
|
| 50 |
### The Caution-Recklessness Trade-off
|
| 51 |
|
| 52 |
-
To estimate how reckless or cautious a model is, we can compute the average number of failed guesses per round (recklessness). It directly relates to how many points a model loses due to wrong guesses.
|
| 53 |
|
| 54 |
-
To estimate caution, we can compute on average how many turns a model waits while having the correct tentative rule before actually guessing it. This relates to how many points a model loses by waiting too long to commit.
|
| 55 |
|
| 56 |
<Sidenote>
|
| 57 |
-
This trade-off mirrors a fundamental tension in science: being overconfident too early might risk false positives, leading to wasted resources and reputational damage; being overly cautious can delay discoveries and allow others to scoop you. Scientists must balance the risk of trying to publish too early and risk being wrong, wait too long and lose priority (or in our case, points).
|
| 58 |
</Sidenote>
|
| 59 |
|
| 60 |
<HtmlEmbed
|
|
@@ -63,32 +69,33 @@ To estimate caution, we can compute on average how many turns a model waits whil
|
|
| 63 |
id="fig-caution-reckless"
|
| 64 |
/>
|
| 65 |
|
|
|
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
On the other hand, GPT 5.2 High has a singular behavior with very few failed guesses (0.28 per round) but a high caution (waiting 3.5 turns on average before guessing when it has the correct rule). Gemini 3 Flash Preview Low and GPT 5 Mini Medium are intermediate in both dimensions, Gemini achieving a better balance with on average 2 points lost due to recklessness and 2 points lost due to caution.
|
| 70 |
|
| 71 |
To try to understand deeper the causes of recklessness and caution, we now turn to an analysis of confidence and guessing strategies.
|
| 72 |
|
| 73 |
### Confidence and Calibration
|
| 74 |
|
| 75 |
-
Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally
|
| 76 |
|
| 77 |
<HtmlEmbed
|
| 78 |
src="calibration-curves.html"
|
| 79 |
-
caption="<strong>Figure 4:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
|
| 80 |
id="fig-calibration"
|
| 81 |
/>
|
| 82 |
|
| 83 |
The calibration analysis reveals several patterns:
|
| 84 |
|
| 85 |
-
- **All models are overconfident** : for instance when they report 80% confidence, their actual success rates are often closer to 20% !
|
| 86 |
-
- GPT 5.2 is the best calibrated model overall.
|
| 87 |
- Even models with a strong performance like Claude Opus 4.5 and Kimi K2 show significant overconfidence.
|
| 88 |
|
| 89 |
-
Is overconfidence a problem ?
|
| 90 |
|
| 91 |
-
For a perfectly calibrated model, as the expected loss for a failed guess is twice the expected opportunity cost of waiting one turn, the optimal confidence threshold for guessing is 0.67 (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). But do model follow such a strategy ?
|
|
|
|
|
|
|
| 92 |
|
| 93 |
|
| 94 |
<HtmlEmbed
|
|
@@ -97,22 +104,20 @@ For a perfectly calibrated model, as the expected loss for a failed guess is twi
|
|
| 97 |
id="fig-confidence"
|
| 98 |
/>
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
We can see that models on average are more cautious than the optimal decision-theoretic strategy for a perfectly calibrated model, which would guess as soon as confidence exceeds 67%. THis is somehow a good thing, given that all models are overconfident. By raising the bar for guessing, they reduce the risk of wrong guesses and compensate for their poor calibration.
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
|
| 107 |
|
|
|
|
| 108 |
|
| 109 |
-
### Performance by Rule
|
| 110 |
|
| 111 |
-
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
The following figure breaks down performance by rule across all models and runs.
|
| 116 |
|
| 117 |
<HtmlEmbed
|
| 118 |
src="by-rule.html"
|
|
@@ -121,9 +126,15 @@ The following figure breaks down performance by rule across all models and runs.
|
|
| 121 |
wide
|
| 122 |
/>
|
| 123 |
|
| 124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
-
The
|
|
|
|
|
|
|
| 127 |
|
| 128 |
<HtmlEmbed
|
| 129 |
src="complexity-analysis.html"
|
|
@@ -131,6 +142,21 @@ The following plot breaks down the relative score of each model (as measured by
|
|
| 131 |
id="fig-complexity"
|
| 132 |
/>
|
| 133 |
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
import Sidenote from "../../../components/Sidenote.astro";
|
| 5 |
import HtmlEmbed from "../../../components/HtmlEmbed.astro";
|
| 6 |
|
| 7 |
+
## 2. Results
|
| 8 |
|
| 9 |
### Overall Performance
|
| 10 |
|
|
|
|
| 17 |
wide
|
| 18 |
/>
|
| 19 |
|
| 20 |
+
Performance varies dramatically among tested models.
|
| 21 |
|
| 22 |
+
* **Claude Opus 4.5** achieves top performance with 17.0 score and moderate token usage. The open-weight model **Kimi K2 Thinking** comes second at 16.2 and performs competitively with the best proprietary models (outperforming GPT 5.2 High and being close to Claude Opus 4.5), but at the price of a 2.5× larger reasoning budget.
|
| 23 |
|
| 24 |
+
* **GPT 5.2 High** and **Grok 4.1 Fast Reasoning** show a similar performance around 15, but GPT 5.2 High is 3 times more token efficient.
|
| 25 |
|
| 26 |
+
* **GPT-5-Mini**, **GPT OSS-120B** and **Gemini 3 Flash Preview Low** cluster in the mid-tier (around 13) with low token usage. While Deepseek R1, an open-weight model specialized for reasoning tasks, achieves a similar score but with a much larger token count.
|
| 27 |
|
| 28 |
+
* Finally, **GPT-OSS 20B** and **Claude Haiku 4.5** lag behind, scoring between 11 and 12 with moderate token usage.
|
| 29 |
+
|
| 30 |
+
As we mentionned, this score reflects not only the pure model's ability to find the correct rule, but also its metacognitive skills: knowing when to commit, how confident it is, and how to balance exploration vs. exploitation. To distinguish these factors, we also computed an alternative "no-stakes" score that removes penalties for wrong guesses and counts tentative rules as guesses.
|
| 31 |
|
| 32 |
### Pure discovery versus metacognition
|
| 33 |
|
| 34 |
+
We use the same game data but we applied a different scoring system to reflect the pure ability to discover the rule, without the metacognitive aspect of knowing when to commit. **In this "no stakes" scenario, guessing is free and systematic**: at each turn, if the model has the correct tentative rule, it is considered to have guessed it correctly (even if it didn't formally attempt to guess); if the tentative rule is incorrect, it is considered a wrong guess, but without penalty.
|
| 35 |
+
|
| 36 |
+
The following chart shows the initial score of each model, and which (higher) score it would have achieved under the "no stakes" scenario. This allows us to isolate pure rule-discovery ability from metacognitive skills.
|
| 37 |
|
| 38 |
<HtmlEmbed
|
| 39 |
src="score-stack.html"
|
| 40 |
+
caption="<strong>Figure 2:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring), while green shows no-stakes gain (additional gain from systematic guessing and removing wrong-guess penalties). Models sorted by total no-stakes score."
|
| 41 |
id="fig-score-stack"
|
| 42 |
wide
|
| 43 |
/>
|
| 44 |
|
| 45 |
+
Even if using this alternative scoring does not change a lot the relative ranking of models, it reveals important differences in their behavior.
|
| 46 |
+
|
| 47 |
+
* GPT 5.2 High and Claude Haiku 4.5 are the two models with the largest difference between raw and no-stakes scores (more than 4), suggesting they are the most penalized by wrong guesses or delayed guessing.
|
| 48 |
+
* On the other hand, Gemini 3 Flash Preview Low and Kimi K2 have the smallest difference (less than 3) and benefit the least from this alternative scoring, indicating a better balance between discovery and metacognition.
|
| 49 |
|
| 50 |
+
They might be two reasons for the difference between the raw and the no-stakes scores:
|
| 51 |
1. The model is reckless and makes a lot of wrong guesses, incurring penalties.
|
| 52 |
2. The model is too cautious and waits too long before guessing, missing out on points.
|
| 53 |
|
| 54 |
We analyze these two aspects in more details below.
|
| 55 |
|
|
|
|
| 56 |
### The Caution-Recklessness Trade-off
|
| 57 |
|
| 58 |
+
To estimate how reckless or cautious a model is, we can compute **the average number of failed guesses per round** (recklessness). It directly relates to how many points a model loses due to wrong guesses.
|
| 59 |
|
| 60 |
+
To estimate caution, we can compute on average **how many turns a model waits while having the correct tentative rule before actually guessing it**. This relates to how many points a model loses by waiting too long to commit.
|
| 61 |
|
| 62 |
<Sidenote>
|
| 63 |
+
This trade-off mirrors a fundamental tension in science: being overconfident too early might risk false positives, leading to wasted resources and reputational damage; being overly cautious can delay discoveries, waste resources and allow others to scoop you. Scientists must balance the risk of trying to publish too early and risk being wrong, wait too long and lose priority (or in our case, points).
|
| 64 |
</Sidenote>
|
| 65 |
|
| 66 |
<HtmlEmbed
|
|
|
|
| 69 |
id="fig-caution-reckless"
|
| 70 |
/>
|
| 71 |
|
| 72 |
+
How should we interpret those values ? Knowing that a failed guess costs 2 points, while each turn of delay costs 1 point, the optimal number of failed guesses per round should be around 0.5 (i.e., 1 failed guess every 2 rounds) to balance the two sources of loss. We can see that most models are above that threshold, indicating **a clear tendency towards recklessness**. This is confirmed by the fact that they have a low caution value (most models wait around 1 turn or less on average before guessing when they have the correct rule).
|
| 73 |
|
| 74 |
+
On the other hand, **GPT 5.2 High has a singular behavior** with very few failed guesses (0.28 per round) but a high caution (waiting 3.5 turns on average before guessing when it has the correct rule). Gemini 3 Flash Preview Low and GPT 5 Mini Medium are intermediate in both dimensions, Gemini achieving a better balance with on average 2 points lost due to caution and 2 points lost due to recklessness (1 failed guess every round on average).
|
|
|
|
|
|
|
| 75 |
|
| 76 |
To try to understand deeper the causes of recklessness and caution, we now turn to an analysis of confidence and guessing strategies.
|
| 77 |
|
| 78 |
### Confidence and Calibration
|
| 79 |
|
| 80 |
+
Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempt to do so. **This allows us to evaluate calibration: does reported confidence match actual accuracy?** This is particularly relevant as modern neural networks have been shown to be poorly calibrated [@guo2017calibration].
|
| 81 |
|
| 82 |
<HtmlEmbed
|
| 83 |
src="calibration-curves.html"
|
| 84 |
+
caption="<strong>Figure 4:</strong> Calibration curves for each model (for reported confidence ≥5). A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
|
| 85 |
id="fig-calibration"
|
| 86 |
/>
|
| 87 |
|
| 88 |
The calibration analysis reveals several patterns:
|
| 89 |
|
| 90 |
+
- **All models are very overconfident** : for instance when they report 80% confidence, their actual success rates are often closer to 20% !
|
| 91 |
+
- GPT 5.2 is the best calibrated model overall, being the closest to the diagonal line, although it is still slightly overconfident.
|
| 92 |
- Even models with a strong performance like Claude Opus 4.5 and Kimi K2 show significant overconfidence.
|
| 93 |
|
| 94 |
+
Is overconfidence a problem ? In our setting, not necessarily; it depends on how the model decides to act on it.
|
| 95 |
|
| 96 |
+
**For a perfectly calibrated model**, as the expected loss for a failed guess is twice the expected opportunity cost of waiting one turn, **the optimal confidence threshold for guessing is 0.67** (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). But do model follow such a strategy ?
|
| 97 |
+
|
| 98 |
+
For this, we can look at how often models guess at each reported confidence level. This is shown in the following figure. For each confidence level (from 5 to 10), we compute the guess rate: the fraction of turns the model actually attempts to guess when reporting that confidence.
|
| 99 |
|
| 100 |
|
| 101 |
<HtmlEmbed
|
|
|
|
| 104 |
id="fig-confidence"
|
| 105 |
/>
|
| 106 |
|
| 107 |
+
Once again, we observe significant differences from one model to another. Grok 4.1 and Gemini 3 will essentially only guess when very confident (9 or 10). Most other models will also often guess at confidence levels above 8 and rarely below. The two Claude models show different behaviors: Claude Opus 4.5 tends to guess more aggressively at confidence level 8, while Claude Haiku 4.5 often guesses even at confidence level 7.
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
We can see that **models on average are more cautious than the optimal decision-theoretic strategy** for a perfectly calibrated model, which would guess as soon as confidence exceeds 67%. This is somehow a good thing for them, given that all models are overconfident. **By raising the threshold for guessing, they reduce the risk of wrong guesses and compensate for their poor calibration.**
|
| 110 |
|
| 111 |
+
This is particularly true for Gemini 3 Flash Preview Low which is very cautious, guessing only 1/3 of the time at reported confidence 9 ! This compensates its overconfidence, which is probably what helps it achieve a good balance between failed guesses and lost opportunity cost. This is reflected in our "no-stakes" analysis by the fact that it's the model with the smallest difference between raw and no-stakes scores.
|
| 112 |
|
| 113 |
+
The case of GPT 5.2 High is different: it is both fairly well calibrated and very cautious, leading to very few failed guesses but a high opportunity cost due to delayed guessing. This suggests that GPT 5.2 High could improve its performance by being more aggressive in guessing once it has a correct tentative rule, especially at confidence level 8.
|
| 114 |
|
|
|
|
| 115 |
|
| 116 |
+
### Performance by Rule Complexity
|
| 117 |
|
| 118 |
+
Not all rules are created equal. Some rules are discovered quickly by all models (e.g. *"all cards must be red"*) while others prove consistently challenging (e.g. *"increase rank after a red card, decrease after a black"*).
|
| 119 |
|
| 120 |
+
The following figure breaks down performance by rule across all models and runs, displaying the average success rate per rule on the left (how often the rule was found), and individual run scores as colored dots for each model on the right.
|
| 121 |
|
| 122 |
<HtmlEmbed
|
| 123 |
src="by-rule.html"
|
|
|
|
| 126 |
wide
|
| 127 |
/>
|
| 128 |
|
| 129 |
+
It confirms that some rules are consistently easy, with low variance in score across models, while others are hard for all models. To analyse this, we need a way to quantify rule complexity. This is not straightforward since it depends on multiple factors: the inherent logical complexity of the rule, how familiar the concept is to models, and how much evidence is needed to distinguish it from alternatives.
|
| 130 |
+
|
| 131 |
+
We created a crude complexity score for each rule based on the complexity of its code implementation, as measured by *cyclomatic complexity* and *Abstract Syntax Tree node count*. Combining these two metrics into a unique indicator
|
| 132 |
+
|
| 133 |
+
$$\text{cyclomatic\_complexity} + 0.14 * \text{node\_count}$$
|
| 134 |
|
| 135 |
+
The coefficient 0.14 was chosen to maximize correlation with average success rate across models. The achieved correlation being -0.66. This indicates that as expected more complex rules tend to have lower success rates, and validates our complexity metric as a useful proxy for rule difficulty, despite its limitations.
|
| 136 |
+
|
| 137 |
+
The following plot breaks down the success rate of each model per complexity quartile.
|
| 138 |
|
| 139 |
<HtmlEmbed
|
| 140 |
src="complexity-analysis.html"
|
|
|
|
| 142 |
id="fig-complexity"
|
| 143 |
/>
|
| 144 |
|
| 145 |
+
|
| 146 |
+
Interestingly, code complexity (as measured by our combination of cyclomatic complexity and AST node count) doesn't perfectly predict difficulty, as semantic concepts also play a role. For instance a rule like "only face cards" has a complexity equivalent to "only A, 2 and 3", but the former is easier for models (and humans !) due to familiarity with the semantic category of face cards.
|
| 147 |
+
|
| 148 |
+
Also rules involving rare events (low acceptance rate). Only aces is harder than "only even ranks" despite being simpler, simply because models need more evidence to confirm it.
|
| 149 |
+
|
| 150 |
+
An interesting test: are symmetric rules equally difficult? For example, "only spades" vs "only non-spades" should be logically equivalent in difficulty, but models might have biases.
|
| 151 |
+
For instance average score on "only spades" is 25, but "no spades" is 20.
|
| 152 |
+
|
| 153 |
+
### Complexity of rules produced
|
| 154 |
+
|
| 155 |
+
#### Overly Complex Rules
|
| 156 |
+
Failure mode: models have a tendency to produce over complicated rules, even if they were informed that the rule is typically one sentence. They can produce tentative rules like "...".
|
| 157 |
+
|
| 158 |
+
TODO : Backup this with examples from logs and "guess complexity" vs "actual complexity".
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
#### Overfitting Rules
|
| 162 |
+
We have observed qualitative evidence of model producing overfit rules that explain all observations so far, but fail to generalize. For instance if all accepted cards so far are red, and happens to be only number cards (simply because no red face card has been tried yet), the model may hypothesize "only red number cards" rather than the simpler "only red cards."
|