dlouapre HF Staff commited on
Commit
e0197ee
·
1 Parent(s): 4343500

improving content

Browse files
ASSESSMENT.md DELETED
@@ -1,291 +0,0 @@
1
- # Critical Assessment: Eleusis Benchmark Article
2
-
3
- ## Executive Summary
4
-
5
- The article presents an interesting benchmark with solid methodology and rich data. The main structural issue is that the **Results section tells a fragmented story about guessing behavior**, spreading related insights across 6+ subsections without a clear narrative arc. The key message—that metacognition matters and models have distinct "scientific personalities"—gets lost in the noise.
6
-
7
- Additionally, there are **data consistency issues** between the text and the underlying data files that need resolution before publication.
8
-
9
- ---
10
-
11
- ## 1. Critical Issues
12
-
13
- ### 1.1 Data Inconsistencies
14
-
15
- The numbers in the text don't match `summary.txt`. For example:
16
-
17
- | Metric | In Text | In summary.txt |
18
- |--------|---------|----------------|
19
- | Claude Opus 4.5 avg score | 15.88 (CLAUDE.md) | 14.46 |
20
- | Kimi K2 avg score | 14.53 (CLAUDE.md) | 10.31 |
21
- | GPT 5.2 High rank | "third place" | Actually 1st by avg_score (14.85) |
22
-
23
- **Action needed:** Audit all numbers in the text against the latest data files.
24
-
25
- ### 1.2 Results Section: Scattered Narrative
26
-
27
- The guessing behavior story is currently spread across:
28
-
29
- 1. "Confidence and Calibration" - calibration curves, confidence distribution
30
- 2. "Guessing Strategy" - score vs failed guesses
31
- 3. "The Caution-Recklessness Trade-off" - early correct turns, caution scatter
32
- 4. "Alternative Scoring Systems" - score stack breakdown
33
- 5. "Analysis of the reckless guessing behavior" - double-down rate
34
-
35
- These all address the same fundamental question: **How do models decide when to commit?** But the current structure forces readers to piece together the story themselves.
36
-
37
- **Problem:** A reader finishing the Results section doesn't have a clear mental model of "what makes some models better than others."
38
-
39
- ---
40
-
41
- ## 2. Suggested Restructuring
42
-
43
- ### Option A: Reorganize Around the Key Insight
44
-
45
- **Proposed Results structure:**
46
-
47
- ```
48
- ## Results
49
-
50
- ### Overall Performance (keep as-is)
51
- Brief overview, scatter plot of score vs tokens
52
-
53
- ### Finding the Rule: Who Gets It Right?
54
- - Success rates by model
55
- - Performance by rule complexity
56
- - Brief: what capabilities matter for finding rules
57
-
58
- ### Knowing When You Know: The Metacognition Challenge
59
- [This is the heart of the article - elevate it]
60
- - The caution-recklessness trade-off (central framing)
61
- - Caution analysis: early correct turns, GPT 5.2 waits too long
62
- - Recklessness analysis: failed guesses, double-down rates
63
- - The scatter plot showing the trade-off (Figure 6)
64
- - Why Claude Opus wins: good enough at finding + great at timing
65
-
66
- ### Confidence and Calibration
67
- - Calibration curves (all models overconfident)
68
- - Confidence distribution when guessing
69
- - Brief: why calibration enables good timing decisions
70
-
71
- ### Alternative Scoring: Robustness Check
72
- - Score stack shows the penalty different behaviors pay
73
- - Confirms that metacognition, not just rule-finding, drives scores
74
- ```
75
-
76
- **Benefits:**
77
- - The key message (metacognition matters) becomes structurally prominent
78
- - Reader builds understanding progressively: first "can they solve it?", then "do they know when they've solved it?"
79
- - Eliminates the feeling of "lots of charts, hard to synthesize"
80
-
81
- ### Option B: Two-Act Structure
82
-
83
- ```
84
- ## Results
85
-
86
- ### Act 1: The Leaderboard (compact)
87
- - Overall performance scatter
88
- - Success rates
89
- - One paragraph summary: "Models vary from 70% to 96% success rate..."
90
-
91
- ### Act 2: The Real Story—Scientific Temperaments
92
- [Frame models as having distinct "personalities"]
93
-
94
- The Cautious Achiever: GPT 5.2 High
95
- - Highest success rate, but 3rd in score
96
- - Figure: excess caution distribution
97
- - Lost ~3.6 points per round to over-caution
98
-
99
- The Balanced Scientist: Claude Opus 4.5
100
- - Not the best at finding rules, but best at knowing when
101
- - Commits quickly, accepts occasional wrong guesses
102
-
103
- The Reckless Guesser: Claude Haiku 4.5 / DeepSeek R1
104
- - Commits before sufficient evidence
105
- - Double-down behavior after failures
106
-
107
- Visualizing the Trade-off
108
- - Caution vs recklessness scatter (the key figure)
109
- - Score stack showing what each "personality" costs
110
-
111
- ### Calibration: Why Timing Is Hard
112
- - Overconfidence makes timing decisions unreliable
113
- - Even well-performing models poorly calibrated
114
- ```
115
-
116
- **Benefits:**
117
- - Memorable framing (scientific personalities)
118
- - Natural story arc
119
- - Each model type is clearly characterized
120
-
121
- ---
122
-
123
- ## 3. Missing Content
124
-
125
- ### 3.1 Figures Marked as TODO
126
-
127
- - **Learning curves figure** (analysis.mdx:22) - Would show within-round dynamics
128
- - **Failure mode distribution** (analysis.mdx:55) - Stacked bar by model
129
-
130
- **Recommendation:** The learning curves figure would be valuable if you have the data. The failure mode classification might be hard to automate reliably—consider whether a few qualitative examples serve the purpose better.
131
-
132
- ### 3.2 Human Baseline
133
-
134
- Mentioned in limitations but this is a significant gap. Without human performance, readers can't judge if 92% success is impressive or trivial.
135
-
136
- **Options:**
137
- - Run a small human study (even N=5 would help)
138
- - Cite related work on human performance in similar inductive reasoning tasks
139
- - Frame it explicitly as "relative comparison between models" not absolute capability assessment
140
-
141
- ### 3.3 Example Turn Figure
142
-
143
- benchmark.mdx shows the JSON output format but doesn't illustrate what a complete turn looks like in context (game state → reasoning → decision).
144
-
145
- **Recommendation:** Add a figure showing:
146
- ```
147
- [Current board state visualization]
148
- [Model reasoning excerpt]
149
- [Decision: play 4♣, confidence 6, don't guess yet]
150
- [Outcome: accepted/rejected]
151
- ```
152
-
153
- This makes the task concrete for readers.
154
-
155
- ---
156
-
157
- ## 4. The "Deeper Analysis" Section
158
-
159
- Currently a grab-bag of interesting observations with TODOs. Your instinct to replace with "Discussion" is right.
160
-
161
- ### Proposed: Discussion Section
162
-
163
- ```
164
- ## Discussion
165
-
166
- ### What Explains the Performance Gap?
167
- - Metacognition (knowing when you know) is the key differentiator
168
- - Success rate alone doesn't predict score (GPT 5.2 vs Opus example)
169
- - Calibration enables good timing, but no model is well-calibrated
170
-
171
- ### Open vs Proprietary Models
172
- - Kimi K2 competitive on rule-finding
173
- - But open models trend toward reckless guessing (training objective differences?)
174
- - Opportunity: calibration tuning could improve open model performance
175
-
176
- ### Failure Modes [keep the accordion, it's useful]
177
-
178
- ### Implications for AI-Assisted Science
179
- - The caution-recklessness trade-off mirrors real scientific decision-making
180
- - An overconfident AI assistant could lead researchers astray
181
- - An overcautious one wastes resources on unnecessary verification
182
- ```
183
-
184
- ### Move to Appendix
185
-
186
- - Symmetric rules analysis (interesting but niche)
187
- - Confirmation bias (preliminary, needs more work)
188
- - Detailed qualitative examples (unless you expand them significantly)
189
-
190
- ---
191
-
192
- ## 5. Framing Suggestions
193
-
194
- ### 5.1 Lead with the Surprise
195
-
196
- Current opening of Results is fine, but the key insight (metacognition matters) comes too late. Consider foreshadowing in the introduction:
197
-
198
- > "We found something surprising: the model with the highest success rate doesn't have the highest score. What matters isn't just finding the answer—it's knowing when you've found it."
199
-
200
- ### 5.2 The "Scientific Personality" Frame
201
-
202
- This is potentially memorable and shareable. Models as:
203
- - **The Perfectionist** (GPT 5.2 High): Always wants more evidence
204
- - **The Pragmatist** (Claude Opus 4.5): Good enough evidence is enough
205
- - **The Gambler** (Claude Haiku 4.5): Guesses based on vibes
206
-
207
- This framing:
208
- - Makes the article more accessible to non-specialists
209
- - Creates natural anchors for discussion
210
- - Is scientifically defensible (behavioral clustering is real)
211
-
212
- ### 5.3 The Decision Theory Angle
213
-
214
- You mention the optimal guessing threshold (0.67 confidence) briefly. This could be expanded:
215
-
216
- > "Given perfect calibration, the optimal strategy is to guess whenever confidence exceeds 67%. But no model is well-calibrated. GPT 5.2 High effectively uses a threshold of ~95%; Claude Haiku 4.5 seems to use ~50%."
217
-
218
- This quantifies the "personalities" and connects to calibration.
219
-
220
- ---
221
-
222
- ## 6. Minor Issues
223
-
224
- ### 6.1 Typos/Grammar
225
-
226
- - results.mdx:38: "overconfident : for instance" → extra space before colon
227
- - results.mdx:39: "GPT 5.2 is the best calibrated" → should be "GPT 5.2 High"
228
- - results.mdx:51: "closed to Claude Opus 4.5" → "close to"
229
- - results.mdx:103: "constrats" → "contrasts"
230
- - analysis.mdx:60: "GPT OSS 120B also performs respectably at 12.0" → check number
231
-
232
- ### 6.2 Caption Numbering
233
-
234
- Figure 7 appears twice (score-stack and reckless-guessing). Fix numbering.
235
-
236
- ### 6.3 Model Names Consistency
237
-
238
- Inconsistent capitalization and naming:
239
- - "Claude Opus 4.5" vs "Claude 4.5 Opus"
240
- - "GPT 5.2 High" vs "Gpt 5.2 High" (in data files)
241
- - "DeepSeek R1" vs "Deepseek R1"
242
-
243
- ---
244
-
245
- ## 7. Ideas for Additional Content
246
-
247
- ### 7.1 Interactive "Play a Round" Demo
248
-
249
- Let readers play one round against a rule to experience the task. Even a simple version would be compelling. (This could be a stretch goal.)
250
-
251
- ### 7.2 Model-Specific Breakdowns
252
-
253
- You have per-model PNG files (`model_claude_opus_4_5.png`, etc.). Consider:
254
- - Appendix section with one page per model
255
- - Or: expandable accordion for each model's detailed stats
256
-
257
- ### 7.3 Token Efficiency Discussion
258
-
259
- You show score vs tokens in Figure 1 but don't discuss it much. Gemini 3 Flash achieves decent results with 4x fewer tokens than Opus—is that worth highlighting for practitioners?
260
-
261
- ### 7.4 Prompt Sensitivity
262
-
263
- You note this as a limitation but could briefly test: what if you told models to be more cautious? More aggressive? (Could be future work suggestion.)
264
-
265
- ---
266
-
267
- ## 8. Prioritized Action Items
268
-
269
- ### Must Fix
270
- 1. Audit all numbers against latest data files
271
- 2. Fix duplicate Figure 7 numbering
272
- 3. Fix typos listed above
273
-
274
- ### Should Do
275
- 4. Reorganize Results section (Option A or B above)
276
- 5. Rename "Deeper Analysis" to "Discussion" and restructure
277
- 6. Add foreshadowing of key insight in introduction
278
-
279
- ### Nice to Have
280
- 7. Add example turn figure in benchmark.mdx
281
- 8. Expand "scientific personalities" framing
282
- 9. Human baseline (even informal)
283
- 10. Per-model detail pages in appendix
284
-
285
- ---
286
-
287
- ## 9. Summary
288
-
289
- The benchmark and data are solid. The article's main weakness is structural: it has too many charts telling pieces of the same story without a clear narrative spine. The fix is to reorganize around **the key insight** (metacognition matters more than raw rule-finding ability) and **the key visual** (the caution-recklessness scatter plot).
290
-
291
- Your target message—"Models differ dramatically because metacognition matters, and this is an opportunity for improvement"—is supported by the data but not yet prominently surfaced by the article structure.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ASSESSMENT_V2.md DELETED
@@ -1,197 +0,0 @@
1
- # Revised Assessment: Eleusis Benchmark Article (v2)
2
-
3
- ## Executive Summary
4
-
5
- The article has improved significantly since the first assessment. The **Results section is now well-structured** with a clear narrative arc: overall performance → the metacognition insight → caution/recklessness trade-off → calibration → performance by rule. The key message about metacognition is now prominent and supported by the logical flow.
6
-
7
- The main remaining issues are:
8
- 1. **Data inconsistencies** between text and data files (numbers are outdated)
9
- 2. **The "Deeper Analysis" section** needs restructuring—much of it now duplicates the improved Results section
10
- 3. Minor typos
11
-
12
- ---
13
-
14
- ## 1. What's Working Well
15
-
16
- ### 1.1 Results Section Structure
17
- The new structure is excellent:
18
- ```
19
- Results
20
- ├── Overall Performance (intro)
21
- ├── Pure discovery vs metacognition (the key insight, early!)
22
- ├── Caution-Recklessness Trade-off (central analysis)
23
- ├── Confidence and Calibration (supporting evidence)
24
- └── Performance by Rule (rule-level breakdown)
25
- ```
26
-
27
- This addresses the main criticism from v1: readers now build understanding progressively and the metacognition insight is front and center.
28
-
29
- ### 1.2 Figure Flow
30
- Figures now tell a coherent story:
31
- - Fig 1: Overview (where does each model sit?)
32
- - Fig 2: Score breakdown (what drives score differences?)
33
- - Fig 3: Caution vs recklessness (the key trade-off)
34
- - Fig 4: Calibration (why is timing hard?)
35
- - Fig 5: Guess rate (how do models decide when to commit?)
36
- - Fig 6-7: Rule-level analysis (drill-down)
37
-
38
- ### 1.3 New Guess Rate Analysis (Figure 5)
39
- This is a valuable addition that wasn't in the original. It shows how models operationalize their confidence into actual decisions, connecting calibration to behavior.
40
-
41
- ### 1.4 Clear Messaging
42
- Lines like "knowing when to commit is as important as finding the rule" now appear early and are reinforced throughout.
43
-
44
- ---
45
-
46
- ## 2. Critical Issues
47
-
48
- ### 2.1 Data Inconsistencies (Must Fix)
49
-
50
- The text still uses outdated numbers. Current data (from `summary.txt` and `overall_performance.json`) vs text:
51
-
52
- | Metric | In Text | Actual Data |
53
- |--------|---------|-------------|
54
- | Claude Opus 4.5 avg score | 15.9 (conclusion.mdx:10) | **17.0** (avg_floored_score) |
55
- | Claude Opus 4.5 success rate | 92% (conclusion.mdx:10) | **83%** |
56
- | Claude Haiku 4.5 success rate | 70% (conclusion.mdx:10) | **56%** |
57
- | Claude Haiku 4.5 failed guesses | 7.5/round (analysis.mdx:15) | **3.95/round** |
58
- | Kimi K2 avg score | 14.5 (analysis.mdx:60) | **16.2** |
59
- | GPT OSS 120B score | 12.0 (analysis.mdx:60) | **12.9** |
60
- | GPT 5.2 High early correct turns | 3.6 (multiple places) | **3.56** ✓ (close enough) |
61
-
62
- **Action:** Audit all numbers in `results.mdx`, `analysis.mdx`, and `conclusion.mdx` against the latest data files.
63
-
64
- ### 2.2 Typos Still Present
65
-
66
- | Location | Issue |
67
- |----------|-------|
68
- | results.mdx:20 | "closed to Claude Opus 4.5" → "close to" |
69
- | results.mdx:85 | "overconfident : for instance" → remove space before colon |
70
- | results.mdx:86 | "GPT 5.2 is the best calibrated" → "GPT 5.2 High" |
71
- | results.mdx:102 | "THis is somehow" → "This is somehow" |
72
-
73
- ---
74
-
75
- ## 3. The "Deeper Analysis" Section
76
-
77
- ### 3.1 Current Problem
78
-
79
- The "Deeper Analysis" section is now partially redundant. It covers:
80
- 1. **Metacognition** (duplicates Results § "Pure discovery vs metacognition")
81
- 2. **Learning Curves** (TODO, placeholder)
82
- 3. **Failure Modes** (valuable, keep)
83
- 4. **Open vs Closed Models** (brief, could be expanded)
84
- 5. **Symmetric Rules** (interesting niche finding)
85
- 6. **Confirmation Bias** (preliminary, incomplete)
86
- 7. **Qualitative Observations** (nice examples, but disconnected)
87
-
88
- ### 3.2 Recommended Restructure
89
-
90
- Rename to "Discussion" and reorganize:
91
-
92
- ```markdown
93
- ## Discussion
94
-
95
- ### What Explains the Performance Gap?
96
- - Brief synthesis: metacognition > raw ability
97
- - The caution-recklessness trade-off determines ranking more than success rate
98
- - Move the GPT 5.2 High / Claude Opus 4.5 / Claude Haiku 4.5 characterizations here
99
- (but avoid repeating numbers already in Results)
100
-
101
- ### Scientific Temperaments
102
- - This is where the "scientific personality" framing could shine
103
- - The Perfectionist (GPT 5.2 High): needs too much evidence
104
- - The Pragmatist (Claude Opus 4.5): good-enough is good enough
105
- - The Gambler (Claude Haiku 4.5): acts on insufficient evidence
106
- - Link to real-world science: these map to actual failure modes in research
107
-
108
- ### Failure Modes [keep the accordion, it's excellent]
109
- - Already well-written, just tighten the taxonomy
110
-
111
- ### Open vs Proprietary Models
112
- - Currently too brief (1 paragraph)
113
- - Could expand: why might open models trend reckless? (RLHF differences?)
114
- - Kimi K2's success is notable—worth highlighting more
115
-
116
- ### Implications for AI-Assisted Science
117
- - Currently in Conclusion but could be expanded here
118
- - An overconfident assistant leads researchers astray
119
- - An overcautious assistant wastes resources
120
- - The calibration problem is particularly concerning
121
-
122
- ### Move to Appendix (or delete)
123
- - Learning Curves (TODO) → either implement or remove
124
- - Symmetric Rules → niche, move to appendix or cut
125
- - Confirmation Bias → too preliminary, either expand significantly or cut
126
- - Qualitative Observations → keep 1-2 good examples, cut the rest
127
- ```
128
-
129
- ### 3.3 Delete the Redundancy
130
-
131
- The current Metacognition subsection (analysis.mdx:7-16) largely repeats what's now better expressed in Results. Either:
132
- - Delete it entirely and rely on Results
133
- - Or transform it into the "Scientific Temperaments" narrative frame (more memorable)
134
-
135
- ---
136
-
137
- ## 4. Missing Content (Lower Priority)
138
-
139
- ### 4.1 TODOs Still Present
140
- - Learning curves figure (analysis.mdx:22) — either implement or remove the placeholder
141
- - Failure mode distribution stacked bar (analysis.mdx:55) — nice to have, not critical
142
-
143
- ### 4.2 Human Baseline
144
- Still missing. Consider adding a sentence like: "Without human performance data on the same rules, we cannot assess whether these success rates represent strong or weak performance in absolute terms—only that models differ substantially among themselves."
145
-
146
- ### 4.3 Example Turn Figure
147
- Would still be valuable in benchmark.mdx to make the task concrete. A simple 3-panel showing:
148
- ```
149
- [Board state] → [Model reasoning excerpt] → [Decision output]
150
- ```
151
-
152
- ---
153
-
154
- ## 5. Minor Polish
155
-
156
- ### 5.1 Model Name Consistency
157
- Some inconsistencies remain:
158
- - "Grok 4.1 Fast Reasoning" vs "Grok 4 1 Fast Reasoning" (in data)
159
- - "DeepSeek R1" vs "Deepseek R1" (in data)
160
- - Decide on one capitalization style and apply consistently
161
-
162
- ### 5.2 The "floored" Score
163
- The article doesn't explain that scores below 0 are floored to 0. This affects interpretation—might be worth a brief mention in the Benchmark section or a sidenote.
164
-
165
- ### 5.3 Sidenote on Optimal Threshold
166
- Results.mdx mentions the 0.67 optimal threshold but doesn't explain why. A brief derivation in a sidenote would help:
167
- > For a perfectly calibrated model: E[guess at p] = p×(points remaining) - (1-p)×2. Setting E[guess] > E[wait 1 turn] gives p > 2/3 ≈ 0.67.
168
-
169
- ---
170
-
171
- ## 6. Summary of Recommended Actions
172
-
173
- ### Must Do
174
- 1. ☐ Fix all data inconsistencies (audit numbers against data files)
175
- 2. ☐ Fix typos listed in §2.2
176
- 3. ☐ Remove or transform redundant content in "Deeper Analysis"
177
-
178
- ### Should Do
179
- 4. ☐ Rename "Deeper Analysis" → "Discussion"
180
- 5. ☐ Restructure Discussion per §3.2
181
- 6. ☐ Either implement Learning Curves figure or remove the TODO
182
-
183
- ### Nice to Have
184
- 7. ☐ Add "Scientific Temperaments" framing
185
- 8. ☐ Add example turn figure in benchmark.mdx
186
- 9. ☐ Explain the score flooring mechanism
187
- 10. ☐ Expand Open vs Proprietary discussion
188
-
189
- ---
190
-
191
- ## 7. Overall Assessment
192
-
193
- **Grade: B+ (up from B-)**
194
-
195
- The structural problems identified in v1 are largely resolved. The article now tells a clear story: models vary in their "scientific temperament," and metacognition—knowing when you know—matters as much as raw reasoning ability.
196
-
197
- The remaining work is mostly cleanup (data consistency, typos) and deciding what to do with the Deeper Analysis section. The article is close to publication-ready once the numbers are fixed.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/astro.config.mjs CHANGED
@@ -65,7 +65,6 @@ export default defineConfig({
65
  bibliography: 'src/content/bibliography.bib',
66
  linkCitations: true,
67
  csl: "apa",
68
- noCite: false,
69
  suppressBibliography: false,
70
  }],
71
  rehypeReferencesAndFootnotes,
 
65
  bibliography: 'src/content/bibliography.bib',
66
  linkCitations: true,
67
  csl: "apa",
 
68
  suppressBibliography: false,
69
  }],
70
  rehypeReferencesAndFootnotes,
app/src/content/article.mdx CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
- title: "Are LLMs any good at the Science Game?"
3
- subtitle: "Evaluating scientific reasoning using the card game Eleusis"
4
  description: "A benchmark for evaluating LLM scientific reasoning using the card game Eleusis, testing iterative hypothesis formation, calibration, and strategic experimentation."
5
  authors:
6
  - name: "David Louapre"
@@ -25,7 +25,7 @@ showPdf: true
25
  import Introduction from "./chapters/eleusis/introduction.mdx";
26
  import Benchmark from "./chapters/eleusis/benchmark.mdx";
27
  import Results from "./chapters/eleusis/results.mdx";
28
- import Analysis from "./chapters/eleusis/analysis.mdx";
29
  import Conclusion from "./chapters/eleusis/conclusion.mdx";
30
  import Appendix from "./chapters/eleusis/appendix.mdx";
31
 
@@ -35,7 +35,7 @@ import Appendix from "./chapters/eleusis/appendix.mdx";
35
 
36
  <Results />
37
 
38
- <Analysis />
39
 
40
  <Conclusion />
41
 
 
1
  ---
2
+ title: "Are LLMs any good at the Game of Science?"
3
+ subtitle: "Evaluating scientific reasoning and metacognition using the card game Eleusis"
4
  description: "A benchmark for evaluating LLM scientific reasoning using the card game Eleusis, testing iterative hypothesis formation, calibration, and strategic experimentation."
5
  authors:
6
  - name: "David Louapre"
 
25
  import Introduction from "./chapters/eleusis/introduction.mdx";
26
  import Benchmark from "./chapters/eleusis/benchmark.mdx";
27
  import Results from "./chapters/eleusis/results.mdx";
28
+ import Discussion from "./chapters/eleusis/discussion.mdx";
29
  import Conclusion from "./chapters/eleusis/conclusion.mdx";
30
  import Appendix from "./chapters/eleusis/appendix.mdx";
31
 
 
35
 
36
  <Results />
37
 
38
+ <Discussion />
39
 
40
  <Conclusion />
41
 
app/src/content/bibliography.bib CHANGED
@@ -1,130 +1,66 @@
1
- @inproceedings{vaswani2017attention,
2
- title = {Attention Is All You Need},
3
- author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, {
4
- }Lukasz and Polosukhin, Illia},
5
- booktitle = {Advances in Neural Information Processing Systems},
6
- year = {2017}
7
- }
8
-
9
- @book{mckinney2017python,
10
- title = {Python for Data Analysis},
11
- author = {McKinney, Wes},
12
- publisher = {O'Reilly Media},
13
- address = {Sebastopol, CA},
14
- year = {2017},
15
- edition = {2},
16
- isbn = {978-1491957660}
17
- }
18
-
19
- @inproceedings{he2016resnet,
20
- title = {Deep Residual Learning for Image Recognition},
21
- author = {He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
22
- booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
23
- pages = {770--778},
24
- year = {2016},
25
- doi = {10.1109/CVPR.2016.90},
26
- url = {https://doi.org/10.1109/CVPR.2016.90}
27
- }
28
-
29
- @article{silver2017mastering,
30
- title = {Mastering the game of Go without human knowledge},
31
- author = {Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and others},
32
- journal = {Nature},
33
- volume = {550},
34
- number = {7676},
35
- pages = {354--359},
36
- year = {2017},
37
- month = {oct},
38
- doi = {10.1038/nature24270},
39
- url = {https://www.nature.com/articles/nature24270}
40
- }
41
-
42
- @techreport{openai2023gpt4,
43
- title = {GPT-4 Technical Report},
44
- author = {{OpenAI}},
45
- institution = {OpenAI},
46
- year = {2023},
47
- number = {arXiv:2303.08774},
48
  archiveprefix = {arXiv},
49
- eprint = {2303.08774},
50
- primaryclass = {cs.CL},
51
- url = {https://arxiv.org/abs/2303.08774}
52
- }
53
-
54
- @phdthesis{doe2020thesis,
55
- title = {Learning Efficient Representations for Large-Scale Visual Recognition},
56
- author = {Doe, Jane},
57
- school = {Massachusetts Institute of Technology},
58
- address = {Cambridge, MA},
59
- year = {2020},
60
- doi = {10.5555/mit-2020-xyz}
61
- }
62
-
63
- @incollection{cover2006entropy,
64
- title = {Entropy, Relative Entropy, and Mutual Information},
65
- author = {Cover, Thomas M. and Thomas, Joy A.},
66
- booktitle = {Elements of Information Theory},
67
- publisher = {Wiley},
68
- address = {Hoboken, NJ},
69
- edition = {2},
70
- year = {2006},
71
- pages = {13--55},
72
- isbn = {978-0471241959}
73
  }
74
 
75
- @misc{zenodo2021dataset,
76
- title = {ImageNet-21K Subset (Version 2.0)},
77
- author = {Smith, John and Lee, Alice and Kumar, Ravi},
78
- year = {2021},
79
- howpublished = {Dataset on Zenodo},
80
- doi = {10.5281/zenodo.1234567},
81
- url = {https://doi.org/10.5281/zenodo.1234567},
82
- note = {Accessed 2025-09-01}
83
  }
84
 
85
- @misc{sklearn2024,
86
- title = {scikit-learn: Machine Learning in Python (Version 1.4)},
87
- author = {Pedregosa, Fabian and Varoquaux, Ga{"e}l and Gramfort, Alexandre and others},
88
- year = {2024},
89
- howpublished = {Software},
90
- doi = {10.5281/zenodo.592264},
91
- url = {https://scikit-learn.org}
92
- }
93
-
94
- @inproceedings{smith2024privacy,
95
- title = {Privacy-Preserving Training with Low-Precision Secure Aggregation},
96
- author = {Smith, Emily and Zhang, Wei and Rossi, Marco and Patel, Neha},
97
- booktitle = {Proceedings of the 41st International Conference on Machine Learning},
98
- editor = {Smith, A. and Johnson, B.},
99
  series = {Proceedings of Machine Learning Research},
100
- volume = {235},
101
- pages = {12345--12367},
102
- address = {Vienna, Austria},
103
  publisher = {PMLR},
104
- month = {jul},
105
- year = {2024},
106
- url = {https://proceedings.mlr.press/v235/}
107
  }
108
 
109
- @article{kingma2015adam,
110
- title = {Adam: A Method for Stochastic Optimization},
111
- author = {Kingma, Diederik P. and Ba, Jimmy},
112
- journal = {International Conference on Learning Representations (ICLR)},
113
- year = {2015},
114
- archiveprefix = {arXiv},
115
- eprint = {1412.6980},
116
- primaryclass = {cs.LG},
117
- url = {https://arxiv.org/abs/1412.6980}
 
118
  }
119
 
120
- @misc{raffel2020t5,
121
- title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
122
- author = {Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and others},
123
- year = {2020},
124
- howpublished = {arXiv preprint},
125
- archiveprefix = {arXiv},
126
- eprint = {1910.10683},
127
- primaryclass = {cs.LG},
128
- doi = {10.48550/arXiv.1910.10683},
129
- url = {https://arxiv.org/abs/1910.10683}
 
 
 
 
 
 
 
 
 
 
 
 
130
  }
 
1
+ @misc{chollet2019measure,
2
+ title = {On the Measure of Intelligence},
3
+ author = {Chollet, François},
4
+ year = {2019},
5
+ howpublished = {arXiv preprint},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  archiveprefix = {arXiv},
7
+ eprint = {1911.01547},
8
+ primaryclass = {cs.AI},
9
+ url = {https://arxiv.org/abs/1911.01547}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  }
11
 
12
+ @article{abbott1977eleusis,
13
+ title = {The New Eleusis},
14
+ author = {Abbott, Robert},
15
+ journal = {Games \& Puzzles},
16
+ year = {1977},
17
+ note = {Updated rules for the Eleusis card game, originally published in 1956}
 
 
18
  }
19
 
20
+ @inproceedings{guo2017calibration,
21
+ title = {On Calibration of Modern Neural Networks},
22
+ author = {Guo, Chuan and Pleiss, Geoff and Sun, Yu and Weinberger, Kilian Q.},
23
+ booktitle = {Proceedings of the 34th International Conference on Machine Learning},
 
 
 
 
 
 
 
 
 
 
24
  series = {Proceedings of Machine Learning Research},
25
+ volume = {70},
26
+ pages = {1321--1330},
27
+ year = {2017},
28
  publisher = {PMLR},
29
+ url = {https://proceedings.mlr.press/v70/guo17a.html}
 
 
30
  }
31
 
32
+ @article{nickerson1998confirmation,
33
+ title = {Confirmation Bias: A Ubiquitous Phenomenon in Many Guises},
34
+ author = {Nickerson, Raymond S.},
35
+ journal = {Review of General Psychology},
36
+ volume = {2},
37
+ number = {2},
38
+ pages = {175--220},
39
+ year = {1998},
40
+ doi = {10.1037/1089-2680.2.2.175},
41
+ url = {https://journals.sagepub.com/doi/abs/10.1037/1089-2680.2.2.175}
42
  }
43
 
44
+ @article{flavell1979metacognition,
45
+ title = {Metacognition and Cognitive Monitoring: A New Area of Cognitive-Developmental Inquiry},
46
+ author = {Flavell, John H.},
47
+ journal = {American Psychologist},
48
+ volume = {34},
49
+ number = {10},
50
+ pages = {906--911},
51
+ year = {1979},
52
+ doi = {10.1037/0003-066X.34.10.906},
53
+ url = {https://psycnet.apa.org/record/1980-09388-001}
54
+ }
55
+
56
+ @article{lichtenstein1977calibration,
57
+ title = {Do Those Who Know More Also Know More About How Much They Know?},
58
+ author = {Lichtenstein, Sarah and Fischhoff, Baruch},
59
+ journal = {Organizational Behavior and Human Performance},
60
+ volume = {20},
61
+ number = {2},
62
+ pages = {159--183},
63
+ year = {1977},
64
+ doi = {10.1016/0030-5073(77)90001-0},
65
+ url = {https://www.sciencedirect.com/science/article/abs/pii/0030507377900010}
66
  }
app/src/content/chapters/eleusis/analysis.mdx DELETED
@@ -1,100 +0,0 @@
1
- import Note from "../../../components/Note.astro";
2
- import Sidenote from "../../../components/Sidenote.astro";
3
- import Accordion from "../../../components/Accordion.astro";
4
-
5
- ## Deeper Analysis
6
-
7
- ### Metacognition: Knowing What You Know
8
-
9
- The caution-recklessness analysis reveals that metacognition, the ability to accurately assess one's own knowledge, is a key differentiator between models. Consider two extremes:
10
-
11
- **GPT 5.2 High** has excellent rule-finding ability (96% success rate), fairly good calibration, but is overly cautious. It averages 3.6 turns after discovering the correct rule before making the guess, leading to lost points despite its high accuracy. It frequently discovers the correct rule early but doesn't *believe* it has sufficient evidence. This excessive caution costs an average of 3.6 points per successful round, enough to drop it from first to third place overall.
12
-
13
- **Claude Opus 4.5** achieves the best balance: high success rate (92%) with well-calibrated timing (only 0.9 early correct turns and 2.8 failed guesses). It is not as well calibrated as GPT 5.2 High, and would benefit from better calibration while keeping its slightly risk-taking approach.
14
-
15
- **Claude Haiku 4.5** has the opposite problem: poor rule-finding (70% success) combined with overconfident metacognition. It commits to guesses without sufficient evidence, accumulating 7.5 failed guesses per round on average—the highest of any model.
16
-
17
- ### Learning Curves
18
-
19
- How do models improve within a single round? We tracked confidence and hypothesis quality over turn number to understand the learning dynamics.
20
-
21
- <Note variant="info">
22
- **TODO**: Add figure showing line plot of average confidence by turn number, colored by eventual success/failure.
23
- </Note>
24
-
25
- Key observations:
26
- - **Successful rounds** typically show steadily increasing confidence with occasional drops when hypotheses are revised
27
- - **Failed rounds** often show erratic confidence or premature plateaus where models become stuck on incorrect hypotheses
28
- - **Acceptance rate decreases** over time as obvious cards are exhausted from the hand
29
-
30
- <Sidenote>
31
- The turn-by-turn reasoning traces provide rich data for understanding model behavior beyond simple success/failure metrics.
32
- </Sidenote>
33
-
34
- ### Failure Modes
35
-
36
- When models fail, why? We identified several recurring patterns:
37
-
38
- <Accordion title="Failure mode taxonomy" open>
39
-
40
- 1. **Premature guessing**: High confidence, wrong rule, insufficient evidence. The model becomes convinced too early based on limited data. This is the dominant failure mode for Claude Haiku 4.5 (7.5 failed guesses/round).
41
-
42
- 2. **Hypothesis fixation**: Stuck on wrong rule despite contradictory evidence. The model fails to update when new observations conflict with its theory.
43
-
44
- 3. **Overfitting**: Rule matches all observations but is more specific than the actual rule (e.g., guessing "only red hearts" when the rule is "only red cards").
45
-
46
- 4. **Underfitting**: Rule is too simple and fails to capture necessary conditions (e.g., guessing "black cards" when rule is "black even cards").
47
-
48
- 5. **Position blindness**: Fails on rules depending on position in mainline or relationship to previous cards.
49
-
50
- 6. **Excessive caution**: The model finds the correct rule but doesn't trust its conclusion. GPT 5.2 High exemplifies this—waiting an average of 3.6 turns after finding the answer, costing significant points.
51
-
52
- </Accordion>
53
-
54
- <Note variant="info">
55
- **TODO**: Add stacked bar chart showing distribution of failure modes by model.
56
- </Note>
57
-
58
- ### Open vs Closed Models
59
-
60
- A notable finding is the competitive performance of open-weight models. Kimi K2, available with open weights, achieves the second-highest score (14.5) and outperforms several proprietary models including GPT 5 Mini Medium and Gemini 3 Flash. The open-weight GPT OSS 120B also performs respectably at 12.0.
61
-
62
- However, open models tend toward more aggressive guessing strategies. Kimi K2 averages 4.0 failed guesses per round (vs. 2.8 for Claude Opus 4.5), and GPT OSS 20B has 6.2. This may reflect differences in training objectives or RLHF tuning between open and proprietary models.
63
-
64
- ### Symmetric Rules
65
-
66
- An interesting test: are symmetric rules equally difficult? For example, "only spades" vs "only non-spades" should be logically equivalent in difficulty, but models might have biases.
67
-
68
- We found that:
69
- - Negative rules ("not X") are generally harder than positive rules ("only X")
70
- - Rules involving rare events (low acceptance rate) are harder than rules with high acceptance rates
71
- - This may reflect training data biases where positive examples are more common
72
-
73
- ### Confirmation Bias
74
-
75
- Do models exhibit confirmation bias—preferring to play cards that confirm their current hypothesis rather than cards that could falsify it?
76
-
77
- <Sidenote>
78
- A good scientist designs experiments that could prove them wrong, not just experiments that confirm what they already believe.
79
- </Sidenote>
80
-
81
- Preliminary analysis suggests:
82
- - Models do show some tendency toward confirmation-seeking behavior
83
- - When confident in a hypothesis, models prefer "safe" plays that are likely to be accepted
84
- - Strategic exploration (playing cards specifically to test hypothesis boundaries) is rare
85
-
86
- ### Qualitative Observations
87
-
88
- Examining individual reasoning traces reveals interesting patterns:
89
-
90
- <Accordion title="Example: Hypothesis revision">
91
-
92
- In one game with the rule "alternating odd/even ranks," a model initially hypothesized "increasing ranks" based on the first few accepted cards. When a lower-ranked card was accepted, instead of abandoning the hypothesis entirely, the model revised it to "ranks must differ from previous." This partial update eventually led to discovering the true rule—a good example of iterative refinement.
93
-
94
- </Accordion>
95
-
96
- <Accordion title="Example: Fixation failure">
97
-
98
- With the rule "only face cards (J, Q, K)," one model became fixated on "only red cards" after the first three accepted cards happened to be red face cards. Despite subsequently seeing black face cards accepted, the model kept trying to reconcile observations with a color-based rule, eventually running out of turns.
99
-
100
- </Accordion>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/src/content/chapters/eleusis/appendix.mdx CHANGED
@@ -1,5 +1,6 @@
1
  import Accordion from "../../../components/Accordion.astro";
2
  import Note from "../../../components/Note.astro";
 
3
 
4
  ## Appendix: Detailed Methods
5
 
@@ -26,50 +27,50 @@ All models were evaluated with the following settings:
26
 
27
  | Parameter | Value |
28
  |-----------|-------|
29
- | Temperature | 0.0 (deterministic) |
30
- | Max tokens | 4096 |
31
  | Retries | 3 (on API failures) |
32
 
33
- Reasoning models were allowed their default reasoning budgets. Standard models used base inference without additional chain-of-thought prompting beyond what's included in the game prompt.
34
 
35
  </Accordion>
36
 
37
  ### Rule Checking
38
 
39
- <Accordion title="Rule verification methodology">
40
 
41
  Rules are created by hand and expressed in natural language. Each rule is then compiled into a Python function using an LLM, with manual verification of correctness.
42
 
43
- When the model outputs a guessed rule, we:
44
- 1. Compile the guess into a Python function using the same LLM
45
- 2. Test the compiled function against all cards played in that game
46
- 3. Mark the guess as correct only if it matches the true rule's behavior on all observations
47
 
48
- This simulation-based approach avoids issues with semantic equivalence in natural language. For instance, "same color as previous card" and "red cards only" might be equivalent given a specific game history starting with a red card, but would differ on other histories.
49
 
50
- </Accordion>
51
 
52
- ### Prompt Structure
53
 
54
- <Accordion title="Full prompt template">
55
 
56
- The prompt includes:
57
 
58
- 1. **Game rules**: Complete explanation of how Eleusis works, without mentioning the game's name to avoid potential training data leakage
59
 
60
- 2. **Scoring system**: Explicit explanation of the scoring formula and strategic implications
 
 
61
 
62
- 3. **Response format**: JSON schema specifying required fields (reasoning, card choice, tentative rule, confidence, guess decision)
 
 
 
63
 
64
- 4. **Game state**: Current mainline, all sidelines, current hand, and reasoning from the previous 3 turns
 
 
65
 
66
- 5. **Format reminders**: Instructions for confidence scale interpretation (7 = 70% probability)
67
 
68
- </Accordion>
69
 
70
  ### Evaluation Metrics
71
 
72
- <Accordion title="Metric definitions">
73
 
74
  - **Success rate**: Fraction of games where the model correctly identified the rule before running out of turns
75
 
@@ -81,22 +82,6 @@ The prompt includes:
81
 
82
  - **Turns to success**: For successful games, mean number of turns before correct guess
83
 
84
- </Accordion>
85
-
86
- ### References
87
-
88
- <Accordion title="Bibliography">
89
-
90
- - Abbott, R. (1963). "Eleusis" — Original game rules and design philosophy
91
-
92
- - Guo, C., et al. (2017). "On Calibration of Modern Neural Networks" — Foundational work on neural network calibration
93
-
94
- - Chollet, F. (2019). "On the Measure of Intelligence" — ARC benchmark and discussion of abstract reasoning
95
-
96
- - Recent LLM reasoning benchmarks: GSM8K, MATH, ARC-AGI, BIG-Bench, etc.
97
-
98
- </Accordion>
99
-
100
  <Note>
101
  Full code, data, and model outputs are available in the benchmark repository.
102
  </Note>
 
1
  import Accordion from "../../../components/Accordion.astro";
2
  import Note from "../../../components/Note.astro";
3
+ import Sidenote from "../../../components/Sidenote.astro";
4
 
5
  ## Appendix: Detailed Methods
6
 
 
27
 
28
  | Parameter | Value |
29
  |-----------|-------|
30
+ | Temperature | 0.7 |
31
+ | Max tokens | 16384 |
32
  | Retries | 3 (on API failures) |
33
 
34
+ Reasoning models were allowed their default reasoning budgets.
35
 
36
  </Accordion>
37
 
38
  ### Rule Checking
39
 
40
+
41
 
42
  Rules are created by hand and expressed in natural language. Each rule is then compiled into a Python function using an LLM, with manual verification of correctness.
43
 
44
+ When the model outputs a guessed rule, we
45
+ TODO: explain
 
 
46
 
47
+ This simulation-based approach avoids issues with semantic equivalence in natural language. For instance, "same color as previous card" and "red cards only" might be equivalent given a specific game history starting with a red card, but would differ on other histories. Also the simulation approach avoids declaring as different two rules that, given the current state of the game, will behave identically. For instance if the rule was "same color", as soon as the first card has been drawn, the model might guess "red cards only" or "black cards only" depending on the color of the first card. Both guesses are semantically different but functionally equivalent given the current game state.
48
 
 
49
 
 
50
 
51
+ ### Additional results
52
 
53
+ #### Learning Curves
54
 
55
+ How do models improve within a single round? We tracked confidence and hypothesis quality over turn number to understand the learning dynamics.
56
 
57
+ <Note variant="info">
58
+ **TODO**: Add figure showing line plot of average confidence by turn number, colored by eventual success/failure.
59
+ </Note>
60
 
61
+ Key observations:
62
+ - **Successful rounds** typically show steadily increasing confidence with occasional drops when hypotheses are revised
63
+ - **Failed rounds** often show erratic confidence or premature plateaus where models become stuck on incorrect hypotheses
64
+ - **Acceptance rate decreases** over time as obvious cards are exhausted from the hand
65
 
66
+ <Sidenote>
67
+ The turn-by-turn reasoning traces provide rich data for understanding model behavior beyond simple success/failure metrics.
68
+ </Sidenote>
69
 
 
70
 
 
71
 
72
  ### Evaluation Metrics
73
 
 
74
 
75
  - **Success rate**: Fraction of games where the model correctly identified the rule before running out of turns
76
 
 
82
 
83
  - **Turns to success**: For successful games, mean number of turns before correct guess
84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
85
  <Note>
86
  Full code, data, and model outputs are available in the benchmark repository.
87
  </Note>
app/src/content/chapters/eleusis/benchmark.mdx CHANGED
@@ -1,20 +1,18 @@
1
  import Sidenote from "../../../components/Sidenote.astro";
2
- import Note from "../../../components/Note.astro";
3
- import Accordion from "../../../components/Accordion.astro";
4
 
5
- ## The Eleusis Benchmark
6
 
7
  ### The Original Game
8
 
9
- In the original Eleusis card game, one player acts as the "dealer" (sometimes called "God" or "Nature") and secretly invents a rule determining which cards can be legally played. The other players don't know this rulethey must discover it through experimentation.
10
 
11
- Players take turns playing cards from their hand onto a central "mainline." If a card satisfies the secret rule, it's accepted and added to the mainline. If it violates the rule, it's rejected and placed in a "sideline" below the mainline at that position. Over time, the pattern of accepted and rejected cards provides evidence about the hidden rule.
12
 
13
  <Sidenote>
14
  The name "Eleusis" comes from the ancient Greek mystery cult, where initiates gradually discovered hidden truths.
15
  </Sidenote>
16
 
17
- At any point, a player can attempt to guess the rule; correctly identifying it ends the game. A specific scoring system rewards efficiency in discovering the rule while penalizing reckless guessing.
18
 
19
  ### Our Adaptation
20
 
@@ -26,15 +24,16 @@ On each turn, the player selects a card from their hand to play. If the card sat
26
 
27
  When correctly guessing the rule, the player scores as many points as the number of turns spent, and each wrong guess deducts a penalty of 2 points:
28
 
29
- $$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num_wrong\_guesses}$$
30
 
31
- A player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. If because of penalties the score drops to zero or below, the round stops and the final score is recorded as zero (similar to a scientist having wasted all their resources).
32
 
33
  This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence and acting accordingly.
34
 
35
  ### Rule Library
 
36
 
37
- We created a library of 26 hand-crafted rules spanning a range of types and complexity. Some rules involve simply card properties (e.g., "only red cards"), while others depend on the sequence of previously accepted cards (e.g., "card rank must be higher than previous card"). The rule might involve rank, suits, color or a combination thereof, and may include positional dependencies.
38
 
39
  | Category | Examples |
40
  |----------|----------|
@@ -48,7 +47,7 @@ Each rule is played 3 times with different random seeds (affecting the initial h
48
 
49
  ### What the LLM Must Do
50
 
51
- On each turn, the model receives the complete game state: the mainline of accepted cards, the sidelines of rejected cards at each position, its current hand, and its history of reasoning from the previous turns.
52
 
53
  The model is free to reason, but it is asked to output a structured response containing:
54
 
@@ -69,6 +68,6 @@ Example output
69
  }
70
  ```
71
 
72
- This structure lets us analyze not just whether models succeed, but *how* they reason: Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy? In particular, forcing the model to articulate a tentative rule and a confidence level in it (even if they don't want to guess it yet) allows us to (secretely) evaluate it nonetheless, which will be useful for measuring calibration and guessing abilities.
73
 
74
 
 
1
  import Sidenote from "../../../components/Sidenote.astro";
 
 
2
 
3
+ ## 1. The Eleusis Benchmark
4
 
5
  ### The Original Game
6
 
7
+ In the original Eleusis card game, one player acts as the "dealer" (sometimes called "God" or "Nature") and secretly invents a rule determining which cards can be legally played. The other players (called "scientists") don't know this rule, they must discover it through experimentation.
8
 
9
+ Players take turns playing cards from their hand onto a central "mainline." If a card satisfies the secret rule, the dealer accepts it and it is added to the mainline. If it violates the rule, it's rejected and placed in a "sideline" below the mainline at that position. Over time, the pattern of accepted and rejected cards provides evidence about the hidden rule.
10
 
11
  <Sidenote>
12
  The name "Eleusis" comes from the ancient Greek mystery cult, where initiates gradually discovered hidden truths.
13
  </Sidenote>
14
 
15
+ At any point, a player can attempt to guess the rule; correctly identifying it ends the game, but a wrong guess incurs a penalty. The game continues until someone correctly identifies the rule. A specific scoring system rewards efficiency in discovering the rule while penalizing reckless guessing.
16
 
17
  ### Our Adaptation
18
 
 
24
 
25
  When correctly guessing the rule, the player scores as many points as the number of turns spent, and each wrong guess deducts a penalty of 2 points:
26
 
27
+ $$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num\_wrong\_guesses}$$
28
 
29
+ For instance, a player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. If because of penalties the score drops to zero or below, the current round ends and the final score is recorded as zero (similar to a scientist having wasted all their resources).
30
 
31
  This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence and acting accordingly.
32
 
33
  ### Rule Library
34
+ In the original game, the dealer has to invent a secret rule on the spot. However, for benchmarking LLMs, we need a fixed set of rules to ensure comparability across model runs. We created a library of 26 hand-crafted rules spanning a range of types and complexity. Some rules involve simply card properties (e.g., "only red cards"), while others depend on the sequence of previously accepted cards (e.g., "card rank must be higher than previous card"). The rule might involve rank, suits, color or a combination thereof, and may include positional dependencies.
35
 
36
+ Here are some example rules from our library, with a tentative categorization:
37
 
38
  | Category | Examples |
39
  |----------|----------|
 
47
 
48
  ### What the LLM Must Do
49
 
50
+ On each turn, the model gets prompted with the rule of the game and the complete game state: the mainline of accepted cards, the sidelines of rejected cards at each position, its current hand, and its history of reasoning from the previous turns.
51
 
52
  The model is free to reason, but it is asked to output a structured response containing:
53
 
 
68
  }
69
  ```
70
 
71
+ **This structure lets us analyze not just whether models succeed, but *how* they reason:** Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy? In particular, forcing the model to articulate a tentative rule and a confidence level in it (even if they don't want to guess it yet) allows us to (secretely) evaluate it nonetheless, which will be useful for measuring calibration and guessing abilities.
72
 
73
 
app/src/content/chapters/eleusis/conclusion.mdx CHANGED
@@ -3,23 +3,12 @@ import Sidenote from "../../../components/Sidenote.astro";
3
 
4
  ## Conclusion
5
 
6
- ### Key Findings
7
-
8
- Our evaluation of ten LLMs on the Eleusis benchmark reveals several important insights:
9
-
10
- 1. **LLMs can do inductive reasoning**—but with significant variation. Claude Opus 4.5 leads with 92% success rate and 15.9 average score, while Claude Haiku 4.5 achieves only 70% success and 9.1 average score—a substantial gap on the same benchmark.
11
-
12
- 2. **Metacognition matters as much as reasoning**. Finding the correct rule is only half the challenge; knowing *when* you've found it is equally important. GPT 5.2 High has the highest success rate (96%) but only ranks third overall because it waits too long to commit—an average of 3.6 turns after finding the answer.
13
-
14
- 3. **There's a caution-recklessness trade-off**. Models cluster into distinct behavioral styles: cautious achievers (GPT 5.2 High), balanced performers (Claude Opus 4.5), and reckless guessers (Claude Haiku 4.5). The best results come from accurate metacognition, not from either extreme.
15
 
16
- 4. **Open models are competitive**. Kimi K2 (open weights) achieves the second-highest score, outperforming several proprietary models. The gap between open and closed models appears to be closing on reasoning tasks.
17
 
18
- 5. **Calibration remains imperfect**—models don't always know what they don't know. Most models show systematic overconfidence at high stated confidence levels.
19
 
20
- <Sidenote>
21
- The 7-point gap between best and worst models (15.9 vs 9.1) suggests this benchmark captures meaningful capability differences.
22
- </Sidenote>
23
 
24
  ### Limitations
25
 
@@ -45,16 +34,7 @@ Several directions for future work:
45
 
46
  - **Human comparisons**: Collecting human performance data would provide crucial context for interpreting model capabilities.
47
 
48
- - **Interactive exploration**: Building tools to explore individual game traces could help researchers understand model reasoning more deeply.
49
 
50
- <Note variant="info">
51
- The benchmark is open source. Try it yourself and contribute new rules or model evaluations!
52
- </Note>
53
-
54
- ### Final Thoughts
55
-
56
- The Eleusis benchmark offers a window into capabilities that matter for real-world scientific reasoning: iterative hypothesis refinement, strategic experimentation, and calibrated confidence. Perhaps most importantly, it reveals the critical role of *metacognition*—the ability to accurately assess one's own knowledge state.
57
-
58
- Our results suggest that raw reasoning ability is necessary but not sufficient. GPT 5.2 High can find the answer more often than any other model, yet loses to Claude Opus 4.5 because it doesn't know when to commit. Claude Haiku 4.5 commits readily but often before it should. The winning strategy requires both: strong inductive reasoning *and* accurate self-assessment.
59
-
60
- As LLMs are increasingly deployed to assist with scientific research, understanding these limitations becomes crucial. A model that is brilliant at generating hypotheses but doesn't know when to trust them could either lead researchers down unproductive paths (if overconfident) or waste time on unnecessary verification (if overcautious). The Eleusis benchmark provides one lens for evaluating and improving these capabilities—measuring not just what models know, but whether they know what they know.
 
3
 
4
  ## Conclusion
5
 
6
+ The Eleusis benchmark offers a window into capabilities that matter for real-world scientific reasoning: iterative hypothesis refinement, strategic experimentation, and calibrated confidence. Perhaps most importantly, it reveals the critical role of *metacognition*—the ability to accurately assess one's own knowledge state.
 
 
 
 
 
 
 
 
7
 
8
+ Our results suggest that raw reasoning ability is necessary but not sufficient. GPT 5.2 High can find the answer more often than any other model, yet loses to Claude Opus 4.5 because it doesn't know when to commit. Claude Haiku 4.5 commits readily but often before it should. The winning strategy requires both: strong inductive reasoning *and* accurate self-assessment.
9
 
10
+ As LLMs are increasingly deployed to assist with scientific research, understanding these limitations becomes crucial. A model that is brilliant at generating hypotheses but doesn't know when to trust them could either lead researchers down unproductive paths (if overconfident) or waste time on unnecessary verification (if overcautious). The Eleusis benchmark provides one lens for evaluating and improving these capabilities—measuring not just what models know, but whether they know what they know.
11
 
 
 
 
12
 
13
  ### Limitations
14
 
 
34
 
35
  - **Human comparisons**: Collecting human performance data would provide crucial context for interpreting model capabilities.
36
 
37
+ - **Prompt engineering**: Exploring how different prompt designs affect performance and metacognitive accuracy. In particular, can we compensate for bad calibration or guessing strategies via prompting?
38
 
39
+ - **Confirmation bias analysis**:
40
+ Do models exhibit confirmation bias—preferring to play cards that confirm their current hypothesis rather than cards that could falsify it? It would require LLM as a judge analysis.
 
 
 
 
 
 
 
 
 
app/src/content/chapters/eleusis/discussion.mdx ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Note from "../../../components/Note.astro";
2
+ import Sidenote from "../../../components/Sidenote.astro";
3
+ import Accordion from "../../../components/Accordion.astro";
4
+
5
+ ## Discussion
6
+
7
+ ### Inductive abilities & metacognition
8
+
9
+ TODO: summarize main findings about the fact that performance depends on both inductive reasoning and metacognitive calibration.
10
+
11
+ Primary factor : inductive reasoning and carefully choosing the next experiment.
12
+
13
+ Then Different scientific personalities, which is on a different axis than raw reasoning ability, play a crucial role in performance.
14
+
15
+ TODO: refine this
16
+ - The Perfectionist (GPT 5.2 High): needs too much evidence
17
+ - The Balanced (Gemini 3 Flash Preview Low): good tradeoff thanks to bad calibration compensated by caution, but not the best at inductive reasoning
18
+ - The Pragmatist (Claude Opus 4.5): good-enough is good enough
19
+ - The Gambler (Claude Haiku 4.5): acts on insufficient evidence
20
+
21
+
22
+ ### Open vs Closed Models
23
+
24
+ A notable finding is the competitive performance of open-weight models. Kimi K2, available with open weights, achieves the second-highest score (16.2) and outperforms several proprietary models including GPT 5.2. DeepSeekR1 scores 13.3 and the open-weight GPT OSS 120B also performs respectably at 12.ç.
25
+
26
+ However, open models all tend toward more aggressive guessing strategies mediated by bad calibration, leading to lower overall scores despite reasonable inductive abilities. This suggests that while open models can match proprietary ones in raw reasoning, they may lack the nuanced metacognitive skills needed for optimal performance in this benchmark.
27
+
28
+
29
+
30
+
app/src/content/chapters/eleusis/introduction.mdx CHANGED
@@ -3,34 +3,34 @@ import Image from "../../../components/Image.astro";
3
 
4
  import exampleSequence from "../../assets/image/example_sequence.png";
5
 
6
- Large language models are increasingly being deployed as tools for scientific researchanalyzing data, generating hypotheses, and even designing experiments. But how well do they actually embody the scientific method?
7
 
8
  <Sidenote>
9
  Read time: 15–20 minutes.
10
  </Sidenote>
11
 
12
- Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge, for instance, evaluates inductive reasoning on visual patterns. These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.
13
 
14
- Real scientific reasoning is not a single inference step. It's an iterative agentic process of observation, hypothesis formation, experimentation, and refinement, often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic* thinking: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
15
 
16
- Beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?), **metacognition** (how certain am I about my uncertainty?), and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis instead of trying to challenge it). A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
17
 
18
  We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.
19
 
20
  ## The Eleusis Game
21
 
22
- Eleusis was designed by Robert Abbott explicitly to simulate the process of scientific discovery. In the original game, one player invents a secret rule governing which cards can be played, and other players must deduce the rule through experimentation, by playing cards and observing whether they are accepted or rejected.
23
 
24
  **Eleusis is a microcosm of the scientific method:** the rule is a hidden law of nature, each card play is an experiment, and the sequence of accepted and rejected cards is the accumulating evidence.
25
 
26
  <Image
27
  src={exampleSequence}
28
  alt="Example Eleusis game sequence with the secret rule 'alternating colors': mainline shows 5♠, J♥, J♠, A♦, 6♣ following the pattern, while the sideline below shows rejected cards 10♣ after J♠, and Q♥ and 2♦ after A♦"
29
- caption="An example Eleusis game with the secret rule 'alternating colors'. The main line (top) shows the sequence accepted cards: 5♠ → J♥ → J♠ → A♦ → 6♣, alternating between black and red. The sideline (bottom) shows cards that were tried but rejected because they are violating the pattern, for instance 10♣ after J♠, or Q♥ and 2♦ after A♦."
30
  id="fig-example-sequence"
31
  preserveColors
32
  />
33
 
34
  We built a benchmark around Eleusis to evaluate LLMs on this iterative, hypothesis-driven reasoning. Rather than testing knowledge retrieval or instruction-following, our benchmark asks: *can models act like scientists?* Can they observe evidence, form hypotheses, design informative experiments, and refine their theories? Can they calibrate their confidence appropriately and know when they've gathered enough evidence to commit to a conclusion?
35
 
36
- These skills are fundamental not just to science, but to debugging code, diagnosing problems, and everyday reasoning under uncertainty.
 
3
 
4
  import exampleSequence from "../../assets/image/example_sequence.png";
5
 
6
+ Large language models are increasingly being deployed as tools for scientific research : analyzing data, generating hypotheses, and even designing experiments. But how well do they actually embody the scientific method?
7
 
8
  <Sidenote>
9
  Read time: 15–20 minutes.
10
  </Sidenote>
11
 
12
+ Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge [@chollet2019measure], for instance, evaluates inductive reasoning on visual patterns. **These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.**
13
 
14
+ First, real scientific reasoning is not a single inference step. It's an iterative agentic process of observation, hypothesis formation, experimentation, and refinement, often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic thinking*: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
15
 
16
+ Also, beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?) [@lichtenstein1977calibration], **metacognition** (how certain am I about my uncertainty?) [@flavell1979metacognition], and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis instead of trying to challenge it) [@nickerson1998confirmation]. A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
17
 
18
  We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.
19
 
20
  ## The Eleusis Game
21
 
22
+ Eleusis was designed by @abbott1977eleusis explicitly to simulate the process of scientific discovery. In the original game, one player invents a secret rule governing which cards can be played, and other players must deduce the rule through experimentation, by playing cards and observing whether they are accepted or rejected.
23
 
24
  **Eleusis is a microcosm of the scientific method:** the rule is a hidden law of nature, each card play is an experiment, and the sequence of accepted and rejected cards is the accumulating evidence.
25
 
26
  <Image
27
  src={exampleSequence}
28
  alt="Example Eleusis game sequence with the secret rule 'alternating colors': mainline shows 5♠, J♥, J♠, A♦, 6♣ following the pattern, while the sideline below shows rejected cards 10♣ after J♠, and Q♥ and 2♦ after A♦"
29
+ caption="An example Eleusis game. The secret rule here is 'colors must alternate'. The main line (top) shows the sequence of accepted cards: 5♠ → J♥ → J♠ → A♦ → 6♣, alternating between black and red. The sideline (bottom) shows cards that were tried but rejected because they are violating the rule, for instance 10♣ after J♠, or Q♥ and 2♦ after A♦."
30
  id="fig-example-sequence"
31
  preserveColors
32
  />
33
 
34
  We built a benchmark around Eleusis to evaluate LLMs on this iterative, hypothesis-driven reasoning. Rather than testing knowledge retrieval or instruction-following, our benchmark asks: *can models act like scientists?* Can they observe evidence, form hypotheses, design informative experiments, and refine their theories? Can they calibrate their confidence appropriately and know when they've gathered enough evidence to commit to a conclusion?
35
 
36
+ These skills are fundamental not just to science, but to debugging code, medical diagnosis, and everyday reasoning under uncertainty.
app/src/content/chapters/eleusis/results.mdx CHANGED
@@ -4,7 +4,7 @@ import Note from "../../../components/Note.astro";
4
  import Sidenote from "../../../components/Sidenote.astro";
5
  import HtmlEmbed from "../../../components/HtmlEmbed.astro";
6
 
7
- ## Results
8
 
9
  ### Overall Performance
10
 
@@ -17,44 +17,50 @@ We evaluated ten models on the Eleusis benchmark, including both proprietary and
17
  wide
18
  />
19
 
20
- Performance varies dramatically among tested models. Claude Opus 4.5 achieves top performance with moderate token usage. The open-weight model Kimi K2 comes second and performs competitively with the best proprietary models, outperforming GPT 5.2 High and being closed to Claude Opus 4.5, but at the price of a 2.5× larger reasoning budget.
21
 
22
- GPT 5.2 High and Grok 4.1 Fast Reasoning show a similar performance but GPT 5.2 High is significantly more token efficient.
23
 
24
- GPT-5-Mini, GPT OSS-120B and Gemini 3 Flash Preview Low cluster in the mid-tier (around 13) with moderate token usage.While Deepseek R1, an open-weight model specialized for reasoning tasks, achieves a similar score with much larger token count.
25
 
26
- Finally, GPT-OSS 20B and Claude Haiku 4.5 lag behind, scoring between 11 and 12 with moderate token usage.
27
 
28
- As we mentionned, this score reflects not only the pure model's ability to find the correct rule, but also its metacognitive skills: knowing when to commit, how confident it is, and how to balance exploration vs. exploitation. To distinguish these factors, we also computed an alternative "no-stakes" score that removes penalties for wrong guesses and counts tentative rules as guesses. This allows us to isolate pure rule-discovery ability from metacognitive skills.
 
 
29
 
30
  ### Pure discovery versus metacognition
31
 
32
- The following chart shows the score of each model, and which score it would have achieved under a "no stakes" scenario where guessing is free and systematic.
 
 
33
 
34
  <HtmlEmbed
35
  src="score-stack.html"
36
- caption="<strong>Figure 2:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring). while green shows no-stakes gain (additional gain from removing wrong-guess penalties). Models sorted by total no-stakes score."
37
  id="fig-score-stack"
38
  wide
39
  />
40
 
41
- Even if using this alternative scoring does not change a lot the relative ranking of models, it reveals important differences in their behavior. GPT 5.2 High and Claude Haiku 4.5 are the two models with the largest difference between raw and no-stakes scores (more than 4) while Gemini and Kimi K2 have the smallest difference (less than 3).
 
 
 
42
 
43
- They might be two reason for the difference between the raw and the no-stakes scores:
44
  1. The model is reckless and makes a lot of wrong guesses, incurring penalties.
45
  2. The model is too cautious and waits too long before guessing, missing out on points.
46
 
47
  We analyze these two aspects in more details below.
48
 
49
-
50
  ### The Caution-Recklessness Trade-off
51
 
52
- To estimate how reckless or cautious a model is, we can compute the average number of failed guesses per round (recklessness). It directly relates to how many points a model loses due to wrong guesses.
53
 
54
- To estimate caution, we can compute on average how many turns a model waits while having the correct tentative rule before actually guessing it. This relates to how many points a model loses by waiting too long to commit.
55
 
56
  <Sidenote>
57
- This trade-off mirrors a fundamental tension in science: being overconfident too early might risk false positives, leading to wasted resources and reputational damage; being overly cautious can delay discoveries and allow others to scoop you. Scientists must balance the risk of trying to publish too early and risk being wrong, wait too long and lose priority (or in our case, points).
58
  </Sidenote>
59
 
60
  <HtmlEmbed
@@ -63,32 +69,33 @@ To estimate caution, we can compute on average how many turns a model waits whil
63
  id="fig-caution-reckless"
64
  />
65
 
 
66
 
67
- How should we interpret those values ? Knowing that a failed guess costs 2 points, while each turn of delay costs 1 point, the optimal number of failed guesses per round should be around 0.5 (i.e., 1 failed guess every 2 rounds) to balance the two sources of loss. We can see that most models are above that threshold, indicating a tendency towards recklessness. This is confirmed by the fact that they have a low caution value (most models wait around 1 turns on average before guessing when they have the correct rule).
68
-
69
- On the other hand, GPT 5.2 High has a singular behavior with very few failed guesses (0.28 per round) but a high caution (waiting 3.5 turns on average before guessing when it has the correct rule). Gemini 3 Flash Preview Low and GPT 5 Mini Medium are intermediate in both dimensions, Gemini achieving a better balance with on average 2 points lost due to recklessness and 2 points lost due to caution.
70
 
71
  To try to understand deeper the causes of recklessness and caution, we now turn to an analysis of confidence and guessing strategies.
72
 
73
  ### Confidence and Calibration
74
 
75
- Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
76
 
77
  <HtmlEmbed
78
  src="calibration-curves.html"
79
- caption="<strong>Figure 4:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
80
  id="fig-calibration"
81
  />
82
 
83
  The calibration analysis reveals several patterns:
84
 
85
- - **All models are overconfident** : for instance when they report 80% confidence, their actual success rates are often closer to 20% !
86
- - GPT 5.2 is the best calibrated model overall.
87
  - Even models with a strong performance like Claude Opus 4.5 and Kimi K2 show significant overconfidence.
88
 
89
- Is overconfidence a problem ? It depends on how the model decides to act on it.
90
 
91
- For a perfectly calibrated model, as the expected loss for a failed guess is twice the expected opportunity cost of waiting one turn, the optimal confidence threshold for guessing is 0.67 (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). But do model follow such a strategy ? For this, we can look at how often models guess at each confidence level.
 
 
92
 
93
 
94
  <HtmlEmbed
@@ -97,22 +104,20 @@ For a perfectly calibrated model, as the expected loss for a failed guess is twi
97
  id="fig-confidence"
98
  />
99
 
100
- We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Most other models will guess at confidence levels above 8 and rarely below. The two Claude models show different behaviors: Claude Opus 4.5 tends to guess more agressively at confidence level 8, while Claude Haiku 4.5 guesses even at confidence level 7.
101
-
102
- We can see that models on average are more cautious than the optimal decision-theoretic strategy for a perfectly calibrated model, which would guess as soon as confidence exceeds 67%. THis is somehow a good thing, given that all models are overconfident. By raising the bar for guessing, they reduce the risk of wrong guesses and compensate for their poor calibration.
103
 
104
- This is particularly true for Gemini 3 Flash Preview Low which is very cautious despite being overconfident, and this is probably what helps it achieve a good balance between failed guesses and lost opportunity cost. It is also consistent with the fact that it's the model with the smallest difference between raw and no-stakes scores.
105
 
106
- The case of GPT 5.2 High is different: it is both fairly well calibrated and very cautious, leading to very few failed guesses but a high opportunity cost due to delayed guessing. This suggests that GPT 5.2 High could improve its performance by being more agressive in guessing once it has a correct tentative rule.
107
 
 
108
 
109
- ### Performance by Rule
110
 
111
- Not all rules are created equal. Some rules are discovered quickly by all models (e.g. "All cards must be red") while others prove consistently challenging (e.g. "increase rank after a red card, decrease after a black").
112
 
113
- It is not easy to quantify rule complexity, as it depends on multiple factors: the inherent logical complexity of the rule, how familiar the concept is to models, and how much evidence is needed to distinguish it from alternatives. We create a crude complexity score for each rule based on the complexity code implementation, as measured by cyclomatic complexity and Abstract Syntax Tree node count.
114
 
115
- The following figure breaks down performance by rule across all models and runs.
116
 
117
  <HtmlEmbed
118
  src="by-rule.html"
@@ -121,9 +126,15 @@ The following figure breaks down performance by rule across all models and runs.
121
  wide
122
  />
123
 
124
- We can see that the most complex rules are devastating for the reckless models like Claude Haiku 4.5 and DeepSeek R1, which often negative scores on these rules due to multiple wrong guesses. Even the best models struggle on the hardest rules, but their superior metacognition allows them to avoid catastrophic failures.
 
 
 
 
125
 
126
- The following plot breaks down the relative score of each model (as measured by score on the rule divided by average score on all rules) against the complexity metrics of each rule.
 
 
127
 
128
  <HtmlEmbed
129
  src="complexity-analysis.html"
@@ -131,6 +142,21 @@ The following plot breaks down the relative score of each model (as measured by
131
  id="fig-complexity"
132
  />
133
 
134
- <Note variant="info">
135
- Interestingly, code complexity (cyclomatic complexity, AST node count) doesn't perfectly predict difficulty. Semantically simple rules like "only face cards" can be harder than structurally complex rules if the semantic concept is unfamiliar to models.
136
- </Note>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  import Sidenote from "../../../components/Sidenote.astro";
5
  import HtmlEmbed from "../../../components/HtmlEmbed.astro";
6
 
7
+ ## 2. Results
8
 
9
  ### Overall Performance
10
 
 
17
  wide
18
  />
19
 
20
+ Performance varies dramatically among tested models.
21
 
22
+ * **Claude Opus 4.5** achieves top performance with 17.0 score and moderate token usage. The open-weight model **Kimi K2 Thinking** comes second at 16.2 and performs competitively with the best proprietary models (outperforming GPT 5.2 High and being close to Claude Opus 4.5), but at the price of a 2.5× larger reasoning budget.
23
 
24
+ * **GPT 5.2 High** and **Grok 4.1 Fast Reasoning** show a similar performance around 15, but GPT 5.2 High is 3 times more token efficient.
25
 
26
+ * **GPT-5-Mini**, **GPT OSS-120B** and **Gemini 3 Flash Preview Low** cluster in the mid-tier (around 13) with low token usage. While Deepseek R1, an open-weight model specialized for reasoning tasks, achieves a similar score but with a much larger token count.
27
 
28
+ * Finally, **GPT-OSS 20B** and **Claude Haiku 4.5** lag behind, scoring between 11 and 12 with moderate token usage.
29
+
30
+ As we mentionned, this score reflects not only the pure model's ability to find the correct rule, but also its metacognitive skills: knowing when to commit, how confident it is, and how to balance exploration vs. exploitation. To distinguish these factors, we also computed an alternative "no-stakes" score that removes penalties for wrong guesses and counts tentative rules as guesses.
31
 
32
  ### Pure discovery versus metacognition
33
 
34
+ We use the same game data but we applied a different scoring system to reflect the pure ability to discover the rule, without the metacognitive aspect of knowing when to commit. **In this "no stakes" scenario, guessing is free and systematic**: at each turn, if the model has the correct tentative rule, it is considered to have guessed it correctly (even if it didn't formally attempt to guess); if the tentative rule is incorrect, it is considered a wrong guess, but without penalty.
35
+
36
+ The following chart shows the initial score of each model, and which (higher) score it would have achieved under the "no stakes" scenario. This allows us to isolate pure rule-discovery ability from metacognitive skills.
37
 
38
  <HtmlEmbed
39
  src="score-stack.html"
40
+ caption="<strong>Figure 2:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring), while green shows no-stakes gain (additional gain from systematic guessing and removing wrong-guess penalties). Models sorted by total no-stakes score."
41
  id="fig-score-stack"
42
  wide
43
  />
44
 
45
+ Even if using this alternative scoring does not change a lot the relative ranking of models, it reveals important differences in their behavior.
46
+
47
+ * GPT 5.2 High and Claude Haiku 4.5 are the two models with the largest difference between raw and no-stakes scores (more than 4), suggesting they are the most penalized by wrong guesses or delayed guessing.
48
+ * On the other hand, Gemini 3 Flash Preview Low and Kimi K2 have the smallest difference (less than 3) and benefit the least from this alternative scoring, indicating a better balance between discovery and metacognition.
49
 
50
+ They might be two reasons for the difference between the raw and the no-stakes scores:
51
  1. The model is reckless and makes a lot of wrong guesses, incurring penalties.
52
  2. The model is too cautious and waits too long before guessing, missing out on points.
53
 
54
  We analyze these two aspects in more details below.
55
 
 
56
  ### The Caution-Recklessness Trade-off
57
 
58
+ To estimate how reckless or cautious a model is, we can compute **the average number of failed guesses per round** (recklessness). It directly relates to how many points a model loses due to wrong guesses.
59
 
60
+ To estimate caution, we can compute on average **how many turns a model waits while having the correct tentative rule before actually guessing it**. This relates to how many points a model loses by waiting too long to commit.
61
 
62
  <Sidenote>
63
+ This trade-off mirrors a fundamental tension in science: being overconfident too early might risk false positives, leading to wasted resources and reputational damage; being overly cautious can delay discoveries, waste resources and allow others to scoop you. Scientists must balance the risk of trying to publish too early and risk being wrong, wait too long and lose priority (or in our case, points).
64
  </Sidenote>
65
 
66
  <HtmlEmbed
 
69
  id="fig-caution-reckless"
70
  />
71
 
72
+ How should we interpret those values ? Knowing that a failed guess costs 2 points, while each turn of delay costs 1 point, the optimal number of failed guesses per round should be around 0.5 (i.e., 1 failed guess every 2 rounds) to balance the two sources of loss. We can see that most models are above that threshold, indicating **a clear tendency towards recklessness**. This is confirmed by the fact that they have a low caution value (most models wait around 1 turn or less on average before guessing when they have the correct rule).
73
 
74
+ On the other hand, **GPT 5.2 High has a singular behavior** with very few failed guesses (0.28 per round) but a high caution (waiting 3.5 turns on average before guessing when it has the correct rule). Gemini 3 Flash Preview Low and GPT 5 Mini Medium are intermediate in both dimensions, Gemini achieving a better balance with on average 2 points lost due to caution and 2 points lost due to recklessness (1 failed guess every round on average).
 
 
75
 
76
  To try to understand deeper the causes of recklessness and caution, we now turn to an analysis of confidence and guessing strategies.
77
 
78
  ### Confidence and Calibration
79
 
80
+ Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempt to do so. **This allows us to evaluate calibration: does reported confidence match actual accuracy?** This is particularly relevant as modern neural networks have been shown to be poorly calibrated [@guo2017calibration].
81
 
82
  <HtmlEmbed
83
  src="calibration-curves.html"
84
+ caption="<strong>Figure 4:</strong> Calibration curves for each model (for reported confidence ≥5). A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
85
  id="fig-calibration"
86
  />
87
 
88
  The calibration analysis reveals several patterns:
89
 
90
+ - **All models are very overconfident** : for instance when they report 80% confidence, their actual success rates are often closer to 20% !
91
+ - GPT 5.2 is the best calibrated model overall, being the closest to the diagonal line, although it is still slightly overconfident.
92
  - Even models with a strong performance like Claude Opus 4.5 and Kimi K2 show significant overconfidence.
93
 
94
+ Is overconfidence a problem ? In our setting, not necessarily; it depends on how the model decides to act on it.
95
 
96
+ **For a perfectly calibrated model**, as the expected loss for a failed guess is twice the expected opportunity cost of waiting one turn, **the optimal confidence threshold for guessing is 0.67** (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). But do model follow such a strategy ?
97
+
98
+ For this, we can look at how often models guess at each reported confidence level. This is shown in the following figure. For each confidence level (from 5 to 10), we compute the guess rate: the fraction of turns the model actually attempts to guess when reporting that confidence.
99
 
100
 
101
  <HtmlEmbed
 
104
  id="fig-confidence"
105
  />
106
 
107
+ Once again, we observe significant differences from one model to another. Grok 4.1 and Gemini 3 will essentially only guess when very confident (9 or 10). Most other models will also often guess at confidence levels above 8 and rarely below. The two Claude models show different behaviors: Claude Opus 4.5 tends to guess more aggressively at confidence level 8, while Claude Haiku 4.5 often guesses even at confidence level 7.
 
 
108
 
109
+ We can see that **models on average are more cautious than the optimal decision-theoretic strategy** for a perfectly calibrated model, which would guess as soon as confidence exceeds 67%. This is somehow a good thing for them, given that all models are overconfident. **By raising the threshold for guessing, they reduce the risk of wrong guesses and compensate for their poor calibration.**
110
 
111
+ This is particularly true for Gemini 3 Flash Preview Low which is very cautious, guessing only 1/3 of the time at reported confidence 9 ! This compensates its overconfidence, which is probably what helps it achieve a good balance between failed guesses and lost opportunity cost. This is reflected in our "no-stakes" analysis by the fact that it's the model with the smallest difference between raw and no-stakes scores.
112
 
113
+ The case of GPT 5.2 High is different: it is both fairly well calibrated and very cautious, leading to very few failed guesses but a high opportunity cost due to delayed guessing. This suggests that GPT 5.2 High could improve its performance by being more aggressive in guessing once it has a correct tentative rule, especially at confidence level 8.
114
 
 
115
 
116
+ ### Performance by Rule Complexity
117
 
118
+ Not all rules are created equal. Some rules are discovered quickly by all models (e.g. *"all cards must be red"*) while others prove consistently challenging (e.g. *"increase rank after a red card, decrease after a black"*).
119
 
120
+ The following figure breaks down performance by rule across all models and runs, displaying the average success rate per rule on the left (how often the rule was found), and individual run scores as colored dots for each model on the right.
121
 
122
  <HtmlEmbed
123
  src="by-rule.html"
 
126
  wide
127
  />
128
 
129
+ It confirms that some rules are consistently easy, with low variance in score across models, while others are hard for all models. To analyse this, we need a way to quantify rule complexity. This is not straightforward since it depends on multiple factors: the inherent logical complexity of the rule, how familiar the concept is to models, and how much evidence is needed to distinguish it from alternatives.
130
+
131
+ We created a crude complexity score for each rule based on the complexity of its code implementation, as measured by *cyclomatic complexity* and *Abstract Syntax Tree node count*. Combining these two metrics into a unique indicator
132
+
133
+ $$\text{cyclomatic\_complexity} + 0.14 * \text{node\_count}$$
134
 
135
+ The coefficient 0.14 was chosen to maximize correlation with average success rate across models. The achieved correlation being -0.66. This indicates that as expected more complex rules tend to have lower success rates, and validates our complexity metric as a useful proxy for rule difficulty, despite its limitations.
136
+
137
+ The following plot breaks down the success rate of each model per complexity quartile.
138
 
139
  <HtmlEmbed
140
  src="complexity-analysis.html"
 
142
  id="fig-complexity"
143
  />
144
 
145
+
146
+ Interestingly, code complexity (as measured by our combination of cyclomatic complexity and AST node count) doesn't perfectly predict difficulty, as semantic concepts also play a role. For instance a rule like "only face cards" has a complexity equivalent to "only A, 2 and 3", but the former is easier for models (and humans !) due to familiarity with the semantic category of face cards.
147
+
148
+ Also rules involving rare events (low acceptance rate). Only aces is harder than "only even ranks" despite being simpler, simply because models need more evidence to confirm it.
149
+
150
+ An interesting test: are symmetric rules equally difficult? For example, "only spades" vs "only non-spades" should be logically equivalent in difficulty, but models might have biases.
151
+ For instance average score on "only spades" is 25, but "no spades" is 20.
152
+
153
+ ### Complexity of rules produced
154
+
155
+ #### Overly Complex Rules
156
+ Failure mode: models have a tendency to produce over complicated rules, even if they were informed that the rule is typically one sentence. They can produce tentative rules like "...".
157
+
158
+ TODO : Backup this with examples from logs and "guess complexity" vs "actual complexity".
159
+
160
+
161
+ #### Overfitting Rules
162
+ We have observed qualitative evidence of model producing overfit rules that explain all observations so far, but fail to generalize. For instance if all accepted cards so far are red, and happens to be only number cards (simply because no red face card has been tried yet), the model may hypothesize "only red number cards" rather than the simpler "only red cards."