Spaces:
Running
Running
First draft
Browse files- app/src/content/article.mdx +22 -37
- app/src/content/assets/figures/basic_metrics.csv +3 -0
- app/src/content/assets/figures/by_rule.json +3 -0
- app/src/content/assets/figures/by_rule.png +3 -0
- app/src/content/assets/figures/calibration_curves.json +3 -0
- app/src/content/assets/figures/calibration_curves.png +3 -0
- app/src/content/assets/figures/complexity_analysis.json +3 -0
- app/src/content/assets/figures/complexity_analysis.png +3 -0
- app/src/content/assets/figures/confidence_distribution.json +3 -0
- app/src/content/assets/figures/confidence_distribution.png +3 -0
- app/src/content/assets/figures/overall_performance.json +3 -0
- app/src/content/assets/figures/overall_performance.png +3 -0
- app/src/content/assets/figures/score_vs_failed_guesses.json +3 -0
- app/src/content/assets/figures/score_vs_failed_guesses.png +3 -0
- app/src/content/assets/figures/summary.txt +109 -0
- app/src/content/chapters/eleusis/analysis.mdx +82 -0
- app/src/content/chapters/eleusis/appendix.mdx +87 -0
- app/src/content/chapters/eleusis/benchmark.mdx +69 -0
- app/src/content/chapters/eleusis/conclusion.mdx +56 -0
- app/src/content/chapters/eleusis/introduction.mdx +34 -0
- app/src/content/chapters/eleusis/results.mdx +103 -0
app/src/content/article.mdx
CHANGED
|
@@ -1,57 +1,42 @@
|
|
| 1 |
---
|
| 2 |
-
title: "
|
| 3 |
-
subtitle: "
|
| 4 |
-
description: "
|
| 5 |
authors:
|
| 6 |
-
- name: "
|
| 7 |
-
url: "https://huggingface.co/
|
| 8 |
affiliations: [1]
|
| 9 |
affiliations:
|
| 10 |
- name: "Hugging Face"
|
| 11 |
url: "https://huggingface.co"
|
| 12 |
-
published: "
|
| 13 |
-
doi: 10.1234/abcd.efgh
|
| 14 |
licence: >
|
| 15 |
-
Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/
|
| 16 |
-
Figures reused from other sources are excluded and marked in their captions (“Figure from …”).
|
| 17 |
tags:
|
| 18 |
-
-
|
| 19 |
-
-
|
|
|
|
|
|
|
| 20 |
tableOfContentsAutoCollapse: true
|
| 21 |
pdfProOnly: false
|
| 22 |
showPdf: true
|
| 23 |
---
|
| 24 |
|
| 25 |
-
import Introduction from "./chapters/
|
| 26 |
-
import
|
| 27 |
-
import
|
| 28 |
-
import
|
| 29 |
-
import
|
| 30 |
-
import
|
| 31 |
-
import Markdown from "./chapters/demo/markdown.mdx";
|
| 32 |
-
import Components from "./chapters/demo/components.mdx";
|
| 33 |
-
import Greetings from "./chapters/demo/greetings.mdx";
|
| 34 |
-
import VibeCodingCharts from "./chapters/demo/vibe-coding-charts.mdx";
|
| 35 |
-
import ImportContent from "./chapters/demo/import-content.mdx";
|
| 36 |
|
| 37 |
<Introduction />
|
| 38 |
|
| 39 |
-
<
|
| 40 |
|
| 41 |
-
<
|
| 42 |
|
| 43 |
-
<
|
| 44 |
-
|
| 45 |
-
<Markdown />
|
| 46 |
-
|
| 47 |
-
<Components />
|
| 48 |
-
|
| 49 |
-
<VibeCodingCharts />
|
| 50 |
-
|
| 51 |
-
<ImportContent />
|
| 52 |
-
|
| 53 |
-
<BestPractices />
|
| 54 |
-
|
| 55 |
-
<Greetings />
|
| 56 |
|
|
|
|
| 57 |
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: "Are LLMs any good at the Science Game?\n Evaluating scientific reasoning using the card game Eleusis"
|
| 3 |
+
subtitle: "Testing LLM calibration and iterative hypothesis formation"
|
| 4 |
+
description: "A benchmark for evaluating LLM scientific reasoning using the card game Eleusis, testing iterative hypothesis formation, calibration, and strategic experimentation."
|
| 5 |
authors:
|
| 6 |
+
- name: "David Louapre"
|
| 7 |
+
url: "https://huggingface.co/dlouapre"
|
| 8 |
affiliations: [1]
|
| 9 |
affiliations:
|
| 10 |
- name: "Hugging Face"
|
| 11 |
url: "https://huggingface.co"
|
| 12 |
+
published: "Jan. 22, 2026"
|
|
|
|
| 13 |
licence: >
|
| 14 |
+
Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/dlouapre/eleusis-benchmark" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
|
|
|
|
| 15 |
tags:
|
| 16 |
+
- LLM evaluation
|
| 17 |
+
- scientific reasoning
|
| 18 |
+
- benchmarks
|
| 19 |
+
- calibration
|
| 20 |
tableOfContentsAutoCollapse: true
|
| 21 |
pdfProOnly: false
|
| 22 |
showPdf: true
|
| 23 |
---
|
| 24 |
|
| 25 |
+
import Introduction from "./chapters/eleusis/introduction.mdx";
|
| 26 |
+
import Benchmark from "./chapters/eleusis/benchmark.mdx";
|
| 27 |
+
import Results from "./chapters/eleusis/results.mdx";
|
| 28 |
+
import Analysis from "./chapters/eleusis/analysis.mdx";
|
| 29 |
+
import Conclusion from "./chapters/eleusis/conclusion.mdx";
|
| 30 |
+
import Appendix from "./chapters/eleusis/appendix.mdx";
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
<Introduction />
|
| 33 |
|
| 34 |
+
<Benchmark />
|
| 35 |
|
| 36 |
+
<Results />
|
| 37 |
|
| 38 |
+
<Analysis />
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
<Conclusion />
|
| 41 |
|
| 42 |
+
<Appendix />
|
app/src/content/assets/figures/basic_metrics.csv
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0e0bdb47eeb82b62a05a7d6dd2b3815404567be86ea4f7cc44a7f2e47a262d35
|
| 3 |
+
size 1372
|
app/src/content/assets/figures/by_rule.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7883abbd4a92c8f305c5c030315878579bb42d6acfcefe24d7d96d550f47120d
|
| 3 |
+
size 5864
|
app/src/content/assets/figures/by_rule.png
ADDED
|
Git LFS Details
|
app/src/content/assets/figures/calibration_curves.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6db808595939baa8afcef3106b6963d19940949b864a06a80c0b7e479d03b38e
|
| 3 |
+
size 5681
|
app/src/content/assets/figures/calibration_curves.png
ADDED
|
Git LFS Details
|
app/src/content/assets/figures/complexity_analysis.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9ad53beba3b7e00c248664f291eaba015dd716be80013584479952bc26c79f83
|
| 3 |
+
size 1612
|
app/src/content/assets/figures/complexity_analysis.png
ADDED
|
Git LFS Details
|
app/src/content/assets/figures/confidence_distribution.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:67d35eb63310d743c06a7a5b401228792e3532d6c22880369d61b2d4efb213b1
|
| 3 |
+
size 5577
|
app/src/content/assets/figures/confidence_distribution.png
ADDED
|
Git LFS Details
|
app/src/content/assets/figures/overall_performance.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c620d1614704161071e6b3fdf51031228bc35a0aab8f70d6221f024a68e21e32
|
| 3 |
+
size 1413
|
app/src/content/assets/figures/overall_performance.png
ADDED
|
Git LFS Details
|
app/src/content/assets/figures/score_vs_failed_guesses.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:abff24e6757f5f108647f4a42dcecf7f85a38f9b2dc509eab02884cd311d685d
|
| 3 |
+
size 1372
|
app/src/content/assets/figures/score_vs_failed_guesses.png
ADDED
|
Git LFS Details
|
app/src/content/assets/figures/summary.txt
ADDED
|
@@ -0,0 +1,109 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
============================================================
|
| 2 |
+
ELEUSIS RESULTS ANALYSIS
|
| 3 |
+
============================================================
|
| 4 |
+
|
| 5 |
+
Analyzing: results/260121_78_rounds
|
| 6 |
+
|
| 7 |
+
Loading results...
|
| 8 |
+
Loaded 6 evaluation runs:
|
| 9 |
+
- solo_evaluation_20260120_091620_gpt_5_2_high
|
| 10 |
+
- solo_evaluation_20260120_091622_gpt_oss_120b
|
| 11 |
+
- solo_evaluation_20260121_070517_claude_haiku_4_5
|
| 12 |
+
- solo_evaluation_20260121_070518_gpt_5_mini_medium
|
| 13 |
+
- solo_evaluation_20260121_070520_gemini_3_flash_preview_low
|
| 14 |
+
- solo_evaluation_20260121_070522_gpt_oss_20b
|
| 15 |
+
|
| 16 |
+
Extracted 26 unique rules from results files
|
| 17 |
+
Built DataFrames: 468 rounds, 7836 turns
|
| 18 |
+
Loaded colors for 17 models
|
| 19 |
+
|
| 20 |
+
============================================================
|
| 21 |
+
BASIC MODEL COMPARISON
|
| 22 |
+
============================================================
|
| 23 |
+
|
| 24 |
+
model rounds_played total_score avg_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio
|
| 25 |
+
Gpt 5.2 High 78 1102 14.128205 1200 3341037 73525.83 0.333333 0.961538 2784.197500 61.271525 25.346154 36.062906 0.702832
|
| 26 |
+
Gpt 5 Mini Medium 78 1001 12.833333 1247 3618399 58345.97 1.256410 0.756410 2901.683240 46.789070 40.051282 79.228889 0.505514
|
| 27 |
+
Gemini 3 Flash Preview Low 78 955 12.243590 1299 1581524 12702.02 1.717949 0.769231 1217.493457 9.778306 35.910256 81.480513 0.440722
|
| 28 |
+
Gpt Oss 120B 78 938 12.025641 1226 3190828 24633.15 3.692308 0.756410 2602.632953 20.092292 51.320513 80.710427 0.635860
|
| 29 |
+
Gpt Oss 20B 78 773 9.910256 1277 7009392 62397.50 6.205128 0.717949 5488.952232 48.862569 80.782051 122.849402 0.657570
|
| 30 |
+
Claude Haiku 4.5 78 713 9.141026 1223 6973411 57734.39 7.551282 0.705128 5701.889616 47.207187 88.576923 152.125983 0.582260
|
| 31 |
+
|
| 32 |
+
Saved: results/260121_78_rounds/basic_metrics.csv
|
| 33 |
+
Saved: results/260121_78_rounds/overall_performance.png
|
| 34 |
+
Saved: results/260121_78_rounds/overall_performance.json
|
| 35 |
+
Saved: results/260121_78_rounds/score_vs_failed_guesses.png
|
| 36 |
+
Saved: results/260121_78_rounds/score_vs_failed_guesses.json
|
| 37 |
+
Saved: results/260121_78_rounds/calibration_curves.png
|
| 38 |
+
Saved: results/260121_78_rounds/calibration_curves.json
|
| 39 |
+
Saved: results/260121_78_rounds/confidence_distribution.png
|
| 40 |
+
Saved: results/260121_78_rounds/confidence_distribution.json
|
| 41 |
+
|
| 42 |
+
============================================================
|
| 43 |
+
BY-RULE ANALYSIS
|
| 44 |
+
============================================================
|
| 45 |
+
|
| 46 |
+
Score by rule (sorted by avg_score):
|
| 47 |
+
rule_description count avg_score std_score success_rate
|
| 48 |
+
Only red cards (hearts or diamonds). 18 23.888889 2.541164 1.000000
|
| 49 |
+
Cards must alternate between red and black colors. Any card may start the line. 18 23.500000 3.166925 1.000000
|
| 50 |
+
Only cards of the suit spades. 18 23.444444 2.254987 1.000000
|
| 51 |
+
Only cards with an even rank (2,4,6,8,10,12). 18 22.333333 2.950573 1.000000
|
| 52 |
+
The card must be of a different suit than the card just before it. Any card may start the line. 18 19.277778 7.282578 0.944444
|
| 53 |
+
Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 18 19.000000 5.636019 1.000000
|
| 54 |
+
Only hearts, clubs, and diamonds allowed. Spades are forbidden. 18 18.333333 5.851093 0.944444
|
| 55 |
+
Only ranks that are prime numbers (2,3,5,7,11,13). 18 18.000000 6.859943 0.944444
|
| 56 |
+
The card must be of a different suit than but same color as the card just before it. Any card may start the line. 18 17.944444 9.295617 1.000000
|
| 57 |
+
Only spades and diamonds. 18 17.500000 4.973459 1.000000
|
| 58 |
+
Only face cards (11,12,13). 18 16.388889 9.356589 0.833333
|
| 59 |
+
Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 18 16.388889 7.769767 1.000000
|
| 60 |
+
Only Aces (rank 1) . 18 16.111111 9.682543 0.944444
|
| 61 |
+
Only cards between 1 and 7 inclusive. 18 10.277778 8.870344 0.944444
|
| 62 |
+
Only black face cards. 18 7.111111 10.093031 0.833333
|
| 63 |
+
Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 18 6.277778 11.113349 0.500000
|
| 64 |
+
Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 18 6.055556 11.305762 0.611111
|
| 65 |
+
Only red cards whose rank is <=7. 18 5.611111 10.330645 1.000000
|
| 66 |
+
Alternate face and number cards. Any card may start the line. 18 5.333333 12.362181 0.611111
|
| 67 |
+
Only cards between 5 and 9 inclusive. 18 4.500000 9.977917 0.888889
|
| 68 |
+
Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 18 1.944444 12.511041 0.777778
|
| 69 |
+
Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 18 1.666667 3.880570 0.333333
|
| 70 |
+
Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 18 1.444444 4.217920 0.111111
|
| 71 |
+
If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 18 1.444444 5.690262 0.277778
|
| 72 |
+
Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 18 0.833333 6.242643 0.277778
|
| 73 |
+
Face cards (11-13) must be red; number cards (1-10) must be black. 18 -0.055556 7.255604 0.444444
|
| 74 |
+
|
| 75 |
+
Saved: results/260121_78_rounds/by_rule.png
|
| 76 |
+
Saved: results/260121_78_rounds/by_rule.json
|
| 77 |
+
|
| 78 |
+
============================================================
|
| 79 |
+
COMPLEXITY ANALYSIS
|
| 80 |
+
============================================================
|
| 81 |
+
|
| 82 |
+
Optimal K for aggregated complexity: 0.05
|
| 83 |
+
Formula: complexity = cyclomatic + 0.05 * node_count
|
| 84 |
+
Correlation with relative_score: -0.429
|
| 85 |
+
|
| 86 |
+
Score by complexity quartile:
|
| 87 |
+
complexity_bin count avg_score avg_relative_score success_rate
|
| 88 |
+
Q1 144 16.909722 1.478439 0.944444
|
| 89 |
+
Q2 90 12.911111 1.105104 0.877778
|
| 90 |
+
Q3 126 12.150794 1.021103 0.761905
|
| 91 |
+
Q4 108 3.277778 0.249874 0.490741
|
| 92 |
+
|
| 93 |
+
Saved: results/260121_78_rounds/complexity_analysis.png
|
| 94 |
+
Saved: results/260121_78_rounds/complexity_analysis.json
|
| 95 |
+
|
| 96 |
+
============================================================
|
| 97 |
+
PER-MODEL REPORTS
|
| 98 |
+
============================================================
|
| 99 |
+
|
| 100 |
+
Saved: results/260121_78_rounds/model_gpt_5_2_high.png
|
| 101 |
+
Saved: results/260121_78_rounds/model_gpt_oss_120b.png
|
| 102 |
+
Saved: results/260121_78_rounds/model_claude_haiku_4_5.png
|
| 103 |
+
Saved: results/260121_78_rounds/model_gpt_5_mini_medium.png
|
| 104 |
+
Saved: results/260121_78_rounds/model_gemini_3_flash_preview_low.png
|
| 105 |
+
Saved: results/260121_78_rounds/model_gpt_oss_20b.png
|
| 106 |
+
|
| 107 |
+
============================================================
|
| 108 |
+
Analysis complete! All outputs saved to: results/260121_78_rounds
|
| 109 |
+
============================================================
|
app/src/content/chapters/eleusis/analysis.mdx
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Note from "../../../components/Note.astro";
|
| 2 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 3 |
+
import Accordion from "../../../components/Accordion.astro";
|
| 4 |
+
|
| 5 |
+
## Deeper Analysis
|
| 6 |
+
|
| 7 |
+
### Learning Curves
|
| 8 |
+
|
| 9 |
+
How do models improve within a single round? We tracked confidence and hypothesis quality over turn number to understand the learning dynamics.
|
| 10 |
+
|
| 11 |
+
<Note variant="info">
|
| 12 |
+
**TODO**: Add figure showing line plot of average confidence by turn number, colored by eventual success/failure.
|
| 13 |
+
</Note>
|
| 14 |
+
|
| 15 |
+
Key observations:
|
| 16 |
+
- **Successful rounds** typically show steadily increasing confidence with occasional drops when hypotheses are revised
|
| 17 |
+
- **Failed rounds** often show erratic confidence or premature plateaus where models become stuck on incorrect hypotheses
|
| 18 |
+
- **Acceptance rate decreases** over time as obvious cards are exhausted from the hand
|
| 19 |
+
|
| 20 |
+
<Sidenote>
|
| 21 |
+
The turn-by-turn reasoning traces provide rich data for understanding model behavior beyond simple success/failure metrics.
|
| 22 |
+
</Sidenote>
|
| 23 |
+
|
| 24 |
+
### Failure Modes
|
| 25 |
+
|
| 26 |
+
When models fail, why? We identified several recurring patterns:
|
| 27 |
+
|
| 28 |
+
<Accordion title="Failure mode taxonomy" open>
|
| 29 |
+
|
| 30 |
+
1. **Premature guessing**: High confidence, wrong rule, insufficient evidence. The model becomes convinced too early based on limited data.
|
| 31 |
+
|
| 32 |
+
2. **Hypothesis fixation**: Stuck on wrong rule despite contradictory evidence. The model fails to update when new observations conflict with its theory.
|
| 33 |
+
|
| 34 |
+
3. **Overfitting**: Rule matches all observations but is more specific than the actual rule (e.g., guessing "only red hearts" when the rule is "only red cards").
|
| 35 |
+
|
| 36 |
+
4. **Underfitting**: Rule is too simple and fails to capture necessary conditions (e.g., guessing "black cards" when rule is "black even cards").
|
| 37 |
+
|
| 38 |
+
5. **Position blindness**: Fails on rules depending on position in mainline or relationship to previous cards.
|
| 39 |
+
|
| 40 |
+
</Accordion>
|
| 41 |
+
|
| 42 |
+
<Note variant="info">
|
| 43 |
+
**TODO**: Add stacked bar chart showing distribution of failure modes by model.
|
| 44 |
+
</Note>
|
| 45 |
+
|
| 46 |
+
### Symmetric Rules
|
| 47 |
+
|
| 48 |
+
An interesting test: are symmetric rules equally difficult? For example, "only spades" vs "only non-spades" should be logically equivalent in difficulty, but models might have biases.
|
| 49 |
+
|
| 50 |
+
We found that:
|
| 51 |
+
- Negative rules ("not X") are generally harder than positive rules ("only X")
|
| 52 |
+
- Rules involving rare events (low acceptance rate) are harder than rules with high acceptance rates
|
| 53 |
+
- This may reflect training data biases where positive examples are more common
|
| 54 |
+
|
| 55 |
+
### Confirmation Bias
|
| 56 |
+
|
| 57 |
+
Do models exhibit confirmation bias—preferring to play cards that confirm their current hypothesis rather than cards that could falsify it?
|
| 58 |
+
|
| 59 |
+
<Sidenote>
|
| 60 |
+
A good scientist designs experiments that could prove them wrong, not just experiments that confirm what they already believe.
|
| 61 |
+
</Sidenote>
|
| 62 |
+
|
| 63 |
+
Preliminary analysis suggests:
|
| 64 |
+
- Models do show some tendency toward confirmation-seeking behavior
|
| 65 |
+
- When confident in a hypothesis, models prefer "safe" plays that are likely to be accepted
|
| 66 |
+
- Strategic exploration (playing cards specifically to test hypothesis boundaries) is rare
|
| 67 |
+
|
| 68 |
+
### Qualitative Observations
|
| 69 |
+
|
| 70 |
+
Examining individual reasoning traces reveals interesting patterns:
|
| 71 |
+
|
| 72 |
+
<Accordion title="Example: Hypothesis revision">
|
| 73 |
+
|
| 74 |
+
In one game with the rule "alternating odd/even ranks," a model initially hypothesized "increasing ranks" based on the first few accepted cards. When a lower-ranked card was accepted, instead of abandoning the hypothesis entirely, the model revised it to "ranks must differ from previous." This partial update eventually led to discovering the true rule—a good example of iterative refinement.
|
| 75 |
+
|
| 76 |
+
</Accordion>
|
| 77 |
+
|
| 78 |
+
<Accordion title="Example: Fixation failure">
|
| 79 |
+
|
| 80 |
+
With the rule "only face cards (J, Q, K)," one model became fixated on "only red cards" after the first three accepted cards happened to be red face cards. Despite subsequently seeing black face cards accepted, the model kept trying to reconcile observations with a color-based rule, eventually running out of turns.
|
| 81 |
+
|
| 82 |
+
</Accordion>
|
app/src/content/chapters/eleusis/appendix.mdx
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Accordion from "../../../components/Accordion.astro";
|
| 2 |
+
import Note from "../../../components/Note.astro";
|
| 3 |
+
|
| 4 |
+
## Appendix: Detailed Methods
|
| 5 |
+
|
| 6 |
+
### Models Evaluated
|
| 7 |
+
|
| 8 |
+
<Accordion title="Model configurations" open>
|
| 9 |
+
|
| 10 |
+
All models were evaluated using their respective APIs with the following settings:
|
| 11 |
+
|
| 12 |
+
| Parameter | Value |
|
| 13 |
+
|-----------|-------|
|
| 14 |
+
| Temperature | 0.0 (deterministic) |
|
| 15 |
+
| Max tokens | 4096 |
|
| 16 |
+
| Retries | 3 (on API failures) |
|
| 17 |
+
|
| 18 |
+
Reasoning models (o1, o3-mini, etc.) were allowed their default reasoning budgets. Standard models used the base inference without chain-of-thought prompting beyond what's included in the game prompt.
|
| 19 |
+
|
| 20 |
+
</Accordion>
|
| 21 |
+
|
| 22 |
+
### Rule Checking
|
| 23 |
+
|
| 24 |
+
<Accordion title="Rule verification methodology">
|
| 25 |
+
|
| 26 |
+
Rules are created by hand and expressed in natural language. Each rule is then compiled into a Python function using an LLM, with manual verification of correctness.
|
| 27 |
+
|
| 28 |
+
When the model outputs a guessed rule, we:
|
| 29 |
+
1. Compile the guess into a Python function using the same LLM
|
| 30 |
+
2. Test the compiled function against all cards played in that game
|
| 31 |
+
3. Mark the guess as correct only if it matches the true rule's behavior on all observations
|
| 32 |
+
|
| 33 |
+
This simulation-based approach avoids issues with semantic equivalence in natural language. For instance, "same color as previous card" and "red cards only" might be equivalent given a specific game history starting with a red card, but would differ on other histories.
|
| 34 |
+
|
| 35 |
+
</Accordion>
|
| 36 |
+
|
| 37 |
+
### Prompt Structure
|
| 38 |
+
|
| 39 |
+
<Accordion title="Full prompt template">
|
| 40 |
+
|
| 41 |
+
The prompt includes:
|
| 42 |
+
|
| 43 |
+
1. **Game rules**: Complete explanation of how Eleusis works, without mentioning the game's name to avoid potential training data leakage
|
| 44 |
+
|
| 45 |
+
2. **Scoring system**: Explicit explanation of the scoring formula and strategic implications
|
| 46 |
+
|
| 47 |
+
3. **Response format**: JSON schema specifying required fields (reasoning, card choice, tentative rule, confidence, guess decision)
|
| 48 |
+
|
| 49 |
+
4. **Game state**: Current mainline, all sidelines, current hand, and reasoning from the previous 3 turns
|
| 50 |
+
|
| 51 |
+
5. **Format reminders**: Instructions for confidence scale interpretation (7 = 70% probability)
|
| 52 |
+
|
| 53 |
+
</Accordion>
|
| 54 |
+
|
| 55 |
+
### Evaluation Metrics
|
| 56 |
+
|
| 57 |
+
<Accordion title="Metric definitions">
|
| 58 |
+
|
| 59 |
+
- **Success rate**: Fraction of games where the model correctly identified the rule before running out of turns
|
| 60 |
+
|
| 61 |
+
- **Average score**: Mean score across all games, including zeros for failed games
|
| 62 |
+
|
| 63 |
+
- **Calibration error**: Mean absolute difference between stated confidence and empirical success rate at that confidence level
|
| 64 |
+
|
| 65 |
+
- **Failed guesses**: Average number of incorrect formal guesses per game
|
| 66 |
+
|
| 67 |
+
- **Turns to success**: For successful games, mean number of turns before correct guess
|
| 68 |
+
|
| 69 |
+
</Accordion>
|
| 70 |
+
|
| 71 |
+
### References
|
| 72 |
+
|
| 73 |
+
<Accordion title="Bibliography">
|
| 74 |
+
|
| 75 |
+
- Abbott, R. (1963). "Eleusis" — Original game rules and design philosophy
|
| 76 |
+
|
| 77 |
+
- Guo, C., et al. (2017). "On Calibration of Modern Neural Networks" — Foundational work on neural network calibration
|
| 78 |
+
|
| 79 |
+
- Chollet, F. (2019). "On the Measure of Intelligence" — ARC benchmark and discussion of abstract reasoning
|
| 80 |
+
|
| 81 |
+
- Recent LLM reasoning benchmarks: GSM8K, MATH, ARC-AGI, BIG-Bench, etc.
|
| 82 |
+
|
| 83 |
+
</Accordion>
|
| 84 |
+
|
| 85 |
+
<Note>
|
| 86 |
+
Full code, data, and model outputs are available in the benchmark repository.
|
| 87 |
+
</Note>
|
app/src/content/chapters/eleusis/benchmark.mdx
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 2 |
+
import Note from "../../../components/Note.astro";
|
| 3 |
+
import Accordion from "../../../components/Accordion.astro";
|
| 4 |
+
|
| 5 |
+
## The Eleusis Benchmark
|
| 6 |
+
|
| 7 |
+
### The Original Game
|
| 8 |
+
|
| 9 |
+
In the original Eleusis card game, one player acts as the "dealer" (sometimes called "God" or "Nature") and secretly invents a rule determining which cards can be legally played. The other players don't know this rule—they must discover it through experimentation.
|
| 10 |
+
|
| 11 |
+
Players take turns playing cards from their hand onto a central "mainline." If a card satisfies the secret rule, it's accepted and added to the mainline. If it violates the rule, it's rejected and placed in a "sideline" below the mainline at that position. Over time, the pattern of accepted and rejected cards provides evidence about the hidden rule.
|
| 12 |
+
|
| 13 |
+
<Sidenote>
|
| 14 |
+
The name "Eleusis" comes from the ancient Greek mystery cult, where initiates gradually discovered hidden truths.
|
| 15 |
+
</Sidenote>
|
| 16 |
+
|
| 17 |
+
At any point, a player can attempt to guess the rule; correctly identifying it ends the game. A specific scoring system rewards efficiency in discovering the rule while penalizing reckless guessing.
|
| 18 |
+
|
| 19 |
+
### Our Adaptation
|
| 20 |
+
|
| 21 |
+
We adapted Eleusis into a single-player benchmark focused purely on the scientific reasoning process. By removing multi-player dynamics, we isolate the core challenge: hypothesis formation and testing under uncertainty.
|
| 22 |
+
|
| 23 |
+
The game uses a standard 52-card deck with ranks 1–13 (Ace through King) and four suits. A secret rule—a deterministic function that takes the card being played and the current sequence of accepted cards (the "mainline")—determines whether each card is accepted or rejected. The player maintains a hand of 12 cards, drawing a replacement after each play.
|
| 24 |
+
|
| 25 |
+
On each turn, the player selects a card from their hand to play. If the card satisfies the secret rule, it joins the mainline; if rejected, it's placed in a sideline below the mainline at that position. At any point, the player may attempt to guess the rule.
|
| 26 |
+
|
| 27 |
+
<Sidenote>
|
| 28 |
+
We chose 12-card hands to give models enough options for strategic experimentation.
|
| 29 |
+
</Sidenote>
|
| 30 |
+
|
| 31 |
+
The game lasts at most 30 turns, with scoring designed to reward efficiency while penalizing reckless guessing:
|
| 32 |
+
|
| 33 |
+
$$\text{score} = (30 - \text{turns\_used}) - 2 \times \text{wrong\_guesses}$$
|
| 34 |
+
|
| 35 |
+
A player who correctly identifies the rule on turn 10 with no wrong guesses scores 20 points; one who made 3 wrong guesses along the way scores only 14. Failing to identify the rule scores 0. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence—exactly the calibration we want to measure.
|
| 36 |
+
|
| 37 |
+
<Note variant="info">
|
| 38 |
+
**TODO**: Add figure showing an example turn with the game state (mainline with sidelines) and the model's structured response.
|
| 39 |
+
</Note>
|
| 40 |
+
|
| 41 |
+
### Rule Library
|
| 42 |
+
|
| 43 |
+
We created a library of 26 hand-crafted rules spanning a range of types and complexity. Some rules involve simply card properties (e.g., "only red cards"), while others depend on the sequence of previously accepted cards (e.g., "card rank must be higher than previous card"). The rule might involve rank, suits, color or a combination thereof, and may include positional dependencies.
|
| 44 |
+
|
| 45 |
+
| Category | Examples |
|
| 46 |
+
|----------|----------|
|
| 47 |
+
| Static property | "Only red cards", "Only face cards (J, Q, K)" |
|
| 48 |
+
| Combined properties | "Only hearts with rank ≤7", "Only red face cards" |
|
| 49 |
+
| Sequential | "Rank must be higher than previous card" |
|
| 50 |
+
| Cyclic patterns | "Alternate between odd and even ranks", "Suits cycle: ♥→♠→♣→♦" |
|
| 51 |
+
| Complex conditionals | "Same suit as previous OR rank differs by exactly 2" |
|
| 52 |
+
|
| 53 |
+
Each rule is played 3 times with different random seeds (affecting the initial hand and deck order). This ensures every model is tested on the same deck sequences for a given seed, and captures variance in performance when the starting hand differs.
|
| 54 |
+
|
| 55 |
+
### What the LLM Must Do
|
| 56 |
+
|
| 57 |
+
On each turn, the model receives the complete game state: the mainline of accepted cards, the sidelines of rejected cards at each position, its current hand, and its history of reasoning from the 3 previous turns. It must output a structured response containing:
|
| 58 |
+
|
| 59 |
+
<Accordion title="Structured response format" open>
|
| 60 |
+
|
| 61 |
+
1. **Reasoning summary**: A brief explanation of its current thinking
|
| 62 |
+
2. **Card choice**: Which card to play from its hand
|
| 63 |
+
3. **Tentative rule**: Its current best hypothesis about the secret rule
|
| 64 |
+
4. **Confidence level**: A self-reported probability (0–10 scale, where 7 means "I estimate 70% chance my tentative rule is correct")
|
| 65 |
+
5. **Guess decision**: Whether to formally guess the rule this turn
|
| 66 |
+
|
| 67 |
+
</Accordion>
|
| 68 |
+
|
| 69 |
+
This structure lets us analyze not just whether models succeed, but *how* they reason: Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy?
|
app/src/content/chapters/eleusis/conclusion.mdx
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Note from "../../../components/Note.astro";
|
| 2 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 3 |
+
|
| 4 |
+
## Conclusion
|
| 5 |
+
|
| 6 |
+
### Key Findings
|
| 7 |
+
|
| 8 |
+
Our evaluation of LLMs on the Eleusis benchmark reveals several important insights:
|
| 9 |
+
|
| 10 |
+
1. **LLMs can do inductive reasoning**—but with significant variation across models. The best models successfully discover hidden rules through iterative experimentation, while others struggle with basic hypothesis formation.
|
| 11 |
+
|
| 12 |
+
2. **Complexity matters**—simple rules are easy, complex rules are hard. This isn't surprising, but our benchmark provides quantitative measurements of how different complexity factors affect performance.
|
| 13 |
+
|
| 14 |
+
3. **Calibration is imperfect**—models don't always know what they don't know. Most models show systematic overconfidence, particularly at high stated confidence levels.
|
| 15 |
+
|
| 16 |
+
4. **Reasoning traces are valuable**—the turn-by-turn data reveals how models think, exposing failure modes that wouldn't be visible from success/failure metrics alone.
|
| 17 |
+
|
| 18 |
+
<Sidenote>
|
| 19 |
+
The gap between the best and worst models is substantial, suggesting this benchmark captures meaningful capability differences.
|
| 20 |
+
</Sidenote>
|
| 21 |
+
|
| 22 |
+
### Limitations
|
| 23 |
+
|
| 24 |
+
This work has several important limitations:
|
| 25 |
+
|
| 26 |
+
- **Rule library scope**: 26 hand-crafted rules may not cover all types of scientific reasoning. Real-world hypothesis formation involves much more complex domains.
|
| 27 |
+
|
| 28 |
+
- **Statistical power**: 3 seeds per rule provides limited data for variance estimates. Some effects may not be reliably estimated.
|
| 29 |
+
|
| 30 |
+
- **Prompt sensitivity**: Different prompts might yield different results. We used a single carefully designed prompt but did not extensively test prompt variations.
|
| 31 |
+
|
| 32 |
+
- **No human baseline**: Without human performance data on the same rules, it's hard to contextualize whether model performance is "good" or "bad" in absolute terms.
|
| 33 |
+
|
| 34 |
+
- **Cost and API differences**: Models have different pricing and rate limits, which affects practical deployment considerations not captured here.
|
| 35 |
+
|
| 36 |
+
### What's Next
|
| 37 |
+
|
| 38 |
+
Several directions for future work:
|
| 39 |
+
|
| 40 |
+
- **More models**: As new models are released, evaluating them on this benchmark will help track progress in scientific reasoning capabilities.
|
| 41 |
+
|
| 42 |
+
- **More rules**: Expanding the rule library to cover additional reasoning patterns (temporal rules, multi-step dependencies, etc.)
|
| 43 |
+
|
| 44 |
+
- **Human comparisons**: Collecting human performance data would provide crucial context for interpreting model capabilities.
|
| 45 |
+
|
| 46 |
+
- **Interactive exploration**: Building tools to explore individual game traces could help researchers understand model reasoning more deeply.
|
| 47 |
+
|
| 48 |
+
<Note variant="info">
|
| 49 |
+
The benchmark is open source. Try it yourself and contribute new rules or model evaluations!
|
| 50 |
+
</Note>
|
| 51 |
+
|
| 52 |
+
### Final Thoughts
|
| 53 |
+
|
| 54 |
+
The Eleusis benchmark offers a window into capabilities that matter for real-world scientific reasoning: iterative hypothesis refinement, strategic experimentation, and calibrated confidence. While current LLMs show promising capabilities, significant gaps remain—particularly in calibration and avoiding cognitive biases like hypothesis fixation.
|
| 55 |
+
|
| 56 |
+
As LLMs are increasingly deployed to assist with scientific research, understanding these limitations becomes crucial. A model that is brilliant at generating hypotheses but systematically overconfident could lead researchers down unproductive paths. The Eleusis benchmark provides one lens for evaluating and improving these capabilities.
|
app/src/content/chapters/eleusis/introduction.mdx
ADDED
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 2 |
+
import Note from "../../../components/Note.astro";
|
| 3 |
+
|
| 4 |
+
Large language models are increasingly being deployed as tools for scientific research—analyzing data, generating hypotheses, and even designing experiments. But how well do they actually embody the scientific method?
|
| 5 |
+
|
| 6 |
+
<Sidenote>
|
| 7 |
+
Read time: 15–20 minutes.
|
| 8 |
+
</Sidenote>
|
| 9 |
+
|
| 10 |
+
Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge, for instance, evaluates inductive reasoning on visual patterns. These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.
|
| 11 |
+
|
| 12 |
+
Real scientific reasoning is not a single inference step. It's an iterative process of observation, hypothesis formation, experimentation, and refinement—often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic* thinking: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
|
| 13 |
+
|
| 14 |
+
<Sidenote>
|
| 15 |
+
Think of debugging code or diagnosing a medical condition—both follow this same iterative pattern.
|
| 16 |
+
</Sidenote>
|
| 17 |
+
|
| 18 |
+
Beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?), **metacognition** (how certain am I about my uncertainty?), and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis). A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
|
| 19 |
+
|
| 20 |
+
We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.
|
| 21 |
+
|
| 22 |
+
## The Eleusis Game
|
| 23 |
+
|
| 24 |
+
Eleusis was designed by Robert Abbott explicitly to simulate the process of scientific discovery. In the game, one player invents a secret rule governing which cards can be played, and other players must deduce the rule through experimentation—playing cards and observing whether they are accepted or rejected.
|
| 25 |
+
|
| 26 |
+
It's a microcosm of the scientific method: the rule is a hidden law of nature, each card play is an experiment, and the sequence of accepted and rejected cards is the accumulating evidence.
|
| 27 |
+
|
| 28 |
+
<Note variant="info">
|
| 29 |
+
**TODO**: Add figure showing an example Eleusis game sequence with the secret rule "alternating colors" (red, black, red, black...).
|
| 30 |
+
</Note>
|
| 31 |
+
|
| 32 |
+
We built a benchmark around Eleusis to evaluate LLMs on this iterative, hypothesis-driven reasoning. Rather than testing knowledge retrieval or instruction-following, our benchmark asks: can models act like scientists? Can they observe evidence, form hypotheses, design informative experiments, and refine their theories? Can they calibrate their confidence appropriately and know when they've gathered enough evidence to commit to a conclusion?
|
| 33 |
+
|
| 34 |
+
These skills are fundamental not just to science, but to debugging code, diagnosing problems, and everyday reasoning under uncertainty.
|
app/src/content/chapters/eleusis/results.mdx
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import Image from "../../../components/Image.astro";
|
| 2 |
+
import Wide from "../../../components/Wide.astro";
|
| 3 |
+
import Note from "../../../components/Note.astro";
|
| 4 |
+
import Sidenote from "../../../components/Sidenote.astro";
|
| 5 |
+
|
| 6 |
+
import overallPerformance from "../../assets/figures/overall_performance.png";
|
| 7 |
+
import calibrationCurves from "../../assets/figures/calibration_curves.png";
|
| 8 |
+
import confidenceDistribution from "../../assets/figures/confidence_distribution.png";
|
| 9 |
+
import scoreVsFailedGuesses from "../../assets/figures/score_vs_failed_guesses.png";
|
| 10 |
+
import byRule from "../../assets/figures/by_rule.png";
|
| 11 |
+
import complexityAnalysis from "../../assets/figures/complexity_analysis.png";
|
| 12 |
+
|
| 13 |
+
## Results
|
| 14 |
+
|
| 15 |
+
### Overall Performance
|
| 16 |
+
|
| 17 |
+
We evaluated a range of models on the Eleusis benchmark. Performance varies significantly across models, correlating with both model size and reasoning effort (measured by output token usage).
|
| 18 |
+
|
| 19 |
+
<Wide>
|
| 20 |
+
<Image
|
| 21 |
+
src={overallPerformance}
|
| 22 |
+
alt="LLM performance on Eleusis benchmark: 2D scatter plot showing average score vs output token count for each model"
|
| 23 |
+
caption="<strong>Figure 1:</strong> Overall model performance on the Eleusis benchmark. Each point represents a model, with position showing average score vs. token usage. Larger reasoning budgets generally correlate with better performance."
|
| 24 |
+
id="fig-overall"
|
| 25 |
+
zoomable
|
| 26 |
+
/>
|
| 27 |
+
</Wide>
|
| 28 |
+
|
| 29 |
+
<Sidenote>
|
| 30 |
+
Token usage serves as a proxy for "thinking effort"—models that produce longer reasoning traces tend to perform better.
|
| 31 |
+
</Sidenote>
|
| 32 |
+
|
| 33 |
+
### Confidence and Calibration
|
| 34 |
+
|
| 35 |
+
Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly.
|
| 36 |
+
|
| 37 |
+
<Image
|
| 38 |
+
src={calibrationCurves}
|
| 39 |
+
alt="Calibration curves showing reported confidence vs actual success rate for all models"
|
| 40 |
+
caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points above the line indicate overconfidence; points below indicate underconfidence."
|
| 41 |
+
id="fig-calibration"
|
| 42 |
+
zoomable
|
| 43 |
+
/>
|
| 44 |
+
|
| 45 |
+
The calibration analysis reveals several patterns:
|
| 46 |
+
|
| 47 |
+
- **Most models are overconfident** at high confidence levels—when they report 90% confidence, actual success rates are often closer to 70%
|
| 48 |
+
- **Some models are well-calibrated** at lower confidence levels but diverge as confidence increases
|
| 49 |
+
- **Reasoning models** tend to show better calibration overall
|
| 50 |
+
|
| 51 |
+
<Image
|
| 52 |
+
src={confidenceDistribution}
|
| 53 |
+
alt="Histogram showing distribution of confidence levels when models choose to guess vs not guess"
|
| 54 |
+
caption="<strong>Figure 3:</strong> Distribution of confidence levels. Left: when models choose to formally guess. Right: when models choose not to guess. Well-calibrated models should show clear separation between these distributions."
|
| 55 |
+
id="fig-confidence"
|
| 56 |
+
zoomable
|
| 57 |
+
/>
|
| 58 |
+
|
| 59 |
+
### Guessing Strategy
|
| 60 |
+
|
| 61 |
+
The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff?
|
| 62 |
+
|
| 63 |
+
<Image
|
| 64 |
+
src={scoreVsFailedGuesses}
|
| 65 |
+
alt="2D scatter plot showing average score vs average number of failed guesses per round for each model"
|
| 66 |
+
caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
|
| 67 |
+
id="fig-guessing"
|
| 68 |
+
zoomable
|
| 69 |
+
/>
|
| 70 |
+
|
| 71 |
+
<Sidenote>
|
| 72 |
+
The optimal strategy depends on accurate self-assessment—knowing when you've gathered enough evidence to commit.
|
| 73 |
+
</Sidenote>
|
| 74 |
+
|
| 75 |
+
### Performance by Rule
|
| 76 |
+
|
| 77 |
+
Not all rules are created equal. Some rules are discovered quickly by all models, while others prove consistently challenging.
|
| 78 |
+
|
| 79 |
+
<Wide>
|
| 80 |
+
<Image
|
| 81 |
+
src={byRule}
|
| 82 |
+
alt="Performance breakdown by rule showing score distribution for each rule across all models"
|
| 83 |
+
caption="<strong>Figure 5:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as points. Some rules show high variance (sensitive to initial conditions), while others are consistently easy or hard."
|
| 84 |
+
id="fig-by-rule"
|
| 85 |
+
zoomable
|
| 86 |
+
/>
|
| 87 |
+
</Wide>
|
| 88 |
+
|
| 89 |
+
### Rule Complexity
|
| 90 |
+
|
| 91 |
+
What makes some rules harder than others? We examined several factors: acceptance rate (rules that accept few cards provide less positive evidence), code complexity of the rule implementation, and semantic complexity.
|
| 92 |
+
|
| 93 |
+
<Image
|
| 94 |
+
src={complexityAnalysis}
|
| 95 |
+
alt="Scatter plot showing relationship between rule complexity metrics and model performance"
|
| 96 |
+
caption="<strong>Figure 6:</strong> Relationship between rule complexity and performance. Multiple complexity factors contribute: acceptance rate, structural complexity, and semantic difficulty."
|
| 97 |
+
id="fig-complexity"
|
| 98 |
+
zoomable
|
| 99 |
+
/>
|
| 100 |
+
|
| 101 |
+
<Note variant="info">
|
| 102 |
+
Interestingly, code complexity (cyclomatic complexity, AST node count) doesn't perfectly predict difficulty. Semantically simple rules like "only face cards" can be harder than structurally complex rules if the semantic concept is unfamiliar to models.
|
| 103 |
+
</Note>
|