dlouapre HF Staff commited on
Commit
7f4144c
·
1 Parent(s): 25932c2

First draft

Browse files
app/src/content/article.mdx CHANGED
@@ -1,57 +1,42 @@
1
  ---
2
- title: "Bringing paper to life:\n A modern template for\n scientific writing"
3
- subtitle: "Publish‑ready workflow that lets you focus on ideas, not infrastructure"
4
- description: "Publish‑ready workflow that lets you focus on ideas, not infrastructure"
5
  authors:
6
- - name: "Thibaud Frere"
7
- url: "https://huggingface.co/tfrere"
8
  affiliations: [1]
9
  affiliations:
10
  - name: "Hugging Face"
11
  url: "https://huggingface.co"
12
- published: "Sep. 01, 2025"
13
- doi: 10.1234/abcd.efgh
14
  licence: >
15
- Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/tfrere/research-article-template" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
16
- Figures reused from other sources are excluded and marked in their captions (“Figure from …”).
17
  tags:
18
- - research
19
- - template
 
 
20
  tableOfContentsAutoCollapse: true
21
  pdfProOnly: false
22
  showPdf: true
23
  ---
24
 
25
- import Introduction from "./chapters/demo/introduction.mdx";
26
- import BuiltWithThis from "./chapters/demo/built-with-this.mdx";
27
- import BestPractices from "./chapters/demo/best-pratices.mdx";
28
- import WritingYourContent from "./chapters/demo/writing-your-content.mdx";
29
- import AvailableBlocks from "./chapters/demo/markdown.mdx";
30
- import GettingStarted from "./chapters/demo/getting-started.mdx";
31
- import Markdown from "./chapters/demo/markdown.mdx";
32
- import Components from "./chapters/demo/components.mdx";
33
- import Greetings from "./chapters/demo/greetings.mdx";
34
- import VibeCodingCharts from "./chapters/demo/vibe-coding-charts.mdx";
35
- import ImportContent from "./chapters/demo/import-content.mdx";
36
 
37
  <Introduction />
38
 
39
- <BuiltWithThis />
40
 
41
- <GettingStarted />
42
 
43
- <WritingYourContent />
44
-
45
- <Markdown />
46
-
47
- <Components />
48
-
49
- <VibeCodingCharts />
50
-
51
- <ImportContent />
52
-
53
- <BestPractices />
54
-
55
- <Greetings />
56
 
 
57
 
 
 
1
  ---
2
+ title: "Are LLMs any good at the Science Game?\n Evaluating scientific reasoning using the card game Eleusis"
3
+ subtitle: "Testing LLM calibration and iterative hypothesis formation"
4
+ description: "A benchmark for evaluating LLM scientific reasoning using the card game Eleusis, testing iterative hypothesis formation, calibration, and strategic experimentation."
5
  authors:
6
+ - name: "David Louapre"
7
+ url: "https://huggingface.co/dlouapre"
8
  affiliations: [1]
9
  affiliations:
10
  - name: "Hugging Face"
11
  url: "https://huggingface.co"
12
+ published: "Jan. 22, 2026"
 
13
  licence: >
14
+ Diagrams and text are licensed under <a href="https://creativecommons.org/licenses/by/4.0/" target="_blank" rel="noopener noreferrer">CC‑BY 4.0</a> with the source available on <a href="https://huggingface.co/spaces/dlouapre/eleusis-benchmark" target="_blank" rel="noopener noreferrer">Hugging Face</a>, unless noted otherwise.
 
15
  tags:
16
+ - LLM evaluation
17
+ - scientific reasoning
18
+ - benchmarks
19
+ - calibration
20
  tableOfContentsAutoCollapse: true
21
  pdfProOnly: false
22
  showPdf: true
23
  ---
24
 
25
+ import Introduction from "./chapters/eleusis/introduction.mdx";
26
+ import Benchmark from "./chapters/eleusis/benchmark.mdx";
27
+ import Results from "./chapters/eleusis/results.mdx";
28
+ import Analysis from "./chapters/eleusis/analysis.mdx";
29
+ import Conclusion from "./chapters/eleusis/conclusion.mdx";
30
+ import Appendix from "./chapters/eleusis/appendix.mdx";
 
 
 
 
 
31
 
32
  <Introduction />
33
 
34
+ <Benchmark />
35
 
36
+ <Results />
37
 
38
+ <Analysis />
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
+ <Conclusion />
41
 
42
+ <Appendix />
app/src/content/assets/figures/basic_metrics.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0e0bdb47eeb82b62a05a7d6dd2b3815404567be86ea4f7cc44a7f2e47a262d35
3
+ size 1372
app/src/content/assets/figures/by_rule.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7883abbd4a92c8f305c5c030315878579bb42d6acfcefe24d7d96d550f47120d
3
+ size 5864
app/src/content/assets/figures/by_rule.png ADDED

Git LFS Details

  • SHA256: 9a26abae8a4f9e0d6bb06c11a8be47272a4fbe3055464a8bf8ce836ffa6380c8
  • Pointer size: 131 Bytes
  • Size of remote file: 266 kB
app/src/content/assets/figures/calibration_curves.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6db808595939baa8afcef3106b6963d19940949b864a06a80c0b7e479d03b38e
3
+ size 5681
app/src/content/assets/figures/calibration_curves.png ADDED

Git LFS Details

  • SHA256: 52506eaeca312227972fd785ae3152e9a1c994fa5367feb6ab256b27a252522f
  • Pointer size: 131 Bytes
  • Size of remote file: 142 kB
app/src/content/assets/figures/complexity_analysis.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9ad53beba3b7e00c248664f291eaba015dd716be80013584479952bc26c79f83
3
+ size 1612
app/src/content/assets/figures/complexity_analysis.png ADDED

Git LFS Details

  • SHA256: de5818ead0206b6bdb5359d89924388d42b5a9c05033dfea048fac37ce90e075
  • Pointer size: 130 Bytes
  • Size of remote file: 71.6 kB
app/src/content/assets/figures/confidence_distribution.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67d35eb63310d743c06a7a5b401228792e3532d6c22880369d61b2d4efb213b1
3
+ size 5577
app/src/content/assets/figures/confidence_distribution.png ADDED

Git LFS Details

  • SHA256: 0b8ad66db19f9226d626fae36ce29862563358fbcfae52951b6fe4042ae34583
  • Pointer size: 131 Bytes
  • Size of remote file: 201 kB
app/src/content/assets/figures/overall_performance.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c620d1614704161071e6b3fdf51031228bc35a0aab8f70d6221f024a68e21e32
3
+ size 1413
app/src/content/assets/figures/overall_performance.png ADDED

Git LFS Details

  • SHA256: ab104cea226dde36e097010335a2253c4286e25b7c19cf837d96907edce1db8c
  • Pointer size: 130 Bytes
  • Size of remote file: 60.7 kB
app/src/content/assets/figures/score_vs_failed_guesses.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:abff24e6757f5f108647f4a42dcecf7f85a38f9b2dc509eab02884cd311d685d
3
+ size 1372
app/src/content/assets/figures/score_vs_failed_guesses.png ADDED

Git LFS Details

  • SHA256: 218cb01d01f02dca9ade3e4917dbc3bf1e4aa04d350b4bfa706ccced166a1268
  • Pointer size: 130 Bytes
  • Size of remote file: 58.8 kB
app/src/content/assets/figures/summary.txt ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ============================================================
2
+ ELEUSIS RESULTS ANALYSIS
3
+ ============================================================
4
+
5
+ Analyzing: results/260121_78_rounds
6
+
7
+ Loading results...
8
+ Loaded 6 evaluation runs:
9
+ - solo_evaluation_20260120_091620_gpt_5_2_high
10
+ - solo_evaluation_20260120_091622_gpt_oss_120b
11
+ - solo_evaluation_20260121_070517_claude_haiku_4_5
12
+ - solo_evaluation_20260121_070518_gpt_5_mini_medium
13
+ - solo_evaluation_20260121_070520_gemini_3_flash_preview_low
14
+ - solo_evaluation_20260121_070522_gpt_oss_20b
15
+
16
+ Extracted 26 unique rules from results files
17
+ Built DataFrames: 468 rounds, 7836 turns
18
+ Loaded colors for 17 models
19
+
20
+ ============================================================
21
+ BASIC MODEL COMPARISON
22
+ ============================================================
23
+
24
+ model rounds_played total_score avg_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio
25
+ Gpt 5.2 High 78 1102 14.128205 1200 3341037 73525.83 0.333333 0.961538 2784.197500 61.271525 25.346154 36.062906 0.702832
26
+ Gpt 5 Mini Medium 78 1001 12.833333 1247 3618399 58345.97 1.256410 0.756410 2901.683240 46.789070 40.051282 79.228889 0.505514
27
+ Gemini 3 Flash Preview Low 78 955 12.243590 1299 1581524 12702.02 1.717949 0.769231 1217.493457 9.778306 35.910256 81.480513 0.440722
28
+ Gpt Oss 120B 78 938 12.025641 1226 3190828 24633.15 3.692308 0.756410 2602.632953 20.092292 51.320513 80.710427 0.635860
29
+ Gpt Oss 20B 78 773 9.910256 1277 7009392 62397.50 6.205128 0.717949 5488.952232 48.862569 80.782051 122.849402 0.657570
30
+ Claude Haiku 4.5 78 713 9.141026 1223 6973411 57734.39 7.551282 0.705128 5701.889616 47.207187 88.576923 152.125983 0.582260
31
+
32
+ Saved: results/260121_78_rounds/basic_metrics.csv
33
+ Saved: results/260121_78_rounds/overall_performance.png
34
+ Saved: results/260121_78_rounds/overall_performance.json
35
+ Saved: results/260121_78_rounds/score_vs_failed_guesses.png
36
+ Saved: results/260121_78_rounds/score_vs_failed_guesses.json
37
+ Saved: results/260121_78_rounds/calibration_curves.png
38
+ Saved: results/260121_78_rounds/calibration_curves.json
39
+ Saved: results/260121_78_rounds/confidence_distribution.png
40
+ Saved: results/260121_78_rounds/confidence_distribution.json
41
+
42
+ ============================================================
43
+ BY-RULE ANALYSIS
44
+ ============================================================
45
+
46
+ Score by rule (sorted by avg_score):
47
+ rule_description count avg_score std_score success_rate
48
+ Only red cards (hearts or diamonds). 18 23.888889 2.541164 1.000000
49
+ Cards must alternate between red and black colors. Any card may start the line. 18 23.500000 3.166925 1.000000
50
+ Only cards of the suit spades. 18 23.444444 2.254987 1.000000
51
+ Only cards with an even rank (2,4,6,8,10,12). 18 22.333333 2.950573 1.000000
52
+ The card must be of a different suit than the card just before it. Any card may start the line. 18 19.277778 7.282578 0.944444
53
+ Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 18 19.000000 5.636019 1.000000
54
+ Only hearts, clubs, and diamonds allowed. Spades are forbidden. 18 18.333333 5.851093 0.944444
55
+ Only ranks that are prime numbers (2,3,5,7,11,13). 18 18.000000 6.859943 0.944444
56
+ The card must be of a different suit than but same color as the card just before it. Any card may start the line. 18 17.944444 9.295617 1.000000
57
+ Only spades and diamonds. 18 17.500000 4.973459 1.000000
58
+ Only face cards (11,12,13). 18 16.388889 9.356589 0.833333
59
+ Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 18 16.388889 7.769767 1.000000
60
+ Only Aces (rank 1) . 18 16.111111 9.682543 0.944444
61
+ Only cards between 1 and 7 inclusive. 18 10.277778 8.870344 0.944444
62
+ Only black face cards. 18 7.111111 10.093031 0.833333
63
+ Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 18 6.277778 11.113349 0.500000
64
+ Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 18 6.055556 11.305762 0.611111
65
+ Only red cards whose rank is <=7. 18 5.611111 10.330645 1.000000
66
+ Alternate face and number cards. Any card may start the line. 18 5.333333 12.362181 0.611111
67
+ Only cards between 5 and 9 inclusive. 18 4.500000 9.977917 0.888889
68
+ Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 18 1.944444 12.511041 0.777778
69
+ Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 18 1.666667 3.880570 0.333333
70
+ Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 18 1.444444 4.217920 0.111111
71
+ If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 18 1.444444 5.690262 0.277778
72
+ Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 18 0.833333 6.242643 0.277778
73
+ Face cards (11-13) must be red; number cards (1-10) must be black. 18 -0.055556 7.255604 0.444444
74
+
75
+ Saved: results/260121_78_rounds/by_rule.png
76
+ Saved: results/260121_78_rounds/by_rule.json
77
+
78
+ ============================================================
79
+ COMPLEXITY ANALYSIS
80
+ ============================================================
81
+
82
+ Optimal K for aggregated complexity: 0.05
83
+ Formula: complexity = cyclomatic + 0.05 * node_count
84
+ Correlation with relative_score: -0.429
85
+
86
+ Score by complexity quartile:
87
+ complexity_bin count avg_score avg_relative_score success_rate
88
+ Q1 144 16.909722 1.478439 0.944444
89
+ Q2 90 12.911111 1.105104 0.877778
90
+ Q3 126 12.150794 1.021103 0.761905
91
+ Q4 108 3.277778 0.249874 0.490741
92
+
93
+ Saved: results/260121_78_rounds/complexity_analysis.png
94
+ Saved: results/260121_78_rounds/complexity_analysis.json
95
+
96
+ ============================================================
97
+ PER-MODEL REPORTS
98
+ ============================================================
99
+
100
+ Saved: results/260121_78_rounds/model_gpt_5_2_high.png
101
+ Saved: results/260121_78_rounds/model_gpt_oss_120b.png
102
+ Saved: results/260121_78_rounds/model_claude_haiku_4_5.png
103
+ Saved: results/260121_78_rounds/model_gpt_5_mini_medium.png
104
+ Saved: results/260121_78_rounds/model_gemini_3_flash_preview_low.png
105
+ Saved: results/260121_78_rounds/model_gpt_oss_20b.png
106
+
107
+ ============================================================
108
+ Analysis complete! All outputs saved to: results/260121_78_rounds
109
+ ============================================================
app/src/content/chapters/eleusis/analysis.mdx ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Note from "../../../components/Note.astro";
2
+ import Sidenote from "../../../components/Sidenote.astro";
3
+ import Accordion from "../../../components/Accordion.astro";
4
+
5
+ ## Deeper Analysis
6
+
7
+ ### Learning Curves
8
+
9
+ How do models improve within a single round? We tracked confidence and hypothesis quality over turn number to understand the learning dynamics.
10
+
11
+ <Note variant="info">
12
+ **TODO**: Add figure showing line plot of average confidence by turn number, colored by eventual success/failure.
13
+ </Note>
14
+
15
+ Key observations:
16
+ - **Successful rounds** typically show steadily increasing confidence with occasional drops when hypotheses are revised
17
+ - **Failed rounds** often show erratic confidence or premature plateaus where models become stuck on incorrect hypotheses
18
+ - **Acceptance rate decreases** over time as obvious cards are exhausted from the hand
19
+
20
+ <Sidenote>
21
+ The turn-by-turn reasoning traces provide rich data for understanding model behavior beyond simple success/failure metrics.
22
+ </Sidenote>
23
+
24
+ ### Failure Modes
25
+
26
+ When models fail, why? We identified several recurring patterns:
27
+
28
+ <Accordion title="Failure mode taxonomy" open>
29
+
30
+ 1. **Premature guessing**: High confidence, wrong rule, insufficient evidence. The model becomes convinced too early based on limited data.
31
+
32
+ 2. **Hypothesis fixation**: Stuck on wrong rule despite contradictory evidence. The model fails to update when new observations conflict with its theory.
33
+
34
+ 3. **Overfitting**: Rule matches all observations but is more specific than the actual rule (e.g., guessing "only red hearts" when the rule is "only red cards").
35
+
36
+ 4. **Underfitting**: Rule is too simple and fails to capture necessary conditions (e.g., guessing "black cards" when rule is "black even cards").
37
+
38
+ 5. **Position blindness**: Fails on rules depending on position in mainline or relationship to previous cards.
39
+
40
+ </Accordion>
41
+
42
+ <Note variant="info">
43
+ **TODO**: Add stacked bar chart showing distribution of failure modes by model.
44
+ </Note>
45
+
46
+ ### Symmetric Rules
47
+
48
+ An interesting test: are symmetric rules equally difficult? For example, "only spades" vs "only non-spades" should be logically equivalent in difficulty, but models might have biases.
49
+
50
+ We found that:
51
+ - Negative rules ("not X") are generally harder than positive rules ("only X")
52
+ - Rules involving rare events (low acceptance rate) are harder than rules with high acceptance rates
53
+ - This may reflect training data biases where positive examples are more common
54
+
55
+ ### Confirmation Bias
56
+
57
+ Do models exhibit confirmation bias—preferring to play cards that confirm their current hypothesis rather than cards that could falsify it?
58
+
59
+ <Sidenote>
60
+ A good scientist designs experiments that could prove them wrong, not just experiments that confirm what they already believe.
61
+ </Sidenote>
62
+
63
+ Preliminary analysis suggests:
64
+ - Models do show some tendency toward confirmation-seeking behavior
65
+ - When confident in a hypothesis, models prefer "safe" plays that are likely to be accepted
66
+ - Strategic exploration (playing cards specifically to test hypothesis boundaries) is rare
67
+
68
+ ### Qualitative Observations
69
+
70
+ Examining individual reasoning traces reveals interesting patterns:
71
+
72
+ <Accordion title="Example: Hypothesis revision">
73
+
74
+ In one game with the rule "alternating odd/even ranks," a model initially hypothesized "increasing ranks" based on the first few accepted cards. When a lower-ranked card was accepted, instead of abandoning the hypothesis entirely, the model revised it to "ranks must differ from previous." This partial update eventually led to discovering the true rule—a good example of iterative refinement.
75
+
76
+ </Accordion>
77
+
78
+ <Accordion title="Example: Fixation failure">
79
+
80
+ With the rule "only face cards (J, Q, K)," one model became fixated on "only red cards" after the first three accepted cards happened to be red face cards. Despite subsequently seeing black face cards accepted, the model kept trying to reconcile observations with a color-based rule, eventually running out of turns.
81
+
82
+ </Accordion>
app/src/content/chapters/eleusis/appendix.mdx ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Accordion from "../../../components/Accordion.astro";
2
+ import Note from "../../../components/Note.astro";
3
+
4
+ ## Appendix: Detailed Methods
5
+
6
+ ### Models Evaluated
7
+
8
+ <Accordion title="Model configurations" open>
9
+
10
+ All models were evaluated using their respective APIs with the following settings:
11
+
12
+ | Parameter | Value |
13
+ |-----------|-------|
14
+ | Temperature | 0.0 (deterministic) |
15
+ | Max tokens | 4096 |
16
+ | Retries | 3 (on API failures) |
17
+
18
+ Reasoning models (o1, o3-mini, etc.) were allowed their default reasoning budgets. Standard models used the base inference without chain-of-thought prompting beyond what's included in the game prompt.
19
+
20
+ </Accordion>
21
+
22
+ ### Rule Checking
23
+
24
+ <Accordion title="Rule verification methodology">
25
+
26
+ Rules are created by hand and expressed in natural language. Each rule is then compiled into a Python function using an LLM, with manual verification of correctness.
27
+
28
+ When the model outputs a guessed rule, we:
29
+ 1. Compile the guess into a Python function using the same LLM
30
+ 2. Test the compiled function against all cards played in that game
31
+ 3. Mark the guess as correct only if it matches the true rule's behavior on all observations
32
+
33
+ This simulation-based approach avoids issues with semantic equivalence in natural language. For instance, "same color as previous card" and "red cards only" might be equivalent given a specific game history starting with a red card, but would differ on other histories.
34
+
35
+ </Accordion>
36
+
37
+ ### Prompt Structure
38
+
39
+ <Accordion title="Full prompt template">
40
+
41
+ The prompt includes:
42
+
43
+ 1. **Game rules**: Complete explanation of how Eleusis works, without mentioning the game's name to avoid potential training data leakage
44
+
45
+ 2. **Scoring system**: Explicit explanation of the scoring formula and strategic implications
46
+
47
+ 3. **Response format**: JSON schema specifying required fields (reasoning, card choice, tentative rule, confidence, guess decision)
48
+
49
+ 4. **Game state**: Current mainline, all sidelines, current hand, and reasoning from the previous 3 turns
50
+
51
+ 5. **Format reminders**: Instructions for confidence scale interpretation (7 = 70% probability)
52
+
53
+ </Accordion>
54
+
55
+ ### Evaluation Metrics
56
+
57
+ <Accordion title="Metric definitions">
58
+
59
+ - **Success rate**: Fraction of games where the model correctly identified the rule before running out of turns
60
+
61
+ - **Average score**: Mean score across all games, including zeros for failed games
62
+
63
+ - **Calibration error**: Mean absolute difference between stated confidence and empirical success rate at that confidence level
64
+
65
+ - **Failed guesses**: Average number of incorrect formal guesses per game
66
+
67
+ - **Turns to success**: For successful games, mean number of turns before correct guess
68
+
69
+ </Accordion>
70
+
71
+ ### References
72
+
73
+ <Accordion title="Bibliography">
74
+
75
+ - Abbott, R. (1963). "Eleusis" — Original game rules and design philosophy
76
+
77
+ - Guo, C., et al. (2017). "On Calibration of Modern Neural Networks" — Foundational work on neural network calibration
78
+
79
+ - Chollet, F. (2019). "On the Measure of Intelligence" — ARC benchmark and discussion of abstract reasoning
80
+
81
+ - Recent LLM reasoning benchmarks: GSM8K, MATH, ARC-AGI, BIG-Bench, etc.
82
+
83
+ </Accordion>
84
+
85
+ <Note>
86
+ Full code, data, and model outputs are available in the benchmark repository.
87
+ </Note>
app/src/content/chapters/eleusis/benchmark.mdx ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Sidenote from "../../../components/Sidenote.astro";
2
+ import Note from "../../../components/Note.astro";
3
+ import Accordion from "../../../components/Accordion.astro";
4
+
5
+ ## The Eleusis Benchmark
6
+
7
+ ### The Original Game
8
+
9
+ In the original Eleusis card game, one player acts as the "dealer" (sometimes called "God" or "Nature") and secretly invents a rule determining which cards can be legally played. The other players don't know this rule—they must discover it through experimentation.
10
+
11
+ Players take turns playing cards from their hand onto a central "mainline." If a card satisfies the secret rule, it's accepted and added to the mainline. If it violates the rule, it's rejected and placed in a "sideline" below the mainline at that position. Over time, the pattern of accepted and rejected cards provides evidence about the hidden rule.
12
+
13
+ <Sidenote>
14
+ The name "Eleusis" comes from the ancient Greek mystery cult, where initiates gradually discovered hidden truths.
15
+ </Sidenote>
16
+
17
+ At any point, a player can attempt to guess the rule; correctly identifying it ends the game. A specific scoring system rewards efficiency in discovering the rule while penalizing reckless guessing.
18
+
19
+ ### Our Adaptation
20
+
21
+ We adapted Eleusis into a single-player benchmark focused purely on the scientific reasoning process. By removing multi-player dynamics, we isolate the core challenge: hypothesis formation and testing under uncertainty.
22
+
23
+ The game uses a standard 52-card deck with ranks 1–13 (Ace through King) and four suits. A secret rule—a deterministic function that takes the card being played and the current sequence of accepted cards (the "mainline")—determines whether each card is accepted or rejected. The player maintains a hand of 12 cards, drawing a replacement after each play.
24
+
25
+ On each turn, the player selects a card from their hand to play. If the card satisfies the secret rule, it joins the mainline; if rejected, it's placed in a sideline below the mainline at that position. At any point, the player may attempt to guess the rule.
26
+
27
+ <Sidenote>
28
+ We chose 12-card hands to give models enough options for strategic experimentation.
29
+ </Sidenote>
30
+
31
+ The game lasts at most 30 turns, with scoring designed to reward efficiency while penalizing reckless guessing:
32
+
33
+ $$\text{score} = (30 - \text{turns\_used}) - 2 \times \text{wrong\_guesses}$$
34
+
35
+ A player who correctly identifies the rule on turn 10 with no wrong guesses scores 20 points; one who made 3 wrong guesses along the way scores only 14. Failing to identify the rule scores 0. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence—exactly the calibration we want to measure.
36
+
37
+ <Note variant="info">
38
+ **TODO**: Add figure showing an example turn with the game state (mainline with sidelines) and the model's structured response.
39
+ </Note>
40
+
41
+ ### Rule Library
42
+
43
+ We created a library of 26 hand-crafted rules spanning a range of types and complexity. Some rules involve simply card properties (e.g., "only red cards"), while others depend on the sequence of previously accepted cards (e.g., "card rank must be higher than previous card"). The rule might involve rank, suits, color or a combination thereof, and may include positional dependencies.
44
+
45
+ | Category | Examples |
46
+ |----------|----------|
47
+ | Static property | "Only red cards", "Only face cards (J, Q, K)" |
48
+ | Combined properties | "Only hearts with rank ≤7", "Only red face cards" |
49
+ | Sequential | "Rank must be higher than previous card" |
50
+ | Cyclic patterns | "Alternate between odd and even ranks", "Suits cycle: ♥→♠→♣→♦" |
51
+ | Complex conditionals | "Same suit as previous OR rank differs by exactly 2" |
52
+
53
+ Each rule is played 3 times with different random seeds (affecting the initial hand and deck order). This ensures every model is tested on the same deck sequences for a given seed, and captures variance in performance when the starting hand differs.
54
+
55
+ ### What the LLM Must Do
56
+
57
+ On each turn, the model receives the complete game state: the mainline of accepted cards, the sidelines of rejected cards at each position, its current hand, and its history of reasoning from the 3 previous turns. It must output a structured response containing:
58
+
59
+ <Accordion title="Structured response format" open>
60
+
61
+ 1. **Reasoning summary**: A brief explanation of its current thinking
62
+ 2. **Card choice**: Which card to play from its hand
63
+ 3. **Tentative rule**: Its current best hypothesis about the secret rule
64
+ 4. **Confidence level**: A self-reported probability (0–10 scale, where 7 means "I estimate 70% chance my tentative rule is correct")
65
+ 5. **Guess decision**: Whether to formally guess the rule this turn
66
+
67
+ </Accordion>
68
+
69
+ This structure lets us analyze not just whether models succeed, but *how* they reason: Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy?
app/src/content/chapters/eleusis/conclusion.mdx ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Note from "../../../components/Note.astro";
2
+ import Sidenote from "../../../components/Sidenote.astro";
3
+
4
+ ## Conclusion
5
+
6
+ ### Key Findings
7
+
8
+ Our evaluation of LLMs on the Eleusis benchmark reveals several important insights:
9
+
10
+ 1. **LLMs can do inductive reasoning**—but with significant variation across models. The best models successfully discover hidden rules through iterative experimentation, while others struggle with basic hypothesis formation.
11
+
12
+ 2. **Complexity matters**—simple rules are easy, complex rules are hard. This isn't surprising, but our benchmark provides quantitative measurements of how different complexity factors affect performance.
13
+
14
+ 3. **Calibration is imperfect**—models don't always know what they don't know. Most models show systematic overconfidence, particularly at high stated confidence levels.
15
+
16
+ 4. **Reasoning traces are valuable**—the turn-by-turn data reveals how models think, exposing failure modes that wouldn't be visible from success/failure metrics alone.
17
+
18
+ <Sidenote>
19
+ The gap between the best and worst models is substantial, suggesting this benchmark captures meaningful capability differences.
20
+ </Sidenote>
21
+
22
+ ### Limitations
23
+
24
+ This work has several important limitations:
25
+
26
+ - **Rule library scope**: 26 hand-crafted rules may not cover all types of scientific reasoning. Real-world hypothesis formation involves much more complex domains.
27
+
28
+ - **Statistical power**: 3 seeds per rule provides limited data for variance estimates. Some effects may not be reliably estimated.
29
+
30
+ - **Prompt sensitivity**: Different prompts might yield different results. We used a single carefully designed prompt but did not extensively test prompt variations.
31
+
32
+ - **No human baseline**: Without human performance data on the same rules, it's hard to contextualize whether model performance is "good" or "bad" in absolute terms.
33
+
34
+ - **Cost and API differences**: Models have different pricing and rate limits, which affects practical deployment considerations not captured here.
35
+
36
+ ### What's Next
37
+
38
+ Several directions for future work:
39
+
40
+ - **More models**: As new models are released, evaluating them on this benchmark will help track progress in scientific reasoning capabilities.
41
+
42
+ - **More rules**: Expanding the rule library to cover additional reasoning patterns (temporal rules, multi-step dependencies, etc.)
43
+
44
+ - **Human comparisons**: Collecting human performance data would provide crucial context for interpreting model capabilities.
45
+
46
+ - **Interactive exploration**: Building tools to explore individual game traces could help researchers understand model reasoning more deeply.
47
+
48
+ <Note variant="info">
49
+ The benchmark is open source. Try it yourself and contribute new rules or model evaluations!
50
+ </Note>
51
+
52
+ ### Final Thoughts
53
+
54
+ The Eleusis benchmark offers a window into capabilities that matter for real-world scientific reasoning: iterative hypothesis refinement, strategic experimentation, and calibrated confidence. While current LLMs show promising capabilities, significant gaps remain—particularly in calibration and avoiding cognitive biases like hypothesis fixation.
55
+
56
+ As LLMs are increasingly deployed to assist with scientific research, understanding these limitations becomes crucial. A model that is brilliant at generating hypotheses but systematically overconfident could lead researchers down unproductive paths. The Eleusis benchmark provides one lens for evaluating and improving these capabilities.
app/src/content/chapters/eleusis/introduction.mdx ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Sidenote from "../../../components/Sidenote.astro";
2
+ import Note from "../../../components/Note.astro";
3
+
4
+ Large language models are increasingly being deployed as tools for scientific research—analyzing data, generating hypotheses, and even designing experiments. But how well do they actually embody the scientific method?
5
+
6
+ <Sidenote>
7
+ Read time: 15–20 minutes.
8
+ </Sidenote>
9
+
10
+ Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge, for instance, evaluates inductive reasoning on visual patterns. These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.
11
+
12
+ Real scientific reasoning is not a single inference step. It's an iterative process of observation, hypothesis formation, experimentation, and refinement—often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic* thinking: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
13
+
14
+ <Sidenote>
15
+ Think of debugging code or diagnosing a medical condition—both follow this same iterative pattern.
16
+ </Sidenote>
17
+
18
+ Beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?), **metacognition** (how certain am I about my uncertainty?), and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis). A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
19
+
20
+ We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.
21
+
22
+ ## The Eleusis Game
23
+
24
+ Eleusis was designed by Robert Abbott explicitly to simulate the process of scientific discovery. In the game, one player invents a secret rule governing which cards can be played, and other players must deduce the rule through experimentation—playing cards and observing whether they are accepted or rejected.
25
+
26
+ It's a microcosm of the scientific method: the rule is a hidden law of nature, each card play is an experiment, and the sequence of accepted and rejected cards is the accumulating evidence.
27
+
28
+ <Note variant="info">
29
+ **TODO**: Add figure showing an example Eleusis game sequence with the secret rule "alternating colors" (red, black, red, black...).
30
+ </Note>
31
+
32
+ We built a benchmark around Eleusis to evaluate LLMs on this iterative, hypothesis-driven reasoning. Rather than testing knowledge retrieval or instruction-following, our benchmark asks: can models act like scientists? Can they observe evidence, form hypotheses, design informative experiments, and refine their theories? Can they calibrate their confidence appropriately and know when they've gathered enough evidence to commit to a conclusion?
33
+
34
+ These skills are fundamental not just to science, but to debugging code, diagnosing problems, and everyday reasoning under uncertainty.
app/src/content/chapters/eleusis/results.mdx ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import Image from "../../../components/Image.astro";
2
+ import Wide from "../../../components/Wide.astro";
3
+ import Note from "../../../components/Note.astro";
4
+ import Sidenote from "../../../components/Sidenote.astro";
5
+
6
+ import overallPerformance from "../../assets/figures/overall_performance.png";
7
+ import calibrationCurves from "../../assets/figures/calibration_curves.png";
8
+ import confidenceDistribution from "../../assets/figures/confidence_distribution.png";
9
+ import scoreVsFailedGuesses from "../../assets/figures/score_vs_failed_guesses.png";
10
+ import byRule from "../../assets/figures/by_rule.png";
11
+ import complexityAnalysis from "../../assets/figures/complexity_analysis.png";
12
+
13
+ ## Results
14
+
15
+ ### Overall Performance
16
+
17
+ We evaluated a range of models on the Eleusis benchmark. Performance varies significantly across models, correlating with both model size and reasoning effort (measured by output token usage).
18
+
19
+ <Wide>
20
+ <Image
21
+ src={overallPerformance}
22
+ alt="LLM performance on Eleusis benchmark: 2D scatter plot showing average score vs output token count for each model"
23
+ caption="<strong>Figure 1:</strong> Overall model performance on the Eleusis benchmark. Each point represents a model, with position showing average score vs. token usage. Larger reasoning budgets generally correlate with better performance."
24
+ id="fig-overall"
25
+ zoomable
26
+ />
27
+ </Wide>
28
+
29
+ <Sidenote>
30
+ Token usage serves as a proxy for "thinking effort"—models that produce longer reasoning traces tend to perform better.
31
+ </Sidenote>
32
+
33
+ ### Confidence and Calibration
34
+
35
+ Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly.
36
+
37
+ <Image
38
+ src={calibrationCurves}
39
+ alt="Calibration curves showing reported confidence vs actual success rate for all models"
40
+ caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points above the line indicate overconfidence; points below indicate underconfidence."
41
+ id="fig-calibration"
42
+ zoomable
43
+ />
44
+
45
+ The calibration analysis reveals several patterns:
46
+
47
+ - **Most models are overconfident** at high confidence levels—when they report 90% confidence, actual success rates are often closer to 70%
48
+ - **Some models are well-calibrated** at lower confidence levels but diverge as confidence increases
49
+ - **Reasoning models** tend to show better calibration overall
50
+
51
+ <Image
52
+ src={confidenceDistribution}
53
+ alt="Histogram showing distribution of confidence levels when models choose to guess vs not guess"
54
+ caption="<strong>Figure 3:</strong> Distribution of confidence levels. Left: when models choose to formally guess. Right: when models choose not to guess. Well-calibrated models should show clear separation between these distributions."
55
+ id="fig-confidence"
56
+ zoomable
57
+ />
58
+
59
+ ### Guessing Strategy
60
+
61
+ The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff?
62
+
63
+ <Image
64
+ src={scoreVsFailedGuesses}
65
+ alt="2D scatter plot showing average score vs average number of failed guesses per round for each model"
66
+ caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
67
+ id="fig-guessing"
68
+ zoomable
69
+ />
70
+
71
+ <Sidenote>
72
+ The optimal strategy depends on accurate self-assessment—knowing when you've gathered enough evidence to commit.
73
+ </Sidenote>
74
+
75
+ ### Performance by Rule
76
+
77
+ Not all rules are created equal. Some rules are discovered quickly by all models, while others prove consistently challenging.
78
+
79
+ <Wide>
80
+ <Image
81
+ src={byRule}
82
+ alt="Performance breakdown by rule showing score distribution for each rule across all models"
83
+ caption="<strong>Figure 5:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as points. Some rules show high variance (sensitive to initial conditions), while others are consistently easy or hard."
84
+ id="fig-by-rule"
85
+ zoomable
86
+ />
87
+ </Wide>
88
+
89
+ ### Rule Complexity
90
+
91
+ What makes some rules harder than others? We examined several factors: acceptance rate (rules that accept few cards provide less positive evidence), code complexity of the rule implementation, and semantic complexity.
92
+
93
+ <Image
94
+ src={complexityAnalysis}
95
+ alt="Scatter plot showing relationship between rule complexity metrics and model performance"
96
+ caption="<strong>Figure 6:</strong> Relationship between rule complexity and performance. Multiple complexity factors contribute: acceptance rate, structural complexity, and semantic difficulty."
97
+ id="fig-complexity"
98
+ zoomable
99
+ />
100
+
101
+ <Note variant="info">
102
+ Interestingly, code complexity (cyclomatic complexity, AST node count) doesn't perfectly predict difficulty. Semantically simple rules like "only face cards" can be harder than structurally complex rules if the semantic concept is unfamiliar to models.
103
+ </Note>