eleusis-benchmark

Running

App Files Files Community

dlouapre HF Staff commited on Jan 28

Commit

4343500

1 Parent(s): aee6411

Improved results and new assesment

Browse files

Files changed (41) hide show

ASSESSMENT_V2.md +197 -0
app/src/content/assets/data/basic_metrics.csv +2 -2
app/src/content/assets/data/by_rule.json +2 -2
app/src/content/assets/data/by_rule.png +2 -2
app/src/content/assets/data/calibration_curves.json +2 -2
app/src/content/assets/data/calibration_curves.png +2 -2
app/src/content/assets/data/caution_vs_failed_guesses.json +2 -2
app/src/content/assets/data/caution_vs_failed_guesses.png +2 -2
app/src/content/assets/data/complexity_analysis.json +2 -2
app/src/content/assets/data/complexity_analysis.png +2 -2
app/src/content/assets/data/excess_caution.png +0 -3
app/src/content/assets/data/{confidence_distribution.json → guess_rate.json} +2 -2
app/src/content/assets/data/{confidence_distribution.png → guess_rate.png} +2 -2
app/src/content/assets/data/model_claude_haiku_4_5.png +2 -2
app/src/content/assets/data/model_claude_opus_4_5.png +2 -2
app/src/content/assets/data/model_deepseek_r1.png +2 -2
app/src/content/assets/data/model_gemini_3_flash_preview_low.png +2 -2
app/src/content/assets/data/model_gpt_5_2_high.png +2 -2
app/src/content/assets/data/model_gpt_5_mini_medium.png +2 -2
app/src/content/assets/data/model_gpt_oss_120b.png +2 -2
app/src/content/assets/data/model_gpt_oss_20b.png +2 -2
app/src/content/assets/data/model_grok_4_1_fast_reasoning.png +2 -2
app/src/content/assets/data/model_kimi_k2.png +2 -2
app/src/content/assets/data/overall_performance.json +2 -2
app/src/content/assets/data/overall_performance.png +2 -2
app/src/content/assets/data/reckless_guessing.json +2 -2
app/src/content/assets/data/reckless_guessing.png +2 -2
app/src/content/assets/data/score_stack.json +2 -2
app/src/content/assets/data/score_stack.png +2 -2
app/src/content/assets/data/score_vs_failed_guesses.json +2 -2
app/src/content/assets/data/score_vs_failed_guesses.png +2 -2
app/src/content/assets/data/summary.txt +71 -71
app/src/content/chapters/eleusis/benchmark.mdx +7 -3
app/src/content/chapters/eleusis/introduction.mdx +2 -2
app/src/content/chapters/eleusis/results.mdx +49 -85
app/src/content/embeds/banner.html +6 -6
app/src/content/embeds/by-rule.html +3 -3
app/src/content/embeds/{confidence-distribution.html → guess-rate.html} +38 -42
app/src/content/embeds/overall-performance.html +5 -5
app/src/content/embeds/score-stack.html +20 -54
app/src/content/embeds/score-vs-failed-guesses.html +5 -5

ASSESSMENT_V2.md ADDED Viewed

	@@ -0,0 +1,197 @@

+# Revised Assessment: Eleusis Benchmark Article (v2)
+## Executive Summary
+The article has improved significantly since the first assessment. The **Results section is now well-structured** with a clear narrative arc: overall performance → the metacognition insight → caution/recklessness trade-off → calibration → performance by rule. The key message about metacognition is now prominent and supported by the logical flow.
+The main remaining issues are:
+1. **Data inconsistencies** between text and data files (numbers are outdated)
+2. **The "Deeper Analysis" section** needs restructuring—much of it now duplicates the improved Results section
+3. Minor typos
+---
+## 1. What's Working Well
+### 1.1 Results Section Structure
+The new structure is excellent:
+```
+Results
+├── Overall Performance (intro)
+├── Pure discovery vs metacognition (the key insight, early!)
+├── Caution-Recklessness Trade-off (central analysis)
+├── Confidence and Calibration (supporting evidence)
+└── Performance by Rule (rule-level breakdown)
+```
+This addresses the main criticism from v1: readers now build understanding progressively and the metacognition insight is front and center.
+### 1.2 Figure Flow
+Figures now tell a coherent story:
+- Fig 1: Overview (where does each model sit?)
+- Fig 2: Score breakdown (what drives score differences?)
+- Fig 3: Caution vs recklessness (the key trade-off)
+- Fig 4: Calibration (why is timing hard?)
+- Fig 5: Guess rate (how do models decide when to commit?)
+- Fig 6-7: Rule-level analysis (drill-down)
+### 1.3 New Guess Rate Analysis (Figure 5)
+This is a valuable addition that wasn't in the original. It shows how models operationalize their confidence into actual decisions, connecting calibration to behavior.
+### 1.4 Clear Messaging
+Lines like "knowing when to commit is as important as finding the rule" now appear early and are reinforced throughout.
+---
+## 2. Critical Issues
+### 2.1 Data Inconsistencies (Must Fix)
+The text still uses outdated numbers. Current data (from `summary.txt` and `overall_performance.json`) vs text:
+| Metric | In Text | Actual Data |
+|--------|---------|-------------|
+| Claude Opus 4.5 avg score | 15.9 (conclusion.mdx:10) | **17.0** (avg_floored_score) |
+| Claude Opus 4.5 success rate | 92% (conclusion.mdx:10) | **83%** |
+| Claude Haiku 4.5 success rate | 70% (conclusion.mdx:10) | **56%** |
+| Claude Haiku 4.5 failed guesses | 7.5/round (analysis.mdx:15) | **3.95/round** |
+| Kimi K2 avg score | 14.5 (analysis.mdx:60) | **16.2** |
+| GPT OSS 120B score | 12.0 (analysis.mdx:60) | **12.9** |
+| GPT 5.2 High early correct turns | 3.6 (multiple places) | **3.56** ✓ (close enough) |
+**Action:** Audit all numbers in `results.mdx`, `analysis.mdx`, and `conclusion.mdx` against the latest data files.
+### 2.2 Typos Still Present
+| Location | Issue |
+|----------|-------|
+| results.mdx:20 | "closed to Claude Opus 4.5" → "close to" |
+| results.mdx:85 | "overconfident : for instance" → remove space before colon |
+| results.mdx:86 | "GPT 5.2 is the best calibrated" → "GPT 5.2 High" |
+| results.mdx:102 | "THis is somehow" → "This is somehow" |
+---
+## 3. The "Deeper Analysis" Section
+### 3.1 Current Problem
+The "Deeper Analysis" section is now partially redundant. It covers:
+1. **Metacognition** (duplicates Results § "Pure discovery vs metacognition")
+2. **Learning Curves** (TODO, placeholder)
+3. **Failure Modes** (valuable, keep)
+4. **Open vs Closed Models** (brief, could be expanded)
+5. **Symmetric Rules** (interesting niche finding)
+6. **Confirmation Bias** (preliminary, incomplete)
+7. **Qualitative Observations** (nice examples, but disconnected)
+### 3.2 Recommended Restructure
+Rename to "Discussion" and reorganize:
+```markdown
+## Discussion
+### What Explains the Performance Gap?
+- Brief synthesis: metacognition > raw ability
+- The caution-recklessness trade-off determines ranking more than success rate
+- Move the GPT 5.2 High / Claude Opus 4.5 / Claude Haiku 4.5 characterizations here
+  (but avoid repeating numbers already in Results)
+### Scientific Temperaments
+- This is where the "scientific personality" framing could shine
+- The Perfectionist (GPT 5.2 High): needs too much evidence
+- The Pragmatist (Claude Opus 4.5): good-enough is good enough
+- The Gambler (Claude Haiku 4.5): acts on insufficient evidence
+- Link to real-world science: these map to actual failure modes in research
+### Failure Modes [keep the accordion, it's excellent]
+- Already well-written, just tighten the taxonomy
+### Open vs Proprietary Models
+- Currently too brief (1 paragraph)
+- Could expand: why might open models trend reckless? (RLHF differences?)
+- Kimi K2's success is notable—worth highlighting more
+### Implications for AI-Assisted Science
+- Currently in Conclusion but could be expanded here
+- An overconfident assistant leads researchers astray
+- An overcautious assistant wastes resources
+- The calibration problem is particularly concerning
+### Move to Appendix (or delete)
+- Learning Curves (TODO) → either implement or remove
+- Symmetric Rules → niche, move to appendix or cut
+- Confirmation Bias → too preliminary, either expand significantly or cut
+- Qualitative Observations → keep 1-2 good examples, cut the rest
+```
+### 3.3 Delete the Redundancy
+The current Metacognition subsection (analysis.mdx:7-16) largely repeats what's now better expressed in Results. Either:
+- Delete it entirely and rely on Results
+- Or transform it into the "Scientific Temperaments" narrative frame (more memorable)
+---
+## 4. Missing Content (Lower Priority)
+### 4.1 TODOs Still Present
+- Learning curves figure (analysis.mdx:22) — either implement or remove the placeholder
+- Failure mode distribution stacked bar (analysis.mdx:55) — nice to have, not critical
+### 4.2 Human Baseline
+Still missing. Consider adding a sentence like: "Without human performance data on the same rules, we cannot assess whether these success rates represent strong or weak performance in absolute terms—only that models differ substantially among themselves."
+### 4.3 Example Turn Figure
+Would still be valuable in benchmark.mdx to make the task concrete. A simple 3-panel showing:
+```
+[Board state] → [Model reasoning excerpt] → [Decision output]
+```
+---
+## 5. Minor Polish
+### 5.1 Model Name Consistency
+Some inconsistencies remain:
+- "Grok 4.1 Fast Reasoning" vs "Grok 4 1 Fast Reasoning" (in data)
+- "DeepSeek R1" vs "Deepseek R1" (in data)
+- Decide on one capitalization style and apply consistently
+### 5.2 The "floored" Score
+The article doesn't explain that scores below 0 are floored to 0. This affects interpretation—might be worth a brief mention in the Benchmark section or a sidenote.
+### 5.3 Sidenote on Optimal Threshold
+Results.mdx mentions the 0.67 optimal threshold but doesn't explain why. A brief derivation in a sidenote would help:
+> For a perfectly calibrated model: E[guess at p] = p×(points remaining) - (1-p)×2. Setting E[guess] > E[wait 1 turn] gives p > 2/3 ≈ 0.67.
+---
+## 6. Summary of Recommended Actions
+### Must Do
+1. ☐ Fix all data inconsistencies (audit numbers against data files)
+2. ☐ Fix typos listed in §2.2
+3. ☐ Remove or transform redundant content in "Deeper Analysis"
+### Should Do
+4. ☐ Rename "Deeper Analysis" → "Discussion"
+5. ☐ Restructure Discussion per §3.2
+6. ☐ Either implement Learning Curves figure or remove the TODO
+### Nice to Have
+7. ☐ Add "Scientific Temperaments" framing
+8. ☐ Add example turn figure in benchmark.mdx
+9. ☐ Explain the score flooring mechanism
+10. ☐ Expand Open vs Proprietary discussion
+---
+## 7. Overall Assessment
+**Grade: B+ (up from B-)**
+The structural problems identified in v1 are largely resolved. The article now tells a clear story: models vary in their "scientific temperament," and metacognition—knowing when you know—matters as much as raw reasoning ability.
+The remaining work is mostly cleanup (data consistency, typos) and deciding what to do with the Deeper Analysis section. The article is close to publication-ready once the numbers are fixed.

app/src/content/assets/data/basic_metrics.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:646b5eda63192bed7d4c3372c684b263db844ad6599e2cff7cd34b945e0a03da
-size 2743

 version https://git-lfs.github.com/spec/v1
+oid sha256:f67f1217568824b751da562d8106fae602792a64c38abb4b7c8bae75698249c0
+size 2716

app/src/content/assets/data/by_rule.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:bedd8081e1e412f0d2453c0f6fe78153fed8433520b9e1b729fc7b11dd5b02a8
-size 30709

 version https://git-lfs.github.com/spec/v1
+oid sha256:b9331eb6e257a86e681b479a1538f8c885bfc17aa788e67e508b142c0e2de38f
+size 30754

app/src/content/assets/data/by_rule.png CHANGED Viewed

Git LFS Details

SHA256: f7c7d4ff1a927f2d44209feb1979ca355f79fa75a03e13ac413d4bdba84012a6
Pointer size: 131 Bytes
Size of remote file: 363 kB

Git LFS Details

SHA256: 20027da06023be1b97fdcbaf832092130f304410546497f94ddc8fa1c12b13d5
Pointer size: 131 Bytes
Size of remote file: 328 kB

app/src/content/assets/data/calibration_curves.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8f0304136f61e101e6deeb8ed81a47b177930b16666c37270c38e67a68fe8209
-size 9463

 version https://git-lfs.github.com/spec/v1
+oid sha256:2db00c0caa1dd76dea08cad41d51e54171853d0a7361bc3b27fac76680310687
+size 9460

app/src/content/assets/data/calibration_curves.png CHANGED Viewed

Git LFS Details

SHA256: 77c24472b31437c692d2514838fa1bf168067d95d374c8dc497dfc7aa0724814
Pointer size: 131 Bytes
Size of remote file: 185 kB

Git LFS Details

SHA256: 6affd070d951e0bc4937270d34503d0e12e1dbcec4d9a823c2bdba5df193d780
Pointer size: 131 Bytes
Size of remote file: 192 kB

app/src/content/assets/data/caution_vs_failed_guesses.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:18ca4459f477b0f044b9a023a6247899a7bd1d03ee007b3d47078ddcca9c5a1b
-size 2465

 version https://git-lfs.github.com/spec/v1
+oid sha256:baed4a816edf6d9d421718d40f785a2bfd3eac4c0f9c33c33655d3a12a76690e
+size 2456

app/src/content/assets/data/caution_vs_failed_guesses.png CHANGED Viewed

Git LFS Details

SHA256: 5939745db4ee368657decc2fdf6237ddee61a3a7feaff0e1dd687552c3bf4add
Pointer size: 130 Bytes
Size of remote file: 85 kB

Git LFS Details

SHA256: 090ef7b8f760ccda4b22f4acb1006469014efe16a69461c9bd1e5c76adc78f7f
Pointer size: 130 Bytes
Size of remote file: 85.5 kB

app/src/content/assets/data/complexity_analysis.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a281c2834fce731ee67126dc08e307268f411c4b7ec24006d36edccd303a6e6d
-size 2273

 version https://git-lfs.github.com/spec/v1
+oid sha256:5260de9ec24d856da4fbc955b2a1645e00eee76bde6289212ccdd73550594e11
+size 2363

app/src/content/assets/data/complexity_analysis.png CHANGED Viewed

Git LFS Details

SHA256: 7d7ed142b4271802c43e9c385ac2fd01da0a9008903655477d1a76608af86fe1
Pointer size: 131 Bytes
Size of remote file: 111 kB

Git LFS Details

SHA256: 167cef2c7885d32c87b478b191e30920725dd3d4e6b136c6d1f0273e1f0bfec8
Pointer size: 131 Bytes
Size of remote file: 111 kB

app/src/content/assets/data/excess_caution.png DELETED Viewed

Git LFS Details

SHA256: e92e86b6c731d1d0eda8a856f264198616692ab166b52847600befb6cb631584
Pointer size: 130 Bytes
Size of remote file: 91.1 kB

app/src/content/assets/data/{confidence_distribution.json → guess_rate.json} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:c9a993bfa7aeb6a163f8aaeb4f16a9944261bd513660b08c97af69d64bab32f3
-size 9189

 version https://git-lfs.github.com/spec/v1
+oid sha256:ef7941e2df9815331f4b02172abf8c14fd147812b4ab5a39cff7e30ccf2e9ac6
+size 10684

app/src/content/assets/data/{confidence_distribution.png → guess_rate.png} RENAMED Viewed

File without changes

app/src/content/assets/data/model_claude_haiku_4_5.png CHANGED Viewed

Git LFS Details

SHA256: e156f35fcb3f764435fccf4ee3ce16b71f594721e11b673fa122f95cccc5c524
Pointer size: 131 Bytes
Size of remote file: 248 kB

Git LFS Details

SHA256: 0bac0488806ee34594a16054dc08541f23f12434578fc46e71f794534f397cc6
Pointer size: 131 Bytes
Size of remote file: 243 kB

app/src/content/assets/data/model_claude_opus_4_5.png CHANGED Viewed

Git LFS Details

SHA256: 250b07856543f2443a6b8ba3c20e15f24e3eb31bbbeda1d1e9555a5d8f4bf1b9
Pointer size: 131 Bytes
Size of remote file: 217 kB

Git LFS Details

SHA256: 48bd3b09a473f51994bc13d87ce4eea68837ad069579129a6cf5ad6865c2350b
Pointer size: 131 Bytes
Size of remote file: 215 kB

app/src/content/assets/data/model_deepseek_r1.png CHANGED Viewed

Git LFS Details

SHA256: 9408c2f99fb62f626909a296150243597be4ba2976d68b2c7b848b5fcba4f33a
Pointer size: 131 Bytes
Size of remote file: 249 kB

Git LFS Details

SHA256: ad5b37144c6c79ccf48d8511aa4d28c0b06eaed9d51ef849cd9f76c0a3f091bd
Pointer size: 131 Bytes
Size of remote file: 244 kB

app/src/content/assets/data/model_gemini_3_flash_preview_low.png CHANGED Viewed

Git LFS Details

SHA256: 75ca6f7798384cf21e16d6ba6a9a7c8eca3d3abb7849767ee628319486dae785
Pointer size: 131 Bytes
Size of remote file: 235 kB

Git LFS Details

SHA256: fe44a0f0b28e87f452a3c75c94d857e77109684c5d49d52d15cccaf2e1424f97
Pointer size: 131 Bytes
Size of remote file: 236 kB

app/src/content/assets/data/model_gpt_5_2_high.png CHANGED Viewed

Git LFS Details

SHA256: 3ce556cf0f570a3c13535287608f734e8f103e308a7cd8db80f355f309003e6c
Pointer size: 131 Bytes
Size of remote file: 194 kB

Git LFS Details

SHA256: 97e06e120f153bf61673b5d22c0df51c1b7a8f2a877415e613a63dae08f39a66
Pointer size: 131 Bytes
Size of remote file: 195 kB

app/src/content/assets/data/model_gpt_5_mini_medium.png CHANGED Viewed

Git LFS Details

SHA256: 8f2a375dfbf81219ac33ceef6f42f4c8d9028d4a3867920a44218db927056985
Pointer size: 131 Bytes
Size of remote file: 210 kB

Git LFS Details

SHA256: c6071ad124d175223a7627e98e674224b5ca535f850d3dd277be9a55fa6366e2
Pointer size: 131 Bytes
Size of remote file: 211 kB

app/src/content/assets/data/model_gpt_oss_120b.png CHANGED Viewed

Git LFS Details

SHA256: c4d22765888054e1b220a66e9bf42278bf46aab27a3a8733f0ebb7e71db9c13e
Pointer size: 131 Bytes
Size of remote file: 259 kB

Git LFS Details

SHA256: 753654ac9b7282e0466312ec1048e497213d0077b0310eaa64c100fd50468e24
Pointer size: 131 Bytes
Size of remote file: 256 kB

app/src/content/assets/data/model_gpt_oss_20b.png CHANGED Viewed

Git LFS Details

SHA256: 41b5dfd6881d9e30e03a91de49517faa4a9cb94c9b88031ce0f52ceb431470df
Pointer size: 131 Bytes
Size of remote file: 270 kB

Git LFS Details

SHA256: 3d6c8716fdf53daa9662b5f3e6f433ee31ee3800f45686603d6c6d14f81ab161
Pointer size: 131 Bytes
Size of remote file: 264 kB

app/src/content/assets/data/model_grok_4_1_fast_reasoning.png CHANGED Viewed

Git LFS Details

SHA256: 08f64a210f54161c501c19e8906518c7d5a6cc55b36749e9c31cb570a09170ee
Pointer size: 131 Bytes
Size of remote file: 221 kB

Git LFS Details

SHA256: dfaec9ab9886e4c9dda4bf3208dbfe911f39310c4b52d4c35ae8ec6505e53b8d
Pointer size: 131 Bytes
Size of remote file: 217 kB

app/src/content/assets/data/model_kimi_k2.png CHANGED Viewed

Git LFS Details

SHA256: 04d4c263b639177670769f818380e061a69b259e7aa073b1151fbd737d19cd07
Pointer size: 131 Bytes
Size of remote file: 238 kB

Git LFS Details

SHA256: 03cc97b30a1949a60ca51d360b1b177ace72d0b36e3fb2149cf11f2b1beffe62
Pointer size: 131 Bytes
Size of remote file: 233 kB

app/src/content/assets/data/overall_performance.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:67f55d87526715789a9b2c902de6acc78f69dc5fd13300eb97e511668bca8003
-size 2303

 version https://git-lfs.github.com/spec/v1
+oid sha256:6ddb4557b07eab530ae73d9ce849c542f503fc3656166e9b6164034b5cba83bf
+size 2391

app/src/content/assets/data/overall_performance.png CHANGED Viewed

Git LFS Details

SHA256: 9d182c87b70f17018bd0664f1812b0ed0b99dbb107e1e455810f64dd21040f24
Pointer size: 130 Bytes
Size of remote file: 76.2 kB

Git LFS Details

SHA256: 02a2fedd1f6b603d295472aa3ceae73c0159a6b6e675311a6376e2323441bf3d
Pointer size: 130 Bytes
Size of remote file: 79 kB

app/src/content/assets/data/reckless_guessing.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a708723564f2779c2600346e347e2cff985a247bc950707d7f5c58137e05395b
-size 19220

 version https://git-lfs.github.com/spec/v1
+oid sha256:98cdab33871f6d7dfec15536fa6e67c5f05ecf6ca12ff2db506303b68318ec0b
+size 14795

app/src/content/assets/data/reckless_guessing.png CHANGED Viewed

Git LFS Details

SHA256: a73c1561ab35ed2e308d9cea71e3c77c116fb3d1d5619878a60d68ee1a031fbe
Pointer size: 130 Bytes
Size of remote file: 69.6 kB

Git LFS Details

SHA256: 24f149c88b5a19fca9028e8284f97105e54e1315330eeed11ac7729511fd432d
Pointer size: 130 Bytes
Size of remote file: 69 kB

app/src/content/assets/data/score_stack.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d64dd73c3b7173b627be30fab1720d57fde169a419d6038a9dec3129a2c93a60
-size 3723

 version https://git-lfs.github.com/spec/v1
+oid sha256:3e41c2dc2bf2fb303a9a1c2550f9e4f7274812ddbd8a937ed68d7969558f5a1c
+size 2876

app/src/content/assets/data/score_stack.png CHANGED Viewed

Git LFS Details

SHA256: 770e7bbbe723acad84dd1ecd4ff8310abd3fd60417953c5b961464d85111e328
Pointer size: 130 Bytes
Size of remote file: 83.3 kB

Git LFS Details

SHA256: 52838c774d90420913b723d63efeb44035c9d1cfd4e482bdb5c2a6d14037f47a
Pointer size: 130 Bytes
Size of remote file: 81.6 kB

app/src/content/assets/data/score_vs_failed_guesses.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:581795032120f5075ef4f805472d19deebe0602aa6737e07bc62a35062f97758
-size 2215

 version https://git-lfs.github.com/spec/v1
+oid sha256:dca6c8ada5a4856ee2227386c4fee89a4f06901708927b47d9b203a360a3bf52
+size 2294

app/src/content/assets/data/score_vs_failed_guesses.png CHANGED Viewed

Git LFS Details

SHA256: 655ad7167280626d2e194caf6115b8afee946a61faca5cc0b2b2f9ded65c6999
Pointer size: 130 Bytes
Size of remote file: 73.7 kB

Git LFS Details

SHA256: 17cd1757ff0bd6247884a1f5e565066001ce6f43d51be31fb8b930d9b1cd3a96
Pointer size: 130 Bytes
Size of remote file: 76.3 kB

app/src/content/assets/data/summary.txt CHANGED Viewed

@@ -26,16 +26,16 @@ BASIC MODEL COMPARISON
 ============================================================
                      model  rounds_played  total_score  avg_score  total_floored_score  avg_floored_score  total_turns  total_output_tokens  total_wall_clock  avg_failed_guesses  success_rate  total_no_stakes_score  avg_no_stakes_score  avg_output_tokens_per_turn  wall_clock_per_turn  intra_rule_variance  inter_rule_variance  variance_ratio
-              Gpt 5.2 High             78         1158  14.846154                 1174          15.051282         1205              3341037          73525.83            0.333333      0.961538                 1505.0            19.294872                 2772.644813            61.017286            25.858974            43.513162        0.594279
-           Claude Opus 4.5             78         1128  14.461538                 1324          16.974359          852              4333716          86367.64            2.769231      0.923077                 1598.0            20.487179                 5086.521127           101.370469            87.525641           180.000684        0.486252
-         Gpt 5 Mini Medium             78          942  12.076923                 1052          13.487179         1261              3618399          58345.97            1.256410      0.756410                 1325.0            16.987179                 2869.467883            46.269603            58.166667           115.878291        0.501963
-Gemini 3 Flash Preview Low             78          817  10.474359                 1024          13.128205         1315              1581524          12702.02            1.717949      0.769231                 1226.0            15.717949                 1202.679848             9.659331            61.128205           154.810427        0.394858
-                   Kimi K2             78          804  10.307692                 1262          16.179487          975             12281540         101346.76            4.025641      0.858974                 1481.0            18.987179                12596.451282           103.945395           182.564103           343.003761        0.532251
-   Grok 4 1 Fast Reasoning             78          737   9.448718                 1182          15.153846          998              8178655         120364.22            4.320513      0.884615                 1441.0            18.474359                 8195.045090           120.605431           109.256410           357.652821        0.305482
-              Gpt Oss 120B             78          580   7.435897                 1004          12.871795         1243              3190828          24633.15            3.692308      0.756410                 1279.0            16.397436                 2567.037812            19.817498           186.794872           225.517949        0.828293
-               Deepseek R1             78          511   6.551282                 1036          13.282051         1104              9229131         165334.16            5.064103      0.833333                 1331.0            17.064103                 8359.720109           149.759203           152.269231           353.910598        0.430248
-               Gpt Oss 20B             78          131   1.679487                  927          11.884615         1297              7009392          62397.50            6.205128      0.717949                 1206.0            15.461538                 5404.311488            48.109098           230.115385           421.666496        0.545728
-          Claude Haiku 4.5             78          -37  -0.474359                  894          11.461538         1254              6973411          57734.39            7.551282      0.705128                 1198.0            15.358974                 5560.933812            46.040183           244.730769           504.499316        0.485096
 Saved: results/260121_78_rounds/basic_metrics.csv
 Saved: results/260121_78_rounds/overall_performance.png
@@ -44,8 +44,8 @@ Saved: results/260121_78_rounds/score_vs_failed_guesses.png
 Saved: results/260121_78_rounds/score_vs_failed_guesses.json
 Saved: results/260121_78_rounds/calibration_curves.png
 Saved: results/260121_78_rounds/calibration_curves.json
-Saved: results/260121_78_rounds/confidence_distribution.png
-Saved: results/260121_78_rounds/confidence_distribution.json
 Saved: results/260121_78_rounds/score_stack.png
 Saved: results/260121_78_rounds/score_stack.json
@@ -53,16 +53,16 @@ Saved: results/260121_78_rounds/score_stack.json
 COMPLEXITY ANALYSIS
 ============================================================
-Optimal K for aggregated complexity: 0.42
-  Formula: complexity = cyclomatic + 0.42 * node_count
-  Correlation with success_rate: -0.612
 Stats by complexity quartile:
-complexity_bin  count  avg_score  success_rate
-            Q1    240  18.745833      0.966667
-            Q2    150  11.246667      0.893333
-            Q3    180  11.138889      0.866667
-            Q4    210  -6.761905      0.547619
 Saved: results/260121_78_rounds/complexity_analysis.png
 Saved: results/260121_78_rounds/complexity_analysis.json
@@ -71,34 +71,34 @@ Saved: results/260121_78_rounds/complexity_analysis.json
 BY-RULE ANALYSIS
 ============================================================
-Score by rule (sorted by avg_score):
-                                                                                                                                            rule_description  count  avg_score  std_score  success_rate
-                                                                                                                        Only red cards (hearts or diamonds).     30  25.633333   2.204749      1.000000
-                                                                                                                              Only cards of the suit spades.     30  25.200000   2.023994      1.000000
-                                                                             Cards must alternate between red and black colors. Any card may start the line.     30  25.166667   2.640315      1.000000
-                                                                                                               Only cards with an even rank (2,4,6,8,10,12).     30  24.300000   2.692903      1.000000
-                                                             The card must be of a different suit than the card just before it. Any card may start the line.     30  21.666667   8.659590      0.966667
-                                                      Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line.     30  20.666667   5.148373      1.000000
-                                                                                                                                        Only Aces (rank 1) .     30  20.233333   8.931476      0.966667
-                                           The card must be of a different suit than but same color as the card just before it. Any card may start the line.     30  19.866667   7.541761      1.000000
-                                                                                             Only hearts, clubs, and diamonds allowed. Spades are forbidden.     30  19.533333  10.836507      0.966667
-                                                                                                                                   Only spades and diamonds.     30  19.066667   4.487018      1.000000
-                                                                                                          Only ranks that are prime numbers (2,3,5,7,11,13).     30  18.633333  12.527166      0.966667
-                                                                                                                                 Only face cards (11,12,13).     30  17.033333  16.044084      0.900000
-                                           Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line.     30  15.100000  12.234350      1.000000
-                                                                                                                       Only cards between 1 and 7 inclusive.     30  13.366667  10.148835      0.966667
-                                                                                                                                      Only black face cards.     30   7.700000  16.316165      0.900000
-                                                                                                                           Only red cards whose rank is <=7.     30   4.866667  11.227225      1.000000
-                                                                                                                       Only cards between 5 and 9 inclusive.     30   4.666667  14.406257      0.933333
-                                                                                               Alternate face and number cards. Any card may start the line.     30   0.366667  20.553519      0.733333
-                                 Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line.     30  -1.066667  20.915154      0.666667
-                                                              Each card must have a rank greater or equal to the previous card. Only Ace can start the line.     30  -3.433333  22.931206      0.600000
-Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc.     30  -5.200000  18.917972      0.766667
-        Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it.     30 -10.466667  13.050917      0.533333
-                                                                                          Face cards (11-13) must be red; number cards (1-10) must be black.     30 -11.500000  17.814659      0.500000
-                                     Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line.     30 -12.066667  16.772172      0.400000
-    If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive.     30 -15.633333  15.354396      0.333333
-                   Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc.     30 -18.000000  16.103116      0.133333
 Saved: results/260121_78_rounds/by_rule.png
 Saved: results/260121_78_rounds/by_rule.json
@@ -139,32 +139,32 @@ Double-Down Rate: After a wrong guess, % of next turns with another guess
 (Only counts official guesses, not shadow/tentative guesses)
                      Model  Wrong Guesses  Next Turn Guesses  Double-Down %
-                   Kimi K2            314                207           65.9
-          Claude Haiku 4.5            589                362           61.5
-   Grok 4 1 Fast Reasoning            337                203           60.2
-               Gpt Oss 20B            484                290           59.9
-               Deepseek R1            395                229           58.0
-           Claude Opus 4.5            216                 91           42.1
-              Gpt Oss 120B            288                108           37.5
-Gemini 3 Flash Preview Low            134                 41           30.6
-         Gpt 5 Mini Medium             98                  9            9.2
-              Gpt 5.2 High             26                  1            3.8
 Wrong Guess Streak Statistics:
                      Model  Streaks  Mean Length  Max Length  Total Wrong
-                   Kimi K2      120         2.62          14          314
-          Claude Haiku 4.5      244         2.41          16          589
-   Grok 4 1 Fast Reasoning      149         2.26          12          337
-               Gpt Oss 20B      207         2.34          13          484
-               Deepseek R1      180         2.19           9          395
-           Claude Opus 4.5      139         1.55           5          216
-              Gpt Oss 120B      184         1.57           8          288
-Gemini 3 Flash Preview Low       97         1.38           4          134
-         Gpt 5 Mini Medium       91         1.08           3           98
-              Gpt 5.2 High       25         1.04           2           26
-Longest streak: 16 consecutive wrong guesses
-  - Claude Haiku 4.5 in round 77
 Saved: results/260121_78_rounds/reckless_guessing.png
 Saved: results/260121_78_rounds/reckless_guessing.json

 ============================================================
                      model  rounds_played  total_score  avg_score  total_floored_score  avg_floored_score  total_turns  total_output_tokens  total_wall_clock  avg_failed_guesses  success_rate  total_no_stakes_score  avg_no_stakes_score  avg_output_tokens_per_turn  wall_clock_per_turn  intra_rule_variance  inter_rule_variance  variance_ratio
+           Claude Opus 4.5             78         1128  14.461538                 1324          16.974359          852              4333716          86367.64            2.000000      0.833333                 1598.0            20.487179                 5086.521127           101.370469            25.000000            81.385983        0.307178
+                   Kimi K2             78          804  10.307692                 1262          16.179487          975             12281540         101346.76            2.038462      0.769231                 1481.0            18.987179                12596.451282           103.945395            25.538462            88.446496        0.288745
+   Grok 4 1 Fast Reasoning             78          737   9.448718                 1182          15.153846          998              8178655         120364.22            2.564103      0.717949                 1441.0            18.474359                 8195.045090           120.605431            25.243590           106.499829        0.237029
+              Gpt 5.2 High             78         1158  14.846154                 1174          15.051282         1205              3341037          73525.83            0.282051      0.948718                 1505.0            19.294872                 2772.644813            61.017286            24.628205            36.601709        0.672870
+         Gpt 5 Mini Medium             78          942  12.076923                 1052          13.487179         1261              3618399          58345.97            1.166667      0.705128                 1325.0            16.987179                 2869.467883            46.269603            39.141026            82.882051        0.472250
+               Deepseek R1             78          511   6.551282                 1036          13.282051         1104              9229131         165334.16            3.192308      0.641026                 1331.0            17.064103                 8359.720109           149.759203            29.628205           115.135043        0.257334
+Gemini 3 Flash Preview Low             78          817  10.474359                 1024          13.128205         1315              1581524          12702.02            0.961538      0.705128                 1226.0            15.717949                 1202.679848             9.659331            29.923077            83.049573        0.360304
+              Gpt Oss 120B             78          580   7.435897                 1004          12.871795         1243              3190828          24633.15            2.153846      0.679487                 1279.0            16.397436                 2567.037812            19.817498            46.692308            78.676239        0.593474
+               Gpt Oss 20B             78          131   1.679487                  927          11.884615         1297              7009392          62397.50            2.974359      0.589744                 1206.0            15.461538                 5404.311488            48.109098            47.576923            88.239487        0.539180
+          Claude Haiku 4.5             78          -37  -0.474359                  894          11.461538         1254              6973411          57734.39            3.948718      0.564103                 1198.0            15.358974                 5560.933812            46.040183            45.102564           107.387350        0.419999
 Saved: results/260121_78_rounds/basic_metrics.csv
 Saved: results/260121_78_rounds/overall_performance.png
 Saved: results/260121_78_rounds/score_vs_failed_guesses.json
 Saved: results/260121_78_rounds/calibration_curves.png
 Saved: results/260121_78_rounds/calibration_curves.json
+Saved: results/260121_78_rounds/guess_rate.png
+Saved: results/260121_78_rounds/guess_rate.json
 Saved: results/260121_78_rounds/score_stack.png
 Saved: results/260121_78_rounds/score_stack.json
 COMPLEXITY ANALYSIS
 ============================================================
+Optimal K for aggregated complexity: 0.14
+  Formula: complexity = cyclomatic + 0.14 * node_count
+  Correlation with success_rate: -0.659
 Stats by complexity quartile:
+complexity_bin  count  avg_floored_score  success_rate
+            Q1    240          19.829167      0.920833
+            Q2    150          14.973333      0.773333
+            Q3    180          15.344444      0.794444
+            Q4    210           5.295238      0.371429
 Saved: results/260121_78_rounds/complexity_analysis.png
 Saved: results/260121_78_rounds/complexity_analysis.json
 BY-RULE ANALYSIS
 ============================================================
+Score by rule (sorted by avg_floored_score):
+                                                                                                                                            rule_description  count  avg_floored_score  std_floored_score  success_rate
+                                                                                                                        Only red cards (hearts or diamonds).     30          25.633333           2.204749      1.000000
+                                                                                                                              Only cards of the suit spades.     30          25.200000           2.023994      1.000000
+                                                                             Cards must alternate between red and black colors. Any card may start the line.     30          25.166667           2.640315      1.000000
+                                                                                                               Only cards with an even rank (2,4,6,8,10,12).     30          24.300000           2.692903      1.000000
+                                                             The card must be of a different suit than the card just before it. Any card may start the line.     30          22.200000           6.477547      0.966667
+                                                                                             Only hearts, clubs, and diamonds allowed. Spades are forbidden.     30          20.666667           5.516954      0.966667
+                                                      Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line.     30          20.666667           5.148373      1.000000
+                                                                                                                                        Only Aces (rank 1) .     30          20.366667           8.580183      0.933333
+                                           The card must be of a different suit than but same color as the card just before it. Any card may start the line.     30          20.333333           5.541899      0.966667
+                                                                                                          Only ranks that are prime numbers (2,3,5,7,11,13).     30          19.966667           6.965349      0.933333
+                                                                                                                                 Only face cards (11,12,13).     30          19.833333           8.288269      0.900000
+                                                                                                                                   Only spades and diamonds.     30          19.066667           4.487018      1.000000
+                                           Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line.     30          16.766667           7.663993      0.900000
+                                                                                                                       Only cards between 1 and 7 inclusive.     30          14.466667           7.238467      0.900000
+                                                                                                                                      Only black face cards.     30          11.466667           9.000894      0.700000
+                                                                                               Alternate face and number cards. Any card may start the line.     30           9.066667           9.409948      0.600000
+                                 Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line.     30           8.600000           9.193701      0.533333
+                                                              Each card must have a rank greater or equal to the previous card. Only Ace can start the line.     30           8.333333           9.400416      0.533333
+                                                                                                                       Only cards between 5 and 9 inclusive.     30           8.166667           7.153891      0.666667
+                                                                                                                           Only red cards whose rank is <=7.     30           7.700000           6.808463      0.666667
+Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc.     30           4.966667           6.321738      0.500000
+                                                                                          Face cards (11-13) must be red; number cards (1-10) must be black.     30           2.966667           5.816050      0.266667
+                                     Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line.     30           2.466667           5.612384      0.200000
+    If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive.     30           1.600000           4.022351      0.166667
+        Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it.     30           1.533333           3.598212      0.200000
+                   Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc.     30           1.133333           3.549972      0.100000
 Saved: results/260121_78_rounds/by_rule.png
 Saved: results/260121_78_rounds/by_rule.json
 (Only counts official guesses, not shadow/tentative guesses)
                      Model  Wrong Guesses  Next Turn Guesses  Double-Down %
+   Grok 4 1 Fast Reasoning            200                108           54.0
+               Deepseek R1            249                132           53.0
+          Claude Haiku 4.5            308                161           52.3
+                   Kimi K2            159                 67           42.1
+               Gpt Oss 20B            232                 97           41.8
+           Claude Opus 4.5            156                 50           32.1
+              Gpt Oss 120B            168                 37           22.0
+Gemini 3 Flash Preview Low             75                 15           20.0
+         Gpt 5 Mini Medium             91                  8            8.8
+              Gpt 5.2 High             22                  0            0.0
 Wrong Guess Streak Statistics:
                      Model  Streaks  Mean Length  Max Length  Total Wrong
+   Grok 4 1 Fast Reasoning      103         1.94           8          200
+               Deepseek R1      121         2.06           7          249
+          Claude Haiku 4.5      157         1.96           7          308
+                   Kimi K2      100         1.59           7          159
+               Gpt Oss 20B      141         1.65           7          232
+           Claude Opus 4.5      115         1.36           5          156
+              Gpt Oss 120B      133         1.26           5          168
+Gemini 3 Flash Preview Low       63         1.19           4           75
+         Gpt 5 Mini Medium       85         1.07           3           91
+              Gpt 5.2 High       22         1.00           1           22
+Longest streak: 8 consecutive wrong guesses
+  - Grok 4 1 Fast Reasoning in round 67
 Saved: results/260121_78_rounds/reckless_guessing.png
 Saved: results/260121_78_rounds/reckless_guessing.json

app/src/content/chapters/eleusis/benchmark.mdx CHANGED Viewed

@@ -24,11 +24,13 @@ The game uses a standard 52-card deck with ranks 1–13 (Ace through King) and f
 On each turn, the player selects a card from their hand to play. If the card satisfies the secret rule, it joins the mainline; if rejected, it's placed in a sideline below the mainline at that position. While playing a card, the player may attempt to guess the rule. The game continues until the player correctly identifies the rule or reaches 30 turns.
-When correctly guessing the rule, the player scores as many points as the number of remaining turns, and each wrong guess deducts a penalty of 2 points:
 $$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num_wrong\_guesses}$$
-A player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. Failing to identify the rule scores 0 but penalties for wrong guesses still apply, leading to possibly a negative score. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence, exactly the calibration we want to measure.
 ### Rule Library
@@ -67,4 +69,6 @@ Example output
 }
 ```
-This structure lets us analyze not just whether models succeed, but *how* they reason: Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy?

 On each turn, the player selects a card from their hand to play. If the card satisfies the secret rule, it joins the mainline; if rejected, it's placed in a sideline below the mainline at that position. While playing a card, the player may attempt to guess the rule. The game continues until the player correctly identifies the rule or reaches 30 turns.
+When correctly guessing the rule, the player scores as many points as the number of turns spent, and each wrong guess deducts a penalty of 2 points:
 $$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num_wrong\_guesses}$$
+A player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. If because of penalties the score drops to zero or below, the round stops and the final score is recorded as zero (similar to a scientist having wasted all their resources).
+This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence and acting accordingly.
 ### Rule Library
 }
 ```
+This structure lets us analyze not just whether models succeed, but *how* they reason: Do they update hypotheses appropriately when evidence contradicts them? Do they explore strategically or play conservatively? Is their stated confidence calibrated to their actual accuracy? In particular, forcing the model to articulate a tentative rule and a confidence level in it (even if they don't want to guess it yet) allows us to (secretely) evaluate it nonetheless, which will be useful for measuring calibration and guessing abilities.

app/src/content/chapters/eleusis/introduction.mdx CHANGED Viewed

@@ -11,9 +11,9 @@ Large language models are increasingly being deployed as tools for scientific re
 Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge, for instance, evaluates inductive reasoning on visual patterns. These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.
-Real scientific reasoning is not a single inference step. It's an iterative agentic process of observation, hypothesis formation, experimentation, and refinement—often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic* thinking: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
-Beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?), **metacognition** (how certain am I about my uncertainty?), and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis). A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
 We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.

 Most reasoning benchmarks test whether models can solve well-defined problems: given premises, derive a conclusion. The ARC challenge, for instance, evaluates inductive reasoning on visual patterns. These benchmarks capture important capabilities, but they miss something fundamental about how science actually works.
+Real scientific reasoning is not a single inference step. It's an iterative agentic process of observation, hypothesis formation, experimentation, and refinement, often spanning many cycles before reaching a conclusion. It requires not just logical ability, but also *strategic* thinking: which experiment to run next, how much evidence is enough, when to commit to a theory versus when to keep exploring.
+Beyond pure reasoning, effective science depends on psychological factors that are rarely evaluated: **calibration** (does my confidence match my actual accuracy?), **metacognition** (how certain am I about my uncertainty?), and resistance to **cognitive biases** like confirmation bias (seeking only evidence that supports my current hypothesis instead of trying to challenge it). A scientist who is brilliant at deduction but overconfident in weak theories will waste resources pursuing dead ends. One who is well-calibrated but overly cautious may never publish.
 We wanted to test whether LLMs can exhibit these deeper aspects of scientific reasoning. To do this, we turned to an unlikely source: a 1950s card game called Eleusis.

app/src/content/chapters/eleusis/results.mdx CHANGED Viewed

@@ -8,7 +8,7 @@ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 ### Overall Performance
-We evaluated ten models on the Eleusis benchmark, including both proprietary and open-weight models. Performance is measured as the average score per turn. We also report token usage (output + reasoning) to compare efficiency.
 <HtmlEmbed
   src="overall-performance.html"
@@ -17,130 +17,94 @@ We evaluated ten models on the Eleusis benchmark, including both proprietary and
   wide
 />
-Performance varies dramatically among tested models. Claude Opus 4.5 achieves top performance with moderate token usage. The open-weight model Kimi K2 (14.5 avg score) performs competitively with the best proprietary models, outperforming GPT 5.2 High and being closed to Claude Opus 4.5, but at the price of a 2.5× larger reasoning budget.
-GPT-5-Mini, GPT OSS-120B and Gemini 3 Flash Preview Low cluster in the mid-tier (12–13 avg score) with moderate token usage. Grok 4.1 Fast Reasoning performs similarly but with higher token costs.
-Deepseek R1, an open-weight model specialized for reasoning tasks, lags behind at 10.9 avg score despite high token usage, suggesting limitations in its scientific reasoning capabilities. As we will see, its guessing strategy and calibration are suboptimal.
-### Confidence and Calibration
-Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
-<HtmlEmbed
-  src="calibration-curves.html"
-  caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
-  id="fig-calibration"
-/>
-The calibration analysis reveals several patterns:
-- **All models are overconfident** : for instance when they report 80% confidence, their actual success rates are often closer to 20% !
-- GPT 5.2 is the best calibrated model overall.
-- Even models with a strong performance like Claude Opus 4.5 and Kimi K2 show significant overconfidence.
-It is also interesting to examine the distribution of confidence levels when models choose to guess.
 <HtmlEmbed
-  src="confidence-distribution.html"
-  caption="<strong>Figure 3:</strong> Distribution of confidence levels when models choose to formally guess. Each bar shows the proportion of guesses made at that confidence level. Click legend items to show/hide models."
-  id="fig-confidence"
 />
-We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Other like GPT 5.2 High or Kimi K2 might also guess at confidence levels 8. Surprisingly, the best performing model Claude Opus 4.5 has a more spread out guessing behavior, often guessing at confidence levels 7 or even 6. Claude Haiku 4.5 has the most reckless guessing behavior, mostly guessing at confidence levels 6 to 8.
-Being able to separate confidence levels when guessing vs not guessing is an important metacognitive skill. Models that guess only when very confident are less likely to make wrong guesses, but may miss opportunities to commit early and gain points. Models that guess at lower confidence levels risk more wrong guesses, but can capitalize on early correct guesses. This trade-off is explored next.
-Note that in principle there is a decision-theoretic optimal confidence threshold for guessing, which depends on the scoring system. Given the scoring that rewards 1 point per turn left, with 2 points penalty for a wrong guess, the optimal threshold is 0.67 (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). Of course this assumes perfect calibration, which none of the models achieve.
-### Guessing Strategy
-The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff? We can analyze their guessing efficiency by plotting average score vs average number of failed guesses per round.
-<HtmlEmbed
-  src="score-vs-failed-guesses.html"
-  caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
-  id="fig-guessing"
-/>
 <Sidenote>
-  The optimal strategy depends on accurate self-assessment—knowing when you've gathered enough evidence to commit.
 </Sidenote>
-### The Caution-Recklessness Trade-off
-Failed guesses tell only half the story. A model might avoid wrong guesses by being *too* cautious—waiting many turns after it already has the correct answer. To measure this, we tracked "early correct turns": how many consecutive turns a model's tentative rule was correct before it finally chose to guess.
-<HtmlEmbed
-  src="excess-caution.html"
-  caption="<strong>Figure 5:</strong> Distribution of early correct turns (waiting with the correct answer). Higher values indicate excessive caution—the model knew the answer but hesitated to commit. GPT 5.2 High stands out as extremely cautious, with a mean of 3.6 turns of unnecessary delay."
-  id="fig-excess-caution"
-/>
-The results reveal striking differences in guessing personalities:
-- **GPT 5.2 High** is remarkably cautious, waiting an average of 3.6 turns after finding the correct rule before guessing. In 87% of successful rounds, it waited at least one turn too long.
-- **Claude Opus 4.5** shows excellent timing—only 0.9 early correct turns on average, meaning it commits almost immediately after finding the answer.
-- **Claude Haiku 4.5** and **DeepSeek R1** are the least cautious (0.5 early turns), but this comes at a cost: they also have the highest failed guess rates.
 <HtmlEmbed
-  src="caution-vs-failed-guesses.html"
-  caption="<strong>Figure 6:</strong> The caution-recklessness trade-off. Models in the upper-left are cautious (delay correct guesses); models in the lower-right are reckless (many failed guesses). The ideal position is lower-left: quick to commit when right, rarely wrong."
-  id="fig-caution-reckless"
 />
-<Sidenote>
-This trade-off mirrors a fundamental tension in science: being overconfident too early might risk false positives, leading to wasted resources and reputational damage; being overly cautious can delay discoveries and allow others to scoop you. Scientists must balance the risk of trying to publish too early and risk being wrong, wait too long and lose priority (or in our case, points).
-</Sidenote>
-This visualization reveals distinct behavioral patterns:
-* GPT 5.2 High achieves high score with very few failed guesses.
-* Claude Opus 4.5 and Kimi K2 also achieve high scores but with more failed guesses, suggesting that they are able to compensate the extra penalties they incur with their ability to commit quickly when they find the right answer, and hence score more points overall.
-* Deepseek R1 and Claude Haiku 4.5 cluster in the lower-right, being both reckless and not particularly cautious, leading to poor performance.
-The data suggests that knowing when you know is just as important as knowing the answer. Claude Opus 4.5's strong performance comes not just from finding correct rules, but from accurate metacognition, recognizing when it has gathered enough evidence to commit, even at the risk of occasional wrong guesses.
-This analysis constrats two ways of losing points : by being too cautious (waiting too long to commit) vs by being too reckless (making too many wrong guesses). A way to visualize this is to explore alternative scoring systems, as we do next.
-### Alternative Scoring Systems
-The Eleusis scoring system includes harsh penalties: wrong guesses cost 2 points each, and rounds can end with negative scores. How much do these penalties affect rankings? To understand the impact of our scoring choices, we compare three scoring variants:
-1. **Raw score**: The standard scoring (30 - turns - 2×wrong guesses)
-2. **Floored score**: Same formula, but negative scores are counted as zero
-3. **No-stakes score**: No penalty for wrong guesses, and tentative rules count as guesses
 <HtmlEmbed
-  src="score-stack.html"
-  caption="<strong>Figure 7:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring). Orange shows flooring gain (what models gain if negative scores count as 0). Green shows no-stakes gain (additional gain from removing wrong-guess penalties). Models sorted by total no-stakes score."
-  id="fig-score-stack"
-  wide
 />
-The flooring gain (orange) reveals which models frequently go negative. GPT 5.2 High gains almost nothing from flooring (0.2 points), indicating it rarely makes enough wrong guesses to go negative. In contrast, Claude Haiku 4.5 gains 11.9 points—nearly 12 points of damage averted per round on average—showing how its reckless guessing leads to catastrophic losses.
-The no-stakes gain (green) shows what models would gain if we simply tested their tentative rule each turn. Interestingly, this gain is relatively consistent across models (2.5–4.2 points), suggesting that most models form correct hypotheses at similar rates, but differ dramatically in their ability to *recognize* when they have the right answer.
-Under any scoring system, Claude Opus 4.5 and GPT 5.2 High remain the top performers. The ranking compression at no-stakes scores (15.4 to 20.5 vs raw -0.5 to 14.8) confirms that our scoring system appropriately rewards good metacognition—knowing when you know.
-### Analysis of the reckless guessing behavior
-Some models loose a lot of points due to reckless guessing. In the "no stakes" scoring system, Claude 4.5 Opus takes the lead, Kimi K2 and Grok 4.1 have similar performance to GPT 5.2 High.
-<HtmlEmbed
-  src="reckless-guessing.html"
-  caption="<strong>Figure 7b:</strong> Double-down rate: how often a model guesses again immediately after a wrong guess. Higher values indicate more reckless behavior—the model keeps guessing despite recent failures."
-  id="fig-reckless-guessing"
-/>
 ### Performance by Rule
@@ -152,7 +116,7 @@ The following figure breaks down performance by rule across all models and runs.
 <HtmlEmbed
   src="by-rule.html"
-  caption="<strong>Figure 8:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as colored dots (one per model run). Hover over rule names for details. The left column shows average success rate. Click legend items to show/hide models."
   id="fig-by-rule"
   wide
 />
@@ -163,7 +127,7 @@ The following plot breaks down the relative score of each model (as measured by
 <HtmlEmbed
   src="complexity-analysis.html"
-  caption="<strong>Figure 9:</strong> Relationship between rule complexity and model performance. The heatmap shows relative scores (value > 1 means above-average performance) for each model across complexity quartiles. Hover over cells for details."
   id="fig-complexity"
 />

 ### Overall Performance
+We evaluated ten models on the Eleusis benchmark, including both proprietary and open-weight models. Performance is measured as the average score per turn. We also report token usage (output + reasoning) per turn to compare efficiency.
 <HtmlEmbed
   src="overall-performance.html"
   wide
 />
+Performance varies dramatically among tested models. Claude Opus 4.5 achieves top performance with moderate token usage. The open-weight model Kimi K2 comes second and performs competitively with the best proprietary models, outperforming GPT 5.2 High and being closed to Claude Opus 4.5, but at the price of a 2.5× larger reasoning budget.
+GPT 5.2 High and Grok 4.1 Fast Reasoning show a similar performance but GPT 5.2 High is significantly more token efficient.
+GPT-5-Mini, GPT OSS-120B and Gemini 3 Flash Preview Low cluster in the mid-tier (around 13) with moderate token usage.While Deepseek R1, an open-weight model specialized for reasoning tasks, achieves a similar score with much larger token count.
+Finally, GPT-OSS 20B and Claude Haiku 4.5 lag behind, scoring between 11 and 12 with moderate token usage.
+As we mentionned, this score reflects not only the pure model's ability to find the correct rule, but also its metacognitive skills: knowing when to commit, how confident it is, and how to balance exploration vs. exploitation. To distinguish these factors, we also computed an alternative "no-stakes" score that removes penalties for wrong guesses and counts tentative rules as guesses. This allows us to isolate pure rule-discovery ability from metacognitive skills.
+### Pure discovery versus metacognition
+The following chart shows the score of each model, and which score it would have achieved under a "no stakes" scenario where guessing is free and systematic.
 <HtmlEmbed
+    src="score-stack.html"
+    caption="<strong>Figure 2:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring). while green shows no-stakes gain (additional gain from removing wrong-guess penalties). Models sorted by total no-stakes score."
+    id="fig-score-stack"
+    wide
 />
+Even if using this alternative scoring does not change a lot the relative ranking of models, it reveals important differences in their behavior. GPT 5.2 High and Claude Haiku 4.5 are the two models with the largest difference between raw and no-stakes scores (more than 4) while Gemini and Kimi K2 have the smallest difference (less than 3).
+They might be two reason for the difference between the raw and the no-stakes scores:
+1. The model is reckless and makes a lot of wrong guesses, incurring penalties.
+2. The model is too cautious and waits too long before guessing, missing out on points.
+We analyze these two aspects in more details below.
+### The Caution-Recklessness Trade-off
+To estimate how reckless or cautious a model is, we can compute the average number of failed guesses per round (recklessness). It directly relates to how many points a model loses due to wrong guesses.
+To estimate caution, we can compute on average how many turns a model waits while having the correct tentative rule before actually guessing it. This relates to how many points a model loses by waiting too long to commit.
 <Sidenote>
+    This trade-off mirrors a fundamental tension in science: being overconfident too early might risk false positives, leading to wasted resources and reputational damage; being overly cautious can delay discoveries and allow others to scoop you. Scientists must balance the risk of trying to publish too early and risk being wrong, wait too long and lose priority (or in our case, points).
 </Sidenote>
 <HtmlEmbed
+    src="caution-vs-failed-guesses.html"
+    caption="<strong>Figure 3:</strong> The caution-recklessness trade-off. Models in the upper-left are cautious (delay correct guesses); models in the lower-right are reckless (many failed guesses). The ideal position is lower-left: quick to commit when right, rarely wrong."
+    id="fig-caution-reckless"
 />
+How should we interpret those values ? Knowing that a failed guess costs 2 points, while each turn of delay costs 1 point, the optimal number of failed guesses per round should be around 0.5 (i.e., 1 failed guess every 2 rounds) to balance the two sources of loss. We can see that most models are above that threshold, indicating a tendency towards recklessness. This is confirmed by the fact that they have a low caution value (most models wait around 1 turns on average before guessing when they have the correct rule).
+On the other hand, GPT 5.2 High has a singular behavior with very few failed guesses (0.28 per round) but a high caution (waiting 3.5 turns on average before guessing when it has the correct rule). Gemini 3 Flash Preview Low and GPT 5 Mini Medium are intermediate in both dimensions, Gemini achieving a better balance with on average 2 points lost due to recklessness and 2 points lost due to caution.
+To try to understand deeper the causes of recklessness and caution, we now turn to an analysis of confidence and guessing strategies.
+### Confidence and Calibration
+Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
+<HtmlEmbed
+  src="calibration-curves.html"
+  caption="<strong>Figure 4:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
+  id="fig-calibration"
+/>
+The calibration analysis reveals several patterns:
+- **All models are overconfident** : for instance when they report 80% confidence, their actual success rates are often closer to 20% !
+- GPT 5.2 is the best calibrated model overall.
+- Even models with a strong performance like Claude Opus 4.5 and Kimi K2 show significant overconfidence.
+Is overconfidence a problem ? It depends on how the model decides to act on it.
+For a perfectly calibrated model, as the expected loss for a failed guess is twice the expected opportunity cost of waiting one turn, the optimal confidence threshold for guessing is 0.67 (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). But do model follow such a strategy ? For this, we can look at how often models guess at each confidence level.
 <HtmlEmbed
+  src="guess-rate.html"
+  caption="<strong>Figure 5:</strong> Guess rate per confidence level. The optimal decision theoretic curve for a perfectly calibrated model should be a step at 67%. Click legend items to show/hide models."
+  id="fig-confidence"
 />
+We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Most other models will guess at confidence levels above 8 and rarely below. The two Claude models show different behaviors: Claude Opus 4.5 tends to guess more agressively at confidence level 8, while Claude Haiku 4.5 guesses even at confidence level 7.
+We can see that models on average are more cautious than the optimal decision-theoretic strategy for a perfectly calibrated model, which would guess as soon as confidence exceeds 67%. THis is somehow a good thing, given that all models are overconfident. By raising the bar for guessing, they reduce the risk of wrong guesses and compensate for their poor calibration.
+This is particularly true for Gemini 3 Flash Preview Low which is very cautious despite being overconfident, and this is probably what helps it achieve a good balance between failed guesses and lost opportunity cost. It is also consistent with the fact that it's the model with the smallest difference between raw and no-stakes scores.
+The case of GPT 5.2 High is different: it is both fairly well calibrated and very cautious, leading to very few failed guesses but a high opportunity cost due to delayed guessing. This suggests that GPT 5.2 High could improve its performance by being more agressive in guessing once it has a correct tentative rule.
 ### Performance by Rule
 <HtmlEmbed
   src="by-rule.html"
+  caption="<strong>Figure 6:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as colored dots (one per model run). Hover over rule names for details. The left column shows average success rate. Click legend items to show/hide models."
   id="fig-by-rule"
   wide
 />
 <HtmlEmbed
   src="complexity-analysis.html"
+  caption="<strong>Figure 7:</strong> Relationship between rule complexity and model performance. The heatmap shows relative scores (value > 1 means above-average performance) for each model across complexity quartiles. Hover over cells for details."
   id="fig-complexity"
 />

app/src/content/embeds/banner.html CHANGED Viewed

@@ -188,7 +188,7 @@
           <div class="model-name" style="color: ${d.color}">${d.name}</div>
           <div class="metric">
             <span class="metric-label">Score:</span>
-            <span class="metric-value">${d.avg_score.toFixed(2)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Tokens/Turn:</span>
@@ -235,11 +235,11 @@
         const { innerWidth, innerHeight } = updateSize();
         // Sort models by score descending
-        const models = [...data.models].sort((a, b) => b.avg_score - a.avg_score);
         // Update scales
         xScale
-          .domain([0, d3.max(models, d => d.avg_score) * 1.05])
           .range([0, innerWidth])
           .nice();
@@ -285,7 +285,7 @@
           .attr('class', 'bar')
           .attr('x', 0)
           .attr('y', d => yScale(d.name))
-          .attr('width', d => xScale(d.avg_score))
           .attr('height', barHeight)
           .attr('fill', d => d.color)
           .attr('rx', 3)
@@ -311,11 +311,11 @@
           .data(models, d => d.name)
           .join('text')
           .attr('class', 'score-label')
-          .attr('x', d => xScale(d.avg_score) + 6)
           .attr('y', d => yScale(d.name) + barHeight / 2)
           .attr('dy', '0.35em')
           .attr('text-anchor', 'start')
-          .text(d => d.avg_score.toFixed(1));
       }
       // Initialize

           <div class="model-name" style="color: ${d.color}">${d.name}</div>
           <div class="metric">
             <span class="metric-label">Score:</span>
+            <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Tokens/Turn:</span>
         const { innerWidth, innerHeight } = updateSize();
         // Sort models by score descending
+        const models = [...data.models].sort((a, b) => b.avg_floored_score - a.avg_floored_score);
         // Update scales
         xScale
+          .domain([0, d3.max(models, d => d.avg_floored_score) * 1.05])
           .range([0, innerWidth])
           .nice();
           .attr('class', 'bar')
           .attr('x', 0)
           .attr('y', d => yScale(d.name))
+          .attr('width', d => xScale(d.avg_floored_score))
           .attr('height', barHeight)
           .attr('fill', d => d.color)
           .attr('rx', 3)
           .data(models, d => d.name)
           .join('text')
           .attr('class', 'score-label')
+          .attr('x', d => xScale(d.avg_floored_score) + 6)
           .attr('y', d => yScale(d.name) + barHeight / 2)
           .attr('dy', '0.35em')
           .attr('text-anchor', 'start')
+          .text(d => d.avg_floored_score.toFixed(1));
       }
       // Initialize

app/src/content/embeds/by-rule.html CHANGED Viewed

@@ -234,7 +234,7 @@
           </div>
           <div class="metric">
             <span class="metric-label">Average Score:</span>
-            <span class="metric-value">${rule.avg_score.toFixed(1)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Cyclomatic Complexity:</span>
@@ -311,7 +311,7 @@
         // Update scales
         const allScores = [];
         rules.forEach(rule => {
-          Object.values(rule.scores_by_model).forEach(scores => {
             allScores.push(...scores);
           });
         });
@@ -400,7 +400,7 @@
         // Data points
         const pointData = [];
         rules.forEach(rule => {
-          Object.entries(rule.scores_by_model).forEach(([modelName, scores]) => {
             scores.forEach((score, seedIdx) => {
               const color = modelColors[modelName] || '#888888';
               pointData.push({

           </div>
           <div class="metric">
             <span class="metric-label">Average Score:</span>
+            <span class="metric-value">${rule.avg_floored_score.toFixed(1)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Cyclomatic Complexity:</span>
         // Update scales
         const allScores = [];
         rules.forEach(rule => {
+          Object.values(rule.floored_scores_by_model).forEach(scores => {
             allScores.push(...scores);
           });
         });
         // Data points
         const pointData = [];
         rules.forEach(rule => {
+          Object.entries(rule.floored_scores_by_model).forEach(([modelName, scores]) => {
             scores.forEach((score, seedIdx) => {
               const color = modelColors[modelName] || '#888888';
               pointData.push({

app/src/content/embeds/{confidence-distribution.html → guess-rate.html} RENAMED Viewed

@@ -1,78 +1,78 @@
-<div class="d3-confidence-distribution"></div>
 <style>
-  .d3-confidence-distribution {
     width: 100%;
     margin: 10px 0;
     position: relative;
     font-family: system-ui, -apple-system, sans-serif;
   }
-  .d3-confidence-distribution svg {
     display: block;
     width: 100%;
     height: auto;
   }
-  .d3-confidence-distribution .axes path,
-  .d3-confidence-distribution .axes line {
     stroke: var(--axis-color, var(--text-color));
   }
-  .d3-confidence-distribution .axes text {
     fill: var(--tick-color, var(--muted-color));
     font-size: 11px;
   }
-  .d3-confidence-distribution .grid line {
     stroke: var(--grid-color, rgba(0,0,0,.08));
   }
-  .d3-confidence-distribution .axes text.axis-label {
     font-size: 14px;
     font-weight: 500;
     fill: var(--text-color);
   }
-  .d3-confidence-distribution .x-axis text {
     transform: translateY(4px);
   }
-  .d3-confidence-distribution .distribution-line {
     fill: none;
     stroke-width: 1.5;
   }
-  .d3-confidence-distribution .data-point {
     cursor: pointer;
     transition: opacity 0.15s ease;
   }
-  .d3-confidence-distribution .data-point:hover {
     opacity: 0.8;
   }
-  .d3-confidence-distribution .legend {
     font-size: 11px;
   }
-  .d3-confidence-distribution .legend-item {
     cursor: pointer;
   }
-  .d3-confidence-distribution .legend-item.dimmed .legend-line,
-  .d3-confidence-distribution .legend-item.dimmed .legend-marker {
     opacity: 0.3;
   }
-  .d3-confidence-distribution .legend-item.dimmed text {
     opacity: 0.4;
   }
-  .d3-confidence-distribution .legend-text {
     fill: var(--text-color);
   }
-  .d3-confidence-distribution .d3-tooltip {
     position: absolute;
     top: 0;
     left: 0;
@@ -91,22 +91,22 @@
     z-index: 10;
   }
-  .d3-confidence-distribution .d3-tooltip .model-name {
     font-weight: 600;
     margin-bottom: 4px;
   }
-  .d3-confidence-distribution .d3-tooltip .metric {
     display: flex;
     justify-content: space-between;
     gap: 16px;
   }
-  .d3-confidence-distribution .d3-tooltip .metric-label {
     color: var(--muted-color);
   }
-  .d3-confidence-distribution .d3-tooltip .metric-value {
     font-weight: 500;
   }
 </style>
@@ -129,8 +129,8 @@
     const bootstrap = () => {
       const scriptEl = document.currentScript;
       let container = scriptEl ? scriptEl.previousElementSibling : null;
-      if (!(container && container.classList && container.classList.contains('d3-confidence-distribution'))) {
-        const candidates = Array.from(document.querySelectorAll('.d3-confidence-distribution'))
           .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
         container = candidates[candidates.length - 1] || null;
       }
@@ -171,10 +171,10 @@
       // Line generator
       const line = d3.line()
         .x(d => xScale(d.confidence_level))
-        .y(d => yScale(d.proportion));
       // Data loading
-      const DATA_URL = '/data/confidence_distribution.json';
       function updateSize() {
         width = container.clientWidth || 800;
@@ -199,15 +199,15 @@
           <div class="model-name" style="color: ${model.color}">${model.name}</div>
           <div class="metric">
             <span class="metric-label">Confidence level:</span>
-            <span class="metric-value">${d.confidence_level * 10}%</span>
           </div>
           <div class="metric">
-            <span class="metric-label">Proportion:</span>
-            <span class="metric-value">${(d.proportion * 100).toFixed(1)}%</span>
           </div>
           <div class="metric">
-            <span class="metric-label">Count:</span>
-            <span class="metric-value">${d.count} / ${model.total_guesses}</span>
           </div>
         `;
@@ -250,14 +250,10 @@
           .domain([5, 10])
           .range([0, innerWidth]);
-        // Y scale: proportion (0 to max + padding)
-        const maxProportion = d3.max(visibleModels, m =>
-          d3.max(m.distribution, d => d.proportion)
-        ) || 0.8;
         yScale
-          .domain([0, Math.min(1, maxProportion * 1.1)])
-          .range([innerHeight, 0])
-          .nice();
         // Grid lines
         const xTicks = [5, 6, 7, 8, 9, 10];
@@ -324,7 +320,7 @@
           .attr('y', -52)
           .attr('text-anchor', 'middle')
           .attr('transform', 'rotate(-90)')
-          .text('Proportion of Guesses');
         // Lines for each model
         gLines.selectAll('.distribution-line')
@@ -358,7 +354,7 @@
           .join('circle')
           .attr('class', 'data-point data-point-circle')
           .attr('cx', d => xScale(d.confidence_level))
-          .attr('cy', d => yScale(d.proportion))
           .attr('r', 4)
           .attr('fill', d => d.model.color)
           .attr('stroke', 'var(--surface-bg, white)')
@@ -374,7 +370,7 @@
           .attr('class', 'data-point data-point-star')
           .attr('d', d => starPath(
             xScale(d.confidence_level),
-            yScale(d.proportion),
             6, 2.6
           ))
           .attr('fill', d => d.model.color)

+<div class="d3-guess-rate"></div>
 <style>
+  .d3-guess-rate {
     width: 100%;
     margin: 10px 0;
     position: relative;
     font-family: system-ui, -apple-system, sans-serif;
   }
+  .d3-guess-rate svg {
     display: block;
     width: 100%;
     height: auto;
   }
+  .d3-guess-rate .axes path,
+  .d3-guess-rate .axes line {
     stroke: var(--axis-color, var(--text-color));
   }
+  .d3-guess-rate .axes text {
     fill: var(--tick-color, var(--muted-color));
     font-size: 11px;
   }
+  .d3-guess-rate .grid line {
     stroke: var(--grid-color, rgba(0,0,0,.08));
   }
+  .d3-guess-rate .axes text.axis-label {
     font-size: 14px;
     font-weight: 500;
     fill: var(--text-color);
   }
+  .d3-guess-rate .x-axis text {
     transform: translateY(4px);
   }
+  .d3-guess-rate .distribution-line {
     fill: none;
     stroke-width: 1.5;
   }
+  .d3-guess-rate .data-point {
     cursor: pointer;
     transition: opacity 0.15s ease;
   }
+  .d3-guess-rate .data-point:hover {
     opacity: 0.8;
   }
+  .d3-guess-rate .legend {
     font-size: 11px;
   }
+  .d3-guess-rate .legend-item {
     cursor: pointer;
   }
+  .d3-guess-rate .legend-item.dimmed .legend-line,
+  .d3-guess-rate .legend-item.dimmed .legend-marker {
     opacity: 0.3;
   }
+  .d3-guess-rate .legend-item.dimmed text {
     opacity: 0.4;
   }
+  .d3-guess-rate .legend-text {
     fill: var(--text-color);
   }
+  .d3-guess-rate .d3-tooltip {
     position: absolute;
     top: 0;
     left: 0;
     z-index: 10;
   }
+  .d3-guess-rate .d3-tooltip .model-name {
     font-weight: 600;
     margin-bottom: 4px;
   }
+  .d3-guess-rate .d3-tooltip .metric {
     display: flex;
     justify-content: space-between;
     gap: 16px;
   }
+  .d3-guess-rate .d3-tooltip .metric-label {
     color: var(--muted-color);
   }
+  .d3-guess-rate .d3-tooltip .metric-value {
     font-weight: 500;
   }
 </style>
     const bootstrap = () => {
       const scriptEl = document.currentScript;
       let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-guess-rate'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-guess-rate'))
           .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
         container = candidates[candidates.length - 1] || null;
       }
       // Line generator
       const line = d3.line()
         .x(d => xScale(d.confidence_level))
+        .y(d => yScale(d.guess_rate));
       // Data loading
+      const DATA_URL = '/data/guess_rate.json';
       function updateSize() {
         width = container.clientWidth || 800;
           <div class="model-name" style="color: ${model.color}">${model.name}</div>
           <div class="metric">
             <span class="metric-label">Confidence level:</span>
+            <span class="metric-value">${d.confidence_level}</span>
           </div>
           <div class="metric">
+            <span class="metric-label">Guess rate:</span>
+            <span class="metric-value">${(d.guess_rate * 100).toFixed(1)}%</span>
           </div>
           <div class="metric">
+            <span class="metric-label">Guesses / Turns:</span>
+            <span class="metric-value">${d.guess_count} / ${d.total_turns}</span>
           </div>
         `;
           .domain([5, 10])
           .range([0, innerWidth]);
+        // Y scale: guess rate (0 to 1)
         yScale
+          .domain([0, 1])
+          .range([innerHeight, 0]);
         // Grid lines
         const xTicks = [5, 6, 7, 8, 9, 10];
           .attr('y', -52)
           .attr('text-anchor', 'middle')
           .attr('transform', 'rotate(-90)')
+          .text('Guess Rate');
         // Lines for each model
         gLines.selectAll('.distribution-line')
           .join('circle')
           .attr('class', 'data-point data-point-circle')
           .attr('cx', d => xScale(d.confidence_level))
+          .attr('cy', d => yScale(d.guess_rate))
           .attr('r', 4)
           .attr('fill', d => d.model.color)
           .attr('stroke', 'var(--surface-bg, white)')
           .attr('class', 'data-point data-point-star')
           .attr('d', d => starPath(
             xScale(d.confidence_level),
+            yScale(d.guess_rate),
             6, 2.6
           ))
           .attr('fill', d => d.model.color)

app/src/content/embeds/overall-performance.html CHANGED Viewed

@@ -184,7 +184,7 @@
           <div class="model-name" style="color: ${d.color}">${d.name}</div>
           <div class="metric">
             <span class="metric-label">Score:</span>
-            <span class="metric-value">${d.avg_score.toFixed(2)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Tokens/Turn:</span>
@@ -222,7 +222,7 @@
         // Update scales
         const xExtent = d3.extent(models, d => d.avg_output_tokens_per_turn);
-        const yExtent = d3.extent(models, d => d.avg_score);
         const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
         const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
@@ -314,7 +314,7 @@
           .join('circle')
           .attr('class', 'point point-circle')
           .attr('cx', d => xScale(d.avg_output_tokens_per_turn))
-          .attr('cy', d => yScale(d.avg_score))
           .attr('r', pointRadius)
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
@@ -328,7 +328,7 @@
           .data(openModels, d => d.name)
           .join('path')
           .attr('class', 'point point-star')
-          .attr('d', d => starPath(xScale(d.avg_output_tokens_per_turn), yScale(d.avg_score), pointRadius * 1.2, pointRadius * 0.5))
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
           .on('mouseenter', showTooltip)
@@ -341,7 +341,7 @@
           .join('text')
           .attr('class', 'point-label')
           .attr('x', d => xScale(d.avg_output_tokens_per_turn) + pointRadius + 6)
-          .attr('y', d => yScale(d.avg_score) + 4)
           .text(d => d.name);
       }

           <div class="model-name" style="color: ${d.color}">${d.name}</div>
           <div class="metric">
             <span class="metric-label">Score:</span>
+            <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Tokens/Turn:</span>
         // Update scales
         const xExtent = d3.extent(models, d => d.avg_output_tokens_per_turn);
+        const yExtent = d3.extent(models, d => d.avg_floored_score);
         const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
         const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
           .join('circle')
           .attr('class', 'point point-circle')
           .attr('cx', d => xScale(d.avg_output_tokens_per_turn))
+          .attr('cy', d => yScale(d.avg_floored_score))
           .attr('r', pointRadius)
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
           .data(openModels, d => d.name)
           .join('path')
           .attr('class', 'point point-star')
+          .attr('d', d => starPath(xScale(d.avg_output_tokens_per_turn), yScale(d.avg_floored_score), pointRadius * 1.2, pointRadius * 0.5))
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
           .on('mouseenter', showTooltip)
           .join('text')
           .attr('class', 'point-label')
           .attr('x', d => xScale(d.avg_output_tokens_per_turn) + pointRadius + 6)
+          .attr('y', d => yScale(d.avg_floored_score) + 4)
           .text(d => d.name);
       }

app/src/content/embeds/score-stack.html CHANGED Viewed

@@ -168,8 +168,7 @@
       // Colors for segments
       const segmentColors = {
-        raw: '#4A90D9',       // Blue - raw score
-        floored: '#E8973E',   // Orange - flooring gain
         noStakes: '#5AAA5A'   // Green - no-stakes gain
       };
@@ -198,14 +197,10 @@
         const y = event.clientY - rect.top;
         let segmentName, segmentValue, description;
-        if (segment === 'raw') {
-          segmentName = 'Raw Score';
-          segmentValue = d.avg_score.toFixed(2);
-          description = 'Standard scoring: 30 - turns - 2×wrong guesses';
-        } else if (segment === 'floored') {
-          segmentName = 'Flooring Gain';
-          segmentValue = '+' + d.floored_delta.toFixed(2);
-          description = 'Gain if negative scores count as 0';
         } else {
           segmentName = 'No-Stakes Gain';
           segmentValue = '+' + d.no_stakes_delta.toFixed(2);
@@ -221,11 +216,7 @@
           <div style="font-size: 11px; color: var(--muted-color); margin-top: 4px;">${description}</div>
           <hr style="border: none; border-top: 1px solid var(--border-color); margin: 8px 0;">
           <div class="metric">
-            <span class="metric-label">Raw Score:</span>
-            <span class="metric-value">${d.avg_score.toFixed(2)}</span>
-          </div>
-          <div class="metric">
-            <span class="metric-label">Floored Score:</span>
             <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
           </div>
           <div class="metric">
@@ -257,8 +248,8 @@
         const { innerWidth, innerHeight } = updateSize();
-        // Sort models by raw score (descending)
-        const models = [...data.models].sort((a, b) => b.avg_score - a.avg_score);
         // Update scales
         const maxScore = d3.max(models, d => d.avg_no_stakes_score);
@@ -323,49 +314,28 @@
           const safeId = toClassName(d.name);
           // Calculate segment positions
-          // Raw score starts from 0, clamp negative scores to 0
-          const rawStart = 0;
-          const rawEnd = Math.max(0, d.avg_score);
-          // Floored delta starts where raw score ends (if positive) or at 0 (if raw was negative)
-          const flooredStart = rawEnd;
-          const flooredEnd = flooredStart + d.floored_delta;
           // No-stakes delta starts where floored ends
           const noStakesStart = flooredEnd;
           const noStakesEnd = noStakesStart + d.no_stakes_delta;
-          // Raw score segment
-          gBars.selectAll(`.bar-raw-${safeId}`)
             .data([d])
             .join('rect')
-            .attr('class', `bar-segment bar-raw-${safeId}`)
-            .attr('x', xScale(rawStart))
             .attr('y', y)
-            .attr('width', Math.max(0, xScale(rawEnd) - xScale(rawStart)))
             .attr('height', barHeight)
-            .attr('fill', segmentColors.raw)
-            .on('mouseenter', (e) => showTooltip(e, d, 'raw'))
-            .on('mousemove', (e) => showTooltip(e, d, 'raw'))
             .on('mouseleave', hideTooltip);
-          // Floored delta segment (only if positive)
-          if (d.floored_delta > 0.01) {
-            gBars.selectAll(`.bar-floored-${safeId}`)
-              .data([d])
-              .join('rect')
-              .attr('class', `bar-segment bar-floored-${safeId}`)
-              .attr('x', xScale(flooredStart))
-              .attr('y', y)
-              .attr('width', Math.max(0, xScale(flooredEnd) - xScale(flooredStart)))
-              .attr('height', barHeight)
-              .attr('fill', segmentColors.floored)
-              .attr('opacity', 0.5)
-              .on('mouseenter', (e) => showTooltip(e, d, 'floored'))
-              .on('mousemove', (e) => showTooltip(e, d, 'floored'))
-              .on('mouseleave', hideTooltip);
-          }
           // No-stakes delta segment (only if positive)
           if (d.no_stakes_delta > 0.01) {
             gBars.selectAll(`.bar-nostakes-${safeId}`)
@@ -386,13 +356,9 @@
         // Update legend
         legendDiv.innerHTML = `
-          <div class="legend-item">
-            <div class="legend-swatch" style="background: ${segmentColors.raw}"></div>
-            <span class="legend-label">Raw Score</span>
-          </div>
           <div class="legend-item">
             <div class="legend-swatch" style="background: ${segmentColors.floored}"></div>
-            <span class="legend-label">Flooring Gain</span>
           </div>
           <div class="legend-item">
             <div class="legend-swatch" style="background: ${segmentColors.noStakes}"></div>

       // Colors for segments
       const segmentColors = {
+        floored: '#4A90D9',   // Blue - floored score
         noStakes: '#5AAA5A'   // Green - no-stakes gain
       };
         const y = event.clientY - rect.top;
         let segmentName, segmentValue, description;
+        if (segment === 'floored') {
+          segmentName = 'Score';
+          segmentValue = d.avg_floored_score.toFixed(2);
+          description = 'Floored score (negative scores count as 0)';
         } else {
           segmentName = 'No-Stakes Gain';
           segmentValue = '+' + d.no_stakes_delta.toFixed(2);
           <div style="font-size: 11px; color: var(--muted-color); margin-top: 4px;">${description}</div>
           <hr style="border: none; border-top: 1px solid var(--border-color); margin: 8px 0;">
           <div class="metric">
+            <span class="metric-label">Score:</span>
             <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
           </div>
           <div class="metric">
         const { innerWidth, innerHeight } = updateSize();
+        // Sort models by floored score (descending)
+        const models = [...data.models].sort((a, b) => b.avg_floored_score - a.avg_floored_score);
         // Update scales
         const maxScore = d3.max(models, d => d.avg_no_stakes_score);
           const safeId = toClassName(d.name);
           // Calculate segment positions
+          // Floored score starts from 0
+          const flooredStart = 0;
+          const flooredEnd = d.avg_floored_score;
           // No-stakes delta starts where floored ends
           const noStakesStart = flooredEnd;
           const noStakesEnd = noStakesStart + d.no_stakes_delta;
+          // Floored score segment (base)
+          gBars.selectAll(`.bar-floored-${safeId}`)
             .data([d])
             .join('rect')
+            .attr('class', `bar-segment bar-floored-${safeId}`)
+            .attr('x', xScale(flooredStart))
             .attr('y', y)
+            .attr('width', Math.max(0, xScale(flooredEnd) - xScale(flooredStart)))
             .attr('height', barHeight)
+            .attr('fill', segmentColors.floored)
+            .on('mouseenter', (e) => showTooltip(e, d, 'floored'))
+            .on('mousemove', (e) => showTooltip(e, d, 'floored'))
             .on('mouseleave', hideTooltip);
           // No-stakes delta segment (only if positive)
           if (d.no_stakes_delta > 0.01) {
             gBars.selectAll(`.bar-nostakes-${safeId}`)
         // Update legend
         legendDiv.innerHTML = `
           <div class="legend-item">
             <div class="legend-swatch" style="background: ${segmentColors.floored}"></div>
+            <span class="legend-label">Score</span>
           </div>
           <div class="legend-item">
             <div class="legend-swatch" style="background: ${segmentColors.noStakes}"></div>

app/src/content/embeds/score-vs-failed-guesses.html CHANGED Viewed

@@ -169,7 +169,7 @@
           <div class="model-name" style="color: ${d.color}">${d.name}</div>
           <div class="metric">
             <span class="metric-label">Score:</span>
-            <span class="metric-value">${d.avg_score.toFixed(2)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Failed Guesses:</span>
@@ -207,7 +207,7 @@
         // Update scales
         const xExtent = d3.extent(models, d => d.avg_failed_guesses);
-        const yExtent = d3.extent(models, d => d.avg_score);
         const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
         const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
@@ -299,7 +299,7 @@
           .join('circle')
           .attr('class', 'point point-circle')
           .attr('cx', d => xScale(d.avg_failed_guesses))
-          .attr('cy', d => yScale(d.avg_score))
           .attr('r', pointRadius)
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
@@ -313,7 +313,7 @@
           .data(openModels, d => d.name)
           .join('path')
           .attr('class', 'point point-star')
-          .attr('d', d => starPath(xScale(d.avg_failed_guesses), yScale(d.avg_score), pointRadius * 1.2, pointRadius * 0.5))
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
           .on('mouseenter', showTooltip)
@@ -326,7 +326,7 @@
           .join('text')
           .attr('class', 'point-label')
           .attr('x', d => xScale(d.avg_failed_guesses) + pointRadius + 6)
-          .attr('y', d => yScale(d.avg_score) + 4)
           .text(d => d.name);
       }

           <div class="model-name" style="color: ${d.color}">${d.name}</div>
           <div class="metric">
             <span class="metric-label">Score:</span>
+            <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
           </div>
           <div class="metric">
             <span class="metric-label">Failed Guesses:</span>
         // Update scales
         const xExtent = d3.extent(models, d => d.avg_failed_guesses);
+        const yExtent = d3.extent(models, d => d.avg_floored_score);
         const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
         const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
           .join('circle')
           .attr('class', 'point point-circle')
           .attr('cx', d => xScale(d.avg_failed_guesses))
+          .attr('cy', d => yScale(d.avg_floored_score))
           .attr('r', pointRadius)
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
           .data(openModels, d => d.name)
           .join('path')
           .attr('class', 'point point-star')
+          .attr('d', d => starPath(xScale(d.avg_failed_guesses), yScale(d.avg_floored_score), pointRadius * 1.2, pointRadius * 0.5))
           .attr('fill', d => d.color)
           .attr('stroke', 'none')
           .on('mouseenter', showTooltip)
           .join('text')
           .attr('class', 'point-label')
           .attr('x', d => xScale(d.avg_failed_guesses) + pointRadius + 6)
+          .attr('y', d => yScale(d.avg_floored_score) + 4)
           .text(d => d.name);
       }