eleusis-benchmark

Running

App Files Files Community

dlouapre HF Staff commited on Jan 28

Commit

aee6411

1 Parent(s): d123922

Adding interactive charts + assesment

Browse files

Files changed (38) hide show

ASSESSMENT.md +291 -0
app/src/content/assets/data/basic_metrics.csv +2 -2
app/src/content/assets/data/by_rule.json +2 -2
app/src/content/assets/data/by_rule.png +2 -2
app/src/content/assets/data/complexity_analysis.json +2 -2
app/src/content/assets/data/complexity_analysis.png +2 -2
app/src/content/assets/data/model_claude_haiku_4_5.png +3 -0
app/src/content/assets/data/model_claude_opus_4_5.png +3 -0
app/src/content/assets/data/model_deepseek_r1.png +3 -0
app/src/content/assets/data/model_gemini_3_flash_preview_low.png +3 -0
app/src/content/assets/data/model_gpt_5_2_high.png +3 -0
app/src/content/assets/data/model_gpt_5_mini_medium.png +3 -0
app/src/content/assets/data/model_gpt_oss_120b.png +3 -0
app/src/content/assets/data/model_gpt_oss_20b.png +3 -0
app/src/content/assets/data/model_grok_4_1_fast_reasoning.png +3 -0
app/src/content/assets/data/model_kimi_k2.png +3 -0
app/src/content/assets/data/overall_performance.json +2 -2
app/src/content/assets/data/overall_performance.png +2 -2
app/src/content/assets/data/reckless_guessing.json +3 -0
app/src/content/assets/data/reckless_guessing.png +3 -0
app/src/content/assets/data/score_stack.json +3 -0
app/src/content/assets/data/score_stack.png +3 -0
app/src/content/assets/data/score_vs_failed_guesses.json +2 -2
app/src/content/assets/data/score_vs_failed_guesses.png +2 -2
app/src/content/assets/data/summary.txt +91 -51
app/src/content/chapters/eleusis/benchmark.mdx +2 -2
app/src/content/chapters/eleusis/results.mdx +59 -42
app/src/content/embeds/by-rule.html +521 -0
app/src/content/embeds/calibration-curves.html +537 -0
app/src/content/embeds/caution-vs-failed-guesses.html +369 -0
app/src/content/embeds/complexity-analysis.html +492 -0
app/src/content/embeds/confidence-distribution.html +495 -0
app/src/content/embeds/excess-caution.html +384 -0
app/src/content/embeds/reckless-guessing.html +400 -0
app/src/content/embeds/score-stack.html +440 -0
app/src/content/embeds/score-vs-failed-guesses.html +369 -0
dark-mode-image.md +48 -0
interactive-charts.md +498 -0

ASSESSMENT.md ADDED Viewed

	@@ -0,0 +1,291 @@

+# Critical Assessment: Eleusis Benchmark Article
+## Executive Summary
+The article presents an interesting benchmark with solid methodology and rich data. The main structural issue is that the **Results section tells a fragmented story about guessing behavior**, spreading related insights across 6+ subsections without a clear narrative arc. The key message—that metacognition matters and models have distinct "scientific personalities"—gets lost in the noise.
+Additionally, there are **data consistency issues** between the text and the underlying data files that need resolution before publication.
+---
+## 1. Critical Issues
+### 1.1 Data Inconsistencies
+The numbers in the text don't match `summary.txt`. For example:
+| Metric | In Text | In summary.txt |
+|--------|---------|----------------|
+| Claude Opus 4.5 avg score | 15.88 (CLAUDE.md) | 14.46 |
+| Kimi K2 avg score | 14.53 (CLAUDE.md) | 10.31 |
+| GPT 5.2 High rank | "third place" | Actually 1st by avg_score (14.85) |
+**Action needed:** Audit all numbers in the text against the latest data files.
+### 1.2 Results Section: Scattered Narrative
+The guessing behavior story is currently spread across:
+1. "Confidence and Calibration" - calibration curves, confidence distribution
+2. "Guessing Strategy" - score vs failed guesses
+3. "The Caution-Recklessness Trade-off" - early correct turns, caution scatter
+4. "Alternative Scoring Systems" - score stack breakdown
+5. "Analysis of the reckless guessing behavior" - double-down rate
+These all address the same fundamental question: **How do models decide when to commit?** But the current structure forces readers to piece together the story themselves.
+**Problem:** A reader finishing the Results section doesn't have a clear mental model of "what makes some models better than others."
+---
+## 2. Suggested Restructuring
+### Option A: Reorganize Around the Key Insight
+**Proposed Results structure:**
+```
+## Results
+### Overall Performance (keep as-is)
+   Brief overview, scatter plot of score vs tokens
+### Finding the Rule: Who Gets It Right?
+   - Success rates by model
+   - Performance by rule complexity
+   - Brief: what capabilities matter for finding rules
+### Knowing When You Know: The Metacognition Challenge
+   [This is the heart of the article - elevate it]
+   - The caution-recklessness trade-off (central framing)
+   - Caution analysis: early correct turns, GPT 5.2 waits too long
+   - Recklessness analysis: failed guesses, double-down rates
+   - The scatter plot showing the trade-off (Figure 6)
+   - Why Claude Opus wins: good enough at finding + great at timing
+### Confidence and Calibration
+   - Calibration curves (all models overconfident)
+   - Confidence distribution when guessing
+   - Brief: why calibration enables good timing decisions
+### Alternative Scoring: Robustness Check
+   - Score stack shows the penalty different behaviors pay
+   - Confirms that metacognition, not just rule-finding, drives scores
+```
+**Benefits:**
+- The key message (metacognition matters) becomes structurally prominent
+- Reader builds understanding progressively: first "can they solve it?", then "do they know when they've solved it?"
+- Eliminates the feeling of "lots of charts, hard to synthesize"
+### Option B: Two-Act Structure
+```
+## Results
+### Act 1: The Leaderboard (compact)
+   - Overall performance scatter
+   - Success rates
+   - One paragraph summary: "Models vary from 70% to 96% success rate..."
+### Act 2: The Real Story—Scientific Temperaments
+   [Frame models as having distinct "personalities"]
+   The Cautious Achiever: GPT 5.2 High
+   - Highest success rate, but 3rd in score
+   - Figure: excess caution distribution
+   - Lost ~3.6 points per round to over-caution
+   The Balanced Scientist: Claude Opus 4.5
+   - Not the best at finding rules, but best at knowing when
+   - Commits quickly, accepts occasional wrong guesses
+   The Reckless Guesser: Claude Haiku 4.5 / DeepSeek R1
+   - Commits before sufficient evidence
+   - Double-down behavior after failures
+   Visualizing the Trade-off
+   - Caution vs recklessness scatter (the key figure)
+   - Score stack showing what each "personality" costs
+### Calibration: Why Timing Is Hard
+   - Overconfidence makes timing decisions unreliable
+   - Even well-performing models poorly calibrated
+```
+**Benefits:**
+- Memorable framing (scientific personalities)
+- Natural story arc
+- Each model type is clearly characterized
+---
+## 3. Missing Content
+### 3.1 Figures Marked as TODO
+- **Learning curves figure** (analysis.mdx:22) - Would show within-round dynamics
+- **Failure mode distribution** (analysis.mdx:55) - Stacked bar by model
+**Recommendation:** The learning curves figure would be valuable if you have the data. The failure mode classification might be hard to automate reliably—consider whether a few qualitative examples serve the purpose better.
+### 3.2 Human Baseline
+Mentioned in limitations but this is a significant gap. Without human performance, readers can't judge if 92% success is impressive or trivial.
+**Options:**
+- Run a small human study (even N=5 would help)
+- Cite related work on human performance in similar inductive reasoning tasks
+- Frame it explicitly as "relative comparison between models" not absolute capability assessment
+### 3.3 Example Turn Figure
+benchmark.mdx shows the JSON output format but doesn't illustrate what a complete turn looks like in context (game state → reasoning → decision).
+**Recommendation:** Add a figure showing:
+```
+[Current board state visualization]
+[Model reasoning excerpt]
+[Decision: play 4♣, confidence 6, don't guess yet]
+[Outcome: accepted/rejected]
+```
+This makes the task concrete for readers.
+---
+## 4. The "Deeper Analysis" Section
+Currently a grab-bag of interesting observations with TODOs. Your instinct to replace with "Discussion" is right.
+### Proposed: Discussion Section
+```
+## Discussion
+### What Explains the Performance Gap?
+   - Metacognition (knowing when you know) is the key differentiator
+   - Success rate alone doesn't predict score (GPT 5.2 vs Opus example)
+   - Calibration enables good timing, but no model is well-calibrated
+### Open vs Proprietary Models
+   - Kimi K2 competitive on rule-finding
+   - But open models trend toward reckless guessing (training objective differences?)
+   - Opportunity: calibration tuning could improve open model performance
+### Failure Modes [keep the accordion, it's useful]
+### Implications for AI-Assisted Science
+   - The caution-recklessness trade-off mirrors real scientific decision-making
+   - An overconfident AI assistant could lead researchers astray
+   - An overcautious one wastes resources on unnecessary verification
+```
+### Move to Appendix
+- Symmetric rules analysis (interesting but niche)
+- Confirmation bias (preliminary, needs more work)
+- Detailed qualitative examples (unless you expand them significantly)
+---
+## 5. Framing Suggestions
+### 5.1 Lead with the Surprise
+Current opening of Results is fine, but the key insight (metacognition matters) comes too late. Consider foreshadowing in the introduction:
+> "We found something surprising: the model with the highest success rate doesn't have the highest score. What matters isn't just finding the answer—it's knowing when you've found it."
+### 5.2 The "Scientific Personality" Frame
+This is potentially memorable and shareable. Models as:
+- **The Perfectionist** (GPT 5.2 High): Always wants more evidence
+- **The Pragmatist** (Claude Opus 4.5): Good enough evidence is enough
+- **The Gambler** (Claude Haiku 4.5): Guesses based on vibes
+This framing:
+- Makes the article more accessible to non-specialists
+- Creates natural anchors for discussion
+- Is scientifically defensible (behavioral clustering is real)
+### 5.3 The Decision Theory Angle
+You mention the optimal guessing threshold (0.67 confidence) briefly. This could be expanded:
+> "Given perfect calibration, the optimal strategy is to guess whenever confidence exceeds 67%. But no model is well-calibrated. GPT 5.2 High effectively uses a threshold of ~95%; Claude Haiku 4.5 seems to use ~50%."
+This quantifies the "personalities" and connects to calibration.
+---
+## 6. Minor Issues
+### 6.1 Typos/Grammar
+- results.mdx:38: "overconfident : for instance" → extra space before colon
+- results.mdx:39: "GPT 5.2 is the best calibrated" → should be "GPT 5.2 High"
+- results.mdx:51: "closed to Claude Opus 4.5" → "close to"
+- results.mdx:103: "constrats" → "contrasts"
+- analysis.mdx:60: "GPT OSS 120B also performs respectably at 12.0" → check number
+### 6.2 Caption Numbering
+Figure 7 appears twice (score-stack and reckless-guessing). Fix numbering.
+### 6.3 Model Names Consistency
+Inconsistent capitalization and naming:
+- "Claude Opus 4.5" vs "Claude 4.5 Opus"
+- "GPT 5.2 High" vs "Gpt 5.2 High" (in data files)
+- "DeepSeek R1" vs "Deepseek R1"
+---
+## 7. Ideas for Additional Content
+### 7.1 Interactive "Play a Round" Demo
+Let readers play one round against a rule to experience the task. Even a simple version would be compelling. (This could be a stretch goal.)
+### 7.2 Model-Specific Breakdowns
+You have per-model PNG files (`model_claude_opus_4_5.png`, etc.). Consider:
+- Appendix section with one page per model
+- Or: expandable accordion for each model's detailed stats
+### 7.3 Token Efficiency Discussion
+You show score vs tokens in Figure 1 but don't discuss it much. Gemini 3 Flash achieves decent results with 4x fewer tokens than Opus—is that worth highlighting for practitioners?
+### 7.4 Prompt Sensitivity
+You note this as a limitation but could briefly test: what if you told models to be more cautious? More aggressive? (Could be future work suggestion.)
+---
+## 8. Prioritized Action Items
+### Must Fix
+1. Audit all numbers against latest data files
+2. Fix duplicate Figure 7 numbering
+3. Fix typos listed above
+### Should Do
+4. Reorganize Results section (Option A or B above)
+5. Rename "Deeper Analysis" to "Discussion" and restructure
+6. Add foreshadowing of key insight in introduction
+### Nice to Have
+7. Add example turn figure in benchmark.mdx
+8. Expand "scientific personalities" framing
+9. Human baseline (even informal)
+10. Per-model detail pages in appendix
+---
+## 9. Summary
+The benchmark and data are solid. The article's main weakness is structural: it has too many charts telling pieces of the same story without a clear narrative spine. The fix is to reorganize around **the key insight** (metacognition matters more than raw rule-finding ability) and **the key visual** (the caution-recklessness scatter plot).
+Your target message—"Models differ dramatically because metacognition matters, and this is an opportunity for improvement"—is supported by the data but not yet prominently surfaced by the article structure.

app/src/content/assets/data/basic_metrics.csv CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:84c461621159e77fa9c7a1370138dd35da740c50943f5b6966fa801a50c8479f
-size 2145

 version https://git-lfs.github.com/spec/v1
+oid sha256:646b5eda63192bed7d4c3372c684b263db844ad6599e2cff7cd34b945e0a03da
+size 2743

app/src/content/assets/data/by_rule.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1651057430922f5919ff3a4f6c005baed488a730bf90848e078044ddf0910a85
-size 5887

 version https://git-lfs.github.com/spec/v1
+oid sha256:bedd8081e1e412f0d2453c0f6fe78153fed8433520b9e1b729fc7b11dd5b02a8
+size 30709

app/src/content/assets/data/by_rule.png CHANGED Viewed

Git LFS Details

SHA256: 157397fb6d139b6399e87166bc83c7c6a0183ec8aa28a81874a6314d2f092fc7
Pointer size: 131 Bytes
Size of remote file: 340 kB

Git LFS Details

SHA256: f7c7d4ff1a927f2d44209feb1979ca355f79fa75a03e13ac413d4bdba84012a6
Pointer size: 131 Bytes
Size of remote file: 363 kB

app/src/content/assets/data/complexity_analysis.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2217f0d480685678ad3d6724a5d8f8d4eb73a95a883af9fe077c6fcb476eabba
-size 2473

 version https://git-lfs.github.com/spec/v1
+oid sha256:a281c2834fce731ee67126dc08e307268f411c4b7ec24006d36edccd303a6e6d
+size 2273

app/src/content/assets/data/complexity_analysis.png CHANGED Viewed

Git LFS Details

SHA256: 0e4c4614bdbf004ef9e898eaaeab152ea08437891d76d3a6da9ef9cf44c86bbc
Pointer size: 130 Bytes
Size of remote file: 99.7 kB

Git LFS Details

SHA256: 7d7ed142b4271802c43e9c385ac2fd01da0a9008903655477d1a76608af86fe1
Pointer size: 131 Bytes
Size of remote file: 111 kB

app/src/content/assets/data/model_claude_haiku_4_5.png ADDED Viewed

Git LFS Details

SHA256: e156f35fcb3f764435fccf4ee3ce16b71f594721e11b673fa122f95cccc5c524
Pointer size: 131 Bytes
Size of remote file: 248 kB

app/src/content/assets/data/model_claude_opus_4_5.png ADDED Viewed

Git LFS Details

SHA256: 250b07856543f2443a6b8ba3c20e15f24e3eb31bbbeda1d1e9555a5d8f4bf1b9
Pointer size: 131 Bytes
Size of remote file: 217 kB

app/src/content/assets/data/model_deepseek_r1.png ADDED Viewed

Git LFS Details

SHA256: 9408c2f99fb62f626909a296150243597be4ba2976d68b2c7b848b5fcba4f33a
Pointer size: 131 Bytes
Size of remote file: 249 kB

app/src/content/assets/data/model_gemini_3_flash_preview_low.png ADDED Viewed

Git LFS Details

SHA256: 75ca6f7798384cf21e16d6ba6a9a7c8eca3d3abb7849767ee628319486dae785
Pointer size: 131 Bytes
Size of remote file: 235 kB

app/src/content/assets/data/model_gpt_5_2_high.png ADDED Viewed

Git LFS Details

SHA256: 3ce556cf0f570a3c13535287608f734e8f103e308a7cd8db80f355f309003e6c
Pointer size: 131 Bytes
Size of remote file: 194 kB

app/src/content/assets/data/model_gpt_5_mini_medium.png ADDED Viewed

Git LFS Details

SHA256: 8f2a375dfbf81219ac33ceef6f42f4c8d9028d4a3867920a44218db927056985
Pointer size: 131 Bytes
Size of remote file: 210 kB

app/src/content/assets/data/model_gpt_oss_120b.png ADDED Viewed

Git LFS Details

SHA256: c4d22765888054e1b220a66e9bf42278bf46aab27a3a8733f0ebb7e71db9c13e
Pointer size: 131 Bytes
Size of remote file: 259 kB

app/src/content/assets/data/model_gpt_oss_20b.png ADDED Viewed

Git LFS Details

SHA256: 41b5dfd6881d9e30e03a91de49517faa4a9cb94c9b88031ce0f52ceb431470df
Pointer size: 131 Bytes
Size of remote file: 270 kB

app/src/content/assets/data/model_grok_4_1_fast_reasoning.png ADDED Viewed

Git LFS Details

SHA256: 08f64a210f54161c501c19e8906518c7d5a6cc55b36749e9c31cb570a09170ee
Pointer size: 131 Bytes
Size of remote file: 221 kB

app/src/content/assets/data/model_kimi_k2.png ADDED Viewed

Git LFS Details

SHA256: 04d4c263b639177670769f818380e061a69b259e7aa073b1151fbd737d19cd07
Pointer size: 131 Bytes
Size of remote file: 238 kB

app/src/content/assets/data/overall_performance.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f797b18c405fd0b865d372d6dfe74e823a42a0f8f057a144b08208c5f6fb29d0
-size 2286

 version https://git-lfs.github.com/spec/v1
+oid sha256:67f55d87526715789a9b2c902de6acc78f69dc5fd13300eb97e511668bca8003
+size 2303

app/src/content/assets/data/overall_performance.png CHANGED Viewed

Git LFS Details

SHA256: 1e634cee9f65439c9a25dad03e70e73f8e5722091da456a6a7206893f50039dc
Pointer size: 130 Bytes
Size of remote file: 75.8 kB

Git LFS Details

SHA256: 9d182c87b70f17018bd0664f1812b0ed0b99dbb107e1e455810f64dd21040f24
Pointer size: 130 Bytes
Size of remote file: 76.2 kB

app/src/content/assets/data/reckless_guessing.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a708723564f2779c2600346e347e2cff985a247bc950707d7f5c58137e05395b
+size 19220

app/src/content/assets/data/reckless_guessing.png ADDED Viewed

Git LFS Details

SHA256: a73c1561ab35ed2e308d9cea71e3c77c116fb3d1d5619878a60d68ee1a031fbe
Pointer size: 130 Bytes
Size of remote file: 69.6 kB

app/src/content/assets/data/score_stack.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d64dd73c3b7173b627be30fab1720d57fde169a419d6038a9dec3129a2c93a60
+size 3723

app/src/content/assets/data/score_stack.png ADDED Viewed

Git LFS Details

SHA256: 770e7bbbe723acad84dd1ecd4ff8310abd3fd60417953c5b961464d85111e328
Pointer size: 130 Bytes
Size of remote file: 83.3 kB

app/src/content/assets/data/score_vs_failed_guesses.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2dd1c504aa5cfc6e5212ce7838b71c98002a94d1f10d5116ba92cff5924ccaae
-size 2212

 version https://git-lfs.github.com/spec/v1
+oid sha256:581795032120f5075ef4f805472d19deebe0602aa6737e07bc62a35062f97758
+size 2215

app/src/content/assets/data/score_vs_failed_guesses.png CHANGED Viewed

Git LFS Details

SHA256: 6caeee464272ac6774ca71a6f28fd7f1f6c10bb4820ec69161d8213f89be5ccb
Pointer size: 130 Bytes
Size of remote file: 73.4 kB

Git LFS Details

SHA256: 655ad7167280626d2e194caf6115b8afee946a61faca5cc0b2b2f9ded65c6999
Pointer size: 130 Bytes
Size of remote file: 73.7 kB

app/src/content/assets/data/summary.txt CHANGED Viewed

@@ -25,17 +25,17 @@ Loaded colors for 17 models
 BASIC MODEL COMPARISON
 ============================================================
-                     model  rounds_played  total_score  avg_score  total_turns  total_output_tokens  total_wall_clock  avg_failed_guesses  success_rate  avg_output_tokens_per_turn  wall_clock_per_turn  intra_rule_variance  inter_rule_variance  variance_ratio
-           Claude Opus 4.5             78         1239  15.884615          825              4333716          86367.64            2.769231      0.923077                 5252.989091           104.688048            30.871795            92.648376        0.333215
-                   Kimi K2             78         1133  14.525641          955             12281540         101346.76            4.025641      0.858974                12860.251309           106.122262            48.679487           116.854872        0.416581
-              Gpt 5.2 High             78         1102  14.128205         1200              3341037          73525.83            0.333333      0.961538                 2784.197500            61.271525            25.346154            36.062906        0.702832
-         Gpt 5 Mini Medium             78         1001  12.833333         1247              3618399          58345.97            1.256410      0.756410                 2901.683240            46.789070            40.051282            79.228889        0.505514
-   Grok 4 1 Fast Reasoning             78          976  12.512821          962              8178655         120364.22            4.320513      0.884615                 8501.720374           125.118732            69.358974           182.704274        0.379624
-Gemini 3 Flash Preview Low             78          955  12.243590         1299              1581524          12702.02            1.717949      0.769231                 1217.493457             9.778306            35.910256            81.480513        0.440722
-              Gpt Oss 120B             78          938  12.025641         1226              3190828          24633.15            3.692308      0.756410                 2602.632953            20.092292            51.320513            80.710427        0.635860
-               Deepseek R1             78          853  10.935897         1069              9229131         165334.16            5.064103      0.833333                 8633.424696           154.662451            69.705128           166.426838        0.418833
-               Gpt Oss 20B             78          773   9.910256         1277              7009392          62397.50            6.205128      0.717949                 5488.952232            48.862569            80.782051           122.849402        0.657570
-          Claude Haiku 4.5             78          713   9.141026         1223              6973411          57734.39            7.551282      0.705128                 5701.889616            47.207187            88.576923           152.125983        0.582260
 Saved: results/260121_78_rounds/basic_metrics.csv
 Saved: results/260121_78_rounds/overall_performance.png
@@ -46,6 +46,26 @@ Saved: results/260121_78_rounds/calibration_curves.png
 Saved: results/260121_78_rounds/calibration_curves.json
 Saved: results/260121_78_rounds/confidence_distribution.png
 Saved: results/260121_78_rounds/confidence_distribution.json
 ============================================================
 BY-RULE ANALYSIS
@@ -53,32 +73,32 @@ BY-RULE ANALYSIS
 Score by rule (sorted by avg_score):
                                                                                                                                             rule_description  count  avg_score  std_score  success_rate
-                                                                                                                        Only red cards (hearts or diamonds).     30  24.666667   2.218004      1.000000
-                                                                                                                              Only cards of the suit spades.     30  24.233333   2.045741      1.000000
-                                                                             Cards must alternate between red and black colors. Any card may start the line.     30  24.200000   2.670400      1.000000
-                                                                                                               Only cards with an even rank (2,4,6,8,10,12).     30  23.466667   2.812942      1.000000
-                                                             The card must be of a different suit than the card just before it. Any card may start the line.     30  21.500000   6.317736      0.966667
-                                                      Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line.     30  20.066667   5.051004      1.000000
-                                                                                             Only hearts, clubs, and diamonds allowed. Spades are forbidden.     30  19.933333   5.501933      0.966667
-                                                                                                                                        Only Aces (rank 1) .     30  19.500000   8.569191      0.966667
-                                                                                                          Only ranks that are prime numbers (2,3,5,7,11,13).     30  19.266667   6.781991      0.966667
-                                           The card must be of a different suit than but same color as the card just before it. Any card may start the line.     30  19.166667   7.479090      1.000000
-                                                                                                                                 Only face cards (11,12,13).     30  19.000000   8.068671      0.900000
-                                                                                                                                   Only spades and diamonds.     30  18.400000   4.476760      1.000000
-                                           Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line.     30  14.700000  12.151770      1.000000
-                                                                                                                       Only cards between 1 and 7 inclusive.     30  13.400000   8.495841      0.966667
-                                                                                                                                      Only black face cards.     30  10.333333   9.830752      0.900000
-                                                                                               Alternate face and number cards. Any card may start the line.     30   7.100000  12.273745      0.733333
-                                                              Each card must have a rank greater or equal to the previous card. Only Ace can start the line.     30   6.966667  10.607360      0.600000
-                                                                                                                       Only cards between 5 and 9 inclusive.     30   6.600000   9.264690      0.933333
-                                 Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line.     30   6.433333  11.990466      0.666667
-                                                                                                                           Only red cards whose rank is <=7.     30   4.366667  11.109124      1.000000
-Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc.     30   1.700000  11.166915      0.766667
-                   Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc.     30   0.766667   4.031628      0.133333
-                                                                                          Face cards (11-13) must be red; number cards (1-10) must be black.     30   0.533333   8.357253      0.500000
-                                     Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line.     30   0.466667   7.951288      0.400000
-    If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive.     30  -1.766667   9.743905      0.333333
-        Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it.     30  -2.233333   8.319828      0.533333
 Saved: results/260121_78_rounds/by_rule.png
 Saved: results/260121_78_rounds/by_rule.json
@@ -112,22 +132,42 @@ Saved: results/260121_78_rounds/caution_vs_failed_guesses.png
 Saved: results/260121_78_rounds/caution_vs_failed_guesses.json
 ============================================================
-COMPLEXITY ANALYSIS
 ============================================================
-Optimal K for aggregated complexity: 0.10
-  Formula: complexity = cyclomatic + 0.10 * node_count
-  Correlation with relative_score: -0.478
-Score by complexity quartile:
-complexity_bin  count  avg_score  avg_relative_score  success_rate
-            Q1    240  18.850000            1.543589      0.966667
-            Q2    150  13.353333            1.076587      0.893333
-            Q3    210  12.228571            0.977793      0.800000
-            Q4    180   3.266667            0.237300      0.572222
-Saved: results/260121_78_rounds/complexity_analysis.png
-Saved: results/260121_78_rounds/complexity_analysis.json
 ============================================================
 PER-MODEL REPORTS

 BASIC MODEL COMPARISON
 ============================================================
+                     model  rounds_played  total_score  avg_score  total_floored_score  avg_floored_score  total_turns  total_output_tokens  total_wall_clock  avg_failed_guesses  success_rate  total_no_stakes_score  avg_no_stakes_score  avg_output_tokens_per_turn  wall_clock_per_turn  intra_rule_variance  inter_rule_variance  variance_ratio
+              Gpt 5.2 High             78         1158  14.846154                 1174          15.051282         1205              3341037          73525.83            0.333333      0.961538                 1505.0            19.294872                 2772.644813            61.017286            25.858974            43.513162        0.594279
+           Claude Opus 4.5             78         1128  14.461538                 1324          16.974359          852              4333716          86367.64            2.769231      0.923077                 1598.0            20.487179                 5086.521127           101.370469            87.525641           180.000684        0.486252
+         Gpt 5 Mini Medium             78          942  12.076923                 1052          13.487179         1261              3618399          58345.97            1.256410      0.756410                 1325.0            16.987179                 2869.467883            46.269603            58.166667           115.878291        0.501963
+Gemini 3 Flash Preview Low             78          817  10.474359                 1024          13.128205         1315              1581524          12702.02            1.717949      0.769231                 1226.0            15.717949                 1202.679848             9.659331            61.128205           154.810427        0.394858
+                   Kimi K2             78          804  10.307692                 1262          16.179487          975             12281540         101346.76            4.025641      0.858974                 1481.0            18.987179                12596.451282           103.945395           182.564103           343.003761        0.532251
+   Grok 4 1 Fast Reasoning             78          737   9.448718                 1182          15.153846          998              8178655         120364.22            4.320513      0.884615                 1441.0            18.474359                 8195.045090           120.605431           109.256410           357.652821        0.305482
+              Gpt Oss 120B             78          580   7.435897                 1004          12.871795         1243              3190828          24633.15            3.692308      0.756410                 1279.0            16.397436                 2567.037812            19.817498           186.794872           225.517949        0.828293
+               Deepseek R1             78          511   6.551282                 1036          13.282051         1104              9229131         165334.16            5.064103      0.833333                 1331.0            17.064103                 8359.720109           149.759203           152.269231           353.910598        0.430248
+               Gpt Oss 20B             78          131   1.679487                  927          11.884615         1297              7009392          62397.50            6.205128      0.717949                 1206.0            15.461538                 5404.311488            48.109098           230.115385           421.666496        0.545728
+          Claude Haiku 4.5             78          -37  -0.474359                  894          11.461538         1254              6973411          57734.39            7.551282      0.705128                 1198.0            15.358974                 5560.933812            46.040183           244.730769           504.499316        0.485096
 Saved: results/260121_78_rounds/basic_metrics.csv
 Saved: results/260121_78_rounds/overall_performance.png
 Saved: results/260121_78_rounds/calibration_curves.json
 Saved: results/260121_78_rounds/confidence_distribution.png
 Saved: results/260121_78_rounds/confidence_distribution.json
+Saved: results/260121_78_rounds/score_stack.png
+Saved: results/260121_78_rounds/score_stack.json
+============================================================
+COMPLEXITY ANALYSIS
+============================================================
+Optimal K for aggregated complexity: 0.42
+  Formula: complexity = cyclomatic + 0.42 * node_count
+  Correlation with success_rate: -0.612
+Stats by complexity quartile:
+complexity_bin  count  avg_score  success_rate
+            Q1    240  18.745833      0.966667
+            Q2    150  11.246667      0.893333
+            Q3    180  11.138889      0.866667
+            Q4    210  -6.761905      0.547619
+Saved: results/260121_78_rounds/complexity_analysis.png
+Saved: results/260121_78_rounds/complexity_analysis.json
 ============================================================
 BY-RULE ANALYSIS
 Score by rule (sorted by avg_score):
                                                                                                                                             rule_description  count  avg_score  std_score  success_rate
+                                                                                                                        Only red cards (hearts or diamonds).     30  25.633333   2.204749      1.000000
+                                                                                                                              Only cards of the suit spades.     30  25.200000   2.023994      1.000000
+                                                                             Cards must alternate between red and black colors. Any card may start the line.     30  25.166667   2.640315      1.000000
+                                                                                                               Only cards with an even rank (2,4,6,8,10,12).     30  24.300000   2.692903      1.000000
+                                                             The card must be of a different suit than the card just before it. Any card may start the line.     30  21.666667   8.659590      0.966667
+                                                      Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line.     30  20.666667   5.148373      1.000000
+                                                                                                                                        Only Aces (rank 1) .     30  20.233333   8.931476      0.966667
+                                           The card must be of a different suit than but same color as the card just before it. Any card may start the line.     30  19.866667   7.541761      1.000000
+                                                                                             Only hearts, clubs, and diamonds allowed. Spades are forbidden.     30  19.533333  10.836507      0.966667
+                                                                                                                                   Only spades and diamonds.     30  19.066667   4.487018      1.000000
+                                                                                                          Only ranks that are prime numbers (2,3,5,7,11,13).     30  18.633333  12.527166      0.966667
+                                                                                                                                 Only face cards (11,12,13).     30  17.033333  16.044084      0.900000
+                                           Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line.     30  15.100000  12.234350      1.000000
+                                                                                                                       Only cards between 1 and 7 inclusive.     30  13.366667  10.148835      0.966667
+                                                                                                                                      Only black face cards.     30   7.700000  16.316165      0.900000
+                                                                                                                           Only red cards whose rank is <=7.     30   4.866667  11.227225      1.000000
+                                                                                                                       Only cards between 5 and 9 inclusive.     30   4.666667  14.406257      0.933333
+                                                                                               Alternate face and number cards. Any card may start the line.     30   0.366667  20.553519      0.733333
+                                 Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line.     30  -1.066667  20.915154      0.666667
+                                                              Each card must have a rank greater or equal to the previous card. Only Ace can start the line.     30  -3.433333  22.931206      0.600000
+Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc.     30  -5.200000  18.917972      0.766667
+        Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it.     30 -10.466667  13.050917      0.533333
+                                                                                          Face cards (11-13) must be red; number cards (1-10) must be black.     30 -11.500000  17.814659      0.500000
+                                     Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line.     30 -12.066667  16.772172      0.400000
+    If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive.     30 -15.633333  15.354396      0.333333
+                   Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc.     30 -18.000000  16.103116      0.133333
 Saved: results/260121_78_rounds/by_rule.png
 Saved: results/260121_78_rounds/by_rule.json
 Saved: results/260121_78_rounds/caution_vs_failed_guesses.json
 ============================================================
+RECKLESS GUESSING ANALYSIS
 ============================================================
+Double-Down Rate: After a wrong guess, % of next turns with another guess
+(Only counts official guesses, not shadow/tentative guesses)
+                     Model  Wrong Guesses  Next Turn Guesses  Double-Down %
+                   Kimi K2            314                207           65.9
+          Claude Haiku 4.5            589                362           61.5
+   Grok 4 1 Fast Reasoning            337                203           60.2
+               Gpt Oss 20B            484                290           59.9
+               Deepseek R1            395                229           58.0
+           Claude Opus 4.5            216                 91           42.1
+              Gpt Oss 120B            288                108           37.5
+Gemini 3 Flash Preview Low            134                 41           30.6
+         Gpt 5 Mini Medium             98                  9            9.2
+              Gpt 5.2 High             26                  1            3.8
+Wrong Guess Streak Statistics:
+                     Model  Streaks  Mean Length  Max Length  Total Wrong
+                   Kimi K2      120         2.62          14          314
+          Claude Haiku 4.5      244         2.41          16          589
+   Grok 4 1 Fast Reasoning      149         2.26          12          337
+               Gpt Oss 20B      207         2.34          13          484
+               Deepseek R1      180         2.19           9          395
+           Claude Opus 4.5      139         1.55           5          216
+              Gpt Oss 120B      184         1.57           8          288
+Gemini 3 Flash Preview Low       97         1.38           4          134
+         Gpt 5 Mini Medium       91         1.08           3           98
+              Gpt 5.2 High       25         1.04           2           26
+Longest streak: 16 consecutive wrong guesses
+  - Claude Haiku 4.5 in round 77
+Saved: results/260121_78_rounds/reckless_guessing.png
+Saved: results/260121_78_rounds/reckless_guessing.json
 ============================================================
 PER-MODEL REPORTS

app/src/content/chapters/eleusis/benchmark.mdx CHANGED Viewed

@@ -26,9 +26,9 @@ On each turn, the player selects a card from their hand to play. If the card sat
 When correctly guessing the rule, the player scores as many points as the number of remaining turns, and each wrong guess deducts a penalty of 2 points:
-$$\text{score} = (30 - \text{turns\_used}) - 2 \times \text{wrong\_guesses}$$
-A player who correctly identifies the rule on turn 10 with no wrong guesses scores 20 points; one who made 3 wrong guesses along the way scores only 14. Failing to identify the rule scores 0. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence, exactly the calibration we want to measure.
 ### Rule Library

 When correctly guessing the rule, the player scores as many points as the number of remaining turns, and each wrong guess deducts a penalty of 2 points:
+$$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num_wrong\_guesses}$$
+A player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. Failing to identify the rule scores 0 but penalties for wrong guesses still apply, leading to possibly a negative score. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence, exactly the calibration we want to measure.
 ### Rule Library

app/src/content/chapters/eleusis/results.mdx CHANGED Viewed

@@ -3,13 +3,6 @@ import Wide from "../../../components/Wide.astro";
 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
-import calibrationCurves from "../../assets/data/calibration_curves.png";
-import confidenceDistribution from "../../assets/data/confidence_distribution.png";
-import scoreVsFailedGuesses from "../../assets/data/score_vs_failed_guesses.png";
-import cautionVsFailedGuesses from "../../assets/data/caution_vs_failed_guesses.png";
-import excessCaution from "../../assets/data/excess_caution.png";
-import byRule from "../../assets/data/by_rule.png";
-import complexityAnalysis from "../../assets/data/complexity_analysis.png";
 ## Results
@@ -34,12 +27,10 @@ Deepseek R1, an open-weight model specialized for reasoning tasks, lags behind a
 Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
-<Image
-  src={calibrationCurves}
-  alt="Calibration curves showing reported confidence vs actual success rate for all models"
-  caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence : they correspond to confidence levels where actual success rates are lower than reported."
   id="fig-calibration"
-  zoomable
 />
 The calibration analysis reveals several patterns:
@@ -51,29 +42,27 @@ The calibration analysis reveals several patterns:
 It is also interesting to examine the distribution of confidence levels when models choose to guess.
-<Image
-  src={confidenceDistribution}
-  alt="Histogram showing distribution of confidence levels when models choose to guess vs not guess"
-  caption="<strong>Figure 3:</strong> Distribution of confidence levels. Left: when models choose to formally guess. Right: when models choose not to guess. Well-calibrated models should show clear separation between these distributions."
   id="fig-confidence"
-  zoomable
 />
 We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Other like GPT 5.2 High or Kimi K2 might also guess at confidence levels 8. Surprisingly, the best performing model Claude Opus 4.5 has a more spread out guessing behavior, often guessing at confidence levels 7 or even 6. Claude Haiku 4.5 has the most reckless guessing behavior, mostly guessing at confidence levels 6 to 8.
 Being able to separate confidence levels when guessing vs not guessing is an important metacognitive skill. Models that guess only when very confident are less likely to make wrong guesses, but may miss opportunities to commit early and gain points. Models that guess at lower confidence levels risk more wrong guesses, but can capitalize on early correct guesses. This trade-off is explored next.
 ### Guessing Strategy
 The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff? We can analyze their guessing efficiency by plotting average score vs average number of failed guesses per round.
-<Image
-  src={scoreVsFailedGuesses}
-  alt="2D scatter plot showing average score vs average number of failed guesses per round for each model"
   caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
   id="fig-guessing"
-  zoomable
 />
 <Sidenote>
@@ -84,12 +73,10 @@ The scoring system creates a strategic tension: guess early for more points, but
 Failed guesses tell only half the story. A model might avoid wrong guesses by being *too* cautious—waiting many turns after it already has the correct answer. To measure this, we tracked "early correct turns": how many consecutive turns a model's tentative rule was correct before it finally chose to guess.
-<Image
-  src={excessCaution}
-  alt="Box plot showing distribution of early correct turns for each model"
-  caption="<strong>Figure 5:</strong> Distribution of early correct turns (waiting with the correct answer). Higher values indicate excessive caution—the model knew the answer but hesitated to commit. GPT 5.2 High stands out as extremely cautious, with a median of 3 turns of unnecessary delay."
   id="fig-excess-caution"
-  zoomable
 />
 The results reveal striking differences in guessing personalities:
@@ -98,12 +85,10 @@ The results reveal striking differences in guessing personalities:
 - **Claude Opus 4.5** shows excellent timing—only 0.9 early correct turns on average, meaning it commits almost immediately after finding the answer.
 - **Claude Haiku 4.5** and **DeepSeek R1** are the least cautious (0.5 early turns), but this comes at a cost: they also have the highest failed guess rates.
-<Image
-  src={cautionVsFailedGuesses}
-  alt="Scatter plot showing caution (early correct turns) vs recklessness (failed guesses) for each model"
   caption="<strong>Figure 6:</strong> The caution-recklessness trade-off. Models in the upper-left are cautious (delay correct guesses); models in the lower-right are reckless (many failed guesses). The ideal position is lower-left: quick to commit when right, rarely wrong."
   id="fig-caution-reckless"
-  zoomable
 />
 <Sidenote>
@@ -118,8 +103,45 @@ This visualization reveals distinct behavioral patterns:
 * Deepseek R1 and Claude Haiku 4.5 cluster in the lower-right, being both reckless and not particularly cautious, leading to poor performance.
 The data suggests that knowing when you know is just as important as knowing the answer. Claude Opus 4.5's strong performance comes not just from finding correct rules, but from accurate metacognition, recognizing when it has gathered enough evidence to commit, even at the risk of occasional wrong guesses.
 ### Performance by Rule
 Not all rules are created equal. Some rules are discovered quickly by all models (e.g. "All cards must be red") while others prove consistently challenging (e.g. "increase rank after a red card, decrease after a black").
@@ -128,26 +150,21 @@ It is not easy to quantify rule complexity, as it depends on multiple factors: t
 The following figure breaks down performance by rule across all models and runs.
-<Wide>
-<Image
-  src={byRule}
-  alt="Performance breakdown by rule showing score distribution for each rule across all models"
-  caption="<strong>Figure 7:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as points. Some rules are consistently easy for all models, while others show wide variance and lower scores, indicating higher complexity. For each rule, we computed a complexity score (see below) to analyze its impact on performance."
   id="fig-by-rule"
-  zoomable
 />
-</Wide>
 We can see that the most complex rules are devastating for the reckless models like Claude Haiku 4.5 and DeepSeek R1, which often negative scores on these rules due to multiple wrong guesses. Even the best models struggle on the hardest rules, but their superior metacognition allows them to avoid catastrophic failures.
 The following plot breaks down the relative score of each model (as measured by score on the rule divided by average score on all rules) against the complexity metrics of each rule.
-<Image
-  src={complexityAnalysis}
-  alt="Scatter plot showing relationship between rule complexity metrics and model performance"
-  caption="<strong>Figure 8:</strong> Relationship between rule complexity and performance. Multiple complexity factors contribute: acceptance rate, structural complexity, and semantic difficulty."
   id="fig-complexity"
-  zoomable
 />
 <Note variant="info">

 import Note from "../../../components/Note.astro";
 import Sidenote from "../../../components/Sidenote.astro";
 import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 ## Results
 Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
+<HtmlEmbed
+  src="calibration-curves.html"
+  caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
   id="fig-calibration"
 />
 The calibration analysis reveals several patterns:
 It is also interesting to examine the distribution of confidence levels when models choose to guess.
+<HtmlEmbed
+  src="confidence-distribution.html"
+  caption="<strong>Figure 3:</strong> Distribution of confidence levels when models choose to formally guess. Each bar shows the proportion of guesses made at that confidence level. Click legend items to show/hide models."
   id="fig-confidence"
 />
 We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Other like GPT 5.2 High or Kimi K2 might also guess at confidence levels 8. Surprisingly, the best performing model Claude Opus 4.5 has a more spread out guessing behavior, often guessing at confidence levels 7 or even 6. Claude Haiku 4.5 has the most reckless guessing behavior, mostly guessing at confidence levels 6 to 8.
 Being able to separate confidence levels when guessing vs not guessing is an important metacognitive skill. Models that guess only when very confident are less likely to make wrong guesses, but may miss opportunities to commit early and gain points. Models that guess at lower confidence levels risk more wrong guesses, but can capitalize on early correct guesses. This trade-off is explored next.
+Note that in principle there is a decision-theoretic optimal confidence threshold for guessing, which depends on the scoring system. Given the scoring that rewards 1 point per turn left, with 2 points penalty for a wrong guess, the optimal threshold is 0.67 (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). Of course this assumes perfect calibration, which none of the models achieve.
 ### Guessing Strategy
 The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff? We can analyze their guessing efficiency by plotting average score vs average number of failed guesses per round.
+<HtmlEmbed
+  src="score-vs-failed-guesses.html"
   caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
   id="fig-guessing"
 />
 <Sidenote>
 Failed guesses tell only half the story. A model might avoid wrong guesses by being *too* cautious—waiting many turns after it already has the correct answer. To measure this, we tracked "early correct turns": how many consecutive turns a model's tentative rule was correct before it finally chose to guess.
+<HtmlEmbed
+  src="excess-caution.html"
+  caption="<strong>Figure 5:</strong> Distribution of early correct turns (waiting with the correct answer). Higher values indicate excessive caution—the model knew the answer but hesitated to commit. GPT 5.2 High stands out as extremely cautious, with a mean of 3.6 turns of unnecessary delay."
   id="fig-excess-caution"
 />
 The results reveal striking differences in guessing personalities:
 - **Claude Opus 4.5** shows excellent timing—only 0.9 early correct turns on average, meaning it commits almost immediately after finding the answer.
 - **Claude Haiku 4.5** and **DeepSeek R1** are the least cautious (0.5 early turns), but this comes at a cost: they also have the highest failed guess rates.
+<HtmlEmbed
+  src="caution-vs-failed-guesses.html"
   caption="<strong>Figure 6:</strong> The caution-recklessness trade-off. Models in the upper-left are cautious (delay correct guesses); models in the lower-right are reckless (many failed guesses). The ideal position is lower-left: quick to commit when right, rarely wrong."
   id="fig-caution-reckless"
 />
 <Sidenote>
 * Deepseek R1 and Claude Haiku 4.5 cluster in the lower-right, being both reckless and not particularly cautious, leading to poor performance.
 The data suggests that knowing when you know is just as important as knowing the answer. Claude Opus 4.5's strong performance comes not just from finding correct rules, but from accurate metacognition, recognizing when it has gathered enough evidence to commit, even at the risk of occasional wrong guesses.
+This analysis constrats two ways of losing points : by being too cautious (waiting too long to commit) vs by being too reckless (making too many wrong guesses). A way to visualize this is to explore alternative scoring systems, as we do next.
+### Alternative Scoring Systems
+The Eleusis scoring system includes harsh penalties: wrong guesses cost 2 points each, and rounds can end with negative scores. How much do these penalties affect rankings? To understand the impact of our scoring choices, we compare three scoring variants:
+1. **Raw score**: The standard scoring (30 - turns - 2×wrong guesses)
+2. **Floored score**: Same formula, but negative scores are counted as zero
+3. **No-stakes score**: No penalty for wrong guesses, and tentative rules count as guesses
+<HtmlEmbed
+  src="score-stack.html"
+  caption="<strong>Figure 7:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring). Orange shows flooring gain (what models gain if negative scores count as 0). Green shows no-stakes gain (additional gain from removing wrong-guess penalties). Models sorted by total no-stakes score."
+  id="fig-score-stack"
+  wide
+/>
+The flooring gain (orange) reveals which models frequently go negative. GPT 5.2 High gains almost nothing from flooring (0.2 points), indicating it rarely makes enough wrong guesses to go negative. In contrast, Claude Haiku 4.5 gains 11.9 points—nearly 12 points of damage averted per round on average—showing how its reckless guessing leads to catastrophic losses.
+The no-stakes gain (green) shows what models would gain if we simply tested their tentative rule each turn. Interestingly, this gain is relatively consistent across models (2.5–4.2 points), suggesting that most models form correct hypotheses at similar rates, but differ dramatically in their ability to *recognize* when they have the right answer.
+Under any scoring system, Claude Opus 4.5 and GPT 5.2 High remain the top performers. The ranking compression at no-stakes scores (15.4 to 20.5 vs raw -0.5 to 14.8) confirms that our scoring system appropriately rewards good metacognition—knowing when you know.
+### Analysis of the reckless guessing behavior
+Some models loose a lot of points due to reckless guessing. In the "no stakes" scoring system, Claude 4.5 Opus takes the lead, Kimi K2 and Grok 4.1 have similar performance to GPT 5.2 High.
+<HtmlEmbed
+  src="reckless-guessing.html"
+  caption="<strong>Figure 7b:</strong> Double-down rate: how often a model guesses again immediately after a wrong guess. Higher values indicate more reckless behavior—the model keeps guessing despite recent failures."
+  id="fig-reckless-guessing"
+/>
 ### Performance by Rule
 Not all rules are created equal. Some rules are discovered quickly by all models (e.g. "All cards must be red") while others prove consistently challenging (e.g. "increase rank after a red card, decrease after a black").
 The following figure breaks down performance by rule across all models and runs.
+<HtmlEmbed
+  src="by-rule.html"
+  caption="<strong>Figure 8:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as colored dots (one per model run). Hover over rule names for details. The left column shows average success rate. Click legend items to show/hide models."
   id="fig-by-rule"
+  wide
 />
 We can see that the most complex rules are devastating for the reckless models like Claude Haiku 4.5 and DeepSeek R1, which often negative scores on these rules due to multiple wrong guesses. Even the best models struggle on the hardest rules, but their superior metacognition allows them to avoid catastrophic failures.
 The following plot breaks down the relative score of each model (as measured by score on the rule divided by average score on all rules) against the complexity metrics of each rule.
+<HtmlEmbed
+  src="complexity-analysis.html"
+  caption="<strong>Figure 9:</strong> Relationship between rule complexity and model performance. The heatmap shows relative scores (value > 1 means above-average performance) for each model across complexity quartiles. Hover over cells for details."
   id="fig-complexity"
 />
 <Note variant="info">

app/src/content/embeds/by-rule.html ADDED Viewed

	@@ -0,0 +1,521 @@

+<div class="d3-by-rule"></div>
+<style>
+  .d3-by-rule {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-by-rule svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-by-rule .axes path,
+  .d3-by-rule .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-by-rule .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-by-rule .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-by-rule .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-by-rule .x-axis text {
+    transform: translateY(4px);
+  }
+  .d3-by-rule .rule-label {
+    font-size: 10px;
+    fill: var(--text-color);
+    cursor: pointer;
+  }
+  .d3-by-rule .rule-label:hover {
+    text-decoration: underline;
+  }
+  .d3-by-rule .complexity-bar {
+    opacity: 0.85;
+  }
+  .d3-by-rule .complexity-text {
+    font-size: 9px;
+    font-weight: 600;
+    pointer-events: none;
+  }
+  .d3-by-rule .point {
+    opacity: 0.85;
+    transition: opacity 0.1s ease;
+  }
+  .d3-by-rule .point:hover {
+    opacity: 1;
+  }
+  .d3-by-rule .point.dimmed {
+    opacity: 0.15;
+  }
+  .d3-by-rule .legend-item {
+    cursor: pointer;
+  }
+  .d3-by-rule .legend-item.inactive .legend-dot {
+    opacity: 0.3;
+  }
+  .d3-by-rule .legend-item.inactive .legend-text {
+    opacity: 0.5;
+    text-decoration: line-through;
+  }
+  .d3-by-rule .legend-text {
+    font-size: 10px;
+    fill: var(--text-color);
+  }
+  .d3-by-rule .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.5;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+    max-width: 320px;
+  }
+  .d3-by-rule .d3-tooltip .rule-name {
+    font-weight: 600;
+    margin-bottom: 6px;
+  }
+  .d3-by-rule .d3-tooltip .rule-desc {
+    margin-bottom: 8px;
+    color: var(--muted-color);
+    font-size: 11px;
+  }
+  .d3-by-rule .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-by-rule .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-by-rule .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-by-rule'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-by-rule'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gComplexity = gRoot.append('g').attr('class', 'complexity');
+      const gPoints = gRoot.append('g').attr('class', 'points');
+      const gLabels = gRoot.append('g').attr('class', 'labels');
+      const gLegend = gRoot.append('g').attr('class', 'legend');
+      // State
+      let data = null;
+      let modelColors = null;
+      let width = 800;
+      let height = 800;
+      const margin = { top: 20, right: 140, bottom: 50, left: 180 };
+      const complexityBarWidth = 30;
+      const complexityGap = 8;
+      // Active models (all visible by default)
+      let activeModels = new Set();
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleBand();
+      // Green to red scale: high success (1.0) = green, low success (0) = red
+      const successColorScale = d3.scaleSequential(d3.interpolateRdYlGn);
+      // Data loading
+      const DATA_URL = '/data/by_rule.json';
+      const COLORS_URL = '/data/overall_performance.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        const numRules = data ? data.rules.length : 26;
+        const rowHeight = 24;
+        height = margin.top + margin.bottom + numRules * rowHeight;
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function formatRuleName(name) {
+        return name.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
+      }
+      function showRuleTooltip(event, rule) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        tip.innerHTML = `
+          <div class="rule-name">${formatRuleName(rule.name)}</div>
+          <div class="rule-desc">${rule.description}</div>
+          <div class="metric">
+            <span class="metric-label">Success Rate:</span>
+            <span class="metric-value">${(rule.success_rate * 100).toFixed(1)}%</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Average Score:</span>
+            <span class="metric-value">${rule.avg_score.toFixed(1)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Cyclomatic Complexity:</span>
+            <span class="metric-value">${rule.cyclomatic_complexity}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">AST Node Count:</span>
+            <span class="metric-value">${rule.node_count}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Aggregated Complexity:</span>
+            <span class="metric-value">${rule.aggregated_complexity.toFixed(1)}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 200;
+        const tipHeight = tip.offsetHeight || 140;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function getContrastColor(color) {
+        // Handle both hex (#rrggbb) and rgb(r, g, b) formats
+        let r, g, b;
+        if (color.startsWith('#')) {
+          const hex = color.replace('#', '');
+          r = parseInt(hex.substr(0, 2), 16) / 255;
+          g = parseInt(hex.substr(2, 2), 16) / 255;
+          b = parseInt(hex.substr(4, 2), 16) / 255;
+        } else if (color.startsWith('rgb')) {
+          const match = color.match(/rgb\((\d+),\s*(\d+),\s*(\d+)\)/);
+          if (match) {
+            r = parseInt(match[1]) / 255;
+            g = parseInt(match[2]) / 255;
+            b = parseInt(match[3]) / 255;
+          } else {
+            return '#000000';
+          }
+        } else {
+          return '#000000';
+        }
+        const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
+        return luminance > 0.5 ? '#000000' : '#ffffff';
+      }
+      function toggleModel(modelName) {
+        if (activeModels.has(modelName)) {
+          activeModels.delete(modelName);
+        } else {
+          activeModels.add(modelName);
+        }
+        render();
+      }
+      function render() {
+        if (!data || !modelColors) return;
+        const { innerWidth, innerHeight } = updateSize();
+        const rules = data.rules;
+        const chartWidth = innerWidth - complexityBarWidth - complexityGap;
+        // Update scales
+        const allScores = [];
+        rules.forEach(rule => {
+          Object.values(rule.scores_by_model).forEach(scores => {
+            allScores.push(...scores);
+          });
+        });
+        const scoreExtent = d3.extent(allScores);
+        const scorePadding = (scoreExtent[1] - scoreExtent[0]) * 0.05;
+        xScale
+          .domain([scoreExtent[0] - scorePadding, scoreExtent[1] + scorePadding])
+          .range([complexityBarWidth + complexityGap, innerWidth])
+          .nice();
+        yScale
+          .domain(rules.map(r => r.name))
+          .range([0, innerHeight])
+          .padding(0.3);
+        // Success rate domain: 0 to 1 (will display as 0% to 100%)
+        successColorScale.domain([0, 1]);
+        // Grid lines
+        const xTicks = xScale.ticks(8);
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        // X-axis
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale).ticks(8).tickSizeInner(-6).tickSizeOuter(0));
+        // X-axis label
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', (complexityBarWidth + complexityGap + innerWidth) / 2)
+          .attr('y', innerHeight + 40)
+          .attr('text-anchor', 'middle')
+          .text('Score');
+        // Success rate bars
+        gComplexity.selectAll('.complexity-bar')
+          .data(rules, d => d.name)
+          .join('rect')
+          .attr('class', 'complexity-bar')
+          .attr('x', 0)
+          .attr('y', d => yScale(d.name))
+          .attr('width', complexityBarWidth)
+          .attr('height', yScale.bandwidth())
+          .attr('fill', d => successColorScale(d.success_rate))
+          .attr('rx', 2);
+        gComplexity.selectAll('.complexity-text')
+          .data(rules, d => d.name)
+          .join('text')
+          .attr('class', 'complexity-text')
+          .attr('x', complexityBarWidth / 2)
+          .attr('y', d => yScale(d.name) + yScale.bandwidth() / 2)
+          .attr('text-anchor', 'middle')
+          .attr('dominant-baseline', 'central')
+          .style('fill', d => getContrastColor(successColorScale(d.success_rate)))
+          .text(d => Math.round(d.success_rate * 100) + '%');
+        // Rule labels (Y-axis)
+        gLabels.selectAll('.rule-label')
+          .data(rules, d => d.name)
+          .join('text')
+          .attr('class', 'rule-label')
+          .attr('x', -8)
+          .attr('y', d => yScale(d.name) + yScale.bandwidth() / 2)
+          .attr('text-anchor', 'end')
+          .attr('dominant-baseline', 'central')
+          .text(d => formatRuleName(d.name))
+          .on('mouseenter', (event, d) => showRuleTooltip(event, d))
+          .on('mousemove', (event, d) => showRuleTooltip(event, d))
+          .on('mouseleave', hideTooltip);
+        // Data points
+        const pointData = [];
+        rules.forEach(rule => {
+          Object.entries(rule.scores_by_model).forEach(([modelName, scores]) => {
+            scores.forEach((score, seedIdx) => {
+              const color = modelColors[modelName] || '#888888';
+              pointData.push({
+                rule: rule.name,
+                model: modelName,
+                score: score,
+                seed: seedIdx,
+                color: color
+              });
+            });
+          });
+        });
+        const pointRadius = Math.max(3, Math.min(5, yScale.bandwidth() / 4));
+        const jitterStrength = yScale.bandwidth() * 0.3;
+        // Simple hash for consistent jitter
+        const hashStr = (str) => {
+          let hash = 0;
+          for (let i = 0; i < str.length; i++) {
+            hash = ((hash << 5) - hash) + str.charCodeAt(i);
+            hash |= 0;
+          }
+          return hash;
+        };
+        gPoints.selectAll('.point')
+          .data(pointData, d => `${d.rule}-${d.model}-${d.seed}`)
+          .join('circle')
+          .attr('class', d => `point ${activeModels.has(d.model) ? '' : 'dimmed'}`)
+          .attr('cx', d => xScale(d.score))
+          .attr('cy', d => {
+            const baseY = yScale(d.rule) + yScale.bandwidth() / 2;
+            const jitter = ((hashStr(d.model + d.seed) % 100) / 100 - 0.5) * jitterStrength;
+            return baseY + jitter;
+          })
+          .attr('r', pointRadius)
+          .attr('fill', d => d.color)
+          .attr('stroke', 'var(--surface-bg)')
+          .attr('stroke-width', 0.5);
+        // Legend
+        const legendX = innerWidth + 15;
+        const legendItemHeight = 16;
+        const modelNames = data.models;
+        const legendItems = gLegend.selectAll('.legend-item')
+          .data(modelNames)
+          .join('g')
+          .attr('class', d => `legend-item ${activeModels.has(d) ? '' : 'inactive'}`)
+          .attr('transform', (d, i) => `translate(${legendX}, ${i * legendItemHeight})`)
+          .style('cursor', 'pointer')
+          .on('click', (event, d) => toggleModel(d));
+        legendItems.selectAll('.legend-dot')
+          .data(d => [d])
+          .join('circle')
+          .attr('class', 'legend-dot')
+          .attr('cx', 5)
+          .attr('cy', 6)
+          .attr('r', 4)
+          .attr('fill', d => modelColors[d] || '#888888');
+        legendItems.selectAll('.legend-text')
+          .data(d => [d])
+          .join('text')
+          .attr('class', 'legend-text')
+          .attr('x', 14)
+          .attr('y', 9)
+          .text(d => d);
+      }
+      // Initialize
+      Promise.all([
+        fetch(DATA_URL, { cache: 'no-cache' }).then(r => r.json()),
+        fetch(COLORS_URL, { cache: 'no-cache' }).then(r => r.json())
+      ])
+        .then(([byRuleData, perfData]) => {
+          data = byRuleData;
+          // Build color map from overall_performance.json
+          modelColors = {};
+          perfData.models.forEach(m => {
+            modelColors[m.name] = m.color;
+          });
+          // Initialize all models as active
+          activeModels = new Set(data.models);
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/calibration-curves.html ADDED Viewed

	@@ -0,0 +1,537 @@

+<div class="d3-calibration-curves"></div>
+<style>
+  .d3-calibration-curves {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-calibration-curves svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-calibration-curves .axes path,
+  .d3-calibration-curves .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-calibration-curves .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-calibration-curves .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-calibration-curves .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-calibration-curves .x-axis text {
+    transform: translateY(4px);
+  }
+  .d3-calibration-curves .calibration-line {
+    fill: none;
+    stroke-width: 1.5;
+  }
+  .d3-calibration-curves .perfect-line {
+    fill: none;
+    stroke: var(--muted-color);
+    stroke-width: 1.5;
+    stroke-dasharray: 8, 6;
+    opacity: 0.6;
+  }
+  .d3-calibration-curves .data-point {
+    cursor: pointer;
+    transition: transform 0.15s ease, opacity 0.15s ease;
+  }
+  .d3-calibration-curves .data-point:hover {
+    opacity: 0.8;
+  }
+  .d3-calibration-curves .legend {
+    font-size: 11px;
+  }
+  .d3-calibration-curves .legend-item {
+    cursor: pointer;
+  }
+  .d3-calibration-curves .legend-item.dimmed .legend-line,
+  .d3-calibration-curves .legend-item.dimmed .legend-marker {
+    opacity: 0.3;
+  }
+  .d3-calibration-curves .legend-item.dimmed text {
+    opacity: 0.4;
+  }
+  .d3-calibration-curves .legend-text {
+    fill: var(--text-color);
+  }
+  .d3-calibration-curves .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-calibration-curves .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-calibration-curves .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-calibration-curves .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-calibration-curves .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-calibration-curves'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-calibration-curves'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups (order matters for layering)
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gPerfect = gRoot.append('g').attr('class', 'perfect');
+      const gLines = gRoot.append('g').attr('class', 'lines');
+      const gPoints = gRoot.append('g').attr('class', 'points');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gLegend = gRoot.append('g').attr('class', 'legend');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 500;
+      const margin = { top: 20, right: 180, bottom: 56, left: 72 };
+      let hiddenModels = new Set();
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleLinear();
+      // Line generator - convert confidence level to probability (divide by 10)
+      const line = d3.line()
+        .x(d => xScale(d.confidence_level / 10))
+        .y(d => yScale(d.actual_success_rate));
+      // Data loading
+      const DATA_URL = '/data/calibration_curves.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        // Calculate inner dimensions, ensuring square plot area
+        const availableWidth = width - margin.left - margin.right;
+        const maxHeight = Math.round(width * 0.8); // Limit max height
+        const innerSize = Math.min(availableWidth, maxHeight - margin.top - margin.bottom);
+        height = innerSize + margin.top + margin.bottom;
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: innerSize,
+          innerHeight: innerSize
+        };
+      }
+      function showTooltip(event, d, model) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        const reportedConfidence = d.confidence_level / 10;
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${model.color}">${model.name}</div>
+          <div class="metric">
+            <span class="metric-label">Reported confidence:</span>
+            <span class="metric-value">${Math.round(reportedConfidence * 100)}%</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Actual success:</span>
+            <span class="metric-value">${(d.actual_success_rate * 100).toFixed(1)}%</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Sample size:</span>
+            <span class="metric-value">${d.sample_count}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 150;
+        const tipHeight = tip.offsetHeight || 100;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function toggleModel(modelName) {
+        if (hiddenModels.has(modelName)) {
+          hiddenModels.delete(modelName);
+        } else {
+          hiddenModels.add(modelName);
+        }
+        render();
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        const models = data.models;
+        // Equal scales for both axes (0-1 probability) to ensure 45° diagonal
+        xScale
+          .domain([0, 1])
+          .range([0, innerWidth]);
+        yScale
+          .domain([0, 1])
+          .range([innerHeight, 0]);
+        // Grid lines - same ticks for both axes
+        const ticks = [0, 0.2, 0.4, 0.6, 0.8, 1.0];
+        const xTicks = ticks;
+        const yTicks = ticks;
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        gGrid.selectAll('.grid-y')
+          .data(yTicks)
+          .join('line')
+          .attr('class', 'grid-y')
+          .attr('x1', 0)
+          .attr('x2', innerWidth)
+          .attr('y1', d => yScale(d))
+          .attr('y2', d => yScale(d));
+        // Perfect calibration line (diagonal from 0,0 to 1,1)
+        gPerfect.selectAll('.perfect-line')
+          .data([0])
+          .join('line')
+          .attr('class', 'perfect-line')
+          .attr('x1', xScale(0))
+          .attr('y1', yScale(0))
+          .attr('x2', xScale(1))
+          .attr('y2', yScale(1));
+        // Axes - format as percentages
+        const tickSize = 6;
+        const percentFormat = d => `${Math.round(d * 100)}%`;
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale)
+            .tickValues(xTicks)
+            .tickFormat(percentFormat)
+            .tickSizeInner(-tickSize)
+            .tickSizeOuter(0));
+        gAxes.selectAll('.y-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'y-axis')
+          .call(d3.axisLeft(yScale)
+            .tickValues(yTicks)
+            .tickFormat(percentFormat)
+            .tickSizeInner(-tickSize)
+            .tickSizeOuter(0));
+        // Axis labels
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 44)
+          .attr('text-anchor', 'middle')
+          .text('Reported Confidence');
+        gAxes.selectAll('.y-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'y-label axis-label')
+          .attr('x', -innerHeight / 2)
+          .attr('y', -52)
+          .attr('text-anchor', 'middle')
+          .attr('transform', 'rotate(-90)')
+          .text('Actual Success Rate');
+        // Lines for each model
+        const visibleModels = models.filter(m => !hiddenModels.has(m.name));
+        gLines.selectAll('.calibration-line')
+          .data(visibleModels, d => d.name)
+          .join('path')
+          .attr('class', 'calibration-line')
+          .attr('d', d => line(d.calibration_points))
+          .attr('stroke', d => d.color);
+        // Data points - circles for closed models, stars for open models
+        const allPoints = visibleModels.flatMap(model =>
+          model.calibration_points.map(p => ({ ...p, model }))
+        );
+        const closedPoints = allPoints.filter(d => !d.model.is_open);
+        const openPoints = allPoints.filter(d => d.model.is_open);
+        // Helper function to create a 5-point star path
+        const starPath = (cx, cy, outerR, innerR) => {
+          const points = [];
+          for (let i = 0; i < 10; i++) {
+            const r = i % 2 === 0 ? outerR : innerR;
+            const angle = (Math.PI / 2) + (i * Math.PI / 5);
+            points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+          }
+          return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+        };
+        // Circles for closed models
+        gPoints.selectAll('.data-point-circle')
+          .data(closedPoints, d => `${d.model.name}-${d.confidence_level}`)
+          .join('circle')
+          .attr('class', 'data-point data-point-circle')
+          .attr('cx', d => xScale(d.confidence_level / 10))
+          .attr('cy', d => yScale(d.actual_success_rate))
+          .attr('r', 4)
+          .attr('fill', d => d.model.color)
+          .attr('stroke', 'var(--surface-bg, white)')
+          .attr('stroke-width', 1)
+          .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
+          .on('mousemove', (event, d) => showTooltip(event, d, d.model))
+          .on('mouseleave', hideTooltip);
+        // Stars for open models
+        gPoints.selectAll('.data-point-star')
+          .data(openPoints, d => `${d.model.name}-${d.confidence_level}`)
+          .join('path')
+          .attr('class', 'data-point data-point-star')
+          .attr('d', d => starPath(
+            xScale(d.confidence_level / 10),
+            yScale(d.actual_success_rate),
+            6, 2.6
+          ))
+          .attr('fill', d => d.model.color)
+          .attr('stroke', 'var(--surface-bg, white)')
+          .attr('stroke-width', 0.8)
+          .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
+          .on('mousemove', (event, d) => showTooltip(event, d, d.model))
+          .on('mouseleave', hideTooltip);
+        // Legend
+        const legendX = innerWidth + 16;
+        const legendItemHeight = 20;
+        // Perfect calibration in legend
+        const legendItems = [
+          { name: 'Perfect calibration', color: 'var(--muted-color)', isPerfect: true }
+        ].concat(models);
+        gLegend.selectAll('.legend-item')
+          .data(legendItems, d => d.name)
+          .join('g')
+          .attr('class', d => {
+            if (d.isPerfect) return 'legend-item';
+            return `legend-item ${hiddenModels.has(d.name) ? 'dimmed' : ''}`;
+          })
+          .attr('transform', (d, i) => `translate(${legendX}, ${i * legendItemHeight})`)
+          .each(function(d) {
+            const g = d3.select(this);
+            g.selectAll('*').remove();
+            if (d.isPerfect) {
+              // Dashed line for perfect calibration
+              g.append('line')
+                .attr('class', 'legend-line')
+                .attr('x1', 0)
+                .attr('x2', 20)
+                .attr('y1', 0)
+                .attr('y2', 0)
+                .attr('stroke', d.color)
+                .attr('stroke-width', 1.5)
+                .attr('stroke-dasharray', '6, 4')
+                .attr('opacity', 0.6);
+            } else {
+              // Line segment (solid for all models)
+              g.append('line')
+                .attr('class', 'legend-line')
+                .attr('x1', 0)
+                .attr('x2', 20)
+                .attr('y1', 0)
+                .attr('y2', 0)
+                .attr('stroke', d.color)
+                .attr('stroke-width', 1.5);
+              // Marker - circle for closed, star for open
+              if (d.is_open) {
+                // Small star for open models
+                const starPath = (cx, cy, outerR, innerR) => {
+                  const points = [];
+                  for (let i = 0; i < 10; i++) {
+                    const r = i % 2 === 0 ? outerR : innerR;
+                    const angle = (Math.PI / 2) + (i * Math.PI / 5);
+                    points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+                  }
+                  return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+                };
+                g.append('path')
+                  .attr('class', 'legend-marker')
+                  .attr('d', starPath(10, 0, 6, 2.6))
+                  .attr('fill', d.color);
+              } else {
+                g.append('circle')
+                  .attr('class', 'legend-marker')
+                  .attr('cx', 10)
+                  .attr('cy', 0)
+                  .attr('r', 3.5)
+                  .attr('fill', d.color);
+              }
+            }
+            g.append('text')
+              .attr('class', 'legend-text')
+              .attr('x', 26)
+              .attr('y', 4)
+              .text(d.name);
+            if (!d.isPerfect) {
+              g.style('cursor', 'pointer')
+                .on('click', () => toggleModel(d.name));
+            }
+          });
+        // Legend note about line styles
+        const noteY = legendItems.length * legendItemHeight + 12;
+        gLegend.selectAll('.legend-note')
+          .data([0])
+          .join('text')
+          .attr('class', 'legend-note')
+          .attr('x', legendX)
+          .attr('y', noteY)
+          .attr('font-size', '10px')
+          .attr('fill', 'var(--muted-color)')
+          .text('● = Closed, ★ = Open');
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/caution-vs-failed-guesses.html ADDED Viewed

	@@ -0,0 +1,369 @@

+<div class="d3-caution-vs-failed-guesses"></div>
+<style>
+  .d3-caution-vs-failed-guesses {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-caution-vs-failed-guesses svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-caution-vs-failed-guesses .axes path,
+  .d3-caution-vs-failed-guesses .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-caution-vs-failed-guesses .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-caution-vs-failed-guesses .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-caution-vs-failed-guesses .axes text.axis-label {
+    font-size: 15px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-caution-vs-failed-guesses .x-axis text {
+    transform: translateY(4px);
+  }
+  .d3-caution-vs-failed-guesses .point {
+    cursor: pointer;
+    transition: opacity 0.15s ease;
+  }
+  .d3-caution-vs-failed-guesses .point:hover {
+    opacity: 0.8;
+  }
+  .d3-caution-vs-failed-guesses .point-label {
+    font-size: 11px;
+    fill: var(--text-color);
+    pointer-events: none;
+  }
+  .d3-caution-vs-failed-guesses .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-caution-vs-failed-guesses .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-caution-vs-failed-guesses .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-caution-vs-failed-guesses .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-caution-vs-failed-guesses .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-caution-vs-failed-guesses'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-caution-vs-failed-guesses'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gPoints = gRoot.append('g').attr('class', 'points');
+      const gLabels = gRoot.append('g').attr('class', 'labels');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 450;
+      const margin = { top: 20, right: 120, bottom: 56, left: 72 };
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleLinear();
+      // Data loading
+      const DATA_URL = '/data/caution_vs_failed_guesses.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        height = Math.max(300, Math.round(width / 1.5));
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function showTooltip(event, d) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${d.color}">${d.name}</div>
+          <div class="metric">
+            <span class="metric-label">Early Correct Turns:</span>
+            <span class="metric-value">${d.avg_early_correct_turns.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Failed Guesses:</span>
+            <span class="metric-value">${d.avg_failed_guesses.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Type:</span>
+            <span class="metric-value">${d.is_open ? 'Open' : 'Closed'}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 150;
+        const tipHeight = tip.offsetHeight || 80;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        const models = data.models;
+        // Update scales - X starts at 0
+        const xExtent = d3.extent(models, d => d.avg_failed_guesses);
+        const yExtent = d3.extent(models, d => d.avg_early_correct_turns);
+        const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
+        const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
+        xScale
+          .domain([0, xExtent[1] + xPadding])
+          .range([0, innerWidth])
+          .nice();
+        yScale
+          .domain([0, yExtent[1] + yPadding])
+          .range([innerHeight, 0])
+          .nice();
+        // Grid lines
+        const xTicks = xScale.ticks(6);
+        const yTicks = yScale.ticks(6);
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        gGrid.selectAll('.grid-y')
+          .data(yTicks)
+          .join('line')
+          .attr('class', 'grid-y')
+          .attr('x1', 0)
+          .attr('x2', innerWidth)
+          .attr('y1', d => yScale(d))
+          .attr('y2', d => yScale(d));
+        // Axes with inner ticks
+        const tickSize = 6;
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
+        gAxes.selectAll('.y-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'y-axis')
+          .call(d3.axisLeft(yScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
+        // Axis labels
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 44)
+          .attr('text-anchor', 'middle')
+          .text('Average Failed Guesses per Round');
+        gAxes.selectAll('.y-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'y-label axis-label')
+          .attr('x', -innerHeight / 2)
+          .attr('y', -52)
+          .attr('text-anchor', 'middle')
+          .attr('transform', 'rotate(-90)')
+          .text('Average Early Correct Turns');
+        // Points - circles for closed models, stars for open models
+        const pointRadius = Math.max(8, Math.min(16, innerWidth / 60));
+        // Helper function to create a 5-point star path
+        const starPath = (cx, cy, outerR, innerR) => {
+          const points = [];
+          for (let i = 0; i < 10; i++) {
+            const r = i % 2 === 0 ? outerR : innerR;
+            const angle = (Math.PI / 2) + (i * Math.PI / 5);
+            points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+          }
+          return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+        };
+        // Closed models as circles
+        const closedModels = models.filter(d => !d.is_open);
+        gPoints.selectAll('.point-circle')
+          .data(closedModels, d => d.name)
+          .join('circle')
+          .attr('class', 'point point-circle')
+          .attr('cx', d => xScale(d.avg_failed_guesses))
+          .attr('cy', d => yScale(d.avg_early_correct_turns))
+          .attr('r', pointRadius)
+          .attr('fill', d => d.color)
+          .attr('stroke', 'none')
+          .on('mouseenter', showTooltip)
+          .on('mousemove', showTooltip)
+          .on('mouseleave', hideTooltip);
+        // Open models as stars
+        const openModels = models.filter(d => d.is_open);
+        gPoints.selectAll('.point-star')
+          .data(openModels, d => d.name)
+          .join('path')
+          .attr('class', 'point point-star')
+          .attr('d', d => starPath(xScale(d.avg_failed_guesses), yScale(d.avg_early_correct_turns), pointRadius * 1.2, pointRadius * 0.5))
+          .attr('fill', d => d.color)
+          .attr('stroke', 'none')
+          .on('mouseenter', showTooltip)
+          .on('mousemove', showTooltip)
+          .on('mouseleave', hideTooltip);
+        // Point labels
+        gLabels.selectAll('.point-label')
+          .data(models)
+          .join('text')
+          .attr('class', 'point-label')
+          .attr('x', d => xScale(d.avg_failed_guesses) + pointRadius + 6)
+          .attr('y', d => yScale(d.avg_early_correct_turns) + 4)
+          .text(d => d.name);
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/complexity-analysis.html ADDED Viewed

	@@ -0,0 +1,492 @@

+<div class="d3-complexity-analysis"></div>
+<style>
+  .d3-complexity-analysis {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-complexity-analysis svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-complexity-analysis .axes path,
+  .d3-complexity-analysis .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-complexity-analysis .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-complexity-analysis .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-complexity-analysis .axes text.chart-title {
+    font-size: 16px;
+    font-weight: 600;
+    fill: var(--text-color);
+  }
+  .d3-complexity-analysis .cell {
+    stroke: var(--surface-bg, #fff);
+    stroke-width: 2;
+    cursor: pointer;
+    transition: opacity 0.1s ease;
+  }
+  .d3-complexity-analysis .cell:hover {
+    opacity: 0.85;
+  }
+  .d3-complexity-analysis .cell-text {
+    font-size: 13px;
+    font-weight: 600;
+    pointer-events: none;
+  }
+  .d3-complexity-analysis .model-label {
+    font-size: 12px;
+    fill: var(--text-color);
+  }
+  .d3-complexity-analysis .quartile-label {
+    font-size: 12px;
+    fill: var(--text-color);
+  }
+  .d3-complexity-analysis .legend-title {
+    font-size: 11px;
+    fill: var(--muted-color);
+  }
+  .d3-complexity-analysis .legend-tick {
+    font-size: 10px;
+    fill: var(--muted-color);
+  }
+  .d3-complexity-analysis .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.5;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+    max-width: 280px;
+  }
+  .d3-complexity-analysis .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-complexity-analysis .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-complexity-analysis .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-complexity-analysis .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+  .d3-complexity-analysis .d3-tooltip .interpretation {
+    margin-top: 6px;
+    font-size: 11px;
+    color: var(--muted-color);
+    font-style: italic;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-complexity-analysis'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-complexity-analysis'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gCells = gRoot.append('g').attr('class', 'cells');
+      const gLegend = gRoot.append('g').attr('class', 'legend');
+      // State
+      let data = null;
+      let width = 700;
+      let height = 450;
+      const margin = { top: 60, right: 100, bottom: 60, left: 160 };
+      // Scales
+      const xScale = d3.scaleBand();
+      const yScale = d3.scaleBand();
+      // Linear color scale: red (0%) -> green (100%+)
+      const colorScale = d3.scaleLinear()
+        .interpolate(() => d3.interpolateRdYlGn);
+      const DATA_URL = '/data/complexity_analysis.json';
+      function updateSize() {
+        width = Math.min(container.clientWidth || 700, 800);
+        const numModels = data ? data.models.length : 10;
+        const cellHeight = 36;
+        height = margin.top + margin.bottom + numModels * cellHeight;
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function getContrastColor(hexColor) {
+        const hex = hexColor.replace('#', '');
+        const r = parseInt(hex.substr(0, 2), 16) / 255;
+        const g = parseInt(hex.substr(2, 2), 16) / 255;
+        const b = parseInt(hex.substr(4, 2), 16) / 255;
+        const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
+        return luminance > 0.5 ? '#000000' : '#ffffff';
+      }
+      function rgbToHex(rgb) {
+        // Convert rgb(r, g, b) string to #rrggbb
+        const match = rgb.match(/rgb\((\d+),\s*(\d+),\s*(\d+)\)/);
+        if (!match) return rgb;
+        const r = parseInt(match[1]).toString(16).padStart(2, '0');
+        const g = parseInt(match[2]).toString(16).padStart(2, '0');
+        const b = parseInt(match[3]).toString(16).padStart(2, '0');
+        return `#${r}${g}${b}`;
+      }
+      function showTooltip(event, d) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        const pct = d.score * 100;
+        const interpretation = pct > 100
+          ? `Performs ${(pct - 100).toFixed(0)}% above average on ${d.quartile} rules`
+          : pct < 100
+            ? `Performs ${(100 - pct).toFixed(0)}% below average on ${d.quartile} rules`
+            : 'Performs at average on these rules';
+        const quartileDesc = {
+          'Q1': 'Easiest (lowest complexity)',
+          'Q2': 'Easy-Medium',
+          'Q3': 'Medium-Hard',
+          'Q4': 'Hardest (highest complexity)'
+        };
+        tip.innerHTML = `
+          <div class="model-name">${d.model}</div>
+          <div class="metric">
+            <span class="metric-label">Quartile:</span>
+            <span class="metric-value">${d.quartile}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Difficulty:</span>
+            <span class="metric-value">${quartileDesc[d.quartile]}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Relative Score:</span>
+            <span class="metric-value">${pct.toFixed(0)}%</span>
+          </div>
+          <div class="interpretation">${interpretation}</div>
+        `;
+        const tipWidth = tip.offsetWidth || 200;
+        const tipHeight = tip.offsetHeight || 120;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        const quartiles = data.quartiles;
+        const models = data.models;
+        // Update scales
+        xScale
+          .domain(quartiles)
+          .range([0, innerWidth])
+          .padding(0.08);
+        yScale
+          .domain(models.map(m => m.name))
+          .range([0, innerHeight])
+          .padding(0.08);
+        // Find score extent for color scale (in percentage: 0-100%+)
+        const allScores = [];
+        models.forEach(m => {
+          quartiles.forEach(q => {
+            allScores.push(m.quartile_scores[q] * 100);
+          });
+        });
+        const minPct = Math.min(...allScores);
+        const maxPct = Math.max(...allScores);
+        // Linear scale from 0% (red) to 100%+ (green)
+        colorScale.domain([0, maxPct]);
+        // Build cell data (with percentage values)
+        const cellData = [];
+        models.forEach(m => {
+          quartiles.forEach(q => {
+            cellData.push({
+              model: m.name,
+              quartile: q,
+              score: m.quartile_scores[q],
+              pct: m.quartile_scores[q] * 100
+            });
+          });
+        });
+        // Draw cells
+        gCells.selectAll('.cell')
+          .data(cellData, d => `${d.model}-${d.quartile}`)
+          .join('rect')
+          .attr('class', 'cell')
+          .attr('x', d => xScale(d.quartile))
+          .attr('y', d => yScale(d.model))
+          .attr('width', xScale.bandwidth())
+          .attr('height', yScale.bandwidth())
+          .attr('fill', d => colorScale(d.pct))
+          .attr('rx', 4)
+          .on('mouseenter', showTooltip)
+          .on('mousemove', showTooltip)
+          .on('mouseleave', hideTooltip);
+        // Draw cell text
+        gCells.selectAll('.cell-text')
+          .data(cellData, d => `${d.model}-${d.quartile}`)
+          .join('text')
+          .attr('class', 'cell-text')
+          .attr('x', d => xScale(d.quartile) + xScale.bandwidth() / 2)
+          .attr('y', d => yScale(d.model) + yScale.bandwidth() / 2)
+          .attr('text-anchor', 'middle')
+          .attr('dominant-baseline', 'central')
+          .style('fill', d => {
+            const bgColor = colorScale(d.pct);
+            const hex = bgColor.startsWith('rgb') ? rgbToHex(bgColor) : bgColor;
+            return getContrastColor(hex);
+          })
+          .text(d => `${d.pct.toFixed(0)}%`);
+        // Model labels (Y-axis)
+        gAxes.selectAll('.model-label')
+          .data(models, d => d.name)
+          .join('text')
+          .attr('class', 'model-label')
+          .attr('x', -10)
+          .attr('y', d => yScale(d.name) + yScale.bandwidth() / 2)
+          .attr('text-anchor', 'end')
+          .attr('dominant-baseline', 'central')
+          .text(d => d.name);
+        // Quartile labels (X-axis)
+        gAxes.selectAll('.quartile-label')
+          .data(quartiles)
+          .join('text')
+          .attr('class', 'quartile-label')
+          .attr('x', d => xScale(d) + xScale.bandwidth() / 2)
+          .attr('y', -10)
+          .attr('text-anchor', 'middle')
+          .text(d => d);
+        // X-axis title
+        gAxes.selectAll('.x-title')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-title axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 40)
+          .attr('text-anchor', 'middle')
+          .text('Complexity Quartile (Q1 = easiest)');
+        // Chart title
+        gAxes.selectAll('.chart-title')
+          .data([0])
+          .join('text')
+          .attr('class', 'chart-title')
+          .attr('x', innerWidth / 2)
+          .attr('y', -35)
+          .attr('text-anchor', 'middle')
+          .text('Model Performance by Rule Complexity');
+        // Legend
+        const legendWidth = 20;
+        const legendHeight = innerHeight * 0.6;
+        const legendX = innerWidth + 30;
+        const legendY = (innerHeight - legendHeight) / 2;
+        // Create gradient
+        const gradientId = 'complexity-legend-gradient';
+        let defs = svg.select('defs');
+        if (defs.empty()) {
+          defs = svg.append('defs');
+        }
+        defs.selectAll(`#${gradientId}`).remove();
+        const gradient = defs.append('linearGradient')
+          .attr('id', gradientId)
+          .attr('x1', '0%')
+          .attr('x2', '0%')
+          .attr('y1', '100%')
+          .attr('y2', '0%');
+        const numStops = 11;
+        for (let i = 0; i <= numStops; i++) {
+          const t = i / numStops;
+          const value = t * maxPct;
+          gradient.append('stop')
+            .attr('offset', `${t * 100}%`)
+            .attr('stop-color', colorScale(value));
+        }
+        // Legend rectangle
+        gLegend.selectAll('.legend-rect')
+          .data([0])
+          .join('rect')
+          .attr('class', 'legend-rect')
+          .attr('x', legendX)
+          .attr('y', legendY)
+          .attr('width', legendWidth)
+          .attr('height', legendHeight)
+          .attr('fill', `url(#${gradientId})`)
+          .attr('rx', 2)
+          .attr('stroke', 'var(--border-color)')
+          .attr('stroke-width', 0.5);
+        // Legend ticks (in percentage)
+        const legendScale = d3.scaleLinear()
+          .domain([0, maxPct])
+          .range([legendY + legendHeight, legendY]);
+        // Generate nice tick values for percentage scale
+        const tickValues = [0, 50, 100];
+        if (maxPct > 100) tickValues.push(Math.round(maxPct / 10) * 10);
+        gLegend.selectAll('.legend-tick')
+          .data(tickValues.filter(v => v <= maxPct))
+          .join('text')
+          .attr('class', 'legend-tick')
+          .attr('x', legendX + legendWidth + 6)
+          .attr('y', d => legendScale(d))
+          .attr('dominant-baseline', 'middle')
+          .text(d => `${d}%`);
+        // Legend title
+        gLegend.selectAll('.legend-title')
+          .data([0])
+          .join('text')
+          .attr('class', 'legend-title')
+          .attr('x', legendX + legendWidth / 2)
+          .attr('y', legendY - 12)
+          .attr('text-anchor', 'middle')
+          .text('Relative Score');
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/confidence-distribution.html ADDED Viewed

	@@ -0,0 +1,495 @@

+<div class="d3-confidence-distribution"></div>
+<style>
+  .d3-confidence-distribution {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-confidence-distribution svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-confidence-distribution .axes path,
+  .d3-confidence-distribution .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-confidence-distribution .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-confidence-distribution .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-confidence-distribution .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-confidence-distribution .x-axis text {
+    transform: translateY(4px);
+  }
+  .d3-confidence-distribution .distribution-line {
+    fill: none;
+    stroke-width: 1.5;
+  }
+  .d3-confidence-distribution .data-point {
+    cursor: pointer;
+    transition: opacity 0.15s ease;
+  }
+  .d3-confidence-distribution .data-point:hover {
+    opacity: 0.8;
+  }
+  .d3-confidence-distribution .legend {
+    font-size: 11px;
+  }
+  .d3-confidence-distribution .legend-item {
+    cursor: pointer;
+  }
+  .d3-confidence-distribution .legend-item.dimmed .legend-line,
+  .d3-confidence-distribution .legend-item.dimmed .legend-marker {
+    opacity: 0.3;
+  }
+  .d3-confidence-distribution .legend-item.dimmed text {
+    opacity: 0.4;
+  }
+  .d3-confidence-distribution .legend-text {
+    fill: var(--text-color);
+  }
+  .d3-confidence-distribution .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-confidence-distribution .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-confidence-distribution .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-confidence-distribution .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-confidence-distribution .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-confidence-distribution'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-confidence-distribution'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups (order matters for layering)
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gLines = gRoot.append('g').attr('class', 'lines');
+      const gPoints = gRoot.append('g').attr('class', 'points');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gLegend = gRoot.append('g').attr('class', 'legend');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 500;
+      const margin = { top: 20, right: 180, bottom: 56, left: 72 };
+      let hiddenModels = new Set();
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleLinear();
+      // Line generator
+      const line = d3.line()
+        .x(d => xScale(d.confidence_level))
+        .y(d => yScale(d.proportion));
+      // Data loading
+      const DATA_URL = '/data/confidence_distribution.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        const availableWidth = width - margin.left - margin.right;
+        const maxHeight = Math.round(width * 0.7);
+        const innerSize = Math.min(availableWidth, maxHeight - margin.top - margin.bottom);
+        height = innerSize + margin.top + margin.bottom;
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function showTooltip(event, d, model) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${model.color}">${model.name}</div>
+          <div class="metric">
+            <span class="metric-label">Confidence level:</span>
+            <span class="metric-value">${d.confidence_level * 10}%</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Proportion:</span>
+            <span class="metric-value">${(d.proportion * 100).toFixed(1)}%</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Count:</span>
+            <span class="metric-value">${d.count} / ${model.total_guesses}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 150;
+        const tipHeight = tip.offsetHeight || 100;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function toggleModel(modelName) {
+        if (hiddenModels.has(modelName)) {
+          hiddenModels.delete(modelName);
+        } else {
+          hiddenModels.add(modelName);
+        }
+        render();
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        const models = data.models;
+        const visibleModels = models.filter(m => !hiddenModels.has(m.name));
+        // X scale: confidence levels 5-10
+        xScale
+          .domain([5, 10])
+          .range([0, innerWidth]);
+        // Y scale: proportion (0 to max + padding)
+        const maxProportion = d3.max(visibleModels, m =>
+          d3.max(m.distribution, d => d.proportion)
+        ) || 0.8;
+        yScale
+          .domain([0, Math.min(1, maxProportion * 1.1)])
+          .range([innerHeight, 0])
+          .nice();
+        // Grid lines
+        const xTicks = [5, 6, 7, 8, 9, 10];
+        const yTicks = yScale.ticks(6);
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        gGrid.selectAll('.grid-y')
+          .data(yTicks)
+          .join('line')
+          .attr('class', 'grid-y')
+          .attr('x1', 0)
+          .attr('x2', innerWidth)
+          .attr('y1', d => yScale(d))
+          .attr('y2', d => yScale(d));
+        // Axes
+        const tickSize = 6;
+        const percentFormat = d => `${Math.round(d * 100)}%`;
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale)
+            .tickValues(xTicks)
+            .tickFormat(d => d)
+            .tickSizeInner(-tickSize)
+            .tickSizeOuter(0));
+        gAxes.selectAll('.y-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'y-axis')
+          .call(d3.axisLeft(yScale)
+            .ticks(6)
+            .tickFormat(percentFormat)
+            .tickSizeInner(-tickSize)
+            .tickSizeOuter(0));
+        // Axis labels
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 44)
+          .attr('text-anchor', 'middle')
+          .text('Confidence Level');
+        gAxes.selectAll('.y-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'y-label axis-label')
+          .attr('x', -innerHeight / 2)
+          .attr('y', -52)
+          .attr('text-anchor', 'middle')
+          .attr('transform', 'rotate(-90)')
+          .text('Proportion of Guesses');
+        // Lines for each model
+        gLines.selectAll('.distribution-line')
+          .data(visibleModels, d => d.name)
+          .join('path')
+          .attr('class', 'distribution-line')
+          .attr('d', d => line(d.distribution))
+          .attr('stroke', d => d.color);
+        // Data points - circles for closed models, stars for open models
+        const allPoints = visibleModels.flatMap(model =>
+          model.distribution.map(p => ({ ...p, model }))
+        );
+        const closedPoints = allPoints.filter(d => !d.model.is_open);
+        const openPoints = allPoints.filter(d => d.model.is_open);
+        // Helper function to create a 5-point star path
+        const starPath = (cx, cy, outerR, innerR) => {
+          const points = [];
+          for (let i = 0; i < 10; i++) {
+            const r = i % 2 === 0 ? outerR : innerR;
+            const angle = (Math.PI / 2) + (i * Math.PI / 5);
+            points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+          }
+          return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+        };
+        // Circles for closed models
+        gPoints.selectAll('.data-point-circle')
+          .data(closedPoints, d => `${d.model.name}-${d.confidence_level}`)
+          .join('circle')
+          .attr('class', 'data-point data-point-circle')
+          .attr('cx', d => xScale(d.confidence_level))
+          .attr('cy', d => yScale(d.proportion))
+          .attr('r', 4)
+          .attr('fill', d => d.model.color)
+          .attr('stroke', 'var(--surface-bg, white)')
+          .attr('stroke-width', 1)
+          .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
+          .on('mousemove', (event, d) => showTooltip(event, d, d.model))
+          .on('mouseleave', hideTooltip);
+        // Stars for open models
+        gPoints.selectAll('.data-point-star')
+          .data(openPoints, d => `${d.model.name}-${d.confidence_level}`)
+          .join('path')
+          .attr('class', 'data-point data-point-star')
+          .attr('d', d => starPath(
+            xScale(d.confidence_level),
+            yScale(d.proportion),
+            6, 2.6
+          ))
+          .attr('fill', d => d.model.color)
+          .attr('stroke', 'var(--surface-bg, white)')
+          .attr('stroke-width', 0.8)
+          .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
+          .on('mousemove', (event, d) => showTooltip(event, d, d.model))
+          .on('mouseleave', hideTooltip);
+        // Legend
+        const legendX = innerWidth + 16;
+        const legendItemHeight = 20;
+        // Helper function for legend star
+        const legendStarPath = (cx, cy, outerR, innerR) => {
+          const points = [];
+          for (let i = 0; i < 10; i++) {
+            const r = i % 2 === 0 ? outerR : innerR;
+            const angle = (Math.PI / 2) + (i * Math.PI / 5);
+            points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+          }
+          return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+        };
+        gLegend.selectAll('.legend-item')
+          .data(models, d => d.name)
+          .join('g')
+          .attr('class', d => `legend-item ${hiddenModels.has(d.name) ? 'dimmed' : ''}`)
+          .attr('transform', (d, i) => `translate(${legendX}, ${i * legendItemHeight})`)
+          .each(function(d) {
+            const g = d3.select(this);
+            g.selectAll('*').remove();
+            // Line segment (solid for all models)
+            g.append('line')
+              .attr('class', 'legend-line')
+              .attr('x1', 0)
+              .attr('x2', 20)
+              .attr('y1', 0)
+              .attr('y2', 0)
+              .attr('stroke', d.color)
+              .attr('stroke-width', 1.5);
+            // Marker - circle for closed, star for open
+            if (d.is_open) {
+              g.append('path')
+                .attr('class', 'legend-marker')
+                .attr('d', legendStarPath(10, 0, 6, 2.6))
+                .attr('fill', d.color);
+            } else {
+              g.append('circle')
+                .attr('class', 'legend-marker')
+                .attr('cx', 10)
+                .attr('cy', 0)
+                .attr('r', 3.5)
+                .attr('fill', d.color);
+            }
+            g.append('text')
+              .attr('class', 'legend-text')
+              .attr('x', 26)
+              .attr('y', 4)
+              .text(d.name);
+            g.style('cursor', 'pointer')
+              .on('click', () => toggleModel(d.name));
+          });
+        // Legend note
+        const noteY = models.length * legendItemHeight + 12;
+        gLegend.selectAll('.legend-note')
+          .data([0])
+          .join('text')
+          .attr('class', 'legend-note')
+          .attr('x', legendX)
+          .attr('y', noteY)
+          .attr('font-size', '10px')
+          .attr('fill', 'var(--muted-color)')
+          .text('● = Closed, ★ = Open');
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/excess-caution.html ADDED Viewed

	@@ -0,0 +1,384 @@

+<div class="d3-excess-caution"></div>
+<style>
+  .d3-excess-caution {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-excess-caution svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-excess-caution .axes path,
+  .d3-excess-caution .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-excess-caution .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-excess-caution .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-excess-caution .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-excess-caution .strip-point {
+    opacity: 0.5;
+  }
+  .d3-excess-caution .mean-line {
+    stroke-width: 4;
+    cursor: pointer;
+  }
+  .d3-excess-caution .mean-line:hover {
+    stroke-width: 5;
+  }
+  .d3-excess-caution .legend {
+    font-size: 11px;
+  }
+  .d3-excess-caution .legend-text {
+    fill: var(--text-color);
+  }
+  .d3-excess-caution .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-excess-caution .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-excess-caution .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-excess-caution .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-excess-caution .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-excess-caution'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-excess-caution'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gPoints = gRoot.append('g').attr('class', 'points');
+      const gMeans = gRoot.append('g').attr('class', 'means');
+      const gLegend = gRoot.append('g').attr('class', 'legend');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 450;
+      const margin = { top: 20, right: 30, bottom: 50, left: 160 };
+      // Scales (swapped: X is now linear, Y is categorical)
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleBand();
+      // Data loading
+      const DATA_URL = '/data/excess_caution.json';
+      // Seeded random for consistent jitter
+      function seededRandom(seed) {
+        const x = Math.sin(seed) * 10000;
+        return x - Math.floor(x);
+      }
+      // Compute quartiles from array
+      function computeQuartiles(values) {
+        const sorted = [...values].sort((a, b) => a - b);
+        const n = sorted.length;
+        const q1 = sorted[Math.floor(n * 0.25)];
+        const median = sorted[Math.floor(n * 0.5)];
+        const q3 = sorted[Math.floor(n * 0.75)];
+        return { q1, median, q3 };
+      }
+      function showTooltip(event, model) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        const quartiles = computeQuartiles(model.values);
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${model.color}">${model.name}</div>
+          <div class="metric">
+            <span class="metric-label">Mean:</span>
+            <span class="metric-value">${model.mean.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Median:</span>
+            <span class="metric-value">${quartiles.median}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Q1 / Q3:</span>
+            <span class="metric-value">${quartiles.q1} / ${quartiles.q3}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Samples:</span>
+            <span class="metric-value">${model.count}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 150;
+        const tipHeight = tip.offsetHeight || 100;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function updateSize() {
+        width = container.clientWidth || 800;
+        // Taller chart for horizontal layout with 10 models
+        height = Math.max(400, Math.round(width * 0.6));
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        // Sort models by mean (descending - most cautious at top)
+        const models = [...data.models].sort((a, b) => b.mean - a.mean);
+        // X scale: linear (early correct turns)
+        const maxValue = d3.max(models, m => d3.max(m.values)) || 10;
+        xScale
+          .domain([0, maxValue + 0.5])
+          .range([0, innerWidth]);
+        // Y scale: categorical (model names)
+        yScale
+          .domain(models.map(m => m.name))
+          .range([0, innerHeight])
+          .padding(0.3);
+        // Grid lines (vertical)
+        const xTicks = xScale.ticks(6);
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        // Remove old horizontal grid lines
+        gGrid.selectAll('.grid-y').remove();
+        // Axes
+        const tickSize = 6;
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale)
+            .ticks(6)
+            .tickFormat(d3.format('d'))
+            .tickSizeInner(-tickSize)
+            .tickSizeOuter(0));
+        gAxes.selectAll('.y-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'y-axis')
+          .call(d3.axisLeft(yScale)
+            .tickSizeInner(-tickSize)
+            .tickSizeOuter(0));
+        // X-axis label
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 40)
+          .attr('text-anchor', 'middle')
+          .text('Early Correct Turns');
+        // Remove old Y-axis label
+        gAxes.selectAll('.y-label').remove();
+        // Create flat array of all points with horizontal jitter
+        const bandHeight = yScale.bandwidth();
+        const jitterWidth = 8; // Fixed horizontal jitter in pixels
+        const pointRadius = Math.min(2.5, bandHeight / 20);
+        const allPoints = models.flatMap((model, modelIdx) =>
+          model.values.map((value, i) => ({
+            model,
+            value,
+            // Seeded random jitter for consistency (horizontal)
+            jitter: (seededRandom(modelIdx * 1000 + i) - 0.5) * jitterWidth
+          }))
+        );
+        // Draw all points as small circles
+        gPoints.selectAll('.strip-point')
+          .data(allPoints, (d, i) => `${d.model.name}-${i}`)
+          .join('circle')
+          .attr('class', 'strip-point')
+          .attr('cx', d => xScale(d.value) + d.jitter)
+          .attr('cy', d => yScale(d.model.name) + bandHeight / 2)
+          .attr('r', pointRadius)
+          .attr('fill', d => d.model.color);
+        // Mean lines with hover (now vertical)
+        const meanLineHeight = bandHeight * 0.78;
+        gMeans.selectAll('.mean-line')
+          .data(models, d => d.name)
+          .join('line')
+          .attr('class', 'mean-line')
+          .attr('x1', d => xScale(d.mean))
+          .attr('x2', d => xScale(d.mean))
+          .attr('y1', d => yScale(d.name) + bandHeight / 2 - meanLineHeight / 2)
+          .attr('y2', d => yScale(d.name) + bandHeight / 2 + meanLineHeight / 2)
+          .attr('stroke', d => d.color)
+          .on('mouseenter', (event, d) => showTooltip(event, d))
+          .on('mousemove', (event, d) => showTooltip(event, d))
+          .on('mouseleave', hideTooltip);
+        // Legend
+        gLegend.selectAll('.legend-note')
+          .data([0])
+          .join('text')
+          .attr('class', 'legend-note legend-text')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 40)
+          .attr('text-anchor', 'middle')
+          .attr('font-size', '11px')
+          .text('');
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/reckless-guessing.html ADDED Viewed

	@@ -0,0 +1,400 @@

+<div class="d3-reckless-guessing"></div>
+<style>
+  .d3-reckless-guessing {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-reckless-guessing svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-reckless-guessing .axes path,
+  .d3-reckless-guessing .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-reckless-guessing .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 12px;
+  }
+  .d3-reckless-guessing .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-reckless-guessing .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-reckless-guessing .axes text.chart-title {
+    font-size: 16px;
+    font-weight: 600;
+    fill: var(--text-color);
+  }
+  .d3-reckless-guessing .axes text.subtitle {
+    font-size: 11px;
+    font-style: italic;
+    fill: var(--muted-color);
+  }
+  .d3-reckless-guessing .model-label {
+    font-size: 13px;
+    font-weight: 500;
+  }
+  .d3-reckless-guessing .bar {
+    cursor: pointer;
+    transition: opacity 0.15s ease;
+  }
+  .d3-reckless-guessing .bar:hover {
+    opacity: 0.8;
+  }
+  .d3-reckless-guessing .percent-label {
+    font-size: 12px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-reckless-guessing .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-reckless-guessing .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-reckless-guessing .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-reckless-guessing .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-reckless-guessing .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-reckless-guessing'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-reckless-guessing'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gBars = gRoot.append('g').attr('class', 'bars');
+      const gLabels = gRoot.append('g').attr('class', 'labels');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 450;
+      const margin = { top: 40, right: 50, bottom: 56, left: 20 };
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleBand();
+      // Data loading
+      const JSON_PATHS = [
+        '/data/reckless_guessing.json',
+        './assets/data/reckless_guessing.json',
+        '../assets/data/reckless_guessing.json',
+        '../../assets/data/reckless_guessing.json'
+      ];
+      const fetchFirstAvailable = async (paths) => {
+        for (const p of paths) {
+          try {
+            const r = await fetch(p, { cache: 'no-cache' });
+            if (r.ok) return await r.json();
+          } catch (_) {}
+        }
+        throw new Error('Data not found');
+      };
+      function updateSize() {
+        width = container.clientWidth || 800;
+        const numModels = data ? data.models.length : 10;
+        const barHeight = 36;
+        height = margin.top + margin.bottom + numModels * barHeight;
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function showTooltip(event, d) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${d.color}">${d.name}</div>
+          <div class="metric">
+            <span class="metric-label">Double-Down Rate:</span>
+            <span class="metric-value">${(d.double_down_rate * 100).toFixed(0)}%</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Wrong Guesses:</span>
+            <span class="metric-value">${d.wrong_guesses}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Next Turn Guesses:</span>
+            <span class="metric-value">${d.next_turn_guesses}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Max Streak:</span>
+            <span class="metric-value">${d.max_streak}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Type:</span>
+            <span class="metric-value">${d.is_open ? 'Open' : 'Closed'}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 180;
+        const tipHeight = tip.offsetHeight || 120;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      // Calculate relative luminance and return black or white for best contrast
+      function getContrastColor(hexColor) {
+        const hex = hexColor.replace('#', '');
+        const r = parseInt(hex.substr(0, 2), 16) / 255;
+        const g = parseInt(hex.substr(2, 2), 16) / 255;
+        const b = parseInt(hex.substr(4, 2), 16) / 255;
+        const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
+        return luminance > 0.5 ? '#000000' : '#ffffff';
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        // Sort models by double_down_rate descending
+        const models = [...data.models].sort((a, b) => b.double_down_rate - a.double_down_rate);
+        // Update scales
+        xScale
+          .domain([0, 0.8])
+          .range([0, innerWidth]);
+        yScale
+          .domain(models.map(d => d.name))
+          .range([0, innerHeight])
+          .padding(0.25);
+        // Grid lines (vertical)
+        const xTicks = [0, 0.2, 0.4, 0.6, 0.8];
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        // Title
+        gAxes.selectAll('.chart-title')
+          .data([0])
+          .join('text')
+          .attr('class', 'chart-title')
+          .attr('x', innerWidth / 2)
+          .attr('y', -20)
+          .attr('text-anchor', 'middle')
+          .text('After Wrong Guess: % Guessing Again Next Turn');
+        // X-axis (bottom)
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale)
+            .tickValues(xTicks)
+            .tickFormat(d => `${Math.round(d * 100)}%`)
+            .tickSizeOuter(0));
+        // X-axis label
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 34)
+          .attr('text-anchor', 'middle')
+          .text('Double-Down Rate');
+        // Subtitle
+        gAxes.selectAll('.subtitle')
+          .data([0])
+          .join('text')
+          .attr('class', 'subtitle')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 48)
+          .attr('text-anchor', 'middle')
+          .text('Higher = more reckless (keeps guessing after failures)');
+        // Bars
+        const barHeight = yScale.bandwidth();
+        // All models with filled bars
+        gBars.selectAll('.bar')
+          .data(models, d => d.name)
+          .join('rect')
+          .attr('class', 'bar')
+          .attr('x', 0)
+          .attr('y', d => yScale(d.name))
+          .attr('width', d => xScale(d.double_down_rate))
+          .attr('height', barHeight)
+          .attr('fill', d => d.color)
+          .attr('rx', 3)
+          .attr('ry', 3)
+          .on('mouseenter', showTooltip)
+          .on('mousemove', showTooltip)
+          .on('mouseleave', hideTooltip);
+        // Model labels (inside bars)
+        gLabels.selectAll('.model-label')
+          .data(models, d => d.name)
+          .join('text')
+          .attr('class', 'model-label')
+          .attr('x', 8)
+          .attr('y', d => yScale(d.name) + barHeight / 2)
+          .attr('dy', '0.35em')
+          .attr('text-anchor', 'start')
+          .style('fill', d => getContrastColor(d.color))
+          .text(d => d.name);
+        // Percentage labels (end of bars)
+        gLabels.selectAll('.percent-label')
+          .data(models, d => d.name)
+          .join('text')
+          .attr('class', 'percent-label')
+          .attr('x', d => xScale(d.double_down_rate) + 6)
+          .attr('y', d => yScale(d.name) + barHeight / 2)
+          .attr('dy', '0.35em')
+          .attr('text-anchor', 'start')
+          .text(d => `${Math.round(d.double_down_rate * 100)}%`);
+      }
+      // Initialize
+      fetchFirstAvailable(JSON_PATHS)
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/score-stack.html ADDED Viewed

	@@ -0,0 +1,440 @@

+<div class="d3-score-stack"></div>
+<style>
+  .d3-score-stack {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-score-stack svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-score-stack .axes path,
+  .d3-score-stack .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-score-stack .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-score-stack .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-score-stack .axes text.axis-label {
+    font-size: 15px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-score-stack .bar-segment {
+    cursor: pointer;
+    transition: opacity 0.15s ease;
+  }
+  .d3-score-stack .bar-segment:hover {
+    opacity: 0.8;
+  }
+  .d3-score-stack .model-label {
+    font-size: 12px;
+    fill: var(--text-color);
+  }
+  .d3-score-stack .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-score-stack .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-score-stack .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-score-stack .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-score-stack .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+  .d3-score-stack .legend {
+    display: flex;
+    flex-wrap: wrap;
+    justify-content: center;
+    gap: 16px;
+    margin-top: 12px;
+    font-size: 12px;
+  }
+  .d3-score-stack .legend-item {
+    display: flex;
+    align-items: center;
+    gap: 6px;
+  }
+  .d3-score-stack .legend-swatch {
+    width: 14px;
+    height: 14px;
+    border-radius: 2px;
+  }
+  .d3-score-stack .legend-label {
+    color: var(--text-color);
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-score-stack'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-score-stack'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gBars = gRoot.append('g').attr('class', 'bars');
+      // Legend container
+      const legendDiv = document.createElement('div');
+      legendDiv.className = 'legend';
+      container.appendChild(legendDiv);
+      // State
+      let data = null;
+      let width = 800;
+      let height = 500;
+      const margin = { top: 20, right: 30, bottom: 56, left: 160 };
+      // Colors for segments
+      const segmentColors = {
+        raw: '#4A90D9',       // Blue - raw score
+        floored: '#E8973E',   // Orange - flooring gain
+        noStakes: '#5AAA5A'   // Green - no-stakes gain
+      };
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleBand();
+      // Data loading
+      const DATA_URL = '/data/score_stack.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        const barCount = data ? data.models.length : 10;
+        height = Math.max(400, barCount * 44 + margin.top + margin.bottom);
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function showTooltip(event, d, segment) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        let segmentName, segmentValue, description;
+        if (segment === 'raw') {
+          segmentName = 'Raw Score';
+          segmentValue = d.avg_score.toFixed(2);
+          description = 'Standard scoring: 30 - turns - 2×wrong guesses';
+        } else if (segment === 'floored') {
+          segmentName = 'Flooring Gain';
+          segmentValue = '+' + d.floored_delta.toFixed(2);
+          description = 'Gain if negative scores count as 0';
+        } else {
+          segmentName = 'No-Stakes Gain';
+          segmentValue = '+' + d.no_stakes_delta.toFixed(2);
+          description = 'Additional gain without guess penalties';
+        }
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${d.color}">${d.name}</div>
+          <div class="metric">
+            <span class="metric-label">${segmentName}:</span>
+            <span class="metric-value">${segmentValue}</span>
+          </div>
+          <div style="font-size: 11px; color: var(--muted-color); margin-top: 4px;">${description}</div>
+          <hr style="border: none; border-top: 1px solid var(--border-color); margin: 8px 0;">
+          <div class="metric">
+            <span class="metric-label">Raw Score:</span>
+            <span class="metric-value">${d.avg_score.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Floored Score:</span>
+            <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">No-Stakes Score:</span>
+            <span class="metric-value">${d.avg_no_stakes_score.toFixed(2)}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 200;
+        const tipHeight = tip.offsetHeight || 150;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        // Sort models by raw score (descending)
+        const models = [...data.models].sort((a, b) => b.avg_score - a.avg_score);
+        // Update scales
+        const maxScore = d3.max(models, d => d.avg_no_stakes_score);
+        xScale
+          .domain([0, maxScore + 1])
+          .range([0, innerWidth])
+          .nice();
+        yScale
+          .domain(models.map(d => d.name))
+          .range([0, innerHeight])
+          .padding(0.25);
+        // Grid lines
+        const xTicks = xScale.ticks(8);
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        // Axes
+        const tickSize = 6;
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale).ticks(8).tickSizeInner(-tickSize).tickSizeOuter(0));
+        gAxes.selectAll('.y-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'y-axis')
+          .call(d3.axisLeft(yScale).tickSize(0))
+          .selectAll('text')
+          .attr('class', 'model-label');
+        // Axis label
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 44)
+          .attr('text-anchor', 'middle')
+          .text('Score');
+        const barHeight = yScale.bandwidth();
+        // Helper to sanitize names for CSS selectors (remove periods, spaces, etc.)
+        const toClassName = (name) => name.replace(/[^a-zA-Z0-9]/g, '-');
+        // Draw stacked bars for each model
+        models.forEach(d => {
+          const y = yScale(d.name);
+          const safeId = toClassName(d.name);
+          // Calculate segment positions
+          // Raw score starts from 0, clamp negative scores to 0
+          const rawStart = 0;
+          const rawEnd = Math.max(0, d.avg_score);
+          // Floored delta starts where raw score ends (if positive) or at 0 (if raw was negative)
+          const flooredStart = rawEnd;
+          const flooredEnd = flooredStart + d.floored_delta;
+          // No-stakes delta starts where floored ends
+          const noStakesStart = flooredEnd;
+          const noStakesEnd = noStakesStart + d.no_stakes_delta;
+          // Raw score segment
+          gBars.selectAll(`.bar-raw-${safeId}`)
+            .data([d])
+            .join('rect')
+            .attr('class', `bar-segment bar-raw-${safeId}`)
+            .attr('x', xScale(rawStart))
+            .attr('y', y)
+            .attr('width', Math.max(0, xScale(rawEnd) - xScale(rawStart)))
+            .attr('height', barHeight)
+            .attr('fill', segmentColors.raw)
+            .on('mouseenter', (e) => showTooltip(e, d, 'raw'))
+            .on('mousemove', (e) => showTooltip(e, d, 'raw'))
+            .on('mouseleave', hideTooltip);
+          // Floored delta segment (only if positive)
+          if (d.floored_delta > 0.01) {
+            gBars.selectAll(`.bar-floored-${safeId}`)
+              .data([d])
+              .join('rect')
+              .attr('class', `bar-segment bar-floored-${safeId}`)
+              .attr('x', xScale(flooredStart))
+              .attr('y', y)
+              .attr('width', Math.max(0, xScale(flooredEnd) - xScale(flooredStart)))
+              .attr('height', barHeight)
+              .attr('fill', segmentColors.floored)
+              .attr('opacity', 0.5)
+              .on('mouseenter', (e) => showTooltip(e, d, 'floored'))
+              .on('mousemove', (e) => showTooltip(e, d, 'floored'))
+              .on('mouseleave', hideTooltip);
+          }
+          // No-stakes delta segment (only if positive)
+          if (d.no_stakes_delta > 0.01) {
+            gBars.selectAll(`.bar-nostakes-${safeId}`)
+              .data([d])
+              .join('rect')
+              .attr('class', `bar-segment bar-nostakes-${safeId}`)
+              .attr('x', xScale(noStakesStart))
+              .attr('y', y)
+              .attr('width', Math.max(0, xScale(noStakesEnd) - xScale(noStakesStart)))
+              .attr('height', barHeight)
+              .attr('fill', segmentColors.noStakes)
+              .attr('opacity', 0.5)
+              .on('mouseenter', (e) => showTooltip(e, d, 'noStakes'))
+              .on('mousemove', (e) => showTooltip(e, d, 'noStakes'))
+              .on('mouseleave', hideTooltip);
+          }
+        });
+        // Update legend
+        legendDiv.innerHTML = `
+          <div class="legend-item">
+            <div class="legend-swatch" style="background: ${segmentColors.raw}"></div>
+            <span class="legend-label">Raw Score</span>
+          </div>
+          <div class="legend-item">
+            <div class="legend-swatch" style="background: ${segmentColors.floored}"></div>
+            <span class="legend-label">Flooring Gain</span>
+          </div>
+          <div class="legend-item">
+            <div class="legend-swatch" style="background: ${segmentColors.noStakes}"></div>
+            <span class="legend-label">No-Stakes Gain</span>
+          </div>
+        `;
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

app/src/content/embeds/score-vs-failed-guesses.html ADDED Viewed

	@@ -0,0 +1,369 @@

+<div class="d3-score-vs-failed-guesses"></div>
+<style>
+  .d3-score-vs-failed-guesses {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-score-vs-failed-guesses svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  .d3-score-vs-failed-guesses .axes path,
+  .d3-score-vs-failed-guesses .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-score-vs-failed-guesses .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-score-vs-failed-guesses .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  .d3-score-vs-failed-guesses .axes text.axis-label {
+    font-size: 15px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-score-vs-failed-guesses .x-axis text {
+    transform: translateY(4px);
+  }
+  .d3-score-vs-failed-guesses .point {
+    cursor: pointer;
+    transition: opacity 0.15s ease;
+  }
+  .d3-score-vs-failed-guesses .point:hover {
+    opacity: 0.8;
+  }
+  .d3-score-vs-failed-guesses .point-label {
+    font-size: 11px;
+    fill: var(--text-color);
+    pointer-events: none;
+  }
+  .d3-score-vs-failed-guesses .d3-tooltip {
+    position: absolute;
+    top: 0;
+    left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+  .d3-score-vs-failed-guesses .d3-tooltip .model-name {
+    font-weight: 600;
+    margin-bottom: 4px;
+  }
+  .d3-score-vs-failed-guesses .d3-tooltip .metric {
+    display: flex;
+    justify-content: space-between;
+    gap: 16px;
+  }
+  .d3-score-vs-failed-guesses .d3-tooltip .metric-label {
+    color: var(--muted-color);
+  }
+  .d3-score-vs-failed-guesses .d3-tooltip .metric-value {
+    font-weight: 500;
+  }
+</style>
+<script>
+  (() => {
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-score-vs-failed-guesses'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-score-vs-failed-guesses'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gPoints = gRoot.append('g').attr('class', 'points');
+      const gLabels = gRoot.append('g').attr('class', 'labels');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 450;
+      const margin = { top: 20, right: 120, bottom: 56, left: 72 };
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleLinear();
+      // Data loading
+      const DATA_URL = '/data/score_vs_failed_guesses.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        height = Math.max(300, Math.round(width / 1.3));
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function showTooltip(event, d) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        tip.innerHTML = `
+          <div class="model-name" style="color: ${d.color}">${d.name}</div>
+          <div class="metric">
+            <span class="metric-label">Score:</span>
+            <span class="metric-value">${d.avg_score.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Failed Guesses:</span>
+            <span class="metric-value">${d.avg_failed_guesses.toFixed(2)}</span>
+          </div>
+          <div class="metric">
+            <span class="metric-label">Type:</span>
+            <span class="metric-value">${d.is_open ? 'Open' : 'Closed'}</span>
+          </div>
+        `;
+        const tipWidth = tip.offsetWidth || 150;
+        const tipHeight = tip.offsetHeight || 80;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        const models = data.models;
+        // Update scales
+        const xExtent = d3.extent(models, d => d.avg_failed_guesses);
+        const yExtent = d3.extent(models, d => d.avg_score);
+        const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
+        const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
+        xScale
+          .domain([Math.max(0, xExtent[0] - xPadding), xExtent[1] + xPadding])
+          .range([0, innerWidth])
+          .nice();
+        yScale
+          .domain([yExtent[0] - yPadding, yExtent[1] + yPadding])
+          .range([innerHeight, 0])
+          .nice();
+        // Grid lines
+        const xTicks = xScale.ticks(6);
+        const yTicks = yScale.ticks(6);
+        gGrid.selectAll('.grid-x')
+          .data(xTicks)
+          .join('line')
+          .attr('class', 'grid-x')
+          .attr('x1', d => xScale(d))
+          .attr('x2', d => xScale(d))
+          .attr('y1', 0)
+          .attr('y2', innerHeight);
+        gGrid.selectAll('.grid-y')
+          .data(yTicks)
+          .join('line')
+          .attr('class', 'grid-y')
+          .attr('x1', 0)
+          .attr('x2', innerWidth)
+          .attr('y1', d => yScale(d))
+          .attr('y2', d => yScale(d));
+        // Axes with inner ticks
+        const tickSize = 6;
+        gAxes.selectAll('.x-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'x-axis')
+          .attr('transform', `translate(0,${innerHeight})`)
+          .call(d3.axisBottom(xScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
+        gAxes.selectAll('.y-axis')
+          .data([0])
+          .join('g')
+          .attr('class', 'y-axis')
+          .call(d3.axisLeft(yScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
+        // Axis labels
+        gAxes.selectAll('.x-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'x-label axis-label')
+          .attr('x', innerWidth / 2)
+          .attr('y', innerHeight + 44)
+          .attr('text-anchor', 'middle')
+          .text('Average Failed Guesses');
+        gAxes.selectAll('.y-label')
+          .data([0])
+          .join('text')
+          .attr('class', 'y-label axis-label')
+          .attr('x', -innerHeight / 2)
+          .attr('y', -52)
+          .attr('text-anchor', 'middle')
+          .attr('transform', 'rotate(-90)')
+          .text('Average Score');
+        // Points - circles for closed models, stars for open models
+        const pointRadius = Math.max(8, Math.min(16, innerWidth / 60));
+        // Helper function to create a 5-point star path
+        const starPath = (cx, cy, outerR, innerR) => {
+          const points = [];
+          for (let i = 0; i < 10; i++) {
+            const r = i % 2 === 0 ? outerR : innerR;
+            const angle = (Math.PI / 2) + (i * Math.PI / 5);
+            points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+          }
+          return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+        };
+        // Closed models as circles
+        const closedModels = models.filter(d => !d.is_open);
+        gPoints.selectAll('.point-circle')
+          .data(closedModels, d => d.name)
+          .join('circle')
+          .attr('class', 'point point-circle')
+          .attr('cx', d => xScale(d.avg_failed_guesses))
+          .attr('cy', d => yScale(d.avg_score))
+          .attr('r', pointRadius)
+          .attr('fill', d => d.color)
+          .attr('stroke', 'none')
+          .on('mouseenter', showTooltip)
+          .on('mousemove', showTooltip)
+          .on('mouseleave', hideTooltip);
+        // Open models as stars
+        const openModels = models.filter(d => d.is_open);
+        gPoints.selectAll('.point-star')
+          .data(openModels, d => d.name)
+          .join('path')
+          .attr('class', 'point point-star')
+          .attr('d', d => starPath(xScale(d.avg_failed_guesses), yScale(d.avg_score), pointRadius * 1.2, pointRadius * 0.5))
+          .attr('fill', d => d.color)
+          .attr('stroke', 'none')
+          .on('mouseenter', showTooltip)
+          .on('mousemove', showTooltip)
+          .on('mouseleave', hideTooltip);
+        // Point labels
+        gLabels.selectAll('.point-label')
+          .data(models)
+          .join('text')
+          .attr('class', 'point-label')
+          .attr('x', d => xScale(d.avg_failed_guesses) + pointRadius + 6)
+          .attr('y', d => yScale(d.avg_score) + 4)
+          .text(d => d.name);
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>

dark-mode-image.md ADDED Viewed

	@@ -0,0 +1,48 @@

+# Dark Mode Image Handling
+## Problem
+The blog template automatically inverts image colors in dark mode using a CSS filter:
+```css
+:global([data-theme="dark"]) .image-wrapper img {
+  filter: invert(0.925) hue-rotate(180deg);
+}
+```
+This works well for charts and figures with white backgrounds, but is undesirable for images that should retain their original colors (e.g., photographs, illustrations with specific color schemes).
+## Solution
+Added a `preserveColors` prop to the `Image` component that opts out of the dark mode inversion.
+### Usage
+```mdx
+import Image from "../../../components/Image.astro";
+import myImage from "../../assets/image/my_image.png";
+<Image
+  src={myImage}
+  alt="Description"
+  preserveColors
+/>
+```
+### Implementation
+**File: `app/src/components/Image.astro`**
+1. Added `preserveColors?: boolean` to the Props interface
+2. Added `data-preserve-colors` attribute to the wrapper div when the prop is true
+3. Updated CSS selectors to exclude images with this attribute:
+```css
+:global([data-theme="dark"]) .image-wrapper:not([data-preserve-colors]) img {
+  filter: invert(0.925) hue-rotate(180deg);
+}
+```
+### Current Usage
+- `introduction.mdx`: The `example_sequence.png` image uses `preserveColors` to maintain the card colors in dark mode

interactive-charts.md ADDED Viewed

	@@ -0,0 +1,498 @@

+# Converting Static Figures to Interactive D3 Charts
+This guide explains how to convert PNG figures into interactive D3.js visualizations for this project.
+## Overview
+Each interactive chart consists of:
+1. **JSON data file** in `app/public/data/` (served at `/data/filename.json`)
+2. **HTML embed file** in `app/src/content/embeds/` (e.g., `chart-name.html`)
+3. **MDX integration** using the `HtmlEmbed` component
+## File Structure
+```
+app/
+├── public/data/                     # JSON data (served at /data/*)
+│   ├── overall_performance.json
+│   ├── calibration_curves.json
+│   └── ...
+└── src/content/embeds/              # HTML chart implementations
+    ├── banner.html                  # Example: scatter plot
+    └── calibration-curves.html      # (to create)
+```
+## Step 1: Understand Your Data
+Check the JSON structure in `app/public/data/`. Common patterns:
+**Scatter plot** (`overall_performance.json`):
+```json
+{
+  "models": [
+    { "name": "Model A", "avg_score": 15.8, "avg_output_tokens_per_turn": 5253, "color": "#FF6B00", "is_open": false }
+  ]
+}
+```
+**Line chart / Calibration** (`calibration_curves.json`):
+```json
+{
+  "models": [
+    {
+      "name": "Model A", "color": "#FF6B00",
+      "calibration_points": [
+        { "confidence_level": 5, "actual_success_rate": 0.041, "sample_count": 73 }
+      ]
+    }
+  ]
+}
+```
+**Histogram** (`confidence_distribution.json`):
+```json
+{
+  "models": [
+    {
+      "name": "Model A", "color": "#FF6B00", "total_guesses": 579,
+      "distribution": [
+        { "confidence_level": 5, "proportion": 0.024, "count": 14 }
+      ]
+    }
+  ]
+}
+```
+## Step 2: Create the HTML Embed
+Create a new file in `app/src/content/embeds/`. Use this template:
+```html
+<div class="d3-CHART-NAME"></div>
+<style>
+  /* Scoped styles - prefix everything with .d3-CHART-NAME */
+  .d3-CHART-NAME {
+    width: 100%;
+    margin: 10px 0;
+    position: relative;
+    font-family: system-ui, -apple-system, sans-serif;
+  }
+  .d3-CHART-NAME svg {
+    display: block;
+    width: 100%;
+    height: auto;
+  }
+  /* Use CSS variables for theme support */
+  .d3-CHART-NAME .axes path,
+  .d3-CHART-NAME .axes line {
+    stroke: var(--axis-color, var(--text-color));
+  }
+  .d3-CHART-NAME .axes text {
+    fill: var(--tick-color, var(--muted-color));
+    font-size: 11px;
+  }
+  .d3-CHART-NAME .grid line {
+    stroke: var(--grid-color, rgba(0,0,0,.08));
+  }
+  /* Use specific selector to override .axes text */
+  .d3-CHART-NAME .axes text.axis-label {
+    font-size: 14px;
+    font-weight: 500;
+    fill: var(--text-color);
+  }
+  .d3-CHART-NAME .axes text.chart-title {
+    font-size: 16px;
+    font-weight: 600;
+    fill: var(--text-color);
+  }
+  /* Adjust tick label spacing if needed */
+  .d3-CHART-NAME .x-axis text {
+    transform: translateY(4px);
+  }
+  /* Tooltip */
+  .d3-CHART-NAME .d3-tooltip {
+    position: absolute;
+    top: 0; left: 0;
+    transform: translate(-9999px, -9999px);
+    pointer-events: none;
+    padding: 10px 12px;
+    border-radius: 8px;
+    font-size: 12px;
+    line-height: 1.4;
+    border: 1px solid var(--border-color);
+    background: var(--surface-bg);
+    color: var(--text-color);
+    box-shadow: 0 4px 24px rgba(0,0,0,.18);
+    opacity: 0;
+    transition: opacity 0.12s ease;
+    z-index: 10;
+  }
+</style>
+<script>
+  (() => {
+    // D3 loader - reuses existing if already loaded
+    const ensureD3 = (cb) => {
+      if (window.d3 && typeof window.d3.select === 'function') return cb();
+      let s = document.getElementById('d3-cdn-script');
+      if (!s) {
+        s = document.createElement('script');
+        s.id = 'd3-cdn-script';
+        s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
+        document.head.appendChild(s);
+      }
+      const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
+      s.addEventListener('load', onReady, { once: true });
+      if (window.d3) onReady();
+    };
+    const bootstrap = () => {
+      // Find container (handles multiple instances)
+      const scriptEl = document.currentScript;
+      let container = scriptEl ? scriptEl.previousElementSibling : null;
+      if (!(container && container.classList && container.classList.contains('d3-CHART-NAME'))) {
+        const candidates = Array.from(document.querySelectorAll('.d3-CHART-NAME'))
+          .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
+        container = candidates[candidates.length - 1] || null;
+      }
+      if (!container) return;
+      if (container.dataset) {
+        if (container.dataset.mounted === 'true') return;
+        container.dataset.mounted = 'true';
+      }
+      // Tooltip setup
+      container.style.position = container.style.position || 'relative';
+      const tip = document.createElement('div');
+      tip.className = 'd3-tooltip';
+      container.appendChild(tip);
+      // SVG setup
+      const svg = d3.select(container).append('svg');
+      const gRoot = svg.append('g');
+      // Chart groups (order matters for layering)
+      const gGrid = gRoot.append('g').attr('class', 'grid');
+      const gAxes = gRoot.append('g').attr('class', 'axes');
+      const gContent = gRoot.append('g').attr('class', 'content');
+      // State
+      let data = null;
+      let width = 800;
+      let height = 450;
+      const margin = { top: 40, right: 120, bottom: 56, left: 72 };
+      // Scales
+      const xScale = d3.scaleLinear();
+      const yScale = d3.scaleLinear();
+      // Data loading - single path since we use public/data/
+      const DATA_URL = '/data/YOUR_DATA_FILE.json';
+      function updateSize() {
+        width = container.clientWidth || 800;
+        height = Math.max(300, Math.round(width / 1.78)); // 16:9 aspect ratio
+        svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
+        gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
+        return {
+          innerWidth: width - margin.left - margin.right,
+          innerHeight: height - margin.top - margin.bottom
+        };
+      }
+      function showTooltip(event, d) {
+        const rect = container.getBoundingClientRect();
+        const x = event.clientX - rect.left;
+        const y = event.clientY - rect.top;
+        tip.innerHTML = `
+          <div style="font-weight: 600; color: ${d.color}">${d.name}</div>
+          <div>Value: ${d.value}</div>
+        `;
+        const tipWidth = tip.offsetWidth || 150;
+        const tipHeight = tip.offsetHeight || 80;
+        let tipX = x + 12;
+        let tipY = y - tipHeight / 2;
+        if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
+        if (tipY < 0) tipY = 8;
+        if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
+        tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
+        tip.style.opacity = '1';
+      }
+      function hideTooltip() {
+        tip.style.opacity = '0';
+        tip.style.transform = 'translate(-9999px, -9999px)';
+      }
+      function render() {
+        if (!data) return;
+        const { innerWidth, innerHeight } = updateSize();
+        // TODO: Implement your chart rendering here
+        // - Update scales with data extent
+        // - Draw grid lines
+        // - Draw axes
+        // - Draw data elements (lines, bars, points, etc.)
+      }
+      // Initialize
+      fetch(DATA_URL, { cache: 'no-cache' })
+        .then(r => r.json())
+        .then(json => {
+          data = json;
+          render();
+        })
+        .catch(err => {
+          const pre = document.createElement('pre');
+          pre.style.color = 'red';
+          pre.style.padding = '16px';
+          pre.textContent = `Error loading data: ${err.message}`;
+          container.appendChild(pre);
+        });
+      // Resize handling
+      if (window.ResizeObserver) {
+        new ResizeObserver(() => render()).observe(container);
+      } else {
+        window.addEventListener('resize', render);
+      }
+      // Theme change handling (re-render on light/dark toggle)
+      const observer = new MutationObserver(() => render());
+      observer.observe(document.documentElement, {
+        attributes: true,
+        attributeFilter: ['data-theme']
+      });
+    };
+    if (document.readyState === 'loading') {
+      document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
+    } else {
+      ensureD3(bootstrap);
+    }
+  })();
+</script>
+```
+## Step 3: Key Implementation Details
+### CSS Variables (Theme Support)
+Always use CSS variables for colors that need to adapt to light/dark mode:
+| Variable | Purpose |
+|----------|---------|
+| `var(--text-color)` | Main text, labels |
+| `var(--muted-color)` | Secondary text, tick labels |
+| `var(--border-color)` | Borders, outlines |
+| `var(--surface-bg)` | Tooltip background |
+| `var(--page-bg)` | Page background |
+### D3 Patterns Used
+**Scale setup:**
+```javascript
+const xExtent = d3.extent(data, d => d.x);
+const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
+xScale.domain([xExtent[0] - xPadding, xExtent[1] + xPadding])
+      .range([0, innerWidth])
+      .nice();
+```
+**Grid lines:**
+```javascript
+gGrid.selectAll('.grid-x')
+  .data(xScale.ticks(6))
+  .join('line')
+  .attr('class', 'grid-x')
+  .attr('x1', d => xScale(d))
+  .attr('x2', d => xScale(d))
+  .attr('y1', 0)
+  .attr('y2', innerHeight);
+```
+**Axes (basic):**
+```javascript
+gAxes.selectAll('.x-axis')
+  .data([0])
+  .join('g')
+  .attr('class', 'x-axis')
+  .attr('transform', `translate(0,${innerHeight})`)
+  .call(d3.axisBottom(xScale).ticks(6));
+```
+**Axes with inner ticks:**
+```javascript
+const tickSize = 6;
+gAxes.selectAll('.x-axis')
+  .data([0])
+  .join('g')
+  .attr('class', 'x-axis')
+  .attr('transform', `translate(0,${innerHeight})`)
+  .call(d3.axisBottom(xScale)
+    .ticks(6)
+    .tickSizeInner(-tickSize)  // Negative = ticks point inward
+    .tickSizeOuter(0));        // No outer ticks
+```
+**Custom shapes (5-point star):**
+```javascript
+const starPath = (cx, cy, outerR, innerR) => {
+  const points = [];
+  for (let i = 0; i < 10; i++) {
+    const r = i % 2 === 0 ? outerR : innerR;
+    const angle = (Math.PI / 2) + (i * Math.PI / 5);
+    points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
+  }
+  return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
+};
+// Use with path elements
+gContent.selectAll('.point-star')
+  .data(openModels)
+  .join('path')
+  .attr('d', d => starPath(xScale(d.x), yScale(d.y), radius * 1.2, radius * 0.5))
+  .attr('fill', d => d.color);
+```
+**Data-join for elements:**
+```javascript
+gContent.selectAll('.point')
+  .data(models)
+  .join('circle')
+  .attr('class', 'point')
+  .attr('cx', d => xScale(d.x))
+  .attr('cy', d => yScale(d.y))
+  .attr('r', 8)
+  .attr('fill', d => d.color)
+  .on('mouseenter', showTooltip)
+  .on('mousemove', showTooltip)
+  .on('mouseleave', hideTooltip);
+```
+## Step 4: Integrate in MDX
+In your `.mdx` file:
+```mdx
+import HtmlEmbed from "../../../components/HtmlEmbed.astro";
+<HtmlEmbed
+  src="chart-name.html"
+  title="Chart Title"
+  caption="<strong>Figure N:</strong> Description of what this shows."
+/>
+```
+For frameless embedding (like the banner):
+```mdx
+<HtmlEmbed src="banner.html" frameless />
+```
+## Charts to Convert
+| Figure | Data File | Chart Type | Status |
+|--------|-----------|------------|--------|
+| 1 | `overall_performance.json` | Scatter | Done (banner.html) |
+| 2 | `calibration_curves.json` | Multi-line | Done (calibration-curves.html) |
+| 3 | `confidence_distribution.json` | Grouped histogram | Done (confidence-distribution.html) |
+| 4 | `score_vs_failed_guesses.json` | Scatter | TODO |
+| 5 | `excess_caution.json` | Box plot | TODO |
+| 6 | `caution_vs_failed_guesses.json` | Scatter | Done (caution-vs-failed-guesses.html) |
+| 7 | `by_rule.json` | Strip plot | Done (by-rule.html) |
+| 8 | `complexity_analysis.json` | Heatmap | Done (complexity-analysis.html) |
+## Testing
+1. Run dev server: `cd app && npm run dev`
+2. Check the chart loads at the correct URL
+3. Verify tooltip interactions
+4. Toggle light/dark mode to check theme support
+5. Resize the window to verify responsiveness
+## Debugging Tips
+- Open browser console to see data loading errors
+- Check Network tab to verify `/data/filename.json` is being fetched
+- If chart doesn't render, check `container.dataset.mounted` isn't already 'true'
+- CSS scoping: always prefix selectors with `.d3-CHART-NAME`
+## Common Gotchas
+### Using `.style()` vs `.attr()` for Dynamic Colors
+When setting fill/stroke colors dynamically in D3 based on data, use `.style()` instead of `.attr()`:
+```javascript
+// WON'T WORK - attr has lower specificity than CSS rules
+.attr('fill', d => getContrastColor(d.color))
+// USE THIS - inline styles have higher specificity
+.style('fill', d => getContrastColor(d.color))
+```
+This is especially important for text labels where you need to calculate contrast colors dynamically. Example contrast function:
+```javascript
+function getContrastColor(hexColor) {
+  const hex = hexColor.replace('#', '');
+  const r = parseInt(hex.substr(0, 2), 16) / 255;
+  const g = parseInt(hex.substr(2, 2), 16) / 255;
+  const b = parseInt(hex.substr(4, 2), 16) / 255;
+  const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
+  return luminance > 0.5 ? '#000000' : '#ffffff';
+}
+// Usage
+gLabels.selectAll('.label')
+  .data(items)
+  .join('text')
+  .style('fill', d => getContrastColor(d.color))
+  .text(d => d.name);
+```
+### CSS Specificity for Axis Labels
+The generic `.axes text` rule applies to ALL text inside the axes group, including axis labels. To style axis labels differently, use a more specific selector:
+```css
+/* This won't work - gets overridden by .axes text */
+.d3-CHART-NAME .axis-label {
+  font-size: 15px;
+}
+/* Use this instead - more specific */
+.d3-CHART-NAME .axes text.axis-label {
+  font-size: 15px;
+  font-weight: 500;
+  fill: var(--text-color);
+}
+```
+### Adjusting Tick Label Position
+To move X-axis tick labels down (add spacing from the axis line):
+```css
+.d3-CHART-NAME .x-axis text {
+  transform: translateY(4px);
+}
+```
+### Removing Chart Elements
+When you don't need a title or legend:
+1. Remove the rendering code from `render()`
+2. Remove the CSS styles
+3. Adjust margins accordingly (e.g., reduce `margin.top` if no title)