dlouapre HF Staff commited on
Commit
aee6411
·
1 Parent(s): d123922

Adding interactive charts + assesment

Browse files
Files changed (38) hide show
  1. ASSESSMENT.md +291 -0
  2. app/src/content/assets/data/basic_metrics.csv +2 -2
  3. app/src/content/assets/data/by_rule.json +2 -2
  4. app/src/content/assets/data/by_rule.png +2 -2
  5. app/src/content/assets/data/complexity_analysis.json +2 -2
  6. app/src/content/assets/data/complexity_analysis.png +2 -2
  7. app/src/content/assets/data/model_claude_haiku_4_5.png +3 -0
  8. app/src/content/assets/data/model_claude_opus_4_5.png +3 -0
  9. app/src/content/assets/data/model_deepseek_r1.png +3 -0
  10. app/src/content/assets/data/model_gemini_3_flash_preview_low.png +3 -0
  11. app/src/content/assets/data/model_gpt_5_2_high.png +3 -0
  12. app/src/content/assets/data/model_gpt_5_mini_medium.png +3 -0
  13. app/src/content/assets/data/model_gpt_oss_120b.png +3 -0
  14. app/src/content/assets/data/model_gpt_oss_20b.png +3 -0
  15. app/src/content/assets/data/model_grok_4_1_fast_reasoning.png +3 -0
  16. app/src/content/assets/data/model_kimi_k2.png +3 -0
  17. app/src/content/assets/data/overall_performance.json +2 -2
  18. app/src/content/assets/data/overall_performance.png +2 -2
  19. app/src/content/assets/data/reckless_guessing.json +3 -0
  20. app/src/content/assets/data/reckless_guessing.png +3 -0
  21. app/src/content/assets/data/score_stack.json +3 -0
  22. app/src/content/assets/data/score_stack.png +3 -0
  23. app/src/content/assets/data/score_vs_failed_guesses.json +2 -2
  24. app/src/content/assets/data/score_vs_failed_guesses.png +2 -2
  25. app/src/content/assets/data/summary.txt +91 -51
  26. app/src/content/chapters/eleusis/benchmark.mdx +2 -2
  27. app/src/content/chapters/eleusis/results.mdx +59 -42
  28. app/src/content/embeds/by-rule.html +521 -0
  29. app/src/content/embeds/calibration-curves.html +537 -0
  30. app/src/content/embeds/caution-vs-failed-guesses.html +369 -0
  31. app/src/content/embeds/complexity-analysis.html +492 -0
  32. app/src/content/embeds/confidence-distribution.html +495 -0
  33. app/src/content/embeds/excess-caution.html +384 -0
  34. app/src/content/embeds/reckless-guessing.html +400 -0
  35. app/src/content/embeds/score-stack.html +440 -0
  36. app/src/content/embeds/score-vs-failed-guesses.html +369 -0
  37. dark-mode-image.md +48 -0
  38. interactive-charts.md +498 -0
ASSESSMENT.md ADDED
@@ -0,0 +1,291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Critical Assessment: Eleusis Benchmark Article
2
+
3
+ ## Executive Summary
4
+
5
+ The article presents an interesting benchmark with solid methodology and rich data. The main structural issue is that the **Results section tells a fragmented story about guessing behavior**, spreading related insights across 6+ subsections without a clear narrative arc. The key message—that metacognition matters and models have distinct "scientific personalities"—gets lost in the noise.
6
+
7
+ Additionally, there are **data consistency issues** between the text and the underlying data files that need resolution before publication.
8
+
9
+ ---
10
+
11
+ ## 1. Critical Issues
12
+
13
+ ### 1.1 Data Inconsistencies
14
+
15
+ The numbers in the text don't match `summary.txt`. For example:
16
+
17
+ | Metric | In Text | In summary.txt |
18
+ |--------|---------|----------------|
19
+ | Claude Opus 4.5 avg score | 15.88 (CLAUDE.md) | 14.46 |
20
+ | Kimi K2 avg score | 14.53 (CLAUDE.md) | 10.31 |
21
+ | GPT 5.2 High rank | "third place" | Actually 1st by avg_score (14.85) |
22
+
23
+ **Action needed:** Audit all numbers in the text against the latest data files.
24
+
25
+ ### 1.2 Results Section: Scattered Narrative
26
+
27
+ The guessing behavior story is currently spread across:
28
+
29
+ 1. "Confidence and Calibration" - calibration curves, confidence distribution
30
+ 2. "Guessing Strategy" - score vs failed guesses
31
+ 3. "The Caution-Recklessness Trade-off" - early correct turns, caution scatter
32
+ 4. "Alternative Scoring Systems" - score stack breakdown
33
+ 5. "Analysis of the reckless guessing behavior" - double-down rate
34
+
35
+ These all address the same fundamental question: **How do models decide when to commit?** But the current structure forces readers to piece together the story themselves.
36
+
37
+ **Problem:** A reader finishing the Results section doesn't have a clear mental model of "what makes some models better than others."
38
+
39
+ ---
40
+
41
+ ## 2. Suggested Restructuring
42
+
43
+ ### Option A: Reorganize Around the Key Insight
44
+
45
+ **Proposed Results structure:**
46
+
47
+ ```
48
+ ## Results
49
+
50
+ ### Overall Performance (keep as-is)
51
+ Brief overview, scatter plot of score vs tokens
52
+
53
+ ### Finding the Rule: Who Gets It Right?
54
+ - Success rates by model
55
+ - Performance by rule complexity
56
+ - Brief: what capabilities matter for finding rules
57
+
58
+ ### Knowing When You Know: The Metacognition Challenge
59
+ [This is the heart of the article - elevate it]
60
+ - The caution-recklessness trade-off (central framing)
61
+ - Caution analysis: early correct turns, GPT 5.2 waits too long
62
+ - Recklessness analysis: failed guesses, double-down rates
63
+ - The scatter plot showing the trade-off (Figure 6)
64
+ - Why Claude Opus wins: good enough at finding + great at timing
65
+
66
+ ### Confidence and Calibration
67
+ - Calibration curves (all models overconfident)
68
+ - Confidence distribution when guessing
69
+ - Brief: why calibration enables good timing decisions
70
+
71
+ ### Alternative Scoring: Robustness Check
72
+ - Score stack shows the penalty different behaviors pay
73
+ - Confirms that metacognition, not just rule-finding, drives scores
74
+ ```
75
+
76
+ **Benefits:**
77
+ - The key message (metacognition matters) becomes structurally prominent
78
+ - Reader builds understanding progressively: first "can they solve it?", then "do they know when they've solved it?"
79
+ - Eliminates the feeling of "lots of charts, hard to synthesize"
80
+
81
+ ### Option B: Two-Act Structure
82
+
83
+ ```
84
+ ## Results
85
+
86
+ ### Act 1: The Leaderboard (compact)
87
+ - Overall performance scatter
88
+ - Success rates
89
+ - One paragraph summary: "Models vary from 70% to 96% success rate..."
90
+
91
+ ### Act 2: The Real Story—Scientific Temperaments
92
+ [Frame models as having distinct "personalities"]
93
+
94
+ The Cautious Achiever: GPT 5.2 High
95
+ - Highest success rate, but 3rd in score
96
+ - Figure: excess caution distribution
97
+ - Lost ~3.6 points per round to over-caution
98
+
99
+ The Balanced Scientist: Claude Opus 4.5
100
+ - Not the best at finding rules, but best at knowing when
101
+ - Commits quickly, accepts occasional wrong guesses
102
+
103
+ The Reckless Guesser: Claude Haiku 4.5 / DeepSeek R1
104
+ - Commits before sufficient evidence
105
+ - Double-down behavior after failures
106
+
107
+ Visualizing the Trade-off
108
+ - Caution vs recklessness scatter (the key figure)
109
+ - Score stack showing what each "personality" costs
110
+
111
+ ### Calibration: Why Timing Is Hard
112
+ - Overconfidence makes timing decisions unreliable
113
+ - Even well-performing models poorly calibrated
114
+ ```
115
+
116
+ **Benefits:**
117
+ - Memorable framing (scientific personalities)
118
+ - Natural story arc
119
+ - Each model type is clearly characterized
120
+
121
+ ---
122
+
123
+ ## 3. Missing Content
124
+
125
+ ### 3.1 Figures Marked as TODO
126
+
127
+ - **Learning curves figure** (analysis.mdx:22) - Would show within-round dynamics
128
+ - **Failure mode distribution** (analysis.mdx:55) - Stacked bar by model
129
+
130
+ **Recommendation:** The learning curves figure would be valuable if you have the data. The failure mode classification might be hard to automate reliably—consider whether a few qualitative examples serve the purpose better.
131
+
132
+ ### 3.2 Human Baseline
133
+
134
+ Mentioned in limitations but this is a significant gap. Without human performance, readers can't judge if 92% success is impressive or trivial.
135
+
136
+ **Options:**
137
+ - Run a small human study (even N=5 would help)
138
+ - Cite related work on human performance in similar inductive reasoning tasks
139
+ - Frame it explicitly as "relative comparison between models" not absolute capability assessment
140
+
141
+ ### 3.3 Example Turn Figure
142
+
143
+ benchmark.mdx shows the JSON output format but doesn't illustrate what a complete turn looks like in context (game state → reasoning → decision).
144
+
145
+ **Recommendation:** Add a figure showing:
146
+ ```
147
+ [Current board state visualization]
148
+ [Model reasoning excerpt]
149
+ [Decision: play 4♣, confidence 6, don't guess yet]
150
+ [Outcome: accepted/rejected]
151
+ ```
152
+
153
+ This makes the task concrete for readers.
154
+
155
+ ---
156
+
157
+ ## 4. The "Deeper Analysis" Section
158
+
159
+ Currently a grab-bag of interesting observations with TODOs. Your instinct to replace with "Discussion" is right.
160
+
161
+ ### Proposed: Discussion Section
162
+
163
+ ```
164
+ ## Discussion
165
+
166
+ ### What Explains the Performance Gap?
167
+ - Metacognition (knowing when you know) is the key differentiator
168
+ - Success rate alone doesn't predict score (GPT 5.2 vs Opus example)
169
+ - Calibration enables good timing, but no model is well-calibrated
170
+
171
+ ### Open vs Proprietary Models
172
+ - Kimi K2 competitive on rule-finding
173
+ - But open models trend toward reckless guessing (training objective differences?)
174
+ - Opportunity: calibration tuning could improve open model performance
175
+
176
+ ### Failure Modes [keep the accordion, it's useful]
177
+
178
+ ### Implications for AI-Assisted Science
179
+ - The caution-recklessness trade-off mirrors real scientific decision-making
180
+ - An overconfident AI assistant could lead researchers astray
181
+ - An overcautious one wastes resources on unnecessary verification
182
+ ```
183
+
184
+ ### Move to Appendix
185
+
186
+ - Symmetric rules analysis (interesting but niche)
187
+ - Confirmation bias (preliminary, needs more work)
188
+ - Detailed qualitative examples (unless you expand them significantly)
189
+
190
+ ---
191
+
192
+ ## 5. Framing Suggestions
193
+
194
+ ### 5.1 Lead with the Surprise
195
+
196
+ Current opening of Results is fine, but the key insight (metacognition matters) comes too late. Consider foreshadowing in the introduction:
197
+
198
+ > "We found something surprising: the model with the highest success rate doesn't have the highest score. What matters isn't just finding the answer—it's knowing when you've found it."
199
+
200
+ ### 5.2 The "Scientific Personality" Frame
201
+
202
+ This is potentially memorable and shareable. Models as:
203
+ - **The Perfectionist** (GPT 5.2 High): Always wants more evidence
204
+ - **The Pragmatist** (Claude Opus 4.5): Good enough evidence is enough
205
+ - **The Gambler** (Claude Haiku 4.5): Guesses based on vibes
206
+
207
+ This framing:
208
+ - Makes the article more accessible to non-specialists
209
+ - Creates natural anchors for discussion
210
+ - Is scientifically defensible (behavioral clustering is real)
211
+
212
+ ### 5.3 The Decision Theory Angle
213
+
214
+ You mention the optimal guessing threshold (0.67 confidence) briefly. This could be expanded:
215
+
216
+ > "Given perfect calibration, the optimal strategy is to guess whenever confidence exceeds 67%. But no model is well-calibrated. GPT 5.2 High effectively uses a threshold of ~95%; Claude Haiku 4.5 seems to use ~50%."
217
+
218
+ This quantifies the "personalities" and connects to calibration.
219
+
220
+ ---
221
+
222
+ ## 6. Minor Issues
223
+
224
+ ### 6.1 Typos/Grammar
225
+
226
+ - results.mdx:38: "overconfident : for instance" → extra space before colon
227
+ - results.mdx:39: "GPT 5.2 is the best calibrated" → should be "GPT 5.2 High"
228
+ - results.mdx:51: "closed to Claude Opus 4.5" → "close to"
229
+ - results.mdx:103: "constrats" → "contrasts"
230
+ - analysis.mdx:60: "GPT OSS 120B also performs respectably at 12.0" → check number
231
+
232
+ ### 6.2 Caption Numbering
233
+
234
+ Figure 7 appears twice (score-stack and reckless-guessing). Fix numbering.
235
+
236
+ ### 6.3 Model Names Consistency
237
+
238
+ Inconsistent capitalization and naming:
239
+ - "Claude Opus 4.5" vs "Claude 4.5 Opus"
240
+ - "GPT 5.2 High" vs "Gpt 5.2 High" (in data files)
241
+ - "DeepSeek R1" vs "Deepseek R1"
242
+
243
+ ---
244
+
245
+ ## 7. Ideas for Additional Content
246
+
247
+ ### 7.1 Interactive "Play a Round" Demo
248
+
249
+ Let readers play one round against a rule to experience the task. Even a simple version would be compelling. (This could be a stretch goal.)
250
+
251
+ ### 7.2 Model-Specific Breakdowns
252
+
253
+ You have per-model PNG files (`model_claude_opus_4_5.png`, etc.). Consider:
254
+ - Appendix section with one page per model
255
+ - Or: expandable accordion for each model's detailed stats
256
+
257
+ ### 7.3 Token Efficiency Discussion
258
+
259
+ You show score vs tokens in Figure 1 but don't discuss it much. Gemini 3 Flash achieves decent results with 4x fewer tokens than Opus—is that worth highlighting for practitioners?
260
+
261
+ ### 7.4 Prompt Sensitivity
262
+
263
+ You note this as a limitation but could briefly test: what if you told models to be more cautious? More aggressive? (Could be future work suggestion.)
264
+
265
+ ---
266
+
267
+ ## 8. Prioritized Action Items
268
+
269
+ ### Must Fix
270
+ 1. Audit all numbers against latest data files
271
+ 2. Fix duplicate Figure 7 numbering
272
+ 3. Fix typos listed above
273
+
274
+ ### Should Do
275
+ 4. Reorganize Results section (Option A or B above)
276
+ 5. Rename "Deeper Analysis" to "Discussion" and restructure
277
+ 6. Add foreshadowing of key insight in introduction
278
+
279
+ ### Nice to Have
280
+ 7. Add example turn figure in benchmark.mdx
281
+ 8. Expand "scientific personalities" framing
282
+ 9. Human baseline (even informal)
283
+ 10. Per-model detail pages in appendix
284
+
285
+ ---
286
+
287
+ ## 9. Summary
288
+
289
+ The benchmark and data are solid. The article's main weakness is structural: it has too many charts telling pieces of the same story without a clear narrative spine. The fix is to reorganize around **the key insight** (metacognition matters more than raw rule-finding ability) and **the key visual** (the caution-recklessness scatter plot).
290
+
291
+ Your target message—"Models differ dramatically because metacognition matters, and this is an opportunity for improvement"—is supported by the data but not yet prominently surfaced by the article structure.
app/src/content/assets/data/basic_metrics.csv CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:84c461621159e77fa9c7a1370138dd35da740c50943f5b6966fa801a50c8479f
3
- size 2145
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:646b5eda63192bed7d4c3372c684b263db844ad6599e2cff7cd34b945e0a03da
3
+ size 2743
app/src/content/assets/data/by_rule.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1651057430922f5919ff3a4f6c005baed488a730bf90848e078044ddf0910a85
3
- size 5887
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bedd8081e1e412f0d2453c0f6fe78153fed8433520b9e1b729fc7b11dd5b02a8
3
+ size 30709
app/src/content/assets/data/by_rule.png CHANGED

Git LFS Details

  • SHA256: 157397fb6d139b6399e87166bc83c7c6a0183ec8aa28a81874a6314d2f092fc7
  • Pointer size: 131 Bytes
  • Size of remote file: 340 kB

Git LFS Details

  • SHA256: f7c7d4ff1a927f2d44209feb1979ca355f79fa75a03e13ac413d4bdba84012a6
  • Pointer size: 131 Bytes
  • Size of remote file: 363 kB
app/src/content/assets/data/complexity_analysis.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2217f0d480685678ad3d6724a5d8f8d4eb73a95a883af9fe077c6fcb476eabba
3
- size 2473
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a281c2834fce731ee67126dc08e307268f411c4b7ec24006d36edccd303a6e6d
3
+ size 2273
app/src/content/assets/data/complexity_analysis.png CHANGED

Git LFS Details

  • SHA256: 0e4c4614bdbf004ef9e898eaaeab152ea08437891d76d3a6da9ef9cf44c86bbc
  • Pointer size: 130 Bytes
  • Size of remote file: 99.7 kB

Git LFS Details

  • SHA256: 7d7ed142b4271802c43e9c385ac2fd01da0a9008903655477d1a76608af86fe1
  • Pointer size: 131 Bytes
  • Size of remote file: 111 kB
app/src/content/assets/data/model_claude_haiku_4_5.png ADDED

Git LFS Details

  • SHA256: e156f35fcb3f764435fccf4ee3ce16b71f594721e11b673fa122f95cccc5c524
  • Pointer size: 131 Bytes
  • Size of remote file: 248 kB
app/src/content/assets/data/model_claude_opus_4_5.png ADDED

Git LFS Details

  • SHA256: 250b07856543f2443a6b8ba3c20e15f24e3eb31bbbeda1d1e9555a5d8f4bf1b9
  • Pointer size: 131 Bytes
  • Size of remote file: 217 kB
app/src/content/assets/data/model_deepseek_r1.png ADDED

Git LFS Details

  • SHA256: 9408c2f99fb62f626909a296150243597be4ba2976d68b2c7b848b5fcba4f33a
  • Pointer size: 131 Bytes
  • Size of remote file: 249 kB
app/src/content/assets/data/model_gemini_3_flash_preview_low.png ADDED

Git LFS Details

  • SHA256: 75ca6f7798384cf21e16d6ba6a9a7c8eca3d3abb7849767ee628319486dae785
  • Pointer size: 131 Bytes
  • Size of remote file: 235 kB
app/src/content/assets/data/model_gpt_5_2_high.png ADDED

Git LFS Details

  • SHA256: 3ce556cf0f570a3c13535287608f734e8f103e308a7cd8db80f355f309003e6c
  • Pointer size: 131 Bytes
  • Size of remote file: 194 kB
app/src/content/assets/data/model_gpt_5_mini_medium.png ADDED

Git LFS Details

  • SHA256: 8f2a375dfbf81219ac33ceef6f42f4c8d9028d4a3867920a44218db927056985
  • Pointer size: 131 Bytes
  • Size of remote file: 210 kB
app/src/content/assets/data/model_gpt_oss_120b.png ADDED

Git LFS Details

  • SHA256: c4d22765888054e1b220a66e9bf42278bf46aab27a3a8733f0ebb7e71db9c13e
  • Pointer size: 131 Bytes
  • Size of remote file: 259 kB
app/src/content/assets/data/model_gpt_oss_20b.png ADDED

Git LFS Details

  • SHA256: 41b5dfd6881d9e30e03a91de49517faa4a9cb94c9b88031ce0f52ceb431470df
  • Pointer size: 131 Bytes
  • Size of remote file: 270 kB
app/src/content/assets/data/model_grok_4_1_fast_reasoning.png ADDED

Git LFS Details

  • SHA256: 08f64a210f54161c501c19e8906518c7d5a6cc55b36749e9c31cb570a09170ee
  • Pointer size: 131 Bytes
  • Size of remote file: 221 kB
app/src/content/assets/data/model_kimi_k2.png ADDED

Git LFS Details

  • SHA256: 04d4c263b639177670769f818380e061a69b259e7aa073b1151fbd737d19cd07
  • Pointer size: 131 Bytes
  • Size of remote file: 238 kB
app/src/content/assets/data/overall_performance.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:f797b18c405fd0b865d372d6dfe74e823a42a0f8f057a144b08208c5f6fb29d0
3
- size 2286
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67f55d87526715789a9b2c902de6acc78f69dc5fd13300eb97e511668bca8003
3
+ size 2303
app/src/content/assets/data/overall_performance.png CHANGED

Git LFS Details

  • SHA256: 1e634cee9f65439c9a25dad03e70e73f8e5722091da456a6a7206893f50039dc
  • Pointer size: 130 Bytes
  • Size of remote file: 75.8 kB

Git LFS Details

  • SHA256: 9d182c87b70f17018bd0664f1812b0ed0b99dbb107e1e455810f64dd21040f24
  • Pointer size: 130 Bytes
  • Size of remote file: 76.2 kB
app/src/content/assets/data/reckless_guessing.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a708723564f2779c2600346e347e2cff985a247bc950707d7f5c58137e05395b
3
+ size 19220
app/src/content/assets/data/reckless_guessing.png ADDED

Git LFS Details

  • SHA256: a73c1561ab35ed2e308d9cea71e3c77c116fb3d1d5619878a60d68ee1a031fbe
  • Pointer size: 130 Bytes
  • Size of remote file: 69.6 kB
app/src/content/assets/data/score_stack.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d64dd73c3b7173b627be30fab1720d57fde169a419d6038a9dec3129a2c93a60
3
+ size 3723
app/src/content/assets/data/score_stack.png ADDED

Git LFS Details

  • SHA256: 770e7bbbe723acad84dd1ecd4ff8310abd3fd60417953c5b961464d85111e328
  • Pointer size: 130 Bytes
  • Size of remote file: 83.3 kB
app/src/content/assets/data/score_vs_failed_guesses.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:2dd1c504aa5cfc6e5212ce7838b71c98002a94d1f10d5116ba92cff5924ccaae
3
- size 2212
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:581795032120f5075ef4f805472d19deebe0602aa6737e07bc62a35062f97758
3
+ size 2215
app/src/content/assets/data/score_vs_failed_guesses.png CHANGED

Git LFS Details

  • SHA256: 6caeee464272ac6774ca71a6f28fd7f1f6c10bb4820ec69161d8213f89be5ccb
  • Pointer size: 130 Bytes
  • Size of remote file: 73.4 kB

Git LFS Details

  • SHA256: 655ad7167280626d2e194caf6115b8afee946a61faca5cc0b2b2f9ded65c6999
  • Pointer size: 130 Bytes
  • Size of remote file: 73.7 kB
app/src/content/assets/data/summary.txt CHANGED
@@ -25,17 +25,17 @@ Loaded colors for 17 models
25
  BASIC MODEL COMPARISON
26
  ============================================================
27
 
28
- model rounds_played total_score avg_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio
29
- Claude Opus 4.5 78 1239 15.884615 825 4333716 86367.64 2.769231 0.923077 5252.989091 104.688048 30.871795 92.648376 0.333215
30
- Kimi K2 78 1133 14.525641 955 12281540 101346.76 4.025641 0.858974 12860.251309 106.122262 48.679487 116.854872 0.416581
31
- Gpt 5.2 High 78 1102 14.128205 1200 3341037 73525.83 0.333333 0.961538 2784.197500 61.271525 25.346154 36.062906 0.702832
32
- Gpt 5 Mini Medium 78 1001 12.833333 1247 3618399 58345.97 1.256410 0.756410 2901.683240 46.789070 40.051282 79.228889 0.505514
33
- Grok 4 1 Fast Reasoning 78 976 12.512821 962 8178655 120364.22 4.320513 0.884615 8501.720374 125.118732 69.358974 182.704274 0.379624
34
- Gemini 3 Flash Preview Low 78 955 12.243590 1299 1581524 12702.02 1.717949 0.769231 1217.493457 9.778306 35.910256 81.480513 0.440722
35
- Gpt Oss 120B 78 938 12.025641 1226 3190828 24633.15 3.692308 0.756410 2602.632953 20.092292 51.320513 80.710427 0.635860
36
- Deepseek R1 78 853 10.935897 1069 9229131 165334.16 5.064103 0.833333 8633.424696 154.662451 69.705128 166.426838 0.418833
37
- Gpt Oss 20B 78 773 9.910256 1277 7009392 62397.50 6.205128 0.717949 5488.952232 48.862569 80.782051 122.849402 0.657570
38
- Claude Haiku 4.5 78 713 9.141026 1223 6973411 57734.39 7.551282 0.705128 5701.889616 47.207187 88.576923 152.125983 0.582260
39
 
40
  Saved: results/260121_78_rounds/basic_metrics.csv
41
  Saved: results/260121_78_rounds/overall_performance.png
@@ -46,6 +46,26 @@ Saved: results/260121_78_rounds/calibration_curves.png
46
  Saved: results/260121_78_rounds/calibration_curves.json
47
  Saved: results/260121_78_rounds/confidence_distribution.png
48
  Saved: results/260121_78_rounds/confidence_distribution.json
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ============================================================
51
  BY-RULE ANALYSIS
@@ -53,32 +73,32 @@ BY-RULE ANALYSIS
53
 
54
  Score by rule (sorted by avg_score):
55
  rule_description count avg_score std_score success_rate
56
- Only red cards (hearts or diamonds). 30 24.666667 2.218004 1.000000
57
- Only cards of the suit spades. 30 24.233333 2.045741 1.000000
58
- Cards must alternate between red and black colors. Any card may start the line. 30 24.200000 2.670400 1.000000
59
- Only cards with an even rank (2,4,6,8,10,12). 30 23.466667 2.812942 1.000000
60
- The card must be of a different suit than the card just before it. Any card may start the line. 30 21.500000 6.317736 0.966667
61
- Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 30 20.066667 5.051004 1.000000
62
- Only hearts, clubs, and diamonds allowed. Spades are forbidden. 30 19.933333 5.501933 0.966667
63
- Only Aces (rank 1) . 30 19.500000 8.569191 0.966667
64
- Only ranks that are prime numbers (2,3,5,7,11,13). 30 19.266667 6.781991 0.966667
65
- The card must be of a different suit than but same color as the card just before it. Any card may start the line. 30 19.166667 7.479090 1.000000
66
- Only face cards (11,12,13). 30 19.000000 8.068671 0.900000
67
- Only spades and diamonds. 30 18.400000 4.476760 1.000000
68
- Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 30 14.700000 12.151770 1.000000
69
- Only cards between 1 and 7 inclusive. 30 13.400000 8.495841 0.966667
70
- Only black face cards. 30 10.333333 9.830752 0.900000
71
- Alternate face and number cards. Any card may start the line. 30 7.100000 12.273745 0.733333
72
- Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 30 6.966667 10.607360 0.600000
73
- Only cards between 5 and 9 inclusive. 30 6.600000 9.264690 0.933333
74
- Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 30 6.433333 11.990466 0.666667
75
- Only red cards whose rank is <=7. 30 4.366667 11.109124 1.000000
76
- Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 30 1.700000 11.166915 0.766667
77
- Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 30 0.766667 4.031628 0.133333
78
- Face cards (11-13) must be red; number cards (1-10) must be black. 30 0.533333 8.357253 0.500000
79
- Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 30 0.466667 7.951288 0.400000
80
- If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 30 -1.766667 9.743905 0.333333
81
- Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 30 -2.233333 8.319828 0.533333
82
 
83
  Saved: results/260121_78_rounds/by_rule.png
84
  Saved: results/260121_78_rounds/by_rule.json
@@ -112,22 +132,42 @@ Saved: results/260121_78_rounds/caution_vs_failed_guesses.png
112
  Saved: results/260121_78_rounds/caution_vs_failed_guesses.json
113
 
114
  ============================================================
115
- COMPLEXITY ANALYSIS
116
  ============================================================
117
 
118
- Optimal K for aggregated complexity: 0.10
119
- Formula: complexity = cyclomatic + 0.10 * node_count
120
- Correlation with relative_score: -0.478
121
-
122
- Score by complexity quartile:
123
- complexity_bin count avg_score avg_relative_score success_rate
124
- Q1 240 18.850000 1.543589 0.966667
125
- Q2 150 13.353333 1.076587 0.893333
126
- Q3 210 12.228571 0.977793 0.800000
127
- Q4 180 3.266667 0.237300 0.572222
128
-
129
- Saved: results/260121_78_rounds/complexity_analysis.png
130
- Saved: results/260121_78_rounds/complexity_analysis.json
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
 
132
  ============================================================
133
  PER-MODEL REPORTS
 
25
  BASIC MODEL COMPARISON
26
  ============================================================
27
 
28
+ model rounds_played total_score avg_score total_floored_score avg_floored_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate total_no_stakes_score avg_no_stakes_score avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio
29
+ Gpt 5.2 High 78 1158 14.846154 1174 15.051282 1205 3341037 73525.83 0.333333 0.961538 1505.0 19.294872 2772.644813 61.017286 25.858974 43.513162 0.594279
30
+ Claude Opus 4.5 78 1128 14.461538 1324 16.974359 852 4333716 86367.64 2.769231 0.923077 1598.0 20.487179 5086.521127 101.370469 87.525641 180.000684 0.486252
31
+ Gpt 5 Mini Medium 78 942 12.076923 1052 13.487179 1261 3618399 58345.97 1.256410 0.756410 1325.0 16.987179 2869.467883 46.269603 58.166667 115.878291 0.501963
32
+ Gemini 3 Flash Preview Low 78 817 10.474359 1024 13.128205 1315 1581524 12702.02 1.717949 0.769231 1226.0 15.717949 1202.679848 9.659331 61.128205 154.810427 0.394858
33
+ Kimi K2 78 804 10.307692 1262 16.179487 975 12281540 101346.76 4.025641 0.858974 1481.0 18.987179 12596.451282 103.945395 182.564103 343.003761 0.532251
34
+ Grok 4 1 Fast Reasoning 78 737 9.448718 1182 15.153846 998 8178655 120364.22 4.320513 0.884615 1441.0 18.474359 8195.045090 120.605431 109.256410 357.652821 0.305482
35
+ Gpt Oss 120B 78 580 7.435897 1004 12.871795 1243 3190828 24633.15 3.692308 0.756410 1279.0 16.397436 2567.037812 19.817498 186.794872 225.517949 0.828293
36
+ Deepseek R1 78 511 6.551282 1036 13.282051 1104 9229131 165334.16 5.064103 0.833333 1331.0 17.064103 8359.720109 149.759203 152.269231 353.910598 0.430248
37
+ Gpt Oss 20B 78 131 1.679487 927 11.884615 1297 7009392 62397.50 6.205128 0.717949 1206.0 15.461538 5404.311488 48.109098 230.115385 421.666496 0.545728
38
+ Claude Haiku 4.5 78 -37 -0.474359 894 11.461538 1254 6973411 57734.39 7.551282 0.705128 1198.0 15.358974 5560.933812 46.040183 244.730769 504.499316 0.485096
39
 
40
  Saved: results/260121_78_rounds/basic_metrics.csv
41
  Saved: results/260121_78_rounds/overall_performance.png
 
46
  Saved: results/260121_78_rounds/calibration_curves.json
47
  Saved: results/260121_78_rounds/confidence_distribution.png
48
  Saved: results/260121_78_rounds/confidence_distribution.json
49
+ Saved: results/260121_78_rounds/score_stack.png
50
+ Saved: results/260121_78_rounds/score_stack.json
51
+
52
+ ============================================================
53
+ COMPLEXITY ANALYSIS
54
+ ============================================================
55
+
56
+ Optimal K for aggregated complexity: 0.42
57
+ Formula: complexity = cyclomatic + 0.42 * node_count
58
+ Correlation with success_rate: -0.612
59
+
60
+ Stats by complexity quartile:
61
+ complexity_bin count avg_score success_rate
62
+ Q1 240 18.745833 0.966667
63
+ Q2 150 11.246667 0.893333
64
+ Q3 180 11.138889 0.866667
65
+ Q4 210 -6.761905 0.547619
66
+
67
+ Saved: results/260121_78_rounds/complexity_analysis.png
68
+ Saved: results/260121_78_rounds/complexity_analysis.json
69
 
70
  ============================================================
71
  BY-RULE ANALYSIS
 
73
 
74
  Score by rule (sorted by avg_score):
75
  rule_description count avg_score std_score success_rate
76
+ Only red cards (hearts or diamonds). 30 25.633333 2.204749 1.000000
77
+ Only cards of the suit spades. 30 25.200000 2.023994 1.000000
78
+ Cards must alternate between red and black colors. Any card may start the line. 30 25.166667 2.640315 1.000000
79
+ Only cards with an even rank (2,4,6,8,10,12). 30 24.300000 2.692903 1.000000
80
+ The card must be of a different suit than the card just before it. Any card may start the line. 30 21.666667 8.659590 0.966667
81
+ Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 30 20.666667 5.148373 1.000000
82
+ Only Aces (rank 1) . 30 20.233333 8.931476 0.966667
83
+ The card must be of a different suit than but same color as the card just before it. Any card may start the line. 30 19.866667 7.541761 1.000000
84
+ Only hearts, clubs, and diamonds allowed. Spades are forbidden. 30 19.533333 10.836507 0.966667
85
+ Only spades and diamonds. 30 19.066667 4.487018 1.000000
86
+ Only ranks that are prime numbers (2,3,5,7,11,13). 30 18.633333 12.527166 0.966667
87
+ Only face cards (11,12,13). 30 17.033333 16.044084 0.900000
88
+ Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 30 15.100000 12.234350 1.000000
89
+ Only cards between 1 and 7 inclusive. 30 13.366667 10.148835 0.966667
90
+ Only black face cards. 30 7.700000 16.316165 0.900000
91
+ Only red cards whose rank is <=7. 30 4.866667 11.227225 1.000000
92
+ Only cards between 5 and 9 inclusive. 30 4.666667 14.406257 0.933333
93
+ Alternate face and number cards. Any card may start the line. 30 0.366667 20.553519 0.733333
94
+ Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 30 -1.066667 20.915154 0.666667
95
+ Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 30 -3.433333 22.931206 0.600000
96
+ Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 30 -5.200000 18.917972 0.766667
97
+ Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 30 -10.466667 13.050917 0.533333
98
+ Face cards (11-13) must be red; number cards (1-10) must be black. 30 -11.500000 17.814659 0.500000
99
+ Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 30 -12.066667 16.772172 0.400000
100
+ If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 30 -15.633333 15.354396 0.333333
101
+ Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 30 -18.000000 16.103116 0.133333
102
 
103
  Saved: results/260121_78_rounds/by_rule.png
104
  Saved: results/260121_78_rounds/by_rule.json
 
132
  Saved: results/260121_78_rounds/caution_vs_failed_guesses.json
133
 
134
  ============================================================
135
+ RECKLESS GUESSING ANALYSIS
136
  ============================================================
137
 
138
+ Double-Down Rate: After a wrong guess, % of next turns with another guess
139
+ (Only counts official guesses, not shadow/tentative guesses)
140
+
141
+ Model Wrong Guesses Next Turn Guesses Double-Down %
142
+ Kimi K2 314 207 65.9
143
+ Claude Haiku 4.5 589 362 61.5
144
+ Grok 4 1 Fast Reasoning 337 203 60.2
145
+ Gpt Oss 20B 484 290 59.9
146
+ Deepseek R1 395 229 58.0
147
+ Claude Opus 4.5 216 91 42.1
148
+ Gpt Oss 120B 288 108 37.5
149
+ Gemini 3 Flash Preview Low 134 41 30.6
150
+ Gpt 5 Mini Medium 98 9 9.2
151
+ Gpt 5.2 High 26 1 3.8
152
+
153
+ Wrong Guess Streak Statistics:
154
+ Model Streaks Mean Length Max Length Total Wrong
155
+ Kimi K2 120 2.62 14 314
156
+ Claude Haiku 4.5 244 2.41 16 589
157
+ Grok 4 1 Fast Reasoning 149 2.26 12 337
158
+ Gpt Oss 20B 207 2.34 13 484
159
+ Deepseek R1 180 2.19 9 395
160
+ Claude Opus 4.5 139 1.55 5 216
161
+ Gpt Oss 120B 184 1.57 8 288
162
+ Gemini 3 Flash Preview Low 97 1.38 4 134
163
+ Gpt 5 Mini Medium 91 1.08 3 98
164
+ Gpt 5.2 High 25 1.04 2 26
165
+
166
+ Longest streak: 16 consecutive wrong guesses
167
+ - Claude Haiku 4.5 in round 77
168
+
169
+ Saved: results/260121_78_rounds/reckless_guessing.png
170
+ Saved: results/260121_78_rounds/reckless_guessing.json
171
 
172
  ============================================================
173
  PER-MODEL REPORTS
app/src/content/chapters/eleusis/benchmark.mdx CHANGED
@@ -26,9 +26,9 @@ On each turn, the player selects a card from their hand to play. If the card sat
26
 
27
  When correctly guessing the rule, the player scores as many points as the number of remaining turns, and each wrong guess deducts a penalty of 2 points:
28
 
29
- $$\text{score} = (30 - \text{turns\_used}) - 2 \times \text{wrong\_guesses}$$
30
 
31
- A player who correctly identifies the rule on turn 10 with no wrong guesses scores 20 points; one who made 3 wrong guesses along the way scores only 14. Failing to identify the rule scores 0. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence, exactly the calibration we want to measure.
32
 
33
  ### Rule Library
34
 
 
26
 
27
  When correctly guessing the rule, the player scores as many points as the number of remaining turns, and each wrong guess deducts a penalty of 2 points:
28
 
29
+ $$\text{score} = (30 - \text{turns\_elapsed} + 1) - 2 \times \text{num_wrong\_guesses}$$
30
 
31
+ A player who correctly identifies the rule on turn 13 with no wrong guesses scores 18 points; one who made 3 wrong guesses along the way scores only 12. Failing to identify the rule scores 0 but penalties for wrong guesses still apply, leading to possibly a negative score. This creates an interesting tension: guessing early yields more points if correct, but wrong guesses are costly. The optimal strategy requires accurately assessing one's own confidence, exactly the calibration we want to measure.
32
 
33
  ### Rule Library
34
 
app/src/content/chapters/eleusis/results.mdx CHANGED
@@ -3,13 +3,6 @@ import Wide from "../../../components/Wide.astro";
3
  import Note from "../../../components/Note.astro";
4
  import Sidenote from "../../../components/Sidenote.astro";
5
  import HtmlEmbed from "../../../components/HtmlEmbed.astro";
6
- import calibrationCurves from "../../assets/data/calibration_curves.png";
7
- import confidenceDistribution from "../../assets/data/confidence_distribution.png";
8
- import scoreVsFailedGuesses from "../../assets/data/score_vs_failed_guesses.png";
9
- import cautionVsFailedGuesses from "../../assets/data/caution_vs_failed_guesses.png";
10
- import excessCaution from "../../assets/data/excess_caution.png";
11
- import byRule from "../../assets/data/by_rule.png";
12
- import complexityAnalysis from "../../assets/data/complexity_analysis.png";
13
 
14
  ## Results
15
 
@@ -34,12 +27,10 @@ Deepseek R1, an open-weight model specialized for reasoning tasks, lags behind a
34
 
35
  Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
36
 
37
- <Image
38
- src={calibrationCurves}
39
- alt="Calibration curves showing reported confidence vs actual success rate for all models"
40
- caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence : they correspond to confidence levels where actual success rates are lower than reported."
41
  id="fig-calibration"
42
- zoomable
43
  />
44
 
45
  The calibration analysis reveals several patterns:
@@ -51,29 +42,27 @@ The calibration analysis reveals several patterns:
51
 
52
  It is also interesting to examine the distribution of confidence levels when models choose to guess.
53
 
54
- <Image
55
- src={confidenceDistribution}
56
- alt="Histogram showing distribution of confidence levels when models choose to guess vs not guess"
57
- caption="<strong>Figure 3:</strong> Distribution of confidence levels. Left: when models choose to formally guess. Right: when models choose not to guess. Well-calibrated models should show clear separation between these distributions."
58
  id="fig-confidence"
59
- zoomable
60
  />
61
 
62
  We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Other like GPT 5.2 High or Kimi K2 might also guess at confidence levels 8. Surprisingly, the best performing model Claude Opus 4.5 has a more spread out guessing behavior, often guessing at confidence levels 7 or even 6. Claude Haiku 4.5 has the most reckless guessing behavior, mostly guessing at confidence levels 6 to 8.
63
 
64
  Being able to separate confidence levels when guessing vs not guessing is an important metacognitive skill. Models that guess only when very confident are less likely to make wrong guesses, but may miss opportunities to commit early and gain points. Models that guess at lower confidence levels risk more wrong guesses, but can capitalize on early correct guesses. This trade-off is explored next.
65
 
 
 
66
 
67
  ### Guessing Strategy
68
 
69
  The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff? We can analyze their guessing efficiency by plotting average score vs average number of failed guesses per round.
70
 
71
- <Image
72
- src={scoreVsFailedGuesses}
73
- alt="2D scatter plot showing average score vs average number of failed guesses per round for each model"
74
  caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
75
  id="fig-guessing"
76
- zoomable
77
  />
78
 
79
  <Sidenote>
@@ -84,12 +73,10 @@ The scoring system creates a strategic tension: guess early for more points, but
84
 
85
  Failed guesses tell only half the story. A model might avoid wrong guesses by being *too* cautious—waiting many turns after it already has the correct answer. To measure this, we tracked "early correct turns": how many consecutive turns a model's tentative rule was correct before it finally chose to guess.
86
 
87
- <Image
88
- src={excessCaution}
89
- alt="Box plot showing distribution of early correct turns for each model"
90
- caption="<strong>Figure 5:</strong> Distribution of early correct turns (waiting with the correct answer). Higher values indicate excessive caution—the model knew the answer but hesitated to commit. GPT 5.2 High stands out as extremely cautious, with a median of 3 turns of unnecessary delay."
91
  id="fig-excess-caution"
92
- zoomable
93
  />
94
 
95
  The results reveal striking differences in guessing personalities:
@@ -98,12 +85,10 @@ The results reveal striking differences in guessing personalities:
98
  - **Claude Opus 4.5** shows excellent timing—only 0.9 early correct turns on average, meaning it commits almost immediately after finding the answer.
99
  - **Claude Haiku 4.5** and **DeepSeek R1** are the least cautious (0.5 early turns), but this comes at a cost: they also have the highest failed guess rates.
100
 
101
- <Image
102
- src={cautionVsFailedGuesses}
103
- alt="Scatter plot showing caution (early correct turns) vs recklessness (failed guesses) for each model"
104
  caption="<strong>Figure 6:</strong> The caution-recklessness trade-off. Models in the upper-left are cautious (delay correct guesses); models in the lower-right are reckless (many failed guesses). The ideal position is lower-left: quick to commit when right, rarely wrong."
105
  id="fig-caution-reckless"
106
- zoomable
107
  />
108
 
109
  <Sidenote>
@@ -118,8 +103,45 @@ This visualization reveals distinct behavioral patterns:
118
 
119
  * Deepseek R1 and Claude Haiku 4.5 cluster in the lower-right, being both reckless and not particularly cautious, leading to poor performance.
120
 
 
121
  The data suggests that knowing when you know is just as important as knowing the answer. Claude Opus 4.5's strong performance comes not just from finding correct rules, but from accurate metacognition, recognizing when it has gathered enough evidence to commit, even at the risk of occasional wrong guesses.
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ### Performance by Rule
124
 
125
  Not all rules are created equal. Some rules are discovered quickly by all models (e.g. "All cards must be red") while others prove consistently challenging (e.g. "increase rank after a red card, decrease after a black").
@@ -128,26 +150,21 @@ It is not easy to quantify rule complexity, as it depends on multiple factors: t
128
 
129
  The following figure breaks down performance by rule across all models and runs.
130
 
131
- <Wide>
132
- <Image
133
- src={byRule}
134
- alt="Performance breakdown by rule showing score distribution for each rule across all models"
135
- caption="<strong>Figure 7:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as points. Some rules are consistently easy for all models, while others show wide variance and lower scores, indicating higher complexity. For each rule, we computed a complexity score (see below) to analyze its impact on performance."
136
  id="fig-by-rule"
137
- zoomable
138
  />
139
- </Wide>
140
 
141
  We can see that the most complex rules are devastating for the reckless models like Claude Haiku 4.5 and DeepSeek R1, which often negative scores on these rules due to multiple wrong guesses. Even the best models struggle on the hardest rules, but their superior metacognition allows them to avoid catastrophic failures.
142
 
143
  The following plot breaks down the relative score of each model (as measured by score on the rule divided by average score on all rules) against the complexity metrics of each rule.
144
 
145
- <Image
146
- src={complexityAnalysis}
147
- alt="Scatter plot showing relationship between rule complexity metrics and model performance"
148
- caption="<strong>Figure 8:</strong> Relationship between rule complexity and performance. Multiple complexity factors contribute: acceptance rate, structural complexity, and semantic difficulty."
149
  id="fig-complexity"
150
- zoomable
151
  />
152
 
153
  <Note variant="info">
 
3
  import Note from "../../../components/Note.astro";
4
  import Sidenote from "../../../components/Sidenote.astro";
5
  import HtmlEmbed from "../../../components/HtmlEmbed.astro";
 
 
 
 
 
 
 
6
 
7
  ## Results
8
 
 
27
 
28
  Models are asked to output their confidence level, with clear instructions on what it means (7 = 70% probability of being correct, etc.). Even when they don't guess, they report their tentative rule. When confidence ≥5, we test whether they would have guessed correctly, even if they didn't formally attempted to guess. This allows us to evaluate calibration: does reported confidence match actual accuracy?
29
 
30
+ <HtmlEmbed
31
+ src="calibration-curves.html"
32
+ caption="<strong>Figure 2:</strong> Calibration curves for each model. A perfectly calibrated model would follow the diagonal. Points below the line indicate overconfidence: they correspond to confidence levels where actual success rates are lower than reported. Click legend items to show/hide models."
 
33
  id="fig-calibration"
 
34
  />
35
 
36
  The calibration analysis reveals several patterns:
 
42
 
43
  It is also interesting to examine the distribution of confidence levels when models choose to guess.
44
 
45
+ <HtmlEmbed
46
+ src="confidence-distribution.html"
47
+ caption="<strong>Figure 3:</strong> Distribution of confidence levels when models choose to formally guess. Each bar shows the proportion of guesses made at that confidence level. Click legend items to show/hide models."
 
48
  id="fig-confidence"
 
49
  />
50
 
51
  We can see that some models like Grok 4.1 or Gemini 3 will essentially only guess when very confident (9 or 10). Other like GPT 5.2 High or Kimi K2 might also guess at confidence levels 8. Surprisingly, the best performing model Claude Opus 4.5 has a more spread out guessing behavior, often guessing at confidence levels 7 or even 6. Claude Haiku 4.5 has the most reckless guessing behavior, mostly guessing at confidence levels 6 to 8.
52
 
53
  Being able to separate confidence levels when guessing vs not guessing is an important metacognitive skill. Models that guess only when very confident are less likely to make wrong guesses, but may miss opportunities to commit early and gain points. Models that guess at lower confidence levels risk more wrong guesses, but can capitalize on early correct guesses. This trade-off is explored next.
54
 
55
+ Note that in principle there is a decision-theoretic optimal confidence threshold for guessing, which depends on the scoring system. Given the scoring that rewards 1 point per turn left, with 2 points penalty for a wrong guess, the optimal threshold is 0.67 (i.e., guess when you believe your tentative rule has at least a 67% chance of being correct). Of course this assumes perfect calibration, which none of the models achieve.
56
+
57
 
58
  ### Guessing Strategy
59
 
60
  The scoring system creates a strategic tension: guess early for more points, but wrong guesses are costly. How do models navigate this tradeoff? We can analyze their guessing efficiency by plotting average score vs average number of failed guesses per round.
61
 
62
+ <HtmlEmbed
63
+ src="score-vs-failed-guesses.html"
 
64
  caption="<strong>Figure 4:</strong> Score vs. failed guesses per round. Models in the upper-left are efficient (high scores, few wrong guesses). Models that guess recklessly appear on the right with low scores."
65
  id="fig-guessing"
 
66
  />
67
 
68
  <Sidenote>
 
73
 
74
  Failed guesses tell only half the story. A model might avoid wrong guesses by being *too* cautious—waiting many turns after it already has the correct answer. To measure this, we tracked "early correct turns": how many consecutive turns a model's tentative rule was correct before it finally chose to guess.
75
 
76
+ <HtmlEmbed
77
+ src="excess-caution.html"
78
+ caption="<strong>Figure 5:</strong> Distribution of early correct turns (waiting with the correct answer). Higher values indicate excessive caution—the model knew the answer but hesitated to commit. GPT 5.2 High stands out as extremely cautious, with a mean of 3.6 turns of unnecessary delay."
 
79
  id="fig-excess-caution"
 
80
  />
81
 
82
  The results reveal striking differences in guessing personalities:
 
85
  - **Claude Opus 4.5** shows excellent timing—only 0.9 early correct turns on average, meaning it commits almost immediately after finding the answer.
86
  - **Claude Haiku 4.5** and **DeepSeek R1** are the least cautious (0.5 early turns), but this comes at a cost: they also have the highest failed guess rates.
87
 
88
+ <HtmlEmbed
89
+ src="caution-vs-failed-guesses.html"
 
90
  caption="<strong>Figure 6:</strong> The caution-recklessness trade-off. Models in the upper-left are cautious (delay correct guesses); models in the lower-right are reckless (many failed guesses). The ideal position is lower-left: quick to commit when right, rarely wrong."
91
  id="fig-caution-reckless"
 
92
  />
93
 
94
  <Sidenote>
 
103
 
104
  * Deepseek R1 and Claude Haiku 4.5 cluster in the lower-right, being both reckless and not particularly cautious, leading to poor performance.
105
 
106
+
107
  The data suggests that knowing when you know is just as important as knowing the answer. Claude Opus 4.5's strong performance comes not just from finding correct rules, but from accurate metacognition, recognizing when it has gathered enough evidence to commit, even at the risk of occasional wrong guesses.
108
 
109
+ This analysis constrats two ways of losing points : by being too cautious (waiting too long to commit) vs by being too reckless (making too many wrong guesses). A way to visualize this is to explore alternative scoring systems, as we do next.
110
+
111
+
112
+
113
+ ### Alternative Scoring Systems
114
+
115
+ The Eleusis scoring system includes harsh penalties: wrong guesses cost 2 points each, and rounds can end with negative scores. How much do these penalties affect rankings? To understand the impact of our scoring choices, we compare three scoring variants:
116
+
117
+ 1. **Raw score**: The standard scoring (30 - turns - 2×wrong guesses)
118
+ 2. **Floored score**: Same formula, but negative scores are counted as zero
119
+ 3. **No-stakes score**: No penalty for wrong guesses, and tentative rules count as guesses
120
+
121
+ <HtmlEmbed
122
+ src="score-stack.html"
123
+ caption="<strong>Figure 7:</strong> Score breakdown under alternative scoring systems. Blue shows raw score (standard scoring). Orange shows flooring gain (what models gain if negative scores count as 0). Green shows no-stakes gain (additional gain from removing wrong-guess penalties). Models sorted by total no-stakes score."
124
+ id="fig-score-stack"
125
+ wide
126
+ />
127
+
128
+ The flooring gain (orange) reveals which models frequently go negative. GPT 5.2 High gains almost nothing from flooring (0.2 points), indicating it rarely makes enough wrong guesses to go negative. In contrast, Claude Haiku 4.5 gains 11.9 points—nearly 12 points of damage averted per round on average—showing how its reckless guessing leads to catastrophic losses.
129
+
130
+ The no-stakes gain (green) shows what models would gain if we simply tested their tentative rule each turn. Interestingly, this gain is relatively consistent across models (2.5–4.2 points), suggesting that most models form correct hypotheses at similar rates, but differ dramatically in their ability to *recognize* when they have the right answer.
131
+
132
+ Under any scoring system, Claude Opus 4.5 and GPT 5.2 High remain the top performers. The ranking compression at no-stakes scores (15.4 to 20.5 vs raw -0.5 to 14.8) confirms that our scoring system appropriately rewards good metacognition—knowing when you know.
133
+
134
+
135
+ ### Analysis of the reckless guessing behavior
136
+
137
+ Some models loose a lot of points due to reckless guessing. In the "no stakes" scoring system, Claude 4.5 Opus takes the lead, Kimi K2 and Grok 4.1 have similar performance to GPT 5.2 High.
138
+
139
+ <HtmlEmbed
140
+ src="reckless-guessing.html"
141
+ caption="<strong>Figure 7b:</strong> Double-down rate: how often a model guesses again immediately after a wrong guess. Higher values indicate more reckless behavior—the model keeps guessing despite recent failures."
142
+ id="fig-reckless-guessing"
143
+ />
144
+
145
  ### Performance by Rule
146
 
147
  Not all rules are created equal. Some rules are discovered quickly by all models (e.g. "All cards must be red") while others prove consistently challenging (e.g. "increase rank after a red card, decrease after a black").
 
150
 
151
  The following figure breaks down performance by rule across all models and runs.
152
 
153
+ <HtmlEmbed
154
+ src="by-rule.html"
155
+ caption="<strong>Figure 8:</strong> Score distribution by rule. Each row is a different rule, with individual run scores shown as colored dots (one per model run). Hover over rule names for details. The left column shows average success rate. Click legend items to show/hide models."
 
 
156
  id="fig-by-rule"
157
+ wide
158
  />
 
159
 
160
  We can see that the most complex rules are devastating for the reckless models like Claude Haiku 4.5 and DeepSeek R1, which often negative scores on these rules due to multiple wrong guesses. Even the best models struggle on the hardest rules, but their superior metacognition allows them to avoid catastrophic failures.
161
 
162
  The following plot breaks down the relative score of each model (as measured by score on the rule divided by average score on all rules) against the complexity metrics of each rule.
163
 
164
+ <HtmlEmbed
165
+ src="complexity-analysis.html"
166
+ caption="<strong>Figure 9:</strong> Relationship between rule complexity and model performance. The heatmap shows relative scores (value > 1 means above-average performance) for each model across complexity quartiles. Hover over cells for details."
 
167
  id="fig-complexity"
 
168
  />
169
 
170
  <Note variant="info">
app/src/content/embeds/by-rule.html ADDED
@@ -0,0 +1,521 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-by-rule"></div>
2
+ <style>
3
+ .d3-by-rule {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-by-rule svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-by-rule .axes path,
17
+ .d3-by-rule .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-by-rule .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-by-rule .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-by-rule .axes text.axis-label {
31
+ font-size: 14px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-by-rule .x-axis text {
37
+ transform: translateY(4px);
38
+ }
39
+
40
+ .d3-by-rule .rule-label {
41
+ font-size: 10px;
42
+ fill: var(--text-color);
43
+ cursor: pointer;
44
+ }
45
+
46
+ .d3-by-rule .rule-label:hover {
47
+ text-decoration: underline;
48
+ }
49
+
50
+ .d3-by-rule .complexity-bar {
51
+ opacity: 0.85;
52
+ }
53
+
54
+ .d3-by-rule .complexity-text {
55
+ font-size: 9px;
56
+ font-weight: 600;
57
+ pointer-events: none;
58
+ }
59
+
60
+ .d3-by-rule .point {
61
+ opacity: 0.85;
62
+ transition: opacity 0.1s ease;
63
+ }
64
+
65
+ .d3-by-rule .point:hover {
66
+ opacity: 1;
67
+ }
68
+
69
+ .d3-by-rule .point.dimmed {
70
+ opacity: 0.15;
71
+ }
72
+
73
+ .d3-by-rule .legend-item {
74
+ cursor: pointer;
75
+ }
76
+
77
+ .d3-by-rule .legend-item.inactive .legend-dot {
78
+ opacity: 0.3;
79
+ }
80
+
81
+ .d3-by-rule .legend-item.inactive .legend-text {
82
+ opacity: 0.5;
83
+ text-decoration: line-through;
84
+ }
85
+
86
+ .d3-by-rule .legend-text {
87
+ font-size: 10px;
88
+ fill: var(--text-color);
89
+ }
90
+
91
+ .d3-by-rule .d3-tooltip {
92
+ position: absolute;
93
+ top: 0;
94
+ left: 0;
95
+ transform: translate(-9999px, -9999px);
96
+ pointer-events: none;
97
+ padding: 10px 12px;
98
+ border-radius: 8px;
99
+ font-size: 12px;
100
+ line-height: 1.5;
101
+ border: 1px solid var(--border-color);
102
+ background: var(--surface-bg);
103
+ color: var(--text-color);
104
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
105
+ opacity: 0;
106
+ transition: opacity 0.12s ease;
107
+ z-index: 10;
108
+ max-width: 320px;
109
+ }
110
+
111
+ .d3-by-rule .d3-tooltip .rule-name {
112
+ font-weight: 600;
113
+ margin-bottom: 6px;
114
+ }
115
+
116
+ .d3-by-rule .d3-tooltip .rule-desc {
117
+ margin-bottom: 8px;
118
+ color: var(--muted-color);
119
+ font-size: 11px;
120
+ }
121
+
122
+ .d3-by-rule .d3-tooltip .metric {
123
+ display: flex;
124
+ justify-content: space-between;
125
+ gap: 16px;
126
+ }
127
+
128
+ .d3-by-rule .d3-tooltip .metric-label {
129
+ color: var(--muted-color);
130
+ }
131
+
132
+ .d3-by-rule .d3-tooltip .metric-value {
133
+ font-weight: 500;
134
+ }
135
+ </style>
136
+ <script>
137
+ (() => {
138
+ const ensureD3 = (cb) => {
139
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
140
+ let s = document.getElementById('d3-cdn-script');
141
+ if (!s) {
142
+ s = document.createElement('script');
143
+ s.id = 'd3-cdn-script';
144
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
145
+ document.head.appendChild(s);
146
+ }
147
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
148
+ s.addEventListener('load', onReady, { once: true });
149
+ if (window.d3) onReady();
150
+ };
151
+
152
+ const bootstrap = () => {
153
+ const scriptEl = document.currentScript;
154
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
155
+ if (!(container && container.classList && container.classList.contains('d3-by-rule'))) {
156
+ const candidates = Array.from(document.querySelectorAll('.d3-by-rule'))
157
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
158
+ container = candidates[candidates.length - 1] || null;
159
+ }
160
+ if (!container) return;
161
+ if (container.dataset) {
162
+ if (container.dataset.mounted === 'true') return;
163
+ container.dataset.mounted = 'true';
164
+ }
165
+
166
+ // Tooltip setup
167
+ container.style.position = container.style.position || 'relative';
168
+ const tip = document.createElement('div');
169
+ tip.className = 'd3-tooltip';
170
+ container.appendChild(tip);
171
+
172
+ // SVG setup
173
+ const svg = d3.select(container).append('svg');
174
+ const gRoot = svg.append('g');
175
+
176
+ // Chart groups
177
+ const gGrid = gRoot.append('g').attr('class', 'grid');
178
+ const gAxes = gRoot.append('g').attr('class', 'axes');
179
+ const gComplexity = gRoot.append('g').attr('class', 'complexity');
180
+ const gPoints = gRoot.append('g').attr('class', 'points');
181
+ const gLabels = gRoot.append('g').attr('class', 'labels');
182
+ const gLegend = gRoot.append('g').attr('class', 'legend');
183
+
184
+ // State
185
+ let data = null;
186
+ let modelColors = null;
187
+ let width = 800;
188
+ let height = 800;
189
+ const margin = { top: 20, right: 140, bottom: 50, left: 180 };
190
+ const complexityBarWidth = 30;
191
+ const complexityGap = 8;
192
+
193
+ // Active models (all visible by default)
194
+ let activeModels = new Set();
195
+
196
+ // Scales
197
+ const xScale = d3.scaleLinear();
198
+ const yScale = d3.scaleBand();
199
+ // Green to red scale: high success (1.0) = green, low success (0) = red
200
+ const successColorScale = d3.scaleSequential(d3.interpolateRdYlGn);
201
+
202
+ // Data loading
203
+ const DATA_URL = '/data/by_rule.json';
204
+ const COLORS_URL = '/data/overall_performance.json';
205
+
206
+ function updateSize() {
207
+ width = container.clientWidth || 800;
208
+ const numRules = data ? data.rules.length : 26;
209
+ const rowHeight = 24;
210
+ height = margin.top + margin.bottom + numRules * rowHeight;
211
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
212
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
213
+ return {
214
+ innerWidth: width - margin.left - margin.right,
215
+ innerHeight: height - margin.top - margin.bottom
216
+ };
217
+ }
218
+
219
+ function formatRuleName(name) {
220
+ return name.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
221
+ }
222
+
223
+ function showRuleTooltip(event, rule) {
224
+ const rect = container.getBoundingClientRect();
225
+ const x = event.clientX - rect.left;
226
+ const y = event.clientY - rect.top;
227
+
228
+ tip.innerHTML = `
229
+ <div class="rule-name">${formatRuleName(rule.name)}</div>
230
+ <div class="rule-desc">${rule.description}</div>
231
+ <div class="metric">
232
+ <span class="metric-label">Success Rate:</span>
233
+ <span class="metric-value">${(rule.success_rate * 100).toFixed(1)}%</span>
234
+ </div>
235
+ <div class="metric">
236
+ <span class="metric-label">Average Score:</span>
237
+ <span class="metric-value">${rule.avg_score.toFixed(1)}</span>
238
+ </div>
239
+ <div class="metric">
240
+ <span class="metric-label">Cyclomatic Complexity:</span>
241
+ <span class="metric-value">${rule.cyclomatic_complexity}</span>
242
+ </div>
243
+ <div class="metric">
244
+ <span class="metric-label">AST Node Count:</span>
245
+ <span class="metric-value">${rule.node_count}</span>
246
+ </div>
247
+ <div class="metric">
248
+ <span class="metric-label">Aggregated Complexity:</span>
249
+ <span class="metric-value">${rule.aggregated_complexity.toFixed(1)}</span>
250
+ </div>
251
+ `;
252
+
253
+ const tipWidth = tip.offsetWidth || 200;
254
+ const tipHeight = tip.offsetHeight || 140;
255
+ let tipX = x + 12;
256
+ let tipY = y - tipHeight / 2;
257
+
258
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
259
+ if (tipY < 0) tipY = 8;
260
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
261
+
262
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
263
+ tip.style.opacity = '1';
264
+ }
265
+
266
+ function hideTooltip() {
267
+ tip.style.opacity = '0';
268
+ tip.style.transform = 'translate(-9999px, -9999px)';
269
+ }
270
+
271
+ function getContrastColor(color) {
272
+ // Handle both hex (#rrggbb) and rgb(r, g, b) formats
273
+ let r, g, b;
274
+ if (color.startsWith('#')) {
275
+ const hex = color.replace('#', '');
276
+ r = parseInt(hex.substr(0, 2), 16) / 255;
277
+ g = parseInt(hex.substr(2, 2), 16) / 255;
278
+ b = parseInt(hex.substr(4, 2), 16) / 255;
279
+ } else if (color.startsWith('rgb')) {
280
+ const match = color.match(/rgb\((\d+),\s*(\d+),\s*(\d+)\)/);
281
+ if (match) {
282
+ r = parseInt(match[1]) / 255;
283
+ g = parseInt(match[2]) / 255;
284
+ b = parseInt(match[3]) / 255;
285
+ } else {
286
+ return '#000000';
287
+ }
288
+ } else {
289
+ return '#000000';
290
+ }
291
+ const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
292
+ return luminance > 0.5 ? '#000000' : '#ffffff';
293
+ }
294
+
295
+ function toggleModel(modelName) {
296
+ if (activeModels.has(modelName)) {
297
+ activeModels.delete(modelName);
298
+ } else {
299
+ activeModels.add(modelName);
300
+ }
301
+ render();
302
+ }
303
+
304
+ function render() {
305
+ if (!data || !modelColors) return;
306
+
307
+ const { innerWidth, innerHeight } = updateSize();
308
+ const rules = data.rules;
309
+ const chartWidth = innerWidth - complexityBarWidth - complexityGap;
310
+
311
+ // Update scales
312
+ const allScores = [];
313
+ rules.forEach(rule => {
314
+ Object.values(rule.scores_by_model).forEach(scores => {
315
+ allScores.push(...scores);
316
+ });
317
+ });
318
+ const scoreExtent = d3.extent(allScores);
319
+ const scorePadding = (scoreExtent[1] - scoreExtent[0]) * 0.05;
320
+
321
+ xScale
322
+ .domain([scoreExtent[0] - scorePadding, scoreExtent[1] + scorePadding])
323
+ .range([complexityBarWidth + complexityGap, innerWidth])
324
+ .nice();
325
+
326
+ yScale
327
+ .domain(rules.map(r => r.name))
328
+ .range([0, innerHeight])
329
+ .padding(0.3);
330
+
331
+ // Success rate domain: 0 to 1 (will display as 0% to 100%)
332
+ successColorScale.domain([0, 1]);
333
+
334
+ // Grid lines
335
+ const xTicks = xScale.ticks(8);
336
+ gGrid.selectAll('.grid-x')
337
+ .data(xTicks)
338
+ .join('line')
339
+ .attr('class', 'grid-x')
340
+ .attr('x1', d => xScale(d))
341
+ .attr('x2', d => xScale(d))
342
+ .attr('y1', 0)
343
+ .attr('y2', innerHeight);
344
+
345
+ // X-axis
346
+ gAxes.selectAll('.x-axis')
347
+ .data([0])
348
+ .join('g')
349
+ .attr('class', 'x-axis')
350
+ .attr('transform', `translate(0,${innerHeight})`)
351
+ .call(d3.axisBottom(xScale).ticks(8).tickSizeInner(-6).tickSizeOuter(0));
352
+
353
+ // X-axis label
354
+ gAxes.selectAll('.x-label')
355
+ .data([0])
356
+ .join('text')
357
+ .attr('class', 'x-label axis-label')
358
+ .attr('x', (complexityBarWidth + complexityGap + innerWidth) / 2)
359
+ .attr('y', innerHeight + 40)
360
+ .attr('text-anchor', 'middle')
361
+ .text('Score');
362
+
363
+ // Success rate bars
364
+ gComplexity.selectAll('.complexity-bar')
365
+ .data(rules, d => d.name)
366
+ .join('rect')
367
+ .attr('class', 'complexity-bar')
368
+ .attr('x', 0)
369
+ .attr('y', d => yScale(d.name))
370
+ .attr('width', complexityBarWidth)
371
+ .attr('height', yScale.bandwidth())
372
+ .attr('fill', d => successColorScale(d.success_rate))
373
+ .attr('rx', 2);
374
+
375
+ gComplexity.selectAll('.complexity-text')
376
+ .data(rules, d => d.name)
377
+ .join('text')
378
+ .attr('class', 'complexity-text')
379
+ .attr('x', complexityBarWidth / 2)
380
+ .attr('y', d => yScale(d.name) + yScale.bandwidth() / 2)
381
+ .attr('text-anchor', 'middle')
382
+ .attr('dominant-baseline', 'central')
383
+ .style('fill', d => getContrastColor(successColorScale(d.success_rate)))
384
+ .text(d => Math.round(d.success_rate * 100) + '%');
385
+
386
+ // Rule labels (Y-axis)
387
+ gLabels.selectAll('.rule-label')
388
+ .data(rules, d => d.name)
389
+ .join('text')
390
+ .attr('class', 'rule-label')
391
+ .attr('x', -8)
392
+ .attr('y', d => yScale(d.name) + yScale.bandwidth() / 2)
393
+ .attr('text-anchor', 'end')
394
+ .attr('dominant-baseline', 'central')
395
+ .text(d => formatRuleName(d.name))
396
+ .on('mouseenter', (event, d) => showRuleTooltip(event, d))
397
+ .on('mousemove', (event, d) => showRuleTooltip(event, d))
398
+ .on('mouseleave', hideTooltip);
399
+
400
+ // Data points
401
+ const pointData = [];
402
+ rules.forEach(rule => {
403
+ Object.entries(rule.scores_by_model).forEach(([modelName, scores]) => {
404
+ scores.forEach((score, seedIdx) => {
405
+ const color = modelColors[modelName] || '#888888';
406
+ pointData.push({
407
+ rule: rule.name,
408
+ model: modelName,
409
+ score: score,
410
+ seed: seedIdx,
411
+ color: color
412
+ });
413
+ });
414
+ });
415
+ });
416
+
417
+ const pointRadius = Math.max(3, Math.min(5, yScale.bandwidth() / 4));
418
+ const jitterStrength = yScale.bandwidth() * 0.3;
419
+
420
+ // Simple hash for consistent jitter
421
+ const hashStr = (str) => {
422
+ let hash = 0;
423
+ for (let i = 0; i < str.length; i++) {
424
+ hash = ((hash << 5) - hash) + str.charCodeAt(i);
425
+ hash |= 0;
426
+ }
427
+ return hash;
428
+ };
429
+
430
+ gPoints.selectAll('.point')
431
+ .data(pointData, d => `${d.rule}-${d.model}-${d.seed}`)
432
+ .join('circle')
433
+ .attr('class', d => `point ${activeModels.has(d.model) ? '' : 'dimmed'}`)
434
+ .attr('cx', d => xScale(d.score))
435
+ .attr('cy', d => {
436
+ const baseY = yScale(d.rule) + yScale.bandwidth() / 2;
437
+ const jitter = ((hashStr(d.model + d.seed) % 100) / 100 - 0.5) * jitterStrength;
438
+ return baseY + jitter;
439
+ })
440
+ .attr('r', pointRadius)
441
+ .attr('fill', d => d.color)
442
+ .attr('stroke', 'var(--surface-bg)')
443
+ .attr('stroke-width', 0.5);
444
+
445
+ // Legend
446
+ const legendX = innerWidth + 15;
447
+ const legendItemHeight = 16;
448
+ const modelNames = data.models;
449
+
450
+ const legendItems = gLegend.selectAll('.legend-item')
451
+ .data(modelNames)
452
+ .join('g')
453
+ .attr('class', d => `legend-item ${activeModels.has(d) ? '' : 'inactive'}`)
454
+ .attr('transform', (d, i) => `translate(${legendX}, ${i * legendItemHeight})`)
455
+ .style('cursor', 'pointer')
456
+ .on('click', (event, d) => toggleModel(d));
457
+
458
+ legendItems.selectAll('.legend-dot')
459
+ .data(d => [d])
460
+ .join('circle')
461
+ .attr('class', 'legend-dot')
462
+ .attr('cx', 5)
463
+ .attr('cy', 6)
464
+ .attr('r', 4)
465
+ .attr('fill', d => modelColors[d] || '#888888');
466
+
467
+ legendItems.selectAll('.legend-text')
468
+ .data(d => [d])
469
+ .join('text')
470
+ .attr('class', 'legend-text')
471
+ .attr('x', 14)
472
+ .attr('y', 9)
473
+ .text(d => d);
474
+ }
475
+
476
+ // Initialize
477
+ Promise.all([
478
+ fetch(DATA_URL, { cache: 'no-cache' }).then(r => r.json()),
479
+ fetch(COLORS_URL, { cache: 'no-cache' }).then(r => r.json())
480
+ ])
481
+ .then(([byRuleData, perfData]) => {
482
+ data = byRuleData;
483
+ // Build color map from overall_performance.json
484
+ modelColors = {};
485
+ perfData.models.forEach(m => {
486
+ modelColors[m.name] = m.color;
487
+ });
488
+ // Initialize all models as active
489
+ activeModels = new Set(data.models);
490
+ render();
491
+ })
492
+ .catch(err => {
493
+ const pre = document.createElement('pre');
494
+ pre.style.color = 'red';
495
+ pre.style.padding = '16px';
496
+ pre.textContent = `Error loading data: ${err.message}`;
497
+ container.appendChild(pre);
498
+ });
499
+
500
+ // Resize handling
501
+ if (window.ResizeObserver) {
502
+ new ResizeObserver(() => render()).observe(container);
503
+ } else {
504
+ window.addEventListener('resize', render);
505
+ }
506
+
507
+ // Theme change handling
508
+ const observer = new MutationObserver(() => render());
509
+ observer.observe(document.documentElement, {
510
+ attributes: true,
511
+ attributeFilter: ['data-theme']
512
+ });
513
+ };
514
+
515
+ if (document.readyState === 'loading') {
516
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
517
+ } else {
518
+ ensureD3(bootstrap);
519
+ }
520
+ })();
521
+ </script>
app/src/content/embeds/calibration-curves.html ADDED
@@ -0,0 +1,537 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-calibration-curves"></div>
2
+ <style>
3
+ .d3-calibration-curves {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-calibration-curves svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-calibration-curves .axes path,
17
+ .d3-calibration-curves .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-calibration-curves .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-calibration-curves .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-calibration-curves .axes text.axis-label {
31
+ font-size: 14px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-calibration-curves .x-axis text {
37
+ transform: translateY(4px);
38
+ }
39
+
40
+ .d3-calibration-curves .calibration-line {
41
+ fill: none;
42
+ stroke-width: 1.5;
43
+ }
44
+
45
+ .d3-calibration-curves .perfect-line {
46
+ fill: none;
47
+ stroke: var(--muted-color);
48
+ stroke-width: 1.5;
49
+ stroke-dasharray: 8, 6;
50
+ opacity: 0.6;
51
+ }
52
+
53
+ .d3-calibration-curves .data-point {
54
+ cursor: pointer;
55
+ transition: transform 0.15s ease, opacity 0.15s ease;
56
+ }
57
+
58
+ .d3-calibration-curves .data-point:hover {
59
+ opacity: 0.8;
60
+ }
61
+
62
+ .d3-calibration-curves .legend {
63
+ font-size: 11px;
64
+ }
65
+
66
+ .d3-calibration-curves .legend-item {
67
+ cursor: pointer;
68
+ }
69
+
70
+ .d3-calibration-curves .legend-item.dimmed .legend-line,
71
+ .d3-calibration-curves .legend-item.dimmed .legend-marker {
72
+ opacity: 0.3;
73
+ }
74
+
75
+ .d3-calibration-curves .legend-item.dimmed text {
76
+ opacity: 0.4;
77
+ }
78
+
79
+ .d3-calibration-curves .legend-text {
80
+ fill: var(--text-color);
81
+ }
82
+
83
+ .d3-calibration-curves .d3-tooltip {
84
+ position: absolute;
85
+ top: 0;
86
+ left: 0;
87
+ transform: translate(-9999px, -9999px);
88
+ pointer-events: none;
89
+ padding: 10px 12px;
90
+ border-radius: 8px;
91
+ font-size: 12px;
92
+ line-height: 1.4;
93
+ border: 1px solid var(--border-color);
94
+ background: var(--surface-bg);
95
+ color: var(--text-color);
96
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
97
+ opacity: 0;
98
+ transition: opacity 0.12s ease;
99
+ z-index: 10;
100
+ }
101
+
102
+ .d3-calibration-curves .d3-tooltip .model-name {
103
+ font-weight: 600;
104
+ margin-bottom: 4px;
105
+ }
106
+
107
+ .d3-calibration-curves .d3-tooltip .metric {
108
+ display: flex;
109
+ justify-content: space-between;
110
+ gap: 16px;
111
+ }
112
+
113
+ .d3-calibration-curves .d3-tooltip .metric-label {
114
+ color: var(--muted-color);
115
+ }
116
+
117
+ .d3-calibration-curves .d3-tooltip .metric-value {
118
+ font-weight: 500;
119
+ }
120
+ </style>
121
+ <script>
122
+ (() => {
123
+ const ensureD3 = (cb) => {
124
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
125
+ let s = document.getElementById('d3-cdn-script');
126
+ if (!s) {
127
+ s = document.createElement('script');
128
+ s.id = 'd3-cdn-script';
129
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
130
+ document.head.appendChild(s);
131
+ }
132
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
133
+ s.addEventListener('load', onReady, { once: true });
134
+ if (window.d3) onReady();
135
+ };
136
+
137
+ const bootstrap = () => {
138
+ const scriptEl = document.currentScript;
139
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
140
+ if (!(container && container.classList && container.classList.contains('d3-calibration-curves'))) {
141
+ const candidates = Array.from(document.querySelectorAll('.d3-calibration-curves'))
142
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
143
+ container = candidates[candidates.length - 1] || null;
144
+ }
145
+ if (!container) return;
146
+ if (container.dataset) {
147
+ if (container.dataset.mounted === 'true') return;
148
+ container.dataset.mounted = 'true';
149
+ }
150
+
151
+ // Tooltip setup
152
+ container.style.position = container.style.position || 'relative';
153
+ const tip = document.createElement('div');
154
+ tip.className = 'd3-tooltip';
155
+ container.appendChild(tip);
156
+
157
+ // SVG setup
158
+ const svg = d3.select(container).append('svg');
159
+ const gRoot = svg.append('g');
160
+
161
+ // Chart groups (order matters for layering)
162
+ const gGrid = gRoot.append('g').attr('class', 'grid');
163
+ const gPerfect = gRoot.append('g').attr('class', 'perfect');
164
+ const gLines = gRoot.append('g').attr('class', 'lines');
165
+ const gPoints = gRoot.append('g').attr('class', 'points');
166
+ const gAxes = gRoot.append('g').attr('class', 'axes');
167
+ const gLegend = gRoot.append('g').attr('class', 'legend');
168
+
169
+ // State
170
+ let data = null;
171
+ let width = 800;
172
+ let height = 500;
173
+ const margin = { top: 20, right: 180, bottom: 56, left: 72 };
174
+ let hiddenModels = new Set();
175
+
176
+ // Scales
177
+ const xScale = d3.scaleLinear();
178
+ const yScale = d3.scaleLinear();
179
+
180
+ // Line generator - convert confidence level to probability (divide by 10)
181
+ const line = d3.line()
182
+ .x(d => xScale(d.confidence_level / 10))
183
+ .y(d => yScale(d.actual_success_rate));
184
+
185
+ // Data loading
186
+ const DATA_URL = '/data/calibration_curves.json';
187
+
188
+ function updateSize() {
189
+ width = container.clientWidth || 800;
190
+ // Calculate inner dimensions, ensuring square plot area
191
+ const availableWidth = width - margin.left - margin.right;
192
+ const maxHeight = Math.round(width * 0.8); // Limit max height
193
+ const innerSize = Math.min(availableWidth, maxHeight - margin.top - margin.bottom);
194
+ height = innerSize + margin.top + margin.bottom;
195
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
196
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
197
+ return {
198
+ innerWidth: innerSize,
199
+ innerHeight: innerSize
200
+ };
201
+ }
202
+
203
+ function showTooltip(event, d, model) {
204
+ const rect = container.getBoundingClientRect();
205
+ const x = event.clientX - rect.left;
206
+ const y = event.clientY - rect.top;
207
+
208
+ const reportedConfidence = d.confidence_level / 10;
209
+
210
+ tip.innerHTML = `
211
+ <div class="model-name" style="color: ${model.color}">${model.name}</div>
212
+ <div class="metric">
213
+ <span class="metric-label">Reported confidence:</span>
214
+ <span class="metric-value">${Math.round(reportedConfidence * 100)}%</span>
215
+ </div>
216
+ <div class="metric">
217
+ <span class="metric-label">Actual success:</span>
218
+ <span class="metric-value">${(d.actual_success_rate * 100).toFixed(1)}%</span>
219
+ </div>
220
+ <div class="metric">
221
+ <span class="metric-label">Sample size:</span>
222
+ <span class="metric-value">${d.sample_count}</span>
223
+ </div>
224
+ `;
225
+
226
+ const tipWidth = tip.offsetWidth || 150;
227
+ const tipHeight = tip.offsetHeight || 100;
228
+ let tipX = x + 12;
229
+ let tipY = y - tipHeight / 2;
230
+
231
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
232
+ if (tipY < 0) tipY = 8;
233
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
234
+
235
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
236
+ tip.style.opacity = '1';
237
+ }
238
+
239
+ function hideTooltip() {
240
+ tip.style.opacity = '0';
241
+ tip.style.transform = 'translate(-9999px, -9999px)';
242
+ }
243
+
244
+ function toggleModel(modelName) {
245
+ if (hiddenModels.has(modelName)) {
246
+ hiddenModels.delete(modelName);
247
+ } else {
248
+ hiddenModels.add(modelName);
249
+ }
250
+ render();
251
+ }
252
+
253
+ function render() {
254
+ if (!data) return;
255
+
256
+ const { innerWidth, innerHeight } = updateSize();
257
+ const models = data.models;
258
+
259
+ // Equal scales for both axes (0-1 probability) to ensure 45° diagonal
260
+ xScale
261
+ .domain([0, 1])
262
+ .range([0, innerWidth]);
263
+
264
+ yScale
265
+ .domain([0, 1])
266
+ .range([innerHeight, 0]);
267
+
268
+ // Grid lines - same ticks for both axes
269
+ const ticks = [0, 0.2, 0.4, 0.6, 0.8, 1.0];
270
+ const xTicks = ticks;
271
+ const yTicks = ticks;
272
+
273
+ gGrid.selectAll('.grid-x')
274
+ .data(xTicks)
275
+ .join('line')
276
+ .attr('class', 'grid-x')
277
+ .attr('x1', d => xScale(d))
278
+ .attr('x2', d => xScale(d))
279
+ .attr('y1', 0)
280
+ .attr('y2', innerHeight);
281
+
282
+ gGrid.selectAll('.grid-y')
283
+ .data(yTicks)
284
+ .join('line')
285
+ .attr('class', 'grid-y')
286
+ .attr('x1', 0)
287
+ .attr('x2', innerWidth)
288
+ .attr('y1', d => yScale(d))
289
+ .attr('y2', d => yScale(d));
290
+
291
+ // Perfect calibration line (diagonal from 0,0 to 1,1)
292
+ gPerfect.selectAll('.perfect-line')
293
+ .data([0])
294
+ .join('line')
295
+ .attr('class', 'perfect-line')
296
+ .attr('x1', xScale(0))
297
+ .attr('y1', yScale(0))
298
+ .attr('x2', xScale(1))
299
+ .attr('y2', yScale(1));
300
+
301
+ // Axes - format as percentages
302
+ const tickSize = 6;
303
+ const percentFormat = d => `${Math.round(d * 100)}%`;
304
+
305
+ gAxes.selectAll('.x-axis')
306
+ .data([0])
307
+ .join('g')
308
+ .attr('class', 'x-axis')
309
+ .attr('transform', `translate(0,${innerHeight})`)
310
+ .call(d3.axisBottom(xScale)
311
+ .tickValues(xTicks)
312
+ .tickFormat(percentFormat)
313
+ .tickSizeInner(-tickSize)
314
+ .tickSizeOuter(0));
315
+
316
+ gAxes.selectAll('.y-axis')
317
+ .data([0])
318
+ .join('g')
319
+ .attr('class', 'y-axis')
320
+ .call(d3.axisLeft(yScale)
321
+ .tickValues(yTicks)
322
+ .tickFormat(percentFormat)
323
+ .tickSizeInner(-tickSize)
324
+ .tickSizeOuter(0));
325
+
326
+ // Axis labels
327
+ gAxes.selectAll('.x-label')
328
+ .data([0])
329
+ .join('text')
330
+ .attr('class', 'x-label axis-label')
331
+ .attr('x', innerWidth / 2)
332
+ .attr('y', innerHeight + 44)
333
+ .attr('text-anchor', 'middle')
334
+ .text('Reported Confidence');
335
+
336
+ gAxes.selectAll('.y-label')
337
+ .data([0])
338
+ .join('text')
339
+ .attr('class', 'y-label axis-label')
340
+ .attr('x', -innerHeight / 2)
341
+ .attr('y', -52)
342
+ .attr('text-anchor', 'middle')
343
+ .attr('transform', 'rotate(-90)')
344
+ .text('Actual Success Rate');
345
+
346
+ // Lines for each model
347
+ const visibleModels = models.filter(m => !hiddenModels.has(m.name));
348
+
349
+ gLines.selectAll('.calibration-line')
350
+ .data(visibleModels, d => d.name)
351
+ .join('path')
352
+ .attr('class', 'calibration-line')
353
+ .attr('d', d => line(d.calibration_points))
354
+ .attr('stroke', d => d.color);
355
+
356
+ // Data points - circles for closed models, stars for open models
357
+ const allPoints = visibleModels.flatMap(model =>
358
+ model.calibration_points.map(p => ({ ...p, model }))
359
+ );
360
+ const closedPoints = allPoints.filter(d => !d.model.is_open);
361
+ const openPoints = allPoints.filter(d => d.model.is_open);
362
+
363
+ // Helper function to create a 5-point star path
364
+ const starPath = (cx, cy, outerR, innerR) => {
365
+ const points = [];
366
+ for (let i = 0; i < 10; i++) {
367
+ const r = i % 2 === 0 ? outerR : innerR;
368
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
369
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
370
+ }
371
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
372
+ };
373
+
374
+ // Circles for closed models
375
+ gPoints.selectAll('.data-point-circle')
376
+ .data(closedPoints, d => `${d.model.name}-${d.confidence_level}`)
377
+ .join('circle')
378
+ .attr('class', 'data-point data-point-circle')
379
+ .attr('cx', d => xScale(d.confidence_level / 10))
380
+ .attr('cy', d => yScale(d.actual_success_rate))
381
+ .attr('r', 4)
382
+ .attr('fill', d => d.model.color)
383
+ .attr('stroke', 'var(--surface-bg, white)')
384
+ .attr('stroke-width', 1)
385
+ .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
386
+ .on('mousemove', (event, d) => showTooltip(event, d, d.model))
387
+ .on('mouseleave', hideTooltip);
388
+
389
+ // Stars for open models
390
+ gPoints.selectAll('.data-point-star')
391
+ .data(openPoints, d => `${d.model.name}-${d.confidence_level}`)
392
+ .join('path')
393
+ .attr('class', 'data-point data-point-star')
394
+ .attr('d', d => starPath(
395
+ xScale(d.confidence_level / 10),
396
+ yScale(d.actual_success_rate),
397
+ 6, 2.6
398
+ ))
399
+ .attr('fill', d => d.model.color)
400
+ .attr('stroke', 'var(--surface-bg, white)')
401
+ .attr('stroke-width', 0.8)
402
+ .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
403
+ .on('mousemove', (event, d) => showTooltip(event, d, d.model))
404
+ .on('mouseleave', hideTooltip);
405
+
406
+ // Legend
407
+ const legendX = innerWidth + 16;
408
+ const legendItemHeight = 20;
409
+
410
+ // Perfect calibration in legend
411
+ const legendItems = [
412
+ { name: 'Perfect calibration', color: 'var(--muted-color)', isPerfect: true }
413
+ ].concat(models);
414
+
415
+ gLegend.selectAll('.legend-item')
416
+ .data(legendItems, d => d.name)
417
+ .join('g')
418
+ .attr('class', d => {
419
+ if (d.isPerfect) return 'legend-item';
420
+ return `legend-item ${hiddenModels.has(d.name) ? 'dimmed' : ''}`;
421
+ })
422
+ .attr('transform', (d, i) => `translate(${legendX}, ${i * legendItemHeight})`)
423
+ .each(function(d) {
424
+ const g = d3.select(this);
425
+ g.selectAll('*').remove();
426
+
427
+ if (d.isPerfect) {
428
+ // Dashed line for perfect calibration
429
+ g.append('line')
430
+ .attr('class', 'legend-line')
431
+ .attr('x1', 0)
432
+ .attr('x2', 20)
433
+ .attr('y1', 0)
434
+ .attr('y2', 0)
435
+ .attr('stroke', d.color)
436
+ .attr('stroke-width', 1.5)
437
+ .attr('stroke-dasharray', '6, 4')
438
+ .attr('opacity', 0.6);
439
+ } else {
440
+ // Line segment (solid for all models)
441
+ g.append('line')
442
+ .attr('class', 'legend-line')
443
+ .attr('x1', 0)
444
+ .attr('x2', 20)
445
+ .attr('y1', 0)
446
+ .attr('y2', 0)
447
+ .attr('stroke', d.color)
448
+ .attr('stroke-width', 1.5);
449
+
450
+ // Marker - circle for closed, star for open
451
+ if (d.is_open) {
452
+ // Small star for open models
453
+ const starPath = (cx, cy, outerR, innerR) => {
454
+ const points = [];
455
+ for (let i = 0; i < 10; i++) {
456
+ const r = i % 2 === 0 ? outerR : innerR;
457
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
458
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
459
+ }
460
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
461
+ };
462
+ g.append('path')
463
+ .attr('class', 'legend-marker')
464
+ .attr('d', starPath(10, 0, 6, 2.6))
465
+ .attr('fill', d.color);
466
+ } else {
467
+ g.append('circle')
468
+ .attr('class', 'legend-marker')
469
+ .attr('cx', 10)
470
+ .attr('cy', 0)
471
+ .attr('r', 3.5)
472
+ .attr('fill', d.color);
473
+ }
474
+ }
475
+
476
+ g.append('text')
477
+ .attr('class', 'legend-text')
478
+ .attr('x', 26)
479
+ .attr('y', 4)
480
+ .text(d.name);
481
+
482
+ if (!d.isPerfect) {
483
+ g.style('cursor', 'pointer')
484
+ .on('click', () => toggleModel(d.name));
485
+ }
486
+ });
487
+
488
+ // Legend note about line styles
489
+ const noteY = legendItems.length * legendItemHeight + 12;
490
+ gLegend.selectAll('.legend-note')
491
+ .data([0])
492
+ .join('text')
493
+ .attr('class', 'legend-note')
494
+ .attr('x', legendX)
495
+ .attr('y', noteY)
496
+ .attr('font-size', '10px')
497
+ .attr('fill', 'var(--muted-color)')
498
+ .text('● = Closed, ★ = Open');
499
+ }
500
+
501
+ // Initialize
502
+ fetch(DATA_URL, { cache: 'no-cache' })
503
+ .then(r => r.json())
504
+ .then(json => {
505
+ data = json;
506
+ render();
507
+ })
508
+ .catch(err => {
509
+ const pre = document.createElement('pre');
510
+ pre.style.color = 'red';
511
+ pre.style.padding = '16px';
512
+ pre.textContent = `Error loading data: ${err.message}`;
513
+ container.appendChild(pre);
514
+ });
515
+
516
+ // Resize handling
517
+ if (window.ResizeObserver) {
518
+ new ResizeObserver(() => render()).observe(container);
519
+ } else {
520
+ window.addEventListener('resize', render);
521
+ }
522
+
523
+ // Theme change handling
524
+ const observer = new MutationObserver(() => render());
525
+ observer.observe(document.documentElement, {
526
+ attributes: true,
527
+ attributeFilter: ['data-theme']
528
+ });
529
+ };
530
+
531
+ if (document.readyState === 'loading') {
532
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
533
+ } else {
534
+ ensureD3(bootstrap);
535
+ }
536
+ })();
537
+ </script>
app/src/content/embeds/caution-vs-failed-guesses.html ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-caution-vs-failed-guesses"></div>
2
+ <style>
3
+ .d3-caution-vs-failed-guesses {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-caution-vs-failed-guesses svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-caution-vs-failed-guesses .axes path,
17
+ .d3-caution-vs-failed-guesses .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-caution-vs-failed-guesses .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-caution-vs-failed-guesses .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-caution-vs-failed-guesses .axes text.axis-label {
31
+ font-size: 15px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-caution-vs-failed-guesses .x-axis text {
37
+ transform: translateY(4px);
38
+ }
39
+
40
+ .d3-caution-vs-failed-guesses .point {
41
+ cursor: pointer;
42
+ transition: opacity 0.15s ease;
43
+ }
44
+
45
+ .d3-caution-vs-failed-guesses .point:hover {
46
+ opacity: 0.8;
47
+ }
48
+
49
+ .d3-caution-vs-failed-guesses .point-label {
50
+ font-size: 11px;
51
+ fill: var(--text-color);
52
+ pointer-events: none;
53
+ }
54
+
55
+ .d3-caution-vs-failed-guesses .d3-tooltip {
56
+ position: absolute;
57
+ top: 0;
58
+ left: 0;
59
+ transform: translate(-9999px, -9999px);
60
+ pointer-events: none;
61
+ padding: 10px 12px;
62
+ border-radius: 8px;
63
+ font-size: 12px;
64
+ line-height: 1.4;
65
+ border: 1px solid var(--border-color);
66
+ background: var(--surface-bg);
67
+ color: var(--text-color);
68
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
69
+ opacity: 0;
70
+ transition: opacity 0.12s ease;
71
+ z-index: 10;
72
+ }
73
+
74
+ .d3-caution-vs-failed-guesses .d3-tooltip .model-name {
75
+ font-weight: 600;
76
+ margin-bottom: 4px;
77
+ }
78
+
79
+ .d3-caution-vs-failed-guesses .d3-tooltip .metric {
80
+ display: flex;
81
+ justify-content: space-between;
82
+ gap: 16px;
83
+ }
84
+
85
+ .d3-caution-vs-failed-guesses .d3-tooltip .metric-label {
86
+ color: var(--muted-color);
87
+ }
88
+
89
+ .d3-caution-vs-failed-guesses .d3-tooltip .metric-value {
90
+ font-weight: 500;
91
+ }
92
+ </style>
93
+ <script>
94
+ (() => {
95
+ const ensureD3 = (cb) => {
96
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
97
+ let s = document.getElementById('d3-cdn-script');
98
+ if (!s) {
99
+ s = document.createElement('script');
100
+ s.id = 'd3-cdn-script';
101
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
102
+ document.head.appendChild(s);
103
+ }
104
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
105
+ s.addEventListener('load', onReady, { once: true });
106
+ if (window.d3) onReady();
107
+ };
108
+
109
+ const bootstrap = () => {
110
+ const scriptEl = document.currentScript;
111
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
112
+ if (!(container && container.classList && container.classList.contains('d3-caution-vs-failed-guesses'))) {
113
+ const candidates = Array.from(document.querySelectorAll('.d3-caution-vs-failed-guesses'))
114
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
115
+ container = candidates[candidates.length - 1] || null;
116
+ }
117
+ if (!container) return;
118
+ if (container.dataset) {
119
+ if (container.dataset.mounted === 'true') return;
120
+ container.dataset.mounted = 'true';
121
+ }
122
+
123
+ // Tooltip setup
124
+ container.style.position = container.style.position || 'relative';
125
+ const tip = document.createElement('div');
126
+ tip.className = 'd3-tooltip';
127
+ container.appendChild(tip);
128
+
129
+ // SVG setup
130
+ const svg = d3.select(container).append('svg');
131
+ const gRoot = svg.append('g');
132
+
133
+ // Chart groups
134
+ const gGrid = gRoot.append('g').attr('class', 'grid');
135
+ const gAxes = gRoot.append('g').attr('class', 'axes');
136
+ const gPoints = gRoot.append('g').attr('class', 'points');
137
+ const gLabels = gRoot.append('g').attr('class', 'labels');
138
+
139
+ // State
140
+ let data = null;
141
+ let width = 800;
142
+ let height = 450;
143
+ const margin = { top: 20, right: 120, bottom: 56, left: 72 };
144
+
145
+ // Scales
146
+ const xScale = d3.scaleLinear();
147
+ const yScale = d3.scaleLinear();
148
+
149
+ // Data loading
150
+ const DATA_URL = '/data/caution_vs_failed_guesses.json';
151
+
152
+ function updateSize() {
153
+ width = container.clientWidth || 800;
154
+ height = Math.max(300, Math.round(width / 1.5));
155
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
156
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
157
+ return {
158
+ innerWidth: width - margin.left - margin.right,
159
+ innerHeight: height - margin.top - margin.bottom
160
+ };
161
+ }
162
+
163
+ function showTooltip(event, d) {
164
+ const rect = container.getBoundingClientRect();
165
+ const x = event.clientX - rect.left;
166
+ const y = event.clientY - rect.top;
167
+
168
+ tip.innerHTML = `
169
+ <div class="model-name" style="color: ${d.color}">${d.name}</div>
170
+ <div class="metric">
171
+ <span class="metric-label">Early Correct Turns:</span>
172
+ <span class="metric-value">${d.avg_early_correct_turns.toFixed(2)}</span>
173
+ </div>
174
+ <div class="metric">
175
+ <span class="metric-label">Failed Guesses:</span>
176
+ <span class="metric-value">${d.avg_failed_guesses.toFixed(2)}</span>
177
+ </div>
178
+ <div class="metric">
179
+ <span class="metric-label">Type:</span>
180
+ <span class="metric-value">${d.is_open ? 'Open' : 'Closed'}</span>
181
+ </div>
182
+ `;
183
+
184
+ const tipWidth = tip.offsetWidth || 150;
185
+ const tipHeight = tip.offsetHeight || 80;
186
+ let tipX = x + 12;
187
+ let tipY = y - tipHeight / 2;
188
+
189
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
190
+ if (tipY < 0) tipY = 8;
191
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
192
+
193
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
194
+ tip.style.opacity = '1';
195
+ }
196
+
197
+ function hideTooltip() {
198
+ tip.style.opacity = '0';
199
+ tip.style.transform = 'translate(-9999px, -9999px)';
200
+ }
201
+
202
+ function render() {
203
+ if (!data) return;
204
+
205
+ const { innerWidth, innerHeight } = updateSize();
206
+ const models = data.models;
207
+
208
+ // Update scales - X starts at 0
209
+ const xExtent = d3.extent(models, d => d.avg_failed_guesses);
210
+ const yExtent = d3.extent(models, d => d.avg_early_correct_turns);
211
+ const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
212
+ const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
213
+
214
+ xScale
215
+ .domain([0, xExtent[1] + xPadding])
216
+ .range([0, innerWidth])
217
+ .nice();
218
+
219
+ yScale
220
+ .domain([0, yExtent[1] + yPadding])
221
+ .range([innerHeight, 0])
222
+ .nice();
223
+
224
+ // Grid lines
225
+ const xTicks = xScale.ticks(6);
226
+ const yTicks = yScale.ticks(6);
227
+
228
+ gGrid.selectAll('.grid-x')
229
+ .data(xTicks)
230
+ .join('line')
231
+ .attr('class', 'grid-x')
232
+ .attr('x1', d => xScale(d))
233
+ .attr('x2', d => xScale(d))
234
+ .attr('y1', 0)
235
+ .attr('y2', innerHeight);
236
+
237
+ gGrid.selectAll('.grid-y')
238
+ .data(yTicks)
239
+ .join('line')
240
+ .attr('class', 'grid-y')
241
+ .attr('x1', 0)
242
+ .attr('x2', innerWidth)
243
+ .attr('y1', d => yScale(d))
244
+ .attr('y2', d => yScale(d));
245
+
246
+ // Axes with inner ticks
247
+ const tickSize = 6;
248
+ gAxes.selectAll('.x-axis')
249
+ .data([0])
250
+ .join('g')
251
+ .attr('class', 'x-axis')
252
+ .attr('transform', `translate(0,${innerHeight})`)
253
+ .call(d3.axisBottom(xScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
254
+
255
+ gAxes.selectAll('.y-axis')
256
+ .data([0])
257
+ .join('g')
258
+ .attr('class', 'y-axis')
259
+ .call(d3.axisLeft(yScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
260
+
261
+ // Axis labels
262
+ gAxes.selectAll('.x-label')
263
+ .data([0])
264
+ .join('text')
265
+ .attr('class', 'x-label axis-label')
266
+ .attr('x', innerWidth / 2)
267
+ .attr('y', innerHeight + 44)
268
+ .attr('text-anchor', 'middle')
269
+ .text('Average Failed Guesses per Round');
270
+
271
+ gAxes.selectAll('.y-label')
272
+ .data([0])
273
+ .join('text')
274
+ .attr('class', 'y-label axis-label')
275
+ .attr('x', -innerHeight / 2)
276
+ .attr('y', -52)
277
+ .attr('text-anchor', 'middle')
278
+ .attr('transform', 'rotate(-90)')
279
+ .text('Average Early Correct Turns');
280
+
281
+ // Points - circles for closed models, stars for open models
282
+ const pointRadius = Math.max(8, Math.min(16, innerWidth / 60));
283
+
284
+ // Helper function to create a 5-point star path
285
+ const starPath = (cx, cy, outerR, innerR) => {
286
+ const points = [];
287
+ for (let i = 0; i < 10; i++) {
288
+ const r = i % 2 === 0 ? outerR : innerR;
289
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
290
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
291
+ }
292
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
293
+ };
294
+
295
+ // Closed models as circles
296
+ const closedModels = models.filter(d => !d.is_open);
297
+ gPoints.selectAll('.point-circle')
298
+ .data(closedModels, d => d.name)
299
+ .join('circle')
300
+ .attr('class', 'point point-circle')
301
+ .attr('cx', d => xScale(d.avg_failed_guesses))
302
+ .attr('cy', d => yScale(d.avg_early_correct_turns))
303
+ .attr('r', pointRadius)
304
+ .attr('fill', d => d.color)
305
+ .attr('stroke', 'none')
306
+ .on('mouseenter', showTooltip)
307
+ .on('mousemove', showTooltip)
308
+ .on('mouseleave', hideTooltip);
309
+
310
+ // Open models as stars
311
+ const openModels = models.filter(d => d.is_open);
312
+ gPoints.selectAll('.point-star')
313
+ .data(openModels, d => d.name)
314
+ .join('path')
315
+ .attr('class', 'point point-star')
316
+ .attr('d', d => starPath(xScale(d.avg_failed_guesses), yScale(d.avg_early_correct_turns), pointRadius * 1.2, pointRadius * 0.5))
317
+ .attr('fill', d => d.color)
318
+ .attr('stroke', 'none')
319
+ .on('mouseenter', showTooltip)
320
+ .on('mousemove', showTooltip)
321
+ .on('mouseleave', hideTooltip);
322
+
323
+ // Point labels
324
+ gLabels.selectAll('.point-label')
325
+ .data(models)
326
+ .join('text')
327
+ .attr('class', 'point-label')
328
+ .attr('x', d => xScale(d.avg_failed_guesses) + pointRadius + 6)
329
+ .attr('y', d => yScale(d.avg_early_correct_turns) + 4)
330
+ .text(d => d.name);
331
+ }
332
+
333
+ // Initialize
334
+ fetch(DATA_URL, { cache: 'no-cache' })
335
+ .then(r => r.json())
336
+ .then(json => {
337
+ data = json;
338
+ render();
339
+ })
340
+ .catch(err => {
341
+ const pre = document.createElement('pre');
342
+ pre.style.color = 'red';
343
+ pre.style.padding = '16px';
344
+ pre.textContent = `Error loading data: ${err.message}`;
345
+ container.appendChild(pre);
346
+ });
347
+
348
+ // Resize handling
349
+ if (window.ResizeObserver) {
350
+ new ResizeObserver(() => render()).observe(container);
351
+ } else {
352
+ window.addEventListener('resize', render);
353
+ }
354
+
355
+ // Theme change handling
356
+ const observer = new MutationObserver(() => render());
357
+ observer.observe(document.documentElement, {
358
+ attributes: true,
359
+ attributeFilter: ['data-theme']
360
+ });
361
+ };
362
+
363
+ if (document.readyState === 'loading') {
364
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
365
+ } else {
366
+ ensureD3(bootstrap);
367
+ }
368
+ })();
369
+ </script>
app/src/content/embeds/complexity-analysis.html ADDED
@@ -0,0 +1,492 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-complexity-analysis"></div>
2
+ <style>
3
+ .d3-complexity-analysis {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-complexity-analysis svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-complexity-analysis .axes path,
17
+ .d3-complexity-analysis .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-complexity-analysis .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-complexity-analysis .axes text.axis-label {
27
+ font-size: 14px;
28
+ font-weight: 500;
29
+ fill: var(--text-color);
30
+ }
31
+
32
+ .d3-complexity-analysis .axes text.chart-title {
33
+ font-size: 16px;
34
+ font-weight: 600;
35
+ fill: var(--text-color);
36
+ }
37
+
38
+ .d3-complexity-analysis .cell {
39
+ stroke: var(--surface-bg, #fff);
40
+ stroke-width: 2;
41
+ cursor: pointer;
42
+ transition: opacity 0.1s ease;
43
+ }
44
+
45
+ .d3-complexity-analysis .cell:hover {
46
+ opacity: 0.85;
47
+ }
48
+
49
+ .d3-complexity-analysis .cell-text {
50
+ font-size: 13px;
51
+ font-weight: 600;
52
+ pointer-events: none;
53
+ }
54
+
55
+ .d3-complexity-analysis .model-label {
56
+ font-size: 12px;
57
+ fill: var(--text-color);
58
+ }
59
+
60
+ .d3-complexity-analysis .quartile-label {
61
+ font-size: 12px;
62
+ fill: var(--text-color);
63
+ }
64
+
65
+ .d3-complexity-analysis .legend-title {
66
+ font-size: 11px;
67
+ fill: var(--muted-color);
68
+ }
69
+
70
+ .d3-complexity-analysis .legend-tick {
71
+ font-size: 10px;
72
+ fill: var(--muted-color);
73
+ }
74
+
75
+ .d3-complexity-analysis .d3-tooltip {
76
+ position: absolute;
77
+ top: 0;
78
+ left: 0;
79
+ transform: translate(-9999px, -9999px);
80
+ pointer-events: none;
81
+ padding: 10px 12px;
82
+ border-radius: 8px;
83
+ font-size: 12px;
84
+ line-height: 1.5;
85
+ border: 1px solid var(--border-color);
86
+ background: var(--surface-bg);
87
+ color: var(--text-color);
88
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
89
+ opacity: 0;
90
+ transition: opacity 0.12s ease;
91
+ z-index: 10;
92
+ max-width: 280px;
93
+ }
94
+
95
+ .d3-complexity-analysis .d3-tooltip .model-name {
96
+ font-weight: 600;
97
+ margin-bottom: 4px;
98
+ }
99
+
100
+ .d3-complexity-analysis .d3-tooltip .metric {
101
+ display: flex;
102
+ justify-content: space-between;
103
+ gap: 16px;
104
+ }
105
+
106
+ .d3-complexity-analysis .d3-tooltip .metric-label {
107
+ color: var(--muted-color);
108
+ }
109
+
110
+ .d3-complexity-analysis .d3-tooltip .metric-value {
111
+ font-weight: 500;
112
+ }
113
+
114
+ .d3-complexity-analysis .d3-tooltip .interpretation {
115
+ margin-top: 6px;
116
+ font-size: 11px;
117
+ color: var(--muted-color);
118
+ font-style: italic;
119
+ }
120
+ </style>
121
+ <script>
122
+ (() => {
123
+ const ensureD3 = (cb) => {
124
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
125
+ let s = document.getElementById('d3-cdn-script');
126
+ if (!s) {
127
+ s = document.createElement('script');
128
+ s.id = 'd3-cdn-script';
129
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
130
+ document.head.appendChild(s);
131
+ }
132
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
133
+ s.addEventListener('load', onReady, { once: true });
134
+ if (window.d3) onReady();
135
+ };
136
+
137
+ const bootstrap = () => {
138
+ const scriptEl = document.currentScript;
139
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
140
+ if (!(container && container.classList && container.classList.contains('d3-complexity-analysis'))) {
141
+ const candidates = Array.from(document.querySelectorAll('.d3-complexity-analysis'))
142
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
143
+ container = candidates[candidates.length - 1] || null;
144
+ }
145
+ if (!container) return;
146
+ if (container.dataset) {
147
+ if (container.dataset.mounted === 'true') return;
148
+ container.dataset.mounted = 'true';
149
+ }
150
+
151
+ // Tooltip setup
152
+ container.style.position = container.style.position || 'relative';
153
+ const tip = document.createElement('div');
154
+ tip.className = 'd3-tooltip';
155
+ container.appendChild(tip);
156
+
157
+ // SVG setup
158
+ const svg = d3.select(container).append('svg');
159
+ const gRoot = svg.append('g');
160
+
161
+ // Chart groups
162
+ const gAxes = gRoot.append('g').attr('class', 'axes');
163
+ const gCells = gRoot.append('g').attr('class', 'cells');
164
+ const gLegend = gRoot.append('g').attr('class', 'legend');
165
+
166
+ // State
167
+ let data = null;
168
+ let width = 700;
169
+ let height = 450;
170
+ const margin = { top: 60, right: 100, bottom: 60, left: 160 };
171
+
172
+ // Scales
173
+ const xScale = d3.scaleBand();
174
+ const yScale = d3.scaleBand();
175
+
176
+ // Linear color scale: red (0%) -> green (100%+)
177
+ const colorScale = d3.scaleLinear()
178
+ .interpolate(() => d3.interpolateRdYlGn);
179
+
180
+ const DATA_URL = '/data/complexity_analysis.json';
181
+
182
+ function updateSize() {
183
+ width = Math.min(container.clientWidth || 700, 800);
184
+ const numModels = data ? data.models.length : 10;
185
+ const cellHeight = 36;
186
+ height = margin.top + margin.bottom + numModels * cellHeight;
187
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
188
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
189
+ return {
190
+ innerWidth: width - margin.left - margin.right,
191
+ innerHeight: height - margin.top - margin.bottom
192
+ };
193
+ }
194
+
195
+ function getContrastColor(hexColor) {
196
+ const hex = hexColor.replace('#', '');
197
+ const r = parseInt(hex.substr(0, 2), 16) / 255;
198
+ const g = parseInt(hex.substr(2, 2), 16) / 255;
199
+ const b = parseInt(hex.substr(4, 2), 16) / 255;
200
+ const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
201
+ return luminance > 0.5 ? '#000000' : '#ffffff';
202
+ }
203
+
204
+ function rgbToHex(rgb) {
205
+ // Convert rgb(r, g, b) string to #rrggbb
206
+ const match = rgb.match(/rgb\((\d+),\s*(\d+),\s*(\d+)\)/);
207
+ if (!match) return rgb;
208
+ const r = parseInt(match[1]).toString(16).padStart(2, '0');
209
+ const g = parseInt(match[2]).toString(16).padStart(2, '0');
210
+ const b = parseInt(match[3]).toString(16).padStart(2, '0');
211
+ return `#${r}${g}${b}`;
212
+ }
213
+
214
+ function showTooltip(event, d) {
215
+ const rect = container.getBoundingClientRect();
216
+ const x = event.clientX - rect.left;
217
+ const y = event.clientY - rect.top;
218
+
219
+ const pct = d.score * 100;
220
+ const interpretation = pct > 100
221
+ ? `Performs ${(pct - 100).toFixed(0)}% above average on ${d.quartile} rules`
222
+ : pct < 100
223
+ ? `Performs ${(100 - pct).toFixed(0)}% below average on ${d.quartile} rules`
224
+ : 'Performs at average on these rules';
225
+
226
+ const quartileDesc = {
227
+ 'Q1': 'Easiest (lowest complexity)',
228
+ 'Q2': 'Easy-Medium',
229
+ 'Q3': 'Medium-Hard',
230
+ 'Q4': 'Hardest (highest complexity)'
231
+ };
232
+
233
+ tip.innerHTML = `
234
+ <div class="model-name">${d.model}</div>
235
+ <div class="metric">
236
+ <span class="metric-label">Quartile:</span>
237
+ <span class="metric-value">${d.quartile}</span>
238
+ </div>
239
+ <div class="metric">
240
+ <span class="metric-label">Difficulty:</span>
241
+ <span class="metric-value">${quartileDesc[d.quartile]}</span>
242
+ </div>
243
+ <div class="metric">
244
+ <span class="metric-label">Relative Score:</span>
245
+ <span class="metric-value">${pct.toFixed(0)}%</span>
246
+ </div>
247
+ <div class="interpretation">${interpretation}</div>
248
+ `;
249
+
250
+ const tipWidth = tip.offsetWidth || 200;
251
+ const tipHeight = tip.offsetHeight || 120;
252
+ let tipX = x + 12;
253
+ let tipY = y - tipHeight / 2;
254
+
255
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
256
+ if (tipY < 0) tipY = 8;
257
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
258
+
259
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
260
+ tip.style.opacity = '1';
261
+ }
262
+
263
+ function hideTooltip() {
264
+ tip.style.opacity = '0';
265
+ tip.style.transform = 'translate(-9999px, -9999px)';
266
+ }
267
+
268
+ function render() {
269
+ if (!data) return;
270
+
271
+ const { innerWidth, innerHeight } = updateSize();
272
+ const quartiles = data.quartiles;
273
+ const models = data.models;
274
+
275
+ // Update scales
276
+ xScale
277
+ .domain(quartiles)
278
+ .range([0, innerWidth])
279
+ .padding(0.08);
280
+
281
+ yScale
282
+ .domain(models.map(m => m.name))
283
+ .range([0, innerHeight])
284
+ .padding(0.08);
285
+
286
+ // Find score extent for color scale (in percentage: 0-100%+)
287
+ const allScores = [];
288
+ models.forEach(m => {
289
+ quartiles.forEach(q => {
290
+ allScores.push(m.quartile_scores[q] * 100);
291
+ });
292
+ });
293
+ const minPct = Math.min(...allScores);
294
+ const maxPct = Math.max(...allScores);
295
+ // Linear scale from 0% (red) to 100%+ (green)
296
+ colorScale.domain([0, maxPct]);
297
+
298
+ // Build cell data (with percentage values)
299
+ const cellData = [];
300
+ models.forEach(m => {
301
+ quartiles.forEach(q => {
302
+ cellData.push({
303
+ model: m.name,
304
+ quartile: q,
305
+ score: m.quartile_scores[q],
306
+ pct: m.quartile_scores[q] * 100
307
+ });
308
+ });
309
+ });
310
+
311
+ // Draw cells
312
+ gCells.selectAll('.cell')
313
+ .data(cellData, d => `${d.model}-${d.quartile}`)
314
+ .join('rect')
315
+ .attr('class', 'cell')
316
+ .attr('x', d => xScale(d.quartile))
317
+ .attr('y', d => yScale(d.model))
318
+ .attr('width', xScale.bandwidth())
319
+ .attr('height', yScale.bandwidth())
320
+ .attr('fill', d => colorScale(d.pct))
321
+ .attr('rx', 4)
322
+ .on('mouseenter', showTooltip)
323
+ .on('mousemove', showTooltip)
324
+ .on('mouseleave', hideTooltip);
325
+
326
+ // Draw cell text
327
+ gCells.selectAll('.cell-text')
328
+ .data(cellData, d => `${d.model}-${d.quartile}`)
329
+ .join('text')
330
+ .attr('class', 'cell-text')
331
+ .attr('x', d => xScale(d.quartile) + xScale.bandwidth() / 2)
332
+ .attr('y', d => yScale(d.model) + yScale.bandwidth() / 2)
333
+ .attr('text-anchor', 'middle')
334
+ .attr('dominant-baseline', 'central')
335
+ .style('fill', d => {
336
+ const bgColor = colorScale(d.pct);
337
+ const hex = bgColor.startsWith('rgb') ? rgbToHex(bgColor) : bgColor;
338
+ return getContrastColor(hex);
339
+ })
340
+ .text(d => `${d.pct.toFixed(0)}%`);
341
+
342
+ // Model labels (Y-axis)
343
+ gAxes.selectAll('.model-label')
344
+ .data(models, d => d.name)
345
+ .join('text')
346
+ .attr('class', 'model-label')
347
+ .attr('x', -10)
348
+ .attr('y', d => yScale(d.name) + yScale.bandwidth() / 2)
349
+ .attr('text-anchor', 'end')
350
+ .attr('dominant-baseline', 'central')
351
+ .text(d => d.name);
352
+
353
+ // Quartile labels (X-axis)
354
+ gAxes.selectAll('.quartile-label')
355
+ .data(quartiles)
356
+ .join('text')
357
+ .attr('class', 'quartile-label')
358
+ .attr('x', d => xScale(d) + xScale.bandwidth() / 2)
359
+ .attr('y', -10)
360
+ .attr('text-anchor', 'middle')
361
+ .text(d => d);
362
+
363
+ // X-axis title
364
+ gAxes.selectAll('.x-title')
365
+ .data([0])
366
+ .join('text')
367
+ .attr('class', 'x-title axis-label')
368
+ .attr('x', innerWidth / 2)
369
+ .attr('y', innerHeight + 40)
370
+ .attr('text-anchor', 'middle')
371
+ .text('Complexity Quartile (Q1 = easiest)');
372
+
373
+ // Chart title
374
+ gAxes.selectAll('.chart-title')
375
+ .data([0])
376
+ .join('text')
377
+ .attr('class', 'chart-title')
378
+ .attr('x', innerWidth / 2)
379
+ .attr('y', -35)
380
+ .attr('text-anchor', 'middle')
381
+ .text('Model Performance by Rule Complexity');
382
+
383
+ // Legend
384
+ const legendWidth = 20;
385
+ const legendHeight = innerHeight * 0.6;
386
+ const legendX = innerWidth + 30;
387
+ const legendY = (innerHeight - legendHeight) / 2;
388
+
389
+ // Create gradient
390
+ const gradientId = 'complexity-legend-gradient';
391
+ let defs = svg.select('defs');
392
+ if (defs.empty()) {
393
+ defs = svg.append('defs');
394
+ }
395
+
396
+ defs.selectAll(`#${gradientId}`).remove();
397
+ const gradient = defs.append('linearGradient')
398
+ .attr('id', gradientId)
399
+ .attr('x1', '0%')
400
+ .attr('x2', '0%')
401
+ .attr('y1', '100%')
402
+ .attr('y2', '0%');
403
+
404
+ const numStops = 11;
405
+ for (let i = 0; i <= numStops; i++) {
406
+ const t = i / numStops;
407
+ const value = t * maxPct;
408
+ gradient.append('stop')
409
+ .attr('offset', `${t * 100}%`)
410
+ .attr('stop-color', colorScale(value));
411
+ }
412
+
413
+ // Legend rectangle
414
+ gLegend.selectAll('.legend-rect')
415
+ .data([0])
416
+ .join('rect')
417
+ .attr('class', 'legend-rect')
418
+ .attr('x', legendX)
419
+ .attr('y', legendY)
420
+ .attr('width', legendWidth)
421
+ .attr('height', legendHeight)
422
+ .attr('fill', `url(#${gradientId})`)
423
+ .attr('rx', 2)
424
+ .attr('stroke', 'var(--border-color)')
425
+ .attr('stroke-width', 0.5);
426
+
427
+ // Legend ticks (in percentage)
428
+ const legendScale = d3.scaleLinear()
429
+ .domain([0, maxPct])
430
+ .range([legendY + legendHeight, legendY]);
431
+
432
+ // Generate nice tick values for percentage scale
433
+ const tickValues = [0, 50, 100];
434
+ if (maxPct > 100) tickValues.push(Math.round(maxPct / 10) * 10);
435
+
436
+ gLegend.selectAll('.legend-tick')
437
+ .data(tickValues.filter(v => v <= maxPct))
438
+ .join('text')
439
+ .attr('class', 'legend-tick')
440
+ .attr('x', legendX + legendWidth + 6)
441
+ .attr('y', d => legendScale(d))
442
+ .attr('dominant-baseline', 'middle')
443
+ .text(d => `${d}%`);
444
+
445
+ // Legend title
446
+ gLegend.selectAll('.legend-title')
447
+ .data([0])
448
+ .join('text')
449
+ .attr('class', 'legend-title')
450
+ .attr('x', legendX + legendWidth / 2)
451
+ .attr('y', legendY - 12)
452
+ .attr('text-anchor', 'middle')
453
+ .text('Relative Score');
454
+ }
455
+
456
+ // Initialize
457
+ fetch(DATA_URL, { cache: 'no-cache' })
458
+ .then(r => r.json())
459
+ .then(json => {
460
+ data = json;
461
+ render();
462
+ })
463
+ .catch(err => {
464
+ const pre = document.createElement('pre');
465
+ pre.style.color = 'red';
466
+ pre.style.padding = '16px';
467
+ pre.textContent = `Error loading data: ${err.message}`;
468
+ container.appendChild(pre);
469
+ });
470
+
471
+ // Resize handling
472
+ if (window.ResizeObserver) {
473
+ new ResizeObserver(() => render()).observe(container);
474
+ } else {
475
+ window.addEventListener('resize', render);
476
+ }
477
+
478
+ // Theme change handling
479
+ const observer = new MutationObserver(() => render());
480
+ observer.observe(document.documentElement, {
481
+ attributes: true,
482
+ attributeFilter: ['data-theme']
483
+ });
484
+ };
485
+
486
+ if (document.readyState === 'loading') {
487
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
488
+ } else {
489
+ ensureD3(bootstrap);
490
+ }
491
+ })();
492
+ </script>
app/src/content/embeds/confidence-distribution.html ADDED
@@ -0,0 +1,495 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-confidence-distribution"></div>
2
+ <style>
3
+ .d3-confidence-distribution {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-confidence-distribution svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-confidence-distribution .axes path,
17
+ .d3-confidence-distribution .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-confidence-distribution .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-confidence-distribution .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-confidence-distribution .axes text.axis-label {
31
+ font-size: 14px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-confidence-distribution .x-axis text {
37
+ transform: translateY(4px);
38
+ }
39
+
40
+ .d3-confidence-distribution .distribution-line {
41
+ fill: none;
42
+ stroke-width: 1.5;
43
+ }
44
+
45
+ .d3-confidence-distribution .data-point {
46
+ cursor: pointer;
47
+ transition: opacity 0.15s ease;
48
+ }
49
+
50
+ .d3-confidence-distribution .data-point:hover {
51
+ opacity: 0.8;
52
+ }
53
+
54
+ .d3-confidence-distribution .legend {
55
+ font-size: 11px;
56
+ }
57
+
58
+ .d3-confidence-distribution .legend-item {
59
+ cursor: pointer;
60
+ }
61
+
62
+ .d3-confidence-distribution .legend-item.dimmed .legend-line,
63
+ .d3-confidence-distribution .legend-item.dimmed .legend-marker {
64
+ opacity: 0.3;
65
+ }
66
+
67
+ .d3-confidence-distribution .legend-item.dimmed text {
68
+ opacity: 0.4;
69
+ }
70
+
71
+ .d3-confidence-distribution .legend-text {
72
+ fill: var(--text-color);
73
+ }
74
+
75
+ .d3-confidence-distribution .d3-tooltip {
76
+ position: absolute;
77
+ top: 0;
78
+ left: 0;
79
+ transform: translate(-9999px, -9999px);
80
+ pointer-events: none;
81
+ padding: 10px 12px;
82
+ border-radius: 8px;
83
+ font-size: 12px;
84
+ line-height: 1.4;
85
+ border: 1px solid var(--border-color);
86
+ background: var(--surface-bg);
87
+ color: var(--text-color);
88
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
89
+ opacity: 0;
90
+ transition: opacity 0.12s ease;
91
+ z-index: 10;
92
+ }
93
+
94
+ .d3-confidence-distribution .d3-tooltip .model-name {
95
+ font-weight: 600;
96
+ margin-bottom: 4px;
97
+ }
98
+
99
+ .d3-confidence-distribution .d3-tooltip .metric {
100
+ display: flex;
101
+ justify-content: space-between;
102
+ gap: 16px;
103
+ }
104
+
105
+ .d3-confidence-distribution .d3-tooltip .metric-label {
106
+ color: var(--muted-color);
107
+ }
108
+
109
+ .d3-confidence-distribution .d3-tooltip .metric-value {
110
+ font-weight: 500;
111
+ }
112
+ </style>
113
+ <script>
114
+ (() => {
115
+ const ensureD3 = (cb) => {
116
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
117
+ let s = document.getElementById('d3-cdn-script');
118
+ if (!s) {
119
+ s = document.createElement('script');
120
+ s.id = 'd3-cdn-script';
121
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
122
+ document.head.appendChild(s);
123
+ }
124
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
125
+ s.addEventListener('load', onReady, { once: true });
126
+ if (window.d3) onReady();
127
+ };
128
+
129
+ const bootstrap = () => {
130
+ const scriptEl = document.currentScript;
131
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
132
+ if (!(container && container.classList && container.classList.contains('d3-confidence-distribution'))) {
133
+ const candidates = Array.from(document.querySelectorAll('.d3-confidence-distribution'))
134
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
135
+ container = candidates[candidates.length - 1] || null;
136
+ }
137
+ if (!container) return;
138
+ if (container.dataset) {
139
+ if (container.dataset.mounted === 'true') return;
140
+ container.dataset.mounted = 'true';
141
+ }
142
+
143
+ // Tooltip setup
144
+ container.style.position = container.style.position || 'relative';
145
+ const tip = document.createElement('div');
146
+ tip.className = 'd3-tooltip';
147
+ container.appendChild(tip);
148
+
149
+ // SVG setup
150
+ const svg = d3.select(container).append('svg');
151
+ const gRoot = svg.append('g');
152
+
153
+ // Chart groups (order matters for layering)
154
+ const gGrid = gRoot.append('g').attr('class', 'grid');
155
+ const gLines = gRoot.append('g').attr('class', 'lines');
156
+ const gPoints = gRoot.append('g').attr('class', 'points');
157
+ const gAxes = gRoot.append('g').attr('class', 'axes');
158
+ const gLegend = gRoot.append('g').attr('class', 'legend');
159
+
160
+ // State
161
+ let data = null;
162
+ let width = 800;
163
+ let height = 500;
164
+ const margin = { top: 20, right: 180, bottom: 56, left: 72 };
165
+ let hiddenModels = new Set();
166
+
167
+ // Scales
168
+ const xScale = d3.scaleLinear();
169
+ const yScale = d3.scaleLinear();
170
+
171
+ // Line generator
172
+ const line = d3.line()
173
+ .x(d => xScale(d.confidence_level))
174
+ .y(d => yScale(d.proportion));
175
+
176
+ // Data loading
177
+ const DATA_URL = '/data/confidence_distribution.json';
178
+
179
+ function updateSize() {
180
+ width = container.clientWidth || 800;
181
+ const availableWidth = width - margin.left - margin.right;
182
+ const maxHeight = Math.round(width * 0.7);
183
+ const innerSize = Math.min(availableWidth, maxHeight - margin.top - margin.bottom);
184
+ height = innerSize + margin.top + margin.bottom;
185
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
186
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
187
+ return {
188
+ innerWidth: width - margin.left - margin.right,
189
+ innerHeight: height - margin.top - margin.bottom
190
+ };
191
+ }
192
+
193
+ function showTooltip(event, d, model) {
194
+ const rect = container.getBoundingClientRect();
195
+ const x = event.clientX - rect.left;
196
+ const y = event.clientY - rect.top;
197
+
198
+ tip.innerHTML = `
199
+ <div class="model-name" style="color: ${model.color}">${model.name}</div>
200
+ <div class="metric">
201
+ <span class="metric-label">Confidence level:</span>
202
+ <span class="metric-value">${d.confidence_level * 10}%</span>
203
+ </div>
204
+ <div class="metric">
205
+ <span class="metric-label">Proportion:</span>
206
+ <span class="metric-value">${(d.proportion * 100).toFixed(1)}%</span>
207
+ </div>
208
+ <div class="metric">
209
+ <span class="metric-label">Count:</span>
210
+ <span class="metric-value">${d.count} / ${model.total_guesses}</span>
211
+ </div>
212
+ `;
213
+
214
+ const tipWidth = tip.offsetWidth || 150;
215
+ const tipHeight = tip.offsetHeight || 100;
216
+ let tipX = x + 12;
217
+ let tipY = y - tipHeight / 2;
218
+
219
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
220
+ if (tipY < 0) tipY = 8;
221
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
222
+
223
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
224
+ tip.style.opacity = '1';
225
+ }
226
+
227
+ function hideTooltip() {
228
+ tip.style.opacity = '0';
229
+ tip.style.transform = 'translate(-9999px, -9999px)';
230
+ }
231
+
232
+ function toggleModel(modelName) {
233
+ if (hiddenModels.has(modelName)) {
234
+ hiddenModels.delete(modelName);
235
+ } else {
236
+ hiddenModels.add(modelName);
237
+ }
238
+ render();
239
+ }
240
+
241
+ function render() {
242
+ if (!data) return;
243
+
244
+ const { innerWidth, innerHeight } = updateSize();
245
+ const models = data.models;
246
+ const visibleModels = models.filter(m => !hiddenModels.has(m.name));
247
+
248
+ // X scale: confidence levels 5-10
249
+ xScale
250
+ .domain([5, 10])
251
+ .range([0, innerWidth]);
252
+
253
+ // Y scale: proportion (0 to max + padding)
254
+ const maxProportion = d3.max(visibleModels, m =>
255
+ d3.max(m.distribution, d => d.proportion)
256
+ ) || 0.8;
257
+ yScale
258
+ .domain([0, Math.min(1, maxProportion * 1.1)])
259
+ .range([innerHeight, 0])
260
+ .nice();
261
+
262
+ // Grid lines
263
+ const xTicks = [5, 6, 7, 8, 9, 10];
264
+ const yTicks = yScale.ticks(6);
265
+
266
+ gGrid.selectAll('.grid-x')
267
+ .data(xTicks)
268
+ .join('line')
269
+ .attr('class', 'grid-x')
270
+ .attr('x1', d => xScale(d))
271
+ .attr('x2', d => xScale(d))
272
+ .attr('y1', 0)
273
+ .attr('y2', innerHeight);
274
+
275
+ gGrid.selectAll('.grid-y')
276
+ .data(yTicks)
277
+ .join('line')
278
+ .attr('class', 'grid-y')
279
+ .attr('x1', 0)
280
+ .attr('x2', innerWidth)
281
+ .attr('y1', d => yScale(d))
282
+ .attr('y2', d => yScale(d));
283
+
284
+ // Axes
285
+ const tickSize = 6;
286
+ const percentFormat = d => `${Math.round(d * 100)}%`;
287
+
288
+ gAxes.selectAll('.x-axis')
289
+ .data([0])
290
+ .join('g')
291
+ .attr('class', 'x-axis')
292
+ .attr('transform', `translate(0,${innerHeight})`)
293
+ .call(d3.axisBottom(xScale)
294
+ .tickValues(xTicks)
295
+ .tickFormat(d => d)
296
+ .tickSizeInner(-tickSize)
297
+ .tickSizeOuter(0));
298
+
299
+ gAxes.selectAll('.y-axis')
300
+ .data([0])
301
+ .join('g')
302
+ .attr('class', 'y-axis')
303
+ .call(d3.axisLeft(yScale)
304
+ .ticks(6)
305
+ .tickFormat(percentFormat)
306
+ .tickSizeInner(-tickSize)
307
+ .tickSizeOuter(0));
308
+
309
+ // Axis labels
310
+ gAxes.selectAll('.x-label')
311
+ .data([0])
312
+ .join('text')
313
+ .attr('class', 'x-label axis-label')
314
+ .attr('x', innerWidth / 2)
315
+ .attr('y', innerHeight + 44)
316
+ .attr('text-anchor', 'middle')
317
+ .text('Confidence Level');
318
+
319
+ gAxes.selectAll('.y-label')
320
+ .data([0])
321
+ .join('text')
322
+ .attr('class', 'y-label axis-label')
323
+ .attr('x', -innerHeight / 2)
324
+ .attr('y', -52)
325
+ .attr('text-anchor', 'middle')
326
+ .attr('transform', 'rotate(-90)')
327
+ .text('Proportion of Guesses');
328
+
329
+ // Lines for each model
330
+ gLines.selectAll('.distribution-line')
331
+ .data(visibleModels, d => d.name)
332
+ .join('path')
333
+ .attr('class', 'distribution-line')
334
+ .attr('d', d => line(d.distribution))
335
+ .attr('stroke', d => d.color);
336
+
337
+ // Data points - circles for closed models, stars for open models
338
+ const allPoints = visibleModels.flatMap(model =>
339
+ model.distribution.map(p => ({ ...p, model }))
340
+ );
341
+ const closedPoints = allPoints.filter(d => !d.model.is_open);
342
+ const openPoints = allPoints.filter(d => d.model.is_open);
343
+
344
+ // Helper function to create a 5-point star path
345
+ const starPath = (cx, cy, outerR, innerR) => {
346
+ const points = [];
347
+ for (let i = 0; i < 10; i++) {
348
+ const r = i % 2 === 0 ? outerR : innerR;
349
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
350
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
351
+ }
352
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
353
+ };
354
+
355
+ // Circles for closed models
356
+ gPoints.selectAll('.data-point-circle')
357
+ .data(closedPoints, d => `${d.model.name}-${d.confidence_level}`)
358
+ .join('circle')
359
+ .attr('class', 'data-point data-point-circle')
360
+ .attr('cx', d => xScale(d.confidence_level))
361
+ .attr('cy', d => yScale(d.proportion))
362
+ .attr('r', 4)
363
+ .attr('fill', d => d.model.color)
364
+ .attr('stroke', 'var(--surface-bg, white)')
365
+ .attr('stroke-width', 1)
366
+ .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
367
+ .on('mousemove', (event, d) => showTooltip(event, d, d.model))
368
+ .on('mouseleave', hideTooltip);
369
+
370
+ // Stars for open models
371
+ gPoints.selectAll('.data-point-star')
372
+ .data(openPoints, d => `${d.model.name}-${d.confidence_level}`)
373
+ .join('path')
374
+ .attr('class', 'data-point data-point-star')
375
+ .attr('d', d => starPath(
376
+ xScale(d.confidence_level),
377
+ yScale(d.proportion),
378
+ 6, 2.6
379
+ ))
380
+ .attr('fill', d => d.model.color)
381
+ .attr('stroke', 'var(--surface-bg, white)')
382
+ .attr('stroke-width', 0.8)
383
+ .on('mouseenter', (event, d) => showTooltip(event, d, d.model))
384
+ .on('mousemove', (event, d) => showTooltip(event, d, d.model))
385
+ .on('mouseleave', hideTooltip);
386
+
387
+ // Legend
388
+ const legendX = innerWidth + 16;
389
+ const legendItemHeight = 20;
390
+
391
+ // Helper function for legend star
392
+ const legendStarPath = (cx, cy, outerR, innerR) => {
393
+ const points = [];
394
+ for (let i = 0; i < 10; i++) {
395
+ const r = i % 2 === 0 ? outerR : innerR;
396
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
397
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
398
+ }
399
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
400
+ };
401
+
402
+ gLegend.selectAll('.legend-item')
403
+ .data(models, d => d.name)
404
+ .join('g')
405
+ .attr('class', d => `legend-item ${hiddenModels.has(d.name) ? 'dimmed' : ''}`)
406
+ .attr('transform', (d, i) => `translate(${legendX}, ${i * legendItemHeight})`)
407
+ .each(function(d) {
408
+ const g = d3.select(this);
409
+ g.selectAll('*').remove();
410
+
411
+ // Line segment (solid for all models)
412
+ g.append('line')
413
+ .attr('class', 'legend-line')
414
+ .attr('x1', 0)
415
+ .attr('x2', 20)
416
+ .attr('y1', 0)
417
+ .attr('y2', 0)
418
+ .attr('stroke', d.color)
419
+ .attr('stroke-width', 1.5);
420
+
421
+ // Marker - circle for closed, star for open
422
+ if (d.is_open) {
423
+ g.append('path')
424
+ .attr('class', 'legend-marker')
425
+ .attr('d', legendStarPath(10, 0, 6, 2.6))
426
+ .attr('fill', d.color);
427
+ } else {
428
+ g.append('circle')
429
+ .attr('class', 'legend-marker')
430
+ .attr('cx', 10)
431
+ .attr('cy', 0)
432
+ .attr('r', 3.5)
433
+ .attr('fill', d.color);
434
+ }
435
+
436
+ g.append('text')
437
+ .attr('class', 'legend-text')
438
+ .attr('x', 26)
439
+ .attr('y', 4)
440
+ .text(d.name);
441
+
442
+ g.style('cursor', 'pointer')
443
+ .on('click', () => toggleModel(d.name));
444
+ });
445
+
446
+ // Legend note
447
+ const noteY = models.length * legendItemHeight + 12;
448
+ gLegend.selectAll('.legend-note')
449
+ .data([0])
450
+ .join('text')
451
+ .attr('class', 'legend-note')
452
+ .attr('x', legendX)
453
+ .attr('y', noteY)
454
+ .attr('font-size', '10px')
455
+ .attr('fill', 'var(--muted-color)')
456
+ .text('● = Closed, ★ = Open');
457
+ }
458
+
459
+ // Initialize
460
+ fetch(DATA_URL, { cache: 'no-cache' })
461
+ .then(r => r.json())
462
+ .then(json => {
463
+ data = json;
464
+ render();
465
+ })
466
+ .catch(err => {
467
+ const pre = document.createElement('pre');
468
+ pre.style.color = 'red';
469
+ pre.style.padding = '16px';
470
+ pre.textContent = `Error loading data: ${err.message}`;
471
+ container.appendChild(pre);
472
+ });
473
+
474
+ // Resize handling
475
+ if (window.ResizeObserver) {
476
+ new ResizeObserver(() => render()).observe(container);
477
+ } else {
478
+ window.addEventListener('resize', render);
479
+ }
480
+
481
+ // Theme change handling
482
+ const observer = new MutationObserver(() => render());
483
+ observer.observe(document.documentElement, {
484
+ attributes: true,
485
+ attributeFilter: ['data-theme']
486
+ });
487
+ };
488
+
489
+ if (document.readyState === 'loading') {
490
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
491
+ } else {
492
+ ensureD3(bootstrap);
493
+ }
494
+ })();
495
+ </script>
app/src/content/embeds/excess-caution.html ADDED
@@ -0,0 +1,384 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-excess-caution"></div>
2
+ <style>
3
+ .d3-excess-caution {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-excess-caution svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-excess-caution .axes path,
17
+ .d3-excess-caution .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-excess-caution .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-excess-caution .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-excess-caution .axes text.axis-label {
31
+ font-size: 14px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+
37
+ .d3-excess-caution .strip-point {
38
+ opacity: 0.5;
39
+ }
40
+
41
+ .d3-excess-caution .mean-line {
42
+ stroke-width: 4;
43
+ cursor: pointer;
44
+ }
45
+
46
+ .d3-excess-caution .mean-line:hover {
47
+ stroke-width: 5;
48
+ }
49
+
50
+ .d3-excess-caution .legend {
51
+ font-size: 11px;
52
+ }
53
+
54
+ .d3-excess-caution .legend-text {
55
+ fill: var(--text-color);
56
+ }
57
+
58
+ .d3-excess-caution .d3-tooltip {
59
+ position: absolute;
60
+ top: 0;
61
+ left: 0;
62
+ transform: translate(-9999px, -9999px);
63
+ pointer-events: none;
64
+ padding: 10px 12px;
65
+ border-radius: 8px;
66
+ font-size: 12px;
67
+ line-height: 1.4;
68
+ border: 1px solid var(--border-color);
69
+ background: var(--surface-bg);
70
+ color: var(--text-color);
71
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
72
+ opacity: 0;
73
+ transition: opacity 0.12s ease;
74
+ z-index: 10;
75
+ }
76
+
77
+ .d3-excess-caution .d3-tooltip .model-name {
78
+ font-weight: 600;
79
+ margin-bottom: 4px;
80
+ }
81
+
82
+ .d3-excess-caution .d3-tooltip .metric {
83
+ display: flex;
84
+ justify-content: space-between;
85
+ gap: 16px;
86
+ }
87
+
88
+ .d3-excess-caution .d3-tooltip .metric-label {
89
+ color: var(--muted-color);
90
+ }
91
+
92
+ .d3-excess-caution .d3-tooltip .metric-value {
93
+ font-weight: 500;
94
+ }
95
+ </style>
96
+ <script>
97
+ (() => {
98
+ const ensureD3 = (cb) => {
99
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
100
+ let s = document.getElementById('d3-cdn-script');
101
+ if (!s) {
102
+ s = document.createElement('script');
103
+ s.id = 'd3-cdn-script';
104
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
105
+ document.head.appendChild(s);
106
+ }
107
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
108
+ s.addEventListener('load', onReady, { once: true });
109
+ if (window.d3) onReady();
110
+ };
111
+
112
+ const bootstrap = () => {
113
+ const scriptEl = document.currentScript;
114
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
115
+ if (!(container && container.classList && container.classList.contains('d3-excess-caution'))) {
116
+ const candidates = Array.from(document.querySelectorAll('.d3-excess-caution'))
117
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
118
+ container = candidates[candidates.length - 1] || null;
119
+ }
120
+ if (!container) return;
121
+ if (container.dataset) {
122
+ if (container.dataset.mounted === 'true') return;
123
+ container.dataset.mounted = 'true';
124
+ }
125
+
126
+ // Tooltip setup
127
+ container.style.position = container.style.position || 'relative';
128
+ const tip = document.createElement('div');
129
+ tip.className = 'd3-tooltip';
130
+ container.appendChild(tip);
131
+
132
+ // SVG setup
133
+ const svg = d3.select(container).append('svg');
134
+ const gRoot = svg.append('g');
135
+
136
+ // Chart groups
137
+ const gGrid = gRoot.append('g').attr('class', 'grid');
138
+ const gAxes = gRoot.append('g').attr('class', 'axes');
139
+ const gPoints = gRoot.append('g').attr('class', 'points');
140
+ const gMeans = gRoot.append('g').attr('class', 'means');
141
+ const gLegend = gRoot.append('g').attr('class', 'legend');
142
+
143
+ // State
144
+ let data = null;
145
+ let width = 800;
146
+ let height = 450;
147
+ const margin = { top: 20, right: 30, bottom: 50, left: 160 };
148
+
149
+ // Scales (swapped: X is now linear, Y is categorical)
150
+ const xScale = d3.scaleLinear();
151
+ const yScale = d3.scaleBand();
152
+
153
+ // Data loading
154
+ const DATA_URL = '/data/excess_caution.json';
155
+
156
+ // Seeded random for consistent jitter
157
+ function seededRandom(seed) {
158
+ const x = Math.sin(seed) * 10000;
159
+ return x - Math.floor(x);
160
+ }
161
+
162
+ // Compute quartiles from array
163
+ function computeQuartiles(values) {
164
+ const sorted = [...values].sort((a, b) => a - b);
165
+ const n = sorted.length;
166
+ const q1 = sorted[Math.floor(n * 0.25)];
167
+ const median = sorted[Math.floor(n * 0.5)];
168
+ const q3 = sorted[Math.floor(n * 0.75)];
169
+ return { q1, median, q3 };
170
+ }
171
+
172
+ function showTooltip(event, model) {
173
+ const rect = container.getBoundingClientRect();
174
+ const x = event.clientX - rect.left;
175
+ const y = event.clientY - rect.top;
176
+ const quartiles = computeQuartiles(model.values);
177
+
178
+ tip.innerHTML = `
179
+ <div class="model-name" style="color: ${model.color}">${model.name}</div>
180
+ <div class="metric">
181
+ <span class="metric-label">Mean:</span>
182
+ <span class="metric-value">${model.mean.toFixed(2)}</span>
183
+ </div>
184
+ <div class="metric">
185
+ <span class="metric-label">Median:</span>
186
+ <span class="metric-value">${quartiles.median}</span>
187
+ </div>
188
+ <div class="metric">
189
+ <span class="metric-label">Q1 / Q3:</span>
190
+ <span class="metric-value">${quartiles.q1} / ${quartiles.q3}</span>
191
+ </div>
192
+ <div class="metric">
193
+ <span class="metric-label">Samples:</span>
194
+ <span class="metric-value">${model.count}</span>
195
+ </div>
196
+ `;
197
+
198
+ const tipWidth = tip.offsetWidth || 150;
199
+ const tipHeight = tip.offsetHeight || 100;
200
+ let tipX = x + 12;
201
+ let tipY = y - tipHeight / 2;
202
+
203
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
204
+ if (tipY < 0) tipY = 8;
205
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
206
+
207
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
208
+ tip.style.opacity = '1';
209
+ }
210
+
211
+ function hideTooltip() {
212
+ tip.style.opacity = '0';
213
+ tip.style.transform = 'translate(-9999px, -9999px)';
214
+ }
215
+
216
+ function updateSize() {
217
+ width = container.clientWidth || 800;
218
+ // Taller chart for horizontal layout with 10 models
219
+ height = Math.max(400, Math.round(width * 0.6));
220
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
221
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
222
+ return {
223
+ innerWidth: width - margin.left - margin.right,
224
+ innerHeight: height - margin.top - margin.bottom
225
+ };
226
+ }
227
+
228
+ function render() {
229
+ if (!data) return;
230
+
231
+ const { innerWidth, innerHeight } = updateSize();
232
+
233
+ // Sort models by mean (descending - most cautious at top)
234
+ const models = [...data.models].sort((a, b) => b.mean - a.mean);
235
+
236
+ // X scale: linear (early correct turns)
237
+ const maxValue = d3.max(models, m => d3.max(m.values)) || 10;
238
+ xScale
239
+ .domain([0, maxValue + 0.5])
240
+ .range([0, innerWidth]);
241
+
242
+ // Y scale: categorical (model names)
243
+ yScale
244
+ .domain(models.map(m => m.name))
245
+ .range([0, innerHeight])
246
+ .padding(0.3);
247
+
248
+ // Grid lines (vertical)
249
+ const xTicks = xScale.ticks(6);
250
+ gGrid.selectAll('.grid-x')
251
+ .data(xTicks)
252
+ .join('line')
253
+ .attr('class', 'grid-x')
254
+ .attr('x1', d => xScale(d))
255
+ .attr('x2', d => xScale(d))
256
+ .attr('y1', 0)
257
+ .attr('y2', innerHeight);
258
+
259
+ // Remove old horizontal grid lines
260
+ gGrid.selectAll('.grid-y').remove();
261
+
262
+ // Axes
263
+ const tickSize = 6;
264
+
265
+ gAxes.selectAll('.x-axis')
266
+ .data([0])
267
+ .join('g')
268
+ .attr('class', 'x-axis')
269
+ .attr('transform', `translate(0,${innerHeight})`)
270
+ .call(d3.axisBottom(xScale)
271
+ .ticks(6)
272
+ .tickFormat(d3.format('d'))
273
+ .tickSizeInner(-tickSize)
274
+ .tickSizeOuter(0));
275
+
276
+ gAxes.selectAll('.y-axis')
277
+ .data([0])
278
+ .join('g')
279
+ .attr('class', 'y-axis')
280
+ .call(d3.axisLeft(yScale)
281
+ .tickSizeInner(-tickSize)
282
+ .tickSizeOuter(0));
283
+
284
+ // X-axis label
285
+ gAxes.selectAll('.x-label')
286
+ .data([0])
287
+ .join('text')
288
+ .attr('class', 'x-label axis-label')
289
+ .attr('x', innerWidth / 2)
290
+ .attr('y', innerHeight + 40)
291
+ .attr('text-anchor', 'middle')
292
+ .text('Early Correct Turns');
293
+
294
+ // Remove old Y-axis label
295
+ gAxes.selectAll('.y-label').remove();
296
+
297
+ // Create flat array of all points with horizontal jitter
298
+ const bandHeight = yScale.bandwidth();
299
+ const jitterWidth = 8; // Fixed horizontal jitter in pixels
300
+ const pointRadius = Math.min(2.5, bandHeight / 20);
301
+
302
+ const allPoints = models.flatMap((model, modelIdx) =>
303
+ model.values.map((value, i) => ({
304
+ model,
305
+ value,
306
+ // Seeded random jitter for consistency (horizontal)
307
+ jitter: (seededRandom(modelIdx * 1000 + i) - 0.5) * jitterWidth
308
+ }))
309
+ );
310
+
311
+ // Draw all points as small circles
312
+ gPoints.selectAll('.strip-point')
313
+ .data(allPoints, (d, i) => `${d.model.name}-${i}`)
314
+ .join('circle')
315
+ .attr('class', 'strip-point')
316
+ .attr('cx', d => xScale(d.value) + d.jitter)
317
+ .attr('cy', d => yScale(d.model.name) + bandHeight / 2)
318
+ .attr('r', pointRadius)
319
+ .attr('fill', d => d.model.color);
320
+
321
+ // Mean lines with hover (now vertical)
322
+ const meanLineHeight = bandHeight * 0.78;
323
+ gMeans.selectAll('.mean-line')
324
+ .data(models, d => d.name)
325
+ .join('line')
326
+ .attr('class', 'mean-line')
327
+ .attr('x1', d => xScale(d.mean))
328
+ .attr('x2', d => xScale(d.mean))
329
+ .attr('y1', d => yScale(d.name) + bandHeight / 2 - meanLineHeight / 2)
330
+ .attr('y2', d => yScale(d.name) + bandHeight / 2 + meanLineHeight / 2)
331
+ .attr('stroke', d => d.color)
332
+ .on('mouseenter', (event, d) => showTooltip(event, d))
333
+ .on('mousemove', (event, d) => showTooltip(event, d))
334
+ .on('mouseleave', hideTooltip);
335
+
336
+ // Legend
337
+ gLegend.selectAll('.legend-note')
338
+ .data([0])
339
+ .join('text')
340
+ .attr('class', 'legend-note legend-text')
341
+ .attr('x', innerWidth / 2)
342
+ .attr('y', innerHeight + 40)
343
+ .attr('text-anchor', 'middle')
344
+ .attr('font-size', '11px')
345
+ .text('');
346
+ }
347
+
348
+ // Initialize
349
+ fetch(DATA_URL, { cache: 'no-cache' })
350
+ .then(r => r.json())
351
+ .then(json => {
352
+ data = json;
353
+ render();
354
+ })
355
+ .catch(err => {
356
+ const pre = document.createElement('pre');
357
+ pre.style.color = 'red';
358
+ pre.style.padding = '16px';
359
+ pre.textContent = `Error loading data: ${err.message}`;
360
+ container.appendChild(pre);
361
+ });
362
+
363
+ // Resize handling
364
+ if (window.ResizeObserver) {
365
+ new ResizeObserver(() => render()).observe(container);
366
+ } else {
367
+ window.addEventListener('resize', render);
368
+ }
369
+
370
+ // Theme change handling
371
+ const observer = new MutationObserver(() => render());
372
+ observer.observe(document.documentElement, {
373
+ attributes: true,
374
+ attributeFilter: ['data-theme']
375
+ });
376
+ };
377
+
378
+ if (document.readyState === 'loading') {
379
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
380
+ } else {
381
+ ensureD3(bootstrap);
382
+ }
383
+ })();
384
+ </script>
app/src/content/embeds/reckless-guessing.html ADDED
@@ -0,0 +1,400 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-reckless-guessing"></div>
2
+ <style>
3
+ .d3-reckless-guessing {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-reckless-guessing svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-reckless-guessing .axes path,
17
+ .d3-reckless-guessing .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-reckless-guessing .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 12px;
24
+ }
25
+
26
+ .d3-reckless-guessing .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-reckless-guessing .axes text.axis-label {
31
+ font-size: 14px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-reckless-guessing .axes text.chart-title {
37
+ font-size: 16px;
38
+ font-weight: 600;
39
+ fill: var(--text-color);
40
+ }
41
+
42
+ .d3-reckless-guessing .axes text.subtitle {
43
+ font-size: 11px;
44
+ font-style: italic;
45
+ fill: var(--muted-color);
46
+ }
47
+
48
+ .d3-reckless-guessing .model-label {
49
+ font-size: 13px;
50
+ font-weight: 500;
51
+ }
52
+
53
+ .d3-reckless-guessing .bar {
54
+ cursor: pointer;
55
+ transition: opacity 0.15s ease;
56
+ }
57
+
58
+ .d3-reckless-guessing .bar:hover {
59
+ opacity: 0.8;
60
+ }
61
+
62
+
63
+ .d3-reckless-guessing .percent-label {
64
+ font-size: 12px;
65
+ font-weight: 500;
66
+ fill: var(--text-color);
67
+ }
68
+
69
+ .d3-reckless-guessing .d3-tooltip {
70
+ position: absolute;
71
+ top: 0;
72
+ left: 0;
73
+ transform: translate(-9999px, -9999px);
74
+ pointer-events: none;
75
+ padding: 10px 12px;
76
+ border-radius: 8px;
77
+ font-size: 12px;
78
+ line-height: 1.4;
79
+ border: 1px solid var(--border-color);
80
+ background: var(--surface-bg);
81
+ color: var(--text-color);
82
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
83
+ opacity: 0;
84
+ transition: opacity 0.12s ease;
85
+ z-index: 10;
86
+ }
87
+
88
+ .d3-reckless-guessing .d3-tooltip .model-name {
89
+ font-weight: 600;
90
+ margin-bottom: 4px;
91
+ }
92
+
93
+ .d3-reckless-guessing .d3-tooltip .metric {
94
+ display: flex;
95
+ justify-content: space-between;
96
+ gap: 16px;
97
+ }
98
+
99
+ .d3-reckless-guessing .d3-tooltip .metric-label {
100
+ color: var(--muted-color);
101
+ }
102
+
103
+ .d3-reckless-guessing .d3-tooltip .metric-value {
104
+ font-weight: 500;
105
+ }
106
+ </style>
107
+ <script>
108
+ (() => {
109
+ const ensureD3 = (cb) => {
110
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
111
+ let s = document.getElementById('d3-cdn-script');
112
+ if (!s) {
113
+ s = document.createElement('script');
114
+ s.id = 'd3-cdn-script';
115
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
116
+ document.head.appendChild(s);
117
+ }
118
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
119
+ s.addEventListener('load', onReady, { once: true });
120
+ if (window.d3) onReady();
121
+ };
122
+
123
+ const bootstrap = () => {
124
+ const scriptEl = document.currentScript;
125
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
126
+ if (!(container && container.classList && container.classList.contains('d3-reckless-guessing'))) {
127
+ const candidates = Array.from(document.querySelectorAll('.d3-reckless-guessing'))
128
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
129
+ container = candidates[candidates.length - 1] || null;
130
+ }
131
+ if (!container) return;
132
+ if (container.dataset) {
133
+ if (container.dataset.mounted === 'true') return;
134
+ container.dataset.mounted = 'true';
135
+ }
136
+
137
+ // Tooltip setup
138
+ container.style.position = container.style.position || 'relative';
139
+ const tip = document.createElement('div');
140
+ tip.className = 'd3-tooltip';
141
+ container.appendChild(tip);
142
+
143
+ // SVG setup
144
+ const svg = d3.select(container).append('svg');
145
+ const gRoot = svg.append('g');
146
+
147
+ // Chart groups
148
+ const gGrid = gRoot.append('g').attr('class', 'grid');
149
+ const gAxes = gRoot.append('g').attr('class', 'axes');
150
+ const gBars = gRoot.append('g').attr('class', 'bars');
151
+ const gLabels = gRoot.append('g').attr('class', 'labels');
152
+
153
+ // State
154
+ let data = null;
155
+ let width = 800;
156
+ let height = 450;
157
+ const margin = { top: 40, right: 50, bottom: 56, left: 20 };
158
+
159
+ // Scales
160
+ const xScale = d3.scaleLinear();
161
+ const yScale = d3.scaleBand();
162
+
163
+ // Data loading
164
+ const JSON_PATHS = [
165
+ '/data/reckless_guessing.json',
166
+ './assets/data/reckless_guessing.json',
167
+ '../assets/data/reckless_guessing.json',
168
+ '../../assets/data/reckless_guessing.json'
169
+ ];
170
+
171
+ const fetchFirstAvailable = async (paths) => {
172
+ for (const p of paths) {
173
+ try {
174
+ const r = await fetch(p, { cache: 'no-cache' });
175
+ if (r.ok) return await r.json();
176
+ } catch (_) {}
177
+ }
178
+ throw new Error('Data not found');
179
+ };
180
+
181
+ function updateSize() {
182
+ width = container.clientWidth || 800;
183
+ const numModels = data ? data.models.length : 10;
184
+ const barHeight = 36;
185
+ height = margin.top + margin.bottom + numModels * barHeight;
186
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
187
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
188
+ return {
189
+ innerWidth: width - margin.left - margin.right,
190
+ innerHeight: height - margin.top - margin.bottom
191
+ };
192
+ }
193
+
194
+ function showTooltip(event, d) {
195
+ const rect = container.getBoundingClientRect();
196
+ const x = event.clientX - rect.left;
197
+ const y = event.clientY - rect.top;
198
+
199
+ tip.innerHTML = `
200
+ <div class="model-name" style="color: ${d.color}">${d.name}</div>
201
+ <div class="metric">
202
+ <span class="metric-label">Double-Down Rate:</span>
203
+ <span class="metric-value">${(d.double_down_rate * 100).toFixed(0)}%</span>
204
+ </div>
205
+ <div class="metric">
206
+ <span class="metric-label">Wrong Guesses:</span>
207
+ <span class="metric-value">${d.wrong_guesses}</span>
208
+ </div>
209
+ <div class="metric">
210
+ <span class="metric-label">Next Turn Guesses:</span>
211
+ <span class="metric-value">${d.next_turn_guesses}</span>
212
+ </div>
213
+ <div class="metric">
214
+ <span class="metric-label">Max Streak:</span>
215
+ <span class="metric-value">${d.max_streak}</span>
216
+ </div>
217
+ <div class="metric">
218
+ <span class="metric-label">Type:</span>
219
+ <span class="metric-value">${d.is_open ? 'Open' : 'Closed'}</span>
220
+ </div>
221
+ `;
222
+
223
+ const tipWidth = tip.offsetWidth || 180;
224
+ const tipHeight = tip.offsetHeight || 120;
225
+ let tipX = x + 12;
226
+ let tipY = y - tipHeight / 2;
227
+
228
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
229
+ if (tipY < 0) tipY = 8;
230
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
231
+
232
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
233
+ tip.style.opacity = '1';
234
+ }
235
+
236
+ function hideTooltip() {
237
+ tip.style.opacity = '0';
238
+ tip.style.transform = 'translate(-9999px, -9999px)';
239
+ }
240
+
241
+ // Calculate relative luminance and return black or white for best contrast
242
+ function getContrastColor(hexColor) {
243
+ const hex = hexColor.replace('#', '');
244
+ const r = parseInt(hex.substr(0, 2), 16) / 255;
245
+ const g = parseInt(hex.substr(2, 2), 16) / 255;
246
+ const b = parseInt(hex.substr(4, 2), 16) / 255;
247
+ const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
248
+ return luminance > 0.5 ? '#000000' : '#ffffff';
249
+ }
250
+
251
+ function render() {
252
+ if (!data) return;
253
+
254
+ const { innerWidth, innerHeight } = updateSize();
255
+
256
+ // Sort models by double_down_rate descending
257
+ const models = [...data.models].sort((a, b) => b.double_down_rate - a.double_down_rate);
258
+
259
+ // Update scales
260
+ xScale
261
+ .domain([0, 0.8])
262
+ .range([0, innerWidth]);
263
+
264
+ yScale
265
+ .domain(models.map(d => d.name))
266
+ .range([0, innerHeight])
267
+ .padding(0.25);
268
+
269
+ // Grid lines (vertical)
270
+ const xTicks = [0, 0.2, 0.4, 0.6, 0.8];
271
+ gGrid.selectAll('.grid-x')
272
+ .data(xTicks)
273
+ .join('line')
274
+ .attr('class', 'grid-x')
275
+ .attr('x1', d => xScale(d))
276
+ .attr('x2', d => xScale(d))
277
+ .attr('y1', 0)
278
+ .attr('y2', innerHeight);
279
+
280
+ // Title
281
+ gAxes.selectAll('.chart-title')
282
+ .data([0])
283
+ .join('text')
284
+ .attr('class', 'chart-title')
285
+ .attr('x', innerWidth / 2)
286
+ .attr('y', -20)
287
+ .attr('text-anchor', 'middle')
288
+ .text('After Wrong Guess: % Guessing Again Next Turn');
289
+
290
+ // X-axis (bottom)
291
+ gAxes.selectAll('.x-axis')
292
+ .data([0])
293
+ .join('g')
294
+ .attr('class', 'x-axis')
295
+ .attr('transform', `translate(0,${innerHeight})`)
296
+ .call(d3.axisBottom(xScale)
297
+ .tickValues(xTicks)
298
+ .tickFormat(d => `${Math.round(d * 100)}%`)
299
+ .tickSizeOuter(0));
300
+
301
+ // X-axis label
302
+ gAxes.selectAll('.x-label')
303
+ .data([0])
304
+ .join('text')
305
+ .attr('class', 'x-label axis-label')
306
+ .attr('x', innerWidth / 2)
307
+ .attr('y', innerHeight + 34)
308
+ .attr('text-anchor', 'middle')
309
+ .text('Double-Down Rate');
310
+
311
+ // Subtitle
312
+ gAxes.selectAll('.subtitle')
313
+ .data([0])
314
+ .join('text')
315
+ .attr('class', 'subtitle')
316
+ .attr('x', innerWidth / 2)
317
+ .attr('y', innerHeight + 48)
318
+ .attr('text-anchor', 'middle')
319
+ .text('Higher = more reckless (keeps guessing after failures)');
320
+
321
+ // Bars
322
+ const barHeight = yScale.bandwidth();
323
+
324
+ // All models with filled bars
325
+ gBars.selectAll('.bar')
326
+ .data(models, d => d.name)
327
+ .join('rect')
328
+ .attr('class', 'bar')
329
+ .attr('x', 0)
330
+ .attr('y', d => yScale(d.name))
331
+ .attr('width', d => xScale(d.double_down_rate))
332
+ .attr('height', barHeight)
333
+ .attr('fill', d => d.color)
334
+ .attr('rx', 3)
335
+ .attr('ry', 3)
336
+ .on('mouseenter', showTooltip)
337
+ .on('mousemove', showTooltip)
338
+ .on('mouseleave', hideTooltip);
339
+
340
+ // Model labels (inside bars)
341
+ gLabels.selectAll('.model-label')
342
+ .data(models, d => d.name)
343
+ .join('text')
344
+ .attr('class', 'model-label')
345
+ .attr('x', 8)
346
+ .attr('y', d => yScale(d.name) + barHeight / 2)
347
+ .attr('dy', '0.35em')
348
+ .attr('text-anchor', 'start')
349
+ .style('fill', d => getContrastColor(d.color))
350
+ .text(d => d.name);
351
+
352
+ // Percentage labels (end of bars)
353
+ gLabels.selectAll('.percent-label')
354
+ .data(models, d => d.name)
355
+ .join('text')
356
+ .attr('class', 'percent-label')
357
+ .attr('x', d => xScale(d.double_down_rate) + 6)
358
+ .attr('y', d => yScale(d.name) + barHeight / 2)
359
+ .attr('dy', '0.35em')
360
+ .attr('text-anchor', 'start')
361
+ .text(d => `${Math.round(d.double_down_rate * 100)}%`);
362
+
363
+ }
364
+
365
+ // Initialize
366
+ fetchFirstAvailable(JSON_PATHS)
367
+ .then(json => {
368
+ data = json;
369
+ render();
370
+ })
371
+ .catch(err => {
372
+ const pre = document.createElement('pre');
373
+ pre.style.color = 'red';
374
+ pre.style.padding = '16px';
375
+ pre.textContent = `Error loading data: ${err.message}`;
376
+ container.appendChild(pre);
377
+ });
378
+
379
+ // Resize handling
380
+ if (window.ResizeObserver) {
381
+ new ResizeObserver(() => render()).observe(container);
382
+ } else {
383
+ window.addEventListener('resize', render);
384
+ }
385
+
386
+ // Theme change handling
387
+ const observer = new MutationObserver(() => render());
388
+ observer.observe(document.documentElement, {
389
+ attributes: true,
390
+ attributeFilter: ['data-theme']
391
+ });
392
+ };
393
+
394
+ if (document.readyState === 'loading') {
395
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
396
+ } else {
397
+ ensureD3(bootstrap);
398
+ }
399
+ })();
400
+ </script>
app/src/content/embeds/score-stack.html ADDED
@@ -0,0 +1,440 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-score-stack"></div>
2
+ <style>
3
+ .d3-score-stack {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-score-stack svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-score-stack .axes path,
17
+ .d3-score-stack .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-score-stack .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-score-stack .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-score-stack .axes text.axis-label {
31
+ font-size: 15px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-score-stack .bar-segment {
37
+ cursor: pointer;
38
+ transition: opacity 0.15s ease;
39
+ }
40
+
41
+ .d3-score-stack .bar-segment:hover {
42
+ opacity: 0.8;
43
+ }
44
+
45
+ .d3-score-stack .model-label {
46
+ font-size: 12px;
47
+ fill: var(--text-color);
48
+ }
49
+
50
+ .d3-score-stack .d3-tooltip {
51
+ position: absolute;
52
+ top: 0;
53
+ left: 0;
54
+ transform: translate(-9999px, -9999px);
55
+ pointer-events: none;
56
+ padding: 10px 12px;
57
+ border-radius: 8px;
58
+ font-size: 12px;
59
+ line-height: 1.4;
60
+ border: 1px solid var(--border-color);
61
+ background: var(--surface-bg);
62
+ color: var(--text-color);
63
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
64
+ opacity: 0;
65
+ transition: opacity 0.12s ease;
66
+ z-index: 10;
67
+ }
68
+
69
+ .d3-score-stack .d3-tooltip .model-name {
70
+ font-weight: 600;
71
+ margin-bottom: 4px;
72
+ }
73
+
74
+ .d3-score-stack .d3-tooltip .metric {
75
+ display: flex;
76
+ justify-content: space-between;
77
+ gap: 16px;
78
+ }
79
+
80
+ .d3-score-stack .d3-tooltip .metric-label {
81
+ color: var(--muted-color);
82
+ }
83
+
84
+ .d3-score-stack .d3-tooltip .metric-value {
85
+ font-weight: 500;
86
+ }
87
+
88
+ .d3-score-stack .legend {
89
+ display: flex;
90
+ flex-wrap: wrap;
91
+ justify-content: center;
92
+ gap: 16px;
93
+ margin-top: 12px;
94
+ font-size: 12px;
95
+ }
96
+
97
+ .d3-score-stack .legend-item {
98
+ display: flex;
99
+ align-items: center;
100
+ gap: 6px;
101
+ }
102
+
103
+ .d3-score-stack .legend-swatch {
104
+ width: 14px;
105
+ height: 14px;
106
+ border-radius: 2px;
107
+ }
108
+
109
+ .d3-score-stack .legend-label {
110
+ color: var(--text-color);
111
+ }
112
+ </style>
113
+ <script>
114
+ (() => {
115
+ const ensureD3 = (cb) => {
116
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
117
+ let s = document.getElementById('d3-cdn-script');
118
+ if (!s) {
119
+ s = document.createElement('script');
120
+ s.id = 'd3-cdn-script';
121
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
122
+ document.head.appendChild(s);
123
+ }
124
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
125
+ s.addEventListener('load', onReady, { once: true });
126
+ if (window.d3) onReady();
127
+ };
128
+
129
+ const bootstrap = () => {
130
+ const scriptEl = document.currentScript;
131
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
132
+ if (!(container && container.classList && container.classList.contains('d3-score-stack'))) {
133
+ const candidates = Array.from(document.querySelectorAll('.d3-score-stack'))
134
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
135
+ container = candidates[candidates.length - 1] || null;
136
+ }
137
+ if (!container) return;
138
+ if (container.dataset) {
139
+ if (container.dataset.mounted === 'true') return;
140
+ container.dataset.mounted = 'true';
141
+ }
142
+
143
+ // Tooltip setup
144
+ container.style.position = container.style.position || 'relative';
145
+ const tip = document.createElement('div');
146
+ tip.className = 'd3-tooltip';
147
+ container.appendChild(tip);
148
+
149
+ // SVG setup
150
+ const svg = d3.select(container).append('svg');
151
+ const gRoot = svg.append('g');
152
+
153
+ // Chart groups
154
+ const gGrid = gRoot.append('g').attr('class', 'grid');
155
+ const gAxes = gRoot.append('g').attr('class', 'axes');
156
+ const gBars = gRoot.append('g').attr('class', 'bars');
157
+
158
+ // Legend container
159
+ const legendDiv = document.createElement('div');
160
+ legendDiv.className = 'legend';
161
+ container.appendChild(legendDiv);
162
+
163
+ // State
164
+ let data = null;
165
+ let width = 800;
166
+ let height = 500;
167
+ const margin = { top: 20, right: 30, bottom: 56, left: 160 };
168
+
169
+ // Colors for segments
170
+ const segmentColors = {
171
+ raw: '#4A90D9', // Blue - raw score
172
+ floored: '#E8973E', // Orange - flooring gain
173
+ noStakes: '#5AAA5A' // Green - no-stakes gain
174
+ };
175
+
176
+ // Scales
177
+ const xScale = d3.scaleLinear();
178
+ const yScale = d3.scaleBand();
179
+
180
+ // Data loading
181
+ const DATA_URL = '/data/score_stack.json';
182
+
183
+ function updateSize() {
184
+ width = container.clientWidth || 800;
185
+ const barCount = data ? data.models.length : 10;
186
+ height = Math.max(400, barCount * 44 + margin.top + margin.bottom);
187
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
188
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
189
+ return {
190
+ innerWidth: width - margin.left - margin.right,
191
+ innerHeight: height - margin.top - margin.bottom
192
+ };
193
+ }
194
+
195
+ function showTooltip(event, d, segment) {
196
+ const rect = container.getBoundingClientRect();
197
+ const x = event.clientX - rect.left;
198
+ const y = event.clientY - rect.top;
199
+
200
+ let segmentName, segmentValue, description;
201
+ if (segment === 'raw') {
202
+ segmentName = 'Raw Score';
203
+ segmentValue = d.avg_score.toFixed(2);
204
+ description = 'Standard scoring: 30 - turns - 2×wrong guesses';
205
+ } else if (segment === 'floored') {
206
+ segmentName = 'Flooring Gain';
207
+ segmentValue = '+' + d.floored_delta.toFixed(2);
208
+ description = 'Gain if negative scores count as 0';
209
+ } else {
210
+ segmentName = 'No-Stakes Gain';
211
+ segmentValue = '+' + d.no_stakes_delta.toFixed(2);
212
+ description = 'Additional gain without guess penalties';
213
+ }
214
+
215
+ tip.innerHTML = `
216
+ <div class="model-name" style="color: ${d.color}">${d.name}</div>
217
+ <div class="metric">
218
+ <span class="metric-label">${segmentName}:</span>
219
+ <span class="metric-value">${segmentValue}</span>
220
+ </div>
221
+ <div style="font-size: 11px; color: var(--muted-color); margin-top: 4px;">${description}</div>
222
+ <hr style="border: none; border-top: 1px solid var(--border-color); margin: 8px 0;">
223
+ <div class="metric">
224
+ <span class="metric-label">Raw Score:</span>
225
+ <span class="metric-value">${d.avg_score.toFixed(2)}</span>
226
+ </div>
227
+ <div class="metric">
228
+ <span class="metric-label">Floored Score:</span>
229
+ <span class="metric-value">${d.avg_floored_score.toFixed(2)}</span>
230
+ </div>
231
+ <div class="metric">
232
+ <span class="metric-label">No-Stakes Score:</span>
233
+ <span class="metric-value">${d.avg_no_stakes_score.toFixed(2)}</span>
234
+ </div>
235
+ `;
236
+
237
+ const tipWidth = tip.offsetWidth || 200;
238
+ const tipHeight = tip.offsetHeight || 150;
239
+ let tipX = x + 12;
240
+ let tipY = y - tipHeight / 2;
241
+
242
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
243
+ if (tipY < 0) tipY = 8;
244
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
245
+
246
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
247
+ tip.style.opacity = '1';
248
+ }
249
+
250
+ function hideTooltip() {
251
+ tip.style.opacity = '0';
252
+ tip.style.transform = 'translate(-9999px, -9999px)';
253
+ }
254
+
255
+ function render() {
256
+ if (!data) return;
257
+
258
+ const { innerWidth, innerHeight } = updateSize();
259
+
260
+ // Sort models by raw score (descending)
261
+ const models = [...data.models].sort((a, b) => b.avg_score - a.avg_score);
262
+
263
+ // Update scales
264
+ const maxScore = d3.max(models, d => d.avg_no_stakes_score);
265
+
266
+ xScale
267
+ .domain([0, maxScore + 1])
268
+ .range([0, innerWidth])
269
+ .nice();
270
+
271
+ yScale
272
+ .domain(models.map(d => d.name))
273
+ .range([0, innerHeight])
274
+ .padding(0.25);
275
+
276
+ // Grid lines
277
+ const xTicks = xScale.ticks(8);
278
+
279
+ gGrid.selectAll('.grid-x')
280
+ .data(xTicks)
281
+ .join('line')
282
+ .attr('class', 'grid-x')
283
+ .attr('x1', d => xScale(d))
284
+ .attr('x2', d => xScale(d))
285
+ .attr('y1', 0)
286
+ .attr('y2', innerHeight);
287
+
288
+ // Axes
289
+ const tickSize = 6;
290
+ gAxes.selectAll('.x-axis')
291
+ .data([0])
292
+ .join('g')
293
+ .attr('class', 'x-axis')
294
+ .attr('transform', `translate(0,${innerHeight})`)
295
+ .call(d3.axisBottom(xScale).ticks(8).tickSizeInner(-tickSize).tickSizeOuter(0));
296
+
297
+ gAxes.selectAll('.y-axis')
298
+ .data([0])
299
+ .join('g')
300
+ .attr('class', 'y-axis')
301
+ .call(d3.axisLeft(yScale).tickSize(0))
302
+ .selectAll('text')
303
+ .attr('class', 'model-label');
304
+
305
+ // Axis label
306
+ gAxes.selectAll('.x-label')
307
+ .data([0])
308
+ .join('text')
309
+ .attr('class', 'x-label axis-label')
310
+ .attr('x', innerWidth / 2)
311
+ .attr('y', innerHeight + 44)
312
+ .attr('text-anchor', 'middle')
313
+ .text('Score');
314
+
315
+ const barHeight = yScale.bandwidth();
316
+
317
+ // Helper to sanitize names for CSS selectors (remove periods, spaces, etc.)
318
+ const toClassName = (name) => name.replace(/[^a-zA-Z0-9]/g, '-');
319
+
320
+ // Draw stacked bars for each model
321
+ models.forEach(d => {
322
+ const y = yScale(d.name);
323
+ const safeId = toClassName(d.name);
324
+
325
+ // Calculate segment positions
326
+ // Raw score starts from 0, clamp negative scores to 0
327
+ const rawStart = 0;
328
+ const rawEnd = Math.max(0, d.avg_score);
329
+
330
+ // Floored delta starts where raw score ends (if positive) or at 0 (if raw was negative)
331
+ const flooredStart = rawEnd;
332
+ const flooredEnd = flooredStart + d.floored_delta;
333
+
334
+ // No-stakes delta starts where floored ends
335
+ const noStakesStart = flooredEnd;
336
+ const noStakesEnd = noStakesStart + d.no_stakes_delta;
337
+
338
+ // Raw score segment
339
+ gBars.selectAll(`.bar-raw-${safeId}`)
340
+ .data([d])
341
+ .join('rect')
342
+ .attr('class', `bar-segment bar-raw-${safeId}`)
343
+ .attr('x', xScale(rawStart))
344
+ .attr('y', y)
345
+ .attr('width', Math.max(0, xScale(rawEnd) - xScale(rawStart)))
346
+ .attr('height', barHeight)
347
+ .attr('fill', segmentColors.raw)
348
+ .on('mouseenter', (e) => showTooltip(e, d, 'raw'))
349
+ .on('mousemove', (e) => showTooltip(e, d, 'raw'))
350
+ .on('mouseleave', hideTooltip);
351
+
352
+ // Floored delta segment (only if positive)
353
+ if (d.floored_delta > 0.01) {
354
+ gBars.selectAll(`.bar-floored-${safeId}`)
355
+ .data([d])
356
+ .join('rect')
357
+ .attr('class', `bar-segment bar-floored-${safeId}`)
358
+ .attr('x', xScale(flooredStart))
359
+ .attr('y', y)
360
+ .attr('width', Math.max(0, xScale(flooredEnd) - xScale(flooredStart)))
361
+ .attr('height', barHeight)
362
+ .attr('fill', segmentColors.floored)
363
+ .attr('opacity', 0.5)
364
+ .on('mouseenter', (e) => showTooltip(e, d, 'floored'))
365
+ .on('mousemove', (e) => showTooltip(e, d, 'floored'))
366
+ .on('mouseleave', hideTooltip);
367
+ }
368
+
369
+ // No-stakes delta segment (only if positive)
370
+ if (d.no_stakes_delta > 0.01) {
371
+ gBars.selectAll(`.bar-nostakes-${safeId}`)
372
+ .data([d])
373
+ .join('rect')
374
+ .attr('class', `bar-segment bar-nostakes-${safeId}`)
375
+ .attr('x', xScale(noStakesStart))
376
+ .attr('y', y)
377
+ .attr('width', Math.max(0, xScale(noStakesEnd) - xScale(noStakesStart)))
378
+ .attr('height', barHeight)
379
+ .attr('fill', segmentColors.noStakes)
380
+ .attr('opacity', 0.5)
381
+ .on('mouseenter', (e) => showTooltip(e, d, 'noStakes'))
382
+ .on('mousemove', (e) => showTooltip(e, d, 'noStakes'))
383
+ .on('mouseleave', hideTooltip);
384
+ }
385
+ });
386
+
387
+ // Update legend
388
+ legendDiv.innerHTML = `
389
+ <div class="legend-item">
390
+ <div class="legend-swatch" style="background: ${segmentColors.raw}"></div>
391
+ <span class="legend-label">Raw Score</span>
392
+ </div>
393
+ <div class="legend-item">
394
+ <div class="legend-swatch" style="background: ${segmentColors.floored}"></div>
395
+ <span class="legend-label">Flooring Gain</span>
396
+ </div>
397
+ <div class="legend-item">
398
+ <div class="legend-swatch" style="background: ${segmentColors.noStakes}"></div>
399
+ <span class="legend-label">No-Stakes Gain</span>
400
+ </div>
401
+ `;
402
+ }
403
+
404
+ // Initialize
405
+ fetch(DATA_URL, { cache: 'no-cache' })
406
+ .then(r => r.json())
407
+ .then(json => {
408
+ data = json;
409
+ render();
410
+ })
411
+ .catch(err => {
412
+ const pre = document.createElement('pre');
413
+ pre.style.color = 'red';
414
+ pre.style.padding = '16px';
415
+ pre.textContent = `Error loading data: ${err.message}`;
416
+ container.appendChild(pre);
417
+ });
418
+
419
+ // Resize handling
420
+ if (window.ResizeObserver) {
421
+ new ResizeObserver(() => render()).observe(container);
422
+ } else {
423
+ window.addEventListener('resize', render);
424
+ }
425
+
426
+ // Theme change handling
427
+ const observer = new MutationObserver(() => render());
428
+ observer.observe(document.documentElement, {
429
+ attributes: true,
430
+ attributeFilter: ['data-theme']
431
+ });
432
+ };
433
+
434
+ if (document.readyState === 'loading') {
435
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
436
+ } else {
437
+ ensureD3(bootstrap);
438
+ }
439
+ })();
440
+ </script>
app/src/content/embeds/score-vs-failed-guesses.html ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div class="d3-score-vs-failed-guesses"></div>
2
+ <style>
3
+ .d3-score-vs-failed-guesses {
4
+ width: 100%;
5
+ margin: 10px 0;
6
+ position: relative;
7
+ font-family: system-ui, -apple-system, sans-serif;
8
+ }
9
+
10
+ .d3-score-vs-failed-guesses svg {
11
+ display: block;
12
+ width: 100%;
13
+ height: auto;
14
+ }
15
+
16
+ .d3-score-vs-failed-guesses .axes path,
17
+ .d3-score-vs-failed-guesses .axes line {
18
+ stroke: var(--axis-color, var(--text-color));
19
+ }
20
+
21
+ .d3-score-vs-failed-guesses .axes text {
22
+ fill: var(--tick-color, var(--muted-color));
23
+ font-size: 11px;
24
+ }
25
+
26
+ .d3-score-vs-failed-guesses .grid line {
27
+ stroke: var(--grid-color, rgba(0,0,0,.08));
28
+ }
29
+
30
+ .d3-score-vs-failed-guesses .axes text.axis-label {
31
+ font-size: 15px;
32
+ font-weight: 500;
33
+ fill: var(--text-color);
34
+ }
35
+
36
+ .d3-score-vs-failed-guesses .x-axis text {
37
+ transform: translateY(4px);
38
+ }
39
+
40
+ .d3-score-vs-failed-guesses .point {
41
+ cursor: pointer;
42
+ transition: opacity 0.15s ease;
43
+ }
44
+
45
+ .d3-score-vs-failed-guesses .point:hover {
46
+ opacity: 0.8;
47
+ }
48
+
49
+ .d3-score-vs-failed-guesses .point-label {
50
+ font-size: 11px;
51
+ fill: var(--text-color);
52
+ pointer-events: none;
53
+ }
54
+
55
+ .d3-score-vs-failed-guesses .d3-tooltip {
56
+ position: absolute;
57
+ top: 0;
58
+ left: 0;
59
+ transform: translate(-9999px, -9999px);
60
+ pointer-events: none;
61
+ padding: 10px 12px;
62
+ border-radius: 8px;
63
+ font-size: 12px;
64
+ line-height: 1.4;
65
+ border: 1px solid var(--border-color);
66
+ background: var(--surface-bg);
67
+ color: var(--text-color);
68
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
69
+ opacity: 0;
70
+ transition: opacity 0.12s ease;
71
+ z-index: 10;
72
+ }
73
+
74
+ .d3-score-vs-failed-guesses .d3-tooltip .model-name {
75
+ font-weight: 600;
76
+ margin-bottom: 4px;
77
+ }
78
+
79
+ .d3-score-vs-failed-guesses .d3-tooltip .metric {
80
+ display: flex;
81
+ justify-content: space-between;
82
+ gap: 16px;
83
+ }
84
+
85
+ .d3-score-vs-failed-guesses .d3-tooltip .metric-label {
86
+ color: var(--muted-color);
87
+ }
88
+
89
+ .d3-score-vs-failed-guesses .d3-tooltip .metric-value {
90
+ font-weight: 500;
91
+ }
92
+ </style>
93
+ <script>
94
+ (() => {
95
+ const ensureD3 = (cb) => {
96
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
97
+ let s = document.getElementById('d3-cdn-script');
98
+ if (!s) {
99
+ s = document.createElement('script');
100
+ s.id = 'd3-cdn-script';
101
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
102
+ document.head.appendChild(s);
103
+ }
104
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
105
+ s.addEventListener('load', onReady, { once: true });
106
+ if (window.d3) onReady();
107
+ };
108
+
109
+ const bootstrap = () => {
110
+ const scriptEl = document.currentScript;
111
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
112
+ if (!(container && container.classList && container.classList.contains('d3-score-vs-failed-guesses'))) {
113
+ const candidates = Array.from(document.querySelectorAll('.d3-score-vs-failed-guesses'))
114
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
115
+ container = candidates[candidates.length - 1] || null;
116
+ }
117
+ if (!container) return;
118
+ if (container.dataset) {
119
+ if (container.dataset.mounted === 'true') return;
120
+ container.dataset.mounted = 'true';
121
+ }
122
+
123
+ // Tooltip setup
124
+ container.style.position = container.style.position || 'relative';
125
+ const tip = document.createElement('div');
126
+ tip.className = 'd3-tooltip';
127
+ container.appendChild(tip);
128
+
129
+ // SVG setup
130
+ const svg = d3.select(container).append('svg');
131
+ const gRoot = svg.append('g');
132
+
133
+ // Chart groups
134
+ const gGrid = gRoot.append('g').attr('class', 'grid');
135
+ const gAxes = gRoot.append('g').attr('class', 'axes');
136
+ const gPoints = gRoot.append('g').attr('class', 'points');
137
+ const gLabels = gRoot.append('g').attr('class', 'labels');
138
+
139
+ // State
140
+ let data = null;
141
+ let width = 800;
142
+ let height = 450;
143
+ const margin = { top: 20, right: 120, bottom: 56, left: 72 };
144
+
145
+ // Scales
146
+ const xScale = d3.scaleLinear();
147
+ const yScale = d3.scaleLinear();
148
+
149
+ // Data loading
150
+ const DATA_URL = '/data/score_vs_failed_guesses.json';
151
+
152
+ function updateSize() {
153
+ width = container.clientWidth || 800;
154
+ height = Math.max(300, Math.round(width / 1.3));
155
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
156
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
157
+ return {
158
+ innerWidth: width - margin.left - margin.right,
159
+ innerHeight: height - margin.top - margin.bottom
160
+ };
161
+ }
162
+
163
+ function showTooltip(event, d) {
164
+ const rect = container.getBoundingClientRect();
165
+ const x = event.clientX - rect.left;
166
+ const y = event.clientY - rect.top;
167
+
168
+ tip.innerHTML = `
169
+ <div class="model-name" style="color: ${d.color}">${d.name}</div>
170
+ <div class="metric">
171
+ <span class="metric-label">Score:</span>
172
+ <span class="metric-value">${d.avg_score.toFixed(2)}</span>
173
+ </div>
174
+ <div class="metric">
175
+ <span class="metric-label">Failed Guesses:</span>
176
+ <span class="metric-value">${d.avg_failed_guesses.toFixed(2)}</span>
177
+ </div>
178
+ <div class="metric">
179
+ <span class="metric-label">Type:</span>
180
+ <span class="metric-value">${d.is_open ? 'Open' : 'Closed'}</span>
181
+ </div>
182
+ `;
183
+
184
+ const tipWidth = tip.offsetWidth || 150;
185
+ const tipHeight = tip.offsetHeight || 80;
186
+ let tipX = x + 12;
187
+ let tipY = y - tipHeight / 2;
188
+
189
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
190
+ if (tipY < 0) tipY = 8;
191
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
192
+
193
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
194
+ tip.style.opacity = '1';
195
+ }
196
+
197
+ function hideTooltip() {
198
+ tip.style.opacity = '0';
199
+ tip.style.transform = 'translate(-9999px, -9999px)';
200
+ }
201
+
202
+ function render() {
203
+ if (!data) return;
204
+
205
+ const { innerWidth, innerHeight } = updateSize();
206
+ const models = data.models;
207
+
208
+ // Update scales
209
+ const xExtent = d3.extent(models, d => d.avg_failed_guesses);
210
+ const yExtent = d3.extent(models, d => d.avg_score);
211
+ const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
212
+ const yPadding = (yExtent[1] - yExtent[0]) * 0.1;
213
+
214
+ xScale
215
+ .domain([Math.max(0, xExtent[0] - xPadding), xExtent[1] + xPadding])
216
+ .range([0, innerWidth])
217
+ .nice();
218
+
219
+ yScale
220
+ .domain([yExtent[0] - yPadding, yExtent[1] + yPadding])
221
+ .range([innerHeight, 0])
222
+ .nice();
223
+
224
+ // Grid lines
225
+ const xTicks = xScale.ticks(6);
226
+ const yTicks = yScale.ticks(6);
227
+
228
+ gGrid.selectAll('.grid-x')
229
+ .data(xTicks)
230
+ .join('line')
231
+ .attr('class', 'grid-x')
232
+ .attr('x1', d => xScale(d))
233
+ .attr('x2', d => xScale(d))
234
+ .attr('y1', 0)
235
+ .attr('y2', innerHeight);
236
+
237
+ gGrid.selectAll('.grid-y')
238
+ .data(yTicks)
239
+ .join('line')
240
+ .attr('class', 'grid-y')
241
+ .attr('x1', 0)
242
+ .attr('x2', innerWidth)
243
+ .attr('y1', d => yScale(d))
244
+ .attr('y2', d => yScale(d));
245
+
246
+ // Axes with inner ticks
247
+ const tickSize = 6;
248
+ gAxes.selectAll('.x-axis')
249
+ .data([0])
250
+ .join('g')
251
+ .attr('class', 'x-axis')
252
+ .attr('transform', `translate(0,${innerHeight})`)
253
+ .call(d3.axisBottom(xScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
254
+
255
+ gAxes.selectAll('.y-axis')
256
+ .data([0])
257
+ .join('g')
258
+ .attr('class', 'y-axis')
259
+ .call(d3.axisLeft(yScale).ticks(6).tickSizeInner(-tickSize).tickSizeOuter(0));
260
+
261
+ // Axis labels
262
+ gAxes.selectAll('.x-label')
263
+ .data([0])
264
+ .join('text')
265
+ .attr('class', 'x-label axis-label')
266
+ .attr('x', innerWidth / 2)
267
+ .attr('y', innerHeight + 44)
268
+ .attr('text-anchor', 'middle')
269
+ .text('Average Failed Guesses');
270
+
271
+ gAxes.selectAll('.y-label')
272
+ .data([0])
273
+ .join('text')
274
+ .attr('class', 'y-label axis-label')
275
+ .attr('x', -innerHeight / 2)
276
+ .attr('y', -52)
277
+ .attr('text-anchor', 'middle')
278
+ .attr('transform', 'rotate(-90)')
279
+ .text('Average Score');
280
+
281
+ // Points - circles for closed models, stars for open models
282
+ const pointRadius = Math.max(8, Math.min(16, innerWidth / 60));
283
+
284
+ // Helper function to create a 5-point star path
285
+ const starPath = (cx, cy, outerR, innerR) => {
286
+ const points = [];
287
+ for (let i = 0; i < 10; i++) {
288
+ const r = i % 2 === 0 ? outerR : innerR;
289
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
290
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
291
+ }
292
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
293
+ };
294
+
295
+ // Closed models as circles
296
+ const closedModels = models.filter(d => !d.is_open);
297
+ gPoints.selectAll('.point-circle')
298
+ .data(closedModels, d => d.name)
299
+ .join('circle')
300
+ .attr('class', 'point point-circle')
301
+ .attr('cx', d => xScale(d.avg_failed_guesses))
302
+ .attr('cy', d => yScale(d.avg_score))
303
+ .attr('r', pointRadius)
304
+ .attr('fill', d => d.color)
305
+ .attr('stroke', 'none')
306
+ .on('mouseenter', showTooltip)
307
+ .on('mousemove', showTooltip)
308
+ .on('mouseleave', hideTooltip);
309
+
310
+ // Open models as stars
311
+ const openModels = models.filter(d => d.is_open);
312
+ gPoints.selectAll('.point-star')
313
+ .data(openModels, d => d.name)
314
+ .join('path')
315
+ .attr('class', 'point point-star')
316
+ .attr('d', d => starPath(xScale(d.avg_failed_guesses), yScale(d.avg_score), pointRadius * 1.2, pointRadius * 0.5))
317
+ .attr('fill', d => d.color)
318
+ .attr('stroke', 'none')
319
+ .on('mouseenter', showTooltip)
320
+ .on('mousemove', showTooltip)
321
+ .on('mouseleave', hideTooltip);
322
+
323
+ // Point labels
324
+ gLabels.selectAll('.point-label')
325
+ .data(models)
326
+ .join('text')
327
+ .attr('class', 'point-label')
328
+ .attr('x', d => xScale(d.avg_failed_guesses) + pointRadius + 6)
329
+ .attr('y', d => yScale(d.avg_score) + 4)
330
+ .text(d => d.name);
331
+ }
332
+
333
+ // Initialize
334
+ fetch(DATA_URL, { cache: 'no-cache' })
335
+ .then(r => r.json())
336
+ .then(json => {
337
+ data = json;
338
+ render();
339
+ })
340
+ .catch(err => {
341
+ const pre = document.createElement('pre');
342
+ pre.style.color = 'red';
343
+ pre.style.padding = '16px';
344
+ pre.textContent = `Error loading data: ${err.message}`;
345
+ container.appendChild(pre);
346
+ });
347
+
348
+ // Resize handling
349
+ if (window.ResizeObserver) {
350
+ new ResizeObserver(() => render()).observe(container);
351
+ } else {
352
+ window.addEventListener('resize', render);
353
+ }
354
+
355
+ // Theme change handling
356
+ const observer = new MutationObserver(() => render());
357
+ observer.observe(document.documentElement, {
358
+ attributes: true,
359
+ attributeFilter: ['data-theme']
360
+ });
361
+ };
362
+
363
+ if (document.readyState === 'loading') {
364
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
365
+ } else {
366
+ ensureD3(bootstrap);
367
+ }
368
+ })();
369
+ </script>
dark-mode-image.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dark Mode Image Handling
2
+
3
+ ## Problem
4
+
5
+ The blog template automatically inverts image colors in dark mode using a CSS filter:
6
+
7
+ ```css
8
+ :global([data-theme="dark"]) .image-wrapper img {
9
+ filter: invert(0.925) hue-rotate(180deg);
10
+ }
11
+ ```
12
+
13
+ This works well for charts and figures with white backgrounds, but is undesirable for images that should retain their original colors (e.g., photographs, illustrations with specific color schemes).
14
+
15
+ ## Solution
16
+
17
+ Added a `preserveColors` prop to the `Image` component that opts out of the dark mode inversion.
18
+
19
+ ### Usage
20
+
21
+ ```mdx
22
+ import Image from "../../../components/Image.astro";
23
+ import myImage from "../../assets/image/my_image.png";
24
+
25
+ <Image
26
+ src={myImage}
27
+ alt="Description"
28
+ preserveColors
29
+ />
30
+ ```
31
+
32
+ ### Implementation
33
+
34
+ **File: `app/src/components/Image.astro`**
35
+
36
+ 1. Added `preserveColors?: boolean` to the Props interface
37
+ 2. Added `data-preserve-colors` attribute to the wrapper div when the prop is true
38
+ 3. Updated CSS selectors to exclude images with this attribute:
39
+
40
+ ```css
41
+ :global([data-theme="dark"]) .image-wrapper:not([data-preserve-colors]) img {
42
+ filter: invert(0.925) hue-rotate(180deg);
43
+ }
44
+ ```
45
+
46
+ ### Current Usage
47
+
48
+ - `introduction.mdx`: The `example_sequence.png` image uses `preserveColors` to maintain the card colors in dark mode
interactive-charts.md ADDED
@@ -0,0 +1,498 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Converting Static Figures to Interactive D3 Charts
2
+
3
+ This guide explains how to convert PNG figures into interactive D3.js visualizations for this project.
4
+
5
+ ## Overview
6
+
7
+ Each interactive chart consists of:
8
+ 1. **JSON data file** in `app/public/data/` (served at `/data/filename.json`)
9
+ 2. **HTML embed file** in `app/src/content/embeds/` (e.g., `chart-name.html`)
10
+ 3. **MDX integration** using the `HtmlEmbed` component
11
+
12
+ ## File Structure
13
+
14
+ ```
15
+ app/
16
+ ├── public/data/ # JSON data (served at /data/*)
17
+ │ ├── overall_performance.json
18
+ │ ├── calibration_curves.json
19
+ │ └── ...
20
+ └── src/content/embeds/ # HTML chart implementations
21
+ ├── banner.html # Example: scatter plot
22
+ └── calibration-curves.html # (to create)
23
+ ```
24
+
25
+ ## Step 1: Understand Your Data
26
+
27
+ Check the JSON structure in `app/public/data/`. Common patterns:
28
+
29
+ **Scatter plot** (`overall_performance.json`):
30
+ ```json
31
+ {
32
+ "models": [
33
+ { "name": "Model A", "avg_score": 15.8, "avg_output_tokens_per_turn": 5253, "color": "#FF6B00", "is_open": false }
34
+ ]
35
+ }
36
+ ```
37
+
38
+ **Line chart / Calibration** (`calibration_curves.json`):
39
+ ```json
40
+ {
41
+ "models": [
42
+ {
43
+ "name": "Model A", "color": "#FF6B00",
44
+ "calibration_points": [
45
+ { "confidence_level": 5, "actual_success_rate": 0.041, "sample_count": 73 }
46
+ ]
47
+ }
48
+ ]
49
+ }
50
+ ```
51
+
52
+ **Histogram** (`confidence_distribution.json`):
53
+ ```json
54
+ {
55
+ "models": [
56
+ {
57
+ "name": "Model A", "color": "#FF6B00", "total_guesses": 579,
58
+ "distribution": [
59
+ { "confidence_level": 5, "proportion": 0.024, "count": 14 }
60
+ ]
61
+ }
62
+ ]
63
+ }
64
+ ```
65
+
66
+ ## Step 2: Create the HTML Embed
67
+
68
+ Create a new file in `app/src/content/embeds/`. Use this template:
69
+
70
+ ```html
71
+ <div class="d3-CHART-NAME"></div>
72
+ <style>
73
+ /* Scoped styles - prefix everything with .d3-CHART-NAME */
74
+ .d3-CHART-NAME {
75
+ width: 100%;
76
+ margin: 10px 0;
77
+ position: relative;
78
+ font-family: system-ui, -apple-system, sans-serif;
79
+ }
80
+
81
+ .d3-CHART-NAME svg {
82
+ display: block;
83
+ width: 100%;
84
+ height: auto;
85
+ }
86
+
87
+ /* Use CSS variables for theme support */
88
+ .d3-CHART-NAME .axes path,
89
+ .d3-CHART-NAME .axes line {
90
+ stroke: var(--axis-color, var(--text-color));
91
+ }
92
+
93
+ .d3-CHART-NAME .axes text {
94
+ fill: var(--tick-color, var(--muted-color));
95
+ font-size: 11px;
96
+ }
97
+
98
+ .d3-CHART-NAME .grid line {
99
+ stroke: var(--grid-color, rgba(0,0,0,.08));
100
+ }
101
+
102
+ /* Use specific selector to override .axes text */
103
+ .d3-CHART-NAME .axes text.axis-label {
104
+ font-size: 14px;
105
+ font-weight: 500;
106
+ fill: var(--text-color);
107
+ }
108
+
109
+ .d3-CHART-NAME .axes text.chart-title {
110
+ font-size: 16px;
111
+ font-weight: 600;
112
+ fill: var(--text-color);
113
+ }
114
+
115
+ /* Adjust tick label spacing if needed */
116
+ .d3-CHART-NAME .x-axis text {
117
+ transform: translateY(4px);
118
+ }
119
+
120
+ /* Tooltip */
121
+ .d3-CHART-NAME .d3-tooltip {
122
+ position: absolute;
123
+ top: 0; left: 0;
124
+ transform: translate(-9999px, -9999px);
125
+ pointer-events: none;
126
+ padding: 10px 12px;
127
+ border-radius: 8px;
128
+ font-size: 12px;
129
+ line-height: 1.4;
130
+ border: 1px solid var(--border-color);
131
+ background: var(--surface-bg);
132
+ color: var(--text-color);
133
+ box-shadow: 0 4px 24px rgba(0,0,0,.18);
134
+ opacity: 0;
135
+ transition: opacity 0.12s ease;
136
+ z-index: 10;
137
+ }
138
+ </style>
139
+ <script>
140
+ (() => {
141
+ // D3 loader - reuses existing if already loaded
142
+ const ensureD3 = (cb) => {
143
+ if (window.d3 && typeof window.d3.select === 'function') return cb();
144
+ let s = document.getElementById('d3-cdn-script');
145
+ if (!s) {
146
+ s = document.createElement('script');
147
+ s.id = 'd3-cdn-script';
148
+ s.src = 'https://cdn.jsdelivr.net/npm/d3@7/dist/d3.min.js';
149
+ document.head.appendChild(s);
150
+ }
151
+ const onReady = () => { if (window.d3 && typeof window.d3.select === 'function') cb(); };
152
+ s.addEventListener('load', onReady, { once: true });
153
+ if (window.d3) onReady();
154
+ };
155
+
156
+ const bootstrap = () => {
157
+ // Find container (handles multiple instances)
158
+ const scriptEl = document.currentScript;
159
+ let container = scriptEl ? scriptEl.previousElementSibling : null;
160
+ if (!(container && container.classList && container.classList.contains('d3-CHART-NAME'))) {
161
+ const candidates = Array.from(document.querySelectorAll('.d3-CHART-NAME'))
162
+ .filter((el) => !(el.dataset && el.dataset.mounted === 'true'));
163
+ container = candidates[candidates.length - 1] || null;
164
+ }
165
+ if (!container) return;
166
+ if (container.dataset) {
167
+ if (container.dataset.mounted === 'true') return;
168
+ container.dataset.mounted = 'true';
169
+ }
170
+
171
+ // Tooltip setup
172
+ container.style.position = container.style.position || 'relative';
173
+ const tip = document.createElement('div');
174
+ tip.className = 'd3-tooltip';
175
+ container.appendChild(tip);
176
+
177
+ // SVG setup
178
+ const svg = d3.select(container).append('svg');
179
+ const gRoot = svg.append('g');
180
+
181
+ // Chart groups (order matters for layering)
182
+ const gGrid = gRoot.append('g').attr('class', 'grid');
183
+ const gAxes = gRoot.append('g').attr('class', 'axes');
184
+ const gContent = gRoot.append('g').attr('class', 'content');
185
+
186
+ // State
187
+ let data = null;
188
+ let width = 800;
189
+ let height = 450;
190
+ const margin = { top: 40, right: 120, bottom: 56, left: 72 };
191
+
192
+ // Scales
193
+ const xScale = d3.scaleLinear();
194
+ const yScale = d3.scaleLinear();
195
+
196
+ // Data loading - single path since we use public/data/
197
+ const DATA_URL = '/data/YOUR_DATA_FILE.json';
198
+
199
+ function updateSize() {
200
+ width = container.clientWidth || 800;
201
+ height = Math.max(300, Math.round(width / 1.78)); // 16:9 aspect ratio
202
+ svg.attr('width', width).attr('height', height).attr('viewBox', `0 0 ${width} ${height}`);
203
+ gRoot.attr('transform', `translate(${margin.left},${margin.top})`);
204
+ return {
205
+ innerWidth: width - margin.left - margin.right,
206
+ innerHeight: height - margin.top - margin.bottom
207
+ };
208
+ }
209
+
210
+ function showTooltip(event, d) {
211
+ const rect = container.getBoundingClientRect();
212
+ const x = event.clientX - rect.left;
213
+ const y = event.clientY - rect.top;
214
+
215
+ tip.innerHTML = `
216
+ <div style="font-weight: 600; color: ${d.color}">${d.name}</div>
217
+ <div>Value: ${d.value}</div>
218
+ `;
219
+
220
+ const tipWidth = tip.offsetWidth || 150;
221
+ const tipHeight = tip.offsetHeight || 80;
222
+ let tipX = x + 12;
223
+ let tipY = y - tipHeight / 2;
224
+
225
+ if (tipX + tipWidth > width) tipX = x - tipWidth - 12;
226
+ if (tipY < 0) tipY = 8;
227
+ if (tipY + tipHeight > height) tipY = height - tipHeight - 8;
228
+
229
+ tip.style.transform = `translate(${tipX}px, ${tipY}px)`;
230
+ tip.style.opacity = '1';
231
+ }
232
+
233
+ function hideTooltip() {
234
+ tip.style.opacity = '0';
235
+ tip.style.transform = 'translate(-9999px, -9999px)';
236
+ }
237
+
238
+ function render() {
239
+ if (!data) return;
240
+ const { innerWidth, innerHeight } = updateSize();
241
+
242
+ // TODO: Implement your chart rendering here
243
+ // - Update scales with data extent
244
+ // - Draw grid lines
245
+ // - Draw axes
246
+ // - Draw data elements (lines, bars, points, etc.)
247
+ }
248
+
249
+ // Initialize
250
+ fetch(DATA_URL, { cache: 'no-cache' })
251
+ .then(r => r.json())
252
+ .then(json => {
253
+ data = json;
254
+ render();
255
+ })
256
+ .catch(err => {
257
+ const pre = document.createElement('pre');
258
+ pre.style.color = 'red';
259
+ pre.style.padding = '16px';
260
+ pre.textContent = `Error loading data: ${err.message}`;
261
+ container.appendChild(pre);
262
+ });
263
+
264
+ // Resize handling
265
+ if (window.ResizeObserver) {
266
+ new ResizeObserver(() => render()).observe(container);
267
+ } else {
268
+ window.addEventListener('resize', render);
269
+ }
270
+
271
+ // Theme change handling (re-render on light/dark toggle)
272
+ const observer = new MutationObserver(() => render());
273
+ observer.observe(document.documentElement, {
274
+ attributes: true,
275
+ attributeFilter: ['data-theme']
276
+ });
277
+ };
278
+
279
+ if (document.readyState === 'loading') {
280
+ document.addEventListener('DOMContentLoaded', () => ensureD3(bootstrap), { once: true });
281
+ } else {
282
+ ensureD3(bootstrap);
283
+ }
284
+ })();
285
+ </script>
286
+ ```
287
+
288
+ ## Step 3: Key Implementation Details
289
+
290
+ ### CSS Variables (Theme Support)
291
+
292
+ Always use CSS variables for colors that need to adapt to light/dark mode:
293
+
294
+ | Variable | Purpose |
295
+ |----------|---------|
296
+ | `var(--text-color)` | Main text, labels |
297
+ | `var(--muted-color)` | Secondary text, tick labels |
298
+ | `var(--border-color)` | Borders, outlines |
299
+ | `var(--surface-bg)` | Tooltip background |
300
+ | `var(--page-bg)` | Page background |
301
+
302
+ ### D3 Patterns Used
303
+
304
+ **Scale setup:**
305
+ ```javascript
306
+ const xExtent = d3.extent(data, d => d.x);
307
+ const xPadding = (xExtent[1] - xExtent[0]) * 0.1;
308
+ xScale.domain([xExtent[0] - xPadding, xExtent[1] + xPadding])
309
+ .range([0, innerWidth])
310
+ .nice();
311
+ ```
312
+
313
+ **Grid lines:**
314
+ ```javascript
315
+ gGrid.selectAll('.grid-x')
316
+ .data(xScale.ticks(6))
317
+ .join('line')
318
+ .attr('class', 'grid-x')
319
+ .attr('x1', d => xScale(d))
320
+ .attr('x2', d => xScale(d))
321
+ .attr('y1', 0)
322
+ .attr('y2', innerHeight);
323
+ ```
324
+
325
+ **Axes (basic):**
326
+ ```javascript
327
+ gAxes.selectAll('.x-axis')
328
+ .data([0])
329
+ .join('g')
330
+ .attr('class', 'x-axis')
331
+ .attr('transform', `translate(0,${innerHeight})`)
332
+ .call(d3.axisBottom(xScale).ticks(6));
333
+ ```
334
+
335
+ **Axes with inner ticks:**
336
+ ```javascript
337
+ const tickSize = 6;
338
+ gAxes.selectAll('.x-axis')
339
+ .data([0])
340
+ .join('g')
341
+ .attr('class', 'x-axis')
342
+ .attr('transform', `translate(0,${innerHeight})`)
343
+ .call(d3.axisBottom(xScale)
344
+ .ticks(6)
345
+ .tickSizeInner(-tickSize) // Negative = ticks point inward
346
+ .tickSizeOuter(0)); // No outer ticks
347
+ ```
348
+
349
+ **Custom shapes (5-point star):**
350
+ ```javascript
351
+ const starPath = (cx, cy, outerR, innerR) => {
352
+ const points = [];
353
+ for (let i = 0; i < 10; i++) {
354
+ const r = i % 2 === 0 ? outerR : innerR;
355
+ const angle = (Math.PI / 2) + (i * Math.PI / 5);
356
+ points.push([cx + r * Math.cos(angle), cy - r * Math.sin(angle)]);
357
+ }
358
+ return 'M' + points.map(p => p.join(',')).join('L') + 'Z';
359
+ };
360
+
361
+ // Use with path elements
362
+ gContent.selectAll('.point-star')
363
+ .data(openModels)
364
+ .join('path')
365
+ .attr('d', d => starPath(xScale(d.x), yScale(d.y), radius * 1.2, radius * 0.5))
366
+ .attr('fill', d => d.color);
367
+ ```
368
+
369
+ **Data-join for elements:**
370
+ ```javascript
371
+ gContent.selectAll('.point')
372
+ .data(models)
373
+ .join('circle')
374
+ .attr('class', 'point')
375
+ .attr('cx', d => xScale(d.x))
376
+ .attr('cy', d => yScale(d.y))
377
+ .attr('r', 8)
378
+ .attr('fill', d => d.color)
379
+ .on('mouseenter', showTooltip)
380
+ .on('mousemove', showTooltip)
381
+ .on('mouseleave', hideTooltip);
382
+ ```
383
+
384
+ ## Step 4: Integrate in MDX
385
+
386
+ In your `.mdx` file:
387
+
388
+ ```mdx
389
+ import HtmlEmbed from "../../../components/HtmlEmbed.astro";
390
+
391
+ <HtmlEmbed
392
+ src="chart-name.html"
393
+ title="Chart Title"
394
+ caption="<strong>Figure N:</strong> Description of what this shows."
395
+ />
396
+ ```
397
+
398
+ For frameless embedding (like the banner):
399
+ ```mdx
400
+ <HtmlEmbed src="banner.html" frameless />
401
+ ```
402
+
403
+ ## Charts to Convert
404
+
405
+ | Figure | Data File | Chart Type | Status |
406
+ |--------|-----------|------------|--------|
407
+ | 1 | `overall_performance.json` | Scatter | Done (banner.html) |
408
+ | 2 | `calibration_curves.json` | Multi-line | Done (calibration-curves.html) |
409
+ | 3 | `confidence_distribution.json` | Grouped histogram | Done (confidence-distribution.html) |
410
+ | 4 | `score_vs_failed_guesses.json` | Scatter | TODO |
411
+ | 5 | `excess_caution.json` | Box plot | TODO |
412
+ | 6 | `caution_vs_failed_guesses.json` | Scatter | Done (caution-vs-failed-guesses.html) |
413
+ | 7 | `by_rule.json` | Strip plot | Done (by-rule.html) |
414
+ | 8 | `complexity_analysis.json` | Heatmap | Done (complexity-analysis.html) |
415
+
416
+ ## Testing
417
+
418
+ 1. Run dev server: `cd app && npm run dev`
419
+ 2. Check the chart loads at the correct URL
420
+ 3. Verify tooltip interactions
421
+ 4. Toggle light/dark mode to check theme support
422
+ 5. Resize the window to verify responsiveness
423
+
424
+ ## Debugging Tips
425
+
426
+ - Open browser console to see data loading errors
427
+ - Check Network tab to verify `/data/filename.json` is being fetched
428
+ - If chart doesn't render, check `container.dataset.mounted` isn't already 'true'
429
+ - CSS scoping: always prefix selectors with `.d3-CHART-NAME`
430
+
431
+ ## Common Gotchas
432
+
433
+ ### Using `.style()` vs `.attr()` for Dynamic Colors
434
+
435
+ When setting fill/stroke colors dynamically in D3 based on data, use `.style()` instead of `.attr()`:
436
+
437
+ ```javascript
438
+ // WON'T WORK - attr has lower specificity than CSS rules
439
+ .attr('fill', d => getContrastColor(d.color))
440
+
441
+ // USE THIS - inline styles have higher specificity
442
+ .style('fill', d => getContrastColor(d.color))
443
+ ```
444
+
445
+ This is especially important for text labels where you need to calculate contrast colors dynamically. Example contrast function:
446
+
447
+ ```javascript
448
+ function getContrastColor(hexColor) {
449
+ const hex = hexColor.replace('#', '');
450
+ const r = parseInt(hex.substr(0, 2), 16) / 255;
451
+ const g = parseInt(hex.substr(2, 2), 16) / 255;
452
+ const b = parseInt(hex.substr(4, 2), 16) / 255;
453
+ const luminance = 0.299 * r + 0.587 * g + 0.114 * b;
454
+ return luminance > 0.5 ? '#000000' : '#ffffff';
455
+ }
456
+
457
+ // Usage
458
+ gLabels.selectAll('.label')
459
+ .data(items)
460
+ .join('text')
461
+ .style('fill', d => getContrastColor(d.color))
462
+ .text(d => d.name);
463
+ ```
464
+
465
+ ### CSS Specificity for Axis Labels
466
+
467
+ The generic `.axes text` rule applies to ALL text inside the axes group, including axis labels. To style axis labels differently, use a more specific selector:
468
+
469
+ ```css
470
+ /* This won't work - gets overridden by .axes text */
471
+ .d3-CHART-NAME .axis-label {
472
+ font-size: 15px;
473
+ }
474
+
475
+ /* Use this instead - more specific */
476
+ .d3-CHART-NAME .axes text.axis-label {
477
+ font-size: 15px;
478
+ font-weight: 500;
479
+ fill: var(--text-color);
480
+ }
481
+ ```
482
+
483
+ ### Adjusting Tick Label Position
484
+
485
+ To move X-axis tick labels down (add spacing from the axis line):
486
+
487
+ ```css
488
+ .d3-CHART-NAME .x-axis text {
489
+ transform: translateY(4px);
490
+ }
491
+ ```
492
+
493
+ ### Removing Chart Elements
494
+
495
+ When you don't need a title or legend:
496
+ 1. Remove the rendering code from `render()`
497
+ 2. Remove the CSS styles
498
+ 3. Adjust margins accordingly (e.g., reduce `margin.top` if no title)