dlouapre's picture
dlouapre HF Staff
Improved with 3 more models GPT 5.2 Pro, Deepseek V3.2 and Gemini 3 Pro High
5619318
============================================================
ELEUSIS RESULTS ANALYSIS
============================================================
Analyzing: results/260216
Loading results...
Loaded 15 evaluation runs:
- solo_evaluation_20260120_091620_gpt_5_2_high
- solo_evaluation_20260120_091622_gpt_oss_120b
- solo_evaluation_20260121_070517_claude_haiku_4_5
- solo_evaluation_20260121_070518_gpt_5_mini_medium
- solo_evaluation_20260121_070520_gemini_3_flash_preview_low
- solo_evaluation_20260121_070522_gpt_oss_20b
- solo_evaluation_20260122_110723_grok_4_1_fast_reasoning
- solo_evaluation_20260122_180323_deepseek_r1
- solo_evaluation_20260126_084128_claude_opus_4_5
- solo_evaluation_20260126_084154_kimi_k2
- solo_evaluation_20260126_084210_glm_4_7
- solo_evaluation_20260202_094341_gemini_3_flash_preview_high
- solo_evaluation_20260211_093748_deepseek_v3_2
- solo_evaluation_20260211_093840_gemini_3_pro_preview_high
- solo_evaluation_20260211_133738_gpt_5_2_pro_high
Extracted 26 unique rules from results files
Built DataFrames: 1170 rounds, 17471 turns
Loaded colors for 20 models
============================================================
BASIC MODEL COMPARISON
============================================================
model rounds_played total_score avg_score total_floored_score avg_floored_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate counting_output_tokens total_no_stakes_score avg_no_stakes_score avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio
Gemini 3 Pro Preview High 78 1183 15.166667 1346 17.256410 888 7535898 77661.61 1.076923 0.897436 6187669 1665.0 21.346154 6968.095721 87.456768 20.743590 46.509402 0.446009
Claude Opus 4.5 78 1128 14.461538 1324 16.974359 756 4333716 86367.64 2.000000 0.833333 3430535 1598.0 20.487179 4537.744709 114.242910 25.000000 81.385983 0.307178
Kimi K2 78 804 10.307692 1262 16.179487 801 12281540 101346.76 2.038462 0.769231 5918992 1481.0 18.987179 7389.503121 126.525293 25.538462 88.446496 0.288745
Glm 4.7 78 856 10.974359 1220 15.641026 862 19398173 398026.00 1.935897 0.743590 5245185 1443.0 18.500000 6084.901392 461.747100 41.500000 92.825983 0.447073
Gpt 5.2 Pro High 78 1188 15.230769 1188 15.230769 1208 2329778 215320.39 0.128205 0.974359 2325128 1668.0 21.384615 1924.774834 178.245356 14.346154 17.624615 0.813984
Grok 4 1 Fast Reasoning 78 737 9.448718 1182 15.153846 795 8178655 120364.22 2.564103 0.717949 4559832 1441.0 18.474359 5735.637736 151.401535 25.243590 106.499829 0.237029
Deepseek V3.2 78 986 12.641026 1174 15.051282 1028 8030639 112903.88 1.243590 0.846154 6262450 1551.0 19.884615 6091.877432 109.828677 36.410256 52.352821 0.695478
Gpt 5.2 High 78 1158 14.846154 1174 15.051282 1195 3341037 73525.83 0.282051 0.948718 3232254 1505.0 19.294872 2704.815063 61.527891 24.628205 36.601709 0.672870
Gemini 3 Flash Preview High 78 786 10.076923 1096 14.051282 1012 17435375 103053.60 1.743590 0.730769 13082063 1365.0 17.500000 12926.939723 101.831621 26.948718 75.419487 0.357318
Gpt 5 Mini Medium 78 942 12.076923 1052 13.487179 1163 3618399 58345.97 1.166667 0.705128 2998454 1325.0 16.987179 2578.206363 50.168504 39.141026 82.882051 0.472250
Deepseek R1 78 511 6.551282 1036 13.282051 851 9229131 165334.16 3.192308 0.641026 5944454 1331.0 17.064103 6985.257344 194.282209 29.628205 115.135043 0.257334
Gemini 3 Flash Preview Low 78 817 10.474359 1024 13.128205 1207 1581524 12702.02 0.961538 0.705128 1389850 1226.0 15.717949 1151.491301 10.523629 29.923077 83.049573 0.360304
Gpt Oss 120B 78 580 7.435897 1004 12.871795 1041 3190828 24633.15 2.153846 0.679487 2250622 1279.0 16.397436 2161.980788 23.662968 46.692308 78.676239 0.593474
Gpt Oss 20B 78 131 1.679487 927 11.884615 972 7009392 62397.50 2.974359 0.589744 3234713 1206.0 15.461538 3327.894033 64.194959 47.576923 88.239487 0.539180
Claude Haiku 4.5 78 -37 -0.474359 894 11.461538 848 6973411 57734.39 3.948718 0.564103 4053200 1198.0 15.358974 4779.716981 68.083007 45.102564 107.387350 0.419999
Saved: results/260216/basic_metrics.csv
Saved: results/260216/overall_performance.png
Saved: results/260216/overall_performance.json
Saved: results/260216/score_vs_failed_guesses.png
Saved: results/260216/score_vs_failed_guesses.json
Saved: results/260216/calibration_curves.png
Saved: results/260216/calibration_curves.json
Saved: results/260216/guess_rate.png
Saved: results/260216/guess_rate.json
Saved: results/260216/score_stack.png
Saved: results/260216/score_stack.json
============================================================
COMPLEXITY ANALYSIS
============================================================
Optimal K for aggregated complexity: 0.23
Formula: complexity = cyclomatic + 0.23 * node_count
Correlation with success_rate: -0.656
Stats by complexity quartile:
complexity_bin count avg_floored_score success_rate
Q1 360 19.772222 0.941667
Q2 225 15.551111 0.813333
Q3 270 15.959259 0.833333
Q4 315 6.276190 0.438095
Saved: results/260216/complexity_analysis.png
Saved: results/260216/complexity_analysis.json
============================================================
BY-RULE ANALYSIS
============================================================
Score by rule (sorted by avg_floored_score):
rule_description count avg_floored_score std_floored_score success_rate
Cards must alternate between red and black colors. Any card may start the line. 45 24.711111 2.581598 1.000000
Only cards of the suit spades. 45 24.400000 2.783066 1.000000
Only red cards (hearts or diamonds). 45 24.355556 3.517202 1.000000
Only cards with an even rank (2,4,6,8,10,12). 45 23.866667 2.509980 1.000000
The card must be of a different suit than the card just before it. Any card may start the line. 45 22.177778 5.457642 0.977778
Only hearts, clubs, and diamonds allowed. Spades are forbidden. 45 20.777778 4.742884 0.977778
Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 45 20.777778 5.178140 1.000000
The card must be of a different suit than but same color as the card just before it. Any card may start the line. 45 20.488889 4.993733 0.977778
Only ranks that are prime numbers (2,3,5,7,11,13). 45 20.266667 6.005301 0.955556
Only face cards (11,12,13). 45 20.022222 7.294110 0.933333
Only Aces (rank 1) . 45 19.711111 8.387280 0.933333
Only spades and diamonds. 45 19.466667 3.992038 1.000000
Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 45 17.400000 7.171281 0.911111
Only cards between 1 and 7 inclusive. 45 15.200000 6.614722 0.933333
Only black face cards. 45 12.844444 8.196328 0.800000
Alternate face and number cards. Any card may start the line. 45 10.911111 9.144850 0.688889
Only cards between 5 and 9 inclusive. 45 9.844444 7.416062 0.755556
Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 45 9.288889 8.454752 0.644444
Only red cards whose rank is <=7. 45 8.666667 7.245688 0.711111
Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 45 8.555556 8.983711 0.555556
Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 45 6.133333 7.050467 0.555556
Face cards (11-13) must be red; number cards (1-10) must be black. 45 4.644444 6.729027 0.400000
Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 45 3.800000 6.679684 0.288889
Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 45 2.955556 4.847471 0.311111
If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 45 2.888889 5.924611 0.222222
Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 45 1.466667 4.048569 0.133333
Saved: results/260216/by_rule.png
Saved: results/260216/by_rule.json
============================================================
EXCESS CAUTION ANALYSIS
============================================================
Early Correct Turns: consecutive correct shadow guesses before winning guess
(Only counts successful rounds where model eventually guessed correctly)
Model Rounds Mean Median % with Early
Gpt 5.2 Pro High 76 4.62 4.0 97.4
Gpt 5.2 High 75 3.56 3.0 86.7
Gemini 3 Pro Preview High 73 2.34 2.0 79.5
Deepseek V3.2 70 2.19 2.0 74.3
Gemini 3 Flash Preview Low 60 1.92 2.0 68.3
Gemini 3 Flash Preview High 63 1.75 1.0 68.3
Gpt 5 Mini Medium 59 1.32 1.0 67.8
Gpt Oss 120B 59 1.31 1.0 61.0
Kimi K2 67 1.28 1.0 65.7
Gpt Oss 20B 56 1.09 1.0 51.8
Claude Opus 4.5 72 0.94 1.0 59.7
Glm 4.7 67 0.94 1.0 59.7
Grok 4 1 Fast Reasoning 69 0.75 0.0 44.9
Claude Haiku 4.5 55 0.53 0.0 38.2
Deepseek R1 65 0.49 0.0 33.8
Overall: 1713 total early correct turns across 986 successful rounds
Mean: 1.74, Median: 1.0
639/986 (64.8%) rounds had at least 1 early correct turn
Saved: results/260216/excess_caution.png
Saved: results/260216/excess_caution.json
Saved: results/260216/caution_vs_failed_guesses.png
Saved: results/260216/caution_vs_failed_guesses.json
Saved: results/260216/score_vs_recklessness.png
Saved: results/260216/score_vs_recklessness.json
============================================================
RECKLESS GUESSING ANALYSIS
============================================================
Double-Down Rate: After a wrong guess, % of next turns with another guess
(Only counts official guesses, not shadow/tentative guesses)
Model Wrong Guesses Next Turn Guesses Double-Down %
Grok 4 1 Fast Reasoning 200 108 54.0
Deepseek R1 249 132 53.0
Claude Haiku 4.5 308 161 52.3
Kimi K2 159 67 42.1
Gpt Oss 20B 232 97 41.8
Glm 4.7 151 63 41.7
Gemini 3 Pro Preview High 84 31 36.9
Claude Opus 4.5 156 50 32.1
Gemini 3 Flash Preview High 136 40 29.4
Gpt Oss 120B 168 37 22.0
Deepseek V3.2 97 20 20.6
Gemini 3 Flash Preview Low 75 15 20.0
Gpt 5 Mini Medium 91 8 8.8
Gpt 5.2 High 22 0 0.0
Gpt 5.2 Pro High 10 0 0.0
Wrong Guess Streak Statistics:
Model Streaks Mean Length Max Length Total Wrong
Grok 4 1 Fast Reasoning 103 1.94 8 200
Deepseek R1 121 2.06 7 249
Claude Haiku 4.5 157 1.96 7 308
Kimi K2 100 1.59 7 159
Gpt Oss 20B 141 1.65 7 232
Glm 4.7 91 1.66 5 151
Gemini 3 Pro Preview High 60 1.40 5 84
Claude Opus 4.5 115 1.36 5 156
Gemini 3 Flash Preview High 99 1.37 4 136
Gpt Oss 120B 133 1.26 5 168
Deepseek V3.2 79 1.23 4 97
Gemini 3 Flash Preview Low 63 1.19 4 75
Gpt 5 Mini Medium 85 1.07 3 91
Gpt 5.2 High 22 1.00 1 22
Gpt 5.2 Pro High 10 1.00 1 10
Longest streak: 8 consecutive wrong guesses
- Grok 4 1 Fast Reasoning in round 67
Saved: results/260216/reckless_guessing.png
Saved: results/260216/reckless_guessing.json
============================================================
COMPLEXITY RATIO ANALYSIS
============================================================
Analyzed 14024 tentative rules with confidence >= 5
Using optimal k = 0.490 for aggregated complexity
Complexity Ratio by Model:
(Ratio = Tentative Complexity / Actual Complexity)
Model Median Q25 Q75 Count
Gpt Oss 120B 1.318 0.873 2.360 1182
Gpt Oss 20B 1.146 0.775 2.068 1219
Claude Haiku 4.5 1.049 0.731 2.000 1001
Deepseek R1 1.000 0.759 1.751 933
Deepseek V3.2 1.000 0.773 1.492 906
Gemini 3 Flash Preview High 1.000 0.824 1.627 958
Gemini 3 Flash Preview Low 1.000 0.775 1.521 1016
Gemini 3 Pro Preview High 1.000 0.691 1.199 789
Glm 4.7 1.000 0.729 1.302 882
Gpt 5 Mini Medium 1.000 0.761 1.656 939
Gpt 5.2 High 1.000 0.788 1.176 857
Gpt 5.2 Pro High 1.000 0.842 1.101 855
Grok 4 1 Fast Reasoning 1.000 0.773 1.664 938
Claude Opus 4.5 0.984 0.708 1.168 664
Kimi K2 0.970 0.621 1.274 885
Interpretation:
- Ratio > 1: Model tends to overcomplicate rules
- Ratio < 1: Model tends to oversimplify rules
- Ratio ~ 1: Model matches actual rule complexity
Highest median: Gpt Oss 120B (1.318)
Lowest median: Kimi K2 (0.970)
Saved: results/260216/complexity_ratio.png
Saved: results/260216/complexity_ratio.json
============================================================
OUTPUT TOKENS BY TURN
============================================================
Saved: results/260216/tokens_by_turn.png
Saved: results/260216/tokens_by_turn.json
Tokens trend summary (early vs late turns):
Claude Haiku 4.5: early=3191, late=5889 (+84.5%)
Claude Opus 4.5: early=2649, late=8447 (+218.9%)
Deepseek R1: early=5083, late=10946 (+115.3%)
Deepseek V3.2: early=3660, late=12272 (+235.3%)
Gemini 3 Flash Preview High: early=10023, late=15704 (+56.7%)
Gemini 3 Flash Preview Low: early=1046, late=1351 (+29.1%)
Gemini 3 Pro Preview High: early=5059, late=14670 (+190.0%)
Glm 4.7: early=3885, late=4457 (+14.7%)
Gpt 5 Mini Medium: early=1241, late=4862 (+291.9%)
Gpt 5.2 High: early=963, late=5910 (+514.0%)
Gpt 5.2 Pro High: early=1006, late=3125 (+210.7%)
Gpt Oss 120B: early=1050, late=4475 (+326.2%)
Gpt Oss 20B: early=1744, late=7789 (+346.6%)
Grok 4 1 Fast Reasoning: early=2810, late=17827 (+534.4%)
Kimi K2: early=5545, late=10653 (+92.1%)
============================================================
PER-MODEL REPORTS
============================================================
Saved: results/260216/model_gpt_5_2_high.png
Saved: results/260216/model_gpt_oss_120b.png
Saved: results/260216/model_claude_haiku_4_5.png
Saved: results/260216/model_gpt_5_mini_medium.png
Saved: results/260216/model_gemini_3_flash_preview_low.png
Saved: results/260216/model_gpt_oss_20b.png
Saved: results/260216/model_grok_4_1_fast_reasoning.png
Saved: results/260216/model_deepseek_r1.png
Saved: results/260216/model_claude_opus_4_5.png
Saved: results/260216/model_kimi_k2.png
Saved: results/260216/model_glm_4_7.png
Saved: results/260216/model_gemini_3_flash_preview_high.png
Saved: results/260216/model_deepseek_v3_2.png
Saved: results/260216/model_gemini_3_pro_preview_high.png
Saved: results/260216/model_gpt_5_2_pro_high.png
============================================================
Analysis complete! All outputs saved to: results/260216
============================================================