Spaces:
Running
Running
| ============================================================ | |
| ELEUSIS RESULTS ANALYSIS | |
| ============================================================ | |
| Analyzing: results/260216 | |
| Loading results... | |
| Loaded 15 evaluation runs: | |
| - solo_evaluation_20260120_091620_gpt_5_2_high | |
| - solo_evaluation_20260120_091622_gpt_oss_120b | |
| - solo_evaluation_20260121_070517_claude_haiku_4_5 | |
| - solo_evaluation_20260121_070518_gpt_5_mini_medium | |
| - solo_evaluation_20260121_070520_gemini_3_flash_preview_low | |
| - solo_evaluation_20260121_070522_gpt_oss_20b | |
| - solo_evaluation_20260122_110723_grok_4_1_fast_reasoning | |
| - solo_evaluation_20260122_180323_deepseek_r1 | |
| - solo_evaluation_20260126_084128_claude_opus_4_5 | |
| - solo_evaluation_20260126_084154_kimi_k2 | |
| - solo_evaluation_20260126_084210_glm_4_7 | |
| - solo_evaluation_20260202_094341_gemini_3_flash_preview_high | |
| - solo_evaluation_20260211_093748_deepseek_v3_2 | |
| - solo_evaluation_20260211_093840_gemini_3_pro_preview_high | |
| - solo_evaluation_20260211_133738_gpt_5_2_pro_high | |
| Extracted 26 unique rules from results files | |
| Built DataFrames: 1170 rounds, 17471 turns | |
| Loaded colors for 20 models | |
| ============================================================ | |
| BASIC MODEL COMPARISON | |
| ============================================================ | |
| model rounds_played total_score avg_score total_floored_score avg_floored_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate counting_output_tokens total_no_stakes_score avg_no_stakes_score avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio | |
| Gemini 3 Pro Preview High 78 1183 15.166667 1346 17.256410 888 7535898 77661.61 1.076923 0.897436 6187669 1665.0 21.346154 6968.095721 87.456768 20.743590 46.509402 0.446009 | |
| Claude Opus 4.5 78 1128 14.461538 1324 16.974359 756 4333716 86367.64 2.000000 0.833333 3430535 1598.0 20.487179 4537.744709 114.242910 25.000000 81.385983 0.307178 | |
| Kimi K2 78 804 10.307692 1262 16.179487 801 12281540 101346.76 2.038462 0.769231 5918992 1481.0 18.987179 7389.503121 126.525293 25.538462 88.446496 0.288745 | |
| Glm 4.7 78 856 10.974359 1220 15.641026 862 19398173 398026.00 1.935897 0.743590 5245185 1443.0 18.500000 6084.901392 461.747100 41.500000 92.825983 0.447073 | |
| Gpt 5.2 Pro High 78 1188 15.230769 1188 15.230769 1208 2329778 215320.39 0.128205 0.974359 2325128 1668.0 21.384615 1924.774834 178.245356 14.346154 17.624615 0.813984 | |
| Grok 4 1 Fast Reasoning 78 737 9.448718 1182 15.153846 795 8178655 120364.22 2.564103 0.717949 4559832 1441.0 18.474359 5735.637736 151.401535 25.243590 106.499829 0.237029 | |
| Deepseek V3.2 78 986 12.641026 1174 15.051282 1028 8030639 112903.88 1.243590 0.846154 6262450 1551.0 19.884615 6091.877432 109.828677 36.410256 52.352821 0.695478 | |
| Gpt 5.2 High 78 1158 14.846154 1174 15.051282 1195 3341037 73525.83 0.282051 0.948718 3232254 1505.0 19.294872 2704.815063 61.527891 24.628205 36.601709 0.672870 | |
| Gemini 3 Flash Preview High 78 786 10.076923 1096 14.051282 1012 17435375 103053.60 1.743590 0.730769 13082063 1365.0 17.500000 12926.939723 101.831621 26.948718 75.419487 0.357318 | |
| Gpt 5 Mini Medium 78 942 12.076923 1052 13.487179 1163 3618399 58345.97 1.166667 0.705128 2998454 1325.0 16.987179 2578.206363 50.168504 39.141026 82.882051 0.472250 | |
| Deepseek R1 78 511 6.551282 1036 13.282051 851 9229131 165334.16 3.192308 0.641026 5944454 1331.0 17.064103 6985.257344 194.282209 29.628205 115.135043 0.257334 | |
| Gemini 3 Flash Preview Low 78 817 10.474359 1024 13.128205 1207 1581524 12702.02 0.961538 0.705128 1389850 1226.0 15.717949 1151.491301 10.523629 29.923077 83.049573 0.360304 | |
| Gpt Oss 120B 78 580 7.435897 1004 12.871795 1041 3190828 24633.15 2.153846 0.679487 2250622 1279.0 16.397436 2161.980788 23.662968 46.692308 78.676239 0.593474 | |
| Gpt Oss 20B 78 131 1.679487 927 11.884615 972 7009392 62397.50 2.974359 0.589744 3234713 1206.0 15.461538 3327.894033 64.194959 47.576923 88.239487 0.539180 | |
| Claude Haiku 4.5 78 -37 -0.474359 894 11.461538 848 6973411 57734.39 3.948718 0.564103 4053200 1198.0 15.358974 4779.716981 68.083007 45.102564 107.387350 0.419999 | |
| Saved: results/260216/basic_metrics.csv | |
| Saved: results/260216/overall_performance.png | |
| Saved: results/260216/overall_performance.json | |
| Saved: results/260216/score_vs_failed_guesses.png | |
| Saved: results/260216/score_vs_failed_guesses.json | |
| Saved: results/260216/calibration_curves.png | |
| Saved: results/260216/calibration_curves.json | |
| Saved: results/260216/guess_rate.png | |
| Saved: results/260216/guess_rate.json | |
| Saved: results/260216/score_stack.png | |
| Saved: results/260216/score_stack.json | |
| ============================================================ | |
| COMPLEXITY ANALYSIS | |
| ============================================================ | |
| Optimal K for aggregated complexity: 0.23 | |
| Formula: complexity = cyclomatic + 0.23 * node_count | |
| Correlation with success_rate: -0.656 | |
| Stats by complexity quartile: | |
| complexity_bin count avg_floored_score success_rate | |
| Q1 360 19.772222 0.941667 | |
| Q2 225 15.551111 0.813333 | |
| Q3 270 15.959259 0.833333 | |
| Q4 315 6.276190 0.438095 | |
| Saved: results/260216/complexity_analysis.png | |
| Saved: results/260216/complexity_analysis.json | |
| ============================================================ | |
| BY-RULE ANALYSIS | |
| ============================================================ | |
| Score by rule (sorted by avg_floored_score): | |
| rule_description count avg_floored_score std_floored_score success_rate | |
| Cards must alternate between red and black colors. Any card may start the line. 45 24.711111 2.581598 1.000000 | |
| Only cards of the suit spades. 45 24.400000 2.783066 1.000000 | |
| Only red cards (hearts or diamonds). 45 24.355556 3.517202 1.000000 | |
| Only cards with an even rank (2,4,6,8,10,12). 45 23.866667 2.509980 1.000000 | |
| The card must be of a different suit than the card just before it. Any card may start the line. 45 22.177778 5.457642 0.977778 | |
| Only hearts, clubs, and diamonds allowed. Spades are forbidden. 45 20.777778 4.742884 0.977778 | |
| Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 45 20.777778 5.178140 1.000000 | |
| The card must be of a different suit than but same color as the card just before it. Any card may start the line. 45 20.488889 4.993733 0.977778 | |
| Only ranks that are prime numbers (2,3,5,7,11,13). 45 20.266667 6.005301 0.955556 | |
| Only face cards (11,12,13). 45 20.022222 7.294110 0.933333 | |
| Only Aces (rank 1) . 45 19.711111 8.387280 0.933333 | |
| Only spades and diamonds. 45 19.466667 3.992038 1.000000 | |
| Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 45 17.400000 7.171281 0.911111 | |
| Only cards between 1 and 7 inclusive. 45 15.200000 6.614722 0.933333 | |
| Only black face cards. 45 12.844444 8.196328 0.800000 | |
| Alternate face and number cards. Any card may start the line. 45 10.911111 9.144850 0.688889 | |
| Only cards between 5 and 9 inclusive. 45 9.844444 7.416062 0.755556 | |
| Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 45 9.288889 8.454752 0.644444 | |
| Only red cards whose rank is <=7. 45 8.666667 7.245688 0.711111 | |
| Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 45 8.555556 8.983711 0.555556 | |
| Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 45 6.133333 7.050467 0.555556 | |
| Face cards (11-13) must be red; number cards (1-10) must be black. 45 4.644444 6.729027 0.400000 | |
| Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 45 3.800000 6.679684 0.288889 | |
| Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 45 2.955556 4.847471 0.311111 | |
| If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 45 2.888889 5.924611 0.222222 | |
| Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 45 1.466667 4.048569 0.133333 | |
| Saved: results/260216/by_rule.png | |
| Saved: results/260216/by_rule.json | |
| ============================================================ | |
| EXCESS CAUTION ANALYSIS | |
| ============================================================ | |
| Early Correct Turns: consecutive correct shadow guesses before winning guess | |
| (Only counts successful rounds where model eventually guessed correctly) | |
| Model Rounds Mean Median % with Early | |
| Gpt 5.2 Pro High 76 4.62 4.0 97.4 | |
| Gpt 5.2 High 75 3.56 3.0 86.7 | |
| Gemini 3 Pro Preview High 73 2.34 2.0 79.5 | |
| Deepseek V3.2 70 2.19 2.0 74.3 | |
| Gemini 3 Flash Preview Low 60 1.92 2.0 68.3 | |
| Gemini 3 Flash Preview High 63 1.75 1.0 68.3 | |
| Gpt 5 Mini Medium 59 1.32 1.0 67.8 | |
| Gpt Oss 120B 59 1.31 1.0 61.0 | |
| Kimi K2 67 1.28 1.0 65.7 | |
| Gpt Oss 20B 56 1.09 1.0 51.8 | |
| Claude Opus 4.5 72 0.94 1.0 59.7 | |
| Glm 4.7 67 0.94 1.0 59.7 | |
| Grok 4 1 Fast Reasoning 69 0.75 0.0 44.9 | |
| Claude Haiku 4.5 55 0.53 0.0 38.2 | |
| Deepseek R1 65 0.49 0.0 33.8 | |
| Overall: 1713 total early correct turns across 986 successful rounds | |
| Mean: 1.74, Median: 1.0 | |
| 639/986 (64.8%) rounds had at least 1 early correct turn | |
| Saved: results/260216/excess_caution.png | |
| Saved: results/260216/excess_caution.json | |
| Saved: results/260216/caution_vs_failed_guesses.png | |
| Saved: results/260216/caution_vs_failed_guesses.json | |
| Saved: results/260216/score_vs_recklessness.png | |
| Saved: results/260216/score_vs_recklessness.json | |
| ============================================================ | |
| RECKLESS GUESSING ANALYSIS | |
| ============================================================ | |
| Double-Down Rate: After a wrong guess, % of next turns with another guess | |
| (Only counts official guesses, not shadow/tentative guesses) | |
| Model Wrong Guesses Next Turn Guesses Double-Down % | |
| Grok 4 1 Fast Reasoning 200 108 54.0 | |
| Deepseek R1 249 132 53.0 | |
| Claude Haiku 4.5 308 161 52.3 | |
| Kimi K2 159 67 42.1 | |
| Gpt Oss 20B 232 97 41.8 | |
| Glm 4.7 151 63 41.7 | |
| Gemini 3 Pro Preview High 84 31 36.9 | |
| Claude Opus 4.5 156 50 32.1 | |
| Gemini 3 Flash Preview High 136 40 29.4 | |
| Gpt Oss 120B 168 37 22.0 | |
| Deepseek V3.2 97 20 20.6 | |
| Gemini 3 Flash Preview Low 75 15 20.0 | |
| Gpt 5 Mini Medium 91 8 8.8 | |
| Gpt 5.2 High 22 0 0.0 | |
| Gpt 5.2 Pro High 10 0 0.0 | |
| Wrong Guess Streak Statistics: | |
| Model Streaks Mean Length Max Length Total Wrong | |
| Grok 4 1 Fast Reasoning 103 1.94 8 200 | |
| Deepseek R1 121 2.06 7 249 | |
| Claude Haiku 4.5 157 1.96 7 308 | |
| Kimi K2 100 1.59 7 159 | |
| Gpt Oss 20B 141 1.65 7 232 | |
| Glm 4.7 91 1.66 5 151 | |
| Gemini 3 Pro Preview High 60 1.40 5 84 | |
| Claude Opus 4.5 115 1.36 5 156 | |
| Gemini 3 Flash Preview High 99 1.37 4 136 | |
| Gpt Oss 120B 133 1.26 5 168 | |
| Deepseek V3.2 79 1.23 4 97 | |
| Gemini 3 Flash Preview Low 63 1.19 4 75 | |
| Gpt 5 Mini Medium 85 1.07 3 91 | |
| Gpt 5.2 High 22 1.00 1 22 | |
| Gpt 5.2 Pro High 10 1.00 1 10 | |
| Longest streak: 8 consecutive wrong guesses | |
| - Grok 4 1 Fast Reasoning in round 67 | |
| Saved: results/260216/reckless_guessing.png | |
| Saved: results/260216/reckless_guessing.json | |
| ============================================================ | |
| COMPLEXITY RATIO ANALYSIS | |
| ============================================================ | |
| Analyzed 14024 tentative rules with confidence >= 5 | |
| Using optimal k = 0.490 for aggregated complexity | |
| Complexity Ratio by Model: | |
| (Ratio = Tentative Complexity / Actual Complexity) | |
| Model Median Q25 Q75 Count | |
| Gpt Oss 120B 1.318 0.873 2.360 1182 | |
| Gpt Oss 20B 1.146 0.775 2.068 1219 | |
| Claude Haiku 4.5 1.049 0.731 2.000 1001 | |
| Deepseek R1 1.000 0.759 1.751 933 | |
| Deepseek V3.2 1.000 0.773 1.492 906 | |
| Gemini 3 Flash Preview High 1.000 0.824 1.627 958 | |
| Gemini 3 Flash Preview Low 1.000 0.775 1.521 1016 | |
| Gemini 3 Pro Preview High 1.000 0.691 1.199 789 | |
| Glm 4.7 1.000 0.729 1.302 882 | |
| Gpt 5 Mini Medium 1.000 0.761 1.656 939 | |
| Gpt 5.2 High 1.000 0.788 1.176 857 | |
| Gpt 5.2 Pro High 1.000 0.842 1.101 855 | |
| Grok 4 1 Fast Reasoning 1.000 0.773 1.664 938 | |
| Claude Opus 4.5 0.984 0.708 1.168 664 | |
| Kimi K2 0.970 0.621 1.274 885 | |
| Interpretation: | |
| - Ratio > 1: Model tends to overcomplicate rules | |
| - Ratio < 1: Model tends to oversimplify rules | |
| - Ratio ~ 1: Model matches actual rule complexity | |
| Highest median: Gpt Oss 120B (1.318) | |
| Lowest median: Kimi K2 (0.970) | |
| Saved: results/260216/complexity_ratio.png | |
| Saved: results/260216/complexity_ratio.json | |
| ============================================================ | |
| OUTPUT TOKENS BY TURN | |
| ============================================================ | |
| Saved: results/260216/tokens_by_turn.png | |
| Saved: results/260216/tokens_by_turn.json | |
| Tokens trend summary (early vs late turns): | |
| Claude Haiku 4.5: early=3191, late=5889 (+84.5%) | |
| Claude Opus 4.5: early=2649, late=8447 (+218.9%) | |
| Deepseek R1: early=5083, late=10946 (+115.3%) | |
| Deepseek V3.2: early=3660, late=12272 (+235.3%) | |
| Gemini 3 Flash Preview High: early=10023, late=15704 (+56.7%) | |
| Gemini 3 Flash Preview Low: early=1046, late=1351 (+29.1%) | |
| Gemini 3 Pro Preview High: early=5059, late=14670 (+190.0%) | |
| Glm 4.7: early=3885, late=4457 (+14.7%) | |
| Gpt 5 Mini Medium: early=1241, late=4862 (+291.9%) | |
| Gpt 5.2 High: early=963, late=5910 (+514.0%) | |
| Gpt 5.2 Pro High: early=1006, late=3125 (+210.7%) | |
| Gpt Oss 120B: early=1050, late=4475 (+326.2%) | |
| Gpt Oss 20B: early=1744, late=7789 (+346.6%) | |
| Grok 4 1 Fast Reasoning: early=2810, late=17827 (+534.4%) | |
| Kimi K2: early=5545, late=10653 (+92.1%) | |
| ============================================================ | |
| PER-MODEL REPORTS | |
| ============================================================ | |
| Saved: results/260216/model_gpt_5_2_high.png | |
| Saved: results/260216/model_gpt_oss_120b.png | |
| Saved: results/260216/model_claude_haiku_4_5.png | |
| Saved: results/260216/model_gpt_5_mini_medium.png | |
| Saved: results/260216/model_gemini_3_flash_preview_low.png | |
| Saved: results/260216/model_gpt_oss_20b.png | |
| Saved: results/260216/model_grok_4_1_fast_reasoning.png | |
| Saved: results/260216/model_deepseek_r1.png | |
| Saved: results/260216/model_claude_opus_4_5.png | |
| Saved: results/260216/model_kimi_k2.png | |
| Saved: results/260216/model_glm_4_7.png | |
| Saved: results/260216/model_gemini_3_flash_preview_high.png | |
| Saved: results/260216/model_deepseek_v3_2.png | |
| Saved: results/260216/model_gemini_3_pro_preview_high.png | |
| Saved: results/260216/model_gpt_5_2_pro_high.png | |
| ============================================================ | |
| Analysis complete! All outputs saved to: results/260216 | |
| ============================================================ | |