eleusis-benchmark

Running

App Files Files Community

eleusis-benchmark / app /src /content /assets /data /summary.txt

dlouapre HF Staff

Improved with 3 more models GPT 5.2 Pro, Deepseek V3.2 and Gemini 3 Pro High

5619318 2 months ago

raw

history blame contribute delete

22.8 kB

	============================================================
	ELEUSIS RESULTS ANALYSIS
	============================================================

	Analyzing: results/260216

	Loading results...
	Loaded 15 evaluation runs:
	- solo_evaluation_20260120_091620_gpt_5_2_high
	- solo_evaluation_20260120_091622_gpt_oss_120b
	- solo_evaluation_20260121_070517_claude_haiku_4_5
	- solo_evaluation_20260121_070518_gpt_5_mini_medium
	- solo_evaluation_20260121_070520_gemini_3_flash_preview_low
	- solo_evaluation_20260121_070522_gpt_oss_20b
	- solo_evaluation_20260122_110723_grok_4_1_fast_reasoning
	- solo_evaluation_20260122_180323_deepseek_r1
	- solo_evaluation_20260126_084128_claude_opus_4_5
	- solo_evaluation_20260126_084154_kimi_k2
	- solo_evaluation_20260126_084210_glm_4_7
	- solo_evaluation_20260202_094341_gemini_3_flash_preview_high
	- solo_evaluation_20260211_093748_deepseek_v3_2
	- solo_evaluation_20260211_093840_gemini_3_pro_preview_high
	- solo_evaluation_20260211_133738_gpt_5_2_pro_high

	Extracted 26 unique rules from results files
	Built DataFrames: 1170 rounds, 17471 turns
	Loaded colors for 20 models

	============================================================
	BASIC MODEL COMPARISON
	============================================================

	model rounds_played total_score avg_score total_floored_score avg_floored_score total_turns total_output_tokens total_wall_clock avg_failed_guesses success_rate counting_output_tokens total_no_stakes_score avg_no_stakes_score avg_output_tokens_per_turn wall_clock_per_turn intra_rule_variance inter_rule_variance variance_ratio
	Gemini 3 Pro Preview High 78 1183 15.166667 1346 17.256410 888 7535898 77661.61 1.076923 0.897436 6187669 1665.0 21.346154 6968.095721 87.456768 20.743590 46.509402 0.446009
	Claude Opus 4.5 78 1128 14.461538 1324 16.974359 756 4333716 86367.64 2.000000 0.833333 3430535 1598.0 20.487179 4537.744709 114.242910 25.000000 81.385983 0.307178
	Kimi K2 78 804 10.307692 1262 16.179487 801 12281540 101346.76 2.038462 0.769231 5918992 1481.0 18.987179 7389.503121 126.525293 25.538462 88.446496 0.288745
	Glm 4.7 78 856 10.974359 1220 15.641026 862 19398173 398026.00 1.935897 0.743590 5245185 1443.0 18.500000 6084.901392 461.747100 41.500000 92.825983 0.447073
	Gpt 5.2 Pro High 78 1188 15.230769 1188 15.230769 1208 2329778 215320.39 0.128205 0.974359 2325128 1668.0 21.384615 1924.774834 178.245356 14.346154 17.624615 0.813984
	Grok 4 1 Fast Reasoning 78 737 9.448718 1182 15.153846 795 8178655 120364.22 2.564103 0.717949 4559832 1441.0 18.474359 5735.637736 151.401535 25.243590 106.499829 0.237029
	Deepseek V3.2 78 986 12.641026 1174 15.051282 1028 8030639 112903.88 1.243590 0.846154 6262450 1551.0 19.884615 6091.877432 109.828677 36.410256 52.352821 0.695478
	Gpt 5.2 High 78 1158 14.846154 1174 15.051282 1195 3341037 73525.83 0.282051 0.948718 3232254 1505.0 19.294872 2704.815063 61.527891 24.628205 36.601709 0.672870
	Gemini 3 Flash Preview High 78 786 10.076923 1096 14.051282 1012 17435375 103053.60 1.743590 0.730769 13082063 1365.0 17.500000 12926.939723 101.831621 26.948718 75.419487 0.357318
	Gpt 5 Mini Medium 78 942 12.076923 1052 13.487179 1163 3618399 58345.97 1.166667 0.705128 2998454 1325.0 16.987179 2578.206363 50.168504 39.141026 82.882051 0.472250
	Deepseek R1 78 511 6.551282 1036 13.282051 851 9229131 165334.16 3.192308 0.641026 5944454 1331.0 17.064103 6985.257344 194.282209 29.628205 115.135043 0.257334
	Gemini 3 Flash Preview Low 78 817 10.474359 1024 13.128205 1207 1581524 12702.02 0.961538 0.705128 1389850 1226.0 15.717949 1151.491301 10.523629 29.923077 83.049573 0.360304
	Gpt Oss 120B 78 580 7.435897 1004 12.871795 1041 3190828 24633.15 2.153846 0.679487 2250622 1279.0 16.397436 2161.980788 23.662968 46.692308 78.676239 0.593474
	Gpt Oss 20B 78 131 1.679487 927 11.884615 972 7009392 62397.50 2.974359 0.589744 3234713 1206.0 15.461538 3327.894033 64.194959 47.576923 88.239487 0.539180
	Claude Haiku 4.5 78 -37 -0.474359 894 11.461538 848 6973411 57734.39 3.948718 0.564103 4053200 1198.0 15.358974 4779.716981 68.083007 45.102564 107.387350 0.419999

	Saved: results/260216/basic_metrics.csv
	Saved: results/260216/overall_performance.png
	Saved: results/260216/overall_performance.json
	Saved: results/260216/score_vs_failed_guesses.png
	Saved: results/260216/score_vs_failed_guesses.json
	Saved: results/260216/calibration_curves.png
	Saved: results/260216/calibration_curves.json
	Saved: results/260216/guess_rate.png
	Saved: results/260216/guess_rate.json
	Saved: results/260216/score_stack.png
	Saved: results/260216/score_stack.json

	============================================================
	COMPLEXITY ANALYSIS
	============================================================

	Optimal K for aggregated complexity: 0.23
	Formula: complexity = cyclomatic + 0.23 * node_count
	Correlation with success_rate: -0.656

	Stats by complexity quartile:
	complexity_bin count avg_floored_score success_rate
	Q1 360 19.772222 0.941667
	Q2 225 15.551111 0.813333
	Q3 270 15.959259 0.833333
	Q4 315 6.276190 0.438095

	Saved: results/260216/complexity_analysis.png
	Saved: results/260216/complexity_analysis.json

	============================================================
	BY-RULE ANALYSIS
	============================================================

	Score by rule (sorted by avg_floored_score):
	rule_description count avg_floored_score std_floored_score success_rate
	Cards must alternate between red and black colors. Any card may start the line. 45 24.711111 2.581598 1.000000
	Only cards of the suit spades. 45 24.400000 2.783066 1.000000
	Only red cards (hearts or diamonds). 45 24.355556 3.517202 1.000000
	Only cards with an even rank (2,4,6,8,10,12). 45 23.866667 2.509980 1.000000
	The card must be of a different suit than the card just before it. Any card may start the line. 45 22.177778 5.457642 0.977778
	Only hearts, clubs, and diamonds allowed. Spades are forbidden. 45 20.777778 4.742884 0.977778
	Card rank must have opposite odd/even parity to the previous card's rank. Any card may start the line. 45 20.777778 5.178140 1.000000
	The card must be of a different suit than but same color as the card just before it. Any card may start the line. 45 20.488889 4.993733 0.977778
	Only ranks that are prime numbers (2,3,5,7,11,13). 45 20.266667 6.005301 0.955556
	Only face cards (11,12,13). 45 20.022222 7.294110 0.933333
	Only Aces (rank 1) . 45 19.711111 8.387280 0.933333
	Only spades and diamonds. 45 19.466667 3.992038 1.000000
	Suits must repeat in the cyclic order hearts → spades → clubs → diamonds → hearts... Any card may start the line. 45 17.400000 7.171281 0.911111
	Only cards between 1 and 7 inclusive. 45 15.200000 6.614722 0.933333
	Only black face cards. 45 12.844444 8.196328 0.800000
	Alternate face and number cards. Any card may start the line. 45 10.911111 9.144850 0.688889
	Only cards between 5 and 9 inclusive. 45 9.844444 7.416062 0.755556
	Each card must share at least one property with the previous card: same color, or same parity. Any card may start the line. 45 9.288889 8.454752 0.644444
	Only red cards whose rank is <=7. 45 8.666667 7.245688 0.711111
	Each card must have a rank greater or equal to the previous card. Only Ace can start the line. 45 8.555556 8.983711 0.555556
	Suits must appear in pairs: card 1 and 2 same suit, cards 3 and 4 same suit (different from 1 and 2), cards 5 and 6 same suit (different from 3 and 4), etc. 45 6.133333 7.050467 0.555556
	Face cards (11-13) must be red; number cards (1-10) must be black. 45 4.644444 6.729027 0.400000
	Hearts and spades form Group A; clubs and diamonds form Group B. Alternate between groups. Any card may start the line. 45 3.800000 6.679684 0.288889
	Face cards imposes the suit: if a face card is played, the next card must match its suit. Otherwise, the next card must be a different suit than it. 45 2.955556 4.847471 0.311111
	If the previous card was red, rank must increase or be equal; if black, rank must decrease or be equal. Starting card must be between 5 and 9 inclusive. 45 2.888889 5.924611 0.222222
	Rank repeats in pairs: ranks must come in doubles: (x, x), then (y, y) with y different from x, then (z, z) with z different from y, etc. 45 1.466667 4.048569 0.133333

	Saved: results/260216/by_rule.png
	Saved: results/260216/by_rule.json

	============================================================
	EXCESS CAUTION ANALYSIS
	============================================================

	Early Correct Turns: consecutive correct shadow guesses before winning guess
	(Only counts successful rounds where model eventually guessed correctly)

	Model Rounds Mean Median % with Early
	Gpt 5.2 Pro High 76 4.62 4.0 97.4
	Gpt 5.2 High 75 3.56 3.0 86.7
	Gemini 3 Pro Preview High 73 2.34 2.0 79.5
	Deepseek V3.2 70 2.19 2.0 74.3
	Gemini 3 Flash Preview Low 60 1.92 2.0 68.3
	Gemini 3 Flash Preview High 63 1.75 1.0 68.3
	Gpt 5 Mini Medium 59 1.32 1.0 67.8
	Gpt Oss 120B 59 1.31 1.0 61.0
	Kimi K2 67 1.28 1.0 65.7
	Gpt Oss 20B 56 1.09 1.0 51.8
	Claude Opus 4.5 72 0.94 1.0 59.7
	Glm 4.7 67 0.94 1.0 59.7
	Grok 4 1 Fast Reasoning 69 0.75 0.0 44.9
	Claude Haiku 4.5 55 0.53 0.0 38.2
	Deepseek R1 65 0.49 0.0 33.8

	Overall: 1713 total early correct turns across 986 successful rounds
	Mean: 1.74, Median: 1.0
	639/986 (64.8%) rounds had at least 1 early correct turn

	Saved: results/260216/excess_caution.png
	Saved: results/260216/excess_caution.json
	Saved: results/260216/caution_vs_failed_guesses.png
	Saved: results/260216/caution_vs_failed_guesses.json
	Saved: results/260216/score_vs_recklessness.png
	Saved: results/260216/score_vs_recklessness.json

	============================================================
	RECKLESS GUESSING ANALYSIS
	============================================================

	Double-Down Rate: After a wrong guess, % of next turns with another guess
	(Only counts official guesses, not shadow/tentative guesses)

	Model Wrong Guesses Next Turn Guesses Double-Down %
	Grok 4 1 Fast Reasoning 200 108 54.0
	Deepseek R1 249 132 53.0
	Claude Haiku 4.5 308 161 52.3
	Kimi K2 159 67 42.1
	Gpt Oss 20B 232 97 41.8
	Glm 4.7 151 63 41.7
	Gemini 3 Pro Preview High 84 31 36.9
	Claude Opus 4.5 156 50 32.1
	Gemini 3 Flash Preview High 136 40 29.4
	Gpt Oss 120B 168 37 22.0
	Deepseek V3.2 97 20 20.6
	Gemini 3 Flash Preview Low 75 15 20.0
	Gpt 5 Mini Medium 91 8 8.8
	Gpt 5.2 High 22 0 0.0
	Gpt 5.2 Pro High 10 0 0.0

	Wrong Guess Streak Statistics:
	Model Streaks Mean Length Max Length Total Wrong
	Grok 4 1 Fast Reasoning 103 1.94 8 200
	Deepseek R1 121 2.06 7 249
	Claude Haiku 4.5 157 1.96 7 308
	Kimi K2 100 1.59 7 159
	Gpt Oss 20B 141 1.65 7 232
	Glm 4.7 91 1.66 5 151
	Gemini 3 Pro Preview High 60 1.40 5 84
	Claude Opus 4.5 115 1.36 5 156
	Gemini 3 Flash Preview High 99 1.37 4 136
	Gpt Oss 120B 133 1.26 5 168
	Deepseek V3.2 79 1.23 4 97
	Gemini 3 Flash Preview Low 63 1.19 4 75
	Gpt 5 Mini Medium 85 1.07 3 91
	Gpt 5.2 High 22 1.00 1 22
	Gpt 5.2 Pro High 10 1.00 1 10

	Longest streak: 8 consecutive wrong guesses
	- Grok 4 1 Fast Reasoning in round 67

	Saved: results/260216/reckless_guessing.png
	Saved: results/260216/reckless_guessing.json

	============================================================
	COMPLEXITY RATIO ANALYSIS
	============================================================

	Analyzed 14024 tentative rules with confidence >= 5
	Using optimal k = 0.490 for aggregated complexity

	Complexity Ratio by Model:
	(Ratio = Tentative Complexity / Actual Complexity)

	Model Median Q25 Q75 Count
	Gpt Oss 120B 1.318 0.873 2.360 1182
	Gpt Oss 20B 1.146 0.775 2.068 1219
	Claude Haiku 4.5 1.049 0.731 2.000 1001
	Deepseek R1 1.000 0.759 1.751 933
	Deepseek V3.2 1.000 0.773 1.492 906
	Gemini 3 Flash Preview High 1.000 0.824 1.627 958
	Gemini 3 Flash Preview Low 1.000 0.775 1.521 1016
	Gemini 3 Pro Preview High 1.000 0.691 1.199 789
	Glm 4.7 1.000 0.729 1.302 882
	Gpt 5 Mini Medium 1.000 0.761 1.656 939
	Gpt 5.2 High 1.000 0.788 1.176 857
	Gpt 5.2 Pro High 1.000 0.842 1.101 855
	Grok 4 1 Fast Reasoning 1.000 0.773 1.664 938
	Claude Opus 4.5 0.984 0.708 1.168 664
	Kimi K2 0.970 0.621 1.274 885

	Interpretation:
	- Ratio > 1: Model tends to overcomplicate rules
	- Ratio < 1: Model tends to oversimplify rules
	- Ratio ~ 1: Model matches actual rule complexity

	Highest median: Gpt Oss 120B (1.318)
	Lowest median: Kimi K2 (0.970)

	Saved: results/260216/complexity_ratio.png
	Saved: results/260216/complexity_ratio.json

	============================================================
	OUTPUT TOKENS BY TURN
	============================================================

	Saved: results/260216/tokens_by_turn.png
	Saved: results/260216/tokens_by_turn.json

	Tokens trend summary (early vs late turns):
	Claude Haiku 4.5: early=3191, late=5889 (+84.5%)
	Claude Opus 4.5: early=2649, late=8447 (+218.9%)
	Deepseek R1: early=5083, late=10946 (+115.3%)
	Deepseek V3.2: early=3660, late=12272 (+235.3%)
	Gemini 3 Flash Preview High: early=10023, late=15704 (+56.7%)
	Gemini 3 Flash Preview Low: early=1046, late=1351 (+29.1%)
	Gemini 3 Pro Preview High: early=5059, late=14670 (+190.0%)
	Glm 4.7: early=3885, late=4457 (+14.7%)
	Gpt 5 Mini Medium: early=1241, late=4862 (+291.9%)
	Gpt 5.2 High: early=963, late=5910 (+514.0%)
	Gpt 5.2 Pro High: early=1006, late=3125 (+210.7%)
	Gpt Oss 120B: early=1050, late=4475 (+326.2%)
	Gpt Oss 20B: early=1744, late=7789 (+346.6%)
	Grok 4 1 Fast Reasoning: early=2810, late=17827 (+534.4%)
	Kimi K2: early=5545, late=10653 (+92.1%)

	============================================================
	PER-MODEL REPORTS
	============================================================

	Saved: results/260216/model_gpt_5_2_high.png
	Saved: results/260216/model_gpt_oss_120b.png
	Saved: results/260216/model_claude_haiku_4_5.png
	Saved: results/260216/model_gpt_5_mini_medium.png
	Saved: results/260216/model_gemini_3_flash_preview_low.png
	Saved: results/260216/model_gpt_oss_20b.png
	Saved: results/260216/model_grok_4_1_fast_reasoning.png
	Saved: results/260216/model_deepseek_r1.png
	Saved: results/260216/model_claude_opus_4_5.png
	Saved: results/260216/model_kimi_k2.png
	Saved: results/260216/model_glm_4_7.png
	Saved: results/260216/model_gemini_3_flash_preview_high.png
	Saved: results/260216/model_deepseek_v3_2.png
	Saved: results/260216/model_gemini_3_pro_preview_high.png
	Saved: results/260216/model_gpt_5_2_pro_high.png

	============================================================
	Analysis complete! All outputs saved to: results/260216
	============================================================