64.9 kB

Title: Pencil Puzzle Bench: A Benchmark for Multi-Step Verifiable Reasoning

URL Source: https://arxiv.org/html/2603.02119

Markdown Content: ∗Solved only through agentic iteration (0% in direct-ask evaluation).

4 Evaluation Methodology

4.1 Strategies

Direct Ask (Single-Shot)

Model receives puzzle state and must output complete solution as JSON move list. One inference call.

Agentic (Multi-Turn with Iterative Verification)

Model has access to tools: make_move, check_board, reset_puzzle. The model iteratively makes moves, checks the board for constraint violations, and course-corrects based on the specific error feedback. This process continues until the puzzle is solved or the model gives up.

Move Format: Moves are coordinate-based strings like mouse,left,3,5 (left-click at column 3, row 5) or mouse,right,1,2 (right-click for secondary state). Line-drawing puzzles use drag syntax: mouse,left,1,1,1,5. This interface mirrors the puzz.link web UI.

Note: Both strategies are baselines, not optimized attempts. We did not perform prompt engineering or hill-climbing. The goal is to measure where models are, not to maximize scores.

4.2 Agentic Evaluation Design

Of 51 models, 3 were excluded from agentic evaluation because they lack the tool-calling capability required by the harness. The remaining 48 were evaluated agentically: 45 on a 30-puzzle baseline (4 types: yajilin, sashigane, lits, lightup), and 3 top models (GPT-5.2@xhigh, Claude Opus 4.6@thinking, and Gemini 3.1 Pro) on an expanded 60-puzzle set across all 20 varieties (3 puzzles per variety). This expansion is what solved the 3 previously unsolved varieties.

Agentic Scale

Agentic evaluation is a long-context, many-turn process. Across 1,602 agentic runs (45 models on the 30-puzzle baseline, plus 3 top models on the expanded 60-puzzle set):

• Turns per attempt: mean 57.2, median 29, P90 = 113, max 1,221
• Duration per attempt: mean 40 min, median 17 min, P90 = 108 min, max 14.3 hours

These are not brief tool calls; models sustain solving attempts over extended multi-turn conversations, maintaining context across dozens to hundreds of reasoning-action-feedback loops. This makes Pencil Puzzle Bench a demanding test of long-context utilization and sustained reasoning, not just single-turn problem solving.

4.3 Models

We evaluate 51 models across 11 providers:

• OpenAI (14): GPT-5.2 (none/low/medium/high/xhigh), GPT-5.2-pro, GPT-5.1@medium, GPT-5@medium, GPT-4.1, GPT-4o, o1, o3, GPT-3.5-turbo, GPT-OSS-120B
• Anthropic (11): Claude Sonnet 4.5, Claude Sonnet 4.5@thinking, Claude Sonnet 4.6, Claude Sonnet 4.6@thinking, Claude Sonnet 4.6-1m, Claude Opus 4.5 (high), Claude Opus 4.5@thinking, Claude Opus 4.6 (high), Claude Opus 4.6@thinking, Claude Opus 4.6@max, Claude Opus 4.6-1m
• Google (7): Gemini 3.1 Pro, Gemini 3 Pro (default/minimal/high), Gemini 3 Flash (minimal/high), Gemini 2.5 Pro
• xAI (3): Grok 4.1 Fast, Grok 4.1 Fast Reasoning, Grok Code Fast
• DeepSeek (2): V3.2, V3.2-speciale
• Qwen (5): Qwen3.5-397B, Qwen3-235B, Qwen3-Coder, Qwen3-Next-80B, Qwen3-VL-235B
• Mistral (2): Mistral Large, Devstral
• Moonshot (2): Kimi K2@thinking, Kimi K2.5
• Minimax (2): M2.1, M2.5
• Zhipu (2): GLM-4.7, GLM-5
• Xiaomi (1): MiMo-v2

5 Results

5.1 Overall Performance

Figure1 and Table5.1 show the top models ranked by their best success rate across either strategy. Complete results for all 51 models are in Table8 (Appendix C).

Table 3: Top 15 models by best success rate. Direct ask on 300 puzzles; agentic on 30-puzzle baseline for most models, 60-puzzle expanded set for top 3.

5.2 The Agentic Gap

The agentic strategy, where models iteratively make moves, check the board for constraint violations, and course-correct, reveals a striking pattern: models with weak direct-ask performance benefit disproportionately from agentic iteration. Table5.2 shows the top 10 models by agentic success rate with uplift, and Table5.2 shows GPT-5.2’s performance across reasoning effort levels.

Table 4: Top 10 models by agentic success rate, with uplift (Δ\Delta pp = agentic −- direct) and cost per attempt.

Table 5: GPT-5.2 performance by reasoning effort level. Direct ask on 300 puzzles; agentic on 30-puzzle baseline.

Claude Opus 4.6 at default effort (no extended thinking) achieves 30.0% agentic success despite only 0.3% direct-ask performance (+30.0 percentage points on the 30 puzzles evaluated in both modes). This is the most extreme case: the model lacks internal chain-of-thought reasoning, so agentic iteration substantially compensates. Even with extended thinking enabled, the gap persists: Claude Opus 4.6@thinking scores 27.3% direct-ask but still gains +4.8pp from agentic iteration (28.6% →\to 33.3% on the same 84 puzzles). GPT-5.2@xhigh, which already achieves 27.0% direct-ask, gains +35.7pp from agentic iteration (20.2% →\to 56.0% on the same 84 puzzles). This suggests that reasoning depth and agentic iteration are two distinct axes of capability: the agentic gap ranges from ++4.8pp for models with strong reasoning to ++30.0pp for models without, and both benefit from iterative verification and course-correction.

Extended Agentic Evaluation

The initial 30-puzzle agentic evaluation covered 4 varieties. We then ran an expanded 60-puzzle evaluation across all 20 varieties for the top 3 models (GPT-5.2@xhigh, Claude Opus 4.6@thinking, and Gemini 3.1 Pro). Infrastructure error rates (timeouts, API failures) differed between models: GPT-5.2@xhigh had a 27.8% error rate on the baseline set vs. 39.6% on the expanded puzzles, while Claude Opus 4.6@thinking showed 50.0% and 31.2% respectively, suggesting different failure modes across models.

Claude’s Arc

At the same configuration (default effort, no extended thinking), agentic performance improves across Claude generations: Sonnet 4.5 (0.0% / 3.3%), Opus 4.5 (0.3% / 3.3%), Opus 4.6 (0.3% / 30.0%). Without extended thinking, the jump from Opus 4.5 to 4.6 is entirely in agentic mode. With extended thinking enabled, both axes improve: Opus 4.5@thinking (6.0% / 6.7%) to Opus 4.6@thinking (27.3% / 33.3%).

Higher Effort Does Not Always Help

Among non-thinking configurations, Claude Opus 4.6 at default effort (high) outperforms Opus 4.6@max (30.0% vs 23.3% on the baseline 30-puzzle set). Enabling extended thinking (Opus 4.6@thinking, 33.3% on the expanded set) does not surpass the default effort configuration on the baseline puzzles, suggesting that the relationship between reasoning depth and agentic effectiveness is non-monotonic.

Agentic Iteration Solves Previously Unsolved Varieties

The expanded agentic evaluation solved 3 previously unsolved varieties; all 20 varieties have now been solved at least once. This demonstrates that iterative verification and course-correction can unlock capabilities not achieved by any model in single-shot mode.

5.3 Reasoning Effort Scaling

Figure3 shows GPT-5.2’s outcome breakdown across reasoning effort levels. Each attempt has one of three outcomes: correct (puzzle solved), incorrect (model returned a response but the solution was wrong), or request failed (infrastructure timeout or API error before any response was returned).

Figure 3: GPT-5.2 outcome breakdown by reasoning effort level (direct ask, 300 puzzles). Each bar sums to 100%. Correct answers (blue) improve 81×\times from none to xhigh, but at xhigh, 35% of requests fail before returning a response (red), revealing a sharp reliability/capability tradeoff.

5.4 Model Capability Over Time

Figure4 shows model release dates plotted against success rates, illustrating the pace of capability improvement. At medium reasoning effort, we observe steady improvement across GPT-5 generations: GPT-5.0 (6.0%), GPT-5.1 (7.7%), GPT-5.2 (9.3%). A breakaway in capability is visible since late 2025, with dramatic improvements in the most recent three months. However, even the best models reach only 27.0% success in direct-ask mode, and many varieties and larger puzzles remain unsolved across all models. The extended timeline in Figure7 (Appendix E) shows that models released before late 2024 achieve 0% or near-0% solve rates, confirming that this capability is entirely new.

Figure 4: Model success rates over time with frontier model release dates annotated. The progression shows both generational improvement within model families and the gap between frontier and non-frontier models.

5.5 Cost Analysis

Total recorded benchmark cost: approximately $28,246 across 17,032 runs, computed from per-request token usage in the published artifacts.3 3 3 These are artifact-recorded costs. Actual provider costs are higher than recorded because when agentic runs terminate with errors, token usage from completed turns before the error is not recorded. All run records (model, puzzle, strategy, outcome, cost, duration) are published in the runs.jsonl artifact on HuggingFace.4 4 4https://huggingface.co/datasets/approximatelabs/pencil-puzzle-bench Figure5 visualizes the cost-success Pareto frontier.

Figure 5: Cost per success vs. success rate across models. The Pareto frontier (lower-right) shows the cost-efficiency tradeoff: Grok 4.1 Fast achieves the lowest cost per success ($0.01) while GPT-5.2@xhigh achieves the highest success rate (56.0%).

Cost-per-success varies by 66,822×\times across models.

6 Analysis

6.1 What Makes a Puzzle Hard?

Our benchmark stratifies puzzles into short, medium, and long tiers by number of required moves within each type. But is move count a good proxy for difficulty? After analysis, we propose solution compression ratio as a better difficulty measure.

Move Count is a Weak Predictor

Regressing solve rate on move count yields R adj 2=0.092 R^{2}_{\text{adj}}=0.092{}; move count explains only 0.092 of variance in solve rate. Puzzle variety alone (η 2=0.242\eta^{2}=0.242{}) explains 2.6×\times more variance than move count.

Solution Compressibility Predicts Difficulty

The compression ratio of the solution move sequence (zlib-compressed size / raw size) achieves R adj 2=0.385 R^{2}_{\text{adj}}=0.385{}, 4.2×\times more predictive than move count. The intuition: solutions with high compression ratios contain repetitive, structured patterns that models can generalize from partial progress. Low compression ratios indicate high-entropy solutions requiring genuinely novel decisions at each step.

Why Area Doesn’t Add Much

Adding grid area improves the model only modestly (compression + area + log(area): R adj 2=0.415 R^{2}{\text{adj}}=0.415{} vs. compression alone: 0.385). This is likely because area is confounded with variety; different varieties tend to use characteristic grid sizes. Within a variety, puzzles often cluster in a narrow size range (the Golden 300 has a median area of 100 for nearly every variety), so area provides little additional signal once variety is controlled for. Adding variety dummies to the best three features yields R adj 2=0.621 R^{2}{\text{adj}}=0.621{}, confirming that variety captures additional difficulty structure beyond compressibility.

6.2 Infrastructure Scaling Limits

The 35% request failure rate at xhigh reasoning effort is not random network failure. Error duration analysis reveals 100% of failures occur between 1–4 hours, with 62% clustering in the 2–3 hour range, suggesting a systematic server-side timeout around 2.5–3 hours. These are not transient errors but systematic limits on how long a single inference request can run.

The tradeoff is stark (Figure3): xhigh gains +6pp over high in direct ask (20.7% →\rightarrow 27.0%), with corresponding gains in agentic mode (36.7% →\rightarrow 56.0%). But at xhigh, 35% of requests fail before returning any response. The marginal capability gain is real, but reliability collapses.

7 Discussion

Two Axes of Capability

Our results reveal two distinct axes along which models can improve at puzzle solving: reasoning depth (controlled by effort levels and extended thinking) and agentic iteration (controlled by iterative verification and course-correction). These axes are complementary, not redundant. The agentic gap is largest for models without internal reasoning: Claude Opus 4.6 (no extended thinking) has near-zero direct-ask performance but reaches 30.0% agentically. With extended thinking enabled, Opus 4.6@thinking achieves 27.3% direct-ask—yet still gains +4.8pp from agentic iteration (28.6% →\to 33.3% on the same 84 puzzles). GPT-5.2@xhigh, the strongest single-shot reasoner at 27.0%, gains a further ++35.7pp agentically (56.0%). The strongest results combine both axes: deep reasoning and iterative verification. The expanded agentic evaluation solved 3 previously unsolved varieties that no model solved in direct-ask mode, demonstrating that agentic iteration provides capabilities beyond what single-shot reasoning alone achieves.

A Grounding Example

To contextualize these results: the best model at maximum reasoning effort (GPT-5.2@xhigh) solves only 33.3% of Sudoku puzzles, a variety familiar to most readers. This underscores that even well-known constraint-satisfaction problems remain genuinely challenging for frontier models, despite Sudoku being one of the most commonly encountered varieties in training data.

Limitations

Models received ASCII text-based board representations. In agentic mode, models also had access to a render_board_as_svg() tool that returns an SVG text representation of the current board state, which some models called for additional spatial context. Rasterized pixel images were not used. Varieties requiring strong visual-spatial reasoning may benefit from multimodal approaches using rendered images. The agentic evaluation used a 30-puzzle subset (4 varieties: yajilin, sashigane, lits, lightup, selected to span a range of difficulty levels and elicit meaningful agentic iteration) for most models, with an expanded 60-puzzle evaluation (3 puzzles per variety across all 20 varieties, selected by difficulty tier) for the top 3 models. Both strategies are baselines without prompt optimization.

All success rates are point estimates from single evaluation runs without confidence intervals or repeated trials. With 15 puzzles per variety, a single additional solve changes per-variety rates by 6.7 percentage points. The agentic uplift figures are computed from small samples and should be interpreted with this caveat. No human performance baseline is established. Future work should include repeated trials, bootstrap confidence intervals, and human calibration studies.

Future Work

The puzzle dataset, benchmark code, and step-level solution traces enable:

• RL fine-tuning with step-level rewards, following the paradigm of process supervision[21, 49] and reasoning via RL[10]
• Process reward model training using per-move, per-constraint verification signals
• Curriculum learning across difficulty levels using compression ratio as a principled difficulty metric
• Multimodal evaluation using the SVG renderer, where models could “see” the puzzle board as a rendered image rather than parsing ASCII text

8 Conclusion

We introduced Pencil Puzzle Bench, a dataset of 62,231 logic puzzles across 94 varieties with step-level solution traces, and an evaluation benchmark of 300 puzzles across 20 varieties with programmatic step-level verification, where every intermediate board state can be checked against variety-specific constraints. Our evaluation of 51 models in two modes (direct ask and agentic) reveals two distinct axes of improvement: reasoning effort scaling (81×\times from GPT-5.2 none to xhigh) and agentic iteration (up to ++30.0pp for models without extended thinking, and ++4.8pp even for strong reasoning models). The strongest results combine both axes: GPT-5.2@xhigh achieves 56.0% in agentic mode. Agentic evaluation solved 3 previously unsolved varieties; all 20 varieties have now been solved at least once, though 49% of individual puzzles remain unsolved by any model.

The step-level verification infrastructure provides a foundation for future work on process supervision, step-level reinforcement learning, and curriculum learning for reasoning.

Acknowledgments

This paper was written with heavy use of AI coding agents (Claude Code, Codex) and conversational AI assistants (ChatGPT, Claude). These tools were used extensively throughout the research, implementation, data analysis, and writing process. Any errors or inaccuracies in this work are the sole responsibility of the author.

References

[1]M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P. Nyczyk, and T. Hoefler (2024)Graph of thoughts: solving elaborate problems with large language models. External Links: 2308.09687, Document, LinkCited by: §2.
[2]G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016)OpenAI gym. External Links: 1606.01540, LinkCited by: §3.4.
[3]G. Chen, W. Xu, H. Zhang, H. P. Chan, C. Liu, L. Bing, D. Zhao, A. T. Luu, and Y. Rong (2025)FINEREASON: evaluating and improving llms’ deliberate reasoning through reflective puzzle solving. External Links: 2502.20238, LinkCited by: §2.
[4]J. Chen, Q. He, S. Yuan, A. Chen, Z. Cai, W. Dai, H. Yu, Q. Yu, X. Li, J. Chen, H. Zhou, and M. Wang (2025)Enigmata: scaling logical reasoning in large language models with synthetic verifiable puzzles. External Links: 2505.19914, LinkCited by: §1, §2.
[5]S. Chen, Y. Chen, Z. Li, Y. Jiang, Z. Wan, Y. He, D. Ran, T. Gu, H. Li, T. Xie, and B. Ray (2025)Recent advances in large language model benchmarks against data contamination: from static to dynamic evaluation. External Links: 2502.17521, LinkCited by: §2.
[6]W. Chen, X. Ma, X. Wang, and W. W. Cohen (2023)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks. External Links: 2211.12588, LinkCited by: §2.
[7]P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord (2018)Think you have solved question answering? try arc, the ai2 reasoning challenge. External Links: 1803.05457, LinkCited by: §2.
[8]P. Clark, O. Tafjord, and K. Richardson (2020)Transformers as soft reasoners over language. External Links: 2002.05867, LinkCited by: §2.
[9]K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021)Training verifiers to solve math word problems. External Links: 2110.14168, LinkCited by: §2.
[10]DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. Wang, H. Ding, H. Xin, H. Gao, H. Qu, H. Li, J. Guo, J. Li, J. Wang, J. Chen, J. Yuan, J. Qiu, J. Li, J. L. Cai, J. Ni, J. Liang, J. Chen, K. Dong, K. Hu, K. Gao, K. Guan, K. Huang, K. Yu, L. Wang, L. Zhang, L. Zhao, L. Wang, L. Zhang, L. Xu, L. Xia, M. Zhang, M. Zhang, M. Tang, M. Li, M. Wang, M. Li, N. Tian, P. Huang, P. Zhang, Q. Wang, Q. Chen, Q. Du, R. Ge, R. Zhang, R. Pan, R. Wang, R. J. Chen, R. L. Jin, R. Chen, S. Lu, S. Zhou, S. Chen, S. Ye, S. Wang, S. Yu, S. Zhou, S. Pan, S. S. Li, S. Zhou, S. Wu, S. Ye, T. Yun, T. Pei, T. Sun, T. Wang, W. Zeng, W. Zhao, W. Liu, W. Liang, W. Gao, W. Yu, W. Zhang, W. L. Xiao, W. An, X. Liu, X. Wang, X. Chen, X. Nie, X. Cheng, X. Liu, X. Xie, X. Liu, X. Yang, X. Li, X. Su, X. Lin, X. Q. Li, X. Jin, X. Shen, X. Chen, X. Sun, X. Wang, X. Song, X. Zhou, X. Wang, X. Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Y. Zhang, Y. Xu, Y. Li, Y. Zhao, Y. Sun, Y. Wang, Y. Yu, Y. Zhang, Y. Shi, Y. Xiong, Y. He, Y. Piao, Y. Wang, Y. Tan, Y. Ma, Y. Liu, Y. Guo, Y. Ou, Y. Wang, Y. Gong, Y. Zou, Y. He, Y. Xiong, Y. Luo, Y. You, Y. Liu, Y. Zhou, Y. X. Zhu, Y. Xu, Y. Huang, Y. Li, Y. Zheng, Y. Zhu, Y. Ma, Y. Tang, Y. Zha, Y. Yan, Z. Z. Ren, Z. Ren, Z. Sha, Z. Fu, Z. Xu, Z. Xie, Z. Zhang, Z. Hao, Z. Ma, Z. Yan, Z. Wu, Z. Gu, Z. Zhu, Z. Liu, Z. Li, Z. Xie, Z. Song, Z. Pan, Z. Huang, Z. Xu, Z. Zhang, and Z. Zhang (2026)DeepSeek-r1: incentivizing reasoning capability in llms via reinforcement learning. External Links: 2501.12948, Document, LinkCited by: §1, §2, 1st item.
[11]C. Deng, Y. Zhao, X. Tang, M. Gerstein, and A. Cohan (2024)Investigating data contamination in modern benchmarks for large language models. External Links: 2311.09783, LinkCited by: §1, §2.
[12]M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant (2021)Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. External Links: 2101.02235, LinkCited by: §2.
[13]P. Giadikiaroglou, M. Lymperaiou, G. Filandrianos, and G. Stamou (2024)Puzzle solving using reasoning of large language models: a survey. External Links: 2402.11291, Document, LinkCited by: §2.
[14]T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang (2024)Large language model based multi-agents: a survey of progress and challenges. External Links: 2402.01680, LinkCited by: §2.
[15]D. Halder, A. Saji, T. Jayakumar, R. Puduppully, A. Kunchukuttan, and R. Dabre (2025)RiddleBench: a new generative reasoning benchmark for llms. External Links: 2510.24932, LinkCited by: §2.
[16]D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt (2021)Measuring mathematical problem solving with the math dataset. External Links: 2103.03874, LinkCited by: §2.
[17]C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan (2024)SWE-bench: can language models resolve real-world github issues?. External Links: 2310.06770, LinkCited by: §2.
[18]Z. Ke, F. Jiao, Y. Ming, X. Nguyen, A. Xu, D. X. Long, M. Li, C. Qin, P. Wang, S. Savarese, C. Xiong, and S. Joty (2025)A survey of frontiers in llm reasoning: inference scaling, learning to reason, and agentic systems. External Links: 2504.09037, LinkCited by: §2, §2.
[19]Y. Li, H. Wang, and C. Zhang (2024)Assessing logical puzzle solving in large language models: insights from a minesweeper case study. External Links: 2311.07387, Document, LinkCited by: §2.
[20]J. Liang, S. Wan, X. Wu, Y. Li, Q. Chen, D. Tang, S. Wang, and Z. Wei (2025)HardcoreLogic: challenging large reasoning models with long-tail logic puzzle games. External Links: 2510.12563, LinkCited by: §2.
[21]H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe (2023)Let’s verify step by step. External Links: 2305.20050, LinkCited by: §1, §2, 1st item.
[22]B. Y. Lin, R. L. Bras, K. Richardson, A. Sabharwal, R. Poovendran, P. Clark, and Y. Choi (2025)ZebraLogic: on the scaling limits of llms for logical reasoning. External Links: 2502.01100, LinkCited by: §2.
[23]A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark (2023)Self-refine: iterative refinement with self-feedback. External Links: 2303.17651, LinkCited by: §2.
[24]T. R. McIntosh, T. Susnjak, N. Arachchilage, T. Liu, P. Watters, and M. N. Halgamuge (2024)Inadequacies of large language model benchmarks in the era of generative artificial intelligence. External Links: 2402.09880, Document, LinkCited by: §2.
[25]A. V. Nikolaev (2025)Evolomino is np-complete. External Links: 2503.07611, Document, LinkCited by: §2.
[26]OpenAI, :, A. Jaech, A. Kalai, A. Lerer, A. Richardson, A. El-Kishky, A. Low, A. Helyar, A. Madry, A. Beutel, A. Carney, A. Iftimie, A. Karpenko, A. T. Passos, A. Neitz, A. Prokofiev, A. Wei, A. Tam, A. Bennett, A. Kumar, A. Saraiva, A. Vallone, A. Duberstein, A. Kondrich, A. Mishchenko, A. Applebaum, A. Jiang, A. Nair, B. Zoph, B. Ghorbani, B. Rossen, B. Sokolowsky, B. Barak, B. McGrew, B. Minaiev, B. Hao, B. Baker, B. Houghton, B. McKinzie, B. Eastman, C. Lugaresi, C. Bassin, C. Hudson, C. M. Li, C. de Bourcy, C. Voss, C. Shen, C. Zhang, C. Koch, C. Orsinger, C. Hesse, C. Fischer, C. Chan, D. Roberts, D. Kappler, D. Levy, D. Selsam, D. Dohan, D. Farhi, D. Mely, D. Robinson, D. Tsipras, D. Li, D. Oprica, E. Freeman, E. Zhang, E. Wong, E. Proehl, E. Cheung, E. Mitchell, E. Wallace, E. Ritter, E. Mays, F. Wang, F. P. Such, F. Raso, F. Leoni, F. Tsimpourlas, F. Song, F. von Lohmann, F. Sulit, G. Salmon, G. Parascandolo, G. Chabot, G. Zhao, G. Brockman, G. Leclerc, H. Salman, H. Bao, H. Sheng, H. Andrin, H. Bagherinezhad, H. Ren, H. Lightman, H. W. Chung, I. Kivlichan, I. O’Connell, I. Osband, I. C. Gilaberte, I. Akkaya, I. Kostrikov, I. Sutskever, I. Kofman, J. Pachocki, J. Lennon, J. Wei, J. Harb, J. Twore, J. Feng, J. Yu, J. Weng, J. Tang, J. Yu, J. Q. Candela, J. Palermo, J. Parish, J. Heidecke, J. Hallman, J. Rizzo, J. Gordon, J. Uesato, J. Ward, J. Huizinga, J. Wang, K. Chen, K. Xiao, K. Singhal, K. Nguyen, K. Cobbe, K. Shi, K. Wood, K. Rimbach, K. Gu-Lemberg, K. Liu, K. Lu, K. Stone, K. Yu, L. Ahmad, L. Yang, L. Liu, L. Maksin, L. Ho, L. Fedus, L. Weng, L. Li, L. McCallum, L. Held, L. Kuhn, L. Kondraciuk, L. Kaiser, L. Metz, M. Boyd, M. Trebacz, M. Joglekar, M. Chen, M. Tintor, M. Meyer, M. Jones, M. Kaufer, M. Schwarzer, M. Shah, M. Yatbaz, M. Y. Guan, M. Xu, M. Yan, M. Glaese, M. Chen, M. Lampe, M. Malek, M. Wang, M. Fradin, M. McClay, M. Pavlov, M. Wang, M. Wang, M. Murati, M. Bavarian, M. Rohaninejad, N. McAleese, N. Chowdhury, N. Chowdhury, N. Ryder, N. Tezak, N. Brown, O. Nachum, O. Boiko, O. Murk, O. Watkins, P. Chao, P. Ashbourne, P. Izmailov, P. Zhokhov, R. Dias, R. Arora, R. Lin, R. G. Lopes, R. Gaon, R. Miyara, R. Leike, R. Hwang, R. Garg, R. Brown, R. James, R. Shu, R. Cheu, R. Greene, S. Jain, S. Altman, S. Toizer, S. Toyer, S. Miserendino, S. Agarwal, S. Hernandez, S. Baker, S. McKinney, S. Yan, S. Zhao, S. Hu, S. Santurkar, S. R. Chaudhuri, S. Zhang, S. Fu, S. Papay, S. Lin, S. Balaji, S. Sanjeev, S. Sidor, T. Broda, A. Clark, T. Wang, T. Gordon, T. Sanders, T. Patwardhan, T. Sottiaux, T. Degry, T. Dimson, T. Zheng, T. Garipov, T. Stasi, T. Bansal, T. Creech, T. Peterson, T. Eloundou, V. Qi, V. Kosaraju, V. Monaco, V. Pong, V. Fomenko, W. Zheng, W. Zhou, W. McCabe, W. Zaremba, Y. Dubois, Y. Lu, Y. Chen, Y. Cha, Y. Bai, Y. He, Y. Zhang, Y. Wang, Z. Shao, and Z. Li (2024)OpenAI o1 system card. External Links: 2412.16720, LinkCited by: §1, §2.
[27]J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein (2023)Generative agents: interactive simulacra of human behavior. External Links: 2304.03442, LinkCited by: §2.
[28]S. G. Patil, H. Mao, F. Yan, C. C. Ji, V. Suresh, I. Stoica, and J. E. Gonzalez (2025-13–19 Jul)The berkeley function calling leaderboard (BFCL): from tool use to agentic evaluation of large language models. In Proceedings of the 42nd International Conference on Machine Learning, A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 267, pp.48371–48392. External Links: LinkCited by: §2.
[29]T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom (2023)Toolformer: language models can teach themselves to use tools. External Links: 2302.04761, LinkCited by: §2.
[30]T. P. Schrader, L. Lange, T. Kaminski, S. Razniewski, and A. Friedrich (2025)A solver-in-the-loop framework for improving llms on answer set programming for logic puzzle solving. External Links: 2512.17093, LinkCited by: §2.
[31]J. Seely, Y. Imajuku, T. Zhao, E. Cetin, and L. Jones (2025)Sudoku-bench: evaluating creative reasoning with sudoku variants. External Links: 2505.16135, LinkCited by: §2.
[32]Z. Shi, X. Jiang, C. Xu, C. Yao, S. Ma, Y. Shen, Z. Li, J. Guo, and Y. Wang (2026)JudgeAgent: beyond static benchmarks for knowledge-driven and dynamic llm evaluation. External Links: 2509.02097, LinkCited by: §2.
[33]A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Slone, A. Rahane, A. S. Iyer, A. Andreassen, A. Madotto, A. Santilli, A. Stuhlmüller, A. Dai, A. La, A. Lampinen, A. Zou, A. Jiang, A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli, A. Venkatesh, A. Gholamidavoodi, A. Tabassum, A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sabharwal, A. Herrick, A. Efrat, A. Erdem, A. Karakaş, B. R. Roberts, B. S. Loe, B. Zoph, B. Bojanowski, B. Özyurt, B. Hedayatnia, B. Neyshabur, B. Inden, B. Stein, B. Ekmekci, B. Y. Lin, B. Howald, B. Orinion, C. Diao, C. Dour, C. Stinson, C. Argueta, C. F. Ramírez, C. Singh, C. Rathkopf, C. Meng, C. Baral, C. Wu, C. Callison-Burch, C. Waites, C. Voigt, C. D. Manning, C. Potts, C. Ramirez, C. E. Rivera, C. Siro, C. Raffel, C. Ashcraft, C. Garbacea, D. Sileo, D. Garrette, D. Hendrycks, D. Kilman, D. Roth, D. Freeman, D. Khashabi, D. Levy, D. M. González, D. Perszyk, D. Hernandez, D. Chen, D. Ippolito, D. Gilboa, D. Dohan, D. Drakard, D. Jurgens, D. Datta, D. Ganguli, D. Emelin, D. Kleyko, D. Yuret, D. Chen, D. Tam, D. Hupkes, D. Misra, D. Buzan, D. C. Mollo, D. Yang, D. Lee, D. Schrader, E. Shutova, E. D. Cubuk, E. Segal, E. Hagerman, E. Barnes, E. Donoway, E. Pavlick, E. Rodola, E. Lam, E. Chu, E. Tang, E. Erdem, E. Chang, E. A. Chi, E. Dyer, E. Jerzak, E. Kim, E. E. Manyasi, E. Zheltonozhskii, F. Xia, F. Siar, F. Martínez-Plumed, F. Happé, F. Chollet, F. Rong, G. Mishra, G. I. Winata, G. de Melo, G. Kruszewski, G. Parascandolo, G. Mariani, G. Wang, G. Jaimovitch-López, G. Betz, G. Gur-Ari, H. Galijasevic, H. Kim, H. Rashkin, H. Hajishirzi, H. Mehta, H. Bogar, H. Shevlin, H. Schütze, H. Yakura, H. Zhang, H. M. Wong, I. Ng, I. Noble, J. Jumelet, J. Geissinger, J. Kernion, J. Hilton, J. Lee, J. F. Fisac, J. B. Simon, J. Koppel, J. Zheng, J. Zou, J. Kocoń, J. Thompson, J. Wingfield, J. Kaplan, J. Radom, J. Sohl-Dickstein, J. Phang, J. Wei, J. Yosinski, J. Novikova, J. Bosscher, J. Marsh, J. Kim, J. Taal, J. Engel, J. Alabi, J. Xu, J. Song, J. Tang, J. Waweru, J. Burden, J. Miller, J. U. Balis, J. Batchelder, J. Berant, J. Frohberg, J. Rozen, J. Hernandez-Orallo, J. Boudeman, J. Guerr, J. Jones, J. B. Tenenbaum, J. S. Rule, J. Chua, K. Kanclerz, K. Livescu, K. Krauth, K. Gopalakrishnan, K. Ignatyeva, K. Markert, K. D. Dhole, K. Gimpel, K. Omondi, K. Mathewson, K. Chiafullo, K. Shkaruta, K. Shridhar, K. McDonell, K. Richardson, L. Reynolds, L. Gao, L. Zhang, L. Dugan, L. Qin, L. Contreras-Ochando, L. Morency, L. Moschella, L. Lam, L. Noble, L. Schmidt, L. He, L. O. Colón, L. Metz, L. K. Şenel, M. Bosma, M. Sap, M. ter Hoeve, M. Farooqi, M. Faruqui, M. Mazeika, M. Baturan, M. Marelli, M. Maru, M. J. R. Quintana, M. Tolkiehn, M. Giulianelli, M. Lewis, M. Potthast, M. L. Leavitt, M. Hagen, M. Schubert, M. O. Baitemirova, M. Arnaud, M. McElrath, M. A. Yee, M. Cohen, M. Gu, M. Ivanitskiy, M. Starritt, M. Strube, M. Swędrowski, M. Bevilacqua, M. Yasunaga, M. Kale, M. Cain, M. Xu, M. Suzgun, M. Walker, M. Tiwari, M. Bansal, M. Aminnaseri, M. Geva, M. Gheini, M. V. T, N. Peng, N. A. Chi, N. Lee, N. G. Krakover, N. Cameron, N. Roberts, N. Doiron, N. Martinez, N. Nangia, N. Deckers, N. Muennighoff, N. S. Keskar, N. S. Iyer, N. Constant, N. Fiedel, N. Wen, O. Zhang, O. Agha, O. Elbaghdadi, O. Levy, O. Evans, P. A. M. Casares, P. Doshi, P. Fung, P. P. Liang, P. Vicol, P. Alipoormolabashi, P. Liao, P. Liang, P. Chang, P. Eckersley, P. M. Htut, P. Hwang, P. Miłkowski, P. Patil, P. Pezeshkpour, P. Oli, Q. Mei, Q. Lyu, Q. Chen, R. Banjade, R. E. Rudolph, R. Gabriel, R. Habacker, R. Risco, R. Millière, R. Garg, R. Barnes, R. A. Saurous, R. Arakawa, R. Raymaekers, R. Frank, R. Sikand, R. Novak, R. Sitelew, R. LeBras, R. Liu, R. Jacobs, R. Zhang, R. Salakhutdinov, R. Chi, R. Lee, R. Stovall, R. Teehan, R. Yang, S. Singh, S. M. Mohammad, S. Anand, S. Dillavou, S. Shleifer, S. Wiseman, S. Gruetter, S. R. Bowman, S. S. Schoenholz, S. Han, S. Kwatra, S. A. Rous, S. Ghazarian, S. Ghosh, S. Casey, S. Bischoff, S. Gehrmann, S. Schuster, S. Sadeghi, S. Hamdan, S. Zhou, S. Srivastava, S. Shi, S. Singh, S. Asaadi, S. S. Gu, S. Pachchigar, S. Toshniwal, S. Upadhyay, Shyamolima, Debnath, S. Shakeri, S. Thormeyer, S. Melzi, S. Reddy, S. P. Makini, S. Lee, S. Torene, S. Hatwar, S. Dehaene, S. Divic, S. Ermon, S. Biderman, S. Lin, S. Prasad, S. T. Piantadosi, S. M. Shieber, S. Misherghi, S. Kiritchenko, S. Mishra, T. Linzen, T. Schuster, T. Li, T. Yu, T. Ali, T. Hashimoto, T. Wu, T. Desbordes, T. Rothschild, T. Phan, T. Wang, T. Nkinyili, T. Schick, T. Kornev, T. Tunduny, T. Gerstenberg, T. Chang, T. Neeraj, T. Khot, T. Shultz, U. Shaham, V. Misra, V. Demberg, V. Nyamai, V. Raunak, V. Ramasesh, V. U. Prabhu, V. Padmakumar, V. Srikumar, W. Fedus, W. Saunders, W. Zhang, W. Vossen, X. Ren, X. Tong, X. Zhao, X. Wu, X. Shen, Y. Yaghoobzadeh, Y. Lakretz, Y. Song, Y. Bahri, Y. Choi, Y. Yang, Y. Hao, Y. Chen, Y. Belinkov, Y. Hou, Y. Hou, Y. Bai, Z. Seid, Z. Zhao, Z. Wang, Z. J. Wang, Z. Wang, and Z. Wu (2023)Beyond the imitation game: quantifying and extrapolating the capabilities of language models. External Links: 2206.04615, LinkCited by: §2.
[34]M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, and J. Wei (2022)Challenging big-bench tasks and whether chain-of-thought can solve them. External Links: 2210.09261, LinkCited by: §2.
[35]H. Tang (2022)A framework for loop and path puzzle satisfiability np-hardness results. External Links: 2202.02046, LinkCited by: §2.
[36]N. Tyagi, M. Parmar, M. Kulkarni, A. RRV, N. Patel, M. Nakamura, A. Mitra, and C. Baral (2024)Step-by-step reasoning to solve grid puzzles: where do llms falter?. External Links: 2407.14790, LinkCited by: §2.
[37]L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K. Lee, and E. Lim (2023)Plan-and-solve prompting: improving zero-shot chain-of-thought reasoning by large language models. External Links: 2305.04091, LinkCited by: §2.
[38]X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou (2023)Self-consistency improves chain of thought reasoning in language models. External Links: 2203.11171, LinkCited by: §2.
[39]A. Wei, Y. Wu, Y. Wan, T. Suresh, H. Tan, Z. Zhou, S. Koyejo, K. Wang, and A. Aiken (2025)SATBench: benchmarking llms’ logical reasoning via automated puzzle generation from sat formulas. External Links: 2505.14615, LinkCited by: §2.
[40]J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou (2023)Chain-of-thought prompting elicits reasoning in large language models. External Links: 2201.11903, LinkCited by: §1.
[41]C. Xu, S. Guan, D. Greene, and M. Kechadi (2024)Benchmark data contamination of large language models: a survey. External Links: 2406.04244, LinkCited by: §2.
[42]F. F. Xu, Y. Song, B. Li, Y. Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, M. Yang, H. Y. Lu, A. Martin, Z. Su, L. Maben, R. Mehta, W. Chi, L. Jang, Y. Xie, S. Zhou, and G. Neubig (2025)TheAgentCompany: benchmarking llm agents on consequential real world tasks. External Links: 2412.14161, LinkCited by: §2.
[43]R. Xu, Z. Wang, R. Fan, and P. Liu (2024)Benchmarking benchmark leakage in large language models. External Links: 2404.18824, LinkCited by: §1, §2.
[44]S. Yao, N. Shinn, P. Razavi, and K. Narasimhan (2024)τ\tau-Bench: a benchmark for tool-agent-user interaction in real-world domains. External Links: 2406.12045, LinkCited by: §2.
[45]S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan (2023)Tree of thoughts: deliberate problem solving with large language models. External Links: 2305.10601, LinkCited by: §2.
[46]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao (2023)ReAct: synergizing reasoning and acting in language models. External Links: 2210.03629, LinkCited by: §2.
[47]T. Yato and T. Seta (2002-11)Complexity and completeness of finding another solution and its application to puzzles. Technical report External Links: LinkCited by: §2.
[48]Z. Zhang, Z. Chen, Z. Zhang, Y. Sun, Y. Tian, Z. Jia, C. Li, X. Liu, X. Min, and G. Zhai (2025)PuzzleBench: a fully dynamic evaluation framework for large multimodal models on puzzle solving. External Links: 2504.10885, LinkCited by: §2.
[49]Z. Zhang, C. Zheng, Y. Wu, B. Zhang, R. Lin, B. Yu, D. Liu, J. Zhou, and J. Lin (2025)The lessons of developing process reward models in mathematical reasoning. External Links: 2501.07301, LinkCited by: §1, §2, 1st item.
[50]C. Zheng, J. Zhu, Z. Ou, Y. Chen, K. Zhang, R. Shan, Z. Zheng, M. Yang, J. Lin, Y. Yu, and W. Zhang (2025)A survey of process reward models: from outcome signals to process supervisions for large language models. External Links: 2510.08049, LinkCited by: §2.
[51]D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. Le, and E. Chi (2023)Least-to-most prompting enables complex reasoning in large language models. External Links: 2205.10625, LinkCited by: §2.

Appendix A: Puzzle Variety Descriptions

Brief descriptions of the 20 varieties in the evaluation set. The full dataset contains 94 varieties.

Country

Draw a loop through orthogonally adjacent cells that visits every outlined country exactly once. Numbers indicate how many cells the loop passes through in that country. The loop cannot branch or cross itself.

Dbchoco

Divide the grid into regions. Each region contains one white and one grey contiguous area of identical size and shape (allowing rotation/reflection). Numbers indicate the size of the area they’re in.

Firefly

Draw paths from every firefly (starting at black dots) to form one connected network. Paths cannot branch, cross, or connect directly between two black dots. Numbers indicate how many times the path turns.

Heyawake

Shade some cells in a grid divided into rooms. Shaded cells cannot be adjacent. Numbers indicate shaded cells in that room. No horizontal/vertical line of unshaded cells can pass through 2+ room borders.

Hitori

Shade some cells so that no row or column contains duplicate unshaded numbers. Shaded cells cannot be adjacent. All unshaded cells must form one orthogonally connected area.

Kurodoko

Shade some cells (not numbers). Shaded cells cannot be adjacent. Numbers indicate how many unshaded cells can be seen in a straight line horizontally and vertically, including itself. All unshaded cells must connect.

Lightup

Place lights in empty cells to illuminate all non-black cells. Lights illuminate in straight lines until blocked by black cells. Lights cannot illuminate each other. Numbers on black cells indicate adjacent lights.

Lits

Place one tetromino (4-cell block) in every outlined region. No 2x2 square can be fully covered. Identical tetrominoes (including rotations/reflections) cannot share an edge. All tetrominoes must connect.

Mashu

Draw a loop through every circle. On black circles: the loop must turn and go straight before/after. On white circles: the loop must go straight but turn in at least one adjacent cell.

Norinori

Shade cells so that each shaded cell is adjacent to exactly one other shaded cell (forming dominoes). Each outlined region contains exactly 2 shaded cells.

Nurikabe

Shade cells to form regions of unshaded cells. Each region contains exactly one number indicating its size. Numbers cannot be shaded. Shaded cells cannot form 2x2 squares and must all connect.

Nurimaze

Shade some tiles to form a maze. Tiles are fully shaded or unshaded. Clued tiles cannot be shaded. No 2x2 square can be all shaded or all unshaded. Unshaded cells form a path from S to G passing through all circles.

Nurimisaki

Shade cells with no 2x2 squares of same color. Circles mark cells that are unshaded and adjacent to exactly one other unshaded cell. Numbers indicate visible unshaded cells in a straight line including itself.

Sashigane

Divide the grid into L-shaped regions (width one cell). Circles must be at the corner of an L. Arrows must be at the ends, pointing toward the corner. Numbers indicate total cells in that L-shape.

Shakashaka

Place right triangles (half-cells) so that every unshaded area forms a rectangle (upright or 45° rotated). Numbers indicate how many adjacent cells contain triangles.

Shikaku

Divide the grid into rectangles. Each rectangle contains exactly one number indicating its size in cells.

Slither

Draw a single loop along cell edges. Numbers indicate how many of that cell’s four edges are part of the loop. The loop cannot branch or cross.

Sudoku

Place numbers 1-N in each cell (N = grid width). Each row, column, and outlined block contains exactly one of each number.

Tapa

Shade cells (not numbers). Numbers indicate lengths of consecutive shaded cell blocks in the 8 surrounding cells. Shaded cells cannot form 2x2 squares and must all connect.

Yajilin

Shade some cells and draw a loop through the rest. Shaded cells cannot be adjacent. Number clues cannot be shaded and indicate how many shaded cells lie in the arrow’s direction.

Appendix B: Prompt Templates

B.1 DirectAsk (Single-Shot) Strategy

System Prompt:

Solve the puzzle!!

Answer with a list of moves you would like to make that

solve the puzzle as json in a markdown json code block

‘‘‘json

["mouse,left,1,1","mouse,right,3,1",...]

‘‘‘

User Prompt Template:

Puzzle Type:{puzzle_type}

Puzzle Rules:

‘‘‘

{rules_text}

‘‘‘

Here is an example of inputs/a solved puzzle

(lots of context for you)

‘‘‘

{example_of_inputs}

‘‘‘

Here’s some more:

{example_move_context}

Note specifically how the coordinate systems work

(for the puzzle vs.the inputs).

For the puzzle you are working on,ensure you fully

understand from the example of input above,that

the move is exactly where you expect.

====================

Here is the puzzle you are to solve:

{puzzle_state}

====================

Please now solve it.

B.2 BasicAgentic (Multi-Turn) Strategy

System Prompt:

Solve the puzzle!!

It is known to be solvable,and you can figure it out.

This is a logic deduction benchmark.

You are graded on both how many steps you take(number of

tool calls),how many moves it takes(puzzle moves),and

ultimately if you can solve the puzzle.This puzzle is

solvable.If you need to,you may reset the puzzle and

keep trying.

User Prompt: Same template as DirectAsk (Section B.1).

Available Tools:

• make_move(movestring): “Make a move, shows the board after the move is applied”
• make_multi_move(movelist): “Make a series of moves, shows the board after the move is applied”
• check_board_for_completeness(): “Check the current state of the board against the rules of the puzzle, see if its complete or if errors exist”
• render_board_as_svg(): “Shows the full detail SVG of the board (useful if you want more information / are worried about your view into errors)”
• get_rules(): “Gets all the rules for the puzzle”
• reset_puzzle(): “Fully reset the puzzle (erase all moves, go back to a blank slate). Use this instead of giving up if you want another attempt.”
• give_up(): Forfeit the attempt

Output Validation Loop: If the model produces a text response before the puzzle is complete, the agent framework automatically retries with the message: “Not done yet, keep going!! Puzzle isn’t complete.” This continues until the puzzle is solved, the model calls give_up(), or the maximum move limit (5000) is reached.

Appendix C: Full Model Results

Complete results for all 51 evaluated models across both strategies.

Table 6: Full results by model and strategy. Succ = correct solves, Fail = wrong answers, Err = infrastructure failures (timeouts, API errors), $/Att = cost per attempt. Strategies are evaluated on different puzzle sets and rates are not comparable across columns.

Appendix D: Puzzle Variety Gallery

Figure6 shows one puzzle from each of the 20 benchmark varieties, each partially solved. Red highlighting indicates cells where variety-specific constraints are currently violated.

Figure 6: All 20 varieties in the benchmark, shown mid-solve with constraint violations highlighted in red.

Appendix E: Full Model Success Over Time

Figure7 extends the timeline from the main text (Figure4) to include all evaluated models, spanning release dates from 2022 through early 2026. The extended view reveals that pencil puzzle solving capability is entirely a recent phenomenon: models released before late 2024—including GPT-3.5-turbo, GPT-4o, GPT-4.1, and o1—achieve 0% or near-0% solve rates. The first non-trivial performance appeared with o3 (3.0%) in early 2025, followed by an explosion in capability starting around November 2025 with the GPT-5 family and reasoning-enabled models. This back-testing confirms that our benchmark measures a genuinely new capability frontier, not one that was latent in earlier model generations.

Figure 7: Full model success rates over time with frontier model release dates annotated, including all evaluated models. Models released before late 2024 show 0% or near-0% solve rates, with capability emerging only in late 2025.

Xet Storage Details

Size:: 64.9 kB
Xet hash:: 5fbe6bad1f83288ea6fd89a47731dbbd3c0fd12a555e7f01f13f585ad181a575

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.