arxiv:2606.17861

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?

Published on Jun 16

· Submitted by

Tongxu Luo on Jun 17

Chinese University of Hong Kong, Shenzhen

Upvote

Authors:

Tongxu Luo ,

Rongsheng Wang ,

Zhengyang Tang ,

Xinyuan Wang

Abstract

End-to-end game generation presents significant challenges for coding agents, requiring them to create complete playable games from natural language descriptions while meeting specific evaluation criteria for engine grounding, artifact completeness, and interactive verification.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.

View arXiv page View PDF Project page GitHub 44 Add to collection

Community

Zeno-Luo

Paper author Paper submitter about 13 hours ago

Can coding agents build an actual game in a real game engine?

We introduce GameCraft-Bench, a benchmark of 140 Godot tasks across 15 game families for evaluating end-to-end game generation through interactive gameplay verification.

The strongest frontier agent achieves only 41.46%, suggesting that creating complete, playable games remains far from solved.

Demos, code, and data: https://tongxuluo.github.io/gamecraft-bench-website/