arxiv:2606.09826

OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Published on Jun 8

· Submitted by

Mingxian Lin on Jun 9

Hong Kong University

Upvote

Authors:

Fan Zhang ,

Abstract

OmniGameArena presents a unified benchmark for evaluating vision-language model agents in diverse game settings with a reflection-based improvement protocol that tracks performance evolution and skill generalization.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing. We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

View arXiv page View PDF Project page GitHub 30 Add to collection

Community

mxlin043

Paper submitter about 15 hours ago

OmniGameArena is a real-time benchmark of 12 new Unreal Engine 5 games (7 Solo, 3 PvP, 2 Coop). They share one action interface, so commercial VLMs, open-weight VLMs, and specialized game policies are all tested the same way. On top of the cold-start leaderboard, we add the Improvement Dynamics Curve (IDC): the agent reflects on its own play over several rounds, and we track how much the score goes up and whether the learned skill still works on unseen game variants. The project page has the leaderboard, gameplay videos, and a demo you can play in the browser.