Papers
arxiv:2602.05986

RISE-Video: Can Video Generators Decode Implicit World Rules?

Published on Feb 5
ยท Submitted by
Xue Yang
on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

RISE-Video presents a novel benchmark for evaluating text-image-to-video synthesis models based on cognitive reasoning rather than visual fidelity, using a multi-dimensional metric system and automated LMM-based evaluation.

AI-generated summary

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: Reasoning Alignment, Temporal Consistency, Physical Rationality, and Visual Quality. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

Community

Paper submitter

Despite strong visual realism, we find that current text-image-to-video models frequently fail to respect implicit world rules when generating complex scenarios. We introduce RISE-Video to systematically evaluate reasoning fidelity in video generation and reveal persistent reasoning gaps across state-of-the-art models.
Code: https://github.com/VisionXLab/Rise-Video
Data: https://huggingface.co/datasets/VisionXLab/RISE-Video

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/rise-video-can-video-generators-decode-implicit-world-rules-4136-2f194534

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.05986 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.05986 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.05986 in a Space README.md to link it from this page.

Collections including this paper 2