arxiv:2604.04415

Structured Causal Video Reasoning via Multi-Objective Alignment

Published on Apr 6

· Submitted by

Yongxin Guo on Apr 13

University of Western Australia

Upvote

Authors:

Yongxin Guo ,

Abstract

Video-LLMs trained on structured event facts with causal relationships outperform existing methods on complex video understanding tasks requiring precise temporal inference.

AI-generated summary

Human understanding of video dynamics is typically grounded in a structured mental representation of entities, actions, and temporal relations, rather than relying solely on immediate deductive reasoning. In contrast, existing Video-LLMs largely depend on unstructured video reasoning, where critical visual evidence is embedded in verbose textual descriptions and temporal causality is often weakly modeled. This leads to inefficient processes and fragile causal inference. To bridge this cognitive gap, we propose constructing a compact representation of salient events and their causal relationships, which we name Structured Event Facts, prior to the reasoning stage. This structured prior serves as an explicit constraint to promote concise and causally grounded reasoning, while also making intermediate evidence easier to verify. To effectively train models on such structured facts, we introduce CausalFact-60K and a four-stage training pipeline comprising facts alignment, format warm-start, thinking warm-start, and reinforcement learning-based post-training. During RL stage, we find that this framework introduces competing objectives, as structural completeness and causal fidelity must be balanced against reasoning length, making it difficult to optimize. We address this challenge by formulating the optimization as a Multi-Objective Reinforcement Learning (MORL) problem and explicitly optimizing toward the Pareto-Frontier to balance these trade-offs. As a result, we introduce Factum-4B, which yields more reliable reasoning and delivers stronger performance on challenging video understanding tasks requiring fine-grained temporal inference.

View arXiv page View PDF Add to collection

Community

Yongxin-Guo

Paper author Paper submitter about 17 hours ago

We introduce Factum-4B, a Video-LLM that reasons over Structured Event Facts instead of relying on verbose, unstructured descriptions. By explicitly modeling salient events and their causal relations—and training with CausalFact-60K plus Pareto-optimized multi-objective RL—Factum-4B achieves more reliable, causally grounded video understanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.04415

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.04415 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.04415 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.04415 in a Space README.md to link it from this page.