Papers
arxiv:2510.27544

TempoBench: Evaluating Temporal Causal Reasoning in Large Language Models

Published on Jun 6
Authors:
,
,
,

Abstract

Large language models excel at forward simulation but struggle with temporal causal reasoning, particularly identifying minimal necessary inputs for observed outcomes, as demonstrated by a new formally verified benchmark showing significant performance gaps between simulation and causal attribution tasks.

Temporal reasoning involves understanding how systems evolve over time through input-driven state transitions. A key aspect is temporal causal reasoning, causally reasoning about what prior inputs were necessary in causing an observed outcome. While large language models (LLMs) perform well at forward simulation, predicting outputs from inputs, they struggle to identify the minimal causal inputs of outcomes. To study this distinction, we define two tasks: trace simulation (SIM), which requires models to simulate system execution, and minimal causal attribution (MIN), which identifies the minimal set of inputs necessary for a given outcome. We introduce TempoBench, the first formally verified benchmark for temporal causal reasoning, built from synthesized Mealy machines with controllable complexity and provably correct causal labels. Across frontier models, we observe that despite achieving up to 96\% accuracy on the SIM task, performance on the causal attribution MIN task drops below 25\%; models fail to reason about causal necessity. Over 94\% of causal errors involve overspecification, where models perform retrieval and list all possible inputs rather than reasoning about the minimal causal subset. Fine-tuning on TempoBench training corpus improves causal reasoning and generalizes better than math, code, or instruction training, with gains across standard reasoning benchmarks.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2510.27544
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.27544 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.27544 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.27544 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.