arxiv:2603.19017

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Published on Mar 19

· Submitted by

Authors:

Abstract

MultiTempBench evaluates multilingual temporal reasoning capabilities of LLMs across different calendar systems and languages, revealing tokenization quality as a key bottleneck in low-resource settings.

AI-generated summary

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

View arXiv page View PDF GitHub 0 Add to collection

Community

gagan3012

Paper submitter about 7 hours ago

We present MULTITEMPBENCH, a multilingual temporal reasoning benchmark spanning
three tasks, date arithmetic, time zone conversion, and temporal relation extraction across
five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar).
MULTITEMPBENCH contains 15,000 examples built by translating 750 curated English
questions and expanding each into controlled
date-format variants. We evaluate 20 LLMs and
introduce the multilingual Date Fragmentation
Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing
analyses of internal temporal representations.
We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in
low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day
separation and accuracy collapses, while highresource settings are often robust to digit-level
splitting. Beyond tokenisation, crossed mixedeffects regression shows that temporal linearity
is the strongest predictor of temporal reasoning
in high-resource languages, whereas fragmentation is the stronger predictor in low-resource
languages.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.19017 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.19017 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.19017 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.