What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?
Abstract
MultiTempBench evaluates multilingual temporal reasoning capabilities of LLMs across different calendar systems and languages, revealing tokenization quality as a key bottleneck in low-resource settings.
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains 15,000 examples built by translating 750 curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb
Community
We present MULTITEMPBENCH, a multilingual temporal reasoning benchmark spanning
three tasks, date arithmetic, time zone conversion, and temporal relation extraction across
five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar).
MULTITEMPBENCH contains 15,000 examples built by translating 750 curated English
questions and expanding each into controlled
date-format variants. We evaluate 20 LLMs and
introduce the multilingual Date Fragmentation
Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing
analyses of internal temporal representations.
We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in
low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day
separation and accuracy collapses, while highresource settings are often robust to digit-level
splitting. Beyond tokenisation, crossed mixedeffects regression shows that temporal linearity
is the strongest predictor of temporal reasoning
in high-resource languages, whereas fragmentation is the stronger predictor in low-resource
languages.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper